# Text tokenization 
Text tokenization is the process of breaking down a longer text into smaller units, known as tokens. These tokens can be words, phrases, sentences, or even individual characters, depending on the level of granularity required for the analysis.
Tokenization is a fundamental step in natural language processing (NLP) and is used to prepare textual data for further analysis, such as text classification, sentiment analysis, language modeling, and information retrieval.

Here are the common types of text tokenization:

### Word Tokenization:

This is the most common type of tokenization, where the text is split into individual words or terms. 

In [24]:
# Import the word_tokenize function from the nltk.tokenize module
from nltk.tokenize import word_tokenize

# Define a text string that you want to tokenize
text = "Hello everyone. Welcome to NLP tutorial."

# Use the word_tokenize function to tokenize the text into words
tokens = word_tokenize(text)

# Print the list of tokens (words) obtained from the text
print(tokens)


['Hello', 'everyone', '.', 'Welcome', 'to', 'NLP', 'tutorial', '.']


### Sentence Tokenization:
In this case, the text is divided into individual sentences. 

In [25]:
# Import the sent_tokenize function from the nltk.tokenize module
from nltk.tokenize import sent_tokenize

# Define a text string that you want to sentence tokenize
text = "Hello everyone. Welcome to my Github profile. let's study NLP."

# Use the sent_tokenize function to tokenize the text into sentences
sentences = sent_tokenize(text)

# Print the list of sentences obtained from the text
print(sentences)


['Hello everyone.', 'Welcome to my Github profile.', "let's study NLP."]


### Phrasal Tokenization:
Sometimes, it's useful to tokenize text into multi-word phrases or n-grams.

In [27]:
import nltk
from nltk.util import ngrams

# Sample text
text = "Natural language processing is a subfield of artificial intelligence."

# Tokenize the text into words
words = nltk.word_tokenize(text)

# Define the n-gram size (e.g., 2 for bigrams, 3 for trigrams)
n = 2

# Create a list of n-grams using NLTK's ngrams function
ngram_list = list(ngrams(words, n))

# Print the list of n-grams
for ngram in ngram_list:
    print(" ".join(ngram))


Natural language
language processing
processing is
is a
a subfield
subfield of
of artificial
artificial intelligence
intelligence .


### Character Tokenization:
In some cases, text is tokenized into individual characters. 
Tokenizing text into individual characters is not as common as word or sentence tokenization, but it can be useful for specific tasks like Text Generation,Spelling Correction and Handwriting Recognition.

In [28]:
# Sample text
text = "Hello, world!"

# Tokenize the text into individual characters
characters = list(text)

# Print the list of individual characters
for char in characters:
    print(char)


H
e
l
l
o
,
 
w
o
r
l
d
!
