## Tokenization: Explained with Simple Examples

### What is Tokenization?

Tokenization is the process of converting a stream of text into individual words, sentences, or subwords (tokens). This is an essential step in Natural Language Processing (NLP) as it breaks down text into manageable units for further analysis.

**2. Types of Tokenization**

    Word Tokenization: Splitting the text into individual words.
    Sentence Tokenization: Splitting the text into individual sentences.

**Why It's Useful:**  
Think of a sentence as a big puzzle. Tokenization helps us by breaking it down into smaller pieces (tokens), making it easier to analyze and understand each piece.


In [3]:
#import nltk
import nltk
nltk.download('punkt')

#The "punkt" model in NLTK is an essential tool for tokenization in NLP tasks.
#It is pre-trained on text data and can effectively split text into sentences and words in various languages

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
#Word Tokenization with NLTK
# NLTK’s word_tokenize() function is already quite good at handling special cases like contractions, punctuation, and symbols.

#example use
text = "Hello, how are you doing today?"
words = nltk.word_tokenize(text)
print(words)
print("number of items in list : ", len(words))


text = "I'm excited to go to the market! It's going to be fun."
tokens = nltk.word_tokenize(text)
print(tokens)
print("number of items in list : ", len(tokens))


['Hello', ',', 'how', 'are', 'you', 'doing', 'today', '?']
number of items in list :  8


In [None]:
#Sentence Tokenization with NLTK
# Example text
text = "Hello, how are you doing today? I hope you're doing well."

# Sentence tokenization
sentences = nltk.sent_tokenize(text)
print(sentences)
print("number of items in list : ", len(sentences))


['Hello, how are you doing today?', "I hope you're doing well."]
number of items in list :  2


In [1]:
#Customizing Tokenization

# The custom tokenizer uses a regular expression r'\w+|[^\w\s]' which matches words (\w+) and punctuation marks ([^\w\s]).
# This allows for the separation of punctuation from words, making it especially useful in text analysis tasks where punctuation marks carry significan

from nltk.tokenize import RegexpTokenizer

# Define a tokenizer to extract only words (ignoring punctuation)
tokenizer = RegexpTokenizer(r'\w+')

# Example text
text = "Hello, how are you doing today?"
tokens = tokenizer.tokenize(text)
print(tokens)

tokenizer = RegexpTokenizer(r'[^\w\s]+')
tokens = tokenizer.tokenize(text)
print(tokens)

tokenizer = RegexpTokenizer(r'\w+|[^\w\s]+')
tokens = tokenizer.tokenize(text)
print(tokens)

# Tokenizing only numbers
tokenizer = RegexpTokenizer(r'\d+')
text = "The price of the book is 50 dollars and the tax is 5."
tokens = tokenizer.tokenize(text)
print(tokens)

## Tokenizing decimal numbers
tokenizer = RegexpTokenizer(r'\d+\.\d+')
text = "The price is 12.99 dollars and the discount is 0.25."
tokens = tokenizer.tokenize(text)
print(tokens)

# Tokenizing numbers with commas
tokenizer = RegexpTokenizer(r'\d{1,3}(?:,\d{3})*')
text = "The population is 1,234,567 and the revenue is 987,654,321."
tokens = tokenizer.tokenize(text)
print(tokens)

# Tokenizing percentages
tokenizer = RegexpTokenizer(r'\d+%')
text = "The success rate was 75% last year, but now it is 80%."
tokens = tokenizer.tokenize(text)
print(tokens)

# Tokenizing phone numbers
tokenizer = RegexpTokenizer(r'\+?\d{1,2}\s?\(?\d{1,4}\)?[\s\-]?\d{1,4}[\s\-]?\d{1,4}')
text = "You can reach me at 800-555-1234 or +44 20 7946 0958."
tokens = tokenizer.tokenize(text)
print(tokens)


['Hello', 'how', 'are', 'you', 'doing', 'today']
[',', '?']
['Hello', ',', 'how', 'are', 'you', 'doing', 'today', '?']
['50', '5']
['12.99', '0.25']
['1,234,567', '987,654,321']
['75%', '80%']
['800-555-1234', '+44 20 7946 0958']


In [None]:
# Capture only words (ignores numbers): [A-Za-z]+
tokenizer = RegexpTokenizer(r'[A-Za-z]+')
text = "The price of the book is 50 dollars and the tax is 5."
tokens = tokenizer.tokenize(text)
print(tokens)

# Capture words and punctuation (ignores numbers): [A-Za-z]+|[.,!?;]
tokenizer = RegexpTokenizer(r'[A-Za-z]+|[.,!?;]')
text = "The price of the book is 50 dollars, and the tax is 5!"
tokens = tokenizer.tokenize(text)
print(tokens)

# Capture words with apostrophes (ignores numbers and phone numbers): [A-Za-z]+(?:'[A-Za-z]+)
tokenizer = RegexpTokenizer(r"[A-Za-z]+(?:'[A-Za-z]+)?")
text = "I'm looking forward to seeing 123-456-7890, it's a great day!"
tokens = tokenizer.tokenize(text)
print(tokens)




['The', 'price', 'of', 'the', 'book', 'is', 'dollars', 'and', 'the', 'tax', 'is']
['The', 'price', 'of', 'the', 'book', 'is', 'dollars', ',', 'and', 'the', 'tax', 'is', '!']
["I'm", 'looking', 'forward', 'to', 'seeing', "it's", 'a', 'great', 'day']


In [4]:
text = """Hello! My name is Alice. I work at the XYZ company.
I love programming in Python. How about you?"""

# Sentence tokenization
sentences = nltk.sent_tokenize(text)

# Word tokenization for each sentence
word_tokens = [nltk.word_tokenize(sentence) for sentence in sentences]

print(word_tokens)


[['Hello', '!'], ['My', 'name', 'is', 'Alice', '.'], ['I', 'work', 'at', 'the', 'XYZ', 'company', '.'], ['I', 'love', 'programming', 'in', 'Python', '.'], ['How', 'about', 'you', '?']]


In [5]:
# he TweetTokenizer in NLTK is a specialized tokenizer designed specifically for tokenizing text from social media, particularly tweets. It handles the unique characteristics of social media text, such as hashtags, mentions, and emoticons, more effectively than a regular tokenizer.
# This tokenizer is optimized to deal with the noisy,
# informal language that is typical in social media content.


from nltk.tokenize import TweetTokenizer

# Initialize the TweetTokenizer
tweet_tokenizer = TweetTokenizer()

# Example tweet text
tweet_text = "Hello world! 😊 I'm loving #NLP and it's so fun! @user"

# Tokenizing the tweet
tokens = tweet_tokenizer.tokenize(tweet_text)

# Displaying the tokens
print("Tokens from tweet:")
print(tokens)

Tokens from tweet:
['Hello', 'world', '!', '😊', "I'm", 'loving', '#NLP', 'and', "it's", 'so', 'fun', '!', '@user']
