# $$ Step\ 5\ : Tokenization\ $$

_____________

### Tokenizing Text  

A key step in Natural Language Processing (NLP) is **tokenization**, which involves breaking down text into smaller components called **tokens**. The most common type is **word tokenization**, where each word in a sentence is treated as an individual token. However, tokenization can also occur at different levels, such as **sentence tokenization**, **subword tokenization**, or even **character tokenization**, depending on the application.  

### Why is Tokenization Important?  
Tokenization helps in analyzing text by breaking it into meaningful parts, making it easier to process and interpret. It is an essential preprocessing step before transforming text into numerical representations for machine learning models. By understanding both individual words and their overall context, we can improve tasks such as **text classification, machine translation, and sentiment analysis**.  

Now, let‚Äôs explore some examples of sentence and word tokenization using the `nltk` package.  

____________________

In [2]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## Sentence tokenization :

Sentence tokenization, also called sentence segmentation, is the process of splitting a text into individual sentences. This is useful in sentiment analysis when analyzing opinions sentence by sentence rather than as a whole.

#### How does it work ? 

Most sentence tokenizers use regular expressions or pre-trained models to identify sentence boundaries.

In [4]:
from nltk.tokenize import sent_tokenize  

text = "The new phone is amazing! The battery lasts all day. However, the camera quality could be better. Overall, I love it!"  
sentences = sent_tokenize(text)  

print(sentences)

['The new phone is amazing!', 'The battery lasts all day.', 'However, the camera quality could be better.', 'Overall, I love it!']


In [5]:
text = "Dr. Smith is a great doctor. He works at St. John's Hospital! Do you know him?"  
sentences = sent_tokenize(text)  

print(sentences)

['Dr. Smith is a great doctor.', "He works at St. John's Hospital!", 'Do you know him?']


## Word tokenization :

In [6]:
from nltk.tokenize import word_tokenize  

text = "I absolutely love this product! It's fantastic and exceeded my expectations."  
tokens = word_tokenize(text)  

print(tokens)

['I', 'absolutely', 'love', 'this', 'product', '!', 'It', "'s", 'fantastic', 'and', 'exceeded', 'my', 'expectations', '.']


### Exemple : Handling Contractions Before Tokenization

In [8]:
text = "I can't believe this! It's horrible."  
tokens = word_tokenize(text)  
print(tokens)

['I', 'ca', "n't", 'believe', 'this', '!', 'It', "'s", 'horrible', '.']


In [9]:
import re  

text = "I can't believe this! It's horrible."  
text_cleaned = re.sub(r"can\'t", "can not", text)  # Expand contractions  
text_cleaned = re.sub(r"it\'s", "it is", text_cleaned)  

tokens = word_tokenize(text_cleaned)  
print(tokens)

['I', 'can', 'not', 'believe', 'this', '!', 'It', "'s", 'horrible', '.']


### Exemple : Tokenization with Special Characters & Emojis

In [10]:
text = "I love this product!!! üòç It's the best ever!! #HappyCustomer"
tokens = word_tokenize(text)
print(tokens)

['I', 'love', 'this', 'product', '!', '!', '!', 'üòç', 'It', "'s", 'the', 'best', 'ever', '!', '!', '#', 'HappyCustomer']


### Example: Case Sensitivity in Word Tokenization

In [11]:
from nltk.tokenize import word_tokenize  

text = "Her service was outstanding! her kindness made my day."  
tokens = word_tokenize(text)  

print(tokens)


['Her', 'service', 'was', 'outstanding', '!', 'her', 'kindness', 'made', 'my', 'day', '.']


"Her" and "her" are treated as different tokens.

In sentiment analysis, this can lead to inconsistencies in frequency-based models.

##### Solution: Convert Text to Lowercase Before Tokenization

In [12]:
tokens_lower = [word.lower() for word in tokens]  
print(tokens_lower)

['her', 'service', 'was', 'outstanding', '!', 'her', 'kindness', 'made', 'my', 'day', '.']


### Exemple : word tokenizing for a list of sentences 

In [13]:
reviews = [
    "The product is amazing and works perfectly!",
    "I really loved the experience of using this.",
    "This is the worst purchase I have ever made.",
    "The shipping was delayed, but the product is great."
]

In [22]:
token = []
for sentence in reviews :
    token.append(word_tokenize(sentence))

In [23]:
token

[['The', 'product', 'is', 'amazing', 'and', 'works', 'perfectly', '!'],
 ['I', 'really', 'loved', 'the', 'experience', 'of', 'using', 'this', '.'],
 ['This', 'is', 'the', 'worst', 'purchase', 'I', 'have', 'ever', 'made', '.'],
 ['The',
  'shipping',
  'was',
  'delayed',
  ',',
  'but',
  'the',
  'product',
  'is',
  'great',
  '.']]

In [None]:
# Tokenize and stem words in reviews
from nltk.tokenize import word_tokenize

