# Foundations in NLP : Text Preprocessing , Tokenisation and embeddings 

**Why do we need natural language processing **: 

Imagine you're a detective analyzing thousands of unstructured text files . You need patterns , connections and insights -fast . 

That's where idea come from . 
For computers need to understand the text meaningfully .

**Making Text Understandable to computers - 3 Steps**

To help computers understand text , we need : 

- Text Preprocessing 
- Tokenisation 
- Embeddings


# Step 1 : Text Preprocessing

Prepare the text by : 
- Lowercasing 
- Removing punctuation
- Removing stopwards
- Stemming / Lemmatization ( Making words ro root form )

# Text Preprocessing on real Data 

## Objective
Clean text data so it's uniform , lowercase , and ready for further analysis.

## Lowercasing
-  Goal : Convert all the text to lowecases to avoid treating "LOVE" and "love" as a different words

In [2]:
import pandas as pd
# Initial Data

reviews = [
   "Love my new glasses! 😍 The frames are so lightweight and stylish. Perfect for my daily wear!",
"Ugh, the lenses scratched in just a week. 😤 Not worth the price, honestly.",
"Best purchase ever! 👓 The blue-light filter actually helps with eye strain. 10/10!",
"Delivery was fast, but the frames feel cheap. 🤷‍♂️ Expected better quality for the price.",
"These glasses make me look so cool! 😎 The fit is perfect, and the lenses are crystal clear.",
"Terrible experience. 😡 The prescription was wrong, and customer service didn’t help at all.",
"Super comfy and trendy! 💖 I get compliments every time I wear them.",
"Meh, it’s okay. The frame color looks different from the website photo. 😐",
"Worth every penny! The anti-glare coating is amazing for night driving. 🌟",
"Broke within 2 weeks. 😒 Poor durability, but at least they looked nice."
    
    ]

# Convert this to dataframe
df_reviews = pd.DataFrame(reviews, columns=["Review"]) # Made a column named Reviews to put raw reviews

# Lowercase conversion in Dataframe
df_reviews["Review_LowerCase"]= df_reviews["Review"].str.lower() # new colunm named "Review_Lowercase" where lowercase data is stored

# Display the Dataframe
df_reviews





Unnamed: 0,Review,Review_LowerCase
0,Love my new glasses! 😍 The frames are so light...,love my new glasses! 😍 the frames are so light...
1,"Ugh, the lenses scratched in just a week. 😤 No...","ugh, the lenses scratched in just a week. 😤 no..."
2,Best purchase ever! 👓 The blue-light filter ac...,best purchase ever! 👓 the blue-light filter ac...
3,"Delivery was fast, but the frames feel cheap. ...","delivery was fast, but the frames feel cheap. ..."
4,These glasses make me look so cool! 😎 The fit ...,these glasses make me look so cool! 😎 the fit ...
5,Terrible experience. 😡 The prescription was wr...,terrible experience. 😡 the prescription was wr...
6,Super comfy and trendy! 💖 I get compliments ev...,super comfy and trendy! 💖 i get compliments ev...
7,"Meh, it’s okay. The frame color looks differen...","meh, it’s okay. the frame color looks differen..."
8,Worth every penny! The anti-glare coating is a...,worth every penny! the anti-glare coating is a...
9,"Broke within 2 weeks. 😒 Poor durability, but a...","broke within 2 weeks. 😒 poor durability, but a..."


## **Removing puctuation and emojis**

**Goal** : Clean out unneccesaty characters like punctuation and emoji to simplyfy analysis

- Note : Emoji also adds meaning to the text , we have various libraries to get meaning out of the emoji but for now to simply things we are just removing it


In [3]:
import re 
df_reviews["Review_NoPunctEmoji"]= df_reviews["Review_LowerCase"].apply(lambda x : re.sub( r'[^\w\s]','',x))  

# made new colunm without emoji and punctuation

In [4]:
df_reviews

Unnamed: 0,Review,Review_LowerCase,Review_NoPunctEmoji
0,Love my new glasses! 😍 The frames are so light...,love my new glasses! 😍 the frames are so light...,love my new glasses the frames are so lightwe...
1,"Ugh, the lenses scratched in just a week. 😤 No...","ugh, the lenses scratched in just a week. 😤 no...",ugh the lenses scratched in just a week not w...
2,Best purchase ever! 👓 The blue-light filter ac...,best purchase ever! 👓 the blue-light filter ac...,best purchase ever the bluelight filter actua...
3,"Delivery was fast, but the frames feel cheap. ...","delivery was fast, but the frames feel cheap. ...",delivery was fast but the frames feel cheap e...
4,These glasses make me look so cool! 😎 The fit ...,these glasses make me look so cool! 😎 the fit ...,these glasses make me look so cool the fit is...
5,Terrible experience. 😡 The prescription was wr...,terrible experience. 😡 the prescription was wr...,terrible experience the prescription was wron...
6,Super comfy and trendy! 💖 I get compliments ev...,super comfy and trendy! 💖 i get compliments ev...,super comfy and trendy i get compliments ever...
7,"Meh, it’s okay. The frame color looks differen...","meh, it’s okay. the frame color looks differen...",meh its okay the frame color looks different f...
8,Worth every penny! The anti-glare coating is a...,worth every penny! the anti-glare coating is a...,worth every penny the antiglare coating is ama...
9,"Broke within 2 weeks. 😒 Poor durability, but a...","broke within 2 weeks. 😒 poor durability, but a...",broke within 2 weeks poor durability but at l...


 ## **Remove Stopwords**

In [5]:
!pip uninstall nltk -y


Found existing installation: nltk 3.9.1
Uninstalling nltk-3.9.1:
  Successfully uninstalled nltk-3.9.1


In [6]:
!pip install --upgrade nltk

Collecting nltk
  Using cached nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Using cached nltk-3.9.1-py3-none-any.whl (1.5 MB)
Installing collected packages: nltk
Successfully installed nltk-3.9.1


In [7]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

import nltk
nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))
df_reviews["Review_NoStopwords"]= df_reviews["Review_NoPunctEmoji"].apply(
    lambda x: ' '.join(word for word in x.split() if word not in stop_words)
)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [8]:
df_reviews

Unnamed: 0,Review,Review_LowerCase,Review_NoPunctEmoji,Review_NoStopwords
0,Love my new glasses! 😍 The frames are so light...,love my new glasses! 😍 the frames are so light...,love my new glasses the frames are so lightwe...,love new glasses frames lightweight stylish pe...
1,"Ugh, the lenses scratched in just a week. 😤 No...","ugh, the lenses scratched in just a week. 😤 no...",ugh the lenses scratched in just a week not w...,ugh lenses scratched week worth price honestly
2,Best purchase ever! 👓 The blue-light filter ac...,best purchase ever! 👓 the blue-light filter ac...,best purchase ever the bluelight filter actua...,best purchase ever bluelight filter actually h...
3,"Delivery was fast, but the frames feel cheap. ...","delivery was fast, but the frames feel cheap. ...",delivery was fast but the frames feel cheap e...,delivery fast frames feel cheap expected bette...
4,These glasses make me look so cool! 😎 The fit ...,these glasses make me look so cool! 😎 the fit ...,these glasses make me look so cool the fit is...,glasses make look cool fit perfect lenses crys...
5,Terrible experience. 😡 The prescription was wr...,terrible experience. 😡 the prescription was wr...,terrible experience the prescription was wron...,terrible experience prescription wrong custome...
6,Super comfy and trendy! 💖 I get compliments ev...,super comfy and trendy! 💖 i get compliments ev...,super comfy and trendy i get compliments ever...,super comfy trendy get compliments every time ...
7,"Meh, it’s okay. The frame color looks differen...","meh, it’s okay. the frame color looks differen...",meh its okay the frame color looks different f...,meh okay frame color looks different website p...
8,Worth every penny! The anti-glare coating is a...,worth every penny! the anti-glare coating is a...,worth every penny the antiglare coating is ama...,worth every penny antiglare coating amazing ni...
9,"Broke within 2 weeks. 😒 Poor durability, but a...","broke within 2 weeks. 😒 poor durability, but a...",broke within 2 weeks poor durability but at l...,broke within 2 weeks poor durability least loo...


# Stemming 

### **Stemming in RAG (Retrieval-Augmented Generation)**  

**Definition:**  
Stemming is a text normalization technique in Natural Language Processing (NLP) that reduces words to their base or root form by removing suffixes. In RAG, stemming helps improve retrieval by matching different word forms to the same root, ensuring better search results.  

### **Simple Example:**  
- Original words: *"running", "runs", "runner"*  
- After stemming: *"run", "run", "run"*  

Now, if a user searches for *"run"*, the RAG system can retrieve documents containing any of these variations, improving accuracy.  

### **How It Helps in RAG:**  
- Reduces vocabulary size for efficient searching.  
- Helps match queries with relevant documents even if word forms differ.  

**Note:** Stemming is less precise than lemmatization (which uses grammar rules), but it’s faster and useful for large-scale retrieval tasks.


### **Tokenization in RAG (Retrieval-Augmented Generation)**  

#### **Definition:**  
Tokenization is the process of breaking down text into smaller units called **tokens**, which can be words, subwords, or even characters. In RAG, tokenization helps the system process and understand text efficiently.  

#### **Why is Tokenization Needed?**  
1. **Standardizes Input:** Converts raw text into a structured format for NLP models.  
2. **Enables Efficient Processing:** Helps models handle large texts by splitting them into manageable chunks.  
3. **Improves Retrieval & Generation:** Ensures accurate matching of queries with documents in RAG.  

#### **How Tokenization Works:**  
1. **Splitting Text:** A sentence is divided into tokens (words, subwords, or symbols).  
2. **Handling Punctuation & Special Cases:** Decides whether to treat punctuation as separate tokens.  
3. **Subword Tokenization (Advanced):** Breaks rare words into meaningful sub-parts (e.g., "unhappiness" → "un", "happiness").  

---

### **Simple Example:**  
**Input Sentence:**  
*"ChatGPT is amazing!"*  

**Tokenized Output (Word-Level):**  
`["ChatGPT", "is", "amazing", "!"]`  

**Tokenized Output (Subword-Level, e.g., Byte-Pair Encoding):**  
`["Chat", "G", "PT", "is", "amazing", "!"]`  

---

### **How It Helps in RAG:**  
- **Better Search:** Ensures queries and documents are broken into comparable units.  
- **Handles Unknown Words:** Subword tokenization helps with rare or misspelled words.  
- **Compatibility with LLMs:** Most language models (like GPT) require tokenized input.  

**Note:** The choice of tokenizer (word-based, subword-based, or character-based) affects RAG's performance. Modern systems often use **subword tokenizers** (like BPE or WordPiece) for flexibility.