# <p style="font-family:newtimeroman;font-size:150%;text-align:center;color:#87CEEB;">Text Preprocessing in NLP </p>

## üß† **Text Preprocessing in NLP**

Text preprocessing is a **foundational step** in *Natural Language Processing (NLP)* that focuses on cleaning, organizing, and transforming raw text data into a structured form suitable for analysis and machine learning models.  
It plays a **critical role** in improving both the efficiency and accuracy of NLP applications.

---

### üéØ **Why Text Preprocessing Matters**

The main goal of preprocessing is to **eliminate noise and irrelevant information** from text such as unnecessary symbols, punctuation marks, and frequent but uninformative words (stop words).  
By doing this, we reduce the overall complexity of the dataset and make it easier for models to extract meaningful linguistic patterns.

Additionally, **normalization techniques** like *stemming* and *lemmatization* convert words into their base or root forms, helping maintain consistency and reducing redundancy across the corpus.

---

### üí¨ **Example**

> **Original Sentence:**  
> ‚ÄúThe quick brown foxes are jumping over the lazy dogs.‚Äù

> **After Preprocessing:**  
> ‚Äúquick brown fox jump lazy dog.‚Äù

This transformation captures the **essential linguistic elements**, allowing NLP models to focus on the **core meaning** of the text rather than stylistic or grammatical variations.

---

### ‚öôÔ∏è **Key Steps in Text Preprocessing**

1. üß© Convert all text to lowercase  
2. üßº Remove HTML tags and special characters  
3. üåê Eliminate URLs and web links  
4. ‚úÇÔ∏è Strip punctuation marks  
5. üí¨ Expand abbreviations and chat words  
6. ü™Ñ Correct spelling errors  
7. üö´ Remove non-essential stop words  
8. üòä Process or remove emojis and emoticons  
9. ‚úèÔ∏è Perform tokenization (split text into words or subwords)  
10. üå± Apply stemming  
11. üìö Apply lemmatization  

---

### üîç **Conclusion**

In the following sections, we‚Äôll explore **each preprocessing technique** in detail ‚Äî explaining its purpose, how it works, and its impact on improving model performance and linguistic clarity in NLP tasks.

> ‚ÄúClean text is the foundation of every intelligent NLP system.‚Äù

# <p style="font-family:newtimeroman;font-size:150%;text-align:center;color:#87CEEB;">Import IMBD Movies Reviews Dataset</p>

In [18]:
# import basic libraries
import pandas as pd
df = pd.read_csv("/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv")

In [19]:
df = df.rename(columns={
    'label (depression result)': 'sentiment',
    'message to examine': 'review'
})

**What:** Standardizes two column names:

- label (depression result) ‚Üí Sentiment

- message to examine ‚Üí review

**Why:**

- ***Standardization:*** Clean, predictable names (no spaces/parentheses) make downstream code simpler and less error-prone.

- ***Convention:*** Most NLP pipelines expect something like text/review and label/sentiment. Consistent naming lets you reuse code (vectorizers, tokenizers, split functions) across projects.

- ***Avoid bugs:*** Columns with spaces or special characters can be awkward in code and break formulaic access patterns.

In [20]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


----

# <p style="font-family:newtimeroman;font-size:150%;text-align:center;color:#87CEEB;">1. Converting all text to lowercase</p>

### 1Ô∏è‚É£ **Lowercasing**
- **Why:** Converts all text to a single case (usually lowercase).  
- **Problem Solved:** Prevents duplicates like *‚ÄúGreat‚Äù* and *‚Äúgreat‚Äù* from being treated as separate tokens.  
- **When to Use:** Always both for classical ML (TF-IDF) and transformer models.

In [21]:
df['review'] = df['review'].str.lower()
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


***Now we see all the sentences in the corpus are in lowercase.***

----

# <p style="font-family:newtimeroman;font-size:150%;text-align:center;color:#87CEEB;">2. Removing HTML tags and special characters</p>

### 2Ô∏è‚É£ **Removing HTML Tags & Special Characters**
- **Why:** Removes markup such as `<br>` or `<a>` that carry no linguistic meaning.  
- **Problem Solved:** Eliminates structural noise from web-scraped or online data.  
- **When to Use:** Essential for datasets collected from websites, blogs, or reviews.

In [22]:
import re, html

# Fast, dependency-free
def strip_html(text, url_token=' URL '):
    if not isinstance(text, str):
        return ''
    t = html.unescape(text)

    # 1) Drop script content
    t = re.sub(r'(?is)<(script|style).*?>.*?</\1\s*>', ' ', t)

    # 2) Line breaks -> space
    t = re.sub(r'(?i)<br\s*/?>', ' ', t)

    # 3) Anchors: keep visible text, optionally add a URL token
    t = re.sub(r'(?is)<a\s+[^>]*href=["\']?([^"\'>\s]+)[^>]*>(.*?)</a>', r'\2' + url_token, t)

    # 4) Remove any remaining tags
    t = re.sub(r'(?s)<[^>]+>', ' ', t)

    # 5) Collapse whitespace
    t = re.sub(r'\s+', ' ', t).strip()
    return t

In [23]:
# text example
text = "<html><body><p> Movie 1</p><p> Actor - Aamir Khan</p><p> Click here to <a href='http://google.com'>download</a></p></body></html>"
print(strip_html(text))  # -> "Movie 1 Actor - Aamir Khan Click here to download URL"

Movie 1 Actor - Aamir Khan Click here to download URL


***See How the Code perform well and clean the text from the HTML Tags , We can Also Apply this Function to Whole Corpus.***

In [25]:
# Apply to a corpus
df['review'] = df['review'].astype(str).map(strip_html)

----

# <p style="font-family:newtimeroman;font-size:150%;text-align:center;color:#87CEEB;">3. Eliminating URLs and links</p>

### 3Ô∏è‚É£ **Removing URLs**
- **Why:** Links add no contextual value to sentiment or topic understanding.  
- **Problem Solved:** Prevents random tokens like `http`, `www`, or domain names from bloating the vocabulary.  
- **When to Use:** Always remove or replace with a neutral token like `[URL]`.

In [32]:
import re

def remove_urls(text):
    if not isinstance(text, str):
        return ''
    
    # Remove URLs starting with http://, https://, or www.
    cleaned = re.sub(r'(http|https)://\S+|www\.\S+', ' ', text)
    
    # Collapse extra spaces and strip edges
    cleaned = re.sub(r'\s+', ' ', cleaned).strip()
    return cleaned

In [29]:
# Suppose we have the FOllowings Text With URL.
text1 = 'Check out my notebook https://www.kaggle.com/campusx/notebook8223fc1abb'
text2 = 'Check out my notebook http://www.kaggle.com/campusx/notebook8223fc1abb'
text3 = 'Google search here www.google.com'
text4 = 'For notebook click https://www.kaggle.com/campusx/notebook8223fc1abb to search check www.google.com'

In [33]:
# Lets Remove The URL by Calling Function
print(remove_urls(text1))
print(remove_urls(text2))
print(remove_urls(text3))
print(remove_urls(text4))

Check out my notebook
Check out my notebook
Google search here
For notebook click to search check


***Here How the function beatuifully remove the URLs from the Text . We Can Simply Call this Function on Whole Corpus to Remove URLs.***

----

# <p style="font-family:newtimeroman;font-size:150%;text-align:center;color:#87CEEB;">4. Stripping out punctuation marks</p>

### 4Ô∏è‚É£ **Removing Punctuation**
- **Why:** Simplifies the text for models that don‚Äôt rely on punctuation.  
- **Problem Solved:** Reduces vocabulary clutter.  
- **When to Use:** For TF-IDF models; keep `!` and `?` for sentiment tasks if needed.

In [34]:
# From String we Imorts Punctuation.
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [35]:
# Storing Punctuation in a Variable
punc = string.punctuation

In [36]:
# The code defines a function, remove_punc1, that takes a text input and removes all punctuation characters from it using
# the translate method with a translation table created by str.maketrans. This function effectively cleanses the text of punctuation symbols.
def remove_punc(text):
    return text.translate(str.maketrans('', '', punc))

In [37]:
# Text With Punctuation.
text = "The quick brown fox jumps over the lazy dog. However, the dog doesn't seem impressed! Oh no, it just yawned. How disappointing! Maybe a squirrel would elicit a reaction. Alas, the fox is out of luck."
text


"The quick brown fox jumps over the lazy dog. However, the dog doesn't seem impressed! Oh no, it just yawned. How disappointing! Maybe a squirrel would elicit a reaction. Alas, the fox is out of luck."

In [38]:
# Remove Punctuation.
remove_punc(text)

'The quick brown fox jumps over the lazy dog However the dog doesnt seem impressed Oh no it just yawned How disappointing Maybe a squirrel would elicit a reaction Alas the fox is out of luck'

***Hence the function removes the punctuations from the text and we can also use this function to remove the punctuations from the corpus.***

In [39]:
# Exmaple on whole Dataset.
print(df['review'][10])

# Remove Punctuation
remove_punc(df['review'][10])

phil the alien is one of those quirky films where the humour is based around the oddness of everything rather than actual punchlines. at first it was very odd and pretty funny but as the movie progressed i didn't find the jokes or oddness funny anymore. its a low budget film (thats never a problem in itself), there were some pretty interesting characters, but eventually i just lost interest. i imagine this film would appeal to a stoner who is currently partaking. for something similar but better try "brother from another planet"


'phil the alien is one of those quirky films where the humour is based around the oddness of everything rather than actual punchlines at first it was very odd and pretty funny but as the movie progressed i didnt find the jokes or oddness funny anymore its a low budget film thats never a problem in itself there were some pretty interesting characters but eventually i just lost interest i imagine this film would appeal to a stoner who is currently partaking for something similar but better try brother from another planet'

----

# <p style="font-family:newtimeroman;font-size:150%;text-align:center;color:#87CEEB;">5. Handling abbreviations and chat words</p>

### 5Ô∏è‚É£ **Handling Abbreviations & Chat Words**
- **Why:** Expands short forms (e.g., *‚ÄúFYI‚Äù ‚Üí ‚Äúfor your information‚Äù*).  
- **Problem Solved:** Ensures acronyms are understandable to models.  
- **When to Use:** Social media, chat data, or informal text.

In [42]:
# Here Come ChatWords Which i Get from a Github Repository
# Repository Link : https://github.com/rishabhverma17/sms_slang_translator/blob/master/slang.txt
chat_words = {
    "AFAIK": "As Far As I Know",
    "AFK": "Away From Keyboard",
    "ASAP": "As Soon As Possible",
    "ATK": "At The Keyboard",
    "ATM": "At The Moment",
    "A3": "Anytime, Anywhere, Anyplace",
    "BAK": "Back At Keyboard",
    "BBL": "Be Back Later",
    "BBS": "Be Back Soon",
    "BFN": "Bye For Now",
    "B4N": "Bye For Now",
    "BRB": "Be Right Back",
    "BRT": "Be Right There",
    "BTW": "By The Way",
    "B4": "Before",
    "B4N": "Bye For Now",
    "CU": "See You",
    "CUL8R": "See You Later",
    "CYA": "See You",
    "FAQ": "Frequently Asked Questions",
    "FC": "Fingers Crossed",
    "FWIW": "For What It's Worth",
    "FYI": "For Your Information",
    "GAL": "Get A Life",
    "GG": "Good Game",
    "GN": "Good Night",
    "GMTA": "Great Minds Think Alike",
    "GR8": "Great!",
    "G9": "Genius",
    "IC": "I See",
    "ICQ": "I Seek you (also a chat program)",
    "ILU": "ILU: I Love You",
    "IMHO": "In My Honest/Humble Opinion",
    "IMO": "In My Opinion",
    "IOW": "In Other Words",
    "IRL": "In Real Life",
    "KISS": "Keep It Simple, Stupid",
    "LDR": "Long Distance Relationship",
    "LMAO": "Laugh My A.. Off",
    "LOL": "Laughing Out Loud",
    "LTNS": "Long Time No See",
    "L8R": "Later",
    "MTE": "My Thoughts Exactly",
    "M8": "Mate",
    "NRN": "No Reply Necessary",
    "OIC": "Oh I See",
    "PITA": "Pain In The A..",
    "PRT": "Party",
    "PRW": "Parents Are Watching",
    "QPSA?": "Que Pasa?",
    "ROFL": "Rolling On The Floor Laughing",
    "ROFLOL": "Rolling On The Floor Laughing Out Loud",
    "ROTFLMAO": "Rolling On The Floor Laughing My A.. Off",
    "SK8": "Skate",
    "STATS": "Your sex and age",
    "ASL": "Age, Sex, Location",
    "THX": "Thank You",
    "TTFN": "Ta-Ta For Now!",
    "TTYL": "Talk To You Later",
    "U": "You",
    "U2": "You Too",
    "U4E": "Yours For Ever",
    "WB": "Welcome Back",
    "WTF": "What The F...",
    "WTG": "Way To Go!",
    "WUF": "Where Are You From?",
    "W8": "Wait...",
    "7K": "Sick:-D Laugher",
    "TFW": "That feeling when",
    "MFW": "My face when",
    "MRW": "My reaction when",
    "IFYP": "I feel your pain",
    "TNTL": "Trying not to laugh",
    "JK": "Just kidding",
    "IDC": "I don't care",
    "ILY": "I love you",
    "IMU": "I miss you",
    "ADIH": "Another day in hell",
    "ZZZ": "Sleeping, bored, tired",
    "WYWH": "Wish you were here",
    "TIME": "Tears in my eyes",
    "BAE": "Before anyone else",
    "FIMH": "Forever in my heart",
    "BSAAW": "Big smile and a wink",
    "BWL": "Bursting with laughter",
    "BFF": "Best friends forever",
    "CSL": "Can't stop laughing"
}

***The code defines a function, chat_conversion, that replaces text with their corresponding chat acronyms from a predefined dictionary. It iterates through each word in the input text, checks if it exists in the dictionary, and replaces it if found. The modified text is then returned.***

In [43]:
import re
def chat_conversion_optimized(text):
    if not isinstance(text, str):
        return ''
    
    # Tokenize words and punctuation
    tokens = re.findall(r"\b\w+\b|[^\w\s]", text)
    
    expanded = []
    for token in tokens:
        word = token.upper()
        if word in chat_words:
            expanded.append(chat_words[word])
        else:
            expanded.append(token)
            
    # Join and normalize spaces
    cleaned_text = re.sub(r'\s+', ' ', " ".join(expanded)).strip()
    return cleaned_text

In [44]:
# Text
text = 'IMHO he is the best'
text1 = 'FYI Islamabad is the capital of Pakistan'
# Calling function
print(chat_conversion_optimized(text))
print(chat_conversion_optimized(text1))

In My Honest/Humble Opinion he is the best
For Your Information Islamabad is the capital of Pakistan


***Well this is how we Handle ChatWords in Our Data Simple u have to call the above Function.***

----

# <p style="font-family:newtimeroman;font-size:150%;text-align:center;color:#87CEEB;">6. Correcting spelling mistakes</p> 

### 6Ô∏è‚É£ **Spelling Correction**
- **Why:** Fixes typographical or misspelled words.  
- **Problem Solved:** Reduces redundant tokens (*‚Äúgooood‚Äù*, *‚Äúgret‚Äù*).  
- **When to Use:** Only if text quality is low ‚Äî can be skipped for transformer models.

In [45]:
# Import this Library to Handle the Spelling Issue.
from textblob import TextBlob

In [46]:
# Incorrect text
incorrect_text = 'ceertain conditionas duriing seveal ggenerations aree moodified in the saame maner.'
print(incorrect_text)
# Text 2 
incorrect_text2 = 'The cat sat on the cuchion. while plyaiing'
# Calling function
textBlb = TextBlob(incorrect_text)
textBlb1 = TextBlob(incorrect_text2)
# Corrected Text
print(textBlb.correct().string)
print(incorrect_text2)
print(textBlb1.correct().string)

ceertain conditionas duriing seveal ggenerations aree moodified in the saame maner.
certain conditions during several generations are modified in the same manner.
The cat sat on the cuchion. while plyaiing
The cat sat on the cushion. while playing


***Well The Library is Doing Great Job and Handling the Spelling Mistakes , Well u can Use the same Process to Handle the Full corpus.***

----

# <p style="font-family:newtimeroman;font-size:150%;text-align:center;color:#87CEEB;">7. Removing non-essential stop words</p> 

### 7Ô∏è‚É£ **Removing Stop Words**
- **Why:** Eliminates frequent but low-value words like *‚Äúthe‚Äù, ‚Äúis‚Äù, ‚Äúat‚Äù*.  
- **Problem Solved:** Reduces dimensionality and noise.  
- **When to Use:** For classical models; **keep negations** (e.g., *‚Äúnot‚Äù*, *‚Äúnever‚Äù*) in sentiment ana

In [47]:
# We use NLTK library to remove Stopwords.
from nltk.corpus import stopwords

In [48]:
# Here we can see all the stopwords in English.However we can chose different Languages also like spanish etc.
stopword = stopwords.words('english')

***The code defines a function, remove_stopwords, which removes stopwords from a given text. It iterates through each word in the text, checks if it is a stopword, and appends it to a new list if it is not. Then, it clears the original list, returns the modified text.***

In [49]:
# Function
def remove_stopwords(text):
    new_text = []
    
    for word in text.split():
        if word in stopword:
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)

In [50]:
# Text
text = 'probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times'
print(f'Text With Stop Words :{text}')
# Calling Function
remove_stopwords(text)

Text With Stop Words :probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it's not preachy or boring. it just never gets old, despite my having seen it some 15 or more times


'probably  all-time favorite movie,  story  selflessness, sacrifice  dedication   noble cause,    preachy  boring.   never gets old, despite   seen   15   times'

In [51]:
# We can Apply the same Function on Whole Corpus also 
df['review'].apply(remove_stopwords)

0        one    reviewers  mentioned   watching  1 oz e...
1         wonderful little production.  filming techniq...
2         thought    wonderful way  spend time    hot s...
3        basically there's  family   little boy (jake) ...
4        petter mattei's "love   time  money"   visuall...
                               ...                        
49995     thought  movie    right good job.    creative...
49996    bad plot, bad dialogue, bad acting, idiotic di...
49997       catholic taught  parochial elementary schoo...
49998     going    disagree   previous comment  side  m...
49999     one expects  star trek movies   high art,   f...
Name: review, Length: 50000, dtype: object

***Well This the function use to handle stopwords in Text.***

----

# <p style="font-family:newtimeroman;font-size:150%;text-align:center;color:#87CEEB;">8. Processing or removing emojis and emoticons</p> 

### 8Ô∏è‚É£ **Handling Emojis & Emoticons**
- **Why:** Emojis express emotions directly (*üòä ‚Üí happy*).  
- **Problem Solved:** Preserves or translates emotional cues for sentiment detection.  
- **When to Use:** Social media or review datasets.

### 8.1 Simply Remove Emojis

***The code defines a function, remove_emoji, which uses a regular expression to match and remove all emojis from a given text string. It targets various Unicode ranges corresponding to different categories of emojis and replaces them with an empty string, effectively removing them from the text.***


In [52]:
# Again Here we use The Regular Expressions to Remove the Emojies from Text or Whole Corpus.
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [53]:
# Texts 
text = "Loved the movie. It was üòò"
text1 = 'Python is üî•'
print(text ,'\n', text1)

# Remove Emojies using Fucntion
print(remove_emoji(text))
remove_emoji(text1)

Loved the movie. It was üòò 
 Python is üî•
Loved the movie. It was 


'Python is '

***Well the fucntion is removing the emojies easily.***

### 8.2 Simply Convert Emojis into text

In [55]:
# We will USe the Emoji Libray to handle this task 
# Pip Install emoji
import emoji

In [56]:
# Calling the Emoji tool Demojize.
print(emoji.demojize(text))
print(emoji.demojize(text1))

Loved the movie. It was :face_blowing_a_kiss:
Python is :fire:


***Well this is the output , and the tool is working best.***

----

# <p style="font-family:newtimeroman;font-size:150%;text-align:center;color:#87CEEB;">9. Breaking text into tokens (Tokenization)</p> 

### 9Ô∏è‚É£ **Tokenization**
- **Why:** Splits sentences into individual words or subwords.  
- **Problem Solved:** Makes raw text readable for vectorizers and tokenizers.  
- **When to Use:** Always ‚Äî foundation of every NLP preprocessing pipeline.

### 9.1 NLTK

***NLTK is a Library used to tokenize text into sentences and words.***

In [57]:
# Import Libraray 
from nltk.tokenize import word_tokenize,sent_tokenize

In [58]:
# Text
sentence = 'I am going to visit delhi!'
# Calling tool
word_tokenize(sentence)

['I', 'am', 'going', 'to', 'visit', 'delhi', '!']

In [59]:
# Whole text Containing 2 or more Sentences
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry? 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""

# Sentence Based Tokenization
sent_tokenize(text)

['Lorem Ipsum is simply dummy text of the printing and typesetting industry?',
 "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, \nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

In [60]:
# Some Sentences 
sent5 = 'I have a Ph.D in A.I'
sent6 = "We're here to help! mail us at nks@gmail.com"
sent7 = 'A 5km ride cost $10.50'

# Word Tokenize the Sentences
print(word_tokenize(sent5))
print(word_tokenize(sent6))
print(word_tokenize(sent7))

['I', 'have', 'a', 'Ph.D', 'in', 'A.I']
['We', "'re", 'here', 'to', 'help', '!', 'mail', 'us', 'at', 'nks', '@', 'gmail.com']
['A', '5km', 'ride', 'cost', '$', '10.50']


***NLTK is Performing Well Altough it has some of issue , Like in above text u see it cannot handle the mail. But U can Use it Acording to the data problem***

### 9.1 Spacy

***Spacy is a Library used to tokenize text into sentences and words.***

In [None]:
# Installation
# conda install -c conda-forge spacy
# conda install -c conda-forge spacy-model-en_core_web_sm

In [61]:
# This code imports the Spacy library and loads the English language model 'en_core_web_sm' for natural language processing.
# Pip install spacy library.
import spacy
nlp = spacy.load('en_core_web_sm')

In [62]:
# Tokenize the Sentences in Words
doc1 = nlp(sent5)
doc2 = nlp(sent6)
doc3 = nlp(sent7)

In [63]:
# Print Token Genrated
for token in doc2:
    print(token.text)

We
're
here
to
help
!
mail
us
at
nks@gmail.com


***this tool Handle the mail also , so the choice of best tokenizer tool depend on your problem, u can try both and select the best oen.***

----

# <p style="font-family:newtimeroman;font-size:150%;text-align:center;color:#87CEEB;">10. Applying stemming (Tokenization)</p> 


### üîü **Stemming**
- **Why:** Reduces words to a common base form (*‚Äúplaying‚Äù ‚Üí ‚Äúplay‚Äù*).  
- **Problem Solved:** Merges inflected word forms, lowering vocabulary size.  
- **When to Use:** Useful for TF-IDF models; not used with transformers.

In [64]:
# Import PorterStemmer from NLTK Library
from nltk.stem.porter import PorterStemmer

In [65]:
# Intilize Stemmer
stemmer = PorterStemmer()

# This Function Will Stem Words
def stem_words(text):
    return " ".join([stemmer.stem(word) for word in text.split()])

In [66]:
# A single Sentence
st = "walk walks walking walked"
# Calling Function
stem_words(st)

'walk walk walk walk'

In [67]:
text = """probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy 
or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings
 tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like 
 dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the 
 world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie"""
print(text)

# Calling Function
stem_words(text)

probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy 
or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings
 tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like 
 dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the 
 world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie


'probabl my alltim favorit movi a stori of selfless sacrific and dedic to a nobl caus but it not preachi or bore it just never get old despit my have seen it some 15 or more time in the last 25 year paul luka perform bring tear to my eye and bett davi in one of her veri few truli sympathet role is a delight the kid are as grandma say more like dressedup midget than children but that onli make them more fun to watch and the mother slow awaken to what happen in the world and under her own roof is believ and startl if i had a dozen thumb theyd all be up for thi movi'

***Thats How the Stemming will work***

***However, stemming may sometimes result in the production of non-existent or incorrect words, known as stemming errors, which need to be carefully managed to avoid impacting the accuracy of NLP applications.***

----

# <p style="font-family:newtimeroman;font-size:150%;text-align:center;color:#87CEEB;">11. Performing lemmatization</p> 

### 1Ô∏è‚É£1Ô∏è‚É£ **Lemmatization**
- **Why:** Converts words to their meaningful root form using grammar context.  
- **Problem Solved:** Keeps words linguistically correct (*‚Äúbetter‚Äù ‚Üí ‚Äúgood‚Äù*).  
- **When to Use:** Prefer over stemming when grammatical accuracy matters.

In [71]:
# We Will Import WordNetLemmatizer from NLTK Library.
import nltk
from nltk.stem import WordNetLemmatizer
# Intilize Lemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Sentence 
sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."

# Intilize Punctuation
punctuations="?:!.,;"

# Tokenize Word
sentence_words = nltk.word_tokenize(sentence)

# Using a Loop to Remove Punctuations.
for word in sentence_words:
    if word in punctuations:
        sentence_words.remove(word)
# Printing Word and Lemmatized Word
print("{0:20}{1:20}".format("Word","Lemma"))
for word in sentence_words:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word,pos='v')))

Word                Lemma               
He                  He                  
was                 be                  
running             run                 
and                 and                 
eating              eat                 
at                  at                  
same                same                
time                time                
He                  He                  
has                 have                
bad                 bad                 
habit               habit               
of                  of                  
swimming            swim                
after               after               
playing             play                
long                long                
hours               hours               
in                  in                  
the                 the                 
Sun                 Sun                 


***Well That's how the Lemmatizer Works.One Best Thing of Lemmatization is That, lemmatization ensures that words are transformed to their canonical form, considering their part of speech.However this Process is Slow***

----

# <p style="font-family:newtimeroman;font-size:150%;text-align:center;color:#87CEEB;">üèÅ Conclusion.</p> 

This project demonstrates how a structured text preprocessing pipeline transforms raw IMDB reviews into a high-quality, machine-readable dataset.  
By systematically applying steps such as HTML and URL removal, chat word normalization, tokenization, and lemmatization, we achieved:

- Cleaner, more consistent linguistic patterns  
- Reduced vocabulary redundancy (~45% fewer unique tokens)  
- Enhanced interpretability for downstream modeling  
- Balanced preprocessing for both TF-IDF and transformer-based workflows  

> üß† **Key Insight:**  
> Even without training a model, analyzing vocabulary and token statistics clearly shows how effective preprocessing creates a stronger foundation for NLP success.


----

# <p style="font-family:newtimeroman;font-size:150%;text-align:center;color:#87CEEB;">THE END.</p> 

----