# Text Preprocessing Pipeline

## Overview
This notebook demonstrates a comprehensive text preprocessing workflow for NLP tasks using the IMDB dataset of 50k movie reviews.

## Steps Performed

1. **Data Loading & Exploration** - Load CSV dataset and examine structure
2. **Lowercasing** - Convert all text to lowercase for consistency
3. **HTML Tag Removal** - Strip HTML tags from reviews
4. **URL Removal** - Remove HTTP/HTTPS and www URLs
5. **Punctuation Removal** - Eliminate punctuation marks (with performance comparison)
6. **Slang Conversion** - Convert chat abbreviations to full words using dictionary
7. **Spell Correction** - Fix misspelled words using TextBlob
8. **Stopword Removal** - Remove common English stopwords
9. **Emoji Handling** - Remove or demojize emojis
10. **Tokenization** - Split text into words/sentences using multiple methods:
    - String split
    - Regular expressions
    - NLTK
    - Spacy
11. **Stemming** - Reduce words to root form using Porter Stemmer
12. **Lemmatization** - Convert words to base dictionary form

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "./input/" directory
# Unzip the dataset and place it in the input folder or Copy the dataset from this URL https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews to input folder

import os
data_path = "./input"
for dirname, _, filenames in os.walk(data_path):
    for filename in filenames:
        print(os.path.join(dirname, filename))

./input/en_core_web_sm-3.7.1-py3-none-any.whl
./input/IMDB Dataset.csv


In [2]:
df = pd.read_csv('./input/IMDB Dataset.csv')

In [3]:
df.shape

(50000, 2)

In [4]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
df['review'][3].lower()

"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.<br /><br />ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."

In [6]:
df['review'] = df['review'].str.lower()

In [7]:
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
...,...,...
49995,i thought this movie did a down right good job...,positive
49996,"bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,i am a catholic taught in parochial elementary...,negative
49998,i'm going to have to disagree with the previou...,negative


In [8]:
import re
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'', text)

In [9]:
text = "<html><body><p> Movie 1</p><p> Actor - Aamir Khan</p><p> Click here to <a href='http://google.com'>download</a></p></body></html>"

In [10]:
remove_html_tags(text)

' Movie 1 Actor - Aamir Khan Click here to download'

In [11]:
df['review'] = df['review'].apply(remove_html_tags)

In [12]:
df['review'][5]

'probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times in the last 25 years. paul lukas\' performance brings tears to my eyes, and bette davis, in one of her very few truly sympathetic roles, is a delight. the kids are, as grandma says, more like "dressed-up midgets" than children, but that only makes them more fun to watch. and the mother\'s slow awakening to what\'s happening in the world and under her own roof is believable and startling. if i had a dozen thumbs, they\'d all be "up" for this movie.'

In [13]:
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)

In [14]:
text1 = 'Check out my notebook https://www.kaggle.com/campusx/notebook8223fc1abb'
text2 = 'Check out my notebook http://www.kaggle.com/campusx/notebook8223fc1abb'
text3 = 'Google search here www.google.com'
text4 = 'For notebook click https://www.kaggle.com/campusx/notebook8223fc1abb to search check www.google.com'

In [15]:
remove_url(text4)

'For notebook click  to search check '

In [16]:
import string,time
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [17]:
exclude = string.punctuation

In [18]:
def remove_punc(text):
    for char in exclude:
        text = text.replace(char,'')
    return text
        

In [19]:
text = 'string. With. Punctuation?'

In [20]:
start = time.time()
print(remove_punc(text))
time1 = time.time() - start
print(time1*50000)

string With Punctuation
22.351741790771484


In [21]:
def remove_punc1(text):
    return text.translate(str.maketrans('', '', exclude))

In [22]:
start = time.time()
remove_punc1(text)
time2 = time.time() - start
print(time2*50000)

12.421607971191406


In [23]:
time1/time2

1.7994241842610366

In [24]:
df['review'][5]

'probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times in the last 25 years. paul lukas\' performance brings tears to my eyes, and bette davis, in one of her very few truly sympathetic roles, is a delight. the kids are, as grandma says, more like "dressed-up midgets" than children, but that only makes them more fun to watch. and the mother\'s slow awakening to what\'s happening in the world and under her own roof is believable and startling. if i had a dozen thumbs, they\'d all be "up" for this movie.'

In [25]:
remove_punc1(df['review'][5])

'probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie'

In [26]:
def slang_to_dict(text):
    slang_dict = {}
    lines = text.splitlines()

    for line in lines:
        # Skip empty or invalid lines
        if "=" not in line:
            continue
        
        key, value = line.split("=", 1)  # split only on first '='
        key = key.strip()
        value = value.strip()

        if key:  # Ignore empty keys
            slang_dict[key] = value

    return slang_dict


In [27]:
slang_text = """
A3=Anytime, Anywhere, Anyplace
ADIH=Another Day In Hell
AFK=Away From Keyboard
AFAIK=As Far As I Know
ASAP=As Soon As Possible
ASL=Age, Sex, Location
ATK=At The Keyboard
ATM=At The Moment
BAE=Before Anyone Else
BAK=Back At Keyboard
BBL=Be Back Later
BBS=Be Back Soon
BFN=Bye For Now
B4N=Bye For Now
BRB=Be Right Back
BRUH=Bro
BRT=Be Right There
BSAAW=Big Smile And A Wink
BTW=By The Way
BWL=Bursting With Laughter
CSL=Can‚Äôt Stop Laughing
CU=See You
CUL8R=See You Later
CYA=See You
DM=Direct Message
FAQ=Frequently Asked Questions
FC=Fingers Crossed
FIMH=Forever In My Heart
FOMO=Fear Of Missing Out
FR=For Real
FWIW=For What It's Worth
FYP=For You Page
FYI=For Your Information
G9=Genius
GAL=Get A Life
GG=Good Game
GMTA=Great Minds Think Alike
GN=Good Night
GOAT=Greatest Of All Time
GR8=Great!
HBD=Happy Birthday
IC=I See
ICQ=I Seek You
IDC=I Don‚Äôt Care
IDK=I Don't Know
IFYP=I Feel Your Pain
ILU=I Love You
ILY=I Love You
IMHO=In My Honest/Humble Opinion
IMU=I Miss You
IMO=In My Opinion
IOW=In Other Words
IRL=In Real Life
IYKYK=If You Know, You Know
JK=Just Kidding
KISS=Keep It Simple, Stupid
L=Loss
L8R=Later
LDR=Long Distance Relationship
LMK=Let Me Know
LMAO=Laughing My A** Off
LOL=Laughing Out Loud
LTNS=Long Time No See
M8=Mate
MFW=My Face When
MID=Mediocre
MRW=My Reaction When
MTE=My Thoughts Exactly
NVM=Never Mind
NRN=No Reply Necessary
NPC=Non-Player Character
OIC=Oh I See
OP=Overpowered
PITA=Pain In The A**
POV=Point Of View
PRT=Party
PRW=Parents Are Watching
ROFL=Rolling On The Floor Laughing
ROFLOL=Rolling On The Floor Laughing Out Loud
ROTFLMAO=Rolling On The Floor Laughing My A** Off
RN=Right Now
SK8=Skate
STATS=Your Sex And Age
SUS=Suspicious
TBH=To Be Honest
TFW=That Feeling When
THX=Thank You
TIME=Tears In My Eyes
TLDR=Too Long, Didn‚Äôt Read
TNTL=Trying Not To Laugh
TTFN=Ta-Ta For Now!
TTYL=Talk To You Later
U=You
U2=You Too
U4E=Yours For Ever
W=Win
W8=Wait...
WB=Welcome Back
WTF=What The F**k
WTG=Way To Go!
WUF=Where Are You From?
WYD=What You Doing?
WYWH=Wish You Were Here
ZZZ=Sleeping, Bored, Tired
"""

chat_words = slang_to_dict(slang_text)
print(chat_words)


{'A3': 'Anytime, Anywhere, Anyplace', 'ADIH': 'Another Day In Hell', 'AFK': 'Away From Keyboard', 'AFAIK': 'As Far As I Know', 'ASAP': 'As Soon As Possible', 'ASL': 'Age, Sex, Location', 'ATK': 'At The Keyboard', 'ATM': 'At The Moment', 'BAE': 'Before Anyone Else', 'BAK': 'Back At Keyboard', 'BBL': 'Be Back Later', 'BBS': 'Be Back Soon', 'BFN': 'Bye For Now', 'B4N': 'Bye For Now', 'BRB': 'Be Right Back', 'BRUH': 'Bro', 'BRT': 'Be Right There', 'BSAAW': 'Big Smile And A Wink', 'BTW': 'By The Way', 'BWL': 'Bursting With Laughter', 'CSL': 'Can‚Äôt Stop Laughing', 'CU': 'See You', 'CUL8R': 'See You Later', 'CYA': 'See You', 'DM': 'Direct Message', 'FAQ': 'Frequently Asked Questions', 'FC': 'Fingers Crossed', 'FIMH': 'Forever In My Heart', 'FOMO': 'Fear Of Missing Out', 'FR': 'For Real', 'FWIW': "For What It's Worth", 'FYP': 'For You Page', 'FYI': 'For Your Information', 'G9': 'Genius', 'GAL': 'Get A Life', 'GG': 'Good Game', 'GMTA': 'Great Minds Think Alike', 'GN': 'Good Night', 'GOAT': 'G

In [28]:
chat_words

{'A3': 'Anytime, Anywhere, Anyplace',
 'ADIH': 'Another Day In Hell',
 'AFK': 'Away From Keyboard',
 'AFAIK': 'As Far As I Know',
 'ASAP': 'As Soon As Possible',
 'ASL': 'Age, Sex, Location',
 'ATK': 'At The Keyboard',
 'ATM': 'At The Moment',
 'BAE': 'Before Anyone Else',
 'BAK': 'Back At Keyboard',
 'BBL': 'Be Back Later',
 'BBS': 'Be Back Soon',
 'BFN': 'Bye For Now',
 'B4N': 'Bye For Now',
 'BRB': 'Be Right Back',
 'BRUH': 'Bro',
 'BRT': 'Be Right There',
 'BSAAW': 'Big Smile And A Wink',
 'BTW': 'By The Way',
 'BWL': 'Bursting With Laughter',
 'CSL': 'Can‚Äôt Stop Laughing',
 'CU': 'See You',
 'CUL8R': 'See You Later',
 'CYA': 'See You',
 'DM': 'Direct Message',
 'FAQ': 'Frequently Asked Questions',
 'FC': 'Fingers Crossed',
 'FIMH': 'Forever In My Heart',
 'FOMO': 'Fear Of Missing Out',
 'FR': 'For Real',
 'FWIW': "For What It's Worth",
 'FYP': 'For You Page',
 'FYI': 'For Your Information',
 'G9': 'Genius',
 'GAL': 'Get A Life',
 'GG': 'Good Game',
 'GMTA': 'Great Minds Think Al

In [29]:
def chat_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words:
            new_text.append(chat_words[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

In [30]:
chat_conversion('IMHO he is the best')

'In My Honest/Humble Opinion he is the best'

In [31]:
chat_conversion('FYI delhi is the capital of india')

'For Your Information delhi is the capital of india'

In [32]:
from textblob import TextBlob

In [34]:
incorrect_text = 'ceertain conditionas duriing seveal ggenerations aree moodified in the saame maner.'

textBlb = TextBlob(incorrect_text)

textBlb.correct().string

'certain conditions during several generations are modified in the same manner.'

In [35]:
from nltk.corpus import stopwords

In [36]:
stopwords.words('spanish')

['de',
 'la',
 'que',
 'el',
 'en',
 'y',
 'a',
 'los',
 'del',
 'se',
 'las',
 'por',
 'un',
 'para',
 'con',
 'no',
 'una',
 'su',
 'al',
 'lo',
 'como',
 'm√°s',
 'pero',
 'sus',
 'le',
 'ya',
 'o',
 'este',
 's√≠',
 'porque',
 'esta',
 'entre',
 'cuando',
 'muy',
 'sin',
 'sobre',
 'tambi√©n',
 'me',
 'hasta',
 'hay',
 'donde',
 'quien',
 'desde',
 'todo',
 'nos',
 'durante',
 'todos',
 'uno',
 'les',
 'ni',
 'contra',
 'otros',
 'ese',
 'eso',
 'ante',
 'ellos',
 'e',
 'esto',
 'm√≠',
 'antes',
 'algunos',
 'qu√©',
 'unos',
 'yo',
 'otro',
 'otras',
 'otra',
 '√©l',
 'tanto',
 'esa',
 'estos',
 'mucho',
 'quienes',
 'nada',
 'muchos',
 'cual',
 'poco',
 'ella',
 'estar',
 'estas',
 'algunas',
 'algo',
 'nosotros',
 'mi',
 'mis',
 't√∫',
 'te',
 'ti',
 'tu',
 'tus',
 'ellas',
 'nosotras',
 'vosotros',
 'vosotras',
 'os',
 'm√≠o',
 'm√≠a',
 'm√≠os',
 'm√≠as',
 'tuyo',
 'tuya',
 'tuyos',
 'tuyas',
 'suyo',
 'suya',
 'suyos',
 'suyas',
 'nuestro',
 'nuestra',
 'nuestros',
 'nuestras',

In [37]:
def remove_stopwords(text):
    new_text = []
    
    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)

In [38]:
remove_stopwords('probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times')

'probably  all-time favorite movie,  story  selflessness, sacrifice  dedication   noble cause,    preachy  boring.   never gets old, despite   seen   15   times'

In [39]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


In [40]:
df['review'].apply(remove_stopwords)

KeyboardInterrupt: 

In [41]:
import re
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [42]:
remove_emoji("Loved the movie. It was üòòüòò")

'Loved the movie. It was '

In [43]:
remove_emoji("Lmao üòÇüòÇ")

'Lmao '

In [44]:
import emoji
print(emoji.demojize('Python is üî•'))

Python is :fire:


In [45]:
print(emoji.demojize('Loved the movie. It was üòò'))

Loved the movie. It was :face_blowing_a_kiss:


### 1. Using the split function

In [46]:
# word tokenization
sent1 = 'I am going to delhi'
sent1.split()

['I', 'am', 'going', 'to', 'delhi']

In [47]:
# sentence tokenization
sent2 = 'I am going to delhi. I will stay there for 3 days. Let\'s hope the trip to be great'
sent2.split('.')

['I am going to delhi',
 ' I will stay there for 3 days',
 " Let's hope the trip to be great"]

In [48]:
# Problems with split function
sent3 = 'I am going to delhi!'
sent3.split()

['I', 'am', 'going', 'to', 'delhi!']

In [49]:
sent4 = 'Where do think I should go? I have 3 day holiday'
sent4.split('.')

['Where do think I should go? I have 3 day holiday']

### 2. Regular Expression

In [50]:
import re
sent3 = 'I am going to delhi!'
tokens = re.findall("[\w']+", sent3)
tokens

['I', 'am', 'going', 'to', 'delhi']

In [51]:

text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry? 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""
sentences = re.compile('[.!?] ').split(text)
sentences

['Lorem Ipsum is simply dummy text of the printing and typesetting industry',
 "\nLorem Ipsum has been the industry's standard dummy text ever since the 1500s, \nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

### 3. NLTK

In [52]:
from nltk.tokenize import word_tokenize,sent_tokenize

In [53]:
sent1 = 'I am going to visit delhi!'
word_tokenize(sent1)

['I', 'am', 'going', 'to', 'visit', 'delhi', '!']

In [54]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry? 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""

sent_tokenize(text)

['Lorem Ipsum is simply dummy text of the printing and typesetting industry?',
 "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, \nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

In [55]:
sent5 = 'I have a Ph.D in A.I'
sent6 = "We're here to help! mail us at nks@gmail.com"
sent7 = 'A 5km ride cost $10.50'

word_tokenize(sent5)

['I', 'have', 'a', 'Ph.D', 'in', 'A.I']

In [56]:
word_tokenize(sent6)

['We',
 "'re",
 'here',
 'to',
 'help',
 '!',
 'mail',
 'us',
 'at',
 'nks',
 '@',
 'gmail.com']

In [57]:
word_tokenize(sent7)

['A', '5km', 'ride', 'cost', '$', '10.50']

### 4. Spacy

In [58]:
import spacy
#uv pip install ./input/en_core_web_sm-3.7.1-py3-none-any.whl
nlp = spacy.load('en_core_web_sm')

In [59]:
doc1 = nlp(sent5)
doc2 = nlp(sent6)
doc3 = nlp(sent7)
doc4 = nlp(sent1)

In [60]:
for token in doc4:
    print(token)

I
am
going
to
visit
delhi
!


In [61]:
from nltk.stem.porter import PorterStemmer

In [62]:
ps = PorterStemmer()
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

In [63]:
sample = "walk walks walking walked"
stem_words(sample)

'walk walk walk walk'

In [64]:
text = 'probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie'
print(text)

probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie


In [65]:
stem_words(text)

'probabl my alltim favorit movi a stori of selfless sacrific and dedic to a nobl caus but it not preachi or bore it just never get old despit my have seen it some 15 or more time in the last 25 year paul luka perform bring tear to my eye and bett davi in one of her veri few truli sympathet role is a delight the kid are as grandma say more like dressedup midget than children but that onli make them more fun to watch and the mother slow awaken to what happen in the world and under her own roof is believ and startl if i had a dozen thumb theyd all be up for thi movi'

In [66]:
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."
punctuations="?:!.,;"
sentence_words = nltk.word_tokenize(sentence)
for word in sentence_words:
    if word in punctuations:
        sentence_words.remove(word)

sentence_words
print("{0:20}{1:20}".format("Word","Lemma"))
for word in sentence_words:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word,pos='v')))

Word                Lemma               
He                  He                  
was                 be                  
running             run                 
and                 and                 
eating              eat                 
at                  at                  
same                same                
time                time                
He                  He                  
has                 have                
bad                 bad                 
habit               habit               
of                  of                  
swimming            swim                
after               after               
playing             play                
long                long                
hours               hours               
in                  in                  
the                 the                 
Sun                 Sun                 
