Text preprocessing is important to make data more clear and usable for the better results of the model.
Some of the techniques used here may not be required for all the datasets. It depends upon the type of data, source of the data, model used, use-cases etc.

Data Set:
https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

This contains movie reviews for binary sentiment classification containing substantially more data than previous benchmark datasets.

In [5]:
import pandas as pd

In [14]:
!pwd

/content


In [2]:
data_path = "/content/IMDB Dataset.csv"

In [9]:
df = pd.read_csv(data_path)

In [10]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [11]:
df.shape #rows and columns which 50K movie reviews (positive or negative sentiment)

(50000, 2)

In [12]:
df['review'][3] #sample review text

"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them."

#Perform some basic pre-processing
## 1. Lower case - convert the text into lower case so that number of entities can be reduced. This is important here as reviews are free text.

In [13]:
df['review'] = df['review'].str.lower()

In [14]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


## 2. Remove HTML tags

In [15]:
import re ##re: regular expression
def remove_html_tags(text):
  pattern = re.compile('<.*?>')
  return pattern.sub(r'', text) #i.e. replace html tags with empty string


In [16]:
df['review'] = df['review'].apply(remove_html_tags)

In [17]:
df['review'][3]

"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."

##3. Remove URLs

In [18]:
def remove_url(text):
  pattern = re. compile(r'https?://\S+|www\.\S+')
  return pattern.sub(r'', text)

In [19]:
df['review'] = df['review'].apply(remove_url)

##4. Punctuation hanlding

In [20]:
import string, time
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [21]:
exclude = string.punctuation
exclude ##!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [22]:
def remove_punc(text):
  for char in exclude:
    text = text.replace(char, '') #remove any of the char same as one in exclude string and replace with empty char
  return text

In [23]:
def remove_punc_betterway(text):
  return text.translate(str.maketrans('','',exclude))

In [24]:
df['review'] = df['review'].apply(remove_punc_betterway)

In [25]:
df['review'][3]

'basically theres a family where a little boy jake thinks theres a zombie in his closet  his parents are fighting all the timethis movie is slower than a soap opera and suddenly jake decides to become rambo and kill the zombieok first of all when youre going to make a film you must decide if its a thriller or a drama as a drama the movie is watchable parents are divorcing  arguing like in real life and then we have jake with his closet which totally ruins all the film i expected to see a boogeyman similar movie and instead i watched a drama with some meaningless thriller spots3 out of 10 just for the well playing parents  descent dialogs as for the shots with jake just ignore them'

##5. Chat conversation handle
###for example, short forms used like ASAP, FYI, AFAIK, LOL etc.

In [26]:
#prepare the dictionary of the words and their meaning
chat_words = {
    'AFAIK':'As Far As I Know',
    'AFK':'Away From Keyboard',
    'ASAP':'As Soon As Possible',
    "FYI": "For Your Information",
    "ASAP": "As Soon As Possible",
    "BRB": "Be Right Back",
    "BTW": "By The Way",
    "OMG": "Oh My God",
    "IMO": "In My Opinion",
    "LOL": "Laugh Out Loud",
    "TTYL": "Talk To You Later",
    "GTG": "Got To Go",
    "TTYT": "Talk To You Tomorrow",
    "IDK": "I Don't Know",
    "TMI": "Too Much Information",
    "IMHO": "In My Humble Opinion",
    "ICYMI": "In Case You Missed It",
    "AFAIK": "As Far As I Know",
    "BTW": "By The Way",
    "FAQ": "Frequently Asked Questions",
    "TGIF": "Thank God It's Friday",
    "FYA": "For Your Action",
    "ICYMI": "In Case You Missed It",
}

In [27]:
def chat_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words:
            new_text.append(chat_words[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

In [28]:
df['review'] = df['review'].apply(chat_conversion)

##6.incorrect_text handling (for example, spelling mistakes or incorrect words)

In [29]:
from textblob import TextBlob

In [30]:
#example
incorrect_text = "This is an incorrect setnance"
text_blob = TextBlob(incorrect_text)
text_blob.correct().string   #output:  His is an incorrect sentence

'His is an incorrect sentence'

##7. stopwords
Words like “um,” “like,” "me", "us" and “you know” are filler words that carry little information.

In [31]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
##This is the Natural Language tokens which have stopwords package

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [32]:
stopwords.words('english') #list stopwords in english

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [33]:
len(stopwords.words('english')) #179

179

In [34]:
#Function to remove stopwords
##*** It may or may not be useful to remove all the stopwords
##*** It depends upon the use case
def remove_stopwords(text):
    new_text = []

    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)

In [None]:
df['review'].apply(remove_stopwords)

##7.remove_emoji handle

In [36]:
import re
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [37]:
remove_emoji("Loved the movie. It was 😘😘")

'Loved the movie. It was '

In [38]:
##Another way, using libary of emojies
!pip install emoji

Collecting emoji
  Downloading emoji-2.14.0-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.14.0-py3-none-any.whl (586 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/586.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/586.9 kB[0m [31m6.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m586.9/586.9 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.14.0


In [39]:
import emoji
print(emoji.demojize('Python is 🔥'))

Python is :fire:


#8. Tokenization
There are various ways of spliting the sentences into 'words' called 'WordTokens' or smaller sentences called 'sentenceTokens'.

8.1 Using the split function

In [40]:
# word tokenization
sent1 = 'I am going to delhi'
sent1.split()

### Problem with this technique is that it works word by word and is slow
### Also, we have to specify the 'splitchar'

['I', 'am', 'going', 'to', 'delhi']

8.2 Using the Regular Expression

In [41]:
import re
sent3 = 'I am going to delhi!'
tokens = re.findall("[\w']+", sent3)
tokens

['I', 'am', 'going', 'to', 'delhi']

In [42]:
# Sentence Tokens

text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry?
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""
sentences = re.compile('[.!?] ').split(text)
sentences

["Lorem Ipsum is simply dummy text of the printing and typesetting industry?\nLorem Ipsum has been the industry's standard dummy text ever since the 1500s,\nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

8.3 Using NLTK (Natural Language Toolkit)

In [47]:
from nltk.tokenize import word_tokenize,sent_tokenize
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [48]:
sent1 = 'I am going to visit delhi!'
word_tokenize(sent1) ##Word Tokeniser

['I', 'am', 'going', 'to', 'visit', 'delhi', '!']

In [49]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry?
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""

sent_tokenize(text) ##Sentence Tokeniser

['Lorem Ipsum is simply dummy text of the printing and typesetting industry?',
 "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,\nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

8.4 Using Spacy (good)

In [50]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [51]:
doc1 = nlp(text)

for token in doc1:
    print(token)

Lorem
Ipsum
is
simply
dummy
text
of
the
printing
and
typesetting
industry
?


Lorem
Ipsum
has
been
the
industry
's
standard
dummy
text
ever
since
the
1500s
,


when
an
unknown
printer
took
a
galley
of
type
and
scrambled
it
to
make
a
type
specimen
book
.


# 9. Convert similar words into their Fundamental or Root words e.g. Liking & Liked to Like.

## 9.1 Stemmer

In [52]:
from nltk.stem.porter import PorterStemmer

In [53]:
ps = PorterStemmer()
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

In [54]:
sample = "walk walks walking walked"
stem_words(sample)

'walk walk walk walk'

In [57]:
text = 'probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie'
print(text)


probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie


In [58]:
stem_words(text)

'probabl my alltim favorit movi a stori of selfless sacrific and dedic to a nobl caus but it not preachi or bore it just never get old despit my have seen it some 15 or more time in the last 25 year paul luka perform bring tear to my eye and bett davi in one of her veri few truli sympathet role is a delight the kid are as grandma say more like dressedup midget than children but that onli make them more fun to watch and the mother slow awaken to what happen in the world and under her own roof is believ and startl if i had a dozen thumb theyd all be up for thi movi'

## 9.2 Lemmatization (Lemma-tization)

In [59]:
import nltk
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
wordnet_lemmatizer = WordNetLemmatizer()

sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."
punctuations="?:!.,;"
sentence_words = nltk.word_tokenize(sentence)
for word in sentence_words:
    if word in punctuations:
        sentence_words.remove(word)

sentence_words
print("{0:20}{1:20}".format("Word","Lemma"))
for word in sentence_words:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word,pos='v')))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Word                Lemma               
He                  He                  
was                 be                  
running             run                 
and                 and                 
eating              eat                 
at                  at                  
same                same                
time                time                
He                  He                  
has                 have                
bad                 bad                 
habit               habit               
of                  of                  
swimming            swim                
after               after               
playing             play                
long                long                
hours               hours               
in                  in                  
the                 the                 
Sun                 Sun                 


**Stemming**
Removes common suffixes from the end of words to create a base form. Stemming is fast and efficient for processing large amounts of text. However, it can produce stems that aren't actual words, which can lead to less accurate results. For example, stemming "engine" or "engines" would result in "engin", which isn't a valid English word

**Lemmatization**
Analyzes the context of a sentence to return the base form of a word. Lemmatization is more accurate than stemming because it considers the word's use in the larger text and its inflected form. Lemmatization algorithms use part-of-speech information to determine which normalization rules to apply. For example, "following" can be a noun, verb, or adjective, so lemmatization will return "following" for the adjective or verb, and "follow" for the verb.

# **Stemming is useful for search engines, information retrieval, and text mining. Lemmatization is essential for chatbots, text classification, and semantic analysis. **