<a href="https://colab.research.google.com/github/Aseem2004/NLP/blob/main/02-Text_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Preprocessing

In [1]:
import numpy as np
import pandas as pd

In [2]:
df=pd.read_csv('IMDB Dataset.csv')

In [3]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
df.shape

(50000, 2)

* Lowercasing: For ensuring consistency across the dataset

In [5]:
df['review']=df['review'].str.lower()

In [6]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


* Removing HTML Tags: Through regex

In [7]:
df['review'][1]

'a wonderful little production. <br /><br />the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />the actors are extremely well chosen- michael sheen not only "has got all the polari" but he has all the voices down pat too! you can truly see the seamless editing guided by the references to williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. a masterful production about one of the great master\'s of comedy and his life. <br /><br />the realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. it plays on our knowledge and our senses, particularly with the scenes concerning orton and halliwell and the sets (particularly of their flat with halliwell\'s murals decorating every surface) are terribly well d

In [8]:
import re
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'', text)

In [9]:
df['review']=df['review'].apply(remove_html_tags)

In [10]:
df['review'][1]

'a wonderful little production. the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. the actors are extremely well chosen- michael sheen not only "has got all the polari" but he has all the voices down pat too! you can truly see the seamless editing guided by the references to williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. a masterful production about one of the great master\'s of comedy and his life. the realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. it plays on our knowledge and our senses, particularly with the scenes concerning orton and halliwell and the sets (particularly of their flat with halliwell\'s murals decorating every surface) are terribly well done.'

* Removing URLs

In [11]:
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)

In [12]:
text1 = 'Check out https://www.google.com'
text2 = 'Check out http://www.google.com'
text3 = 'Google search here www.google.com'
text4 = 'Click https://www.google.com to search check www.google.com'

In [13]:
for i in range(4):
  print(remove_url(globals()['text'+str(i+1)]))

Check out 
Check out 
Google search here 
Click  to search check 


* Removing Punctuations

In [14]:
import string,time
exclude=string.punctuation

In [15]:
exclude

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [16]:
# Method:1
def remove_punc1(text):
    for char in exclude:
        text = text.replace(char,'')
    return text

* str.maketrans() creates a translation table. The three arguments are:

  - 1st → characters to replace.

  - 2nd → characters to replace them with.

  - 3rd → characters to delete.

- .translate() applies the translation table to the string.

Table says “delete all punctuation marks,” this call returns the text with punctuation removed.

In [17]:
# Method:2
def remove_punc2(text):
    return text.translate(str.maketrans('', '', exclude))

In [18]:
text = 'string. With. Punctuation?!'

In [19]:
start = time.time()
print(remove_punc1(text))
time1 = time.time() - start
print(time1)

string With Punctuation
0.00016260147094726562


In [20]:
start = time.time()
print(remove_punc2(text))
time2 = time.time() - start
print(time2)

string With Punctuation
0.00023102760314941406


In [21]:
print(time1/time2)
# Notice time2 is less, hence we prefer 2nd method to remove punctuations in general

0.7038183694530443


* Chat Word Treatment: For making a machine understand abbreviations

In [22]:
chat_words = {
    "A3": "Anytime, Anywhere, Anyplace",
    "ADIH": "Another Day In Hell",
    "AFK": "Away From Keyboard",
    "AFAIK": "As Far As I Know",
    "ASAP": "As Soon As Possible",
    "ASL": "Age, Sex, Location",
    "ATK": "At The Keyboard",
    "ATM": "At The Moment",
    "BAE": "Before Anyone Else",
    "BAK": "Back At Keyboard",
    "BBL": "Be Back Later",
    "BBS": "Be Back Soon",
    "BFN": "Bye For Now",
    "B4N": "Bye For Now",
    "BRB": "Be Right Back",
    "BRUH": "Bro",
    "BRT": "Be Right There",
    "BSAAW": "Big Smile And A Wink",
    "BTW": "By The Way",
    "BWL": "Bursting With Laughter",
    "CSL": "Can’t Stop Laughing",
    "CU": "See You",
    "CUL8R": "See You Later",
    "CYA": "See You",
    "DM": "Direct Message",
    "FAQ": "Frequently Asked Questions",
    "FC": "Fingers Crossed",
    "FIMH": "Forever In My Heart",
    "FOMO": "Fear Of Missing Out",
    "FR": "For Real",
    "FWIW": "For What It's Worth",
    "FYP": "For You Page",
    "FYI": "For Your Information",
    "G9": "Genius",
    "GAL": "Get A Life",
    "GG": "Good Game",
    "GMTA": "Great Minds Think Alike",
    "GN": "Good Night",
    "GOAT": "Greatest Of All Time",
    "GR8": "Great!",
    "HBD": "Happy Birthday",
    "IC": "I See",
    "ICQ": "I Seek You",
    "IDC": "I Don’t Care",
    "IDK": "I Don't Know",
    "IFYP": "I Feel Your Pain",
    "ILU": "I Love You",
    "ILY": "I Love You",
    "IMHO": "In My Honest/Humble Opinion",
    "IMU": "I Miss You",
    "IMO": "In My Opinion",
    "IOW": "In Other Words",
    "IRL": "In Real Life",
    "IYKYK": "If You Know, You Know",
    "JK": "Just Kidding",
    "KISS": "Keep It Simple, Stupid",
    "L": "Loss",
    "L8R": "Later",
    "LDR": "Long Distance Relationship",
    "LMK": "Let Me Know",
    "LMAO": "Laughing My A** Off",
    "LOL": "Laughing Out Loud",
    "LTNS": "Long Time No See",
    "M8": "Mate",
    "MFW": "My Face When",
    "MID": "Mediocre",
    "MRW": "My Reaction When",
    "MTE": "My Thoughts Exactly",
    "NVM": "Never Mind",
    "NRN": "No Reply Necessary",
    "NPC": "Non-Player Character",
    "OIC": "Oh I See",
    "OP": "Overpowered",
    "PITA": "Pain In The A**",
    "POV": "Point Of View",
    "PRT": "Party",
    "PRW": "Parents Are Watching",
    "ROFL": "Rolling On The Floor Laughing",
    "ROFLOL": "Rolling On The Floor Laughing Out Loud",
    "ROTFLMAO": "Rolling On The Floor Laughing My A** Off",
    "RN": "Right Now",
    "SK8": "Skate",
    "STATS": "Your Sex And Age",
    "SUS": "Suspicious",
    "TBH": "To Be Honest",
    "TFW": "That Feeling When",
    "THX": "Thank You",
    "TIME": "Tears In My Eyes",
    "TLDR": "Too Long, Didn’t Read",
    "TNTL": "Trying Not To Laugh",
    "TTFN": "Ta-Ta For Now!",
    "TTYL": "Talk To You Later",
    "U": "You",
    "U2": "You Too",
    "U4E": "Yours For Ever",
    "W": "Win",
    "W8": "Wait...",
    "WB": "Welcome Back",
    "WTF": "What The F**k",
    "WTG": "Way To Go!",
    "WUF": "Where Are You From?",
    "WYD": "What You Doing?",
    "WYWH": "Wish You Were Here",
    "ZZZ": "Sleeping, Bored, Tired"
}


In [23]:
def chat_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words:
            new_text.append(chat_words[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

In [24]:
chat_conversion('IMHO RM is overrated')

'In My Honest/Humble Opinion RM is overrated'

In [25]:
chat_conversion('Same to u')

'Same to You'

- Spelling Correction

In [26]:
from textblob import TextBlob

In [27]:
incorrect_text = 'ceertain conditionas duriing seveal ggenerations aree moodified in the saame maner.'
textBlb = TextBlob(incorrect_text)
textBlb.correct().string

'certain conditions during several generations are modified in the same manner.'

- Stopwords Removal

In [28]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [29]:
from nltk.corpus import stopwords

In [30]:
stopwords.words('english')

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [31]:
def remove_stopwords(text):
    new_text = []

    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    return " ".join(new_text)

In [32]:
remove_stopwords('This is an example of a sentence where the stopwords will be removed from the text')

'This   example   sentence   stopwords   removed   text'

- Emoji Handling: Either removing or replacing them with meaning

In [35]:
# Removing
import re
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons 🙂😂😍
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs ⚡️☀️🎉
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols 🚗✈️🚀
                           u"\U0001F1E0-\U0001F1FF"  # flags 🏳️🏴🇮🇳
                           u"\U00002702-\U000027B0"  # misc symbols ✂️✈️➕
                           u"\U000024C2-\U0001F251"  # enclosed characters 🅰️🅱️🆘
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)


In [36]:
s = "I love Python 🐍🔥 but debugging makes me 😭"
print(remove_emoji(s))

I love Python  but debugging makes me 


In [38]:
!pip install emoji

Collecting emoji
  Downloading emoji-2.15.0-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.15.0-py3-none-any.whl (608 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/608.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m608.4/608.4 kB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.15.0


In [39]:
# Replacing with meaning
import emoji
print(emoji.demojize('Python is 😍'))

Python is :smiling_face_with_heart-eyes:


## **Tokenization**:
Tokenization is a fundamental step in Natural Language Processing (NLP) that involves splitting a piece of text into smaller units called tokens. These tokens can be words, subwords, or even characters, depending on the specific tokenization method used.

For example, if you have the sentence "I love natural language processing!", tokenization might break it down into the following tokens:

- "I"
- "love"
- "natural"
- "language"
- "processing"
- "!"

Tokenization is often one of the first steps in an NLP pipeline as it prepares the text for further processing and analysis. It's crucial for tasks like sentiment analysis, machine translation, and text classification.

### 1. Using the split function

In [40]:
# word tokenization
sent1 = 'I am going to delhi'
sent1.split()

['I', 'am', 'going', 'to', 'delhi']

In [41]:
# sentence tokenization
sent2 = 'I am going to delhi. I will stay there for 3 days. Let\'s hope the trip to be great'
sent2.split('.')

['I am going to delhi',
 ' I will stay there for 3 days',
 " Let's hope the trip to be great"]

In [42]:
# Problems with split function
sent3 = 'I am going to delhi!'
print(sent3.split())

sent4 = 'Where do think I should go? I have 3 day holiday'
print(sent4.split('.'))

['I', 'am', 'going', 'to', 'delhi!']
['Where do think I should go? I have 3 day holiday']


### 2. Regular Expression

In [44]:
import re
sent3 = 'I am going to delhi!'
tokens = re.findall("[\w']+", sent3)
tokens

  tokens = re.findall("[\w']+", sent3)


['I', 'am', 'going', 'to', 'delhi']

In [45]:

text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry?
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""
sentences = re.compile('[.!?] ').split(text)
sentences

["Lorem Ipsum is simply dummy text of the printing and typesetting industry?\nLorem Ipsum has been the industry's standard dummy text ever since the 1500s,\nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

### 3. NLTK

In [46]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [47]:
from nltk.tokenize import word_tokenize,sent_tokenize

In [48]:
sent1 = 'I am going to visit delhi!'
sent2 = 'I have a Ph.D in A.I'
sent3 = "We're here to help! mail us at aj@gmail.com"
sent4 = 'A 5km ride cost $10.50'

In [49]:
# There are some errors
print(word_tokenize(sent1))
print(word_tokenize(sent2))
print(word_tokenize(sent3))
print(word_tokenize(sent4))

['I', 'am', 'going', 'to', 'visit', 'delhi', '!']
['I', 'have', 'a', 'Ph.D', 'in', 'A.I']
['We', "'re", 'here', 'to', 'help', '!', 'mail', 'us', 'at', 'aj', '@', 'gmail.com']
['A', '5km', 'ride', 'cost', '$', '10.50']


In [50]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry?
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""

sent_tokenize(text)

['Lorem Ipsum is simply dummy text of the printing and typesetting industry?',
 "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,\nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

### 4. Spacy

In [51]:
import spacy
nlp = spacy.load('en_core_web_sm')
# Loads the small English language model (en_core_web_sm).
# This model knows things like vocabulary, grammar, tokenization rules, POS tags, entities, etc.

In [52]:
doc1 = nlp(sent1)
doc2 = nlp(sent2)
doc3 = nlp(sent3)
doc4 = nlp(sent4)

In [53]:
for token in doc1:
    print(token)

I
am
going
to
visit
delhi
!


In [54]:
for token in doc2:
    print(token)

I
have
a
Ph
.
D
in
A.I


In [55]:
for token in doc3:
    print(token)

We
're
here
to
help
!
mail
us
at
aj@gmail.com


In [57]:
for token in doc4:
    print(token)

A
5
km
ride
cost
$
10.50


- Generally Spacy works better than NLTK, but none is foolproof

## **Inflection**:
Modification of a word to express different grammatical categories such as tense, case, voice, gender, mood etc.

## **Stemming**:
Stemming is a technique used in Natural Language Processing (NLP) to reduce words to their root or base form, which is called the "stem". It's a more aggressive process than lemmatization and **can sometimes result in stems that are not actual words.**

In terms of inflection, which is the modification of a word to express different grammatical categories (like tense, case, voice, gender, mood, etc.), stemming aims to remove these inflectional endings to get to the common base form. For example, the words "running", "runs", and "ran" might all be reduced to the stem "run" through stemming. This is useful for tasks where you want to treat different forms of the same word as equivalent, such as in **information retrieval or text analysis.**

- **Stemmer**: An algo that does Stemming

In [58]:
from nltk.stem.porter import PorterStemmer # We are using Porter Stemmer created by some expert Porter

In [59]:
ps = PorterStemmer()
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

In [60]:
sample = "walk walks walking walked"
stem_words(sample)

'walk walk walk walk'

In [61]:
text = 'probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie'
text

'probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets old despite my having seen it some 15 or more times in the last 25 years paul lukas performance brings tears to my eyes and bette davis in one of her very few truly sympathetic roles is a delight the kids are as grandma says more like dressedup midgets than children but that only makes them more fun to watch and the mothers slow awakening to whats happening in the world and under her own roof is believable and startling if i had a dozen thumbs theyd all be up for this movie'

In [62]:
stem_words(text) # Notice there are many meaningless words

'probabl my alltim favorit movi a stori of selfless sacrific and dedic to a nobl caus but it not preachi or bore it just never get old despit my have seen it some 15 or more time in the last 25 year paul luka perform bring tear to my eye and bett davi in one of her veri few truli sympathet role is a delight the kid are as grandma say more like dressedup midget than children but that onli make them more fun to watch and the mother slow awaken to what happen in the world and under her own roof is believ and startl if i had a dozen thumb theyd all be up for thi movi'

## **Lemmatization**:

Lemmatization is a more sophisticated technique than stemming that aims to reduce words to their base or dictionary form, known as the "lemma". Unlike stemming, lemmatization considers the context and the part of speech of a word to return a **valid word.**

For example:
- "running", "runs", and "ran" would all be reduced to the lemma "run".
- "better" might be reduced to "good" because "good" is the base form of "better" in the context of adjectives.

Lemmatization typically requires a dictionary or a lexicon to look up the base forms, making it computationally more expensive than stemming. However, it often produces more accurate results and is preferred in applications where the meaning of the word is important, such as in question answering or machine translation.<br>
To summarize: **Use stemming when speed matters, use lemmatization when meaning matters**

In [63]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [64]:
import nltk
from nltk.stem import WordNetLemmatizer  # WordNet is a large lexical database of English
wordnet_lemmatizer = WordNetLemmatizer()

sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."
punctuations="?:!.,;"

sentence_words = nltk.word_tokenize(sentence)
for word in sentence_words:
    if word in punctuations:
        sentence_words.remove(word)

print(sentence_words)

['He', 'was', 'running', 'and', 'eating', 'at', 'same', 'time', 'He', 'has', 'bad', 'habit', 'of', 'swimming', 'after', 'playing', 'long', 'hours', 'in', 'the', 'Sun']


In [65]:
print("{0:20}{1:20}".format("Word","Lemma"))
for word in sentence_words:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word,pos='v')))

Word                Lemma               
He                  He                  
was                 be                  
running             run                 
and                 and                 
eating              eat                 
at                  at                  
same                same                
time                time                
He                  He                  
has                 have                
bad                 bad                 
habit               habit               
of                  of                  
swimming            swim                
after               after               
playing             play                
long                long                
hours               hours               
in                  in                  
the                 the                 
Sun                 Sun                 


- pos stands for part-of-speech. It tells the lemmatizer what kind of word you are dealing with: noun, verb, adjective, or adverb.
- Different POS → different lemma.


In [66]:
lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("running", pos="v"))  # verb → run
print(lemmatizer.lemmatize("running", pos="n"))  # noun → running

run
running


In [67]:
print(lemmatizer.lemmatize("better"))           # default noun → better (wrong)
print(lemmatizer.lemmatize("better", pos="a"))  # adjective → good

print(lemmatizer.lemmatize("flies"))            # noun → fly
print(lemmatizer.lemmatize("flies", pos="v"))   # verb → fly


better
good
fly
fly
