# Natural Language Toolkit

Basic Preprocessing Types

1.   Lowercasting
2.   Remove HTML Tags
3.   Remove URL
4.   Remove Punctuation
5.   Chat word treatment
6.   Seplling Correction
7.   Removing Stop Words
8.   Handling Emojis
9.   Tokenization
10.  Stemming
11.  Lemmatization

# 1. Lowercasting

In [1]:
import numpy as np
import pandas as pd

In [3]:
df = pd.read_csv('IMDB Dataset.csv')
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [4]:
df['review'][3]

"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them."

In [5]:
df['review'][3].lower()

"basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.<br /><br />ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."

In [6]:
df['review'].str.lower()

0        one of the other reviewers has mentioned that ...
1        a wonderful little production. <br /><br />the...
2        i thought this was a wonderful way to spend ti...
3        basically there's a family where a little boy ...
4        petter mattei's "love in the time of money" is...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot, bad dialogue, bad acting, idiotic di...
49997    i am a catholic taught in parochial elementary...
49998    i'm going to have to disagree with the previou...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

In [7]:
df['review'] = df['review'].str.lower()
df

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive
...,...,...
49995,i thought this movie did a down right good job...,positive
49996,"bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,i am a catholic taught in parochial elementary...,negative
49998,i'm going to have to disagree with the previou...,negative


# 2. HTML Tags

In [8]:
import re

In [9]:
def remove_html_tags(text):
    pattern = re.compile('<?.*>')
    return pattern.sub(r'',text)

In [11]:
df['review'] = df['review'].apply(remove_html_tags)
df

Unnamed: 0,review,sentiment
0,i would say the main appeal of the show is due...,positive
1,the realism really comes home with the little ...,positive
2,"this may not be the crown jewel of his career,...",positive
3,3 out of 10 just for the well playing parents ...,negative
4,we wish mr. mattei good luck and await anxious...,positive
...,...,...
49995,8/10,positive
49996,if you want to watch something similar but a t...,negative
49997,at first i thought the gun might be a fake and...,negative
49998,i'm going to have to disagree with the previou...,negative


# 3. Remove URL

In [12]:
# for example
text1 = 'stackoverflow link  https://stackoverflow.com/questions/9662346/python-code-to-remove-html-tags-from-a-string'
text2 = 'Google search link https://www.google.co.in/'

In [22]:
def remove_url(text):
    pattern = re.compile('https?://\S+|www\.\S+')
    return pattern.sub('',text)

In [23]:
remove_url(text1)

'stackoverflow link  '

In [24]:
df['review'] = df['review'].apply(remove_url)
df

Unnamed: 0,review,sentiment
0,i would say the main appeal of the show is due...,positive
1,the realism really comes home with the little ...,positive
2,"this may not be the crown jewel of his career,...",positive
3,3 out of 10 just for the well playing parents ...,negative
4,we wish mr. mattei good luck and await anxious...,positive
...,...,...
49995,8/10,positive
49996,if you want to watch something similar but a t...,negative
49997,at first i thought the gun might be a fake and...,negative
49998,i'm going to have to disagree with the previou...,negative


In [38]:
# 4. Remove Punctuation
# Punctuation details link wikipedia

# https://en.wikipedia.org/wiki/Punctuation
import string,time
string.punctuation


'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [39]:
exclude = string.punctuation

In [40]:
def remove_punc(text):
    for char in exclude:
        text = text.replace(char,'')
    return text

In [41]:
text = 'string. With. puctuation?'

In [42]:
start = time.time()
print(remove_punc(text))
time1 = time.time() - start
print(time1)

string With puctuation
0.0005002021789550781


In [43]:

def remove_punc1(text):
    return text.translate(str.maketrans('','',exclude))

In [44]:
start = time.time()
print(remove_punc1(text))
time2 = time.time() - start
print(time2)

string With puctuation
0.0004565715789794922


In [45]:
df['review'] = df['review'].apply(remove_url)
df


Unnamed: 0,review,sentiment
0,i would say the main appeal of the show is due...,positive
1,the realism really comes home with the little ...,positive
2,"this may not be the crown jewel of his career,...",positive
3,3 out of 10 just for the well playing parents ...,negative
4,we wish mr. mattei good luck and await anxious...,positive
...,...,...
49995,8/10,positive
49996,if you want to watch something similar but a t...,negative
49997,at first i thought the gun might be a fake and...,negative
49998,i'm going to have to disagree with the previou...,negative


# 5. Chat word treatment

In [46]:
chart_words = {'AFAIK':'As Far As I Know',
'AFK':'Away From Keyboard',
'ASAP':'As Soon As Possible',
'ATK':'At The Keyboard',
'ATM':'At The Moment',
'A3':'Anytime, Anywhere, Anyplace',
'BAK':'Back At Keyboard',
'BBL':'Be Back Later',
'BBS':'Be Back Soon',
'BFN':'Bye For Now',
'B4N':'Bye For Now',
'BRB':'Be Right Back',
'BRT':'Be Right There',
'BTW':'By The Way',
'B4':'Before',
'B4N':'Bye For Now',
'CU':'See You',
'CUL8R':'See You Later',
'CYA':'See You',
'FAQ':'Frequently Asked Questions',
'FC':'Fingers Crossed',
'FWIW':'For What It\'s Worth',
'FYI':'For Your Information',
'GAL':'Get A Life',
'GG':'Good Game',
'GN':'Good Night',
'GMTA':'Great Minds Think Alike',
'GR8':'Great!',
'G9':'Genius',
'IC':'I See',
'ICQ':'I Seek you (also a chat program)',
'ILU':'ILU: I Love You',
'IMHO':'In My Honest/Humble Opinion',
'IMO':'In My Opinion',
'IOW':'In Other Words',
'IRL':'In Real Life',
'KISS':'Keep It Simple, Stupid',
'LDR':'Long Distance Relationship',
'LMAO':'Laugh My A.. Off',
'LOL':'Laughing Out Loud',
'LTNS':'Long Time No See',
'L8R':'Later',
'MTE':'My Thoughts Exactly',
'M8':'Mate',
'NRN':'No Reply Necessary',
'OIC':'Oh I See',
'PITA':'Pain In The A..',
'PRT':'Party',
'PRW':'Parents Are Watching',
'QPSA':'Que Pasa?',
'ROFL':'Rolling On The Floor Laughing',
'ROFLOL':'Rolling On The Floor Laughing Out Loud',
'ROTFLMAO':'Rolling On The Floor Laughing My A.. Off',
'SK8':'Skate',
'STATS':'Your sex and age',
'ASL':'Age, Sex, Location',
'THX':'Thank You',
'TTFN':'Ta-Ta For Now!',
'TTYL':'Talk To You Later',
'U':'You',
'U2':'You Too',
'U4E':'Yours For Ever',
'WB':'Welcome Back',
'WTF':'What The F...',
'WTG':'Way To Go!',
'WUF':'Where Are You From?',
'W8':'Wait...',
'7K':'Sick:-D Laugher'}

In [48]:
chart_words

{'AFAIK': 'As Far As I Know',
 'AFK': 'Away From Keyboard',
 'ASAP': 'As Soon As Possible',
 'ATK': 'At The Keyboard',
 'ATM': 'At The Moment',
 'A3': 'Anytime, Anywhere, Anyplace',
 'BAK': 'Back At Keyboard',
 'BBL': 'Be Back Later',
 'BBS': 'Be Back Soon',
 'BFN': 'Bye For Now',
 'B4N': 'Bye For Now',
 'BRB': 'Be Right Back',
 'BRT': 'Be Right There',
 'BTW': 'By The Way',
 'B4': 'Before',
 'CU': 'See You',
 'CUL8R': 'See You Later',
 'CYA': 'See You',
 'FAQ': 'Frequently Asked Questions',
 'FC': 'Fingers Crossed',
 'FWIW': "For What It's Worth",
 'FYI': 'For Your Information',
 'GAL': 'Get A Life',
 'GG': 'Good Game',
 'GN': 'Good Night',
 'GMTA': 'Great Minds Think Alike',
 'GR8': 'Great!',
 'G9': 'Genius',
 'IC': 'I See',
 'ICQ': 'I Seek you (also a chat program)',
 'ILU': 'ILU: I Love You',
 'IMHO': 'In My Honest/Humble Opinion',
 'IMO': 'In My Opinion',
 'IOW': 'In Other Words',
 'IRL': 'In Real Life',
 'KISS': 'Keep It Simple, Stupid',
 'LDR': 'Long Distance Relationship',
 'LM

In [49]:
def chat_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chart_words:
            new_text.append(chart_words[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

In [50]:
chat_conversion('IMHO he is the best')

'In My Honest/Humble Opinion he is the best'

# 6. Spelling Correction

In [52]:
from textblob import TextBlob

In [58]:
incorrect_text = 'ceertain conditionas duriing seveal ggenrations aree moodified in the saame maner.'
txtblb = TextBlob(incorrect_text)

In [59]:
txtblb.correct().string

'certain conditions during several generations are modified in the same manner.'

# 7. Removing Stopwords

In [60]:
from nltk.corpus import stopwords

In [61]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

# 8. Removing emojis

In [62]:
import re
def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F" # emoticons
                           u"\U0001F300-\U0001F5FF" # symbols & pictographs
                           u"\U0001F680-\U0001F6FF" # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF" # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

In [63]:
remove_emoji("Hilarious 😂")

'Hilarious '

# Tokenization

In [64]:
from nltk.tokenize import word_tokenize,sent_tokenize
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/sunil/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [65]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry?
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type specimen book."""

In [66]:
sent_tokenize(text)

['Lorem Ipsum is simply dummy text of the printing and typesetting industry?',
 "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,\nwhen an unknown printer took a galley of type and scrambled it to make a type specimen book."]

In [67]:
sent7 = 'A 5km. ride cost $10.50'
word_tokenize(sent7)

['A', '5km', '.', 'ride', 'cost', '$', '10.50']

# 10. Stemming

Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the language.

In [69]:
from nltk.stem.porter import PorterStemmer

In [72]:
ps = PorterStemmer()
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

In [73]:
sample = "walk walks walking walked"
stem_words(sample)

'walk walk walk walk'

In [74]:
text = "basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them."
print(text)

basically there's a family where a little boy (jake) thinks there's a zombie in his closet & his parents are fighting all the time.this movie is slower than a soap opera... and suddenly, jake decides to become rambo and kill the zombie.ok, first of all when you're going to make a film you must decide if its a thriller or a drama! as a drama the movie is watchable. parents are divorcing & arguing like in real life. and then we have jake with his closet which totally ruins all the film! i expected to see a boogeyman similar movie, and instead i watched a drama with some meaningless thriller spots.3 out of 10 just for the well playing parents & descent dialogs. as for the shots with jake: just ignore them.


In [75]:
stem_words(text)

"basic there' a famili where a littl boy (jake) think there' a zombi in hi closet & hi parent are fight all the time.thi movi is slower than a soap opera... and suddenly, jake decid to becom rambo and kill the zombie.ok, first of all when you'r go to make a film you must decid if it a thriller or a drama! as a drama the movi is watchable. parent are divorc & argu like in real life. and then we have jake with hi closet which total ruin all the film! i expect to see a boogeyman similar movie, and instead i watch a drama with some meaningless thriller spots.3 out of 10 just for the well play parent & descent dialogs. as for the shot with jake: just ignor them."

# Lemmitization

Lemmatization unlike stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In lemmatization root word is called lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.

In [1]:
import nltk


In [2]:
from nltk.stem import WordNetLemmatizer

In [3]:
wordnet_lemmatization = WordNetLemmatizer()

sentence = 'He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun.'
punctuations = "?:!.,;"

sentence_words = nltk.word_tokenize(sentence)
for word in sentence_words:
    if word in punctuations:
        sentence_words.remove(word)
        
sentence_words
print("{0:20}{1:20}".format("Word","Lemma"))
for word in sentence_words:
    print("{0:20}{1:20}".format(word,wordnet_lemmatization.lemmatize(word)))

Word                Lemma               
He                  He                  
was                 wa                  
running             running             
and                 and                 
eating              eating              
at                  at                  
same                same                
time                time                
He                  He                  
has                 ha                  
bad                 bad                 
habit               habit               
of                  of                  
swimming            swimming            
after               after               
playing             playing             
long                long                
hours               hour                
in                  in                  
the                 the                 
Sun                 Sun                 
