# SECTION 1: Introduction

1. lower case
2. remove HTML tags
3. remove urls
4. remove punctuations
5. chat words treatment
6. emoji treatment
7. spelling correction
8. removal of stopwords
9. tokenization
10. stemming and lemmatization

In [1]:
import numpy as np
import pandas as pd

In [2]:
# NLP Library

# regular expression
import re

In [3]:
# for string manipulation
import string, time

In [4]:
# for emoji treatment
import emoji

In [5]:
# for spelling correction
from textblob import TextBlob

In [6]:
# nltk library
import nltk

In [7]:
# list of stop words
from nltk.corpus import stopwords

In [8]:
# for tokenization
from nltk.tokenize import word_tokenize, sent_tokenize

In [9]:
# for stemming
from nltk.stem.porter import PorterStemmer

In [10]:
# for lemmatization
from nltk.stem import WordNetLemmatizer

In [11]:
# spacy library
import spacy

# SECTION 2: Load Dataset

In [14]:
disaster_tweet_df = pd.read_csv("F:/EDGE/LearningAI/nlp-getting-started/train.csv")

In [15]:
movies_review_df = pd.read_csv("F:/EDGE/LearningAI/archive/IMDB Dataset.csv")

In [16]:
disaster_tweet_df.sample(5)

Unnamed: 0,id,keyword,location,text,target
5648,8058,refugees,"Geneva, Switzerland",CHPSRE: RT: Refugees: For our followers in Par...,1
829,1206,blizzard,"California, USA",@DaBorsch not really that shocking :( blizzard...,0
7558,10805,wrecked,probably not home,coleslaw #wrecked http://t.co/sijNBmCZIJ,0
1390,2006,bush%20fires,,Ted Cruz fires back at Jeb &amp; Bush: ÛÏWe l...,0
6850,9818,trauma,,@crazyindapeg @VETS78734 completely understand...,0


In [17]:
movies_review_df.sample(5)

Unnamed: 0,review,sentiment
21934,I have just finished watching this film and I ...,positive
12631,I should preface this by stating that I am a D...,negative
49646,"Awful dreams, wild premonitions, blasphemy and...",positive
29720,This movie treads on very familiar ground -- t...,negative
3849,"The only reason The Duke Is Tops, one of sever...",positive


# SECTION 3: Lowercasing of All Sentences

Loss of information: emotions of anger or excitement

In [18]:
disaster_tweet_df['text'] = \
disaster_tweet_df['text'].str.lower()

In [19]:
movies_review_df['review'] = \
movies_review_df['review'].str.lower()

In [20]:
disaster_tweet_df['text']

0       our deeds are the reason of this #earthquake m...
1                  forest fire near la ronge sask. canada
2       all residents asked to 'shelter in place' are ...
3       13,000 people receive #wildfires evacuation or...
4       just got sent this photo from ruby #alaska as ...
                              ...                        
7608    two giant cranes holding a bridge collapse int...
7609    @aria_ahrary @thetawniest the out of control w...
7610    m1.94 [01:04 utc]?5km s of volcano hawaii. htt...
7611    police investigating after an e-bike collided ...
7612    the latest: more homes razed by northern calif...
Name: text, Length: 7613, dtype: object

In [21]:
movies_review_df['review']

0        one of the other reviewers has mentioned that ...
1        a wonderful little production. <br /><br />the...
2        i thought this was a wonderful way to spend ti...
3        basically there's a family where a little boy ...
4        petter mattei's "love in the time of money" is...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot, bad dialogue, bad acting, idiotic di...
49997    i am a catholic taught in parochial elementary...
49998    i'm going to have to disagree with the previou...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

# SECTION 4: Removal of HTML Tags

In [22]:
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'',text)

Before removal of HTML tags:

In [23]:
movies_review_df['review'][1]

'a wonderful little production. <br /><br />the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />the actors are extremely well chosen- michael sheen not only "has got all the polari" but he has all the voices down pat too! you can truly see the seamless editing guided by the references to williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. a masterful production about one of the great master\'s of comedy and his life. <br /><br />the realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. it plays on our knowledge and our senses, particularly with the scenes concerning orton and halliwell and the sets (particularly of their flat with halliwell\'s murals decorating every surface) are terribly well d

In [24]:
movies_review_df['review'] = \
movies_review_df['review'].apply(remove_html_tags)

After removal of HTML tags:

In [25]:
movies_review_df['review'][1]

'a wonderful little production. the filming technique is very unassuming- very old-time-bbc fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. the actors are extremely well chosen- michael sheen not only "has got all the polari" but he has all the voices down pat too! you can truly see the seamless editing guided by the references to williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. a masterful production about one of the great master\'s of comedy and his life. the realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. it plays on our knowledge and our senses, particularly with the scenes concerning orton and halliwell and the sets (particularly of their flat with halliwell\'s murals decorating every surface) are terribly well done.'

# SECTION 5: Removal of URLs

In [26]:
disaster_tweet_df[disaster_tweet_df['text'].str.contains('http')].head(5)

Unnamed: 0,id,keyword,location,text,target
31,48,ablaze,Birmingham,@bbcmtd wholesale markets ablaze http://t.co/l...,1
32,49,ablaze,Est. September 2012 - Bristol,we always try to bring the heavy. #metal #rt h...,0
33,50,ablaze,AFRICA,#africanbaze: breaking news:nigeria flag set a...,1
35,53,ablaze,"London, UK",on plus side look at the sky last night it was...,0
37,55,ablaze,World Wide!!,inec office in abia set ablaze - http://t.co/3...,1


In [27]:
movies_review_df[movies_review_df['review'].str.contains('http')].head(5)

Unnamed: 0,review,sentiment
907,following directly from where the story left o...,positive
1088,this quasi j-horror film followed a young woma...,negative
1972,the basic plot of 'marigold' boasts of a roman...,negative
2132,"i, too, found ""oppenheimer"" to be a brilliant ...",positive
3038,"i really love this movie , i saw it for the fi...",positive


In [28]:
def remove_url(text):
    pattern = re.compile(r'https?: ?//\S+|www.\.\S+')
    return pattern.sub(r'',text)

In [29]:
disaster_tweet_df['text'] = \
disaster_tweet_df['text'].apply(remove_url)

In [30]:
movies_review_df['review'] = \
movies_review_df['review'].apply(remove_url)

After removal of URLs:

In [31]:
disaster_tweet_df[disaster_tweet_df['text'].str.contains('http')].head(5)

Unnamed: 0,id,keyword,location,text,target
121,174,aftershock,Baker City Oregon,aftershock: protect yourself and profit in the...,0


In [32]:
movies_review_df[movies_review_df['review'].str.contains('http')].head(5)

Unnamed: 0,review,sentiment


Again, removing http from `disaster_tweet_df`, because it was not removed first time.

In [33]:
disaster_tweet_df['text'] = disaster_tweet_df['text'].str.replace('http','')

In [34]:
disaster_tweet_df[disaster_tweet_df['text'].str.contains('http')].head(5)

Unnamed: 0,id,keyword,location,text,target


# SECTION 6: Removal of Punctuation

In [35]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [36]:
exclude = string.punctuation

First, we will use a basic function to remove punctuation.

In [37]:
def remove_punc(text):
    for char in exclude:
        text = text.replace(char, '')
    return text

In [38]:
text = 'string. With. Punctuation?'

In [41]:
start = time.time() * 5000000000
print(remove_punc(text))
time1 = time.time() * 5000000000 - start
print(time1)

string With Punctuation
8497152.0


Slow code

In [42]:
def remove_punc_2(text):
    return text.translate(str.maketrans('','',exclude))

In [43]:
start = time.time() * 5000000000
print(remove_punc_2(text))
time2 = time.time() * 5000000000 - start
print(time2)

string With Punctuation
7519232.0


Before removal in datasets:

In [44]:
disaster_tweet_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,our deeds are the reason of this #earthquake m...,1
1,4,,,forest fire near la ronge sask. canada,1
2,5,,,all residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,just got sent this photo from ruby #alaska as ...,1


In [45]:
movies_review_df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


In [46]:
disaster_tweet_df['text'] = \
disaster_tweet_df['text'].apply(remove_punc_2)

In [47]:
movies_review_df['review'] = \
movies_review_df['review'].apply(remove_punc_2)

After removal of punctuation marks:

In [48]:
disaster_tweet_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,our deeds are the reason of this earthquake ma...,1
1,4,,,forest fire near la ronge sask canada,1
2,5,,,all residents asked to shelter in place are be...,1
3,6,,,13000 people receive wildfires evacuation orde...,1
4,7,,,just got sent this photo from ruby alaska as s...,1


In [49]:
movies_review_df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming tech...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive


# SECTION 7: Chat Word Treatment

Chat words cannot be understood by machines. Therefore, we need to treat them into proper sentence for the machines to understand.

In [50]:
with open("F:/EDGE/LearningAI/slang.txt", 'r') as f:
    text = f.read()

list_of_slang = text.split("\n")

chat_words = {}
for sentence in list_of_slang:
    if sentence.find("=") != -1:
        key, value = sentence.split("=")[0], sentence.split("=")[1]
        chat_words[key] = value

In [51]:
chat_words

{'AFAIK': 'As Far As I Know',
 'AFK': 'Away From Keyboard',
 'ASAP': 'As Soon As Possible',
 'ATK': 'At The Keyboard',
 'ATM': 'At The Moment',
 'A3': 'Anytime, Anywhere, Anyplace',
 'BAK': 'Back At Keyboard',
 'BBL': 'Be Back Later',
 'BBS': 'Be Back Soon',
 'BFN': 'Bye For Now',
 'B4N': 'Bye For Now',
 'BRB': 'Be Right Back',
 'BRT': 'Be Right There',
 'BTW': 'By The Way',
 'B4': 'Before',
 'CU': 'See You',
 'CUL8R': 'See You Later',
 'CYA': 'See You',
 'FAQ': 'Frequently Asked Questions',
 'FC': 'Fingers Crossed',
 'FWIW': "For What It's Worth",
 'FYI': 'For Your Information',
 'GAL': 'Get A Life',
 'GG': 'Good Game',
 'GN': 'Good Night',
 'GMTA': 'Great Minds Think Alike',
 'GR8': 'Great!',
 'G9': 'Genius',
 'IC': 'I See',
 'ICQ': 'I Seek you (also a chat program)',
 'ILU': 'ILU: I Love You',
 'IMHO': 'In My Honest/Humble Opinion',
 'IMO': 'In My Opinion',
 'IOW': 'In Other Words',
 'IRL': 'In Real Life',
 'KISS': 'Keep It Simple, Stupid',
 'LDR': 'Long Distance Relationship',
 'LM

In [54]:
def chat_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words:
            new_text.append(chat_words[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

In [55]:
chat_conversion('IMHO he is the best')

'In My Honest/Humble Opinion he is the best'

In [56]:
chat_conversion('FYI dhaka is the capital of bangladesh')

'For Your Information dhaka is the capital of bangladesh'

# SECTION 8: Emojis Treatment

## SECTION 8.1: Remove Emojis

In [57]:
def remove_emoji(text):
    emoji_pattern = re.compile(
        "["
        u"\U0001F600-\U0001F64F" # emoticons
        u"\U0001F300-\U0001F5FF" # symbols & pictographs
        u"\U0001F680-\U0001F6FF" # transport & map symbols
        u"\U0001F1E0-\U0001F1FF" # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        "]+", flags=re.UNICODE
    )
    return emoji_pattern.sub(r'',text)

## SECTION 8.2: Converting Emojis to Meaningful Information

`.demojize` in emoji libray
> convert emoji to description

In [59]:
print(emoji.demojize('Python is 🔥'))

Python is :fire:


In [60]:
print(emoji.demojize("Loved the movie. It was 🥰"))

Loved the movie. It was :smiling_face_with_hearts:


# SECTION 9: Spelling Correction

In [61]:
incorrect_text = 'ceertain conditionas duriing seveal ggenerations aree moodified in the saame maner.'

In [62]:
def spelling_corrector(text):
    textBlb = TextBlob(text)
    return textBlb.correct().string

In [63]:
spelling_corrector(incorrect_text)

'certain conditions during several generations are modified in the same manner.'

# SECTION 10: Removing Stopwords

In [65]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\asfar\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [82]:
from nltk.corpus import stopwords
stopwords = stopwords.words('english')

In [83]:
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [84]:
words_discard_from_stopwords = ['because', 'about', 'against', 
                                'between', 'no', 'nor', 'not', 
                                'before', 'after', 'do', 'during', 
                                'above', 'below', 'over', 'under', 
                                'further', 'once', 'how', 'all', 
                                'any', 'don', "don't", 'should', 
                                'ain', 'aren', "aren't", 'couldn', 
                                "couldn't", 'didn', "didn't", 
                                'doesn', "doesn't", 'hadn', "hadn't",
                                'hasn', "hasn't", 'haven', "haven't",
                                'isn', "isn't",'mightn', "mightn't", 
                                'mustn', "mustn't", 'needn', "needn't",
                                'shouldn', "shouldn't", 'wasn', 
                                "wasn't", 'weren', "weren't", 'won', 
                                "won't", "wouldn", "wouldn't"]

In [85]:
# Remove words from stopwords if they exist
stopwords = [word for word in stopwords 
             if word not in words_discard_from_stopwords]

In [86]:
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'into', 'through', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'again', 'then', 'here', 'there', 'when', 'where', 'why', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ma', 'shan', "shan't"]


In [87]:
disaster_tweet_df.head(5)

Unnamed: 0,id,keyword,location,text,target
0,1,,,our deeds are the reason of this earthquake ma...,1
1,4,,,forest fire near la ronge sask canada,1
2,5,,,all residents asked to shelter in place are be...,1
3,6,,,13000 people receive wildfires evacuation orde...,1
4,7,,,just got sent this photo from ruby alaska as s...,1


In [88]:
movies_review_df.head(3)

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming tech...,positive
2,i thought this was a wonderful way to spend ti...,positive


In [89]:
def remove_stopwords(text):
    new_text = []
    
    for word in text.split():
        if word in stopwords:
            new_text.append('')
        else:
            new_text.append(word)
    
    x = new_text[:]
    new_text.clear()
    
    return " ".join(x)

In [90]:
disaster_tweet_df['text'] = \
disaster_tweet_df['text'].apply(remove_stopwords)

In [91]:
movies_review_df['review'] = \
movies_review_df['review'].apply(remove_stopwords)

After removal of stopwords:

In [92]:
disaster_tweet_df.head(5)

Unnamed: 0,id,keyword,location,text,target
0,1,,,deeds reason earthquake may allah forgive...,1
1,4,,,forest fire near la ronge sask canada,1
2,5,,,all residents asked shelter place notified...,1
3,6,,,13000 people receive wildfires evacuation orde...,1
4,7,,,got sent photo ruby alaska smoke wildfire...,1


In [93]:
movies_review_df.head(3)

Unnamed: 0,review,sentiment
0,one reviewers mentioned after watching 1...,positive
1,wonderful little production filming techniqu...,positive
2,thought wonderful way spend time hot s...,positive


# SECTION 11: Tokenization

Tokenization is the process of breaking text document into smaller parts, which are known as tokens.

2 types of tokenization:
1. sentence
2. word

**NLTK Methods**

In [96]:
paragraph = "The library is a vital resource for academic and personal growth. It provides access to a vast collection of books, journals, and other materials that support research and learning. In addition, libraries offer a quiet and conducive environment for studying and reflection. They also provide access to technology and other resources that enhance the learning experience. Libraries are staffed by knowledgeable professionals who can assist with research and provide guidance on how to use library resources effectively. As such, the library is an essential component of any educational institution and plays a critical role in promoting lifelong learning and intellectual development."

In [98]:
sent_tokenize(paragraph)

['The library is a vital resource for academic and personal growth.',
 'It provides access to a vast collection of books, journals, and other materials that support research and learning.',
 'In addition, libraries offer a quiet and conducive environment for studying and reflection.',
 'They also provide access to technology and other resources that enhance the learning experience.',
 'Libraries are staffed by knowledgeable professionals who can assist with research and provide guidance on how to use library resources effectively.',
 'As such, the library is an essential component of any educational institution and plays a critical role in promoting lifelong learning and intellectual development.']

In [99]:
sent_1 = 'I am going to Toronto, Canada!'
word_tokenize(sent_1)

['I', 'am', 'going', 'to', 'Toronto', ',', 'Canada', '!']

**Spacy Methods**

In [101]:
nlp = spacy.load("en_core_web_sm")

tokenizer_spacy = lambda text : [token for token in nlp(text)]

In [102]:
tokenizer_spacy(sent_1)

[I, am, going, to, Toronto, ,, Canada, !]

**Trying Both Methods on Disaster Tweets Dataset**

In [103]:
disaster_tweet_df['text'].apply(word_tokenize)

0       [deeds, reason, earthquake, may, allah, forgiv...
1           [forest, fire, near, la, ronge, sask, canada]
2       [all, residents, asked, shelter, place, notifi...
3       [13000, people, receive, wildfires, evacuation...
4       [got, sent, photo, ruby, alaska, smoke, wildfi...
                              ...                        
7608    [two, giant, cranes, holding, bridge, collapse...
7609    [ariaahrary, thetawniest, control, wild, fires...
7610                [m194, 0104, utc5km, volcano, hawaii]
7611    [police, investigating, after, ebike, collided...
7612    [latest, homes, razed, northern, california, w...
Name: text, Length: 7613, dtype: object

In [104]:
disaster_tweet_df['text'].apply(tokenizer_spacy)

0       [ , deeds,   , reason,   , earthquake, may, al...
1           [forest, fire, near, la, ronge, sask, canada]
2       [all, residents, asked,  , shelter,  , place, ...
3       [13000, people, receive, wildfires, evacuation...
4       [ , got, sent,  , photo,  , ruby, alaska,  , s...
                              ...                        
7608    [two, giant, cranes, holding,  , bridge, colla...
7609    [ariaahrary, thetawniest,    , control, wild, ...
7610          [m194, 0104, utc5, km,   , volcano, hawaii]
7611    [police, investigating, after,  , ebike, colli...
7612    [ , latest,  , homes, razed,  , northern, cali...
Name: text, Length: 7613, dtype: object

`spacy` method is better, slow.

We're prioritizing *speed*, so using `nltk` on both datasets.

In [105]:
disaster_tweet_df['text'] = \
disaster_tweet_df['text'].apply(word_tokenize)

In [106]:
movies_review_df['review'] = \
movies_review_df['review'].apply(word_tokenize)

In [107]:
disaster_tweet_df.head(5)

Unnamed: 0,id,keyword,location,text,target
0,1,,,"[deeds, reason, earthquake, may, allah, forgiv...",1
1,4,,,"[forest, fire, near, la, ronge, sask, canada]",1
2,5,,,"[all, residents, asked, shelter, place, notifi...",1
3,6,,,"[13000, people, receive, wildfires, evacuation...",1
4,7,,,"[got, sent, photo, ruby, alaska, smoke, wildfi...",1


In [108]:
movies_review_df.head(5)

Unnamed: 0,review,sentiment
0,"[one, reviewers, mentioned, after, watching, 1...",positive
1,"[wonderful, little, production, filming, techn...",positive
2,"[thought, wonderful, way, spend, time, hot, su...",positive
3,"[basically, theres, family, little, boy, jake,...",negative
4,"[petter, matteis, love, time, money, visually,...",positive


In [109]:
disaster_tweet_df['text'][0]

['deeds', 'reason', 'earthquake', 'may', 'allah', 'forgive', 'us', 'all']

In [110]:
movies_review_df['review'][0]

['one',
 'reviewers',
 'mentioned',
 'after',
 'watching',
 '1',
 'oz',
 'episode',
 'youll',
 'hooked',
 'right',
 'exactly',
 'happened',
 'methe',
 'first',
 'thing',
 'struck',
 'about',
 'oz',
 'brutality',
 'unflinching',
 'scenes',
 'violence',
 'set',
 'right',
 'word',
 'go',
 'trust',
 'not',
 'show',
 'faint',
 'hearted',
 'timid',
 'show',
 'pulls',
 'no',
 'punches',
 'regards',
 'drugs',
 'sex',
 'violence',
 'hardcore',
 'classic',
 'use',
 'wordit',
 'called',
 'oz',
 'nickname',
 'given',
 'oswald',
 'maximum',
 'security',
 'state',
 'penitentary',
 'focuses',
 'mainly',
 'emerald',
 'city',
 'experimental',
 'section',
 'prison',
 'all',
 'cells',
 'glass',
 'fronts',
 'face',
 'inwards',
 'privacy',
 'not',
 'high',
 'agenda',
 'em',
 'city',
 'home',
 'manyaryans',
 'muslims',
 'gangstas',
 'latinos',
 'christians',
 'italians',
 'irish',
 'moreso',
 'scuffles',
 'death',
 'stares',
 'dodgy',
 'dealings',
 'shady',
 'agreements',
 'never',
 'far',
 'awayi',
 'would

# SECTION 12: Stemming and Lemmatization

**Stemming** is a way to bring back a word in its original form. eg walking -> walk.

2 ways:
1. porter stemming
2. snow ball stemming

In [111]:
ps = PorterStemmer()

In [112]:
def stem_words(tokenized_text):
    return [ps.stem(word) for word in tokenized_text]

In [113]:
disaster_tweet_df['text'][0]

['deeds', 'reason', 'earthquake', 'may', 'allah', 'forgive', 'us', 'all']

In [114]:
stem_words(disaster_tweet_df['text'][0])

['deed', 'reason', 'earthquak', 'may', 'allah', 'forgiv', 'us', 'all']

In [115]:
stem_words(movies_review_df['review'][0])

['one',
 'review',
 'mention',
 'after',
 'watch',
 '1',
 'oz',
 'episod',
 'youll',
 'hook',
 'right',
 'exactli',
 'happen',
 'meth',
 'first',
 'thing',
 'struck',
 'about',
 'oz',
 'brutal',
 'unflinch',
 'scene',
 'violenc',
 'set',
 'right',
 'word',
 'go',
 'trust',
 'not',
 'show',
 'faint',
 'heart',
 'timid',
 'show',
 'pull',
 'no',
 'punch',
 'regard',
 'drug',
 'sex',
 'violenc',
 'hardcor',
 'classic',
 'use',
 'wordit',
 'call',
 'oz',
 'nicknam',
 'given',
 'oswald',
 'maximum',
 'secur',
 'state',
 'penitentari',
 'focus',
 'mainli',
 'emerald',
 'citi',
 'experiment',
 'section',
 'prison',
 'all',
 'cell',
 'glass',
 'front',
 'face',
 'inward',
 'privaci',
 'not',
 'high',
 'agenda',
 'em',
 'citi',
 'home',
 'manyaryan',
 'muslim',
 'gangsta',
 'latino',
 'christian',
 'italian',
 'irish',
 'moreso',
 'scuffl',
 'death',
 'stare',
 'dodgi',
 'deal',
 'shadi',
 'agreement',
 'never',
 'far',
 'awayi',
 'would',
 'say',
 'main',
 'appeal',
 'show',
 'due',
 'fact',
 

Stemming is the process of reducing inflection in words to their root forms even if the stem itself is not a valid word in the language.

Hence, we apply **lemmatization**. Coverted root word is a *lemma*.

In [127]:
wordnet_lemmatizer = WordNetLemmatizer()

In [128]:
def lemmatize_words(tokenized_text):
    return [wordnet_lemmatizer.lemmatize(word) for word in tokenized_text]

In [129]:
lemmatize_words(disaster_tweet_df['text'][0])

['deed', 'reason', 'earthquak', 'may', 'allah', 'forgiv', 'u', 'all']

Lemmatization takes time.

Our priority is speed, so we are using only stemming.

In [125]:
disaster_tweet_df['text'] = \
disaster_tweet_df['text'].apply(stem_words)

In [126]:
movies_review_df['review'] = \
movies_review_df['review'].apply(stem_words)

In [130]:
disaster_tweet_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,"[deed, reason, earthquak, may, allah, forgiv, ...",1
1,4,,,"[forest, fire, near, la, rong, sask, canada]",1
2,5,,,"[all, resid, ask, shelter, place, notifi, offi...",1
3,6,,,"[13000, peopl, receiv, wildfir, evacu, order, ...",1
4,7,,,"[got, sent, photo, rubi, alaska, smoke, wildfi...",1


In [131]:
movies_review_df.head()

Unnamed: 0,review,sentiment
0,"[one, review, mention, after, watch, 1, oz, ep...",positive
1,"[wonder, littl, product, film, techniqu, unass...",positive
2,"[thought, wonder, way, spend, time, hot, summe...",positive
3,"[basic, there, famili, littl, boy, jake, think...",negative
4,"[petter, mattei, love, time, money, visual, st...",positive
