This Notebook contains Natural processing language Machine learning topics and their practical Implementation.

Topics covered are:

1. Tokenization
2. Stop words
3. Stemming
4. Lemmatization
5. Bag of Words
6. Term frequency & Inverse document Frequency (TF-IDF)
7. Word2vec
8. AvgWord2vec

from 5-8 are different technique to convert words to vectors.

In [1]:
## Importing all Libraries
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from tqdm import tqdm
import gensim
import re
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [4]:
dataframe = pd.read_csv('/content/IMDB.csv', sep = ',', engine='python',encoding='utf-8', error_bad_lines=True)



  dataframe = pd.read_csv('/content/IMDB.csv', sep = ',', engine='python',encoding='utf-8', error_bad_lines=True)


In [5]:
dataframe

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [6]:
dataframe.columns = dataframe.columns.str.lower()

In [7]:
dataframe.columns

Index(['review', 'sentiment'], dtype='object')

In [8]:
dataframe.shape

(50000, 2)

In [9]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [10]:
ps = PorterStemmer()
wnl = WordNetLemmatizer()

In [11]:
'''
we are looking for html tages, punctuations, URL's
'''

def clean_html(sentence):
    pattern = re.compile('<.*?>')
    cleaned_text = re.sub(pattern,' ',sentence)
    return cleaned_text

print("Removing Html")
print('After Removing HTML tags:',clean_html('This is a demo test text!<>'))
print('\n')

#Function to keep only words containing letters A-Z and a-z.
#this will remove all punctuations, special characters.
def rem_pun(sentence):
    cleaned_text  = re.sub('[^a-zA-Z]',' ',sentence)
    return (cleaned_text)

print("Removing Punctuations")
print("After Removing Punctuations:",rem_pun("fsd*?~,,,( sdfsdfdsvv)#"))
print("\n")

#Remove URL from sentences.
def rem_url(sen):
    txt = re.sub(r"http\S+", "", sen)
    sen = re.sub(r"www.\S+", "", txt)
    return (sen)

print("Removing URL")
print("After Removing URL:",rem_url("https://colab.research.google.com/drive/1dG8sy949kwnxsOX6BN4Dkime6JdVjGqL#scrollTo=_0_gNhnK6TRY notice the URL is removed"))
print("\n")


Removing Html
After Removing HTML tags: This is a demo test text! 


Removing Punctuations
After Removing Punctuations: fsd        sdfsdfdsvv  


Removing URL
After Removing URL:  notice the URL is removed




In [12]:
clean_html(dataframe['review'][0])

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.  The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.  It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.  I would say the main appeal of the show is due to the fact that it goes where other sh

In [13]:
"""
data cleaning removing the special char's, cleaning the hmtl tages, removing Punctuations and url's.
once the cleaning is done we will apply Lemmatization {as we doing sentimental analysis each word is important so we are using lemmatization,
stemming would created meaningless words. Lemmatization will take some time.
"""

corpus = []


for i in tqdm(range(0, len(dataframe)), desc='Processing'):
  review = clean_html(dataframe['review'][i])
  review = rem_pun(review)
  review = rem_url(review)
  review = review.lower()
  review = review.split()

  review = [wnl.lemmatize(word) for word in review if not word in stopwords.words('english')]

  review = ' '.join(review)
  corpus.append(review)

Processing: 100%|██████████| 50000/50000 [24:30<00:00, 34.00it/s]


In [14]:
dataframe.columns

Index(['review', 'sentiment'], dtype='object')

In [15]:
dataframe.review[0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [16]:
corpus[0]

'one reviewer mentioned watching oz episode hooked right exactly happened first thing struck oz brutality unflinching scene violence set right word go trust show faint hearted timid show pull punch regard drug sex violence hardcore classic use word called oz nickname given oswald maximum security state penitentary focus mainly emerald city experimental section prison cell glass front face inwards privacy high agenda em city home many aryan muslim gangsta latino christian italian irish scuffle death stare dodgy dealing shady agreement never far away would say main appeal show due fact go show dare forget pretty picture painted mainstream audience forget charm forget romance oz mess around first episode ever saw struck nasty surreal say ready watched developed taste oz got accustomed high level graphic violence violence injustice crooked guard sold nickel inmate kill order get away well mannered middle class inmate turned prison bitch due lack street skill prison experience watching oz m

In [17]:
del dataframe['review']

In [18]:
dataframe.shape

(50000, 1)

In [19]:
dataframe['review'] = corpus

In [20]:
dataframe.shape

(50000, 2)

In [21]:
X = dataframe['review']
X

0        one reviewer mentioned watching oz episode hoo...
1        wonderful little production filming technique ...
2        thought wonderful way spend time hot summer we...
3        basically family little boy jake think zombie ...
4        petter mattei love time money visually stunnin...
                               ...                        
49995    thought movie right good job creative original...
49996    bad plot bad dialogue bad acting idiotic direc...
49997    catholic taught parochial elementary school nu...
49998    going disagree previous comment side maltin on...
49999    one expects star trek movie high art fan expec...
Name: review, Length: 50000, dtype: object

In [22]:
dataframe.sentiment.value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

In [23]:
sentiment_mapping = {'positive': 1, 'negative': 0}

In [24]:
y = dataframe['sentiment'].map(sentiment_mapping)

In [25]:
y.value_counts()

1    25000
0    25000
Name: sentiment, dtype: int64

In [26]:
#Train Test Model

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=42)

In [27]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((40000,), (10000,), (40000,), (10000,))

In [28]:
#creating an bag of words model
#max_features maximum top high frequency terms, binary {0,1}
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=2500,binary=True)
X = cv.fit_transform(X_train).toarray()

In [29]:
X.shape, y_train.shape

((40000, 2500), (40000,))

In [30]:
from sklearn.ensemble import RandomForestClassifier

In [31]:
rf = RandomForestClassifier(n_estimators=10, random_state=42)
rf.fit(X,y_train)

In [32]:
X_text = cv.transform(X_test)

In [33]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score,classification_report

In [34]:
y_pred = rf.predict(X_text)

In [35]:
y_pred

array([0, 1, 0, ..., 0, 0, 1])

In [36]:
accuracy_score(y_test,y_pred)

0.7725

In [37]:
# I'm taking min_sample_leaf = 1, d =10 and n_estimators =75

rf_opt = RandomForestClassifier(n_estimators=75,max_depth=10,min_samples_leaf=1, random_state=42)
rf_opt.fit(X,y_train)
y_pred = rf_opt.predict(X_text)
auc = accuracy_score(y_test,y_pred)

In [38]:
auc

0.8191

Till Now we used the bag of words technique to convert the text to vector. Let us use the TFIDF technique to convert the text to vectors. for this we are going to use TfidfVectorizer from sklearn.

In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [46]:
#creating an bag of words model
#max_features maximum top high frequency terms, binary {0,1}
tv = TfidfVectorizer(max_features=2500,binary=True)
X = tv.fit_transform(X_train).toarray()

In [47]:
X.shape, y_train.shape

((40000, 2500), (40000,))

In [48]:
from sklearn.ensemble import RandomForestClassifier

In [49]:
rf = RandomForestClassifier(n_estimators=10, random_state=42)
rf.fit(X,y_train)

In [50]:
X_text = tv.transform(X_test)

In [51]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score,classification_report

In [52]:
y_pred = rf.predict(X_text)

In [53]:
y_pred

array([0, 0, 0, ..., 1, 0, 1])

In [54]:
accuracy_score(y_test,y_pred)

0.7784

In [55]:
# I'm taking min_sample_leaf = 1, d =10 and n_estimators =75

rf_opt = RandomForestClassifier(n_estimators=75,max_depth=10,min_samples_leaf=1, random_state=42)
rf_opt.fit(X,y_train)
y_pred = rf_opt.predict(X_text)
auc = accuracy_score(y_test,y_pred)
auc

0.8223

Till Now we used the bag of words and TFIDF technique to convert the text to vector. Let us use the word2vec and Avgword2vec techniques.we need to use the gensim library for this technique.

In [57]:
import gensim
from nltk import sent_tokenize
from gensim.utils import simple_preprocess
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [58]:
# Need to convert the sentences to words from corpus and these words are fed to the word2vec model

words=[]
for sent in corpus:
    sent_token=sent_tokenize(sent)
    for sent in sent_token:
        words.append(simple_preprocess(sent))

In [59]:
words[0]

['one',
 'reviewer',
 'mentioned',
 'watching',
 'oz',
 'episode',
 'hooked',
 'right',
 'exactly',
 'happened',
 'first',
 'thing',
 'struck',
 'oz',
 'brutality',
 'unflinching',
 'scene',
 'violence',
 'set',
 'right',
 'word',
 'go',
 'trust',
 'show',
 'faint',
 'hearted',
 'timid',
 'show',
 'pull',
 'punch',
 'regard',
 'drug',
 'sex',
 'violence',
 'hardcore',
 'classic',
 'use',
 'word',
 'called',
 'oz',
 'nickname',
 'given',
 'oswald',
 'maximum',
 'security',
 'state',
 'penitentary',
 'focus',
 'mainly',
 'emerald',
 'city',
 'experimental',
 'section',
 'prison',
 'cell',
 'glass',
 'front',
 'face',
 'inwards',
 'privacy',
 'high',
 'agenda',
 'em',
 'city',
 'home',
 'many',
 'aryan',
 'muslim',
 'gangsta',
 'latino',
 'christian',
 'italian',
 'irish',
 'scuffle',
 'death',
 'stare',
 'dodgy',
 'dealing',
 'shady',
 'agreement',
 'never',
 'far',
 'away',
 'would',
 'say',
 'main',
 'appeal',
 'show',
 'due',
 'fact',
 'go',
 'show',
 'dare',
 'forget',
 'pretty',
 'p

In [60]:
## Lets train Word2vec from scratch
model=gensim.models.Word2Vec(words)

In [67]:
## To Get All the Vocabulary
len(model.wv.index_to_key)

34673

In [63]:
model.corpus_count

50000

In [64]:
model.epochs

5

In [65]:
model.wv.similar_by_word('good')

[('decent', 0.7728003263473511),
 ('great', 0.7113469839096069),
 ('bad', 0.6892054677009583),
 ('okay', 0.6558389067649841),
 ('alright', 0.6532899737358093),
 ('ok', 0.6405544877052307),
 ('fine', 0.6331086754798889),
 ('nice', 0.6047989130020142),
 ('average', 0.5819214582443237),
 ('excellent', 0.5738447904586792)]

In [66]:
model.vector_size

100

In [61]:
model.wv['kid'].shape

(100,)

In [68]:
def avgword2vec(doc):
  return np.mean([model.wv[word] for word in doc if word in model.wv.index_to_key],axis=0)

In [69]:
X = []

for i in tqdm(range(len(words))):
  X.append(avgword2vec(words[i]))

100%|██████████| 50000/50000 [08:14<00:00, 101.09it/s]


In [70]:
type(X)

list

In [71]:
X_new = np.array(X)

In [73]:
X_new.shape

(50000, 100)

In [76]:
words[0]

['one',
 'reviewer',
 'mentioned',
 'watching',
 'oz',
 'episode',
 'hooked',
 'right',
 'exactly',
 'happened',
 'first',
 'thing',
 'struck',
 'oz',
 'brutality',
 'unflinching',
 'scene',
 'violence',
 'set',
 'right',
 'word',
 'go',
 'trust',
 'show',
 'faint',
 'hearted',
 'timid',
 'show',
 'pull',
 'punch',
 'regard',
 'drug',
 'sex',
 'violence',
 'hardcore',
 'classic',
 'use',
 'word',
 'called',
 'oz',
 'nickname',
 'given',
 'oswald',
 'maximum',
 'security',
 'state',
 'penitentary',
 'focus',
 'mainly',
 'emerald',
 'city',
 'experimental',
 'section',
 'prison',
 'cell',
 'glass',
 'front',
 'face',
 'inwards',
 'privacy',
 'high',
 'agenda',
 'em',
 'city',
 'home',
 'many',
 'aryan',
 'muslim',
 'gangsta',
 'latino',
 'christian',
 'italian',
 'irish',
 'scuffle',
 'death',
 'stare',
 'dodgy',
 'dealing',
 'shady',
 'agreement',
 'never',
 'far',
 'away',
 'would',
 'say',
 'main',
 'appeal',
 'show',
 'due',
 'fact',
 'go',
 'show',
 'dare',
 'forget',
 'pretty',
 'p

In [74]:
X_new[0]

array([-0.20281795,  0.09490767,  0.16579908, -0.38300735, -0.2785943 ,
       -0.01875068,  0.15630338,  0.24195954, -0.22736458, -0.05327965,
       -0.07614865, -0.15919502, -0.22948802,  0.00376907, -0.03978518,
       -0.04461985,  0.2978749 , -0.5216743 ,  0.06048151, -0.32240334,
       -0.05007555, -0.02201805, -0.3306045 , -0.11483   ,  0.05856982,
        0.22730121, -0.45548272,  0.37531325,  0.12268218,  0.19204143,
        0.46177512, -0.32101327, -0.03476228, -0.36342102, -0.11838903,
        0.6732902 , -0.15151656,  0.38450056, -0.3096272 ,  0.19246295,
       -0.10423347, -0.2592303 , -0.21944952,  0.11816273,  0.01707923,
       -0.47271085,  0.44593334, -0.11325435, -0.02860805,  0.41289496,
        0.47416526,  0.23392978, -0.1285916 ,  0.16633704, -0.21178779,
       -0.1943526 , -0.14618754,  0.01430091, -0.10621922,  0.03536858,
       -0.36379823, -0.0089318 ,  0.53694797,  0.11331793, -0.14090703,
       -0.16734254,  0.13348341, -0.0075645 , -0.27368718,  0.26

In [77]:
df = pd.DataFrame(X_new, columns=[f'dim_{i}' for i in range(100)])

In [78]:
df

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5,dim_6,dim_7,dim_8,dim_9,...,dim_90,dim_91,dim_92,dim_93,dim_94,dim_95,dim_96,dim_97,dim_98,dim_99
0,-0.202818,0.094908,0.165799,-0.383007,-0.278594,-0.018751,0.156303,0.241960,-0.227365,-0.053280,...,0.349301,0.244401,-0.045721,-0.040536,0.648144,0.136739,0.020938,0.145907,0.137537,-0.262799
1,-0.321950,0.314521,0.094887,0.002823,-0.222583,0.002843,0.313749,0.389165,-0.051099,0.037231,...,-0.129954,0.107908,-0.289629,-0.152867,0.040166,-0.033017,-0.475925,0.392240,0.247260,-0.460642
2,-0.302849,0.025770,-0.038214,-0.014555,-0.149666,0.146332,0.166173,-0.016083,-0.244363,-0.100021,...,0.220555,0.420064,-0.219340,0.142388,0.291076,0.023317,-0.394996,0.029069,0.151342,-0.201608
3,-0.026140,0.301957,-0.003335,-0.299396,-0.236882,0.018884,0.133509,0.114151,-0.082147,-0.326713,...,0.273587,0.336693,-0.175260,-0.543649,0.417571,-0.081170,-0.418492,0.335728,0.309550,-0.517151
4,-0.334754,0.114438,-0.122220,-0.232564,-0.412956,0.129136,0.066186,0.136826,-0.131628,-0.200468,...,0.203894,0.398909,-0.000891,-0.160250,0.225601,0.032715,-0.069757,0.277453,0.030005,-0.263259
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,-0.584474,0.156636,-0.127392,-0.274071,-0.345282,0.500408,0.090970,-0.212377,-0.249057,-0.279228,...,0.396790,0.761188,-0.600996,-0.001030,0.562303,-0.005669,-0.721671,0.097547,0.600127,-0.702075
49996,-0.305474,0.226246,-0.070455,-0.614518,-0.192302,0.262522,0.413970,0.016805,-0.393052,-0.079298,...,0.485067,0.488733,-0.232362,-0.330080,0.534543,0.190037,-0.559639,0.156327,0.267636,-0.739673
49997,-0.176239,0.116982,0.151695,-0.154848,-0.085060,0.115188,0.279364,0.034404,-0.224587,0.026550,...,0.453435,0.408228,-0.108760,-0.120499,0.257251,-0.069690,0.041909,-0.125069,0.210620,-0.271808
49998,-0.144867,0.203766,0.045839,-0.090647,-0.176818,-0.149622,0.099642,0.349280,-0.048045,-0.007247,...,0.078259,0.240626,0.140965,-0.045758,0.331444,0.120782,-0.108691,0.087732,0.162077,-0.368241


In [80]:
y.value_counts()

1    25000
0    25000
Name: sentiment, dtype: int64

In [84]:
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=42)

In [85]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((40000, 100), (10000, 100), (40000,), (10000,))

In [86]:
# I'm taking min_sample_leaf = 1, d =10 and n_estimators =75

rf_opt = RandomForestClassifier(n_estimators=75,max_depth=10,min_samples_leaf=1, random_state=42)
rf_opt.fit(X_train,y_train)
y_pred = rf_opt.predict(X_test)
auc = accuracy_score(y_test,y_pred)
auc

0.8258