In [3]:
import pyprind
import pandas as pd
import os


In [4]:
basepath = "aclImdb" # to update after dataset download
labels = {"pos":1, "neg":0}
pbar  =pyprind.ProgBar(50000)
df = pd.DataFrame()
for s in ("test","train"):
    for l in ("pos","neg"):
        path = os.path.join(basepath,s,l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), "r", encoding="utf-8") as infile:
                txt = infile.read()
            df = pd.concat([df,pd.DataFrame([[txt, labels[l]]])], ignore_index=True) # df.append is now deprecated
            pbar.update()
df.columns= ["review", "sentiment"]







In [5]:
#reshuffling rows
import numpy as np

np.random.seed(42)
df = df.reindex(np.random.permutation(df.index))
df.to_csv("movie_data.csv", index=False, encoding='utf-8')

In [6]:
#file read check

df = pd.read_csv("movie_data.csv",encoding="utf-8")


In [7]:
df["review"][0]

"I can't add an awful lot to the positive reviews already on here - great acting, balanced writing, multi-faceted characters, a great anti-hero in Tony, great commentary on millennial American life. The integral use of psychiatry coupled with Tony's mother issues are especially fresh and humorous. Several other characters add a lot of depth - Hesh's interesting history as an outsider muscling in, Ralphie's total irredeemable viciousness, Chris' dual desires in life, and so on.<br /><br />I have to dig into some of the criticisms however, especially the 'it glorifies violence/belittles Italian-Americans' one.Most of the writers and actors are Italian-American, would they attack themselves? There are several positive Italian-American characters - Artie Bucco the chef, Dr. Melfi and her family and the Cusamanos next door to the Sopranos. Indeed, Dr Melfi's ex-husband notes in season 1 that only a tiny minority of Italian-Americans have ever had Mob connections (certainly smaller than the 

Bag of words model: 
To work with ML algorithms, we need numbers. so, the words need to be converted to numbers to train our models and achieve the required objectives. \
words -> numerical feature vectors -> ML model -> output numerical results \
In BOW,  a vocabulary is created which stores the list of unique tokens from the entire set of documents. Then parsing through each document, a feature vector is obtained for words present in it taking vocab as reference



In [8]:
# to construct bag of words model based on count of words appearing in document, CountVectorizer library in sklearn can be used.
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
# a sample doc
docs = np.array(["the sun is shining", "the water is sweet", "The clouds are beautiful and the sun is shining"])
bag = count.fit_transform(docs)

In [9]:
print(count.vocabulary_)

{'the': 8, 'sun': 6, 'is': 4, 'shining': 5, 'water': 9, 'sweet': 7, 'clouds': 3, 'are': 1, 'beautiful': 2, 'and': 0}


In [10]:
# our feature vector for three phrases in docs
print(bag.toarray())

[[0 0 0 0 1 1 1 0 1 0]
 [0 0 0 0 1 0 0 1 1 1]
 [1 1 1 1 1 1 1 0 2 0]]


The array basically gives the count of the token at index i.for example , and has position 0 in the vocab. thus for first two sentences, in indices 0, we have no and, and the value is 0. in third sentence, we have a and and thus we have 1 there.'


The values in the feature vector are called raw term frequencies tf (t,d), number of times term t occurs in document d.

The sequence of items created in the vocabulary is basically a 1 gram model. where we assign a number to a word in the dataset. WE could also have n-gram model where we have a n words together assigned to a number. 
For example 2 gram model could have {"the sun", "sun is", "is shining"}. the n could depend on the application , for spam filtering, n=3,4 works well. for other tasks, we can try with different parameters.

setting n is trivial in sklearn. n_gram_range = (2,2) - for 2 gram model

Word relevancy : Term frequency inverse document frequency
Some words occur frequently across multiple documents and thus do not contain useful information. it is better to reduce the influence of these words while making decisions or constructing feature vectors with these words. one way of doing it is using tf - idf (term frequence - inverse document frequency)
tf-idf(t,d) = tf(t,d) * idf(t,d)
idf(t,d) = $log{\frac{n_d}{1+df(d,t)}}$

$n_d$ denotes the number of documents and df is the number of documents in which the term is found. the +1 in denom is to prevent division by zero error in case. and logarithm makes sure idf does not blow up to high value for lower df

In [11]:
#sklearn tfidf transformer
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.   0.   0.   0.43 0.56 0.56 0.   0.43 0.  ]
 [0.   0.   0.   0.   0.36 0.   0.   0.61 0.36 0.61]
 [0.38 0.38 0.38 0.38 0.22 0.29 0.29 0.   0.45 0.  ]]


sklearn computes tf-idf in a slightly different way. \
idf(t,d) = $\log{\frac{1+n_d}{1+df(d,t)}}$ \
tf-idf(t,d) = tf(t,d) * (idf(t,d)+1) \
Though raw frequencies may be normalized before passing tfidf transformer, tf-idf transformer normalizes the frequencies while computing tf-idf . By default, l2 normalization is used.

Cleaning text data:
The first important step before passing our text into transformers and model is to clean the data and strip it off unwanted elements.

In [12]:
df.loc[1,'review'][-50:]

's a dream.<br /><br />Seriously interesting stuff.'

we see that we lot of html tags in the text and this gives no useful information for opinion mining and is in fact a baggage in already expensive nlp. Usually , punctuations can be very helpful in nlp related tasks like sentiment analysis etc. here we are stripping punctuation marks as well. \
Python's regex library can be used to search and remove certain characters from the code.

In [13]:
import re
def preprocess_text(text):
    text = re.sub('<[^>]*>','',text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                       text)
    text= re.sub('[\W]+',' ',text.lower()) + ' '.join(emoticons).replace('-','')
    return text

Adding emoticons at the end after remaining text may not seem right, but for our application of movie review, we ll be using bag of words model and the order does not matter -> sentiment analysis requires the weights associated with individual words and order does not matter much. it is important in sentence completion etc


In [14]:
preprocess_text(df.loc[1,'review'][-50:])

's a dream seriously interesting stuff '

In [15]:
# applying across all reviews
df["review"] = df["review"].apply(preprocess_text)


In [16]:
# before getting into next topic, we ll see regex
# re.search(pat, str) -> search for pat in str and return if found , else return None
str = 'an example word:cat!!'
match  = re.search(r'word:\w\w\w',str) # \w matches a word character a-zA-Z0-9_, W catches ny 
print(match)

<re.Match object; span=(11, 19), match='word:cat'>


In [17]:
match1 = re.search(r'iii', 'piiig')
match2 = re.search(r'igs', 'piiig') # not found, match == None
print(match1, match2)

<re.Match object; span=(1, 4), match='iii'> None


In [18]:
# . meaning match any character except \n
match = re.search(r'..g', 'piiig')
print(match)

<re.Match object; span=(2, 5), match='iig'>


In [19]:
match1 = re.search(r'\d\d\d', 'p123g')
print(match1.group())
match2 = re.search(r'\w\w\w', '@@abcd!')
print(match2.group())
match3 = re.search(r'\W\w\w', '@@abcd!')
print(match3.group())

123
abc
@ab


In [20]:
#repetion 
#+ -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
#* -- 0 or more occurrences of the pattern to its left
#? -- match 0 or 1 occurrences of the pattern to its left
#leftmost & largest match is chosen

match = re.search(r'pi+', 'piigs')
print(match.group())
match = re.search(r'i+', 'piigiiii') # found, match.group() == "ii"
print(match)



pii
<re.Match object; span=(1, 3), match='ii'>


In [21]:
# * -> meaning, pattern to its left can be matched 0 or more times, even zero is allowed


match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx') # found, match.group() == "1 2   3"
match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx') # found, match.group() == "12  3"
match = re.search(r'\d\s*\d\s*\d', 'xx123xx') # found, match.group() == "123"

In [29]:
match = re.search(r'^b\w+', 'foobar') # expected : no match. ^ denotes start of string, foobar does not begin with b
print(None if match == None else match.group())
match = re.search(r'b\w+', 'foobar') # expected bar
print(None if match == None else match.group())
match = re.search(r'^b\w+', 'barley') # expected barley because barley does begin with b
print(None if match == None else match.group())

None
bar
barley


In [30]:
# email 
str = 'purple alice-b@google.com monkey dishwasher'
#simple approach
match = re.search(r'\w+@\w+',str)
print(None if match == None else match.group())

b@google


does not produce full email id, because -, . are not included in \w. \
A set of characters to search for can be included in a square bracket. this can solve a lot of problems

In [32]:
match = re.search(r'[\w.-]+@[\w.-]+',str) # this included . and - in the character search
print(None if match == None else match.group())

alice-b@google.com


Some square bracket rules:
1. to include a range of characters, put a dash in between. for example [a-z]
2. to include dash as a character, put it in the end.for example [ab-]
3. ^ in the beginning will excluded characters. for example [^ab] meaning chars except a and b


Group extraction
The "group" feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. To do this, add parentheses ( ) around the username and host in the pattern, like this: r'([\w.-]+)@([\w.-]+)'. In this case, the parentheses do not change what the pattern will match, instead they establish logical "groups" inside of the match text. On a successful search, match.group(1) is the match text corresponding to the 1st left parentheses, and match.group(2) is the text corresponding to the 2nd left parentheses. The plain match.group() is still the whole match text as usual.

In [33]:
# group extraction
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'([\w.-]+)@([\w.-]+)',str)
if match:
    print(match.group()) # the full match
    print(match.group(1)) # match group 1
    print(match.group(2)) # match group 2



alice-b@google.com
alice-b
google.com


In [38]:
# findall() just like search but finds all matches within the string

str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
emails = re.findall(r'([\w.-]+)@([\w.-]+)',str)
for email in emails:
    print(email)

('alice', 'google.com')
('bob', 'abc.com')


In [94]:
f  = open(r'/Users/divyeshkanagavel/Desktop/DSA_cpp/programming/c_programs/trim.c','r',encoding='utf-8')
text = f.read()
print(text[:50])

#include <stdio.h>
#include <string.h>
int trim(ch


In [102]:
# get the includes from the c file

match = re.findall(r'#\w+\s[\w.<>]+',text)
print(match)


['#include <stdio.h>', '#include <string.h>']


Greedy vs non-greedy :
By default , regex searches are greedy. suppose you have a text with tags, say html code \
<b>foo</b> or <i>hi</i>. if we give something like <.*> will get entire line <b>foo</b>. to get just <b>, need to enable non-greedy option. this can be done with a question mark. <.*?>

In [108]:
text= "<b>foo</b>"
match =re.search(r'<.*>', text)
print(match.group())

<b>foo</b>


In [109]:
text= "<b>foo</b>"
match =re.search(r'<.*?>', text)
print(match.group())

<b>


Substitution:

The re.sub(pat, replacement, str) function searches for all the instances of pattern in the given string, and replaces them. The replacement string can include '\1', '\2' which refer to the text from group(1), group(2), and so on from the original matching text.





In [114]:
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
print(re.findall(r'([\w\.-]+)@([\w\.-]+)', str))

print(re.sub(r'([\w\.-]+)@([\w\.-]+)',r'\1@yahoo.com',str))

[('alice', 'google.com'), ('bob', 'abc.com')]
purple alice@yahoo.com, blah monkey bob@yahoo.com blah dishwasher


TODO : babyname Exercises in Google developer Regex tutorials

In [115]:
# back to NLP

#processing documents into tokens

#simple tokenizer : split text at its whitespace

def tokenizer(text):
    return text.split()
#example
tokenizer("I like coding, weight lifting and playing chess!")



['I', 'like', 'coding,', 'weight', 'lifting', 'and', 'playing', 'chess!']

Another useful technique in the context of tokenization is word stemming. that is transforming a word into its root form. It allows us to map words which have same roots to the root. Porter stemmer algorithm is one of the first known stemming algorithm. NLTK library has it implemented in python


In [116]:
import nltk

In [117]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

tokenizer_porter("I like coding, weight lifting and playing chess!")

['i', 'like', 'coding,', 'weight', 'lift', 'and', 'play', 'chess!']

Lancaster and snowball stemmer are advanced stemming algorithms which build on top of stemming algorithm and they are found in the library. Stemming can sometimes result in non-English words like thu . lemmatization is the process of having grammatically correct words in the dataset -> this is computationally expensive and usually both lemmatization and stemming can have similar performances.

Stop word removal : Stop words are those words which are extremely common and appear in many documents and thus do not help much in the classification of documents or sentiment analysis and hence can be removed from the dataset.

In [118]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/divyeshkanagavel/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [119]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
[w for w in tokenizer_porter("I like coding, weight lifting and playing chess!") if w not in stop]


['like', 'coding,', 'weight', 'lift', 'play', 'chess!']

Cool! once we have the tokenizer, we can go ahead and train a model on this dataset and see if we can analyze the sentiments successfully

In [120]:
X_train = df.loc[:25000,'review'].values
y_train = df.loc[:25000,'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:,'sentiment'].values

In [121]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

In [126]:
tfidf = TfidfVectorizer(strip_accents=None,lowercase=False,preprocessor=None)
param_grid = [{'vect__ngram_range':[(1,1)],'vect__stop_words':[stop,None], 'vect__tokenizer':[tokenizer, tokenizer_porter],'clf__penalty':['l1','l2'],'clf__C':[1.0,10.0,100.0]},
              {'vect__ngram_range':[(1,1)], 'vect__stop_words':[stop,None],'vect__tokenizer':[tokenizer, tokenizer_porter], 'vect__use_idf':[False], 'vect__norm':[None], 'clf__penalty':['l1','l2'], 'clf__C':[1.0,10.0,100.0]}]


In [127]:
lr_tfidf = Pipeline([('vect', tfidf),('clf',LogisticRegression(random_state=0))])
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, scoring='accuracy',cv=5,verbose=1,n_jobs=-1)
gs_lr_tfidf.fit(X_train, y_train)


Fitting 5 folds for each of 48 candidates, totalling 240 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [None]:
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)

Another popular classifier for movie review and nlp classification documents is naive Bayes classifier. it is simple to implement, computationally efficient and works well for small datasets.

Out of core learning : Doing grid search with a bunch of parameters on a dataset of 50000 reviews is quite expensive, took a lot of time to run. the real world datasets are even bigger and sometimes don't even fit in device memory. So, we do something called out of core learning, where gradients are updated after processing a batch of data instead of updating it one sample at a time

we will make use of the partial_fit function of the SGDClassifier in scikit-learn to stream the documents directly from our local drive, and train a logistic regression model using small mini-batches of documents.

In [128]:
import numpy as np
import re
from nltk.corpus import stopwords
def tokenizer(text):

    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                       text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) \
       + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized



In [146]:
def stream_docs(path):
    with open(path,'r',encoding='utf-8') as csv:
        next(csv) # skip the header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In [147]:
next(stream_docs(path='movie_data.csv'))

('"I can\'t add an awful lot to the positive reviews already on here - great acting, balanced writing, multi-faceted characters, a great anti-hero in Tony, great commentary on millennial American life. The integral use of psychiatry coupled with Tony\'s mother issues are especially fresh and humorous. Several other characters add a lot of depth - Hesh\'s interesting history as an outsider muscling in, Ralphie\'s total irredeemable viciousness, Chris\' dual desires in life, and so on.<br /><br />I have to dig into some of the criticisms however, especially the \'it glorifies violence/belittles Italian-Americans\' one.Most of the writers and actors are Italian-American, would they attack themselves? There are several positive Italian-American characters - Artie Bucco the chef, Dr. Melfi and her family and the Cusamanos next door to the Sopranos. Indeed, Dr Melfi\'s ex-husband notes in season 1 that only a tiny minority of Italian-Americans have ever had Mob connections (certainly smaller

In [148]:
# minibatch processing
def minibatch(docs_stream, size):
    docs, y = [],[]
    try:
        for _ in range(size):
            text,label = next(docs_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

count vectorizer needs the entire vocabulary of dataset in memory and tfidf vectorizer needs access to all documents to compute idf frequencies. there is another vectorizer called the hash vectorizer which can work with out of core learning. Hash vectorizer is data independent, it takes the string and based on the string assigns a feature vector using a hashing function. having higher feature vector can minimize hash collisions


In [149]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer(decode_error="ignore",n_features = 2**21, preprocessor=None,tokenizer=tokenizer)
clf = SGDClassifier(loss='log_loss', random_state=1,max_iter=1)
doc_stream = stream_docs(path='movie_data.csv')







In [151]:
import pyprind

pbar = pyprind.ProgBar(45)
classes = np.array([0,1])

for _ in range(45):
    X_train, y_train = minibatch(doc_stream, size=1000)
    
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()






In [152]:
X_test, y_test = minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print("Accuracy score = %.2f" % clf.score(X_test, y_test))

Accuracy score = 0.86


In [153]:
clf = clf.partial_fit(X_test, y_test)

fit starts from random parameters and after learning from the entire dataset, we get the correct weights. \
partial_fit updates the parameters without clearing or starting from scratch. the weights are merely updated based on gradients from new data.

Modern alternative to bag of words is word2vec, which is an unsupervised neural network algorithm which is trained on corpus of data to learn representations between words. the feature vectors are calculated such that similarity in words can be easily mapped to vector transformations like addition, scaling etc.

Topic modelling with LDA : Latent Dirichlet Allocation -> not to be confused with Linear Discriminant Analysis (another supervised learning algorithm which is used in dimensionality reduction). \
Latent Dirichlet Allocation is a unsupervised learning algorithm which is used to assign topics to unlabelled documents

LDA : generative probabilistic model that tries to find groups of words that appear frequently together across different documents.
Input to LDA : bag of words model [a matrix] \
LDA decomposes this matrix into two matrices : document to topic matrix, a word to topic matrix
these two can be multiplied together to get back the original bag of words matrix with very little loss of error.





In [154]:
import pandas as pd
df = pd.read_csv("movie_data.csv", encoding='utf-8')




In [155]:
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words = 'english', max_df=.1, max_features = 5000) #max_df - maximum document frequency - if a word occurs in  more than 10 percent of the documents , omit it pls

X = count.fit_transform(df['review'].values) 



In [157]:
from sklearn.decomposition import LatentDirichletAllocation 
lda = LatentDirichletAllocation(n_components = 10 , random_state = 123,learning_method='batch')
X_topics = lda.fit_transform(X)
