### Introduction
___

- IMDB movie reviews dataset
- http://ai.stanford.edu/~amaas/data/sentiment
- Contains 25000 positive and 25000 negative reviews
- Contains at most reviews per movie
- At least 7 stars out of 10 $\rightarrow$ positive (label = 1)
- At most 4 stars out of 10 $\rightarrow$ negative (label = 0)
- 50/50 train/test split
- Evaluation accuracy

<b>Features: bag of 1-grams with TF-IDF values</b>:
- Extremely sparse feature matrix - close to 97% are zeros

 <b>Model: Logistic regression</b>
- $p(y = 1|x) = \sigma(w^{T}x)$
- Linear classification model
- Can handle sparse data
- Fast to train
- Weights can be interpreted

In [1]:
#we will be using imdb movie reviews datasset
import pandas as pd

df=pd.read_csv('data/movie_data.csv')
df.head(5)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [2]:
#to view the expanded data columns
df['review'][0]

'In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />"Murder in Greenwich" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich famil

In [3]:
#Bag of words/Bag of N-grams model
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

#initialize the count vectorizer
count=CountVectorizer()

docs=np.array(['The sun is shining',
                'The weather is sweet',
                'The sun is shining, the weather is sweet, and one and one is two'])
bag=count.fit_transform(docs)
#we will transfrom these three sentences to the numeric values in bag of words method(matrix)

In [5]:
print(count.vocabulary_)
#it prints the dictionary to tell the counts of various words in the sentence(basically index) of the matrix of words

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [6]:
print(bag.toarray())
#it represent the dictionary item stored in above
#tf is the term frequnecy, number of times a word encounter in a document
#it contains the frequency of words in the three sentences as rows and indexes of words are given in dict

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


Raw term frequencies: *tf (t,d)*—the number of times a term t occurs in a document *d*

$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d)$$

$$\text{idf}(t,d) = \text{log}\frac{n_d}{1+\text{df}(d, t)},$$

where $n_d$ is the total number of documents, and df(d, t) is the number of documents d that contain the term t.

In [12]:
#idf is the inverse document frequency
#log is taken so that low documents frequencies are not given too much weight
np.set_printoptions(precision=2)

from sklearn.feature_extraction.text import TfidfTransformer
tfidf=TfidfTransformer(use_idf=True,norm='l2',smooth_idf=True)
#use_idf will enable the inverse document frequencies rewaiting
#each output row will be having a unit l2 norm
#smooth_idf add 1 to the words being seen so that we didn't get a division by 0 when the words are not present in sentence

doc=tfidf.fit_transform(bag)
print(doc.toarray())

#this numeric values help us to tell what are the words that are helpful to distinguish between sentences

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


The equations for the idf and tf-idf that are implemented in scikit-learn are:

$$\text{idf} (t,d) = log\frac{1 + n_d}{1 + \text{df}(d, t)}$$
The tf-idf equation that is implemented in scikit-learn is as follows:

$$\text{tf-idf}(t,d) = \text{tf}(t,d) \times (\text{idf}(t,d)+1)$$

In [13]:
#to remove the unwanted tags or words or emojis from our text
df.loc[0,'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

In [16]:
import re
#using regular epressiong to clean and preprocess the text
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)  #stripping away any html tags and replacing them with an empty string
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text) #finding the text emojis using regex
    #moving them to the last of text reviews, to the end of the reviews
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text

In [17]:
preprocessor(df.loc[0,'review'][-50:])

'is seven title brazil not available'

In [18]:
preprocessor("</a>This :) is a :( test :-)!")

'this is a test :) :( :)'

In [20]:
#using aply method to apply the function
df['review']=df['review'].apply(preprocessor)

In [21]:
#tockenization of documents
#stemming is use to reduce the derivational or modified form of various words form their base words

from nltk.stem.porter import PorterStemmer
porter=PorterStemmer()

In [24]:
#creating a function to get the tokens from the text basic one
def tokenizer(text):
    return text.split()

In [30]:
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]
#taking the text tokenizing it and then returning the word token in the form of stem

In [31]:
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [32]:
tokenizer_porter('runners like running and thus they run')
#it reomoves the extraneous alphabets like running to run

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

In [34]:
#to remove th articles from the sentences
import nltk
nltk.download('stopwords')
#stopwords contains the articles so we download it

[nltk_data] Downloading package stopwords to /home/rhyme/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [36]:
from nltk.corpus import stopwords

stop=stopwords.words('english')
[w for w in tokenizer_porter('a runner like running and runs a lot')[-10:] if w not in stop]
#we are tokenizing the sentence and then stemming it and also removing the articles form it

['runner', 'like', 'run', 'run', 'lot']

In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer
#it will combine all the above steps in one

tfidf=TfidfVectorizer(strip_accents=None,
                     lowercase=False,
                     preprocessor=None,
                     tokenizer=tokenizer_porter,
                     use_idf=True,
                     norm='l2',
                     smooth_idf=True)
#use_idf to minimize the weights
#preprocessor is set to none since we have done that with our data in above steps
#lowercase is set to false since we have also taken care of that in our data

#target for y value will be the sentiment
y=df.sentiment.values
x=tfidf.fit_transform(df.review)
#split the original dataset into tfidf matrix

In [44]:
#using logistic regression for Document Classification

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(x,y,random_state=1,test_size=0.5,shuffle=False)

In [46]:
#to save the model we will use pickle library 
import pickle
from sklearn.linear_model import LogisticRegressionCV

clf=LogisticRegressionCV(cv=5,
                        scoring='accuracy',random_state=0,n_jobs=-1,
                        verbose=3,max_iter=300).fit(X_train,y_train)

#cv is corss validation folds
#using LogisticRegressionCV so that model can automatically fine tune its hyperparameter
#n_jobs will allow us to use the cpu and gpu
#verbose will show all the steps of getting output
#max_iter will tell us the maximum iteration optimizer cross validation will run

saved_model=open('saved_model.sav','wb')  #write bits mode
pickle.dump(clf,saved_model)  #to save the model locally on disk
saved_model.close()     #closing the model


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  2.5min finished


In [48]:
filename='saved_model.sav'
saved_clf=pickle.load(open(filename,'rb'))
#loading the model back

In [49]:
#to check the accuracy score
saved_clf.score(X_test,y_test)
#scored using logistic regression classifier



0.89604