# Sentiment Analysis (Movie Review)

#### Data Set: https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data 

To begin with the project: step one would be downloading the data and playing with it. Use pandas to load the data, actual data is in *.tsv format. One can load that or I converted it in csv format and used it (4-5 lines had some special characters which caused problem in loading the dataset, but after removing those special characters, dataset can be loaded normaly).

In [1]:
import pandas as pd
dataset = pd.read_csv('./train.csv', delimiter=',')

Use: dataset.info() to get the summary of different columns, if any column contains any row (if yes, then how many?) with Na values. Use datase.head(), dataset.tail(), dataset.describe(), dataset['$ColumnName'] e.t.c to gain more insight into the dataset. 

In [2]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156060 entries, 0 to 156059
Data columns (total 4 columns):
PhraseId      156060 non-null int64
SentenceId    156060 non-null int64
Phrase        156060 non-null object
Sentiment     156060 non-null int64
dtypes: int64(3), object(1)
memory usage: 4.8+ MB


Use: dataset.$ColumnName.value_count() to get the information about any particular column. 

In [3]:
dataset.Sentiment.value_counts()

2    79582
3    32927
1    27273
4     9206
0     7072
Name: Sentiment, dtype: int64

To proceed further with the sentiment analysis we need to do text classification. We can use 'bag of words (BOW)' model for the analysis. In laymen terms BOW model converts text in the form of numbers which can then be used in an algorithm for analysis.

Specifically BOW model is used for feature extraction in text data. It returns a vector with all the words and number of times each word is repeated. It is known as BOW because it is only concerned with the number of times a word is repeated rather than order of words. Let's take an example to understand it better (assume each document contains a sentence only):

    Doc1: Switzerland is a beautiful country. 
    Doc2: India is a country of smart IT professionals. 
    Doc3: USA is a country of opportunities. 



More the content in each document lengthier would be the length of each vector (will contain lot of zeros). Basically doc vectors would be a sparse vectors if documents are too large. Sparse vectors need lot of memory for storage and due to length even computation becomes slow. In order to reduce the length of the sparse vectors one may use the technique like stemming, lematization, converting to lower case or ignoring stop-words e.t.c. 

Now, we will generate DTM using CountVectorizer module of scikit-learn. To read more about the arguments of CountVectorizer you may visit __[here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)__. As discussed above we will use:<br>
<ul>
<li>tokenizer = Overrides the string tokenization step, we generatre tokenizer from NLTK's Regex tokenizer (by default: None)</li>
<li>lowercase = True (no need to use, as it is set True by default)</li>
<li>stop_words = 'english' (by default None is used, to improve the result we can provide custom made list of stop words)</li>
<li>ngram_range = (1,1) (by defualt its (1,1) i.e strictly monograms will be used, (2,2) only bigrams while (1,2) uses both)</li>
</ul>    

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
token = RegexpTokenizer(r'[a-zA-Z0-9]+')
cv = CountVectorizer(stop_words='english',ngram_range = (1,1),tokenizer = token.tokenize)
text_counts = cv.fit_transform(dataset['Phrase'])

We will now split the data for training and testing to check how well our model has performed. Also we will randomize the data in case our data includes all positive first and then all negative or some other kind of bias. We will use: scikit_learn's __[train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)__ for splitting the text_count (which contains our X) and dataset['Sentiment'] (this contains Y). 

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, dataset['Sentiment'], test_size=0.25, random_state=5)

Now we have the training and testing data. We should start the analysis. Our analysis (as most of ML analysis) will be in 5 steps(a mneumonic to remember them is <b>DC-FEM</b> remember as DC Female or District of Columbia Fire and Emergency Medical service): 
<ol>
    <li>Defining the model</li>
    <li>Compiling the model</li>
    <li>Fitting the model</li>
    <li>Evaluating the model</li>
    <li>Making predictions with the model</li>
</ol>
 
### 1. Defining the model
We will use one of the __[Naive Bayes (NB)](https://scikit-learn.org/stable/modules/naive_bayes.html)__ classifier for defining the model. Specifically we will use __[MultinomialNB classifier](https://scikit-learn.org/stable/modules/naive_bayes.html)__. As a fresher to ML one can use cheat sheet given by sklearn __[here](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)__ to determine the best model to use for a particular problem. It tell us to use NB classifier. Let us take a detour to learn more about NB model. 
####  Naive Bayes Model
This model applies Bayes theorem with a Naive assumption of no relation between different features. According to Bayes theorem:<br>

Posterior = likelihood * proposition/evidedence or  P(A|B) = P(B|A) * P(A)/P(B)<br>
<b>For ex: In a deck of playing cards, a card is chosen. What is the probability of a card being queen given the card is a face card?</b><br>
This can be solved using Bayes theorem.<br>
P(Queen given Face card) = P(Queen|Face)<br> 
P(Face given Queen) = P(Face|Queen) = 1<br>
P(Queen) = 4/52 = 1/13
P(Face) = 3/13
From Bayes theore:<br>
P(Queen|Face) = P(Face|Queen) P(Queen)/P(Face) = 1/3<br>



For an input with several variables:<br>
P(y|x1, x2, ... xn) = P(x1, x2, ... xn|y)* P(y)/P(x1,x2, ...xn)<br>
with Naive Bayes we assume x1, x2 ... xn are independent of each other, i.e:<br>
P(x1, x2, ... xn|y) = P(x1|y) * P(x2|y) ... * P(xn|y)<br> 
The assumption in distribution of P(xi|y) give rise to different NBM. For example assuming Gaussian distribution will give rise to Gaussian Naive Bayes (GNB) or multinomial distribusion will give Multinomial Naive Bayes (MNB). 

Naive Bayes Model works particularly well with text classification and spam filtering. <b>Advantages</b> of working with NB algorithm are:
<ul>
    <li>Requires small amount of training data to learn the parameters</li>
    <li>Can be trained relatively fast compared to sophisticated models</li>
</ul>
Main <b>disadvantage</b> of NB Algorithm is:
<ul>
    <li>It's a decent classifier but a bad estimator</li>
    <li>It works well with discrete values but won't work with continuous values (can't be used in regression)</li>
</ul>

#### Dilemma of NB Algorithm
A challenging question which can be asked regarding NB algorithm is: although the condinal independence assumed in NB algorithm is hardly true in real life then howcome NB Algorithm work so well as classifier? 
I won't discuss the solution here, rather will direct you towards the resource which contains the solution (__[here](https://www.cs.unb.ca/~hzhang/publications/FLAIRS04ZhangH.pdf)__). In short the answer lies in distribution of dependencies rather than dependency, somehow due to distribution the effect of dependencies cancels out. 

#### Loss function for NB classification
NB classification uses a zero-one loss function. In this function error = number of incorrect classifications. Here accuracy of probability estimation is not taken into account by error function given that class with highest probability is predicted right. For example let's say there are two classes A and B, and different attributes (x1, x2, ... xn) are given. P(A|all atributes) = 0.95 and P(B|all atributes)=0.05 but NB might estimates P(A|all atributes) = 0.7 and P(B|all atributes) = 0.3. Here althogh estimates are far from accurate but classifiction is correct.

Let's move back to our analysis. The first two steps of defining and compiling the model are reduced to identifying and importing the model from sklearn (as sklearn gives as precompiled models).

### 2. Compiling the model
Since we are using sklearn's modules and classes we just need to import the precompiled classes. Sklearn gives the information of all the classes __[here](https://scikit-learn.org/stable/modules/classes.html)__.   


In [7]:
from sklearn.naive_bayes import MultinomialNB

### 3. Fitting the model
In this step we generate our model fitting our dataset in the MultinomialNB. Inorder to look for the arguments which can be passed while fitting the model its advised to check the sklearn webpage of the module under use. For MNB it can be checked __[here](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB)__ 

In [8]:
MNB = MultinomialNB()
MNB.fit(X_train, Y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### 4. Evaluating the model
Here we quantify the quality of our model. We use __[metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#model-evaluation)__ module from sklearn library to evaluate the predictions. 

In [9]:
from sklearn import metrics
predicted = MNB.predict(X_test)
accuracy_score = metrics.accuracy_score(predicted, Y_test)

In [10]:
print(str('{:04.2f}'.format(accuracy_score*100))+'%')

60.25%


## Tweaking the model
We have observed the accuracy of our model is over 60%. We can now play with our model to increase its' accuracy.

#### Trying different n-grams

In [11]:
#from sklearn.feature_extraction import CountVectorizer
#from nltk.tokenize import RegexpTokenizer
#token = RegexpTokenizer(r'[A-Za-z0-9]+')
cv = CountVectorizer(stop_words='english', ngram_range = (2,2), tokenizer = token.tokenize)
text_counts = cv.fit_transform(dataset['Phrase'])

#from sklearn.model_selection import train_test_split()
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, dataset['Sentiment'],test_size=0.25, random_state=5)

#Defining the model-> we will use MultinomialNB

#Compiling the model -> We will import precompiled MNB from sklearn library
#from sklearn.naive_bayes import MultinomialNB 

#Fitting the model
MNB = MultinomialNB()
MNB.fit(X_train, Y_train)

#Evaulating the model
#form sklearn import metrics
accuracy_score = metrics.accuracy_score(MNB.predict(X_test), Y_test)
print(str('{:04.2f}'.format(accuracy_score*100))+'%')

60.37%


In [12]:
#It shows only a marginal imporvement, let us try with trigram tokenization now:
cv = CountVectorizer(stop_words='english', ngram_range = (3,3), tokenizer = token.tokenize)
text_counts = cv.fit_transform(dataset['Phrase'])
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, dataset['Sentiment'],test_size=0.25, random_state=5)
MNB = MultinomialNB()
MNB.fit(X_train, Y_train)
accuracy_score = metrics.accuracy_score(MNB.predict(X_test), Y_test)
print(str('{:04.2f}'.format(accuracy_score*100))+'%')

58.86%


#### Trying different Naive Bayes Algorithms 

In [13]:
#With this particular MNB model we are gaining success which is close to 60%, nomatter what n-gram vectorization we opt for.
#Let's try to change the model to ComplementNB. 

#let's write the complete code assuming we have our data imported to dataset.

#from sklearn.feature_extraction import CountVectorizer
#from nlkt.tokenize import RegexpTokenizer
#token = RegexpTokenixer(r'[A-Za-z0-9]+')
cv = CountVectorizer(stop_words='english', ngram_range=(1,1), tokenizer=token.tokenize)
text_count = cv.fit_transform(dataset['Phrase'])

#split the dataset in train test 
#form sklearn.model_selection() import train_test_split()
X_train, X_test, Y_train, Y_test = train_test_split(text_count, dataset['Sentiment'], test_size=0.25, random_state=2)

#Defining and compiling the model -> we will use ComplementNB
from sklearn.naive_bayes import ComplementNB

#Fitting the model
CNB = ComplementNB()
CNB.fit(X_train, Y_train)

#evaluating the model
#from sklearn import metrics
accuracy_score = metrics.accuracy_score(CNB.predict(X_test),Y_test)

print(str('{:4.2f}'.format(accuracy_score*100))+'%')

47.53%


How about using several different algorithms all at once!

In [14]:
from sklearn.naive_bayes import GaussianNB
GNB = GaussianNB()
GNB.fit(X_train.todense(), Y_train)
accuracy_score = metrics.accuracy_score(CNB.predict(X_test),Y_test)

print('GNB accuracy = ' + str('{:4.2f}'.format(accuracy_score*100))+'%')

GNB accuracy = 47.53%


In [16]:
from sklearn.naive_bayes import BernoulliNB
BNB = BernoulliNB()
BNB.fit(X_train, Y_train)
accuracy_score_bnb = metrics.accuracy_score(BNB.predict(X_test),Y_test)
print('BNB accuracy = ' + str('{:4.2f}'.format(accuracy_score_bnb*100))+'%')

BNB accuracy = 60.61%


### Improving the accuracy
We have tried using different n-grams and different Naive Bayes models but maximum accuracy lingers arround 60%. In order to improve our model let's try to change the way the BOW is created. Currently we created BOW with CountVectorizer which counts the occurance of the word in the text. More number of time a word occurs it becomes more important for classification. 


### TF-IDF: Term Frequency-Inverse Document Frequency
Let's use TF-IDF here product of term frequency and inverse document frequency is used. Term frequency is how frequently a terms has appeared in a document. Let's say a term appears f times in a document with d words. <br>
Term Frequency = f/d <br>
IDF is inverse document frequency. If a corpus contains N documents and the term of our interest appears only in D documents then IDF is:<br>
IDF = log(N/D)
TF-IDF is product of Term Frequncy and Inverse Document Frequency. <b>TF-IDF shows the rarity of a word in the corpus.</b> If a word is rare then probably its a signature word for a particular sentiment/information.  


In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
text_count_2 = tfidf.fit_transform(dataset['Phrase'])

#splitting the data in test and training
#from sklearn.model_selection() import train_test_split()
x_train, x_test, y_train, y_test = train_test_split(text_count_2, dataset['Sentiment'],test_size=0.25,random_state=5)

#defining the model
#compilimg the model -> we are going to use already used models GNB, MNB, CNB, BNB
#fitting the model
MNB.fit(x_train, y_train)
accuracy_score_mnb = metrics.accuracy_score(MNB.predict(x_test), y_test)
print('accuracy_score_mnb = '+str('{:4.2f}'.format(accuracy_score_mnb*100))+'%')

BNB.fit(x_train, y_train)
accuracy_score_bnb = metrics.accuracy_score(BNB.predict(x_test), y_test)
print('accuracy_score_bnb = '+str('{:4.2f}'.format(accuracy_score_bnb*100))+'%')

CNB.fit(x_train, y_train)
accuracy_score_cnb = metrics.accuracy_score(CNB.predict(x_test), y_test)
print('accuracy_score_cnb = '+str('{:4.2f}'.format(accuracy_score_cnb*100))+'%')

GNB.fit(x_train.todense(), y_train)
accuracy_score_gnb = metrics.accuracy_score(GNB.predict(x_test.todense()), y_test)
print('accuracy_score_gnb = '+str('{:4.2f}'.format(accuracy_score_gnb*100))+'%')

accuracy_score_mnb = 58.50%
accuracy_score_bnb = 59.33%
accuracy_score_cnb = 51.42%
accuracy_score_gnb = 19.97%


### Trying non Bayesian algorithms
Even the Tfidf vectorizer i.e creating a different BOW didn't help in imporving the accuracy of the model. Rather than naive bayes algorithm we can also opt for stochastic gradient descent classifier or linear support vector classifier. Both of these are known to work well with the text data classification. Let's try to use these:

In [23]:
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC
SGDC = SGDClassifier()
LSVC = LinearSVC()

#on TF-IDF data
LSVC.fit(x_train, y_train)
accuracy_score_lsvc = metrics.accuracy_score(LSVC.predict(x_test), y_test)
print('accuracy_score_lsvc = '+str('{:4.2f}'.format(accuracy_score_lsvc*100))+'%')

SGDC.fit(x_train, y_train)
accuracy_score_sgdc = metrics.accuracy_score(SGDC.predict(x_test), y_test)
print('accuracy_score_sgdc = '+str('{:4.2f}'.format(accuracy_score_sgdc*100))+'%')

#on CountVectorize data
LSVC.fit(X_train, Y_train)
accuracy_score_lsvc_CV = metrics.accuracy_score(LSVC.predict(X_test), Y_test)
print('accuracy_score_lsvc_cv = '+str('{:4.2f}'.format(accuracy_score_lsvc_CV*100))+'%')

SGDC.fit(X_train, Y_train)
accuracy_score_sgdc_CV = metrics.accuracy_score(SGDC.predict(X_test), Y_test)
print('accuracy_score_sgdc_cv = '+str('{:4.2f}'.format(accuracy_score_sgdc_CV*100))+'%')



accuracy_score_lsvc = 63.88%
accuracy_score_sgdc = 56.47%




accuracy_score_lsvc_cv = 63.05%
accuracy_score_sgdc_cv = 60.18%
