# Intro to Movie Review Sentiment Analysis


![](https://i.imgur.com/WNgxr2I.png)


For the movie review sentiment analysis, we will be working on The Rotten Tomatoes movie review dataset from Kaggle. 
Here, we'll have to label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive based on the sentiment of the movie reviews.

The dataset is comprised of tab-separated files with phrases from the Rotten Tomatoes dataset. Each phrase has a PhraseId. Each sentence has a SentenceId. Phrases that are repeated (such as short/common words) are only included once in the data.

The sentiment labels are:

* 0 - *negative*
* 1 - *somewhat negative*
* 2 - *neutral*
* 3 - *somewhat positive*
* 4 - *positive*







**Any suggestions for improvement or comments are highly appreciated!**

Please upvote(like button) and share this kernel if you like it so that more people can learn from it. 



Below is the step by step methodology that we will be following :

- <a href='#1'>1. Initial Look at the Data</a>
    - <a href='#1.1'>1.1 Distribution of reviews in each sentiment category</a>
    - <a href='#1.2'>1.2 Dropping insignificant columns</a>
    - <a href='#1.3'>1.3 Overall Distribution of the length of the reviews under each sentiment class</a>
    - <a href='#1.4'>1.4 Creating Word Cloud of negative and positive movie reviews</a>
        - <a href='#1.4.1'>1.4.1 Filtering out positive and negative movie reviews</a>
        - <a href='#1.4.2'>1.4.2 Word Cloud for negatively classified movie reviews</a>
        - <a href='#1.4.3'>1.4.3 Word Cloud for positively classified movie reviews</a>
    - <a href='#1.5'>1.5 Term Frequencies of each Sentiment class</a>
        - <a href='#1.5.1'>1.5.1 Term Frequency for 'negative' sentiments</a>
        - <a href='#1.5.2'>1.5.2 Term Frequency for 'some negative' sentiments</a>
        - <a href='#1.5.3'>1.5.3 Term Frequency for 'neutral' sentiments</a>
        - <a href='#1.5.4'>1.5.4 Term Frequency for 'some positive' sentiments</a>
        - <a href='#1.5.5'>1.5.5 Term Frequency for 'positive' sentiments</a>
    - <a href='#1.6'>1.6 Total Term Frequency of all the 5 sentiment classes</a>
    - <a href='#1.7'>1.7 Frequency plot of top frequent 500 phrases in movie reviews</a>
    - <a href='#1.8'>1.8 Plot of Absolute frequency of phrases against their rank</a>
    - <a href='#1.9'>1.9 Movie Reviews Tokens Visualisation</a>
        - <a href='#1.9.1'>1.9.1 Plot of top frequently used 50 phrases in negative movie reviews</a>
        - <a href='#1.9.2'>1.9.2 Plot of top frequently used 50 phrases in positive movie reviews</a>
- <a href='#2'>2. Traditional Supervised Machine Learning Models</a>
    - <a href='#2.1'>2.1 Feature Engineering</a>
    - <a href='#2.2'>2.2 Implementation of CountVectorizer & TF-IDF
        - <a href='#2.2.1'>2.2.1 CountVectorizer</a>
        - <a href='#2.2.2'>2.2.2 How is TF-IDF different from CountVectorizer?</a>
        - <a href='#2.2.3'>2.2.3 How exactly does TF-IDF work?</a>
        - <a href='#2.2.4'>2.2.4 Understanding the parameters of TfidfVectorizer</a>
        - <a href='#2.2.5'>2.2.5 Setting the parametrs of CountVectorizer</a>
    - <a href='#2.3'>2.3 Model Training, Prediction and Performance Evaluation</a>
        - <a href='#2.3.1'>2.3.1 Logistic Regression model on CountVectorizer</a>
        - <a href='#2.3.2'>2.3.2 Logistic Regression model on TF-IDF features</a>
        - <a href='#2.3.3'>2.3.3 SGD model on Countvectorizer</a>
        - <a href='#2.3.4'>2.3.4 SGD model on TF-IDF</a>
        - <a href='#2.3.5'>2.3.5 RandomForest model on TF-IDF</a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


## <a id='1'>1. Initial Look at the Data</a>

In [None]:
df_train = pd.read_csv("../input/train.tsv", sep='\t')
df_train.head()

In [None]:
df_test = pd.read_csv("../input/test.tsv", sep='\t')
df_test.head()

## <a id='1.1'>1.1 Distribution of reviews in each sentiment category</a>

Here, the training dataset has dominating neutral phrases from the movie reviews followed by somewhat positive and then somewhat negative.

In [None]:
df_train.Sentiment.value_counts()

In [None]:
df_train.info()

## <a id='1.2'>1.2 Dropping insignificant columns</a>

In [None]:
df_train_1 = df_train.drop(['PhraseId','SentenceId'],axis=1)
df_train_1.head()

Let's check the phrase length of each of the movie reviews.

In [None]:
df_train_1['phrase_len'] = [len(t) for t in df_train_1.Phrase]
df_train_1.head(4)

## <a id='1.3'>1.3 Overall Distribution of the length of the reviews under each sentiment class</a>

In [None]:
fig,ax = plt.subplots(figsize=(5,5))
plt.boxplot(df_train_1.phrase_len)
plt.show()

From the above box plot, some of the reviews are way more than 100 chracters long.

In [None]:
df_train_1[df_train_1.phrase_len > 100].head()

In [None]:
df_train_1[df_train_1.phrase_len > 100].loc[0].Phrase

## <a id='1.4'>1.4 Creating Word Cloud of negative and positive movie reviews</a>

### Word Cloud


A word cloud is a graphical representation of frequently used words in a collection of text files. The height of each word in this picture is an indication of frequency of occurrence of the word in the entire text. Such diagrams are very useful when doing text analytics.

It provides a general idea of what kind of words are frequent in the corpus, in a sort of quick and dirty way.

Let's start doing some EDA on text data by Word Cloud.

## <a id='1.4.1'>1.4.1 Filtering out positive and negative movie reviews</a>

In [None]:
neg_phrases = df_train_1[df_train_1.Sentiment == 0]
neg_words = []
for t in neg_phrases.Phrase:
    neg_words.append(t)
neg_words[:4]

**pandas.Series.str.cat ** : Concatenate strings in the Series/Index with given separator. Here we give a space as separator, so, it will concatenate all the strings in each of the index separated by a space.

In [None]:
neg_text = pd.Series(neg_words).str.cat(sep=' ')
neg_text[:100]

In [None]:
for t in neg_phrases.Phrase[:300]:
    if 'good' in t:
        print(t)

So, we can very well see, even if the texts contain words like "good", it is a negative sentiment because it indicates that the movie is **NOT** a good movie. 

In [None]:
pos_phrases = df_train_1[df_train_1.Sentiment == 4] ## 4 is positive sentiment
pos_string = []
for t in pos_phrases.Phrase:
    pos_string.append(t)
pos_text = pd.Series(pos_string).str.cat(sep=' ')
pos_text[:100]
    

## <a id='1.4.2'>1.4.2 Word Cloud for negatively classified movie reviews</a>

In [None]:
from wordcloud import WordCloud
wordcloud = WordCloud(width=1600, height=800, max_font_size=200).generate(neg_text)
plt.figure(figsize=(12,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Some of the big words can be interpreted quite neutral, such as "movie","film", etc. We can see some of the words in smaller size make sense to be in negative movie reviews like "bad cinema", "annoying", "dull", etc.

However, there are some words like "good" is also present in the negatively classified sentiment about the movie.
Let's go deeper into such words/texts:

## <a id='1.4.3'>1.4.3 Word Cloud for positively classified movie reviews</a>

In [None]:
wordcloud = WordCloud(width=1600, height=800, max_font_size=200).generate(pos_text)
plt.figure(figsize=(12,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

Again I see some neutral words in big size, "movie","film", but positive words like "good", "best", "fascinating" also stand out.

## <a id='1.5'>1.5 Term Frequencies of each Sentiment class</a>

We also want to understand how terms are distributed across documents. This helps us to characterize the properties of the algorithms for compressing phrases.

A commonly used model of the distribution of terms in a collection is Zipf's law . It states that, if $t_1$ is the most common term in the collection, $t_2$ is the next most common, and so on, then the collection frequency $cf_i$ of the $i$th most common term is proportional to $1/i$: 

$\displaystyle cf_i \propto \frac{1}{i}.$	


So if the most frequent term occurs $cf_1$ times, then the second most frequent term has half as many occurrences, the third most frequent term a third as many occurrences, and so on. The intuition is that frequency decreases very rapidly with rank.
The above equation is one of the simplest ways of formalizing such a rapid decrease and it has been found to be a reasonably good model.


We need the Term Frequency data to see what kind of words are used in the movie reviews and how many times have been used.
Let's proceed with CountVectorizer to calculate term frequencies:


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cvector = CountVectorizer(min_df = 0.0, max_df = 1.0, ngram_range=(1,2))
cvector.fit(df_train_1.Phrase)

In [None]:
len(cvector.get_feature_names())

It looks like count vectorizer has extracted 94644 words out of the corpus.
Getting term frequency for each class can be obtained with the below code block.

## <a id='1.5.1'>1.5.1 Term Frequency for 'negative' sentiments</a>

In [None]:
neg_matrix = cvector.transform(df_train_1[df_train_1.Sentiment == 0].Phrase)
som_neg_matrix = cvector.transform(df_train_1[df_train_1.Sentiment == 1].Phrase)
neu_matrix = cvector.transform(df_train_1[df_train_1.Sentiment == 2].Phrase)
som_pos_matrix = cvector.transform(df_train_1[df_train_1.Sentiment == 3].Phrase)
pos_matrix = cvector.transform(df_train_1[df_train_1.Sentiment == 4].Phrase)


In [None]:
neg_words = neg_matrix.sum(axis=0)
neg_words_freq = [(word, neg_words[0, idx]) for word, idx in cvector.vocabulary_.items()]
neg_tf = pd.DataFrame(list(sorted(neg_words_freq, key = lambda x: x[1], reverse=True)),columns=['Terms','negative'])

In [None]:
neg_tf.head()

In [None]:
neg_tf_df = neg_tf.set_index('Terms')
neg_tf_df.head()

## <a id='1.5.2'>1.5.2 Term Frequency for 'some negative' sentiments</a>

In [None]:

som_neg_words = som_neg_matrix.sum(axis=0)
som_neg_words_freq = [(word, som_neg_words[0, idx]) for word, idx in cvector.vocabulary_.items()]
som_neg_tf = pd.DataFrame(list(sorted(som_neg_words_freq, key = lambda x: x[1], reverse=True)),columns=['Terms','some-negative'])
som_neg_tf_df = som_neg_tf.set_index('Terms')
som_neg_tf_df.head()

## <a id='1.5.3'>1.5.3 Term Frequency for 'neutral' sentiments</a>

In [None]:

neu_words = neu_matrix.sum(axis=0)
neu_words_freq = [(word, neu_words[0, idx]) for word, idx in cvector.vocabulary_.items()]
neu_words_tf = pd.DataFrame(list(sorted(neu_words_freq, key = lambda x: x[1], reverse=True)),columns=['Terms','neutral'])
neu_words_tf_df = neu_words_tf.set_index('Terms')
neu_words_tf_df.head()

## <a id='1.5.4'>1.5.4 Term Frequency for 'some positive' sentiments</a>

In [None]:

som_pos_words = som_pos_matrix.sum(axis=0)
som_pos_words_freq = [(word, som_pos_words[0, idx]) for word, idx in cvector.vocabulary_.items()]
som_pos_words_tf = pd.DataFrame(list(sorted(som_pos_words_freq, key = lambda x: x[1], reverse=True)),columns=['Terms','some-positive'])
som_pos_words_tf_df = som_pos_words_tf.set_index('Terms')
som_pos_words_tf_df.head()

## <a id='1.5.5'>1.5.5 Term Frequency for 'positive' sentiments</a>


In [None]:

pos_words = pos_matrix.sum(axis=0)
pos_words_freq = [(word, pos_words[0, idx]) for word, idx in cvector.vocabulary_.items()]
pos_words_tf = pd.DataFrame(list(sorted(pos_words_freq, key = lambda x: x[1], reverse=True)),columns=['Terms','positive'])
pos_words_tf_df = pos_words_tf.set_index('Terms')
pos_words_tf_df.head()

In [None]:
term_freq_df = pd.concat([neg_tf_df,som_neg_tf_df,neu_words_tf_df,som_pos_words_tf_df,pos_words_tf_df],axis=1)

## <a id='1.6'>1.6 Total Term Frequency of all the 5 sentiment classes</a>

In [None]:
term_freq_df['total'] = term_freq_df['negative'] + term_freq_df['some-negative'] \
                                 + term_freq_df['neutral'] + term_freq_df['some-positive'] \
                                 +  term_freq_df['positive'] 
term_freq_df.sort_values(by='total', ascending=False).head(20)

## <a id='1.7'>1.7 Frequency plot of top frequent 500 phrases in movie reviews</a>

**"Given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc."**

In other words, the rth most frequent word has a frequency f(r) that scales according to $${f(r)} \propto \frac{1}{r^\alpha}$$ for $$\alpha \approx {1}$$


Let's see how the movie review tokens and their frequencies look like on a plot.

In [None]:
y_pos = np.arange(500)
plt.figure(figsize=(10,8))
s = 1
expected_zipf = [term_freq_df.sort_values(by='total', ascending=False)['total'][0]/(i+1)**s for i in y_pos]
plt.bar(y_pos, term_freq_df.sort_values(by='total', ascending=False)['total'][:500], align='center', alpha=0.5)
plt.plot(y_pos, expected_zipf, color='r', linestyle='--',linewidth=2,alpha=0.5)
plt.ylabel('Frequency')
plt.title('Top 500 phrases in movie reviews')

On the X-axis is the rank of the frequency from highest rank from left up to 500th rank to the right. Y-axis is the frequency observed in the corpus.

Another way to plot this is on a log-log graph, with X-axis being log(rank), Y-axis being log(frequency). By plotting on the log-log scale the result will yield roughly linear line on the graph.

## <a id='1.8'>1.8 Plot of Absolute frequency of phrases against their rank</a>

In [None]:
from pylab import *
counts = term_freq_df.total
tokens = term_freq_df.index
ranks = arange(1, len(counts)+1)
indices = argsort(-counts)
frequencies = counts[indices]
plt.figure(figsize=(8,6))
plt.ylim(1,10**6)
plt.xlim(1,10**6)
loglog(ranks, frequencies, marker=".")
plt.plot([1,frequencies[0]],[frequencies[0],1],color='r')
title("Zipf plot for phrases tokens")
xlabel("Frequency rank of token")
ylabel("Absolute frequency of token")
grid(True)
for n in list(logspace(-0.5, log10(len(counts)-2), 25).astype(int)):
    dummy = text(ranks[n], frequencies[n], " " + tokens[indices[n]], 
                 verticalalignment="bottom",
                 horizontalalignment="left")

We can clearly see that words like "the", "in","it", etc are much higher in frequency but has been ranked less as they don't have any significance regarding the sentiment of the movie review. On the other hand, some words like "downbeat laughably" have been given higher rank as they are very less frequent in the document and seems to be significant related to the sentiment of a movie.

## <a id='1.9'>1.9 Movie Reviews Tokens Visualisation</a>

Next, let's explore about how different the tokens in two different classes(positive, negative).

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cvec = CountVectorizer(stop_words='english',max_features=10000)
cvec.fit(df_train_1.Phrase)

In [None]:
neg_matrix = cvec.transform(df_train_1[df_train_1.Sentiment == 0].Phrase)
som_neg_matrix = cvec.transform(df_train_1[df_train_1.Sentiment == 1].Phrase)
neu_matrix = cvec.transform(df_train_1[df_train_1.Sentiment == 2].Phrase)
som_pos_matrix = cvec.transform(df_train_1[df_train_1.Sentiment == 3].Phrase)
pos_matrix = cvec.transform(df_train_1[df_train_1.Sentiment == 4].Phrase)

neg_words = neg_matrix.sum(axis=0)
neg_words_freq = [(word, neg_words[0, idx]) for word, idx in cvec.vocabulary_.items()]
neg_tf = pd.DataFrame(list(sorted(neg_words_freq, key = lambda x: x[1], reverse=True)),columns=['Terms','negative'])

neg_tf_df = neg_tf.set_index('Terms')


som_neg_words = som_neg_matrix.sum(axis=0)
som_neg_words_freq = [(word, som_neg_words[0, idx]) for word, idx in cvec.vocabulary_.items()]
som_neg_tf = pd.DataFrame(list(sorted(som_neg_words_freq, key = lambda x: x[1], reverse=True)),columns=['Terms','some-negative'])
som_neg_tf_df = som_neg_tf.set_index('Terms')

neu_words = neu_matrix.sum(axis=0)
neu_words_freq = [(word, neu_words[0, idx]) for word, idx in cvec.vocabulary_.items()]
neu_words_tf = pd.DataFrame(list(sorted(neu_words_freq, key = lambda x: x[1], reverse=True)),columns=['Terms','neutral'])
neu_words_tf_df = neu_words_tf.set_index('Terms')

som_pos_words = som_pos_matrix.sum(axis=0)
som_pos_words_freq = [(word, som_pos_words[0, idx]) for word, idx in cvec.vocabulary_.items()]
som_pos_words_tf = pd.DataFrame(list(sorted(som_pos_words_freq, key = lambda x: x[1], reverse=True)),columns=['Terms','some-positive'])
som_pos_words_tf_df = som_pos_words_tf.set_index('Terms')

pos_words = pos_matrix.sum(axis=0)
pos_words_freq = [(word, pos_words[0, idx]) for word, idx in cvec.vocabulary_.items()]
pos_words_tf = pd.DataFrame(list(sorted(pos_words_freq, key = lambda x: x[1], reverse=True)),columns=['Terms','positive'])
pos_words_tf_df = pos_words_tf.set_index('Terms')

term_freq_df = pd.concat([neg_tf_df,som_neg_tf_df,neu_words_tf_df,som_pos_words_tf_df,pos_words_tf_df],axis=1)

term_freq_df['total'] = term_freq_df['negative'] + term_freq_df['some-negative'] \
                                 + term_freq_df['neutral'] + term_freq_df['some-positive'] \
                                 +  term_freq_df['positive'] 
        
term_freq_df.sort_values(by='total', ascending=False).head(15)

## <a id='1.9.1'>1.9.1 Plot of top frequently used 50 phrases in negative movie reviews</a>

In [None]:
y_pos = np.arange(50)
plt.figure(figsize=(12,10))
plt.bar(y_pos, term_freq_df.sort_values(by='negative', ascending=False)['negative'][:50], align='center', alpha=0.5)
plt.xticks(y_pos, term_freq_df.sort_values(by='negative', ascending=False)['negative'][:50].index,rotation='vertical')
plt.ylabel('Frequency')
plt.xlabel('Top 50 negative tokens')
plt.title('Top 50 tokens in negative movie reviews')

We can see some negative words like "bad", "worst", "dull" are some of the high frequency words. But, there exists few neutral words like "movie", "film", "minutes" dominating the frequency plots.

Let's also take a look at top 50 positive tokens on a bar chart.

## <a id='1.9.2'>1.9.2 Plot of top frequently used 50 phrases in positive movie reviews</a>

In [None]:
y_pos = np.arange(50)
plt.figure(figsize=(12,10))
plt.bar(y_pos, term_freq_df.sort_values(by='positive', ascending=False)['positive'][:50], align='center', alpha=0.5)
plt.xticks(y_pos, term_freq_df.sort_values(by='positive', ascending=False)['positive'][:50].index,rotation='vertical')
plt.ylabel('Frequency')
plt.xlabel('Top 50 positive tokens')
plt.title('Top 50 tokens in positive movie reviews')

Once again, there are some neutral words like "film", "movie", are quite high up in the rank.

## <a id='2'>2. Traditional Supervised Machine Learning Models</a>


## <a id='2.1'>2.1 Feature Engineering</a>

In [None]:
phrase = np.array(df_train_1['Phrase'])
sentiments = np.array(df_train_1['Sentiment'])
# build train and test datasets

from sklearn.model_selection import train_test_split    
phrase_train, phrase_test, sentiments_train, sentiments_test = train_test_split(phrase, sentiments, test_size=0.2, random_state=4)

Next, we will try to see how different are the tokens in 4 different classes(positive,some positive,neutral, some negative, negative). 

## <a id='2.2'>2.2 Implementation of CountVectorizer & TF-IDF

## <a id='2.2.1'>2.2.1 CountVectorizer</a>



As we all know, all machine learning algorithms are good with numbers; we have to extract or convert the text data into numbers without losing much of the information.
One way to do such transformation is Bag-Of-Words (BOW) which gives a number to each word but that is very inefficient.
So, a way to do it is by **CountVectorizer**: it counts the number of words in the document i.e it converts a collection of text documents to a matrix of the counts of occurences of each word in the document. 

For Example: If we have a collection of 3 text documents as below, then CountVectorizer converts that into individual counts of occurences of each of the words in the document as below:




In [None]:
cv1 = CountVectorizer()
x_traincv = cv1.fit_transform(["Hi How are you How are you doing","Hi what's up","Wow that's awesome"])

In [None]:
x_traincv_df = pd.DataFrame(x_traincv.toarray(),columns=list(cv1.get_feature_names()))
x_traincv_df

Now, in case of CountVectorizer, we are just counting the number of words in the document and many times it happens that some words like "are","you","hi",etc are very large in numbers and that would dominate our results in machinelearning algorithm.

## <a id='2.2.2'>2.2.2 How is TF-IDF different from CountVectorizer?</a>

So, TF-IDF (stands for **Term-Frequency-Inverse-Document Frequency**) weights down the common words occuring in almost all the documents and give more importance to the words that appear in a subset of documents.
TF-IDF works by penalising these common words by assigning them lower weights while giving importance to some rare words in a particular document. 

## <a id='2.2.3'>2.2.3 How exactly does TF-IDF work?</a>


Consider the below sample table which gives the count of terms(tokens/words) in two documents.

![](https://i.imgur.com/iVOI1TQ.png)

Now, let us define a few terms related to TF-IDF.

**TF (Term Frequency)** :
Denotes the contribution of the word to the document i.e. words relevant to the document should be frequent. 

            TF = (Number of times term t appears in a document)/(Number of terms in the document)

So, TF(This,Document1) = 1/8

TF(This, Document2)=1/5

**IDF (Inverse Document Frequency)** :
If a word has appeared in all the document, then probably that word is not relevant to a particular document. 
But, if it has appeared in a subset of documents then probably the word is of some relevance to the documents it is present in.


           IDF = log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in.

So, IDF(This) = log(2/2) = 0.
IDF(Messi) = log(2/1) = 0.301.


Now, let us compare the TF-IDF for a common word ‘This’ and a word ‘Messi’ which seems to be of relevance to Document 1.

TF-IDF(This,Document1) = (1/8) * (0) = 0

TF-IDF(This, Document2) = (1/5) * (0) = 0

TF-IDF(Messi, Document1) = (4/8) * 0.301 = 0.15

So,  for Document1 , TF-IDF method heavily penalises the word ‘This’ but assigns greater weight to ‘Messi’. So, this may be understood as ‘Messi’ is an important word for Document1 from the context of the entire corpus.


## "Rare terms are more informative than frequent terms"

The graphic below attempts to express this intuition. Note that the TF-IDF weight is a relative measurement, so the values in red on the axis are not intended to be taken as absolute weights.

![](https://i.imgur.com/pmjduLZ.png)





When your corpus (or Structured set of texts) is large, TfIdf is the best option.

Now, let's get back to our problem:

## <a id='2.2.4'>2.2.4 Understanding the parameters of TfidfVectorizer</a>


* min_df : While building the vocabulary, it will ignore terms that have a document frequency strictly lower than the given threshold. In our case, threshold for min_df = 0.0

* max_df : While building the vocabulary, it ignore terms that have a document frequency strictly higher than the given threshold. For us, threshold for max_df = 1.0

* ngram_range : A tuple of lower and upper boundary of the range of n-values for different n-grams to be extracted. 


![](https://i.imgur.com/Gld6LGz.png)





* sublinear_tf : Sublinear tf scaling addresses the problem that 20 occurrences of a word is probably not 20 times more important than 1 occurrence.

![](https://i.imgur.com/ZzspOIQ.png)


** Why is log used when calculating term frequency weight and IDF, inverse document frequency in sublinear_tf transformation?**

Found the answer to this question in Stackoverflow forum which you may find useful.

![](https://i.imgur.com/85dZ0io.png)

## <a id='2.2.5'>2.2.5 Setting the parameters of CountVectorizer</a>

**For CountVectorizer**
This time, the stop words will not help much, because of the same high-frequency words, such as "the", "to", will equally frequent in both classes. If these stop words dominate both of the classes, I won't be able to have a meaningful result. So, I decided to remove stop words, and also will limit the max_features to 10,000 with countvectorizer.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

## Build Bag-Of-Words on train phrases
cv = CountVectorizer(stop_words='english',max_features=10000)
cv_train_features = cv.fit_transform(phrase_train)

In [None]:

# build TFIDF features on train reviews
tv = TfidfVectorizer(min_df=0.0, max_df=1.0, ngram_range=(1,2),
                     sublinear_tf=True)
tv_train_features = tv.fit_transform(phrase_train)

In [None]:
# transform test reviews into features
cv_test_features = cv.transform(phrase_test)
tv_test_features = tv.transform(phrase_test)

In [None]:
print('BOW model:> Train features shape:', cv_train_features.shape, ' Test features shape:', cv_test_features.shape)
print('TFIDF model:> Train features shape:', tv_train_features.shape, ' Test features shape:', tv_test_features.shape)

## <a id='2.3'>2.3 Model Training, Prediction and Performance Evaluation</a>

In [None]:
####Evaluation metrics


from sklearn import metrics
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.base import clone
from sklearn.preprocessing import label_binarize
from scipy import interp
from sklearn.metrics import roc_curve, auc 


def get_metrics(true_labels, predicted_labels):
    
    print('Accuracy:', np.round(
                        metrics.accuracy_score(true_labels, 
                                               predicted_labels),
                        4))
    print('Precision:', np.round(
                        metrics.precision_score(true_labels, 
                                               predicted_labels,
                                               average='weighted'),
                        4))
    print('Recall:', np.round(
                        metrics.recall_score(true_labels, 
                                               predicted_labels,
                                               average='weighted'),
                        4))
    print('F1 Score:', np.round(
                        metrics.f1_score(true_labels, 
                                               predicted_labels,
                                               average='weighted'),
                        4))
                        

def train_predict_model(classifier, 
                        train_features, train_labels, 
                        test_features, test_labels):
    # build model    
    classifier.fit(train_features, train_labels)
    # predict using model
    predictions = classifier.predict(test_features) 
    return predictions    


def display_confusion_matrix(true_labels, predicted_labels, classes=[1,0]):
    
    total_classes = len(classes)
    level_labels = [total_classes*[0], list(range(total_classes))]

    cm = metrics.confusion_matrix(y_true=true_labels, y_pred=predicted_labels, 
                                  labels=classes)
    cm_frame = pd.DataFrame(data=cm, 
                            columns=pd.MultiIndex(levels=[['Predicted:'], classes], 
                                                  labels=level_labels), 
                            index=pd.MultiIndex(levels=[['Actual:'], classes], 
                                                labels=level_labels)) 
    print(cm_frame) 
    
def display_classification_report(true_labels, predicted_labels, classes=[1,0]):

    report = metrics.classification_report(y_true=true_labels, 
                                           y_pred=predicted_labels, 
                                           labels=classes) 
    print(report)
    
    
    
def display_model_performance_metrics(true_labels, predicted_labels, classes=[1,0]):
    print('Model Performance metrics:')
    print('-'*30)
    get_metrics(true_labels=true_labels, predicted_labels=predicted_labels)
    print('\nModel Classification report:')
    print('-'*30)
    display_classification_report(true_labels=true_labels, predicted_labels=predicted_labels, 
                                  classes=classes)
    print('\nPrediction Confusion Matrix:')
    print('-'*30)
    display_confusion_matrix(true_labels=true_labels, predicted_labels=predicted_labels, 
                             classes=classes)


def plot_model_decision_surface(clf, train_features, train_labels,
                                plot_step=0.02, cmap=plt.cm.RdYlBu,
                                markers=None, alphas=None, colors=None):
    
    if train_features.shape[1] != 2:
        raise ValueError("X_train should have exactly 2 columnns!")
    
    x_min, x_max = train_features[:, 0].min() - plot_step, train_features[:, 0].max() + plot_step
    y_min, y_max = train_features[:, 1].min() - plot_step, train_features[:, 1].max() + plot_step
    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                         np.arange(y_min, y_max, plot_step))

    clf_est = clone(clf)
    clf_est.fit(train_features,train_labels)
    if hasattr(clf_est, 'predict_proba'):
        Z = clf_est.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:,1]
    else:
        Z = clf_est.predict(np.c_[xx.ravel(), yy.ravel()])    
    Z = Z.reshape(xx.shape)
    cs = plt.contourf(xx, yy, Z, cmap=cmap)
    
    le = LabelEncoder()
    y_enc = le.fit_transform(train_labels)
    n_classes = len(le.classes_)
    plot_colors = ''.join(colors) if colors else [None] * n_classes
    label_names = le.classes_
    markers = markers if markers else [None] * n_classes
    alphas = alphas if alphas else [None] * n_classes
    for i, color in zip(range(n_classes), plot_colors):
        idx = np.where(y_enc == i)
        plt.scatter(train_features[idx, 0], train_features[idx, 1], c=color,
                    label=label_names[i], cmap=cmap, edgecolors='black', 
                    marker=markers[i], alpha=alphas[i])
    plt.legend()
    plt.show()


def plot_model_roc_curve(clf, features, true_labels, label_encoder=None, class_names=None):
    
    ## Compute ROC curve and ROC area for each class
    fpr = dict()
    tpr = dict()
    roc_auc = dict()
    if hasattr(clf, 'classes_'):
        class_labels = clf.classes_
    elif label_encoder:
        class_labels = label_encoder.classes_
    elif class_names:
        class_labels = class_names
    else:
        raise ValueError('Unable to derive prediction classes, please specify class_names!')
    n_classes = len(class_labels)
    y_test = label_binarize(true_labels, classes=class_labels)
    if n_classes == 2:
        if hasattr(clf, 'predict_proba'):
            prob = clf.predict_proba(features)
            y_score = prob[:, prob.shape[1]-1] 
        elif hasattr(clf, 'decision_function'):
            prob = clf.decision_function(features)
            y_score = prob[:, prob.shape[1]-1]
        else:
            raise AttributeError("Estimator doesn't have a probability or confidence scoring system!")
        
        fpr, tpr, _ = roc_curve(y_test, y_score)      
        roc_auc = auc(fpr, tpr)
        plt.plot(fpr, tpr, label='ROC curve (area = {0:0.2f})'
                                 ''.format(roc_auc),
                 linewidth=2.5)
        
    elif n_classes > 2:
        if hasattr(clf, 'predict_proba'):
            y_score = clf.predict_proba(features)
        elif hasattr(clf, 'decision_function'):
            y_score = clf.decision_function(features)
        else:
            raise AttributeError("Estimator doesn't have a probability or confidence scoring system!")

        for i in range(n_classes):
            fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
            roc_auc[i] = auc(fpr[i], tpr[i])

        ## Compute micro-average ROC curve and ROC area
        fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
        roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

        ## Compute macro-average ROC curve and ROC area
        # First aggregate all false positive rates
        all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
        # Then interpolate all ROC curves at this points
        mean_tpr = np.zeros_like(all_fpr)
        for i in range(n_classes):
            mean_tpr += interp(all_fpr, fpr[i], tpr[i])
        # Finally average it and compute AUC
        mean_tpr /= n_classes
        fpr["macro"] = all_fpr
        tpr["macro"] = mean_tpr
        roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])

        ## Plot ROC curves
        plt.figure(figsize=(6, 4))
        plt.plot(fpr["micro"], tpr["micro"],
                 label='micro-average ROC curve (area = {0:0.2f})'
                       ''.format(roc_auc["micro"]), linewidth=3)

        plt.plot(fpr["macro"], tpr["macro"],
                 label='macro-average ROC curve (area = {0:0.2f})'
                       ''.format(roc_auc["macro"]), linewidth=3)

        for i, label in enumerate(class_labels):
            plt.plot(fpr[i], tpr[i], label='ROC curve of class {0} (area = {1:0.2f})'
                                           ''.format(label, roc_auc[i]), 
                     linewidth=2, linestyle=':')
    else:
        raise ValueError('Number of classes should be atleast 2 or more')
        
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend(loc="lower right")
    plt.show()

In [None]:
from sklearn.linear_model import SGDClassifier, LogisticRegression

lr = LogisticRegression(penalty='l2', max_iter=100, C=1)
sgd = SGDClassifier(loss='hinge', n_iter=100)

## <a id='2.3.1'>2.3.1 Logistic Regression model on CountVectorizer</a>

In [None]:
# Logistic Regression model on BOW features
lr_bow_predictions = train_predict_model(classifier=lr, 
                                             train_features=cv_train_features, train_labels=sentiments_train,
                                             test_features=cv_test_features, test_labels=sentiments_test)
display_model_performance_metrics(true_labels=sentiments_test, predicted_labels=lr_bow_predictions,
                                      classes=[0,1,2,3,4])
                                    

## <a id='2.3.2'>2.3.2 Logistic Regression model on TF-IDF features</a>


In [None]:
# Logistic Regression model on TF-IDF features
lr_tfidf_predictions = train_predict_model(classifier=lr, 
                                               train_features=tv_train_features, train_labels=sentiments_train,
                                               test_features=tv_test_features, test_labels=sentiments_test)
display_model_performance_metrics(true_labels=sentiments_test, predicted_labels=lr_tfidf_predictions,
                                      classes=[0,1,2,3,4])


## <a id='2.3.3'>2.3.3 SGD model on Countvectorizer</a>

In [None]:
# SGD model on Countvectorizer
sgd_bow_predictions = train_predict_model(classifier=sgd, 
                                             train_features=cv_train_features, train_labels=sentiments_train,
                                             test_features=cv_test_features, test_labels=sentiments_test)
display_model_performance_metrics(true_labels=sentiments_test, predicted_labels=sgd_bow_predictions,
                                      classes=[0,1,2,3,4])

## <a id='2.3.4'>2.3.4 SGD model on TF-IDF</a>

In [None]:
# SGD model on TF-IDF
sgd_tfidf_predictions = train_predict_model(classifier=sgd, 
                                                train_features=tv_train_features, train_labels=sentiments_train,
                                                test_features=tv_test_features, test_labels=sentiments_test)
display_model_performance_metrics(true_labels=sentiments_test, predicted_labels=sgd_tfidf_predictions,
                                      classes=[0,1,2,3,4])

## <a id='2.3.5'>2.3.5 RandomForest model on TF-IDF</a>

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_jobs=-1)

In [None]:
# RandomForest model on TF-IDF
rfc_tfidf_predictions = train_predict_model(classifier=rfc, 
                                                train_features=tv_train_features, train_labels=sentiments_train,
                                                test_features=tv_test_features, test_labels=sentiments_test)
display_model_performance_metrics(true_labels=sentiments_test, predicted_labels=rfc_tfidf_predictions,
                                      classes=[0,1,2,3,4])

**Logistic Regression on TF-IDF is outperforming other machine learning algorithms**. 

