# Sentiment Analysis of Movie Reviews

![](https://i.imgur.com/6Wfmf2S.png)

> **Problem Statement**: Apply the TF-IDF technique to train ML models for sentiment analysis using data from the "[Sentiment Analysis on Movie Reviews](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews)" Kaggle competition.


Outline:

1. Download and Explore Dataset
2. Implement the TF-IDF Technique
3. Train baseline model & submit to Kaggle
4. Train & finetune different ML models
3. Document & submit your notebook


Dataset: https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews


## Download and Explore the Data

Outline:

1. Download Dataset from Kaggle
2. Explore and visualize data

### Download Dataset from Kaggle

- Read the "Description", "Evaluation" and "Data" sections on the Kaggle competition page carefully
- Make sure to download the `kaggle.json` file from your [Kaggle account](https://kaggle.com/me/account) and upload it on Colab

In [None]:
!ls

data  kaggle.json  sample_data	sentiment-analysis-on-movie-reviews.zip  submission.csv


In [None]:
!pip install kaggle --upgrade.


Usage:   
  pip3 install [options] <requirement specifier> [package-index-options] ...
  pip3 install [options] -r <requirements file> [package-index-options] ...
  pip3 install [options] [-e] <vcs project url> ...
  pip3 install [options] [-e] <local project path> ...
  pip3 install [options] <archive url/path> ...

no such option: --upgrade.


In [None]:
import os

In [None]:
os.environ["KAGGLE_CONFIG_DIR"] = '.'

In [None]:
!kaggle competitions download -c sentiment-analysis-on-movie-reviews

sentiment-analysis-on-movie-reviews.zip: Skipping, found more recently modified local copy (use --force to force download)


In [None]:
!unzip sentiment-analysis-on-movie-reviews.zip -d data

Archive:  sentiment-analysis-on-movie-reviews.zip
replace data/sampleSubmission.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: data/sampleSubmission.csv  
replace data/test.tsv.zip? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: data/test.tsv.zip       
replace data/train.tsv.zip? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: data/train.tsv.zip      


### Explore and Visualize Data

* Load the train, test, and submission files using Pandas
* Explore rows, columns, sample values etc.
* Visualize distribution of target columns

In [None]:
train_fname = 'data/train.tsv.zip'
test_fname = '/content/data/test.tsv.zip'
sample_fname= 'data/sampleSubmission.csv'

In [None]:
import pandas as pd

In [None]:
#read training data
train_df = pd.read_csv(train_fname, sep = '\t')

#read testing data
test_df = pd.read_csv(test_fname, sep = '\t')

#read submission data
sub_df = pd.read_csv(sample_fname)

In [None]:
train_df

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2
...,...,...,...,...
156055,156056,8544,Hearst 's,2
156056,156057,8544,forced avuncular chortles,1
156057,156058,8544,avuncular chortles,3
156058,156059,8544,avuncular,2


In [None]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156060 entries, 0 to 156059
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   PhraseId    156060 non-null  int64 
 1   SentenceId  156060 non-null  int64 
 2   Phrase      156060 non-null  object
 3   Sentiment   156060 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 4.8+ MB


In [None]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66292 entries, 0 to 66291
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   PhraseId    66292 non-null  int64 
 1   SentenceId  66292 non-null  int64 
 2   Phrase      66292 non-null  object
dtypes: int64(2), object(1)
memory usage: 1.5+ MB


In [None]:
sub_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66292 entries, 0 to 66291
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   PhraseId   66292 non-null  int64
 1   Sentiment  66292 non-null  int64
dtypes: int64(2)
memory usage: 1.0 MB


In [None]:
train_df.Sentiment.value_counts(normalize=True)

2    0.509945
3    0.210989
1    0.174760
4    0.058990
0    0.045316
Name: Sentiment, dtype: float64

In [None]:
sub_df

Unnamed: 0,PhraseId,Sentiment
0,156061,2
1,156062,2
2,156063,2
3,156064,2
4,156065,2
...,...,...
66287,222348,2
66288,222349,2
66289,222350,2
66290,222351,2


Summarize your insights and learnings from the dataset below:

* `Both train and test data have no-null values`
* `Sentiment ranges from 0-4(negative to positive)`
* `The data is unbalanced since more than 50% of the data is marked as neutral, 21% as somewhat positive, 17.4% as somewhat negative, 5.8% as positive and 4.5% as negative.`

## Implement TF-IDF Technique

![](https://i.imgur.com/5VbUPup.png)

Outline:

1. Learn the vocabulary using `TfidfVectorizer`
3. Transform training and test data

#### Learn Vocabulary using `TfidfVectorizer `

* Create custom tokenizer with stemming
* Create a list of stop words
* Configure and create `TfidfVectorizer `
* Learn vocubulary from training set
* View sample entries from vocabulary

In [None]:
train_df['Phrase'] = train_df['Phrase'].apply(lambda Phrase: Phrase.lower())
train_df.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,a series of escapades demonstrating the adage ...,1
1,2,1,a series of escapades demonstrating the adage ...,2
2,3,1,a series,2
3,4,1,a,2
4,5,1,series,2


In [None]:
import nltk
from nltk.tokenize import word_tokenize

In [None]:
train_df.Phrase[0]

'a series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .'

In [None]:
word_tokenize(train_df.Phrase[0])

['a',
 'series',
 'of',
 'escapades',
 'demonstrating',
 'the',
 'adage',
 'that',
 'what',
 'is',
 'good',
 'for',
 'the',
 'goose',
 'is',
 'also',
 'good',
 'for',
 'the',
 'gander',
 ',',
 'some',
 'of',
 'which',
 'occasionally',
 'amuses',
 'but',
 'none',
 'of',
 'which',
 'amounts',
 'to',
 'much',
 'of',
 'a',
 'story',
 '.']

In [None]:
train_df['Phrase'] = train_df['Phrase'].apply(word_tokenize)
train_df.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,"[a, series, of, escapades, demonstrating, the,...",1
1,2,1,"[a, series, of, escapades, demonstrating, the,...",2
2,3,1,"[a, series]",2
3,4,1,[a],2
4,5,1,[series],2


In [None]:
from nltk.corpus import stopwords

In [None]:
nltk.download('stopwords')
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
stop_words = set(stopwords.words('english'))
train_df['Phrase'] = train_df['Phrase'].apply(lambda tokens: [word for word in tokens if word not in stop_words])
train_df.head(10)

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,"[series, escapades, demonstrating, adage, good...",1
1,2,1,"[series, escapades, demonstrating, adage, good...",2
2,3,1,[series],2
3,4,1,[],2
4,5,1,[series],2
5,6,1,"[escapades, demonstrating, adage, good, goose]",2
6,7,1,[],2
7,8,1,"[escapades, demonstrating, adage, good, goose]",2
8,9,1,[escapades],2
9,10,1,"[demonstrating, adage, good, goose]",2


In [None]:
from nltk.stem import PorterStemmer

In [None]:
Stemmer= PorterStemmer()
train_df['Phrase'] = train_df['Phrase'].apply(lambda tokens: [Stemmer.stem(word) for word in tokens])
train_df.head(10)

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,"[seri, escapad, demonstr, adag, good, goos, al...",1
1,2,1,"[seri, escapad, demonstr, adag, good, goos]",2
2,3,1,[seri],2
3,4,1,[],2
4,5,1,[seri],2
5,6,1,"[escapad, demonstr, adag, good, goos]",2
6,7,1,[],2
7,8,1,"[escapad, demonstr, adag, good, goos]",2
8,9,1,[escapad],2
9,10,1,"[demonstr, adag, good, goos]",2


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
def tokenize(text):
  return [Stemmer.stem(word) for word in word_tokenize(text)]

In [None]:
tfid = TfidfVectorizer(lowercase=True,
                       tokenizer = tokenize,
                       stop_words = 'english')

### Transform Training & Test Data

* Transform phrases from training set
* Transform phrases from test set
* Look at some example values

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
#Create training and validation sets.
train_inputs, val_inputs, train_targets, val_targets = train_test_split(train_df.Phrase, train_df.Sentiment,
                                                                        test_size=0.3, random_state=42)

In [None]:
#Create training and validation sets.
train_inputs = train_df[:110_000].Phrase
val_inputs = train_df[110_000:].Phrase
train_targets = train_df[:110_000].Sentiment
val_targets = train_df[110_000:].Sentiment

In [None]:
train_inputs.shape

(110000,)

In [None]:
train_targets.shape

(110000,)

In [None]:
train_inputs = train_inputs.apply(lambda tokens: ' '.join(tokens))
val_inputs = val_inputs.apply(lambda tokens: ' '.join(tokens))
tfid.fit(train_inputs)
tfid.fit(val_inputs)



In [None]:
%%time
input_vectors = tfid.transform(train_inputs)
val_vectors = tfid.transform(val_inputs)

CPU times: user 31.1 s, sys: 95.2 ms, total: 31.2 s
Wall time: 31.4 s


In [None]:
input_vectors.toarray().shape

(110000, 6616)

## Train Baseline Model & Submit to Kaggle

1. Split training and validation sets
2. Train logistic regression model
3. Study predictions on sample phrases
4. Make predictions and submit to Kaggle




### Train Logistic Regression Model



In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
model = LogisticRegression(max_iter=100, solver='sag')

In [None]:
model.fit(input_vectors, train_targets)

### Study Predictions on Sample Inputs

In [None]:
train_preds = model.predict(input_vectors)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
#Training score
accuracy_score(train_targets, train_preds)

0.6540090909090909

In [None]:
val_preds = model.predict(val_vectors)

In [None]:
accuracy_score(val_targets, val_preds)

0.5778115501519757

### Make Predictions & Submit to Kaggle

1. Make predictions on Test Dataset
2. Generate & submit CSV on Kaggle
3. Add screenshot of your score



In [None]:
#using the entire dataset for vectorization and training
Vectorizer = TfidfVectorizer(tokenizer = tokenize,
                              stop_words = 'english',
                              lowercase = True,
                              max_features = 2000)

train_df['Phrase'] = train_df['Phrase'].apply(lambda tokens: ' '.join(tokens))

#fit vectorizer
Vectorizer.fit(train_df.Phrase)

#training inputs and outputs
T_inputs = Vectorizer.transform(train_df.Phrase)
T_outputs = train_df.Sentiment.values

#test inputs
test_inputs = Vectorizer.transform(test_df.Phrase)



In [None]:
#initialize model
model1 = LogisticRegression(solver = 'sag',
                            n_jobs = -1)

In [None]:
model1.fit(T_inputs, T_outputs)

In [None]:
import numpy as np

#prediction on training data
T_preds = model1.predict(T_inputs)

#accuracy on predictions
print(f"Prediction accuracy: {accuracy_score(T_outputs, T_preds)}")

#accuracy on random points
print(f"Constant 2s accuracy: {accuracy_score(T_outputs, 2*np.ones(T_preds.shape))}")
print(f"Random array accuracy: {accuracy_score(T_outputs, np.random.choice([0,1,2,3,4],size = T_preds.shape))}")

Prediction accuracy: 0.6238562091503268
Constant 2s accuracy: 0.5099448929898757
Random array accuracy: 0.19862232474689223


In [None]:
#Test predictions
test_preds = model1.predict(test_inputs)

In [None]:
test_preds

array([2, 2, 2, ..., 2, 2, 1])

In [None]:
sub_df.head()

Unnamed: 0,PhraseId,Sentiment
0,156061,2
1,156062,2
2,156063,2
3,156064,2
4,156065,2


In [None]:
sub_df.Sentiment = test_preds
sub_df.head()

Unnamed: 0,PhraseId,Sentiment
0,156061,2
1,156062,2
2,156063,2
3,156064,2
4,156065,2


In [None]:
sub_df.to_csv("submission.csv",
              index = False)

In [None]:
!head submission.csv

PhraseId,Sentiment
156061,2
156062,2
156063,2
156064,2
156065,2
156066,2
156067,2
156068,2
156069,2


## Train & Finetune Different ML Models


### Model 1

In [None]:
#import Random Forest classifier.
from sklearn.ensemble import RandomForestClassifier

In [None]:
#instantiate the model.
rf = RandomForestClassifier()

In [None]:
%%time
#fit the model.
rf.fit(T_inputs, T_outputs)

CPU times: user 16min 58s, sys: 2.1 s, total: 17min
Wall time: 17min 1s


In [None]:
#training predictions.
rf_preds = rf.predict(T_inputs)

#training accuracy
print(f"Prediction accuracy: {accuracy_score(T_outputs, rf_preds)}")

#accuracy on random points
print(f"Constant 2s accuracy: {accuracy_score(T_outputs, 2*np.ones(rf_preds.shape))}")
print(f"Random array accuracy: {accuracy_score(T_outputs, np.random.choice([0,1,2,3,4],size = rf_preds.shape))}")

Prediction accuracy: 0.807920030757401
Constant 2s accuracy: 0.5099448929898757
Random array accuracy: 0.1990965013456363


In [None]:
rf_test_preds = rf.predict(test_inputs)

#submission files.
rf_df1 = sub_df.copy()
rf_df1.Sentiment = rf_test_preds
rf_df1.to_csv("submission1.csv", index = False)

### Model 2

In [None]:
!pip install xgboost

#import Random Forest classifier.
from xgboost import XGBClassifier



In [None]:
xgb = XGBClassifier()

In [None]:
xgb.fit(T_inputs, T_outputs)

In [None]:
#training predictions.
xgb_preds = xgb.predict(T_inputs)

#training accuracy
print(f"Prediction accuracy: {accuracy_score(T_outputs, xgb_preds)}")

#accuracy on random points
print(f"Constant 2s accuracy: {accuracy_score(T_outputs, 2*np.ones(xgb_preds.shape))}")
print(f"Random array accuracy: {accuracy_score(T_outputs, np.random.choice([0,1,2,3,4],size = xgb_preds.shape))}")

Prediction accuracy: 0.6213443547353582
Constant 2s accuracy: 0.5099448929898757
Random array accuracy: 0.20233243624247085


### Model 3

In [None]:
#import Decision trees from sklearn
from sklearn.tree import DecisionTreeClassifier

In [None]:
#instantiate the Decision tree model
dt = DecisionTreeClassifier()

In [None]:
#fit the model
dt.fit(T_inputs, T_outputs)

In [None]:
#training predictions.
tree_preds = dt.predict(T_inputs)

#training accuracy
print(f"Prediction accuracy: {accuracy_score(T_outputs, tree_preds)}")

#accuracy on random points
print(f"Constant 2s accuracy: {accuracy_score(T_outputs, 2*np.ones(tree_preds.shape))}")
print(f"Random array accuracy: {accuracy_score(T_outputs, np.random.choice([0,1,2,3,4],size = tree_preds.shape))}")

Prediction accuracy: 0.8079264385492759
Constant 2s accuracy: 0.5099448929898757
Random array accuracy: 0.19915417147251058


In [None]:
tree_test_preds = dt.predict(test_inputs)

#submission files.
sub_df2 = sub_df.copy()
sub_df2.Sentiment = tree_test_preds
sub_df2.to_csv("submission2.csv", index = False)

Future work:
- Try more machine learning models
- Try configuring CountVectorizer differently
- Try approaches other than bag of words
