<a href="https://colab.research.google.com/github/SumitraMukherjee/analytics/blob/master/SM_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Polarity Detection
- We shall train a model on labeled *IMDB* movie reviews (https://keras.io/api/datasets/imdb/) and use the model to predict the polarity (*negative* / *positive*) of comments we provide. 
- For vectorization of words, we use *TfidfVectorizer* from *ScikitLearn*  (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
- For classification of sentiments, we use *RidgeClassifier* from *ScikitLearn* (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html)


## Import libraries

In [1]:
from keras.datasets import imdb # to import dataset
import numpy as np # for computation
import pandas as pd # for data handling
from time import time # for run times
from sklearn.linear_model import RidgeClassifier # for classification
from sklearn.feature_extraction.text import TfidfVectorizer # to vectorize text
from sklearn.pipeline import Pipeline # to combine vectorizer and classifier
from sklearn.metrics import accuracy_score # metric for model evaluation

Using TensorFlow backend.


## Read data

In [2]:
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
print("%d labeled comments for training model" %len(train_data))
print("%d labeled comments for testing model" %len(test_data))

word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

train = pd.DataFrame([' '.join([reverse_word_index.get(i - 3, '?') for i in s]) for s in train_data], columns=['comment'])
train['label'] = train_labels

test = pd.DataFrame([' '.join([reverse_word_index.get(i - 3, '?') for i in s]) for s in test_data], columns=['comment'])
test['label'] = test_labels

print("\nFirst 5 labeled comments for training (1: Positive, 0:Negative)")
pd.set_option('display.max_colwidth', 160)
train.head()

25000 labeled comments for training model
25000 labeled comments for testing model

First 5 labeled comments for training (1: Positive, 0:Negative)


Unnamed: 0,comment,label
0,? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there...,1
1,? big hair big boobs bad music and a giant safety pin these are the words to best describe this terrible movie i love cheesy horror movies and i've seen hun...,0
2,? this has to be one of the worst films of the 1990s when my friends i were watching this film being the target audience it was aimed at we just sat watched...,0
3,? the ? ? at storytelling the traditional sort many years after the event i can still see in my ? eye an elderly lady my friend's mother retelling the battl...,1
4,? worst mistake of my life br br i picked this movie up at target for 5 because i figured hey it's sandler i can get some cheap laughs i was wrong completel...,0


## Train and validate model

In [3]:
clf = RidgeClassifier(tol=1e-2, solver="sag", max_iter=1000) # classifier 
vec = TfidfVectorizer(ngram_range=(1,2), sublinear_tf=True, max_df = 0.5) # vectorizer
model = Pipeline([('v', vec), ('c', clf)])
print("Training classifier ... this may take over 30 seconds")
st = time() # start time
model.fit(train.comment.values, train.label.values) # train model
y_pred = model.predict(test.comment.values) # predict polarity for test examples
acc = accuracy_score(y_pred, test.label.values) # accuracy
t = time() - st # time taken for training and prediction
print(f'Accuracy on validation examples = {acc:.3f}, Time = {t:.3f}')

Training classifier ... this may take over 30 seconds
Accuracy on validation examples = 0.906, Time = 46.752


## Predict polarity of new comments

In [5]:
s = input("Type in a comment: \n").lower()
polarity = model.predict([s])[0]
confidence = model.decision_function([s])[0]
print('Comment expresses ', ['Negative', 'Positive'][polarity], 'sentiment.')
print(f'Strength of sentiment = {confidence:.3f}')


Type in a comment: 
this annoying session is a total waste of time
Comment expresses  Negative sentiment.
Strength of sentiment = -1.534
