# Sentiment Analysis

In this project, I am creating two models for binary classification of text sentiments. I used Python's TensorFlow library to create a model using Recurrent Neural Network (RNN) as my first model while for the second model I used the XGBoost library to build decision trees. I will be training both of these models on movie reviews to see how well they can predict whether a review is positive or negative. I will also be training them on Steam reviews to be able to predict whether a person liked the game or not from their use of words on Steam.


In [1]:
#Importing libraries
import tensorflow as tf
import pandas as pd
import numpy as np
import xgboost as xgb

In [2]:
#Loading the dataset
# To run the model on tweets, use 'Steam_reviews.csv', while for movie reviews use 'IMDB Dataset.csv'

fileName = 'IMDB Dataset.csv'
data = pd.read_csv(fileName)
print(len(data))
data 

50000


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


### Pre-processing Data:

In [3]:
#Preprocessing the data:

# For movie reviews, uncomment this block of code and comment out the other one:

data['review'] = data['review'].str.replace('<br /><br />', '')
data['review'] = data['review'].str.replace('[^\w\s]','', regex=True).str.lower()
data['tokens'] = data['review'].str.split() #tokenizing to see if there are unnecessary characters
data['sentiment'] = (data['sentiment'] == 'positive').astype(int)


# For steam reviews, uncomment the following block of code and comment out the other one:

# data['review'] = data['user_review'].str.replace('[^\w\s]','', regex=True).str.lower()
# data['sentiment'] = data['user_suggestion']


data

Unnamed: 0,review,sentiment,tokens
0,one of the other reviewers has mentioned that ...,1,"[one, of, the, other, reviewers, has, mentione..."
1,a wonderful little production the filming tech...,1,"[a, wonderful, little, production, the, filmin..."
2,i thought this was a wonderful way to spend ti...,1,"[i, thought, this, was, a, wonderful, way, to,..."
3,basically theres a family where a little boy j...,0,"[basically, theres, a, family, where, a, littl..."
4,petter matteis love in the time of money is a ...,1,"[petter, matteis, love, in, the, time, of, mon..."
...,...,...,...
49995,i thought this movie did a down right good job...,1,"[i, thought, this, movie, did, a, down, right,..."
49996,bad plot bad dialogue bad acting idiotic direc...,0,"[bad, plot, bad, dialogue, bad, acting, idioti..."
49997,i am a catholic taught in parochial elementary...,0,"[i, am, a, catholic, taught, in, parochial, el..."
49998,im going to have to disagree with the previous...,0,"[im, going, to, have, to, disagree, with, the,..."


In [4]:
# Remove extra columns for steam reviews dataset only. Uncomment the following lines for steam reviews dataset but comment for movie reviews.
# data = data[['review', 'sentiment']]
# data

In [4]:
#Splitting the data into training and testing set using scikit-learn's split
#funtion

# For movie reviews, uncomment this block of code and comment out the other one:

X_train = data['review'][:40000]
y_train = data['sentiment'][:40000]
X_test = data['review'][40000:]
y_test = data['sentiment'][40000:]

# For steam reviews, uncomment the following block of code and comment out the other one:

# X_train = data['review'][:10000]
# y_train = data['sentiment'][:10000]
# X_test = data['review'][10000:]
# y_test = data['sentiment'][10000:]

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(40000,)
(40000,)
(10000,)
(10000,)


## Model 1 : Forward Propogation Neural Network using TensorFlow and Keras

In [6]:
#Creating a tokenizer using Keras from TensorFlow
from tensorflow.keras.preprocessing.text import Tokenizer
# Only keep 10,000 most common words and call other words 'out of 
# vocabulary words' or 'OOV'
tokenizer = Tokenizer(num_words=10000, oov_token='<OOV>')
#Building tokenizer vocabulary on the training data
tokenizer.fit_on_texts(X_train)
#print(dict(list(tokenizer.word_index.items())[0:30])) #uncomment to see the 30 most common tokens

In [7]:
#Converting the text into numerical sequence
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

In [8]:
#Padding the sequences to ensure all sequences have the same length because
# we need to input all sequences of same length into the NN to train it.

from tensorflow.keras.preprocessing.sequence import pad_sequences
#limiting the length to 100 tokens
max_length = 100

X_train_pad = pad_sequences(X_train_seq, maxlen=max_length, padding='post', truncating='post')
X_test_pad = pad_sequences(X_test_seq, maxlen=max_length, padding='post', truncating='post')

In [9]:
#Defining the model: FeedForward NN using the sigmoid activation function.
# the first layer is an embedding layer, the last layer is a single neuron with
# sigmoid activation function which produces a probability between 0 and 1.

model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(input_dim=10000, output_dim=16, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 16)           160000    
                                                                 
 global_average_pooling1d (G  (None, 16)               0         
 lobalAveragePooling1D)                                          
                                                                 
 dense (Dense)               (None, 16)                272       
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________


In [10]:
#Compiling the model while using binary cross entropy loss
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [11]:
#Training and Evaluating the model

#Running this again will continue running more iterations which will lead to overfitting.
history = model.fit(X_train_pad, y_train, epochs=5, validation_data=(X_test_pad, y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [12]:
train_loss, train_accuracy = model.evaluate(X_train_pad, y_train)
print("Training Accuracy: ", train_accuracy)
print("Training Loss: ", train_loss)
test_loss, test_accuracy = model.evaluate(X_test_pad, y_test)
print("Testing Accuracy: ", test_accuracy)
print("Testing Loss: ", test_loss)

Training Accuracy:  0.9186750054359436
Training Loss:  0.2134408950805664
Testing Accuracy:  0.8360000252723694
Testing Loss:  0.39441898465156555


In [13]:
#Predicting a review from the user.
if fileName == 'Steam_reviews.csv':
    review = input('Enter a game review: ')
else:
    review = input('Enter a movie review: ')
review = tokenizer.texts_to_sequences([review])
review = pad_sequences(review, maxlen=100)
prediction = model.predict(review)

# Print the predicted sentiment
if prediction > 0.5:
    print('Positive review')
else:
    print('Negative review')
    
#It does better on longer reviews than short reviews.

Enter a movie review:  Absolute sh*te game! DO NOT PLAY!!!  Lets be honest here, its a game with brilliant graphics, but that's the only positive about this game. This game throws you in at the deep end with other players who have been playing this game for a very long time and will NOT give you a chance to fire a shot. Even if you did manage to get a shot off, it will no doubt do zero damage to the target while apparently your own armour might as well be made out of hopes and dreams!


Negative review


## Model 2 : Decision Trees using XGBoost

In [5]:
# Run all the cells in the preprocessing section before running this model. 
# (Don't run this cell consecutively or you will need to preprocess again!)

#Now we extract features using Scikit-learn's CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer() # We cannot use Keras's tokenizer for XGBoost because we need tokens to be vectorized
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)


In [6]:
# Defining and training the decision tree model with binary classifier

model2 = xgb.XGBClassifier(max_depth=5, learning_rate=0.1, n_estimators=100, objective='binary:logistic')
model2.fit(X_train, y_train)

In [7]:
#Evaluating the model using Scikit-learn's model evaluation functions:

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

pred1 = model2.predict(X_train)
pred2 = model2.predict(X_test)

print('Training Accuracy:', accuracy_score(y_train, pred1))
print('Testing Accuracy:', accuracy_score(y_test, pred2))
#print('Confusion matrix:', confusion_matrix(y_test, predictions)) #Uncomment to see the confusion matrix


Training Accuracy: 0.8532
Testing Accuracy: 0.8264
