# Sentiment analysis for entertainment media using Natural Language Processing models
This notebook contains the script used to generate the proposed model for sentiment analysis of IMDb reviews.

First of all, run this cell to install the required libraries: 
- pandas
- scikit-learn

In [2]:
# The first step is to be sure that pandas and scikit-learn are installed.
%pip install pandas scikit-learn  

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.0.1 -> 23.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
# The needed libraries and models are imported for the project
import pandas as pd
import numpy as np

from sklearn.svm import SVC, LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

from sklearn.pipeline import make_pipeline

In [4]:
# The dataset is read from the file stored in the /dataset folder
dataset = pd.read_csv("./dataset/IMDB Dataset.csv")

## Preparing the testing and training datasets

In [5]:
# The positive and negative reviews are filtered on separate datasets
positiveReviews = dataset[(dataset["sentiment"] == "positive")]
negativeReviews = dataset[(dataset["sentiment"] == "negative")]

# Two series of data are obtained from the positive reviews. One for the input and the other for the output
positiveInput = positiveReviews["review"].values
positiveOutput = positiveReviews["sentiment"].values

# Two series of data are obtained from the negative reviews. One for the input and the other for the output
negativeInput = negativeReviews["review"].values
negativeOutput = negativeReviews["sentiment"].values

# The training and testing series of inputs and outputs are created
pInput_train, pInput_test, pOutput_train, pOutput_test = train_test_split(positiveInput, positiveOutput, train_size = 0.50, random_state = 54321)
nInput_train, nInput_test, nOutput_train, nOutput_test = train_test_split(negativeInput, negativeOutput, train_size = 0.50, random_state = 54321)

In [6]:
# Positive and negative inputs and outputs are concatenated for training
input_train = np.concatenate((nInput_train, pInput_train), axis = 0)
output_train = np.concatenate((nOutput_train, pOutput_train), axis = 0)

# Positive and negative inputs and outputs are concatenated for testing
input_test = np.concatenate((nInput_test, pInput_test), axis = 0)
output_test = np.concatenate((nOutput_test, pOutput_test), axis = 0)

## Model evaluation logic
The function evaluateModel was declared. It takes the expected outputs and the model predictions to evaluate the following with its confusion matrix: 
- Accuracy
- Precision
- Recall
- F1-Score

In [7]:
def evaluateModel(outputTest, predictions):
    confusionMatrix = confusion_matrix(outputTest, predictions)
    print("Confusion Matrix: ")
    print(confusionMatrix)

    VP = confusionMatrix[0][0]
    VN =  confusionMatrix[1][1]
    FP = confusionMatrix[0][1]
    FN = confusionMatrix[1][0]

    accuracy =  (VP + VN) / (VP + VN + FP + FN)
    precision = VP / (VP + FP)
    recall= VP / (VP + FN)
    F1 = (2 * (precision) * (recall)) / (precision + recall)

    print("Accuracy:", accuracy)
    print("Precision:", precision)
    print("Recall:", recall)
    print("F1:", F1)

## Proposed model #1 : SVC with "rbf" kernel using a TF-iDF Vectorizer

In [10]:
# A SVC with "rbf" kernel (kLinear) and Tfidf vectorization is created
kRbfTfidf = make_pipeline(TfidfVectorizer(), SVC(gamma='auto', max_iter= 1000))
kRbfTfidf.fit(input_train, output_train)



In [11]:
# The predictions for the model are stored and evaluated. 
predictionSVCkLinear = kRbfTfidf.predict(input_test)

evaluateModel(output_test, predictionSVCkLinear)

Confusion Matrix: 
[[ 3281  9219]
 [  193 12307]]
Accuracy: 0.62352
Precision: 0.26248
Recall: 0.9444444444444444
F1: 0.4107925378740453


## Proposed model #2 : SVC with "linear" kernel using a TF-iDF Vectorizer

In [16]:
# A SVC with "linear" kernel (kLinear) and Tfidf vectorization is created
kLinearTfidf = make_pipeline(TfidfVectorizer(), SVC(kernel="linear", gamma='auto', max_iter= 4000))
kLinearTfidf.fit(input_train, output_train)



In [17]:
# The predictions for the model are stored and evaluated. 
predictionSVCkLinear = kLinearTfidf.predict(input_test)

evaluateModel(output_test, predictionSVCkLinear)

Confusion Matrix: 
[[11087  1413]
 [ 1288 11212]]
Accuracy: 0.89196
Precision: 0.88696
Recall: 0.8959191919191919
F1: 0.8914170854271355


## Proposed model #3 : LinearSVC using a TF-iDF Vectorizer

In [18]:
# A LinearSVC model with Tfidf vectorization is created
linearTfidf = make_pipeline(TfidfVectorizer(), LinearSVC(max_iter= 4000))
linearTfidf.fit(input_train, output_train)

In [19]:
predictionLinearSVC = linearTfidf.predict(input_test)

evaluateModel(output_test, predictionLinearSVC)

Confusion Matrix: 
[[11123  1377]
 [ 1268 11232]]
Accuracy: 0.8942
Precision: 0.88984
Recall: 0.8976676620127512
F1: 0.8937366919770198
