# Introduction

In notebook 9 we will be conducting sentiment analysis on a movie review data set. This will demonstrate both the application of a common classification task that employees natural language processing (NLP) and the machine learning workflow using a blank notebook as a template.

sentiment analysis is a subfield of NLP that focuses on determining the sentiment or emotional tone behind words in order to understand the attitudes, opinions, and emotions of the writer. It's often applied to fields like customer feedback, product reviews, and social media monitoring.

The basic premise of sentiment analysis is classifying the polarity of a given text. This classification is of binary (positive, negative) like this example, but can feature more then two discrete outputs/labels.

The datasets in this notebook features a collection of reviews of movies corresponding to labeled "positive" or "negative" sentiment available from Kaggle (Sentiment Polarity Data Set v2.0 from Movie Review Data by Pang, Lee and Vaithyanatha). The data came in already split in train and test sets so after importing the data we use the handy SkLearn library to vectorize this data and apply TF-IDF which weights the importance of a given word in the text based on its frequency. After the training and test data is vectorizing we will train a support vector classifier (SVC) model from SKlearn which is well suited for the sentiment analysis classification task. After fitting the SVC model its accuracy and precision is evaluated. Followed by making predictions based on novel reviews to evaluate its performance or flaws. 


# Loading the data

Here we load our training and test set (pre-split) and pandas package.

In [1]:
import pandas as pd

#importing train test data sets
reviews_training = pd.read_csv("/kaggle/input/movie-reviews-sentiment-polarity/movie_reviews_train.csv")
reviews_test = pd.read_csv("/kaggle/input/movie-reviews-sentiment-polarity/movie_reviews_test.csv")


# Exploring data

The next step is to understand the data we are working with. This is done by looking at a few rows of data and the shape or description of the datasets.

Lets look at the first few lines of data for both the train and test dataset using head().

In [2]:
reviews_training.head()

Unnamed: 0,Content,Label
0,every once in a while you see a film that is s...,pos
1,the love for family is one of the strongest dr...,pos
2,after the terminally bleak reservoir dogs and ...,pos
3,( warning to those who have not seen seven : ...,pos
4,"having not seen , "" who framed roger rabbit "" ...",pos


In [3]:
reviews_test.head()

Unnamed: 0,Content,Label
0,hedwig ( john cameron mitchell ) was born a bo...,pos
1,one of the more unusual and suggestively viole...,pos
2,what do you get when you combine clueless and ...,pos
3,>from the man who presented us with henry : th...,pos
4,tibet has entered the american consciousness s...,pos


We can see the format is consistent across the two datasets. There is one column “Content” which is the feature that has the text to be classified and a second column “Label” which is the positive and negative review based on the text feature.

The describe() function provides additional insight into our data set as seen below.

In [4]:
reviews_training.describe()

Unnamed: 0,Content,Label
count,1800,1800
unique,1800,2
top,every once in a while you see a film that is s...,pos
freq,1,900


In [5]:
reviews_test.describe()

Unnamed: 0,Content,Label
count,200,200
unique,200,2
top,hedwig ( john cameron mitchell ) was born a bo...,pos
freq,1,100


The describe() function showed us the training set has 1800 records and test 200, a 9:1 split. Each feature is unique, and the labels have two unique features positive or negative.

Let’s make sure there are no missing values that might cause future problems.

In [6]:
reviews_training.isna().sum()

Content    0
Label      0
dtype: int64

In [7]:
reviews_test.isna().sum()

Content    0
Label      0
dtype: int64

Great, both datasets are free of mising values.

Now that we have a sense of the data we are working with let’s move on to any pre-processing and preparing the data sets.

# Preprocessing the data & Preparing the training and test sets

In our data exploration we discovered the two data sets are in good shape and organized requiring little additional preprocessing and preparation for training. An important step in oh preprocessing data for sentiment analysis is to vectorize the data using TF-IDF to convert the string text into numerical vectorized weighted values to be used in training the model. We will rely on SKlearn TfidfVectorizer function to accomplish this. Two variables X_train and X_test will be vectorized and split from the original data frames. The label data (Y variables) does not need to be split into separate variables at this point. 

Here is where we accomplished the key tasks of importing our vectorizing function, creating an instance of it, and then applying it to the features in the training and test datasets to create our X_train and X_test variables.

In [8]:
#imports function
from sklearn.feature_extraction.text import TfidfVectorizer

#creates instances of vectorizer
vectorizer = TfidfVectorizer()

#creates training set and applies the vectroization
X_train = vectorizer.fit_transform(reviews_training["Content"])

#creates test set and applies the vectroization
X_test = vectorizer.transform(reviews_test["Content"])



Let’s look at the vectorized data as a sanity check.

In [9]:
print(X_train)

  (0, 37728)	0.019557322533372945
  (0, 3591)	0.018613017612739185
  (0, 12297)	0.03690729640884355
  (0, 17656)	0.03646139735698193
  (0, 12346)	0.04860691140872605
  (0, 27323)	0.03573101423242628
  (0, 15767)	0.034256965318728114
  (0, 29509)	0.03090024505659402
  (0, 22330)	0.036036739822766194
  (0, 29250)	0.06957206198519013
  (0, 10870)	0.04860691140872605
  (0, 8393)	0.03895421032567651
  (0, 28411)	0.01972213541121538
  (0, 9454)	0.0269234257561756
  (0, 35929)	0.0420119478454016
  (0, 6599)	0.03235271241041815
  (0, 7739)	0.045426753803619664
  (0, 10361)	0.0597766998936176
  (0, 9320)	0.030557274972506234
  (0, 29172)	0.05288284389099513
  (0, 36072)	0.03487211423424892
  (0, 14950)	0.05103150581345296
  (0, 14344)	0.01920511390031357
  (0, 33376)	0.030727111018264492
  (0, 37056)	0.024216623086966205
  :	:
  (1799, 27189)	0.021610419271034357
  (1799, 15413)	0.013757250742321862
  (1799, 2440)	0.026978876380204273
  (1799, 26764)	0.04367002153433109
  (1799, 25356)	0.128851

In [10]:
print(X_test)

  (0, 37810)	0.049233302497535825
  (0, 37809)	0.006722119146968519
  (0, 37801)	0.01783853441303375
  (0, 37728)	0.011269008059395832
  (0, 37554)	0.008063419859195647
  (0, 37426)	0.02731475123347245
  (0, 37372)	0.036934013963421386
  (0, 37368)	0.04424874854344269
  (0, 37329)	0.040087702490054514
  (0, 37167)	0.017575014810606034
  (0, 37156)	0.005845986417172121
  (0, 37076)	0.018148478731249307
  (0, 37056)	0.02093057372291351
  (0, 36818)	0.02543141450152861
  (0, 36797)	0.006603127886297591
  (0, 36117)	0.05019527629175949
  (0, 35952)	0.010879548081688246
  (0, 35862)	0.006741572051307036
  (0, 35348)	0.028258157994876743
  (0, 34962)	0.02568542199981158
  (0, 34908)	0.012919199431329811
  (0, 34650)	0.12026310747016354
  (0, 34531)	0.04141024184007774
  (0, 34462)	0.036526654761964275
  (0, 34335)	0.017941877802180164
  :	:
  (199, 2499)	0.05402996851424845
  (199, 2497)	0.06581649435927639
  (199, 2440)	0.009684470387721593
  (199, 2275)	0.035076806787243134
  (199, 2132)	0

It appears that the vectorization and splitting of the features from the data set was successful. Despite the values in the vectors being incomprehensible the formatting and dimensions seem appropriate.

Now that preprocessing and splitting the data is done, we can move on to model selection.

# Creating and configuring a sklearn.svm.SVC

At the onset of this project, we decided to use SVC for our model because of its ability to classify data such as sentiment analysis into binary predictions matching the structure of our data. In this part we will import the svm.SVC model from SKlearn to be applied in the next section for training the model. Initially we will use the default settings of the SVC.

In [11]:
#imports svm function
from sklearn import svm

#creates an instance of the SVC for future reference
SVC_classifier = svm.SVC()

This was a simple step,  now that our model is imported we are ready to train the SVM.

# Training the SVM

Now is when we train our SVC_classifier model by fitting the X_train data to the labels in the reviews_training data set (Y variable of training set). 

In [12]:
#fits the traing data to a SVC classifier
SVC_classifier.fit(X_train, reviews_training["Label"])

The output indicates that the model was successfully fitted with the training data. Now lets see how the model preforms in the next step.

# Validating and Testing the SVM

In this step we will evaluate the model performance on the test dataset (reviews_test). In addition, we will create our own example reviews to feed in the model to see if its outputs are correct and logical. This is done with the default model hyperparameter settings to first ensure the model is producing reasonable outputs.

Now we will have the model make predictions based off the X_test data.

In [13]:
#applies the classifier to the test data set and saves results
predictions = SVC_classifier.predict(X_test)

#prints the first few model predictions and corresponding labels in the dataset
print(f"First predictions: {predictions[0:4]}\nCorrect labels:{reviews_test['Label'][0:4]}")

First predictions: ['pos' 'pos' 'neg' 'pos']
Correct labels:0    pos
1    pos
2    pos
3    pos
Name: Label, dtype: object


We can see the model is performing well at making predictions with three of the four predictions being correct compared to the original label data. Also, the positive is abbreviated as “pos” , and the negative is labeled as “neg”. 

Here we evaluate the model in its entirety beyond the first four rows and data in order to accomplish this task we referenced the classification_report function from SK learn to complete this task in only a couple lines of code we also print the results in full language instead of the abbreviated values in the dataset.

In [14]:
#imports evaluation function from sklean
from sklearn.metrics import classification_report

#saves evaluation data
report = classification_report(reviews_test["Label"], predictions, output_dict=True)

#prints results
print('Positives: ', report['pos'])
print('Negatives: ', report['neg'])

Positives:  {'precision': 0.8571428571428571, 'recall': 0.84, 'f1-score': 0.8484848484848485, 'support': 100}
Negatives:  {'precision': 0.8431372549019608, 'recall': 0.86, 'f1-score': 0.8514851485148515, 'support': 100}


The model seems to have performed well in sentiment analysis. The F1-score, which balances precision and recall, is approximately 0.85 for both positive and negative classes. This suggests the model was fairly accurate and consistent in predicting both positive and negative sentiments. Now lets see if we can tune the model through hyper parameterization.

# Evaluate (and Improve) the results

To see if we can improve the performance of the model the grid search CV function from sklearn will be used to manipulate the sea, gamma, kernel parameters of the SVC model. The goal here is to find the ideal hyperparameters to improve the accuracy of the model.

In [15]:
"""#imports gridsearch for hypertunning
from sklearn.model_selection import GridSearchCV

#setting different parameters for the model to use
parameters = {'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001], 'kernel': ['rbf', 'linear','poly']}

#creates models with different paramaters
grid_search_model = GridSearchCV(estimator=svm.SVC(), param_grid=parameters)

# Train the classifier on the training data
grid_search_model.fit(X_train, reviews_training["Label"])

# Print out the results 
print('Best score:', grid_search_model.best_score_)
print('Best C:',grid_search_model.best_estimator_.C)
print('Best Kernel:',grid_search_model.best_estimator_.kernel)
print('Best Gamma:',grid_search_model.best_estimator_.gamma)
"""
#Output:
#Best score: 0.8522222222222222
#Best C: 10
#Best Kernel: linear
#Best Gamma: 1


'#imports gridsearch for hypertunning\nfrom sklearn.model_selection import GridSearchCV\n\n#setting different parameters for the model to use\nparameters = {\'C\': [0.1, 1, 10, 100], \'gamma\': [1, 0.1, 0.01, 0.001], \'kernel\': [\'rbf\', \'linear\',\'poly\']}\n\n#creates models with different paramaters\ngrid_search_model = GridSearchCV(estimator=svm.SVC(), param_grid=parameters)\n\n# Train the classifier on the training data\ngrid_search_model.fit(X_train, reviews_training["Label"])\n\n# Print out the results \nprint(\'Best score:\', grid_search_model.best_score_)\nprint(\'Best C:\',grid_search_model.best_estimator_.C)\nprint(\'Best Kernel:\',grid_search_model.best_estimator_.kernel)\nprint(\'Best Gamma:\',grid_search_model.best_estimator_.gamma)\n'

The grid search function indicated that the best score could be obtained was 0.8522 with a C equal to 10 kernel equal to linear and gamma equal to 1. However, this is not a better score than our original default model performance. The code is commented out because it took a very long time to run but the outputs are seen in the section above period now let's apply these hyperparameters to the model and evaluate the model’s performance.

In [16]:
#optimized model with hyperparameters:
op_classifier = svm.SVC(C=10,gamma=1,kernel='linear')

#fits the traing data to a SVC classifier
op_classifier.fit(X_train, reviews_training["Label"])

#applies the classifier to the test data set and saves results
op_predictions = op_classifier.predict(X_test)

#saves evaluation data
op_report = classification_report(reviews_test["Label"], predictions, output_dict=True)

#prints results
print('Positives: ', op_report['pos'])
print('Negatives: ', op_report['neg'])

Positives:  {'precision': 0.8571428571428571, 'recall': 0.84, 'f1-score': 0.8484848484848485, 'support': 100}
Negatives:  {'precision': 0.8431372549019608, 'recall': 0.86, 'f1-score': 0.8514851485148515, 'support': 100}


We can see the optimized model (op_classifier) are identical to the results of the default model (SVC_classifier). So in this case hyperparmaterization did not improve performance, and the originally trained model can be used. Lets see the SVC_classifier model make some predictions on novel movie reviews in the next section.

# Demonstrating making predictions

In this section we will evaluate the model's performance on real world fictional movie reviews to see if the predictions are correct or if there's any obvious flaws. This will be useful to understand how the sentiment analysis model performs in real world applications

Here are a few examples of made-up reviews that we will use to test the model.

In [17]:
# New movie reviews to test
reviews = ["Disappointed, there were no wolfs in Wolf of Wall Street",
           "Tom Cruise ejecting at mock 10 was such an amazing and realistic scene in the new top gun",
           "Two Thumbs up",
           "Two Thumbs down",
           "1 out of 10",
           "10 out of 10",
           "one star",
           "five star"]

# Vectorize and predict sentiment for each review in a for loop to be concise
for review in reviews:
    test_prediction = vectorizer.transform([review])
    print(f"Review: {review}")
    print(f"Sentiment: {SVC_classifier.predict(test_prediction)[0]}")
    print("\n")

Review: Disappointed, there were no wolfs in Wolf of Wall Street
Sentiment: neg


Review: Tom Cruise ejecting at mock 10 was such an amazing and realistic scene in the new top gun
Sentiment: pos


Review: Two Thumbs up
Sentiment: neg


Review: Two Thumbs down
Sentiment: neg


Review: 1 out of 10
Sentiment: pos


Review: 10 out of 10
Sentiment: pos


Review: one star
Sentiment: neg


Review: five star
Sentiment: neg




This test with made-up reviews shows that the first two reviews performed as expected with correctly labeled positive and negative sentiment. This makes sense because these are in the full text format of the training data set. However, the following two reviews, “thumbs up” or “thumbs down” were rated negative which is inaccurate. The same erroneous labeling was seen with our star reviews. It seems this model works well with full text not short rating statements such as thumbs or stars.

# Conclusion

In this notebook 9  we preformed sentiment analysis using a movie review dataset, demonstrating both the application of a common NLP classification task and the machine learning workflow from scratch. 

The fundamental goal of sentiment analysis is to classify a given text into categories, in this case, 'positive' and 'negative.' Our dataset, sourced from Kaggle's Sentiment Polarity Data Set v2.0 from Movie Review Data, offered us a collection of movie reviews labeled as either 'positive' or 'negative.' These data were pre-split into training and test sets and required little pre-processing outside of model specific requirements of vectorization.

We utilized the powerful sklearn library to vectorize our data and apply TF-IDF transformation. This weighs the importance of words in the text based on their frequency, ensuring that we account for both common and distinctive words when training our model.

The chosen model was a Support Vector Classifier (SVC), which has proven to be effective for sentiment analysis tasks. Post-training, we observed high precision rates in both positive (85.71%) and negative (84.31%) classifications, demonstrating the model's capability to handle sentiment classification effectively.

An attempt at hyper parametrization was made using the GridSearchCV function to manipulate the C, gamma, or kernel variable within the SVC model. This suggested a C=10, linear kernel, and gamma=1. However this did not improve model performance beyond the default settings.

Further, the model was evaluated based on new, unseen movie reviews to gauge its performance in real-world scenarios. While the model performed well whole sentence reviews, it did exhibit some limitations in understanding ratings such as one star or thumbs up.

In conclusion, this notebook presents a comprehensive sentiment analysis workflow, underlining the value and potential of NLP in extracting meaningful insights from textual data. However, it also emphasizes the need for continual model refinement and adaptation to improve performance in more complex and nuanced linguistic scenarios.
