# Sentiment analysis of text data

The focus for this lab is classification of natural language data, we'll be using the [Movie Review Polarity Dataset](http://www.cs.cornell.edu/people/pabo/movie-review-data/) that includes film reviews annotated with a label that classify them as positive or negative. The task is to build a classifier to predict new (unseen) reviews.

The usual workflow for building and deploying a classifier is depicted in the image below

![Text Classification workflow](https://developers.google.com/machine-learning/guides/text-classification/images/Workflow.png)

First we prepare the computing environment by importing the necessary libraries

In [1]:
import os
import random
import warnings

import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import sklearn

The [NLTK](https://www.nltk.org/) installed package doesn't include the necessary data which should be installed as described in the [documentation](https://www.nltk.org/data.html). The full list of available corpora data is available on [NLTK website](https://www.nltk.org/nltk_data/).

In our case we just need the stopwords, which can be downloaded as following:

In [2]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\black\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\black\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

## Collect and load data

The dataset is in a zip archive `data/review_polarity.zip`, in the archive reviews are stored within the `txt_sentoken` directory as single files. Each file corresponding to a single review is stored in the `pos` or `neg` subdirectory according to its classification. The function below will load the dataset in a pandas dataframe:

In [3]:
from zipfile import ZipFile
import re

def load_dataset_archive(ziparch, seed=None, encoding='utf-8'):
    """Load the Movie Review Polarity Dataset from the given zip archive.
    For the description of the data see <http://www.cs.cornell.edu/people/pabo/movie-review-data/>
    """
    data = []
    with ZipFile(ziparch, 'r') as myzip:
        for fi in myzip.infolist():
            if not fi.is_dir():
                m = re.search('/(neg|pos)/(\w+).txt$', fi.filename)
                if m:
                    row = {'id': m.group(2), 'Text': myzip.read(fi).decode(encoding), 'Label': 0 if m.group(1) == 'neg' else 1}
                    data.append(row)

    # shuffle data to avoid order biases
    random.seed(seed)
    random.shuffle(data)
    return pd.DataFrame.from_records(data, columns=['id', 'Text', 'Label'], index='id')

In [4]:
dataset = load_dataset_archive('data/review_polarity.zip')
print(dataset.info())
dataset.sample(10)

<class 'pandas.core.frame.DataFrame'>
Index: 1999 entries, cv163_10110 to cv969_13250
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Text    1999 non-null   object
 1   Label   1999 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 46.9+ KB
None


Unnamed: 0_level_0,Text,Label
id,Unnamed: 1_level_1,Unnamed: 2_level_1
cv444_9975,the makers of spawn have created something alm...,0
cv925_9459,everybody in this film's thinking of alicia . ...,0
cv634_11989,i didn't come into city of angels expecting gr...,0
cv338_8821,leonardo decaprio ( what's eating gilbert grap...,1
cv100_12406,warning : spoilers are included in this review...,0
cv332_16307,"like the wonderful 1990 drama , "" awakenings ,...",1
cv675_22871,have you ever been in an automobile accident w...,0
cv254_5870,even the best comic actor is at the mercy of h...,0
cv717_17472,starring arnold schwarzenegger ; danny devito ...,0
cv957_8737,capsule : the best place to start if you're a ...,1


## Feature extraction

To use ML techniques we need to transform the textual representation into a set of features, and for that we can use the infrastructure provided by [scikit](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature). For this example I used the [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer), but you can try also the [Tf–idf term weighting](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) based extractor.

To limit the number of features we can set the parameter `max_features` to the `CountVectorizer` constructor (the set of all features might be unmanageable).

We can ignore stopwords by using the `stop_words` parameter, below we'll use the data from the downloaded NLTK corpus.

N-grams can be considered by specifying the `ngram_range` parameter. E.g `(1,3)` uses length 1, 2, and 3.

In [5]:
vectorizer = sklearn.feature_extraction.text.CountVectorizer(
    stop_words=nltk.corpus.stopwords.words('english'),
    max_features=100000,
    ngram_range=(1,3)
)

%time fmatrix = vectorizer.fit_transform(dataset['Text'])

print(fmatrix.shape)

CPU times: total: 5.8 s
Wall time: 5.92 s
(1999, 100000)


The default tokeniser is the regular expression `(?u)\b\w\w+\b` (`(?u)` switches on the `re.U (re.UNICODE)` flag), but we can also use one of the [NLTK tokenisers](https://www.nltk.org/api/nltk.tokenize.html)

In [6]:
small_vectorizer = sklearn.feature_extraction.text.CountVectorizer(
    stop_words=nltk.corpus.stopwords.words('english'),
    max_features=1000,
    ngram_range=(1,3),
    tokenizer=nltk.tokenize.word_tokenize
)

small_matrix = small_vectorizer.fit_transform(dataset['Text'])

print(small_matrix.shape)



(1999, 1000)


## Classification and Evaluation

Below you'll find an example using the [logistic regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) classifier, with the corresponding C parameter tuning using the `lbfgs` solver:


In [7]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

base_estimator = LogisticRegression(solver='sag')

param_grid = {'C': [0.01, 0.05, 0.25, 0.5, 1]}  # possible options for the C parameter of the regression

clf = GridSearchCV(base_estimator, param_grid=param_grid)
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    %time clf.fit(fmatrix, dataset['Label'])

pd.concat([pd.DataFrame(clf.cv_results_["params"]),pd.DataFrame(clf.cv_results_["mean_test_score"], columns=["Accuracy"])],axis=1)

CPU times: total: 19.6 s
Wall time: 19.7 s


Unnamed: 0,C,Accuracy
0,0.01,0.838914
1,0.05,0.842915
2,0.25,0.845917
3,0.5,0.843416
4,1.0,0.842916


### Train with whole dataset

Once you selected the parameter you can prepare the model for classifying unseen data. Usually you prepare the model for deployment by using the whole dataset (beware of overfitting, though).

In [8]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    %time lr_full = LogisticRegression(C=1, solver='sag').fit(fmatrix, dataset['Label'])

CPU times: total: 875 ms
Wall time: 872 ms


Trained models can be saved for later deployment using Python libraries for serialisation. Scikit documentation suggests to use [joblib](https://scikit-learn.org/stable/modules/model_persistence.html):

In [9]:
import joblib

joblib.dump(lr_full, 'my_lr_classifier.joblib')
joblib.dump(vectorizer, 'my_full_vectorizer.joblib')
lr_copy = joblib.load('my_lr_classifier.joblib')
lr_copy

## Using the model for prediction

To classify new instance the features must be aliged to the ones used for training. To this end you use the `transform` method of the corresponding vectoriser (the `fit` phase is the one where the features are selected): 

In [10]:
new_example = vectorizer.transform(['drill', 'good drill', 'crap film', 'excellent one'])
lr_full.predict(new_example)

array([0, 1, 0, 1], dtype=int64)

Let's have a look at the features of the new data. To understand which features are in the example we need to consider the list of feature names in the vectoriser (the method `get_feature_names()`).

In [11]:
features = vectorizer.get_feature_names_out()
for row, col in zip(*new_example.nonzero()):
    print('{} ({},{})={} '.format(features[col], row, col, new_example[row,col]))

drill (0,23230)=1 
drill (1,23230)=1 
good (1,37750)=1 
crap (2,17940)=1 
crap film (2,17943)=1 
film (2,30208)=1 
excellent (3,27495)=1 
excellent one (3,27511)=1 
one (3,62006)=1 


## Try a different classifier

Select a different classifier and verify whether you can obtain a better accuracy. With textual data [naïve Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes) and [support vector machine (SVM)](https://scikit-learn.org/stable/modules/svm.html#svm) are often used, but you can also train and use [deep learning models](https://developers.google.com/machine-learning/guides/text-classification/step-4).

Comment on your experiments.

Before applying our classifier to custom data, let us first test it on the existing dataset, by dividing the original dataset in two portions, the training set a validation set.

In [28]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Splitting the dataset into training and test sets (80% training, 20% test)
X_train, X_test, y_train, y_test = train_test_split(dataset['Text'], dataset['Label'], test_size=0.2, random_state=42)

# Vectorizing the training set
vectorizer = TfidfVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)

# Training the SVM model using only the training set
svm = SVC(kernel='linear')
svm.fit(X_train_vectorized, y_train)

# Vectorizing the test set
X_test_vectorized = vectorizer.transform(X_test)

# Predicting labels for the test set
predictions = svm.predict(X_test_vectorized)

# Calculating accuracy by comparing predicted labels to true labels
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy of the classifier: {accuracy * 100:.2f}%")

Accuracy of the classifier: 86.25%


As the computation displays, the classifier has classified reviews correctly with a rate of 86.25%. This is not a bad score, therefore we can proceed applying it on custom data of our choice.

In [26]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC

#First let's define what reviews we want to proof, in this case we specify custom ones
test_reviews = ["good film", "crap film", "I find it very interesting", "I will for sure watch it again"]

#We use a TfidVectorizer to vectorize the dataset
vectorizer = TfidfVectorizer()
X_vectorized = vectorizer.fit_transform(dataset['Text'])

# Training the SVM model (notice that since we want to test custom reviews we can use the whole dataset here, without splitting)
svm = SVC(kernel='linear')
svm.fit(X_vectorized, dataset['Label'])

# Predicting the sentiment of provided reviews
test_reviews = ["good film", "crap film", "I find it very interesting", "I will for sure watch it again", "how can you keep watching this for 2 hours straight, boredom in its purest for"]
test_reviews_vectorized = vectorizer.transform(test_reviews)
predictions = svm.predict(test_reviews_vectorized)

# Display predictions for test reviews
for review, prediction in zip(test_reviews, predictions):
    print(f"Review: {review} --> Predicted Label: {prediction}")


Review: good film --> Predicted Label: 1
Review: crap film --> Predicted Label: 0
Review: I find it very interesting --> Predicted Label: 0
Review: I will for sure watch it again --> Predicted Label: 1
Review: how can you keep watching this for 2 hours straight, boredom in its purest for --> Predicted Label: 0


By looking at the reviews, one would conclude:
- Positive Reviews:
1. good film
2. I find it very interesting
3. I will for sure watch it again
- Negative Reviews:
1. crap film
2. how can you keep watching this for 2 hours straight, boredom in its purest for

The classifier predicted the label correctly 4/5 times. However, in this case we are considering a custom and small sample of tweets.