# Applying Machine Learning Techniques to Sentiment Analysis

### Objective:

*In this project, we will delve into a subfield of natural language processing (NLP) called sentiment analysis and learn how to use machine learning algorithms to classify documents based on their polarity: the attitude of the writer. In particular, we are going to work with a dataset of 50,000 movie reviews from the Internet Movie Database (IMDb) and build a predictor that can distinguish between positive and negative reviews.*

### Overview:

* Cleaning and Preparing Text Data
* Building Feature Vectors From Text Documents
* Training A Machine Learning Model to Classify Reviews
* Working with Large Text Datasets Using Out-of-Core Learning

## Cleaning and Preparing Text Data

A compressed archive of the movie review dataset (84.1 MB) can be downloaded from http://ai.stanford.edu/~amaas/data/sentiment/ as a gzip-compressed tarball archive.

### Data Preparation

Lets go ahead, download the dataset from the above link, and preprocess into a more convenient format (Pandas DataFrame in our case).

In [1]:
# Lets import necessary libraries first
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# This script is just to ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [3]:
import os
base_path = "aclImdb"
reviews = {"pos":1, "neg":0}

review_list = list()
for s in ("train", "test"):
    for l in ("pos", "neg"):
        for r in sorted(os.listdir(os.path.join(base_path, s, l))):
            with open(os.path.join(base_path, s, l, r), 'r', encoding='utf-8') as infile:
                text = infile.read() 
                review_list.append((text, reviews[l]))
               
df = pd.DataFrame(review_list, columns=['review', 'sentiment'], index=None)

Next, lets perform some basic data exploration techniques.

In [4]:
df.head()

Unnamed: 0,review,sentiment
0,Bromwell High is a cartoon comedy. It ran at t...,1
1,Homelessness (or Houselessness as George Carli...,1
2,Brilliant over-acting by Lesley Ann Warren. Be...,1
3,This is easily the most underrated film inn th...,1
4,This is not the typical Mel Brooks film. It wa...,1


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 781.4+ KB


In [6]:
df.shape

(50000, 2)

Now, lets shuffle the dataframe and store it into a csv file for our own convenience.

In [7]:
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.head()

Unnamed: 0,review,sentiment
11841,"Often tagged as a comedy, The Man In The White...",1
19602,After Chaplin made one of his best films: Doug...,0
45519,I think the movie was one sided I watched it r...,0
25747,I have fond memories of watching this visually...,1
42642,This episode had potential. The basic premise ...,0


In [8]:
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,review,sentiment
0,"Often tagged as a comedy, The Man In The White...",1
1,After Chaplin made one of his best films: Doug...,0
2,I think the movie was one sided I watched it r...,0
3,I have fond memories of watching this visually...,1
4,This episode had potential. The basic premise ...,0


In [9]:
df.to_csv("movie_data.csv", index=False, encoding='utf-8')

### Cleaning Text Data

Most of the reviews might contain unwanted characters such as '/', ';', ')', and so on which won't provide any useful imformation to help discriminate distinct reviews. Therefore, lets go ahead and remove them.

In [10]:
import re

def preprocessor(text):
    text = re.sub("<[^*]>", "", text)
    emoticons = re.findall("(?::|;|=)(?:)?(?:\)|\(|D|P)",text)
    text = re.sub("[\W]+", " ", text.lower()) 
    return text + " ".join(emoticons)

In [11]:
df['review'] = df['review'].apply(preprocessor)

## Building Feature Vectors From Text Documents

### Tokenization

One way of tokenization is to split the documents into individual words based on white spaces.

In [12]:
def tokenizer(text):
    return text.split()

In [13]:
tokenizer(df.iloc[0,0][:50])

['often',
 'tagged',
 'as',
 'a',
 'comedy',
 'the',
 'man',
 'in',
 'the',
 'white',
 'suit']

Another useful technique is stemming in which words are transformed into their root form. This method will also narrow down the size of bag-of-words.

In [14]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [15]:
tokenizer_porter(df.iloc[0,0][:50])

['often',
 'tag',
 'as',
 'a',
 'comedi',
 'the',
 'man',
 'in',
 'the',
 'white',
 'suit']

### Stopwords

In [16]:
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/lovepreet/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [17]:
from nltk.corpus import stopwords
stop = stopwords.words("english")

# An instance
[w for w in df.iloc[0,0][:50].split() if w not in stop]

['often', 'tagged', 'comedy', 'man', 'white', 'suit']

### Transformer

We can use term-frequency or term-frequency-inverse -document-frequency techniques to build our bag-of-words model. Lets use sklearn for our convenience.

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vec = TfidfVectorizer(use_idf=True, smooth_idf=True, norm='l2')

In [19]:
df.iloc[0,0][:50].split()

['often',
 'tagged',
 'as',
 'a',
 'comedy',
 'the',
 'man',
 'in',
 'the',
 'white',
 'suit']

In [20]:
tfidf_vec.fit_transform(df.iloc[0,0][:50].split()).toarray()

array([[0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0.]])

## Training A Machine Learning Model to Classify Reviews

In [21]:
# First of all, lets split the data into train and test sets
X = df['review']
y = df['sentiment']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [22]:
# Next, lets import the required libraries
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

In [23]:
# Now, lets build a data pipeline for efficieny 
lr_tfidf = Pipeline([
    ("vec", TfidfVectorizer()),
    ("clf", SGDClassifier())
])

Next, lets define the list of parameters we want to experiment with under the process of "hypertuning parameters" and then use grid search to find the optimal parameters for the model.

In [24]:
params = {
          "vec__norm":['l2'], 
          "vec__use_idf":[True],
          "vec__smooth_idf":[True],
          "vec__stop_words":[stop, None],
          "vec__tokenizer":[tokenizer_porter, tokenizer],
          "clf__loss":["log_loss"],
          "clf__alpha":[0.00001,0.001,0.1] # Experiment
}

In [25]:
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid=params, scoring='accuracy', cv=3, n_jobs=-1)

In [26]:
gs_lr_tfidf.fit(X_train, y_train)

In [27]:
gs_lr_tfidf.best_score_

0.8959428349315169

In [28]:
gs_lr_tfidf.best_params_

{'clf__alpha': 1e-05,
 'clf__loss': 'log_loss',
 'vec__norm': 'l2',
 'vec__smooth_idf': True,
 'vec__stop_words': ['i',
  'me',
  'my',
  'myself',
  'we',
  'our',
  'ours',
  'ourselves',
  'you',
  "you're",
  "you've",
  "you'll",
  "you'd",
  'your',
  'yours',
  'yourself',
  'yourselves',
  'he',
  'him',
  'his',
  'himself',
  'she',
  "she's",
  'her',
  'hers',
  'herself',
  'it',
  "it's",
  'its',
  'itself',
  'they',
  'them',
  'their',
  'theirs',
  'themselves',
  'what',
  'which',
  'who',
  'whom',
  'this',
  'that',
  "that'll",
  'these',
  'those',
  'am',
  'is',
  'are',
  'was',
  'were',
  'be',
  'been',
  'being',
  'have',
  'has',
  'had',
  'having',
  'do',
  'does',
  'did',
  'doing',
  'a',
  'an',
  'the',
  'and',
  'but',
  'if',
  'or',
  'because',
  'as',
  'until',
  'while',
  'of',
  'at',
  'by',
  'for',
  'with',
  'about',
  'against',
  'between',
  'into',
  'through',
  'during',
  'before',
  'after',
  'above',
  'below',
  'to',

Finally, lets select the optimal model and evaluate its predictions.

In [29]:
gs_lr_tfidf = gs_lr_tfidf.best_estimator_

train_pred = gs_lr_tfidf.predict(X_train)
test_pred = gs_lr_tfidf.predict(X_test)

from sklearn.metrics import accuracy_score
train_acc = accuracy_score(y_train, train_pred)
test_acc = accuracy_score(y_test, test_pred)

print(f"Predictor Accuracy on Training Data: {train_acc}")
print(f"Predictor Accuracy on Test Data: {test_acc}")

Predictor Accuracy on Training Data: 0.9626571428571429
Predictor Accuracy on Test Data: 0.9008666666666667


In [30]:
# Also, lets take a look at the confusion matrix of test set evaluation
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, test_pred)

array([[6646,  850],
       [ 637, 6867]])

## Working with Large Text Datasets Using Out-of-Core Learning

First, we will define a tokenizer function that cleans the unprocessed text data from the movie_data.csv file that we constructed at the beginning of this chapter and separate it into word tokens while removing stop-words:

In [31]:
def tokenizer(text):
    text = re.sub("<[^*]>", "", text)
    emoticons = re.findall("(?::|;|=)(?:)?(?:\)|\(|D|P)",text)
    text = re.sub("[\W]+", " ", text.lower()) 
    text = text + " ".join(emoticons)
 
    return [porter.stem(word) for word in text.split() if word not in stop]

Next, lets build a way to read batches of data at a time. 

In [32]:
# Generator Function
def stream_docs(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv)
        for line in csv:
            text, label = line[:-3], line[-2:-1]
            yield text, label

In [33]:
def get_mini_batch(doc_stream, size):
    train, label = [],[]
    
    for i in range(size):
        try:
            x,y = next(doc_stream)
            train.append(x)
            label.append(y)
        except StopIteration:
            return None, None
    
    return train, label

Finally, lets use Hashing Vector for text processing which is data independant and train our machine learning model using out-of-core learning technique.

In [34]:
from sklearn.feature_extraction.text import HashingVectorizer

doc_stream = stream_docs(path='movie_data.csv')
vec = HashingVectorizer(decode_error='ignore', n_features=2**21, preprocessor=None, tokenizer=tokenizer)
clf = SGDClassifier(loss='log_loss', random_state=42)

for _ in range(45):
    X_train, y_train = get_mini_batch(doc_stream, 1000)
    if not X_train:
        break
    X_train = vec.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=np.array(['0','1']))

In [35]:
# Model Evaluation
X_test, y_test = get_mini_batch(doc_stream, 5000)
X_test = vec.transform(X_test)

clf.score(X_test, y_test)

0.871

As we can see the accuracy is 87% which is slightly below than what we got in the previous section. However, out-of-core learning is very memory efficient and the training time was just over a minute.