<a href="https://colab.research.google.com/github/potasali/Machine-Learning/blob/master/Programming_Assignment_2_Sentiment_Analyzer/Programming_Assignment_2_Sentiment_Analyzer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Programming Assignment 2: Sentiment Analyzer

## Introduction

The purpose of this assignment is to get you familiar with sentiment classification. By the
end of this assignment you will have your very own “Sentiment Analyzer”. You are given
with Large Movie Review Dataset that contains separate labelled train and test set. Your
task is to train a Logistic Regression classifier on train set and report evaluation metrics on
test set. 

Let's start with the necessary imports.

In [0]:
# Importing Libraries
import os
import glob
import re
import numpy as np
from matplotlib import pyplot
import math 
import pandas as pd
import random

## 1. Dataset

- $x_1$ = count(positive words) ∈ review
- $x_2$ = count(negative words) ∈ review
- $x_3$ = Star Rating (1-10 scale)
- $x_4$ = log(word count of review)
- $x_5$ = 1 if “no” ∈ review, 0 otherwise
- $x_6$ = 1 if “!” ∈ review, 0 otherwise
- $y$ = 1 if positive, 0 otherwise

### 1.1 Fetching Reviews

The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). There are two top-level directories `[train/, test/]` corresponding to the training and test sets. Each contains `[pos/,neg/]` directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention `[[id]_[rating].txt]` where `[id]` is a unique id and `[rating]` is the star rating for that review on a 1-10 scale. For example, the file `[test/pos/200_8.txt]` is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb.

The reviews and the ratings are loaded from the data files

In [0]:
# A function to fetch reviews from the files
def getDataset(_path):
    
    # Creating a dictionary to store each review according to their id
    dataset = {}
    ratings = []
    temp = ""

    # Using glob module to retrieve files/pathnames.
    files = (glob.glob(_path))

    # Iterating through each file in the given folder
    for file in files:

        # Opening the file as read only
        with open(file, 'r', encoding="utf8") as myFile:

            # Reading review from a file
            review = myFile.read()

            # Preprocessing the file name before using it as the index for our dictionary
            name = file.split('/')[-1]

            # Storing review for each file in the dictionary
            dataset[name] = review

    return dataset

# Loading the datasets, separating train and test dataset and further separating into two sub-categories, positive and negative reviews.
ptrReview = getDataset('Dataset/train/pos/*')
ntrReview = getDataset('Dataset/train/neg/*')
pteReview = getDataset('Dataset/test/pos/*')
nteReview = getDataset('Dataset/test/neg/*')

### 1.2 Dataset Preprocessing
<ul>
    <li>Loading Positive and Negative Lexicons from the dataset file </li>
    <li>Finding lexicon count</li>
    <li>Finding word count</li>
    <li>Finding `no` count</li>
    <li>Finding `!` count</li>
</ul>


In [0]:
# Loading Positive Lexicons
with open('Dataset/positive-words.txt','r') as myFile:
    pLex = myFile.read().split('\n')

# Loading Negative Lexicons
with open('Dataset/negative-words.txt','r') as myFile:
    nLex = myFile.read().split('\n')

In [0]:
# Data preprocessing function
def preprocess(review, pLex, nLex):
    X = []

    # Iterating through each review
    for filename, rev in review.items():

        # Initializing the positive and negative lexicon count
        x1 = np.float64(0)
        x2 = np.float64(0)

        # Initializing the word count
        count = np.float64(0)

        # Converting the review text into lower case
        rev = rev.lower()

        # Creating a word vector array for each review 
        revsplit = rev.split()
        
        for word in revsplit:
            
            # Counting the number of Positive Lexicons in the review
            if word in pLex:
                x1 += 1
                
            # Counting the number of Negative Lexicons in the review
            if word in nLex:
                x2 += 1
            
            # Counting the number of words in the review
            count += 1
        
        # Taking log of the word count
        x4 = np.log(count)
        
        # Extracting the rating from the dataset dictionary index
        x3 = np.float64(((re.findall("_\d+", filename)).pop(0))[1:])
        
        # Checking the occurence of `no` in the review
        if 'no' in rev:
            x5 = np.float64(1)
        else:
            x5 = np.float64(0)

        # Checking the occurence of `!` in the review
        if '!' in rev:
            x6 = np.float64(1)
        else:
            x6 = np.float64(0)

        # Storing the features in a feature vector
        x = np.array([x1, x2, x3, x4, x5, x6])
        X.append(x)
   
    X = np.array(X)
    return X

In [0]:
# Passing the datasets to the preprocessing function 
ptrX = preprocess(ptrReview, pLex, nLex)
ntrX = preprocess(ntrReview, pLex, nLex)
pteX = preprocess(pteReview, pLex, nLex)
nteX = preprocess(nteReview, pLex, nLex)

# Extracting labels from our datasets
ptrY = np.ones(ptrX.shape[0]).reshape(-1,1)
ntrY = np.zeros(ntrX.shape[0]).reshape(-1,1)
pteY = np.ones(pteX.shape[0]).reshape(-1,1)
nteY = np.zeros(nteX.shape[0]).reshape(-1,1)

# Combining the negative and positive dataset to get train and test data further divided by X and Y
Xtr = np.concatenate((ptrX, ntrX))
Ytr = np.concatenate((ptrY, ntrY))
Xte = np.concatenate((pteX, nteX))
Yte = np.concatenate((pteY, nteY))

### 1.3 Shuffling Dataset
Shuffling data serves the purpose of reducing variance and making sure that models remain general and overfit less.


In [0]:
# Shuffle numbers from 0 to the size of the dataset
def createShuffle(size):
    array = np.array([i for i in range(size)])
    random.shuffle(array)
    return array
    
# Sorting the datasets index according to the previously shuffled array
def shuffleArray(posArray, arrayToShuffle):
    size = len(posArray)

    # Initializing a shuffled array
    shuffledArray = np.empty_like(arrayToShuffle) 

    # Iterating through the entire dataset and sorting the dataset using the previously shuffled array as the index
    for i in range(size):
        shuffledArray[i] = arrayToShuffle[posArray[i]]
    return shuffledArray

m = len(ptrReview) + len(ntrReview)
array = createShuffle(m)

In [0]:
Xtr = shuffleArray(array, Xtr)
Ytr = shuffleArray(array, Ytr)

## 2. Implementation

In [0]:
# Sigmoid function takes an input z of any real number and returns an output value in the range of 0 and 1
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Defining epsilon
epsilon = np.float( 1e-100 )

# Cross Entropy loss measures the performance of our classification model.
def crossEntropyLoss(h_x, Y):
    total_cost = (-Y * np.log(h_x + epsilon) - (1 - Y) * np.log(1 - h_x + epsilon)).mean()
    return total_cost

# Hypothesis Function h(x) predicts the label of a set of data
def hypothesis(X, Theta):

    # Taking dot product of the Feature Vector 'X' and the parameters 'Thetas'
    hyp = sigmoid(np.dot(X, Theta).reshape(-1,1))

    return hyp

#Predicting Y
def predict(X, Theta):

    # Using our model to predict the label of the set of data
    h_x = hypothesis(X, Theta)
    predictions = []

    # Classifying the predicted value into {0,1} by separating at the 0.5 mark
    for h_xi in h_x:
        if h_xi[0] <= 0.5:
            predictions.append(0)
        else:
            predictions.append(1)
    
    # Storing the predictions in an array
    predictions = np.array(predictions)
    predictions = predictions.reshape(-1,1)
    return predictions


### 2.1 Batch Gradient Descent

In [0]:
# Batch gradient function
def batchGradientDescent(X, Y, n_epochs, alpha):

    # Inittializing the parameters with ones
    Theta = np.ones(X.shape[1]).reshape(-1, 1)

    # Initializing the cost
    J = np.zeros(n_epochs)

    # Iterating for a predefined epoch
    for i in range(n_epochs):

        # Predicting the labels using our model
        h_x = hypothesis(X, Theta)

        # Calculating the cross entropy loss for our model for the entire data set
        J[i] = crossEntropyLoss(h_x, Y)

        # Updating Theta (parameters) using gradient descent with a predefined alpha
        Theta -= alpha * np.dot(X.T, (h_x - Y)) / X.shape[0]

    return J, Theta

In [0]:
n_epochs = 500
alpha = 0.01
J_batch, T_batch = batchGradientDescent(Xtr, Ytr, n_epochs, alpha)

### 2.2 Stochastic Gradient Descent

In [0]:

def stochastic_gradient_descent(X, Y, n_epochs, alpha):
    
    # Inittializing the parameters with ones
    Theta = np.ones(X.shape[1]).reshape(-1, 1)
    
    # Initializing the cost
    J = np.zeros(n_epochs * m)

    index = 0

    # Iterating for a predefined epoch
    for i in range(n_epochs):

        # Iterating through each data entry
        for x, y in zip(X, Y):

            # Predicting the label using our model
            h_x = hypothesis(x.reshape(1, -1), Theta)
            
            # Calculating the cross entropy loss for our model
            J[index] = crossEntropyLoss(h_x, y.reshape(1, -1))

            index +=1      
            
            # Updating Theta (parameters) using gradient descent with a predefined alpha for each data entry
            Theta -= alpha * np.dot(x.reshape(1,-1).T, (h_x - y.reshape(1,-1))) / x.reshape(1,-1).shape[0] 

    return J, Theta

In [0]:
J_stochastic, T_stochastic = stochastic_gradient_descent(Xtr, Ytr, n_epochs, alpha)

## 3. Evaluation

In [0]:
predict_batch = predict(Xte, T_batch)
predict_stochastic = predict(Xte, T_batch)

def eval(prediction, expected):
    tp = 0
    tn = 0
    fp = 0
    fn = 0

    for E, P in zip(expected, prediction):
        if E == 0:

            # Prediction = 0 and Expected = 0
            if P == E:
                tn += 1

            # Prediction = 1 and Expected = 0
            else:
                fp += 1

        if E == 1:

            # Prediction = 0 and Expected = 1
            if P == E:
                tp += 1

            # Prediction = 1 and Expected = 1
            else:
                fn += 1
                
    #Precision
    precision = float(tp) / float(tp + fp)
    print("Precision: ", precision)
    
    #Recall
    recall = float(tp) / float(tp + fn)
    print("Recall: ", recall)
    
    #Accuracy
    accuracy = float(tp + tn) / float(tp + fn + tn + fp)
    print("Accuracy: ", accuracy)
    
    #F1 score
    f1_score = (2 * precision * recall) / (precision + recall)
    print("F1 Score: ", f1_score)
    
    #Confusion Matrix
    print("\nConfusion Matrix:")
    print(tp, fp)
    print(fn, tn)
    



### 3.1 Batch Gradient Descent Evaluation

In [0]:
eval(predict_batch, Yte)

Precision:  0.9879996821107844
Recall:  0.99456
Accuracy:  0.99124
F1 Score:  0.9912689869632818

Confusion Matrix:
12432 151
68 12349


### 3.2 Stochastic Gradient Descent Evaluation

In [0]:
eval(predict_stochastic, Yte)

Precision:  0.9879996821107844
Recall:  0.99456
Accuracy:  0.99124
F1 Score:  0.9912689869632818

Confusion Matrix:
12432 151
68 12349
