<a href="https://colab.research.google.com/github/potasali/Machine-Learning/blob/master/Programming_Assignment_3_Multinomial_Logistic_Regression/Programming_Assignment_3_Multinomial_Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Programming Assignment 3: Multinomial Logistic Regression

## Instructions
- The aim of this assignment is to give you an initial hands-on regarding real-life
machine learning application.
- Use separate training and testing data as discussed in class.
- You can only use Python programming language and Jupyter Notebook.
- You can only use numpy, matplotlib and are not allowed to use NLTK, scikit-learn
or any other machine learning toolkit.
- Submit your code as one notebook file (.ipynb) on LMS. The name of file
should be your roll number.

## Problem
The purpose of this assignment is to get you familiar with multinomial sentiment
classification. By the end of this assignment you will have your very own “Sentiment
Analyzer”. You are given with Twitter US Airline Sentiment Dataset that contains around
14,640 tweets about airlines labelled as positive, negative and neutral. Your task is to train
a Multinomial Logistic Regression classifier on this dataset.


## 1. Dataset Splitting
Instead of a usual random split, we will split the dataset in a stratified fashion. Stratified
splitting ensure that the train and test sets have approximately the same percentage of
samples of each target class as the complete set. For example, in an 80-20 stratified split
80% samples of each class will be in train set and 20% in test set.
Implement stratified split and do the 80-20 train-test split of the provided dataset

In [0]:
# Importing Libraries
import os
import glob
import re
import numpy as np
from matplotlib import pyplot
import math 
import pandas as pd
import string
import heapq
from nltk.corpus import stopwords

In [0]:
# Loading the dataset
df = pd.read_csv (r'Tweets.csv')

In [0]:
# Dividing the dataset into train and test data and then furthe dividing it into X and Y. 
def dataSplit(df):
    
    # Exctracting data for each class
    neu = df[df['airline_sentiment'] == 'neutral']
    pos = df[df['airline_sentiment'] == 'positive']
    neg = df[df['airline_sentiment'] == 'negative']

    # Splitting the dataset into train and test data using 80:20 ratio

    # Using the first 80% data entries as train data
    neu_train = neu[0:int( len(neu)*0.8 )]
    pos_train = pos[0:int( len(pos)*0.8 )]
    neg_train = neg[0:int( len(neg)*0.8 )]

    # Using the last 20% data entries as test data
    neu_test = neu[int(len(neu)*0.8): len(neu)]
    pos_test = pos[int(len(pos)*0.8): len(pos)]
    neg_test = neg[int(len(neg)*0.8): len(neg)]

    # Combining the three classes to get train and test data further divided by X and Y
    X_train = np.concatenate((neu_train['text'], pos_train['text'], neg_train['text']))
    Y_train = np.concatenate((neu_train['airline_sentiment'], pos_train['airline_sentiment'], neg_train['airline_sentiment']))
    X_test = np.concatenate((neu_test['text'], pos_test['text'], neg_test['text']))
    Y_test = np.concatenate((neu_test['airline_sentiment'], pos_test['airline_sentiment'], neg_test['airline_sentiment']))
    
    return X_train, Y_train, X_test, Y_test

X_train, Y_train, X_test, Y_test = dataSplit(df)


## 2. Dataset Preprocessing
We’ll represent a tweet as a bag-of-words, that is, an unordered set of words with their
position ignored, keeping only their frequency in the tweet. <br>
Please note that in our case the vocabulary might be in thousands, so we will use text cleaning techniques such as ignore case, punctuation and frequent (stop) words like “a”, ”an”, “the” etc. to reduce the size of vocabulary.


### 2.1 One Hot Encoding & Label Encoding

In [0]:
# One hot encoding for each label in the training dataset
# Using 
#   first column as neutral
#   second column as positive
#   third column as negative
def oheEncoding(data, arr):
    '''
    Returns a matrix where each sample in y is represented
           as a row, and each column represents the class label in
           the one-hot encoding scheme.
    '''
    for label in data:

        if label == 'neutral':
            arr.append([1,0,0])

        elif label == 'positive':
            arr.append([0,1,0])

        elif label == 'negative':
            arr.append([0,0,1])

    arr = np.asarray(arr)

# label encoding for each label in the test dataset
def labelEncoding(data, arr):
    for label in data:

        # Labeling neutral as 0
        if label == 'neutral':
            arr.append(0)

        # Labeling positive as 1
        elif label == 'positive':
            arr.append(1)

        # Labeling negative as 2
        elif label == 'negative':
            arr.append(2)

    arr = np.asarray(arr)


def yProcess(Y_train, Y_test):
    ytr = []
    yte = []
    oheEncoding(Y_train, ytr)
    labelEncoding(Y_test, yte)
    return ytr, yte


### 2.1 Data Cleaning & Bag of Words Implementation

In [0]:

# Cleaning the dataset
def dataCleaning(data):
    arr = []

    # Iterating through the entire X data
    for text in data:

        # Converting text into lower case
        text = text.lower()

        # Replaces all non word characters (this excludes characters from a to Z and digits from 0-9) with a space
        text = re.sub(r'\W',' ',text)

        # Replaces multiple spaces with a single space
        text = re.sub(r'\s+',' ',text)

        # Appends the preprocessesed data to an array
        arr.append(text)

    arr = pd.DataFrame(arr, columns=["text"])
    return arr

# Extracting the entire vocabulary from the dataset
def dictionary(tweets):

    # Initializing the word dictionary
    wordfreq = {}

    # Iterating through the entire X data
    for tweet in tweets:

        # Tokenizing each word from the data
        tokens = tweet.split()

        # Iterating through every word
        for token in tokens:

            # Update the count of the word in the dictionary while dynamically adding new words to the dictionary
            if token not in wordfreq.keys():
                wordfreq[token] = 1
            else:
                wordfreq[token] += 1

    return wordfreq

# Creating a bag of words array
def bow(tweets, vocab):

    #initializing the bag of word vector
    tweet_vectors = []

    # Iterating through the entire X data
    for tweet in tweets:

        # Tokenizing each word from the data
        tweet_tokens = tweet.split()

        sent_vec = []

        # Iterating through every word from the entire vocabulary
        for token in vocab:

            # Update the count of the each word in the tweet for the given vocabulary
            if token in tweet_tokens:
                sent_vec.append(tweet.count(token))
            else:
                sent_vec.append(0)
        
        # Appending the counts to the bag of word vector
        tweet_vectors.append(sent_vec)

    return tweet_vectors

def xProcess(X_train, X_test):

    # Passing the datasets through the dataCleaing function 
    xtr = dataCleaning(X_train)
    xte = dataCleaning(X_test)
    
    # Finding the vocabulary for the train data
    wordfreq = dictionary(xte.text)

    # Sorting the vocabulary using heapsort for the counts
    mostfreq = heapq.nlargest(5989, wordfreq, key=wordfreq.get)

    # Removing stopwords fromo the vocabulary
    mostfreq_stop = [word for word in mostfreq if not word in stopwords.words()]

    # Finding the bag of words for the datasets
    xtr = bow(xtr.text, mostfreq_stop)
    xte = bow(xte.text, mostfreq_stop)
    
    xtr = pd.DataFrame(xtr)
    xte = pd.DataFrame(xte)
    
    xtr = xtr.to_numpy()
    xte = xte.to_numpy()
    
    return xtr, xte

In [0]:
def dataSetPreProcess(X_train, X_test, Y_train, Y_test):
    ytr, yte = yProcess(Y_train, Y_test)
    xtr, xte = xProcess(X_train, X_test)
    return xtr, ytr, xte, yte
xtr, ytr, xte, yte = dataSetPreProcess(X_train, X_test, Y_train, Y_test)

In [0]:
xtr

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

 ## 3. Implementation

### 3.1 Softmax function

In [0]:
# Softmax function is an activation function that turns numbers into probabilities which sum to one.
def softmax(z):
    smax = (np.exp(z.T) / np.sum(np.exp(z), axis=1)).T
    return smax

# Hypothesis Function h(x) predicts the label of a set of data
def h_x(X, W):

    # Taking dot product of the Feature Vector 'X' and the weights 'W'
    hx = (np.dot(X,W))

    return hx

### 3.2 Cross-entropy loss function


In [0]:
# Cross Entropy loss measures the performance of our classification model.
def cross_entropy(hx, y):
    ce = - np.sum(np.log(hx) * (y), axis=1)
    return J

# Takes the mean of the cross entropy loss
def cost(hx, y):
    J = np.mean(cross_entropy(hx, y))
    return J

### 3.3 Mini-batch Gradient Descent with batch size of 32 samples


In [0]:
def miniBatchGD(X, Y, alpha, iters):
    """Mini Batch Gradient Descent.

        Parameters
        ------------
        X : {array-like, sparse matrix}, shape = [n_samples, n_features]
            Training vectors, where n_samples is the number of samples and
            n_features is the number of features.
        Y : {array-like, sparse matrix}, shape = [n_samples, 1]
            Training vectors, where n_samples is the number of samples
        alpha : float (default: 0.001)
            Learning rate (between 0.0 and 1.0)
        epochs : int (default: 100)
            Passes over the training dataset.

        """

    # number of samples
    m = (xtr.shape)[0]
    
    # number of features
    n = (xtr.shape)[1] 

    # W : 2d-array, shape={n_features, 1}
    # Model weights after fitting.
    # Setting random state for shuffling and initializing the weights.
    W = np.full((n, 3), 0.1)
    
    # Iterating for a predefined iter
    for epoch in range (iters):

        # Iterating through batches of 32
        for i in range(365):

            # Extracting the next batch of 32 samples from the dataset
            b = i * 32
            batchX = X[b : b + 32]
            batchY = Y[b : b + 32]
            
            # Predicting the labels using our model
            ni = h_x(batchX, W)

            # Applying softmax on the predicted value
            smax = softmax(ni) 

            # Updating the weights using gradient descent with the predefined alpha
            diff = smax - batchY
            mse = np.mean(diff, axis=0)
            grad = np.dot(batchX.T, diff)
            W -= (alpha * grad)

    return W

W = miniBatchGD(xtr,ytr,0.001,100)

### 3.4 Prediction function 
to predict whether the tweet is positive, negative or neutral using learned multinomial logistic regression

In [0]:
# Returns the class with the maximum probability
def to_classlabel(z):
    label = z.argmax(axis=1)
    return label

## 4. Evaluation report


In [0]:
def eval(prediction, expected):
    
    # Initializing the confusion matrix along with all the outcomes
    tp = [0,0,0]
    tn = [0,0,0]
    fp = [0,0,0]
    fn = [0,0,0]
    CM = np.full((3,3), 0)

    # Updating the confusion matrix
    for E, P in zip(expected, prediction): 
        CM[P][E] += 1
    
    # Updating values for the true-positives, true-negatives, false-positives, false-negatives
    for i in range(3):
        tp[i] = CM[i][i]
        fp[i] = (CM[:, i]).sum() - tp[i]
        fn[i] = (CM[i, :]).sum() - tp[i]
        tn[i] = (CM.sum()) - tp[i] - fp[i] - fn[i]

    # Displaying the Confusion Matrix
    CM = pd.DataFrame(data=CM)
    CM.columns = ['neutral', 'positive', 'negative']
    CM.rename(index={0:'neutral',1:'positive',2:'negative'}, inplace=True)
    print("Confusion Matrix\n\n", CM)

    # Finding the pooled confusion matrix
    poolCM = np.array([[sum(tp), sum(fp)],[sum(fn), sum(tn)]])
    print("\nPooled: \n", poolCM)
    
    # Micro-Averaging Precision 
    print("\nMicro-Average Precision: ", (sum(tp)/(sum(fp)+sum(tp))))
    
    # Micro-Averaging Recall
    print("Micro-Average Recall: ", (sum(tp)/(sum(fn)+sum(tp))))
    
    # Micro-Averaging Accuracy
    print("Micro-Average Accuracy: ", ((sum(tp)+sum(tn))/(sum(fn)+sum(tp)+sum(fp)+sum(tn))))
    
    # Micro-Averaging F1 score
    miP = (sum(tp)/(sum(fp)+sum(tp)))
    miR = (sum(tp)/(sum(fn)+sum(tp)))
    f1_score = (2 * miP * miR) / (miP + miR)
    print("Micro-Average F1-Score: ", f1_score)
    
    # Finding Precison, Accuracy and Recall for each class
    maP = []
    maR = []
    maA = []
    for i in range(3):
        maP.append(tp[i] / (tp[i] + fp[i]))
        maR.append(tp[i] / (tp[i] + fn[i]))
        maA.append((tp[i] + tn[i]) / (tp[i] + tn[i] + fp[i] + fn[i]))

    # Macro-Average Precision
    _maP = (sum(maP))/3
    print("\nMacro-Average Precision: ", _maP)

    # Macro-Average Recall
    _maR = (sum(maR))/3
    print("Macro-Average Recall: ", _maR)

    # Macro-Average Accuracy
    _maA = (sum(maA))/3
    print("Macro-Average Accuracy: ", _maA)

    # Macro-Average F1 score
    f1_score = (2 * _maP * _maR) / (_maP + _maR)
    print("Macro-Average F1-Score: ", f1_score)

# Using our model to predict labels from the test data
ni = h_x(xte,W)
smax = softmax(ni)
predicted = to_classlabel(smax)
eval(predicted, yte)  

Confusion Matrix

           neutral  positive  negative
neutral       214        27        43
positive       73       275        42
negative      333       171      1751

Pooled: 
 [[2240  689]
 [ 689 5169]]

Micro-Average Precision:  0.7647661317855924
Micro-Average Recall:  0.7647661317855924
Micro-Average Accuracy:  0.8431774211903948
Micro-Average F1-Score:  0.7647661317855925

Macro-Average Precision:  0.6267534476211646
Macro-Average Recall:  0.7450486686488061
Macro-Average Accuracy:  0.7450486686488061
Macro-Average F1-Score:  0.6808005559736282
