<a href="https://colab.research.google.com/github/potasali/Machine-Learning/blob/master/Programming_Assignment_4_Na%C3%AFve_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Programming Assignment 4: Naïve Bayes


## Problem
The purpose of this assignment is to get you familiar with multinomial sentiment
classification. By the end of this assignment you will have your very own “Sentiment
Analyzer”. You are given with Twitter US Airline Sentiment Dataset that contains around
14,640 tweets about airlines labelled as positive, negative and neutral. Your task is to train
a Naïve Bayes classifier on this dataset.



## 1. Dataset Splitting
Instead of a usual random split, we will split the dataset in a stratified fashion. Stratified
splitting ensure that the train and test sets have approximately the same percentage of
samples of each target class as the complete set. For example, in an 80-20 stratified split
80% samples of each class will be in train set and 20% in test set.
Implement stratified split and do the 80-20 train-test split of the provided dataset

In [0]:
# Importing Libraries
import os
import glob
import re
import numpy as np
from matplotlib import pyplot
import math 
import pandas as pd
import string
import heapq
from nltk.corpus import stopwords

In [0]:
# Loading the dataset
df = pd.read_csv (r'Tweets.csv')

# Shuffling the dataset
df.sample(frac=1)
df

Unnamed: 0,airline_sentiment,text
0,neutral,@USAirways Is there a phone line to call into ...
1,positive,@united Bag was finally delivered and intact. ...
2,positive,@usairways Thanks to Kevin and team at F38ish ...
3,negative,"@AmericanAir Yes, talked to them. FLL says is ..."
4,negative,@VirginAmerica and it's a really big bad thing...
...,...,...
14635,positive,@southwestair Great job celebrating #MardiGras...
14636,negative,@JetBlue I have been on phone with rep for ove...
14637,positive,@VirginAmerica @SSal thanks!
14638,positive,@VirginAmerica Thank you!!


In [0]:
# Dividing the dataset into train and test data and then furthe dividing it into X and Y. 
def dataSplit(df):
    
    # Exctracting data for each class
    neu = df[df['airline_sentiment'] == 'neutral']
    pos = df[df['airline_sentiment'] == 'positive']
    neg = df[df['airline_sentiment'] == 'negative']

    # Splitting the dataset into train and test data using 80:20 ratio

    # Using the first 80% data entries as train data
    neu_train = neu[0:int( len(neu)*0.8 )]
    pos_train = pos[0:int( len(pos)*0.8 )]
    neg_train = neg[0:int( len(neg)*0.8 )]
    
    # Using the last 20% data entries as test data
    neu_test = neu[int(len(neu)*0.8): len(neu)]
    pos_test = pos[int(len(pos)*0.8): len(pos)]
    neg_test = neg[int(len(neg)*0.8): len(neg)]

    # Combining the three classes to get train and test data further divided by X and Y
    X_train = np.concatenate((neu_train['text'], pos_train['text'], neg_train['text']))
    Y_train = np.concatenate((neu_train['airline_sentiment'], pos_train['airline_sentiment'], neg_train['airline_sentiment']))
    X_test = np.concatenate((neu_test['text'], pos_test['text'], neg_test['text']))
    Y_test = np.concatenate((neu_test['airline_sentiment'], pos_test['airline_sentiment'], neg_test['airline_sentiment']))
    
    return X_train, Y_train, X_test, Y_test, np.array([neu_train['text'], pos_train['text'], neg_train['text']]), np.array( [neu_test['text'], pos_test['text'], neg_test['text']])

X_train, Y_train, X_test, Y_test, x_train, x_test  = dataSplit(df)


## 2. Dataset Preprocessing
We’ll represent a tweet as a bag-of-words, that is, an unordered set of words with their
position ignored, keeping only their frequency in the tweet. <br>
Please note that in our case the vocabulary might be in thousands, so we will use text cleaning techniques such as ignore case, punctuation and frequent (stop) words like “a”, ”an”, “the” etc. to reduce the size of vocabulary.


### 2.1 One Hot Encoding & Label Encoding

In [0]:
# One hot encoding for each label in the training dataset
# Using 
#   first column as neutral
#   second column as positive
#   third column as negative
def oheEncoding(data, arr):
    '''
    Returns a matrix where each sample in y is represented
           as a row, and each column represents the class label in
           the one-hot encoding scheme.
    '''
    for label in data:

        if label == 'neutral':
            arr.append([1,0,0])

        elif label == 'positive':
            arr.append([0,1,0])

        elif label == 'negative':
            arr.append([0,0,1])

    arr = np.asarray(arr)

# label encoding for each label in the test dataset
def labelEncoding(data, arr):

    for label in data:

        # Labeling neutral as 0
        if label == 'neutral':
            arr.append(0)

        # Labeling positive as 1
        elif label == 'positive':
            arr.append(1)
        
        # Labeling negative as 2
        elif label == 'negative':
            arr.append(2)
            
    arr = np.asarray(arr)

def yProcess(Y_train, Y_test):
    ytr = []
    yte = []
    oheEncoding(Y_train, ytr)
    labelEncoding(Y_test, yte)
    return ytr, yte


### 2.1 Data Cleaning & Bag of Words Implementation

In [0]:

# Cleaning the dataset
def dataCleaning(data):
    arr = []

    # Iterating through the entire X data
    for text in data:

        # Converting text into lower case
        text = text.lower()

        # Replaces all non word characters (this excludes characters from a to Z and digits from 0-9) with a space
        text = re.sub(r'\W',' ',text)

        # Replaces multiple spaces with a single space
        text = re.sub(r'\s+',' ',text)

        # Appends the preprocessesed data to an array
        arr.append(text)

    arr = pd.DataFrame(arr, columns=["text"])
    return arr

# Finding the frequently used words from the dataset

def dictionary(tweets):

    # Initializing the word dictionary
    wordfreq = {}

    # Iterating through the entire X data
    for tweet in tweets:

        # Tokenizing each word from the data
        tokens = tweet.split()

        # Iterating through every word
        for token in tokens:

            # Update the count of the word in the dictionary while dynamically adding new words to the dictionary
            if token not in wordfreq.keys():
                wordfreq[token] = 1
            else:
                wordfreq[token] += 1

    return wordfreq

# Creating a bag of words vector
def bow(tweets, vocab):

    # Initializing the bag of word vector
    tweet_vectors = []

    # Iterating through the entire X data
    for tweet in tweets:

        # Tokenizing each word from the data
        tweet_tokens = tweet.split()
        sent_vec = []

        # Iterating through every word from the entire vocabulary
        for token in vocab:

            # Update the count of the each word in the tweet for the given vocabulary
            if token in tweet_tokens:
                sent_vec.append(tweet.count(token))
            else:
                sent_vec.append(0)

        # Appending the counts to the bag of word vector   
        tweet_vectors.append(sent_vec)
    return tweet_vectors

def xProcess(X_train, X_test, x_train, x_test):
    
    # Passing the datasets through the dataCleaing function 
    xtr = dataCleaning(X_train)
    xte = dataCleaning(X_test)

    # Initializing an array to store preprocessed data separated by classes
    _xtr = []

    # Passing the classes separately through the dataCleaing function 
    for i in range(3):
        _xtr.append(dataCleaning(x_train[i]))

    # Finding the vocabulary for the train data
    wordfreq = dictionary(xtr.text)

    # Sorting the vocabulary using heapsort using word count
    mostfreq = heapq.nlargest(5989, wordfreq, key=wordfreq.get)

    # Removing stopwords fromo the vocabulary
    mostfreq_stop = [word for word in mostfreq if not word in stopwords.words()]

     # Finding the bag of words for the classes separately
    for i in range(3):
        _xtr[i] = bow(_xtr[i].text, mostfreq_stop)
        _xtr[i] = pd.DataFrame(_xtr[i])
        _xtr[i] = _xtr[i].to_numpy()

    xte = bow(xte.text, mostfreq_stop)
    xte = pd.DataFrame(xte)    
    xte = xte.to_numpy()
    
    return _xtr, xte

In [0]:
# Returns separate preprocessed dataset for each class
def dataSetPreProcess(X_train, X_test, Y_train, Y_test, x_train, x_test):   
    ytr, yte = yProcess(Y_train, Y_test)
    xtr, xte = xProcess(X_train, X_test, x_train, x_test)
    return xtr, ytr, xte, yte
xtr, ytr, xte, yte = dataSetPreProcess(X_train, X_test, Y_train, Y_test, x_train, x_test)

In [0]:
print(xtr[0].shape)
print(xtr[1].shape)
print(xtr[2].shape)
print(xte.shape)

(2479, 5729)
(1890, 5729)
(7342, 5729)
(2929, 5729)


 ## 3. Implementation

### 3.1  Training Naive Bayes

In [0]:
def trainNaiveBayes(X_train, xtr):
    
    # number of samples in X
    N_doc = (X_train.shape)[0]

    # number of samples from X in each class
    N_c = np.array([(xtr[0].shape)[0], (xtr[1].shape)[0], (xtr[2].shape)[0]])

    # Finding the probablity of each prior
    prob_prior = np.array([N_c[0], N_c[1], N_c[2]])
    prob_prior = prob_prior / N_doc

    # Finding the log of the priors
    log_prior = np.log(prob_prior)
    

    class_sum = []
    class_total = []
    likelihood = []

    # Iterating over each class
    for i in range(3):
        
        # Finding the total count of each word
        class_sum.append(np.sum(xtr[i], axis = 0)) 
        
         # Applying Add-1 smoothing 
        class_sum[i] = class_sum[i] + 1        
        class_total.append(np.sum(class_sum))
        
        # Finding the probability of each word
        likelihood.append(class_sum[i] / class_total[i])
    
    # Finding the log of the probabilty of each word
    log_likelihood = []
    for i in range(3):
        log_likelihood.append(np.log(likelihood[i]))
    
    # returns log P(c) and log P(w|c)
    return log_prior, log_likelihood

In [0]:
log_prior, log_likelihood = trainNaiveBayes(X_train, xtr)

### 3.2 Testing Naive Bayes

In [0]:
def testNaiveBayes(log_prior, log_likelihood, xte):
    predictions = []

    # Finding the probability of the test sample being in one of the classes
    for i in range(3):
        class_prob = np.dot(xte,log_likelihood[i])
        class_prob = class_prob + log_prior[i]
        predictions.append(class_prob)
    
    prediction_label = []
    m = (xte.shape)[0]

    # Iterating through the entire test data
    for i in range(m):

        # Choosing the class with the maximum probability
        _max = max(predictions[0][i],predictions[1][i],predictions[2][i])
        for j in range(3):
            if _max == predictions[j][i]:
                prediction_label.append(j)
                
    print(len(prediction_label))
    
    return prediction_label

In [0]:
prediction_label = testNaiveBayes(log_prior, log_likelihood, xte)

2929


## 4. Evaluation report


In [0]:
def eval(prediction, expected):
    
    # Initializing the confusion matrix along with all the outcomes
    tp = [0,0,0]
    tn = [0,0,0]
    fp = [0,0,0]
    fn = [0,0,0]
    CM = np.full((3,3), 0)

    # Updating the confusion matrix
    for E, P in zip(expected, prediction): 
        CM[P][E] += 1
    
    # Updating values for the true-positives, true-negatives, false-positives, false-negatives
    for i in range(3):
        tp[i] = CM[i][i]
        fp[i] = (CM[:, i]).sum() - tp[i]
        fn[i] = (CM[i, :]).sum() - tp[i]
        tn[i] = (CM.sum()) - tp[i] - fp[i] - fn[i]

    # Displaying the Confusion Matrix
    CM = pd.DataFrame(data=CM)
    CM.columns = ['neutral', 'positive', 'negative']
    CM.rename(index={0:'neutral',1:'positive',2:'negative'}, inplace=True)
    print("Confusion Matrix\n\n", CM)
    
    # Finding the pooled confusion matrix
    poolCM = np.array([[sum(tp), sum(fp)],[sum(fn), sum(tn)]])
    print("\nPooled: \n", poolCM)
    
    # Micro-Averaging Precision 
    print("\nMicro-Average Precision: ", (sum(tp)/(sum(fp)+sum(tp))))
    
    # Micro-Averaging Recall
    print("Micro-Average Recall: ", (sum(tp)/(sum(fn)+sum(tp))))
    
    # Micro-Averaging Accuracy
    print("Micro-Average Accuracy: ", ((sum(tp)+sum(tn))/(sum(fn)+sum(tp)+sum(fp)+sum(tn))))
    
    # Micro-Averaging F1 score
    miP = (sum(tp)/(sum(fp)+sum(tp)))
    miR = (sum(tp)/(sum(fn)+sum(tp)))
    f1_score = (2 * miP * miR) / (miP + miR)
    print("Micro-Average F1-Score: ", f1_score)
    
    # Finding Precison, Accuracy and Recall for each class
    maP = []
    maR = []
    maA = []
    for i in range(3):
        maP.append(tp[i] / (tp[i] + fp[i]))
        maR.append(tp[i] / (tp[i] + fn[i]))
        maA.append((tp[i] + tn[i]) / (tp[i] + tn[i] + fp[i] + fn[i]))

    # Macro-Average Precision
    _maP = (sum(maP))/3
    print("\nMacro-Average Precision: ", _maP)

    # Macro-Average Recall
    _maR = (sum(maR))/3
    print("Macro-Average Recall: ", _maR)

    # Macro-Average Accuracy
    _maA = (sum(maA))/3
    print("Macro-Average Accuracy: ", _maA)

    # Macro-Average F1 score
    f1_score = (2 * _maP * _maR) / (_maP + _maR)
    print("Macro-Average F1-Score: ", f1_score)

eval(prediction_label, yte)  

Confusion Matrix

           neutral  positive  negative
neutral       510       307       681
positive        7        68         1
negative      103        98      1154

Pooled: 
 [[1732 1197]
 [1197 4661]]

Micro-Average Precision:  0.5913280983270741
Micro-Average Recall:  0.5913280983270741
Micro-Average Accuracy:  0.7275520655513827
Micro-Average F1-Score:  0.5913280983270741

Macro-Average Precision:  0.531628054567613
Macro-Average Recall:  0.6956170990984031
Macro-Average Accuracy:  0.6956170990984031
Macro-Average F1-Score:  0.602666164967869
