<h1>Bag-of-Words Representation and Multinomial Naive Bayes Model</h1>
<hr>
<strong>Question 4.5 (Coding) [4 points]:</strong> Extend your classifier so that it can compute an MAP estimate of
parameters using a fair Dirichlet prior. This corresponds to additive smoothing.

In [1]:
# including the required libraries
import numpy  as np 
import pandas as pd
import matplotlib.pyplot as plt
import os
cwd = os.getcwd()

In [2]:
# getting the corresponding data sets.
dataFrameTestFeatures  = pd.read_csv( os.path.join(cwd, "data", "question-4-test-features.csv"),  header=None)
dataFrameTrainFeatures = pd.read_csv( os.path.join(cwd, "data", "question-4-train-features.csv"), header=None)
dataFrameTrainLabels   = pd.read_csv( os.path.join(cwd, "data", "question-4-train-labels.csv"),   header=None)
dataFrameTestLabels    = pd.read_csv( os.path.join(cwd, "data", "question-4-test-labels.csv"),    header=None)

In [3]:
# stacking the test and train labels
npTrain = np.hstack((dataFrameTrainLabels.values, dataFrameTrainFeatures.values))
npTest  = np.hstack((dataFrameTestLabels.values,  dataFrameTestFeatures.values))

In [4]:
# printing the stacked matrices
print('*****************************')
print(npTrain)
print('*****************************')
print(npTest)
print('*****************************')
print(npTest.shape)

*****************************
[['neutral' 1 0 ... 0 0 0]
 ['positive' 1 1 ... 0 0 0]
 ['neutral' 1 0 ... 0 0 0]
 ...
 ['negative' 0 0 ... 0 0 0]
 ['negative' 0 0 ... 0 0 0]
 ['positive' 0 0 ... 0 0 0]]
*****************************
[['negative' 0 0 ... 0 0 0]
 ['negative' 0 0 ... 0 0 0]
 ['positive' 0 0 ... 0 0 0]
 ...
 ['neutral' 0 0 ... 0 0 0]
 ['negative' 0 0 ... 0 0 0]
 ['neutral' 0 0 ... 0 0 0]]
*****************************
(2928, 5723)


In [5]:
# getting the corresponding data according to the tweet type.
frameTrain      = pd.DataFrame(npTrain)
npTrainNeutral  = frameTrain[frameTrain[0] == 'neutral'].loc[:, 1:frameTrain.shape[1]].values
npTrainPositive = frameTrain[frameTrain[0] == 'positive'].loc[:, 1:frameTrain.shape[1]].values
npTrainNegative = frameTrain[frameTrain[0] == 'negative'].loc[:, 1:frameTrain.shape[1]].values

# getting the required numbers for calculations.
wordCount = npTrainNeutral.shape[1]
testCount = npTest.shape[0]
trainNeutralCount  = npTrainNeutral.shape[0]
trainPositiveCount = npTrainPositive.shape[0]
trainNegativeCount = npTrainNegative.shape[0]

In [6]:
# printing the required numbers.
print('TEST COUNT: ' + str(testCount))
print('WORD COUNT: ' + str(wordCount))
print('TRAIN NEUTRAL  COUNT: ' + str(trainNeutralCount))
print('TRAIN POSITIVE COUNT: ' + str(trainPositiveCount))
print('TRAIN NEGATIVE COUNT: ' + str(trainNegativeCount))

TEST COUNT: 2928
WORD COUNT: 5722
TRAIN NEUTRAL  COUNT: 2617
TRAIN POSITIVE COUNT: 2004
TRAIN NEGATIVE COUNT: 7091


In [7]:
# printing the train matrices.
print('-------TRAIN NEUTRAL---------')
print(npTrainNeutral)
print('*****************************')

print('-------TRAIN POSITIVE--------')
print(npTrainPositive)
print('*****************************')

print('-------TRAIN NEGATIVE--------')
print(npTrainNegative)
print('*****************************')

-------TRAIN NEUTRAL---------
[[1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
*****************************
-------TRAIN POSITIVE--------
[[1 1 1 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
*****************************
-------TRAIN NEGATIVE--------
[[1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
*****************************


In [8]:
# calculating the number of occurrences of word J.
npNumOfWordJOccNeutral  = npTrainNeutral.sum(axis=0)  # Tj,neutral
npNumOfWordJOccPositive = npTrainPositive.sum(axis=0) # Tj,positive
npNumOfWordJOccNegative = npTrainNegative.sum(axis=0) # Tj,negative

# the number of tweets in the training set.
neutralTweetCount  = npTrainNeutral.shape[0]  # N neutral
positiveTweetCount = npTrainPositive.shape[0] # N positive
negativeTweetCount = npTrainNegative.shape[0] # N negative
tweetCount = neutralTweetCount + positiveTweetCount + negativeTweetCount; # N

# estimates the fraction of the tweets with the j-th word of the vocabulary,
probOfOccOfWordJNeutral  = (npNumOfWordJOccNeutral  + 1) / (npNumOfWordJOccNeutral.sum()  + wordCount) # P(Xj | Y = neutral)
probOfOccOfWordJPositive = (npNumOfWordJOccPositive + 1) / (npNumOfWordJOccPositive.sum() + wordCount) # P(Xj | Y = positive)
probOfOccOfWordJNegative = (npNumOfWordJOccNegative + 1) / (npNumOfWordJOccNegative.sum() + wordCount) # P(Xj | Y = negative)

# the probability that any particular tweet will be positive, negative or neutral.
probOfNeutral  = neutralTweetCount  / tweetCount # P(Y = neutral)
probOfPositive = positiveTweetCount / tweetCount # P(Y = positive)
probOfNegative = negativeTweetCount / tweetCount # P(Y = negative)

# converting the content to float.
probOfOccOfWordJNeutral = probOfOccOfWordJNeutral.astype(float)
probOfOccOfWordJPositive = probOfOccOfWordJPositive.astype(float)
probOfOccOfWordJNegative = probOfOccOfWordJNegative.astype(float)

In [9]:
print('--------------------------------------------------------------------------------------------------------------------')
print('Number of Word J Occ in Neutral      : ' + str(npNumOfWordJOccNeutral))
print('Probability of Neutral               : ' + str(probOfNeutral))
print('Number of Neutral Tweets             : ' + str(neutralTweetCount))
print('Probability of Word J Occ in Neutral : ' + str(probOfOccOfWordJNeutral))
print('--------------------------------------------------------------------------------------------------------------------')
print('Number of Word J Occ in Positive     : ' + str(npNumOfWordJOccPositive))
print('Probability of Positive              : ' + str(probOfPositive))
print('Number of Positive Tweets            : ' + str(positiveTweetCount))
print('Probability of Word J Occ in Positive: ' + str(probOfOccOfWordJPositive))
print('--------------------------------------------------------------------------------------------------------------------')
print('Number of Word J Occ in Negative     : ' + str(npNumOfWordJOccNegative))
print('Probability of Negative              : ' + str(probOfNegative))
print('Number of Negative Tweets            : ' + str(negativeTweetCount))
print('Probability of Word J Occ in Negative: ' + str(probOfOccOfWordJNegative))
print('--------------------------------------------------------------------------------------------------------------------')

--------------------------------------------------------------------------------------------------------------------
Number of Word J Occ in Neutral      : [175 1 4 ... 0 0 0]
Probability of Neutral               : 0.2234460382513661
Number of Neutral Tweets             : 2617
Probability of Word J Occ in Neutral : [8.06710363e-03 9.16716322e-05 2.29179081e-04 ... 4.58358161e-05
 4.58358161e-05 4.58358161e-05]
--------------------------------------------------------------------------------------------------------------------
Number of Word J Occ in Positive     : [152 2 32 ... 0 0 0]
Probability of Positive              : 0.1711065573770492
Number of Positive Tweets            : 2004
Probability of Word J Occ in Positive: [8.54366763e-03 1.67522895e-04 1.84275184e-03 ... 5.58409649e-05
 5.58409649e-05 5.58409649e-05]
--------------------------------------------------------------------------------------------------------------------
Number of Word J Occ in Negative     : [189 3 136 ... 

In [10]:
# calculating the estimates for every test count.
npTest = dataFrameTestFeatures.values.astype(float)
resultsNeutral  = np.zeros((testCount, 1))
resultsPositive = np.zeros((testCount, 1))
resultsNegative = np.zeros((testCount, 1))

for i in range(testCount):
    tempNeutral  = np.zeros((1, wordCount))
    tempPositive = np.zeros((1, wordCount))
    tempNegative = np.zeros((1, wordCount))
    
    for j in range(wordCount):
        # getting how many times 
        #the word occurred in tweet.
        occOfWordJInTweetI = npTest[i][j]    
        if ((probOfOccOfWordJNeutral[j] != 0 or occOfWordJInTweetI != 0)):
            tempNeutral[0][j]  = occOfWordJInTweetI * np.log(probOfOccOfWordJNeutral[j])
            
        if ((probOfOccOfWordJPositive[j] != 0 or occOfWordJInTweetI != 0)):
            tempPositive[0][j] = occOfWordJInTweetI * np.log(probOfOccOfWordJPositive[j])
            
        if ((probOfOccOfWordJNegative[j] != 0 or occOfWordJInTweetI != 0)):
            tempNegative[0][j] = occOfWordJInTweetI * np.log(probOfOccOfWordJNegative[j])
        
    # calculating MLE
    resultsNeutral[i][0]  = (np.log(probOfNeutral)  + tempNeutral.sum(axis=1))[0]
    resultsPositive[i][0] = (np.log(probOfPositive) + tempPositive.sum(axis=1))[0]
    resultsNegative[i][0] = (np.log(probOfNegative) + tempNegative.sum(axis=1))[0]

In [11]:
# printing the MLE results for each tweet type.
print('-------Neutral-------')
print(resultsNeutral)
print('---------------------')

print('-------Positive------')
print(resultsPositive)
print('---------------------')

print('-------Negative------')
print(resultsNegative)
print('---------------------')

-------Neutral-------
[[-77.32434937]
 [-86.01422333]
 [-25.11251507]
 ...
 [-28.97438985]
 [-62.08238288]
 [-53.13756197]]
---------------------
-------Positive------
[[-71.49682708]
 [-89.72128942]
 [-25.31050152]
 ...
 [-31.53684995]
 [-65.82798516]
 [-57.62132515]]
---------------------
-------Negative------
[[-68.25541905]
 [-80.1110864 ]
 [-26.26012836]
 ...
 [-29.74415034]
 [-56.72641379]
 [-49.51032312]]
---------------------


In [12]:
# calculating the results for each tweet.
results = []
for i in range(testCount):
    if ((resultsNeutral[i] >= resultsNegative[i]) and (resultsNeutral[i] >= resultsPositive[i])):
        results.append('neutral')
    elif ((resultsNegative[i] >= resultsNeutral[i]) and (resultsNegative[i] >= resultsPositive[i])):
        results.append('negative')
    elif ((resultsPositive[i] >= resultsNeutral[i]) and (resultsPositive[i] > resultsNegative[i])):
        results.append('positive')
        
results = np.array(results).reshape(-1,1)

In [13]:
print('-------Results------')
print(results)
print('--------------------')
print('-----Test Values----')
print(dataFrameTestLabels.values)
print('--------------------')

-------Results------
[['negative']
 ['negative']
 ['neutral']
 ...
 ['neutral']
 ['negative']
 ['negative']]
--------------------
-----Test Values----
[['negative']
 ['negative']
 ['positive']
 ...
 ['neutral']
 ['negative']
 ['neutral']]
--------------------


In [14]:
# calculating Accuracy
estimateCount = 0
failureCount  = 0

for i in range(testCount):
    if dataFrameTestLabels.values[i] == results[i]:
        estimateCount = estimateCount + 1
    else:
        failureCount = failureCount + 1

print(estimateCount)
print(failureCount)
accuracy = (estimateCount / (estimateCount + failureCount)) * 100
print('ACCURACY: ' + str(accuracy))

2205
723
ACCURACY: 75.30737704918032


<h1 align="center">INTERPRETATION OF THE RESULTS</h1>
<hr>
By using this model, an accurracy of <strong>75.30737704918032%</strong> is achieved.Since this model uses smoothing in MLE, it was able to predict the data better than the model that didn't used any smoothing which resulted in <strong>62.807377049180324%</strong> accurracy. That is to say, by just using smoothing, we were able to get 13 percent more accurracy and the current model is more reliable than the previous one.