# QnA Matching Data Science Scenario

## Part 4: Naive Bayes Classifier (Unigrams and Bigrams)

### Overview

__Part 4__ of the series shows the implementation of the _Naive Bayes Classifier_ as described in the paper entitled ["MCE Training Techniques for Topic Identification of Spoken Audio Documents"](http://ieeexplore.ieee.org/abstract/document/5742980/).

In this notebook, we have trained two classifiers. The first classifier is trained on a collection of unigrams, which could be single words or learned phrases from the __Part 1__. The second classifier is trained on a collection of bigrams, which could be the concatenation of single words or learned phrases. For example, the bag of bigrams of the sentence "create a javascript object" are #_create, create_a, a_javascript, javascript_object, object_# after adding the sentence boundary symbol #.

At the end, we save the Naive Bayes scores obtained on both training and test datasets into text files, which can be retrieved in the future notebooks.

Note: This notebook series are built under Python 3.5 and NLTK 3.2.2.


## Import required Python modules

In [91]:
import pandas as pd
import math
import numpy as np
import copy
import os
import matplotlib
import matplotlib.pyplot as plt
from azure.storage import CloudStorageAccount
from IPython.display import display

%matplotlib inline

# suppress all warnings
import warnings
warnings.filterwarnings("ignore")

## Read trainQ and testQ into Pandas DataFrames

In [92]:
trainQ_url = 'https://mezsa.blob.core.windows.net/stackoverflownew/trainQwithTokens.tsv'
testQ_url = 'https://mezsa.blob.core.windows.net/stackoverflownew/testQwithTokens.tsv'

trainQ = pd.read_csv(trainQ_url, sep='\t', index_col='Id', encoding='latin1')
testQ = pd.read_csv(testQ_url, sep='\t', index_col='Id', encoding='latin1')

## Create Tokens to IDs Hash

We assign an unique ID to each token in the vocabulary.

In [93]:
# create a list of ngrams [ngram1, ngram2, ...]
def create_ngram(tokens, ngram):
    if ngram > 1:
        # split tokens into a list
        # add "#" to represent the benginning and the end of sentence
        tokenList = ["#"] + tokens.split(',') + ["#"]
        # create a list of ngrams
        ngramList = ["_".join(tokenList[i:i+ngram]) for i in range(len(tokenList) - (ngram-1))]
    else: 
        # If ngram = 1, then only unigram will be considered
        ngramList = tokens.split(',')

    return ngramList

In [94]:
# get Token to ID mapping: {Token: tokenId}
def tokens_to_ids(tokens, featureHash, ngram):
    token2IdHash = {}
    for i in range(len(tokens)):
        # If ngram = 1, then only unigram will be considered
        tokenList = create_ngram(tokens.iloc[i], ngram)
            
        if featureHash is None:
            for t in tokenList:
                if t not in token2IdHash.keys():
                    token2IdHash[t] = len(token2IdHash)
        else:
            for t in tokenList:
                if t not in token2IdHash.keys() and t in list(featureHash.keys()):
                    token2IdHash[t] = len(token2IdHash)
            
    return token2IdHash

In [95]:
token2IdHashInit = tokens_to_ids(trainQ['Tokens'], featureHash=None, ngram=1)
token2IdHashInit_bigram = tokens_to_ids(trainQ['Tokens'], featureHash=None, ngram=2)

In [96]:
print("Total number of unique unigrams in the TrainQ: " + str(len(token2IdHashInit)))
print("Total number of unique bigrams in the TrainQ: " + str(len(token2IdHashInit_bigram)))

Total number of unique unigrams in the TrainQ: 4848
Total number of unique bigrams in the TrainQ: 184469


## Create Count Matrix for Each Token in Each Answer

In [97]:
def count_matrix(frame, token2IdHash, uniqueAnswerId, ngram):
    # create am empty matrix with the shape of:
    # num_row = num of unique tokens
    # num_column = num of unique answerIds (N_wA) or num of questions in testQ (tfMatrix)
    # rowIdx = token2IdHash.values()
    # colIdx = index of uniqueAnswerId (N_wA) or index of questions in testQ (tfMatrix)
    num_row = len(token2IdHash)
    if uniqueAnswerId is not None:  # get N_wA
        num_column = len(uniqueAnswerId)
    else:
        num_column = len(frame)
    countMatrix = np.empty(shape=(num_row, num_column))

    # loop through each question in the frame to fill in the countMatrix with corresponding counts
    for i in range(len(frame)):
        # If ngram = 1, then only unigram will be considered
        tokens = create_ngram(frame['Tokens'].iloc[i], ngram)
            
        if uniqueAnswerId is not None:   # get N_wA
            answerId = frame['AnswerId'].iloc[i]
            colIdx = uniqueAnswerId.index(answerId)
        else:
            colIdx = i
            
        for t in tokens:
            if t in token2IdHash.keys():
                rowIdx = token2IdHash[t]
                countMatrix[rowIdx, colIdx] += 1

    return countMatrix

In [98]:
# get unique answerId in ascending order
uniqueAnswerId = list(np.unique(trainQ['AnswerId']))
# calculate the count matrix of all training questions (unigrams only).
N_wAInit = count_matrix(trainQ, token2IdHashInit, uniqueAnswerId, ngram=1)
# calculate the count matrix of all training questions (bigrams only).
N_wAInit_bigram = count_matrix(trainQ, token2IdHashInit_bigram, uniqueAnswerId, ngram=2)

## Feature Selection Based on Posteriori Probability P(A|w) 

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/feature_selection.PNG?token=APoO9ioyiKq8Sx0dxZ5onqIzyy6ywUmEks5YnH6cwA%3D%3D">

In [99]:
# calculate P(A): [P_A1, P_A2, ...]
def prior_probability_answer(answerIds, uniqueAnswerId): 
    P_A = []
    # convert a pandas series to a list
    answerIds = list(answerIds)
    
    for id in uniqueAnswerId:
        P_A.append(answerIds.count(id)/len(answerIds))
    return np.array(P_A)

In [100]:
P_A = prior_probability_answer(trainQ['AnswerId'], uniqueAnswerId)

In [101]:
# calculate P(A|w)
def posteriori_prob(N_wAInit, P_A, uniqueAnswerId):
    # N_A is the total number of answers
    N_A = len(uniqueAnswerId)
    # N_w is the total number of times w appears over all documents 
    # rowSum of count matrix (N_wAInit)
    N_wInit = np.sum(N_wAInit, axis = 1)
    # P(A|w) = (N_w|A + N_A * P(A))/(N_w + N_A)
    N = N_wAInit + N_A * P_A
    D = N_wInit + N_A
    P_Aw = np.divide(N.T, D).T    
    
    return P_Aw

In [102]:
P_Aw = posteriori_prob(N_wAInit, P_A, uniqueAnswerId)
P_Aw_bigram = posteriori_prob(N_wAInit_bigram, P_A, uniqueAnswerId)

In [103]:
# select the top N tokens w which maximize P(A|w) for each A.
# get FeatureHash: {token: 1}
def feature_selection(P_Aw, token2IdHashInit, topN):
    featureHash = {}
    # for each answer A, sort tokens w by P(A|w)
    sortedIdxMatrix = np.argsort(P_Aw, axis=0)[::-1]
    # select top N tokens for each answer A
    topMatrix = sortedIdxMatrix[0:topN, :]
    # for each token w in topMatrix, add w to FeatureHash if it has not already been included
    topTokenIdList = np.reshape(topMatrix, topMatrix.shape[0] * topMatrix.shape[1])
    # get ID to Token mapping: {tokenId: Token}
    Id2TokenHashInit = {y:x for x, y in token2IdHashInit.items()}
    
    for tokenId in topTokenIdList:
        token = Id2TokenHashInit[tokenId]
        if token not in featureHash.keys():
            featureHash[token] = 1
    return featureHash

To determine the best top N tokens to select per answer, we have experimented different _topN_ values and found that selecting top 3 unigrams and top 30 bigrams per answer yields the best restuls. Four plots of evaluation matrices against different number of features are provided at the end of this notebook to describe how we determine those two numbers. 

In [104]:
featureHash = feature_selection(P_Aw, token2IdHashInit, topN=3)
featureHash_bigram = feature_selection(P_Aw_bigram, token2IdHashInit_bigram, topN=30)

## Re-assign ID to Each Selected Token and Re-calculate Count Matrix

After selecting the top N tokens of each answer, we use the collection of selected tokens for training and re-assign an ID to each selected token. Based on the new assigned IDs, we re-calculate the Count Matrix.

In [105]:
%time token2IdHash = tokens_to_ids(trainQ['Tokens'], featureHash=featureHash, ngram=1)
%time token2IdHash_bigram = tokens_to_ids(trainQ['Tokens'], featureHash=featureHash_bigram, ngram=2)

Wall time: 5.53 s
Wall time: 10min 15s


In [106]:
N_wA = count_matrix(trainQ, token2IdHash, uniqueAnswerId, ngram=1)
N_wA_bigram = count_matrix(trainQ, token2IdHash_bigram, uniqueAnswerId, ngram=2)

## Calculate P(w) on Full Collection of Training Questions (w is selected token)

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/P_w.PNG?token=APoO9vsVzM00o3cEfOPq5mUeQ_2eJK94ks5YnH6ywA%3D%3D">

In [107]:
def feature_weights(N_wA, alpha):
    # N_w is the total number of times w appears over all documents 
    # rowSum of count matrix (N_wA)
    N_w = np.sum(N_wA, axis = 1)
    # N_W is the total count of all words
    N_W = np.sum(N_wA)
    # N_V is the count of unique words in the vocabulary
    N_V = N_wA.shape[0]
    # P(w) = (N_w + 1*alpha) / (N_W +N_V*alpha)
    N2 = N_w + 1 * alpha
    D2 = N_W + alpha * N_V
    P_w = N2/D2

    return P_w

In [108]:
alpha = 0.0001
P_w = feature_weights(N_wA, alpha)
P_w_bigram = feature_weights(N_wA_bigram, alpha)

## Calculate Probability Function P(w|A) and P(w|NotA) on Training Data

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/probability_function.PNG?token=APoO9rWEZ1g_OgvWT_pleQlhT2DEFw3tks5YnIHzwA%3D%3D">

In [109]:
def word_probability_in_answer(N_wA, P_w, beta):
    # N_V is the count of unique words in the vocabulary
    N_V = N_wA.shape[0]
    # N_WA is the total count of all words in questions on answer A 
    # colSum of count matrix (N_wA)
    N_WA = np.sum(N_wA, axis=0)
    # P(w|A) = (N_w|A + beta N_V P(w))/(N_W|A + beta * N_V)
    N = (N_wA.T + beta * N_V * P_w).T
    D = N_WA + beta * N_V
    P_wA = N / D
    
    return P_wA

In [110]:
def word_probability_Notin_answer(N_wA, P_w, beta):
    # N_V is the count of unique words in the vocabulary
    N_V = N_wA.shape[0]
    # N_wNotA is the count of w over all documents but not on answer A
    # N_wNotA = N_w - N_wA
    N_w = np.sum(N_wA, axis = 1)
    N_wNotA = (N_w - N_wA.T).T
    # N_WNotA is the count of all words over all documents but not on answer A
    # N_WNotA = N_W - N_WA
    N_W = np.sum(N_wA)
    N_WA = np.sum(N_wA, axis=0)
    N_WNotA = N_W - N_WA
    # P(w|NotA) = (N_w|NotA + beta * N_V * P(w))/(N_W|NotA + beta * N_V)
    N = (N_wNotA.T + beta * N_V * P_w).T
    D = N_WNotA + beta * N_V
    P_wNotA = N / D
    
    return P_wNotA

In [111]:
beta = 0.0001
# unigrams:
P_wA = word_probability_in_answer(N_wA, P_w, beta)
P_wNotA = word_probability_Notin_answer(N_wA, P_w, beta)
# bigrams:
P_wA_bigram = word_probability_in_answer(N_wA_bigram, P_w_bigram, beta)
P_wNotA_bigram = word_probability_Notin_answer(N_wA_bigram, P_w_bigram, beta)

## Calculate Naive Bayes Weights

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/NB_weight.PNG?token=APoO9s-2DejCvW03RgK6zXgiXX6UT5WWks5YnjiRwA%3D%3D">

In [112]:
# given an answer set A, get the NB weights for each word w: {answerId: [(word_index1, weight1), (word_index2, weight2)]}
NBWeights = np.log(P_wA / P_wNotA)
NBWeights_bigram = np.log(P_wA_bigram / P_wNotA_bigram)

## Calculate Normalized TF of Each Word w in Test Set

Each document/question d is typically represented by a feature vector x that represents the contents of d. Because different documents can have different lengths, it can be useful to apply L1 normalized feature vector x. 

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/tf.PNG?token=APoO9tMyEVzqoUJYT9ALcdF3_BryHHEVks5YnIQywA%3D%3D">

In [113]:
def normalize_tf(frame, token2IdHash, ngram):
    N_wQ = count_matrix(frame, token2IdHash, uniqueAnswerId=None, ngram=ngram)
    N_WQ = np.sum(N_wQ, axis=0)
    
    # find the index where N_WQ is zero
    zeroIdx = np.where(N_WQ == 0)[0]
    
    # if N_WQ is zero, then the x_w for that particular question would be zero.
    # for a simple calculation, we convert the N_WQ to 1 in those cases so the demoninator is not zero. 
    if len(zeroIdx) > 0:
        N_WQ[zeroIdx] = 1
    
    # x_w = P_wd = count(w)/sum(count(i in V))
    x_w = N_wQ / N_WQ
    
    return x_w

In [114]:
# unigrams:
x_wTrain = normalize_tf(trainQ, token2IdHash, ngram=1)
x_wTest = normalize_tf(testQ, token2IdHash, ngram=1)

# bigrams:
x_wTrain_bigram = normalize_tf(trainQ, token2IdHash_bigram, ngram=2)
x_wTest_bigram = normalize_tf(testQ, token2IdHash_bigram, ngram=2)

## Score Each Question in Test Set Against a Specific Answer

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/NB_scores.PNG?token=APoO9vABVheo1aZRkUYQq41utE6VRM1Yks5YnITBwA%3D%3D">

In [115]:
beta_A = 0
# unigrams:
NBScoresTrain = -beta_A + np.dot(x_wTrain.T, NBWeights)
NBScoresTest = -beta_A + np.dot(x_wTest.T, NBWeights)

# bigrams:
NBScoresTrain_bigram = -beta_A + np.dot(x_wTrain_bigram.T, NBWeights_bigram)
NBScoresTest_bigram = -beta_A + np.dot(x_wTest_bigram.T, NBWeights_bigram)

## Save or Reload Scores
We save the Naive Bayes scores into text files, which can be retrieved in the future notebooks. 

In [116]:
fileNameTrain = os.path.join(os.getcwd(), "NBScoresTrain.out")
fileNameTest = os.path.join(os.getcwd(), "NBScoresTest.out")

fileNameTrain_bigram = os.path.join(os.getcwd(), "NBScoresBigramTrain.out")
fileNameTest_bigram = os.path.join(os.getcwd(), "NBScoresBigramTest.out")

# save scores to a text file:
if True: 
    np.savetxt(fileNameTrain, NBScoresTrain, delimiter=',')
    np.savetxt(fileNameTest, NBScoresTest, delimiter=',')
    np.savetxt(fileNameTrain_bigram, NBScoresTrain_bigram, delimiter=',')
    np.savetxt(fileNameTest_bigram, NBScoresTest_bigram, delimiter=',')

# reload the text file into numpy matrix:
if False:
    NBScoresTrain = np.loadtxt(fileNameTrain, delimiter=',')
    NBScoresTest = np.loadtxt(fileNameTest, delimiter=',')
    NBScoresTrain_bigram = np.loadtxt(fileNameTrain_bigram, delimiter=',')
    NBScoresTest_bigram = np.loadtxt(fileNameTest_bigram, delimiter=',')

## Rank the Naive Bayes Scores and Calculate Average Rank 

We use two evaluation matrices to test our model performance. For each question in the test set, we calculate a __Naive Bayes Score__ against each answer. Then we rank the answers based on their __Naive Bayes Scores__ to calculate __Average Rank__ and __Top 10 Percentage__ in the Test set using the below formula:

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/evaluation.PNG?token=APoO9hyYDFxGc9FRbmIXU3VGv0wdeCaPks5YnIVtwA%3D%3D">

The __Average Rank__ can be interpreted as in average at which position we can find the correct answer among all available answers for a given question. 

The __Top 10 Percentage__ can be interpreted as how many percentage of the new questions that we can find their correct answers in the first 10 choices.

In [117]:
# sort the similarity scores in descending order and map them to the corresponding AnswerId in Answer set
def rank(frame, scores, uniqueAnswerId):
    frame['SortedAnswers'] = list(np.array(uniqueAnswerId)[np.argsort(-scores, axis=1)])
    
    rankList = []
    for i in range(len(frame)):
        rankList.append(np.where(frame['SortedAnswers'].iloc[i] == frame['AnswerId'].iloc[i])[0][0] + 1)
    frame['Rank'] = rankList
    
    return frame

In [118]:
testQ_bigram = copy.deepcopy(testQ)
testQ = rank(testQ, NBScoresTest, uniqueAnswerId)
testQ_bigram = rank(testQ_bigram, NBScoresTest_bigram, uniqueAnswerId)

The below is the model performance from _Naive Bayes Classifier_ using unigrams.

In [119]:
print('Total number of questions in test set: ' + str(len(testQ)))
print('Total number of answers: ' + str(len(uniqueAnswerId)))
print('Total number of unique features: ' + str(len(featureHash)))
print('Average of rank: ' + str(np.floor(testQ['Rank'].mean())))
print('Percentage of questions find answers in top 10: ' + str(round(len(testQ.query('Rank <= 10'))/len(testQ), 3)))

Total number of questions in test set: 3468
Total number of answers: 1201
Total number of unique features: 1396
Average of rank: 41.0
Percentage of questions find answers in top 10: 0.631


The below is the model performance from _Naive Bayes Classifier_ using bigrams.

In [120]:
print('Average of rank: ' + str(np.floor(testQ_bigram['Rank'].mean())))
print('Percentage of questions find answers in top 10: ' + str(round(len(testQ_bigram.query('Rank <= 10'))/len(testQ_bigram), 3)))

Average of rank: 335.0
Percentage of questions find answers in top 10: 0.42


## Plots and Results

As mentioned earlier, we plot the __Average Rank__ and __Top 10 Percentage__ against different numbers of features we use to train the models. We also experiment with different combinations of hyperparameters (alpha, beta, and beta_A) and the best performance on this test set is obtained as below.

By implementing the Naive Bayes Classifier using a collection of unigrams, we have improved the __Average Rank__ from 65 (Part 3) to 41 and __Top 10 Percentage__ from 39.9% (Part 3) to 63.1%. 

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/NB_results.PNG">

The Naive Bayes Classifier built using a collection of bigrams seem to perform poorly. 

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/NB_bigram_results.PNG">