# QnA Matching Data Science Scenario

## Part 5: 
Calibrated One-vs-rest Support Vector Machine (SVM) Classifier (Unigrams, Unigrams+Bigrams)

### Overview

Part 5 of the series shows the implementation of One-vs-rest SVM Classifier. The classifier has been built using the scores learned from the _Naive Bayes Classifier_ in __Part 4__ as the feature vectors. Two feature vectors sets have been used to build the SVM classifiers: the scores learned on unigrams and the concatenation of scores learned on unigrams and scores learned on bigrams.

Note: This notebook series are built under Python 3.5 and NLTK 3.2.2.

## Import required Python modules

In [3]:
import pandas as pd
import math
import gc
import numpy as np
import copy
import matplotlib
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.calibration import CalibratedClassifierCV
from sklearn.externals import joblib
from IPython.display import display

%matplotlib inline

# suppress all warnings
import warnings
warnings.filterwarnings("ignore")

## Read trainQ and testQ into DataFrames

In [4]:
trainQ_url = 'https://mezsa.blob.core.windows.net/stackoverflow/trainQwithTokens.tsv'
testQ_url = 'https://mezsa.blob.core.windows.net/stackoverflow/testQwithTokens.tsv'

trainQ = pd.read_csv(trainQ_url, sep='\t', index_col='Id', encoding='latin1')
testQ = pd.read_csv(testQ_url, sep='\t', index_col='Id', encoding='latin1')

## Create Tokens to IDs Hash

For each token in the entire vocabulary, we assign it an unique ID.

In [5]:
# get Token to ID mapping: {Token: tokenId}
def tokens_to_ids(tokens, featureHash):
    token2IdHash = {}
    for i in range(len(tokens)):
        tokenList = tokens.iloc[i].split(',')
        if featureHash is None:
            for t in tokenList:
                if t not in token2IdHash.keys():
                    token2IdHash[t] = len(token2IdHash)
        else:
            for t in tokenList:
                if t not in token2IdHash.keys() and t in list(featureHash.keys()):
                    token2IdHash[t] = len(token2IdHash)
            
    return token2IdHash

In [6]:
token2IdHashInit = tokens_to_ids(trainQ['Tokens'], None)

In [7]:
print("Total number of unique tokens in the TrainQ: " + str(len(token2IdHashInit)))

Total number of unique tokens in the TrainQ: 4977


## Create Count Matrix for Each Token in Each Answer

In [8]:
def count_matrix(frame, token2IdHash, uniqueAnswerId):
    # create am empty matrix with the shape of:
    # num_row = num of unique tokens
    # num_column = num of unique answerIds (N_wA) or num of questions in testQ (tfMatrix)
    # rowIdx = token2IdHash.values()
    # colIdx = index of uniqueAnswerId (N_wA) or index of questions in testQ (tfMatrix)
    num_row = len(token2IdHash)
    if uniqueAnswerId is not None:  # get N_wA
        num_column = len(uniqueAnswerId)
    else:
        num_column = len(frame)
    countMatrix = np.empty(shape=(num_row, num_column))

    # loop through each question in the frame to fill in the countMatrix with corresponding counts
    for i in range(len(frame)):
        tokens = frame['Tokens'].iloc[i].split(',')
        if uniqueAnswerId is not None:   # get N_wA
            answerId = frame['AnswerId'].iloc[i]
            colIdx = uniqueAnswerId.index(answerId)
        else:
            colIdx = i
            
        for t in tokens:
            if t in token2IdHash.keys():
                rowIdx = token2IdHash[t]
                countMatrix[rowIdx, colIdx] += 1

    return countMatrix

In [9]:
# get unique answerId in ascending order
uniqueAnswerId = list(np.unique(trainQ['AnswerId']))
# calculate the count matrix of all training questions.
N_wAInit = count_matrix(trainQ, token2IdHashInit, uniqueAnswerId)

## Feature Selection Based on Posteriori Probability P(A|w) 

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/feature_selection.PNG?token=APoO9ioyiKq8Sx0dxZ5onqIzyy6ywUmEks5YnH6cwA%3D%3D">

In [10]:
# calculate P(A): [P_A1, P_A2, ...]
def prior_probability_answer(answerIds, uniqueAnswerId): 
    P_A = []
    # convert a pandas series to a list
    answerIds = list(answerIds)
    
    for id in uniqueAnswerId:
        P_A.append(answerIds.count(id)/len(answerIds))
    return np.array(P_A)

In [11]:
P_A = prior_probability_answer(trainQ['AnswerId'], uniqueAnswerId)

In [12]:
# calculate P(A|w)
def posteriori_prob(N_wAInit, P_A, uniqueAnswerId):
    # N_A is the total number of answers
    N_A = len(uniqueAnswerId)
    # N_w is the total number of times w appears over all documents 
    # rowSum of count matrix (N_wAInit)
    N_wInit = np.sum(N_wAInit, axis = 1)
    # P(A|w) = (N_w|A + N_A * P(A))/(N_w + N_A)
    N = N_wAInit + N_A * P_A
    D = N_wInit + N_A
    P_Aw = np.divide(N.T, D).T    
    
    return P_Aw

In [13]:
P_Aw = posteriori_prob(N_wAInit, P_A, uniqueAnswerId)

In [14]:
# select the top N tokens w which maximize P(A|w) for each A.
# get FeatureHash: {token: 1}
def feature_selection(P_Aw, token2IdHashInit, topN):
    featureHash = {}
    # for each answer A, sort tokens w by P(A|w)
    sortedIdxMatrix = np.argsort(P_Aw, axis=0)[::-1]
    # select top N tokens for each answer A
    topMatrix = sortedIdxMatrix[0:topN, :]
    # for each token w in topMatrix, add w to FeatureHash if it has not already been included
    topTokenIdList = np.reshape(topMatrix, topMatrix.shape[0] * topMatrix.shape[1])
    # get ID to Token mapping: {tokenId: Token}
    Id2TokenHashInit = {y:x for x, y in token2IdHashInit.items()}
    
    for tokenId in topTokenIdList:
        token = Id2TokenHashInit[tokenId]
        if token not in featureHash.keys():
            featureHash[token] = 1
    return featureHash

To determine the best top N tokens to select per answer, we have experimented different _topN_ values and found that selecting top 10 unigrams yields the best results. Two plots of evaluation matrices against different number of features are provided at the end of this notebook to describe how we determine this number. 

In [15]:
topN = 10
featureHash = feature_selection(P_Aw, token2IdHashInit, topN)

## Re-assign ID to Each Selected Token and Re-calculate Count Matrix

After selecting the top N tokens of each answer, we use the collection of selected tokens for training and re-assign an ID to each selected token. Based on the new assigned IDs, we re-calculate the Count Matrix.

In [16]:
token2IdHash = tokens_to_ids(trainQ['Tokens'], featureHash)

In [17]:
N_wA = count_matrix(trainQ, token2IdHash, uniqueAnswerId)

## Calculate P(w) on Full Collection of Training Questions (w is selected token)

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/P_w.PNG?token=APoO9vsVzM00o3cEfOPq5mUeQ_2eJK94ks5YnH6ywA%3D%3D">

In [18]:
def feature_weights(N_wA, alpha):
    # N_w is the total number of times w appears over all documents 
    # rowSum of count matrix (N_wA)
    N_w = np.sum(N_wA, axis = 1)
    # N_W is the total count of all words
    N_W = np.sum(N_wA)
    # N_V is the count of unique words in the vocabulary
    N_V = N_wA.shape[0]
    # P(w) = (N_w + 1*alpha) / (N_W +N_V*alpha)
    N2 = N_w + 1 * alpha
    D2 = N_W + alpha * N_V
    P_w = N2/D2

    return P_w

In [19]:
alpha = 0.0001
P_w = feature_weights(N_wA, alpha)

## Calculate Probability Function P(w|A) and P(w|NotA) on Training Data

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/probability_function.PNG?token=APoO9rWEZ1g_OgvWT_pleQlhT2DEFw3tks5YnIHzwA%3D%3D">

In [20]:
def word_probability_in_answer(N_wA, P_w, beta):
    # N_V is the count of unique words in the vocabulary
    N_V = N_wA.shape[0]
    # N_WA is the total count of all words in questions on answer A 
    # colSum of count matrix (N_wA)
    N_WA = np.sum(N_wA, axis=0)
    # P(w|A) = (N_w|A + beta N_V P(w))/(N_W|A + beta * N_V)
    N = (N_wA.T + beta * N_V * P_w).T
    D = N_WA + beta * N_V
    P_wA = N / D
    
    return P_wA

In [21]:
def word_probability_Notin_answer(N_wA, P_w, beta):
    # N_V is the count of unique words in the vocabulary
    N_V = N_wA.shape[0]
    # N_wNotA is the count of w over all documents but not on answer A
    # N_wNotA = N_w - N_wA
    N_w = np.sum(N_wA, axis = 1)
    N_wNotA = (N_w - N_wA.T).T
    # N_WNotA is the count of all words over all documents but not on answer A
    # N_WNotA = N_W - N_WA
    N_W = np.sum(N_wA)
    N_WA = np.sum(N_wA, axis=0)
    N_WNotA = N_W - N_WA
    # P(w|NotA) = (N_w|NotA + beta * N_V * P(w))/(N_W|NotA + beta * N_V)
    N = (N_wNotA.T + beta * N_V * P_w).T
    D = N_WNotA + beta * N_V
    P_wNotA = N / D
    
    return P_wNotA

In [22]:
beta = 0.0001
P_wA = word_probability_in_answer(N_wA, P_w, beta)
P_wNotA = word_probability_Notin_answer(N_wA, P_w, beta)

## Calculate Naive Bayes Weights

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/NB_weight.PNG?token=APoO9s-2DejCvW03RgK6zXgiXX6UT5WWks5YnjiRwA%3D%3D">

In [23]:
# given an answer set A, get the NB weights for each word w: {answerId: [(word_index1, weight1), (word_index2, weight2)]}
NBWeights = np.log(P_wA / P_wNotA)

## Calculate Normalized TF of Each Word w in Training and Test sets

Each document/question d is typically represented by a feature vector x that represents the contents of d. Because different documents can have different lengths, it can be useful to apply L1 normalmalized feature vector x. 

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/tf.PNG?token=APoO9tMyEVzqoUJYT9ALcdF3_BryHHEVks5YnIQywA%3D%3D">

In [24]:
def normalize_tf(frame, token2IdHash):
    N_wQ = count_matrix(frame, token2IdHash, uniqueAnswerId=None)
    N_WQ = np.sum(N_wQ, axis=0)
    
    # find the index where N_WQ is zero
    zeroIdx = np.where(N_WQ == 0)[0]
    
    # if N_WQ is zero, then the x_w for that particular question would be zero.
    # for a simple calculation, we convert the N_WQ to 1 in those cases so the demoninator is not zero. 
    if len(zeroIdx) > 0:
        N_WQ[zeroIdx] = 1
    
    # x_w = P_wd = count(w)/sum(count(i in V))
    x_w = N_wQ / N_WQ
    
    return x_w

In [25]:
x_wTrain = normalize_tf(trainQ, token2IdHash)
x_wTest = normalize_tf(testQ, token2IdHash)

## Score Each Question in Test Set Against a Specific Answer

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/NB_scores.PNG?token=APoO9vABVheo1aZRkUYQq41utE6VRM1Yks5YnITBwA%3D%3D">

In [26]:
beta_A = 0
NBScoresTrain = -beta_A + np.dot(x_wTrain.T, NBWeights)
NBScoresTest = -beta_A + np.dot(x_wTest.T, NBWeights)

## Fit One-vs-Rest SVM using NB Scores and Calibrate SVM Scores into Probabiilty Estimates

Traditional SVM training finds a hyperplane which maximally seperates positive and negative training tokens in a vector space. In its standard form, an SVM is a two-class classifier. To create a multi-class SVM for a problem with N_A classes, a one-versus-rest SVM classifier is typically learned for each answer class a. 

Firstly, we fit a linear one-vs-rest Support Vector Classifier using _svm.LinearSVC()_ from an open-source Python package **Scikit Learn**. The features are used to train the Classifier are Naive Bayes scores obtained on the training set. Like most surpervised learning methods, SVM Classifier outputs scores s(x) that can be used to rank the questions in the test set from the most probable member to the least probable member of a class a. However, those SVM scores are not equivalent to probabilities, especially in a multi-class classification case. 

In order to map scores into probability estimates, a parametric approach proposed by John Platt for SVM scores consists in finding the parameters A and B for a sigmoid function of the form P(s) such that the negative log-likelihood of the data is minimized.

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/calibration.PNG">

**Scikit Learn** has an implementation of such probability calibration. 

### SVM + Unigrams

In the following section, we have experimented two _one-vs-rest SVM_ classifiers. In the first classifier, we simply use the _Naive Bayes Scores_ we have learned using unigrams as the feature vectors. 

In [27]:
# NB Scores as training features
# AnswerIds as targets
X_train = NBScoresTrain
Y_train = np.array(trainQ['AnswerId'])
X_test = NBScoresTest

In [28]:
# first, fit a Linear SVC model 
est = svm.LinearSVC(dual=False, multi_class='ovr', max_iter=7)
# then fit a Calibrated Classifier with 3-fold cross-validation
clf = CalibratedClassifierCV(est, cv=3, method='sigmoid')
%time clf.fit(X_train, Y_train) 

Wall time: 14min 40s


CalibratedClassifierCV(base_estimator=LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=7,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0),
            cv=3, method='sigmoid')

In [29]:
# add model persistence
pklFile = 'C:/Users/mez/Desktop/SVM_model.pkl'
# save the model as an external .pkl file and load the model when it is needed
if True: 
    joblib.dump(clf, pklFile) 
if False:
    clf = joblib.load(pklFile) 

In [30]:
# make a deep copy of the testQ for the second classifier
testQ_concat = copy.deepcopy(testQ)

In [31]:
# predict probabilities on the test set
Y_test_pred = clf.predict_proba(X_test)
testQ['SVMProb'] = list(Y_test_pred)

### SVM + concat(Unigrams, Bigrams)

In the second classifier, we concatenate the _Naive Bayes Scores_ learned using unigrams and the _Naive Bayes Scores_ learned using bigrams as the feature vectors. We have experimented various _Naive Bayes Scores_ learned using bigrams and have found that concatenating *NBScoresTrain* learned above and *NBScoresTest_bigram* learned in __Part 4__ yields the best results.

In [33]:
# reload the NBScoresTrain_bigram and NBScoresTest_bigram learned from Part 4 into the current notebook
fileNameTrain_bigram = "C:/Users/mez/Desktop/NBScoresBigramTrain.out"
fileNameTest_bigram = "C:/Users/mez/Desktop/NBScoresBigramTest.out"

# reload the text file into numpy matrix:
if True:
    NBScoresTrain_bigram = np.loadtxt(fileNameTrain_bigram, delimiter=',')
    NBScoresTest_bigram = np.loadtxt(fileNameTest_bigram, delimiter=',')

In [35]:
# NB Scores (unigrams + bigrams) as training features
# AnswerIds as targets
X_train = np.concatenate((NBScoresTrain, NBScoresTrain_bigram), axis=1)
Y_train = np.array(trainQ['AnswerId'])
X_test = np.concatenate((NBScoresTest, NBScoresTest_bigram), axis=1)

In [36]:
# first, fit a Linear SVC model 
est_concat = svm.LinearSVC(dual=False, multi_class='ovr', max_iter=7)
# then fit a Calibrated Classifier with 3-fold cross-validation
clf_concat = CalibratedClassifierCV(est_concat, cv=3, method='sigmoid')
%time clf_concat.fit(X_train, Y_train) 

Wall time: 28min 59s


CalibratedClassifierCV(base_estimator=LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=7,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0),
            cv=3, method='sigmoid')

In [37]:
# add model persistence
pklFile = 'C:/Users/mez/Desktop/SVM_model_concat.pkl'
# save the model as an external .pkl file and load the model when it is needed
if True: 
    joblib.dump(clf_concat, pklFile) 
if False:
    clf_concat = joblib.load(pklFile) 

In [39]:
# predict probabilities on the test set
Y_test_pred_concat = clf_concat.predict_proba(X_test)
testQ_concat['SVMProb'] = list(Y_test_pred_concat)

## Save or Reload Scores
We save the probability estimations into text files, which can be retrieved in the future notebooks. 

In [40]:
fileNameTest = "C:/Users/mez/Desktop/SVMProbsTest.out"
fileNameTest_concat = "C:/Users/mez/Desktop/SVMProbsTest_concat.out"

# save scores to a text file:
if True: 
    np.savetxt(fileNameTest, Y_test_pred, delimiter=',')
    np.savetxt(fileNameTest_concat, Y_test_pred_concat, delimiter=',')

# reload the text file into numpy matrix:
if False:
    Y_test_pred = np.loadtxt(fileNameTest, delimiter=',')
    Y_test_pred_concat = np.loadtxt(fileNameTest_concat, delimiter=',')

## Rank the Predicted Probability and Calcualte Average Rank 

We use two evaluation matrices to test our model performance. For each question in the test set, we calculate a calibrated probability against each answer. Then we rank the answers based on their probabilities to calculate Average Rank and Top 10 Percentage in the Test set using the below formula:

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/evaluation.PNG?token=APoO9hyYDFxGc9FRbmIXU3VGv0wdeCaPks5YnIVtwA%3D%3D">

In [41]:
# sort the predicted probability in descending order and map them to the corresponding AnswerId in Answer set
def rank(frame, scores, uniqueAnswerId):
    frame['SortedAnswers'] = list(np.array(uniqueAnswerId)[np.argsort(-scores, axis=1)])
    
    rankList = []
    for i in range(len(frame)):
        rankList.append(np.where(frame['SortedAnswers'].iloc[i] == frame['AnswerId'].iloc[i])[0][0] + 1)
    frame['Rank'] = rankList
    
    return frame

In [42]:
testQ = rank(testQ, Y_test_pred, uniqueAnswerId)
testQ_concat = rank(testQ_concat, Y_test_pred_concat, uniqueAnswerId)

The below is the model performance from _One-vs-rest SVM Classifier_ using unigrams.

In [43]:
print('Total number of questions in test set: ' + str(len(testQ)))
print('Total number of answers: ' + str(len(uniqueAnswerId)))
print('Total number of unique features: ' + str(len(featureHash)))
print('Average of rank: ' + str(np.floor(testQ['Rank'].mean())))
print('Percentage of questions find answers in top 10: ' + str(round(len(testQ.query('Rank <= 10'))/len(testQ), 3)))

Total number of questions in test set: 3671
Total number of answers: 1275
Total number of unique features: 3184
Average of rank: 38.0
Percentage of questions find answers in top 10: 0.612


The below is the model performance from _One-vs-rest SVM Classifier_ using the concatenation of scores learned on unigrams and scores learned on bigrams.

In [44]:
print('Average of rank: ' + str(np.floor(testQ_concat['Rank'].mean())))
print('Percentage of questions find answers in top 10: ' + str(round(len(testQ_concat.query('Rank <= 10'))/len(testQ_concat), 3)))

Average of rank: 46.0
Percentage of questions find answers in top 10: 0.641


## Plots and Results

As mentioned earlier, we plot the Average Rank and Top 10 Percentage against different numbers of features we use to train the model. We also experiment with different combination of hyperparameters (alpha, beta, and beta_A) and the best performance on this test set is obtained as below.

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/SVM_results.PNG">

When we train the classifier only with the Naive Bayes scores learned from unigrams, we obtain the Average Rank of 38 and the Top 10 Percentage of 61.2%. 

However, the feature vectors learned from bigrams doesn't seem to improve the model performance. Even we can find more correct answers for the test questions in the top 10 choices, but the classifier can only recommend the correct answer at the 46th choice in average that is worse than the result from the first classifier.