# QnA Matching Data Science Scenario

## Part 7: 
Ensemble Model

### Overview

__Part 7__ of the series shows a ensemble model that combine the predicted probabilities or scores from 5 base classifiers. We combine the base classifiers based on a weighted average of the base classifiers learned in previous Parts.
* Calibrated Naïve Bayes Scores (train a Logistic Regression model on the Naive Bayes Scores to obtain probabilities)
* Calibrated linear SVM learned with NB Scores (Unigrams)
* Calibrated linear SVM learned with NB Scores (Unigrams + Bigrams)
* Calibrated linear SVM learned with DSSM Embeddings
* Calibrated linear SVM learned with DSSM Embeddings and NB Scores (Unigrams)

Note: This notebook series are built under Python 3.5 and NLTK 3.2.2.

## Import required Python modules

In [1]:
import pandas as pd
import math
import gc
import os
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.calibration import CalibratedClassifierCV
from sklearn.linear_model import LogisticRegression as LR
from sklearn.externals import joblib  
from IPython.display import display

%matplotlib inline

# suppress all warnings
import warnings
warnings.filterwarnings("ignore")

## Read trainQ and testQ into DataFrames

In [2]:
trainQ_url = 'https://mezsa.blob.core.windows.net/stackoverflow/trainQwithTokens.tsv'
testQ_url = 'https://mezsa.blob.core.windows.net/stackoverflow/testQwithTokens.tsv'

trainQ = pd.read_csv(trainQ_url, sep='\t', index_col='Id', encoding='latin1')
testQ = pd.read_csv(testQ_url, sep='\t', index_col='Id', encoding='latin1')

## Read Probabilities or Scores from 5 Base Classifier into Numpy Matrix

In [3]:
# NBScores (unigrams):
NBScoresTrain = np.loadtxt(os.path.join(os.getcwd(), "NBScoresTrain_top10.out"), delimiter=',')
NBScoresTest = np.loadtxt(os.path.join(os.getcwd(), "NBScoresTest_top10.out"), delimiter=',')

In [4]:
# SVM Probabilities learned with NB Scores (Unigrams):
SVMProbsTest = np.loadtxt(os.path.join(os.getcwd(), "SVMProbsTest.out"), delimiter=',')

In [5]:
# SVM Probabilities learned with NB Scores (Unigrams + Bigrams):
SVMProbsTest_concat = np.loadtxt(os.path.join(os.getcwd(), "SVMProbsTest_concat.out"), delimiter=',')

In [6]:
# SVM Probabilities learned with DSSM Embeddings:
SVM_DSSMTest = np.loadtxt(os.path.join(os.getcwd(), "SVM_DSSMTest.out"), delimiter=',')

In [7]:
# SVM Probabilities learned with DSSM Embeddings and NB Scores (Unigrams):
SVM_DSSM_NBScoresTest = np.loadtxt(os.path.join(os.getcwd(), "SVM_DSSM_NBScoresTest.out"), delimiter=',')

## Calibrate NB Scores into Probabilities

As the calibrated SVM classifiers provide probabilities, we want convert the NB scores into probabilities to keep the inputs of the ensemble model at the same scale. For this conversion, we fit a _logistic Regression_ model on the observed NB Scores to predict the probability of each class.

In [8]:
# NB Scores as training features
# AnswerIds as targets
X_train = NBScoresTrain
Y_train = np.array(trainQ['AnswerId'])
X_test = NBScoresTest

In [11]:
# train a logistic regression model on the classifier outputs.
# Note: this model may take a while to complete.
lr = LR(multi_class='ovr', solver='sag', max_iter=20)                                                       
%time lr.fit(X_train, Y_train)
lr

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=20, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='sag', tol=0.0001,
          verbose=0, warm_start=False)

In [9]:
# add model persistence
pklFile = os.path.join(os.getcwd(), "LR_NBScores.pkl")
# save the model as an external .pkl file and load the model when it is needed
if True: 
    joblib.dump(lr, pklFile) 
if False:
    lr = joblib.load(pklFile) 

In [12]:
# get the probability estimates
NBScoresTest_calibrated = lr.predict_proba(X_test)

## Assign Weights to Each Model

We have performed a Grid Search by assigning each base classifier a weight that follows uniform distribution and we have found that the best results are observed when the 5 classifiers are equally weighted. Therefore, we will give a 0.2 as the weight to each classifier below.

In [13]:
Y_test_pred = 0.2*NBScoresTest_calibrated + 0.2*SVMProbsTest + 0.2*SVMProbsTest_concat + 0.2*SVM_DSSMTest + 0.2*SVM_DSSM_NBScoresTest

## Rank the Weighted Average of Probability and Calculate Average Rank 

We use two evaluation matrices to test our model performance. For each question in the test set, we calculate a weighted average of the probabilities obtained from the base classifiers against each answer. Then we rank the answers based on their weighted average to calculate __Average Rank__ and __Top 10 Percentage__ in the Test set using the below formula:

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/evaluation.PNG?token=APoO9hyYDFxGc9FRbmIXU3VGv0wdeCaPks5YnIVtwA%3D%3D">

The __Average Rank__ can be interpreted as in average at which position we can find the correct answer among all available answers for a given question. 

The __Top 10 Percentage__ can be interpreted as how many percentage of the new questions that we can find their correct answers in the first 10 choices.

In [14]:
# sort the predicted probability in descending order and map them to the corresponding AnswerId in Answer set
def rank(frame, scores, uniqueAnswerId):
    frame['SortedAnswers'] = list(np.array(uniqueAnswerId)[np.argsort(-scores, axis=1)])
    
    rankList = []
    for i in range(len(frame)):
        rankList.append(np.where(frame['SortedAnswers'].iloc[i] == frame['AnswerId'].iloc[i])[0][0] + 1)
    frame['Rank'] = rankList
    
    return frame

In [15]:
# get unique answerId in ascending order
uniqueAnswerId = list(np.unique(trainQ['AnswerId']))
testQ = rank(testQ, Y_test_pred, uniqueAnswerId)

In [16]:
print('Total number of questions in test set: ' + str(len(testQ)))
print('Total number of answers: ' + str(len(uniqueAnswerId)))
print('Average of rank: ' + str(np.floor(testQ['Rank'].mean())))
print('Percentage of questions find answers in top 10: ' + str(round(len(testQ.query('Rank <= 10'))/len(testQ), 3)))

Total number of questions in test set: 3671
Total number of answers: 1275
Average of rank: 31.0
Percentage of questions find answers in top 10: 0.642


## An Analysis of the Training Example Size

In our experiment, we have noticed that some Answer class only contains a very few number of training example. As we have built One-vs-rest Support Vector Machine classifiers in the most of our experiments, those classes with very few training examples convey imbalanced datasets. Training a classifier on a small amount of examples is not sufficient. Therefore, we have performed an analysis to study how the size of training example per class actually impact on the model performance.

In this study, we test the __Average Rank__ and __Top 10 Percentage__ distribution with different numbers of training examples per class. As we can see from the distribution below, our ensemble model can secure an __Average Rank__ less than 20 (out of 1275 different answer classes) and a __Top 10 Percentage__ over 60% when we have more than 15 training examples per class.

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/training_size.PNG">

Even the number of classes that have more than 15 training examples are very limited in our particular example. But this study is very meaningful for future works as we have learned the training example size is very critical and having a decent number of training examples per class is a must have.

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/training_size_details.PNG">

With the above study, we have decided to only consider the answer classes that have more than 13 training examples that reduces the entire dataset to 5423 training examples, 1832 test examples, and 109 unique answer classes. By using this subset of training and test datasets, we have replicated the same ensemble modeling process described in this notebook and obtained an __Average Rank__ of 4 and __Top 10 Percentage__ of 91%.

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/subset_results.PNG">
