# Part 3: Model Training and Evaluation

If you haven't complete the **Part 1: Data Preparation** and **Part 2: Phrase Learning**, please complete them before moving forward with **Part 3: Model Training and Evaluation**.

**NOTE**: Python 3 kernel doesn't include Azure Machine Learning Workbench functionalities. Please switch the kernel to `local` before continuing further. 

This example is designed to score new questions against the pre-existing Q&A pairs by training text classification models where each pre-existing Q&A pair is a unique class and a subset of the duplicate questions for each Q&A pair are available as training material. 

In the Part 3, the classification model uses an ensemble method to aggregate the following three base classifiers. In each base classifier, the `AnswerId` is used as the class label and the BOWs representations is used as the features.

1. Naive Bayes Classifier
2. Support Vector Machine (TF-IDF as features)
3. Random Forest (NB Scores as features)

Two different evaluation metrics are used to assess performance.
1. `Average Rank (AR)`: indicates the average position where the correct answer is found in the list of retrieved Q&A pairs (out of the full set of 103 answer classes). 
2. `Top 3 Percentage`: indicates the percentage of test questions that the correct answer can be retrieved in the top three choices in the returned ranked list. 

`Average Rank (AR)` and `Top 3 Percentage` on the test set are calculated using the following formula:

<img src="https://raw.githubusercontent.com/Azure/MachineLearningSamples-QnAMatching/master/Image/evaluation_3.PNG?token=APoO9pHTAVmmb7YsGlsyWXgMHXDUz0xkks5Zwt4ywA%3D%3D">

### Import Required Python Modules

`modules.feature_extractor` contains a list of Python user-defined Python modules to extract effective features that are used in this examples. You can find the source code of those modules in the directory of `modules/feature_extractor.py`.

In [1]:
import pandas as pd
import numpy as np
import os, warnings
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from modules.feature_extractor import (tokensToIds, countMatrix, priorProbabilityAnswer, posterioriProb, 
                               feature_selection, featureWeights, wordProbabilityInAnswer, 
                               wordProbabilityNotinAnswer, normalizeTF, getIDF, softmax)
warnings.filterwarnings("ignore")

## Access trainQ and testQ from Part 2

As we have prepared the _trainQ_ and _testQ_ with learned phrases and tokens from `Part 2: Phrase Learning`, we retrieve the datasets here for the further process.

_trainQ_ contains 5,153 training examples and _testQ_ contains 1,735 test examples. Also, there are 103 unique answer classes in both datasets.

In [2]:
workfolder = os.environ.get('AZUREML_NATIVE_SHARE_DIRECTORY')

# paths to trainQ and testQ.
trainQ_path = os.path.join(workfolder, 'trainQ_part2')
testQ_path = os.path.join(workfolder, 'testQ_part2')

# load the training and test data.
trainQ = pd.read_csv(trainQ_path, sep='\t', index_col='Id', encoding='latin1')
testQ = pd.read_csv(testQ_path, sep='\t', index_col='Id', encoding='latin1')

## Extract Features

Selecting the right set of features is very critical for the model training. In this section, we show you several feature extraction approaches that proved to yield good performance in the text classification use cases.

### Term Frequency and Inverse Document Frequency (TF-IDF) 

TF-IDF is commonly used as features when training text classification models. 

Each question `d` is typically represented by a feature vector `x` that represents the contents of `d`. Because different questions may have different lengths, it can be useful to apply L1 normalization on the feature vector `x`. Therefore, a normalized `Term Frequency` matrix can be obtained based on the following formula.

<img src="https://raw.githubusercontent.com/Azure/MachineLearningSamples-QnAMatching/master/Image/tf.PNG?token=APoO9vtYknxorWSIoJ-dvhbNdu-3pjSIks5ZwuKzwA%3D%3D">

Considering all tokens observed in the training questions, we compute the `Inverse Document Frequency` for each token based on the following formula.

<img src="https://raw.githubusercontent.com/Azure/MachineLearningSamples-QnAMatching/master/Image/idf.PNG?token=APoO9gVRgPlRbg7OSaV56CO0-yj2178Iks5ZwuK-wA%3D%3D">

By knowing the `Term Frequency (TF)` matrix and `Inverse Document Frequency (IDF)` vector, we can simply compute `TF-IDF` matrix by multiplying them together.

<img src="https://raw.githubusercontent.com/Azure/MachineLearningSamples-QnAMatching/master/Image/tfidf.PNG?token=APoO9pllkWjHQTsshFCEGIUbyknjvq8Vks5ZwuMxwA%3D%3D">

In [3]:
token2IdHashInit = tokensToIds(trainQ['Tokens'], featureHash=None)

# get unique answerId in ascending order
uniqueAnswerId = list(np.unique(trainQ['AnswerId']))

N_wQ = countMatrix(trainQ, token2IdHashInit)
idf = getIDF(N_wQ)

x_wTrain = normalizeTF(trainQ, token2IdHashInit)
x_wTest = normalizeTF(testQ, token2IdHashInit)

tfidfTrain = (x_wTrain.T * idf).T
tfidfTest = (x_wTest.T * idf).T

### Naive Bayes Scores

Besides using the IDF as the word weighting mechnism, a hypothesis testing likelihood ratio approach is also implemented here. 

In this approach, the word weights are associated with the answer classes and are calculated using the following formula.

<img src="https://raw.githubusercontent.com/Azure/MachineLearningSamples-QnAMatching/master/Image/NB_weight.PNG?token=APoO9kRUjFMeslJIVyY3wpBy8ycfyddKks5ZwuNjwA%3D%3D">

<img src="https://raw.githubusercontent.com/Azure/MachineLearningSamples-QnAMatching/master/Image/probability_function.PNG?token=APoO9hVi60vQ3PGqz-F7fdeDX6HLaxckks5ZwuNywA%3D%3D">

By knowing the `Term Frequency (TF)` matrix and `Weight` vector for each class, we can simply compute `Naive Bayes Scores` matrix for each class by multiplying them together.

#### Feature selection

Text classification models often pre-select a set of features (i.e., tokens) which carry the most class relevant information for further processing while ignoring words that carry little to no value for identifying classes. A variety of feature selection methods have been previously explored for both text processing. In this example, we have had the most success selecting features based on the estimated class posterior probability `P(A|w)`, where `A` is a specific answer class and `w` is a specific token. The maximum a posteriori probability (MAP) estimate of `P(A|w)` is expressed as

<img src="https://raw.githubusercontent.com/Azure/MachineLearningSamples-QnAMatching/master/Image/feature_selection.PNG?token=APoO9vG9-E4syU3E5F0Ysm7fjyb0N5T4ks5ZwuWSwA%3D%3D">

Feature selection in this example is performed by selecting the top `N` tokens which maximize for each `P(A|w)`. In order to determine the best value for the `TopN` parameter, you can simply run the `scripts/naive_bayes.py` with `local` compute context in the Azure Machine Learning Workbench and enter different integer values as `Arguments`.

<img src="https://raw.githubusercontent.com/Azure/MachineLearningSamples-QnAMatching/master/Image/run_naive_bayes.PNG?token=APoO9pKfKs4--gnpxNfM8Pueedv5oOwAks5ZwuXpwA%3D%3D">

Based our experiments, the `TopN = 19` yields the best result and is demonstrated in this notebook. 

In [4]:
# calculate the count matrix of all training questions.
N_wAInit = countMatrix(trainQ, token2IdHashInit, 'AnswerId', uniqueAnswerId)

P_A = priorProbabilityAnswer(trainQ['AnswerId'], uniqueAnswerId)
P_Aw = posterioriProb(N_wAInit, P_A, uniqueAnswerId)

# select top N important tokens per answer class.
featureHash = feature_selection(P_Aw, token2IdHashInit, topN=19)
token2IdHash = tokensToIds(trainQ['Tokens'], featureHash=featureHash)

N_wA = countMatrix(trainQ, token2IdHash, 'AnswerId', uniqueAnswerId)

alpha = 0.0001
P_w = featureWeights(N_wA, alpha)

beta = 0.0001
P_wA = wordProbabilityInAnswer(N_wA, P_w, beta)
P_wNotA = wordProbabilityNotinAnswer(N_wA, P_w, beta)

NBWeights = np.log(P_wA / P_wNotA)

## Train Classification Models and Predict on Test Data

### Naive Bayes Classifier

We implement the _Naive Bayes Classifier_ as described in the paper entitled ["MCE Training Techniques for Topic Identification of Spoken Audio Documents"](http://ieeexplore.ieee.org/abstract/document/5742980/).

In [5]:
beta_A = 0

x_wTest = normalizeTF(testQ, token2IdHash)
Y_test_prob1 = softmax(-beta_A + np.dot(x_wTest.T, NBWeights))

### Support Vector Machine (TF-IDF as features)

Traditionally, Support Vector Machine (SVM) model finds a hyperplane which maximally seperates positive and negative training tokens in a vector space. In its standard form, an SVM is a two-class classifier. To create a SVM model for a problem with multiple classes, a one-versus-rest (OVR) SVM classifier is typically learned for each answer class.

The `sklearn` Python package implement such a classifier and we use the implementation in this example. More information about this `LinearSVC` classifier can be found [here](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html).

In [6]:
X_train, Y_train = tfidfTrain.T, np.array(trainQ['AnswerId'])
clf = svm.LinearSVC(dual=True, multi_class='ovr', penalty='l2', C=1, loss="squared_hinge", random_state=1)
clf.fit(X_train, Y_train)

X_test = tfidfTest.T
Y_test_prob2 = softmax(clf.decision_function(X_test))

### Random Forest (NB Scores as features)

Similar to the above one-versus-rest SVM classifier, we also implement a one-versus-rest Random Forest classifier based on a base two-class Random Forest classifier from `sklearn`. More information about the `RandomForestClassifier` can be found [here](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In each base classifier, we dynamically compute the naive bayes scores for the positive class as the features. Since the number of negative examples is much larger than the number of positive examples, we hold all positive example and randomly select negative examples based on a negative to positive ratio to obtain a balanced training data. 

In this classifier, we need to tune two hyper-parameters: `TopN` and `n_estimators`. `TopN` is the same parameter as we learned in the _Feature Selectio_ step and `n_estimators` indicates the number of trees to be constructed in the Random Forest classifier. To identify the best values for the hyper-parameters, you can run `scripts/random_forest.py` with `local` compute context in the Azure Machine Learning Workbench and enter different integer values `Arguments`. The value of `TopN` and the value of `n_estimators` should be space delimited.

<img src="https://raw.githubusercontent.com/Azure/MachineLearningSamples-QnAMatching/master/Image/run_rf.PNG?token=APoO9qTD6OH201WZFpAETKAWN3MII-Ocks5ZwumRwA%3D%3D">

Based our experiments, the `TopN = 19` and `n_estimators = 250` yields the best result, and are demonstrated in this notebook.

In [7]:
# train one-vs-rest classifier using NB scores as features.
def ovrClassifier(trainLabels, x_wTrain, x_wTest, NBWeights, clf, ratio):
    uniqueLabel = np.unique(trainLabels)
    dummyLabels = pd.get_dummies(trainLabels)
    numTest = x_wTest.shape[1]
    Y_test_prob = np.zeros(shape=(numTest, len(uniqueLabel)))

    for i in range(len(uniqueLabel)):
        X_train_all, Y_train_all = x_wTrain.T * NBWeights[:, i], dummyLabels.iloc[:, i]
        X_test = x_wTest.T * NBWeights[:, i]
        
        # with sample selection.
        if ratio is not None:
            # ratio = # of Negative/# of Positive
            posIdx = np.where(Y_train_all == 1)[0]
            negIdx = np.random.choice(np.where(Y_train_all == 0)[0], ratio*len(posIdx))
            allIdx = np.concatenate([posIdx, negIdx])
            X_train, Y_train = X_train_all[allIdx], Y_train_all.iloc[allIdx]
        else: # without sample selection.
            X_train, Y_train = X_train_all, Y_train_all
            
        clf.fit(X_train, Y_train)
        if hasattr(clf, "decision_function"):
            Y_test_prob[:, i] = clf.decision_function(X_test)
        else:
            Y_test_prob[:, i] = clf.predict_proba(X_test)[:, 1]

    return softmax(Y_test_prob)

In [8]:
x_wTrain = normalizeTF(trainQ, token2IdHash)
x_wTest = normalizeTF(testQ, token2IdHash)

clf = RandomForestClassifier(n_estimators=250, criterion='entropy', random_state=1)
Y_test_prob3 = ovrClassifier(trainQ["AnswerId"], x_wTrain, x_wTest, NBWeights, clf, ratio=3)

### Ensemble Model

We build an ensemble model by aggregating the predicted probabilities from three previously trained classifiers. The base classifiers are equally weighted in this ensemble method. 

In [9]:
Y_test_prob_aggr = np.mean([Y_test_prob1, Y_test_prob2, Y_test_prob3], axis=0)

## Evaluate Model Performance

Two different evaluation metrics are used to assess performance. 
1. `Average Rank (AR)`: indicates the average position where the correct answer is found in the list of retrieved Q&A pairs (out of the full set of 103 answer classes). 
2. `Top 3 Percentage`: indicates the percentage of test questions that the correct answer can be retrieved in the top three choices in the returned ranked list. 

In [10]:
# get the rank of answerIds for a given question. 
def rank(frame, scores, uniqueAnswerId):
    frame['SortedAnswers'] = list(np.array(uniqueAnswerId)[np.argsort(-scores, axis=1)])
    
    rankList = []
    for i in range(len(frame)):
        rankList.append(np.where(frame['SortedAnswers'].iloc[i] == frame['AnswerId'].iloc[i])[0][0] + 1)
    frame['Rank'] = rankList
    
    return frame

In [11]:
testQ = rank(testQ, Y_test_prob_aggr, uniqueAnswerId)

AR = np.floor(testQ['Rank'].mean())
top3 = round(len(testQ.query('Rank <= 3'))/len(testQ), 3)
 
print('Average of rank: ' + str(AR))
print('Percentage of questions find answers in the first 3 choices: ' + str(top3))

Average of rank: 5.0
Percentage of questions find answers in the first 3 choices: 0.684
