# QnA Matching Data Science Scenario

## Part 6:
Calibrated One-vs-rest Support Vector Machine (SVM) Classifier (DSSM Feature Embeddings)

### Overview

Similar to the __Part 5__, __Part 6__ of the series implements an _One-vs-rest SVM Classifier_ using the feature embeddings extracted from a [_Deep Structured Semantic Model (DSSM) Transformer_](https://microsoft.sharepoint.com/teams/TLC/SitePages/Transforms/DssmTransform.aspx) in this notebook. The transform uses pre-trained DSSM models to feature text into either a semantic embedding vector, or, given two strings, output a similarity score between them.

DSSM is a neural network algorithm (see the architecture as below) that produces feature embeddings for key-value string pairs. It is trained using a dataset consisting of positive key-value pairs, from which the original rows are used as correct examples, and the strings are recombined to produce adversarial, incorrect training examples. Some example of key-value pairs include search query and clicked document title text, search query and clicked ad content text, and entity-tweet string pairs.

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/DSSM.PNG">

The DSSM Transform comes with two general-purpose pre-trained models that can be used out-of-the-box to produce text featurizations in a role similar to n-gram extracting transforms. When used, it can improve prediction results on certain datasets. These models were trained using large Bing datasets, the web search model was trained using click-through data on query strings and search result titles, and the ads model was trained similarly on query strings and ad pairs. The transformer generally takes input text data as _QueryColumn_ or _DocumentColumn_. In the question answering scenario, the question (as short text form) is usually considered as _QueryColumn_ and the answer (as long text form) is usually considered as _DocumentColumn_. But it's also a good experiment to swap the two columns to test the results.

In addition, the input text data can be either the raw data or pre-cleaned data.

### Results

For the learning purpose, we have tried 16 different experiments and their results are as below.

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/DSSM_results.PNG">

* Matching Approach:
    * Q to Q: match the new question to previously seen questions, which link to their correct answers.
    * Q to Ans: match the new question directly to answers.
* Text Column:
    * Text with Phrases: the text data has been pre-processed and reconstructed with the semantically meaningful phrases from __Part 1__.
    * Raw text: the raw text data without any pre-processing.
* Feature Extraction:
    * Bing Ads Selection DSSM - Query Column: use the pre-trained Ads Selection model to extract feature embeddings from the last layer of neural network. The text data is considered as _QueryColumn_.
    * Bing Web Search DSSM - Query Column: use the pre-trained Web Search model to extract feature embeddings from the last layer of neural network. The text data is considered as _QueryColumn_.
    * Bing Ads Selection DSSM - Document Column: use the pre-trained Ads Selection model to extract feature embeddings from the last layer of neural network. The text data is considered as _DocumentColumn_.
    * Bing Web Search DSSM - Document Column: use the pre-trained Ads Selection model to extract feature embeddings from the last layer of neural network. The text data is considered as _DocumentColumn_.
* Model:
    * One-vs-rest SVM: fit a calibrated one-vs-rest SVM classifier using the feature embeddings (__Part 5__ has explained the details of this type of classifier).
    * Cosine Similarity: Compare the Cosine Similarity between two feature embeddings (__Part 2__ and __Part 3__ has explained the details of Cosine Similarity). 

In the first half of this notebook, we will show you the process of building the model that yields the best results in the above table.

Note: This notebook series are built under Python 3.5 and NLTK 3.2.2.

## Import required Python modules

In [24]:
import pandas as pd
import math
import gc
import numpy as np
import copy
import os
import matplotlib
import matplotlib.pyplot as plt
from numpy import linalg as LA
from sklearn import svm
from sklearn.calibration import CalibratedClassifierCV
from sklearn.externals import joblib
from IPython.display import display

%matplotlib inline

# suppress all warnings
import warnings
warnings.filterwarnings("ignore")

# Approach 1: SVM + DSSM Feature Embeddings
## Read trainQ and testQ into DataFrames

In [25]:
trainQ_url = 'https://mezsa.blob.core.windows.net/stackoverflownew/trainQwithTokens.tsv'
testQ_url = 'https://mezsa.blob.core.windows.net/stackoverflownew/testQwithTokens.tsv'

trainQ = pd.read_csv(trainQ_url, sep='\t', index_col='Id', encoding='latin1')
testQ = pd.read_csv(testQ_url, sep='\t', index_col='Id', encoding='latin1')

## Read DSSM featurized data into DataFrames

We apply the DSSM transformer on the pre-processed and reconstructed text data through the TLC GUI. Each text string is transformed to a 300-dimention feature embedding.

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/TLC.PNG">

Then we output the feature embeddings to .tsv files, which are loaded into this notebook. 


In [26]:
trainQ_embeddings_url = "https://mezsa.blob.core.windows.net/stackoverflownew/_dssm/train_phrase_Web_Q.tsv"
testQ_embeddings_url = "https://mezsa.blob.core.windows.net/stackoverflownew/_dssm/test_phrase_Web_Q.tsv"

trainQ_embeddings = pd.read_csv(trainQ_embeddings_url, sep='\t', header=None, skiprows=7, encoding='latin1')
testQ_embeddings = pd.read_csv(testQ_embeddings_url, sep='\t', header=None, skiprows=7, encoding='latin1')

In [27]:
# the first column of this dataframe represents the AnswerId
# the rest columns represent the 300-dimension feature embeddings.
testQ_embeddings.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,291,292,293,294,295,296,297,298,299,300
0,3777,-0.052651,0.339395,-0.712155,-0.150491,0.612713,-0.807458,-0.12581,-0.481467,-0.498498,...,0.597605,-0.0524,-0.636145,-0.888976,-0.893814,0.311533,-0.64789,-0.431012,0.117726,0.570042
1,3777,0.194004,0.123597,-0.649545,0.469424,0.471408,-0.911375,-0.49782,0.109869,-0.715097,...,0.488747,0.176386,-0.853409,-0.801949,-0.691216,0.163063,-0.604138,-0.254578,-0.149677,0.109133


## Fit One-vs-Rest SVM using DSSM Feature Embeddings

In [28]:
# DSSM feature embeddings as training features
# AnswerIds as targets
X_train = np.array(trainQ_embeddings.loc[:, 1:300].astype(float))
Y_train = np.array(trainQ_embeddings.loc[:, 0].astype(int))
X_test = np.array(testQ_embeddings.loc[:, 1:300].astype(float))

In [29]:
# first, fit a Linear SVC model 
est = svm.LinearSVC(dual=True, multi_class='ovr', max_iter=7, penalty='l2', C=1.0, loss="hinge")
# then fit a Calibrated Classifier with 3-fold cross-validation
clf = CalibratedClassifierCV(est, cv=3, method='sigmoid')
%time clf.fit(X_train, Y_train) 

Wall time: 1min 56s


CalibratedClassifierCV(base_estimator=LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=7, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0),
            cv=3, method='sigmoid')

In [30]:
# add model persistence
pklFile = os.path.join(os.getcwd(), "SVM_DSSM.pkl")
# save the model as an external .pkl file and load the model when it is needed
if True: 
    joblib.dump(clf, pklFile) 
if False:
    clf = joblib.load(pklFile) 

In [31]:
# make a deep copy of the testQ for the second approach
testQ2 = copy.deepcopy(testQ)

In [32]:
# predict probabilities on the test set
Y_test_pred = clf.predict_proba(X_test)
testQ['SVMProb'] = list(Y_test_pred)

## Save or Reload Probability Estimates
We save the probability estimates into text files, which can be retrieved in the future notebooks. 

In [33]:
fileNameTest = os.path.join(os.getcwd(), "SVM_DSSMTest.out")

# save scores to a text file:
if True: 
    np.savetxt(fileNameTest, Y_test_pred, delimiter=',')
    
# reload the text file into numpy matrix:
if False:
    Y_test_pred = np.loadtxt(fileNameTest, delimiter=',')

## Rank the Predicted Probability and Calculate Average Rank 

In [34]:
# sort the predicted probability in descending order and map them to the corresponding AnswerId in Answer set
def rank(frame, scores, uniqueAnswerId):
    frame['SortedAnswers'] = list(np.array(uniqueAnswerId)[np.argsort(-scores, axis=1)])
    
    rankList = []
    for i in range(len(frame)):
        rankList.append(np.where(frame['SortedAnswers'].iloc[i] == frame['AnswerId'].iloc[i])[0][0] + 1)
    frame['Rank'] = rankList
    
    return frame

In [35]:
# get unique answerId in ascending order
uniqueAnswerId = list(np.unique(trainQ['AnswerId']))
testQ = rank(testQ, Y_test_pred, uniqueAnswerId)

In [36]:
print('Total number of questions in test set: ' + str(len(testQ)))
print('Total number of answers: ' + str(len(uniqueAnswerId)))
print('Total number of unique features: ' + str(X_train.shape[1]))
print('Average of rank: ' + str(np.floor(testQ['Rank'].mean())))
print('Percentage of questions find answers in top 10: ' + str(round(len(testQ.query('Rank <= 10'))/len(testQ), 3)))

Total number of questions in test set: 3468
Total number of answers: 1201
Total number of unique features: 300
Average of rank: 78.0
Percentage of questions find answers in top 10: 0.535


# Approach 2: SVM + DSSM + NBScores

In the second half, we also concatenate the same set of DSSM embeddings and Naive Bayes scores learned from unigrams in __Part 4__ to train a SVM classifier. By adding a new set of features, we want to learn whether it will improve our model predictability.

## Read NBScores into a Numpy Matrix

In [37]:
fileNameTrain = os.path.join(os.getcwd(), "NBScoresTrain_top4.out")
fileNameTest = os.path.join(os.getcwd(), "NBScoresTest_top4.out")

# reload the text file into numpy matrix:
if True:
    NBScoresTrain = np.loadtxt(fileNameTrain, delimiter=',')
    NBScoresTest = np.loadtxt(fileNameTest, delimiter=',')

In [38]:
# Concatenate DSSM Embeddings and NB Scores as training features
# AnswerIds as targets
X_train2 = np.concatenate((NBScoresTrain, np.array(trainQ_embeddings.loc[:, 1:300])), axis=1)
Y_train2 = np.array(trainQ['AnswerId'])
X_test2 = np.concatenate((NBScoresTest, np.array(testQ_embeddings.loc[:, 1:300])), axis=1)

In [39]:
# first, fit a Linear SVC model 
est2 = svm.LinearSVC(dual=True, multi_class='ovr', max_iter=7, penalty='l2', C=1.0, loss="hinge")
# then fit a Calibrated Classifier with 3-fold cross-validation
clf2 = CalibratedClassifierCV(est2, cv=3, method='sigmoid')
%time clf2.fit(X_train2, Y_train2) 

Wall time: 8min 23s


CalibratedClassifierCV(base_estimator=LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=7, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0),
            cv=3, method='sigmoid')

In [40]:
# add model persistence
pklFile2 = os.path.join(os.getcwd(), "SVM_DSSM_NBScores.pkl")
# save the model as an external .pkl file and load the model when it is needed
if True: 
    joblib.dump(clf2, pklFile2) 
if False:
    clf2 = joblib.load(pklFile2) 

In [41]:
# predict probabilities on the test set
Y_test_pred2 = clf2.predict_proba(X_test2)
testQ2['SVMProb'] = list(Y_test_pred2)

## Save or Reload Probability Estimates
We save the probability estimates into text files, which can be retrieved in the future notebooks. 

In [42]:
fileNameTest2 = os.path.join(os.getcwd(), "SVM_DSSM_NBScoresTest.out")

# save scores to a text file:
if True: 
    np.savetxt(fileNameTest2, Y_test_pred2, delimiter=',')

# reload the text file into numpy matrix:
if False:
    Y_test_pred2 = np.loadtxt(fileNameTest2, delimiter=',')

## Rank the Predicted Probability and Calculate Average Rank¶

In [43]:
# get unique answerId in ascending order
uniqueAnswerId = list(np.unique(trainQ['AnswerId']))
testQ2 = rank(testQ2, Y_test_pred2, uniqueAnswerId)

In [44]:
print('Total number of questions in test set: ' + str(len(testQ2)))
print('Total number of answers: ' + str(len(uniqueAnswerId)))
print('Total number of unique features: ' + str(X_train2.shape[1]))
print('Average of rank: ' + str(np.floor(testQ2['Rank'].mean())))
print('Percentage of questions find answers in top 10: ' + str(round(len(testQ2.query('Rank <= 10'))/len(testQ2), 3)))

Total number of questions in test set: 3468
Total number of answers: 1201
Total number of unique features: 1501
Average of rank: 32.0
Percentage of questions find answers in top 10: 0.657


## Results

As we can easily observed that adding NB Scores (unigrams) to the feature space does help improve the model performance. 