# QnA Matching Data Science Scenario

## Part 2: TF-IDF and Cosine Similarity - Match Q to Similar A

### Overview

__Part 2__ of the series shows the process of matching Questions to Answers based on the _Cosine Similarity_ of their _Term Frequency-Inverse Document Frequency (TF-IDF)_ matrix.

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/experiment1_overview.PNG?token=APoO9gLiCcPVowvW0OsAuuXD9coI-HEsks5Ynhp0wA%3D%3D">

Note: This notebook series are built under Python 3.5 and NLTK 3.2.2.

## Import required Python modules

In [35]:
import pandas as pd
import math
import numpy as np
from numpy import linalg as LA
from azure.storage import CloudStorageAccount
from IPython.display import display

# suppress all warnings
import warnings
warnings.filterwarnings("ignore")

## Read trainQ and testQ into DataFrames

In [36]:
trainQ_url = 'https://mezsa.blob.core.windows.net/stackoverflownew/trainQwithTokens.tsv'
testQ_url = 'https://mezsa.blob.core.windows.net/stackoverflownew/testQwithTokens.tsv'
answersC_url = 'https://mezsa.blob.core.windows.net/stackoverflownew/answersCwithTokens.tsv'

trainQ = pd.read_csv(trainQ_url, sep='\t', index_col='Id', encoding='latin1')
testQ = pd.read_csv(testQ_url, sep='\t', index_col='Id', encoding='latin1')
answersC = pd.read_csv(answersC_url, sep='\t', index_col='Id', encoding='latin1')

## Create Tokens to IDs Hash

We assign an unique ID to each token in the vocabulary.

In [37]:
# get Token to ID mapping: {Token: tokenId}
def tokens_to_ids(tokens, featureHash):
    token2IdHash = {}
    for i in range(len(tokens)):
        tokenList = tokens.iloc[i].split(',')
        if featureHash is None:
            for t in tokenList:
                if t not in token2IdHash.keys():
                    token2IdHash[t] = len(token2IdHash)
        else:
            for t in tokenList:
                if t not in token2IdHash.keys() and t in list(featureHash.keys()):
                    token2IdHash[t] = len(token2IdHash)
            
    return token2IdHash

In [38]:
token2IdHashInit = tokens_to_ids(trainQ['Tokens'].append(answersC['Tokens']), None)

In [39]:
print("Total number of unique tokens in the TrainQ and Answers: " + str(len(token2IdHashInit)))

Total number of unique tokens in the TrainQ and Answers: 5133


## Create Count Matrix for Each Token in Each Question/Answer

In [40]:
def count_matrix(frame, token2IdHash, uniqueAnswerId):
    # create am empty matrix with the shape of:
    # num_row = num of unique tokens
    # num_column = num of unique answerIds (N_wA) or num of questions
    # rowIdx = token2IdHash.values()
    # colIdx = index of uniqueAnswerId (N_wA) or index of questions
    num_row = len(token2IdHash)
    if uniqueAnswerId is not None:  # get N_wA
        num_column = len(uniqueAnswerId)
    else:
        num_column = len(frame)
    countMatrix = np.empty(shape=(num_row, num_column))

    # loop through each question in the frame to fill in the countMatrix with corresponding counts
    for i in range(len(frame)):
        tokens = frame['Tokens'].iloc[i].split(',')
        if uniqueAnswerId is not None:   # get N_wA
            answerId = frame['AnswerId'].iloc[i]
            colIdx = uniqueAnswerId.index(answerId)
        else:  
            colIdx = i
            
        for t in tokens:
            if t in token2IdHash.keys():
                rowIdx = token2IdHash[t]
                countMatrix[rowIdx, colIdx] += 1

    return countMatrix

In [41]:
# calculate the count matrix of all training questions and answers.
N_wQA = count_matrix(trainQ.append(answersC), token2IdHashInit, uniqueAnswerId=None)

## Compute IDF Vector

Considering all tokens observed in the training questions and answers, we compute their __Inverse Document Frequency__ based on the below formula.

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/idf.PNG?token=APoO9qf3WptQgUPVRQJuOt4cobf56-Y3ks5YnhLSwA%3D%3D">

In [42]:
def get_idf(N_wQA):
    # N is total number of documents in the corpus
    # N_V is the number of tokens in the vocabulary
    N_V, N = N_wQA.shape
    # D is the number of documents where the token w appears
    D = np.empty(shape=(0, N_V))
    for i in range(N_V):
        D = np.append(D, len(np.nonzero(N_wQA[i, ])[0]))
    return np.log(N/D)

In [43]:
idf = get_idf(N_wQA)

## Calculate Normalized TF of Each Word w in Test and Answers Sets

Each document d is typically represented by a feature vector x that represents the contents of d. Because different documents can have different lengths, it can be useful to apply L1 normalized feature vector x. Therefore, a normalized __Term Frequency__ matrix can be obtained based on the below formula.

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/tf.PNG?token=APoO9tMyEVzqoUJYT9ALcdF3_BryHHEVks5YnIQywA%3D%3D">

In [44]:
def normalize_tf(frame, token2IdHash):
    N_w = count_matrix(frame, token2IdHash, uniqueAnswerId=None)
    # get the column sum of the count matrix
    N_W = np.sum(N_w, axis=0)
    
    # find the index where N_WQ is zero
    zeroIdx = np.where(N_W == 0)[0]
    
    # if N_W is zero, then the x_w for that particular question/answer would be zero.
    # for a simple calculation, we convert the N_WQ to 1 in those cases so the demoninator is not zero. 
    if len(zeroIdx) > 0:
        N_W[zeroIdx] = 1
    
    # x_w = P_wd = count(w)/sum(count(i in V))
    x_w = N_w / N_W
    
    return x_w

In [45]:
# calculate tf matrix of test question and answers independently.
x_wTest = normalize_tf(testQ, token2IdHashInit)
x_wAns = normalize_tf(answersC, token2IdHashInit)

## Compute TF-IDF Matrix

By knowing the __Term Frequency (TF)__ matrix and __Inverse Document Frequency (IDF)__ vector, we can simply compute __TF-IDF__ matrix by multiplying them together.

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/tfidf.PNG?token=APoO9gw3rPhLusbG3if65TuVZNAnyqTCks5YnhWPwA%3D%3D">

In [46]:
tfidfTest = (x_wTest.T * idf).T
tfidfAns = (x_wAns.T * idf).T

## Calculate Cosine Similarity between Test Set and Answers

For each question in the Test set, we compute its __Cosine Similarity__ against all answers in the Answers set. This similarity is a score that we use to consider how similar a question and an answer is. 

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/cosine.PNG?token=APoO9lOcKkP_7tFDh7p6KZXXVmokwLbGks5YnhY-wA%3D%3D">

In [47]:
def consine_similarity(tfidfL, tfidfR):
    # calculate the dot product of two tfidf arrays
    N = np.dot(tfidfL.T, tfidfR)
    # calculate the norm of each tfidf array
    normL = LA.norm(tfidfL, axis = 0)
    normR = LA.norm(tfidfR, axis = 0)
    similarity = (N.T/normL).T/normR
    
    return similarity

In [48]:
# calculate similarity scores of each question in Test set against all answers. 
simScores = consine_similarity(tfidfTest, tfidfAns)

## Rank the Cosine Similarity and Calculate Average Rank 

We use two evaluation matrices to test our model performance. For each question in the test set, we calculate a __Cosine Similarity__ score against each answer. Then we rank the answers based on their __Cosine Similarity__ scores to calculate __Average Rank__ and __Top 10 Percentage__ in the Test set using the below formula:

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/evaluation.PNG?token=APoO9hyYDFxGc9FRbmIXU3VGv0wdeCaPks5YnIVtwA%3D%3D">

The __Average Rank__ can be interpreted as in average at which position we can find the correct answer among all available answers for a given question. 

The __Top 10 Percentage__ can be interpreted as how many percentage of the new questions that we can find their correct answers in the first 10 choices.

In [49]:
# sort the similarity scores in descending order and map them to the corresponding AnswerId in Answer set
def rank(frame, scores, uniqueAnswerId):
    frame['SortedAnswers'] = list(np.array(uniqueAnswerId)[np.argsort(-scores, axis=1)])
    
    rankList = []
    for i in range(len(frame)):
        rankList.append(np.where(frame['SortedAnswers'].iloc[i] == frame['AnswerId'].iloc[i])[0][0] + 1)
    frame['Rank'] = rankList
    
    return frame

In [50]:
# get unique answerId in ascending order.
uniqueAnswerId = answersC.index.values
# calculate the rank of each question in Test set.
testQ = rank(testQ, simScores, uniqueAnswerId)

In [51]:
# average of rank
print('Average of rank: ' + str(np.floor(testQ['Rank'].mean())))
print('Total number of questions in test set: ' + str(len(testQ)))
print('Total number of answers: ' + str(len(uniqueAnswerId)))
print('Total number of unique features: ' + str(len(token2IdHashInit)))
print('Percentage of questions find answers in top 10: ' + str(round(len(testQ.query('Rank <= 10'))/len(testQ), 3)))

Average of rank: 244.0
Total number of questions in test set: 3468
Total number of answers: 1201
Total number of unique features: 5133
Percentage of questions find answers in top 10: 0.202
