# Document (QnA) Matching Data Science Process

## Part 2: TF-IDF and Cosine Similarity - Match Q to Similar A

### Overview

__Part 2__ of the series shows the process of matching Questions to Answers based on the _Cosine Similarity_ of their _Term Frequency-Inverse Document Frequency (TF-IDF)_ matrix.

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/experiment1_overview.PNG?token=APoO9gLiCcPVowvW0OsAuuXD9coI-HEsks5Ynhp0wA%3D%3D">

Note: This notebook series are built under Python 3.5 and NLTK 3.2.2.

## Import required Python modules

In [1]:
import pandas as pd
import math
import numpy as np
from numpy import linalg as LA
from azure.storage import CloudStorageAccount
from IPython.display import display

# suppress all warnings
import warnings
warnings.filterwarnings("ignore")

## Read trainQ and testQ into DataFrames

In [2]:
trainQ_url = 'https://mezsa.blob.core.windows.net/stackoverflow/trainQwithTokens.tsv'
testQ_url = 'https://mezsa.blob.core.windows.net/stackoverflow/testQwithTokens.tsv'
answersC_url = 'https://mezsa.blob.core.windows.net/stackoverflow/answersCwithTokens.tsv'

trainQ = pd.read_csv(trainQ_url, sep='\t', index_col='Id', encoding='latin1')
testQ = pd.read_csv(testQ_url, sep='\t', index_col='Id', encoding='latin1')
answersC = pd.read_csv(answersC_url, sep='\t', index_col='Id', encoding='latin1')

## Create Tokens to IDs Hash

For each token in the entire vocabulary, we assign it an unique ID.

In [3]:
# get Token to ID mapping: {Token: tokenId}
def tokens_to_ids(tokens, featureHash):
    token2IdHash = {}
    for i in range(len(tokens)):
        tokenList = tokens.iloc[i].split(',')
        if featureHash is None:
            for t in tokenList:
                if t not in token2IdHash.keys():
                    token2IdHash[t] = len(token2IdHash)
        else:
            for t in tokenList:
                if t not in token2IdHash.keys() and t in list(featureHash.keys()):
                    token2IdHash[t] = len(token2IdHash)
            
    return token2IdHash

In [4]:
token2IdHashInit = tokens_to_ids(trainQ['Tokens'].append(answersC['Tokens']), None)

In [5]:
print("Total number of unique tokens in the TrainQ and Answers: " + str(len(token2IdHashInit)))

Total number of unique tokens in the TrainQ and Answers: 5266


## Create Count Matrix for Each Token in Each Question/Answer

In [6]:
def count_matrix(frame, token2IdHash, uniqueAnswerId):
    # create am empty matrix with the shape of:
    # num_row = num of unique tokens
    # num_column = num of unique answerIds (N_wA) or num of questions
    # rowIdx = token2IdHash.values()
    # colIdx = index of uniqueAnswerId (N_wA) or index of questions
    num_row = len(token2IdHash)
    if uniqueAnswerId is not None:  # get N_wA
        num_column = len(uniqueAnswerId)
    else:
        num_column = len(frame)
    countMatrix = np.empty(shape=(num_row, num_column))

    # loop through each question in the frame to fill in the countMatrix with corresponding counts
    for i in range(len(frame)):
        tokens = frame['Tokens'].iloc[i].split(',')
        if uniqueAnswerId is not None:   # get N_wA
            answerId = frame['AnswerId'].iloc[i]
            colIdx = uniqueAnswerId.index(answerId)
        else:  
            colIdx = i
            
        for t in tokens:
            if t in token2IdHash.keys():
                rowIdx = token2IdHash[t]
                countMatrix[rowIdx, colIdx] += 1

    return countMatrix

In [7]:
# calculate the count matrix of all training questions and answers.
N_wQA = count_matrix(trainQ.append(answersC), token2IdHashInit, uniqueAnswerId=None)

## Compute IDF Vector

Considering all tokens observed in the training questions and answers, we compute their Inverse Document Frequency based on the below formula.

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/idf.PNG?token=APoO9qf3WptQgUPVRQJuOt4cobf56-Y3ks5YnhLSwA%3D%3D">

In [8]:
def get_idf(N_wQA):
    # N is total number of documents in the corpus
    # N_V is the number of tokens in the vocabulary
    N_V, N = N_wQA.shape
    # D is the number of documents where the token w appears
    D = np.empty(shape=(0, N_V))
    for i in range(N_V):
        D = np.append(D, len(np.nonzero(N_wQA[i, ])[0]))
    return np.log(N/D)

In [9]:
idf = get_idf(N_wQA)

## Calculate Normalized TF of Each Word w in Test and Answers Sets

Each document d is typically represented by a feature vector x that represents the contents of d. Because different documents can have different lengths, it can be useful to apply L1 normalmalized feature vector x. Therefore, a normalized Term Frequency matrix can be obtained based on the below formula.

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/tf.PNG?token=APoO9tMyEVzqoUJYT9ALcdF3_BryHHEVks5YnIQywA%3D%3D">

In [13]:
def normalize_tf(frame, token2IdHash):
    N_w = count_matrix(frame, token2IdHash, uniqueAnswerId=None)
    # get the column sum of the count matrix
    N_W = np.sum(N_w, axis=0)
    
    # find the index where N_WQ is zero
    zeroIdx = np.where(N_W == 0)[0]
    
    # if N_W is zero, then the x_w for that particular question/answer would be zero.
    # for a simple calculation, we convert the N_WQ to 1 in those cases so the demoninator is not zero. 
    if len(zeroIdx) > 0:
        N_W[zeroIdx] = 1
    
    # x_w = P_wd = count(w)/sum(count(i in V))
    x_w = N_w / N_W
    
    return x_w

In [14]:
# calculate tf matrix of test question and answers independently.
x_wTest = normalize_tf(testQ, token2IdHashInit)
x_wAns = normalize_tf(answersC, token2IdHashInit)

## Compute TF-IDF Matrix

By knowing the Term Frequency (TF) matrix and Inverse Document Frequency (IDF) vector, we can simply compute TF-IDF matrix by multiplying them together.

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/tfidf.PNG?token=APoO9gw3rPhLusbG3if65TuVZNAnyqTCks5YnhWPwA%3D%3D">

In [15]:
tfidfTest = (x_wTest.T * idf).T
tfidfAns = (x_wAns.T * idf).T

## Calculate Cosine Similarity between Test Set and Answers

For each question in the Test set, we compute its Consine Similarity against all answers in the Answers set. This similarity is a score that we use to consider how similar a question and an answer is. 

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/cosine.PNG?token=APoO9lOcKkP_7tFDh7p6KZXXVmokwLbGks5YnhY-wA%3D%3D">

In [16]:
def consine_similarity(tfidfL, tfidfR):
    # calculate the dot product of two tfidf arrays
    N = np.dot(tfidfL.T, tfidfR)
    # calculate the norm of each tfidf array
    normL = LA.norm(tfidfL, axis = 0)
    normR = LA.norm(tfidfR, axis = 0)
    similarity = (N.T/normL).T/normR
    
    return similarity

In [17]:
# calculate similarity scores of each question in Test set against all answers. 
simScores = consine_similarity(tfidfTest, tfidfAns)

## Rank the Cosine Similarity and Calcualte Average Rank 

We use two evaluation matrices to test our model performance. For each question in the test set, we calculate a Cosine Similarity score against each answer. Then we rank the answers based on their Cosine Similarity scores to calculate Average Rank and Top 10 Percentage in the Test set using the below formula:

<img src="https://raw.githubusercontent.com/Azure/Document_Matching/master/pic/evaluation.PNG?token=APoO9hyYDFxGc9FRbmIXU3VGv0wdeCaPks5YnIVtwA%3D%3D">

In [18]:
# sort the similarity scores in descending order and map them to the corresponding AnswerId in Answer set
def rank(frame, scores, uniqueAnswerId):
    frame['SortedAnswers'] = list(np.array(uniqueAnswerId)[np.argsort(-scores, axis=1)])
    
    rankList = []
    for i in range(len(frame)):
        rankList.append(np.where(frame['SortedAnswers'].iloc[i] == frame['AnswerId'].iloc[i])[0][0] + 1)
    frame['Rank'] = rankList
    
    return frame

In [20]:
# get unique answerId in ascending order.
uniqueAnswerId = answersC.index.values
# calculate the rank of each question in Test set.
testQ = rank(testQ, simScores, uniqueAnswerId)

In [21]:
# average of rank
print('Average of rank: ' + str(np.floor(testQ['Rank'].mean())))
print('Total number of questions in test set: ' + str(len(testQ)))
print('Total number of answers: ' + str(len(uniqueAnswerId)))
print('Total number of unique features: ' + str(len(token2IdHashInit)))
print('Percentage of questions find answers in top 10: ' + str(round(len(testQ.query('Rank <= 10'))/len(testQ), 3)))

Average of rank: 257.0
Total number of questions in test set: 3671
Total number of answers: 1275
Total number of unique features: 5266
Percentage of questions find answers in top 10: 0.199


In [22]:
testQ.query('Rank <= 3')

Unnamed: 0_level_0,AnswerId,Text0,CreationDate,Text,NumChars,TextWithPhrases,Tokens,SortedAnswers,Rank
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
15762825,3777.0,Call a c# function from JavaScript in asp.net....,2013-04-02 11:21:29.383,call a c# function from javascript in asp.net....,380.0,call a c# function from javascript in asp.net ...,"c#,asp.net,named,inside,c#,want,need,operation...","[476445, 994406, 3777, 206932, 19391416, 74753...",3
32363998,27943.0,Function to calculate geospatial distance betw...,2015-09-02 22:11:35.147,function to calculate geospatial distance betw...,625.0,function to calculate geospatial distance betw...,"calculate,distance,two,points,lat,long,using,p...","[27943, 938195, 1726662, 4960020, 9316216, 134...",1
21757141,41960.0,Date substract in javascript. <p>How do I subt...,2014-02-13 14:34:20.863,date substract in javascript. how do i subtrac...,222.0,date substract in javascript how do i subtract...,"date,subtract,dates,string,date,subtract,date,...","[41960, 1016908, 4365317, 1056730, 7556634, 32...",1
14004216,52597.0,Placeholder while an image is loading with Emb...,2012-12-22 15:56:10.010,placeholder while an image is loading with emb...,503.0,placeholder while an image is loading with emb...,"placeholder,image,loading,possible_duplicate,s...","[52597, 934925, 1288555, 7225820, 6150397, 328...",1
21849065,63506.0,How to take screenshot of remote website by ur...,2014-02-18 09:19:34.343,how to take screenshot of remote website by ur...,349.0,how to take screenshot of remote website by ur...,"screenshot,remote,website,url,wondering,screen...","[6678156, 63506, 20035319, 3059129, 3528331, 1...",2
27087687,105074.0,Generate unique product id. <p>I have a page t...,2014-11-23 09:57:12.213,generate unique product id. i have a page that...,252.0,generate unique product id i have a page that ...,"generate,unique,product,id,page,receive,site,a...","[11114634, 105074, 2100767, 16120977, 6680842,...",2
34564112,109091.0,jQuery break setInterval operation. <p>i try t...,2016-01-02 08:42:42.683,jquery break setinterval operation. i try to b...,195.0,jquery break setinterval operation i try to br...,"jquery,break,setinterval,operation,try,break,s...","[7421030, 109091, 11600783, 10313023, 10050496...",2
34825965,109091.0,run an ajax call while waiting for the first t...,2016-01-16 10:37:30.117,run an ajax call while waiting for the first t...,1016.0,run an ajax call while waiting for the first t...,"run,ajax,waiting,first,complete,ajax,long,comp...","[4402359, 109091, 8563735, 10899796, 10050826,...",2
33176982,111111.0,Nodejs closures and res object in a callback f...,2015-10-16 18:11:55.243,nodejs closures and res object in a callback f...,643.0,nodejs closures and res object in a callback_f...,"nodejs,closures,res,object,callback_function,w...","[111111, 11192941, 13418980, 1225683, 4841288,...",1
33981274,111111.0,Understanding Closures. <p>I was going through...,2015-11-29 09:28:26.687,understanding closures. i was going through an...,309.0,understanding closures i was going through an ...,"understanding,closures,advanced,text,came,code...","[3924998, 111111, 4091780, 3059129, 11192941, ...",2
