# TFIDF and Sentence Similarity

Problem: https://www.hackerrank.com/challenges/nlp-similarity-scores/problemhttps://www.hackerrank.com/challenges/nlp-similarity-scores/problem

In [12]:
a = "I'd like an apple"
b = 'An apple a day keeps the doctor away'
c = 'Never compare an apple to an orange'
d = 'I prefer scikit-learn to orange'

## TF: Term Frequency:

TF measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: 

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).



In [111]:
k=[]
k.append(a)
k.append(b)
k.append(c)
k.append(d)
k = [i.lower() for i in k]

In [24]:
elements = a +' '+ b+' '+ c +' ' + d
e = elements.split(' ')
e = [i.lower() for i in e]

In [30]:
def tf(lis,elem):
    no = lis.count(elem)
    return no/len(lis)

## IDF: Inverse Document Frequency:

IDF measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: 

IDF(t) = log(Total number of documents / Number of documents with term t in it).



In [200]:
import math
def idf(k,e,elem):
    count = 0
    n = len(k)
    for i in k:
        if elem in i:
            count+=1
    return 1+ math.log10(len(k)/count)    

In [201]:
import pandas as pd
df = pd.DataFrame()

In [202]:
TF=[]
IDF = []
e = list(set(e))
for i in e:
    TF.append(tf(elements,i))
    IDF.append(idf(k,e,i))
T = pd.Series(TF)
W = pd.Series(e)
I = pd.Series(IDF)

In [203]:
df['Words'] = W.values
df['TF'] = T.values
df['IDF'] = I.values

In [204]:
df['TFIDF'] = df['TF'] * df['IDF']

In [205]:
df

Unnamed: 0,Words,TF,IDF,TFIDF
0,apple,0.02459,1.124939,0.027662
1,never,0.0,1.60206,0.0
2,day,0.008197,1.60206,0.013132
3,keeps,0.008197,1.60206,0.013132
4,like,0.008197,1.60206,0.013132
5,away,0.008197,1.60206,0.013132
6,compare,0.008197,1.60206,0.013132
7,an,0.040984,1.0,0.040984
8,prefer,0.008197,1.60206,0.013132
9,scikit-learn,0.008197,1.60206,0.013132


In [161]:
tfidf = pd.Series(df.TFIDF.values,index=df.Words).to_dict()

## Convert the TFIDF values into a matrix

In [217]:
mat = []
for i in k:
    wor = i.split(' ')
    el=[]
    for j in e:
        if j in wor:
            el.append(tfidf[j])
        else:
            el.append(0)
    mat.append(el)

In [231]:
(e)

['apple',
 'never',
 'day',
 'keeps',
 'like',
 'away',
 'compare',
 'an',
 'prefer',
 'scikit-learn',
 'doctor',
 "i'd",
 'orange',
 'a',
 'the',
 'i',
 'to']

In [228]:
mat[0]

[0.003072264014958195,
 0,
 0,
 0,
 0.00493491796170461,
 0,
 0,
 0.0,
 0,
 0,
 0,
 0.0,
 0,
 0,
 0,
 0,
 0]

## Cosine Similarity:

The cosine similarity between two vectors (or two documents on the Vector Space) is a measure that calculates the cosine of the angle between them. This metric is a measurement of orientation and not magnitude, it can be seen as a comparison between documents on a normalized space because we’re not taking into the consideration only the magnitude of each word count (tf-idf) of each document, but the angle between the documents.



In [241]:
import scipy.spatial.distance as distance
import numpy as np
a = np.array(mat[0])
b = np.array(mat[0])
c = distance.cosine(a, b)
print(c)

0.0


In [245]:
k[0],k[0]

("i'd like an apple", "i'd like an apple")

** Same Documents are just overlapping vectors and have a cosine distance of 0 **

In [243]:
a = np.array(mat[0])
b = np.array(mat[3])
c = distance.cosine(a, b)
print(c)

1.0


In [244]:
k[0],k[3]

("i'd like an apple", 'i prefer scikit-learn to orange')

**Dissimilar Documents are the most far apart from each other**

In [246]:
a = np.array(mat[0])
b = np.array(mat[2])
c = 1-distance.cosine(a, b)
print(c)

0.19750828933028974


### Similarity =  1 - Cosine distance

In [249]:
k[0],k[2]

("i'd like an apple", 'never compare an apple to an orange')

## First and the third sentences are 19 % similar

In [248]:
a = np.array(mat[0])
b = np.array(mat[1])
c = 1-distance.cosine(a, b)
print(c)

0.14175292958921348


In [250]:
k[0],k[1]

("i'd like an apple", 'an apple a day keeps the doctor away')

## First and the second sentences are 14% similar