# Text Similarity Measures

- Computing similarity between two text pieces (terms/strings/documents/..)
- Example Applications: 
    - Relevance of a document match for a query
    - Computing semantic relatedness between strings/terms
- Various measures available:
    - Edit Distance/Levenshtein Distance
    - Jaccard Distance
    - Cosine Similarity
    - ...

## Edit Distance

- Edit Distance (also known as Levenshtein Distance) between two strings is the minimum number of single character deletions, insertions, or substitutions required to transform one string into the other. 
- The edit distance between ”good” and ”goodbye” is 3.
- Useful in spell checking applications.

In [7]:
#calculating edit distance between two terms

import nltk
 
w1 = 'mapping'
w2 = 'mappng'
 
nltk.edit_distance(w1, w2)

1

In [6]:
#calculating edit distance between two strings
import nltk
 
s1 = 'It might help to re-install Python if possible.'
s2 = 'I possibly love Python programming.'
 
nltk.edit_distance(s1, s2)

32

In [5]:
#finding the closest possible word from a list of words using edit distance
import nltk
 
mistake = "ligting"
 
words = ['apple', 'bag', 'drawing', 'listing', 'linking', 'living', 'lighting', 'orange', 'walking', 'zoo']
 
for word in words:
    ed = nltk.edit_distance(mistake, word)
    print(word, ed)

apple 7
bag 6
drawing 4
listing 1
linking 2
living 2
lighting 1
orange 6
walking 4
zoo 7


## Jaccard Distance

- Measure of how dissimilar two sets of strings are. The lower the distance, the stringer the string similarity.
- It is defined as the size of intersection divided by size of union of two sets. 
- Perform lemmatization first in order to increase the number of size of intersection.

In [3]:
#calculating jaccard distance between two terms

import nltk
 
w1 = set('mapping')
w2 = set('mappng')
 
nltk.jaccard_distance(w1, w2)

0.16666666666666666

## Cosine Similarity

Cosine similarity calculates similarity by measuring the cosine of angle between two vectors.
- Sentences should, therefore, first be converted to vectors using BOW or TFIDF methods.

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
documents=("AI is our friend and it has been friendly.","AI and humans have always been friendly.")

tfidf_vectorizer=TfidfVectorizer()
tfidf_matrix=tfidf_vectorizer.fit_transform(documents)
#print(tfidf_matrix.toarray())

cs=cosine_similarity(tfidf_matrix[0:1],tfidf_matrix)
print(cs)

[[1.         0.34082422]]


Here the results shows an array with the Cosine Similarities of the document 0 compared with other documents in the corpus. So, the first element in the array is 1 and it is the cosine similarity score of Document 0 with Document 0. The second element in the array, 0.3408, is the cosine similarity score between Document 0 and Document 1.