## Text Similarity

What is text similarity?  

Text similarity has to determine how ‘close’ two pieces of text are both in surface closeness [lexical similarity] and meaning [semantic similarity].  
For instance, how similar are the phrases “the cat ate the mouse” with “the mouse ate the cat food” by just looking at the words?
On the surface, if you consider only word level similarity, these two phrases appear very similar as 3 of the 4 unique words are an exact overlap. It typically does not take into account the actual meaning behind words or the entire phrase in context.
Instead of doing a word for word comparison, we also need to pay attention to context in order to capture more of the semantics. To consider semantic similarity we need to focus on phrase/paragraph levels (or lexical chain level) where a piece of text is broken into a relevant group of related words prior to computing similarity. We know that while the words significantly overlap, these two phrases actually have different meaning.  

There is a dependency structure in any sentences:  

mouse is the object of ate in the first case and food is the object of ate in the second case
Since differences in word order often go hand in hand with differences in meaning (compare the dog bites the man with the man bites the dog), we'd like our sentence embeddings to be sensitive to this variation.

But lucky we are, word vectors have evolved over the years to know the difference between record the play vs play the record

In [15]:
str1 = "I am a man. She is a woman."
str2 = """Case C-40/08

Asturcom Telecomunicaciones SL

v

Cristina Rodríguez Nogueira

(Reference for a preliminary ruling from the Juzgado de Primera Instancia nº 4 de Bilbao)

(Directive 93/13/EEC – Consumer contracts – Unfair arbitration clause – Measure void – Arbitration award which has become final – Enforcement – Whether the national court responsible for enforcement can consider of its own motion whether the unfair arbitration clause is null and void – Principles of equivalence and effectiveness)
"""

In [3]:
def get_jaccard_sim(str1, str2): 
    a = set(str1.split()) 
    b = set(str2.split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

print(get_jaccard_sim(str1,str2))

0.03125


In [17]:
tokens_1 = str1.split()
tokens_2 = str2.split()
textdistance.jaccard(tokens_1 , tokens_2)


0.025

In [18]:
tokens_3 = "hello new world".split()
textdistance.jaccard(tokens_1 , tokens_3)

0.0

In [19]:
#https://github.com/life4/textdistance
import textdistance

textdistance.hamming('test', 'text')
# 1

textdistance.hamming.distance('test', 'text')
# 1

textdistance.hamming.similarity('test', 'text')
# 3

textdistance.hamming.normalized_distance('test', 'text')
# 0.25

textdistance.hamming.normalized_similarity('test', 'text')
# 0.75

textdistance.Hamming(qval=2).distance('test', 'text')
# 2

2