# Lexical Similarity

When referring to lexical text similarity, people actually refer to how similar two pieces of text are at the surface level. For example, how similar are the phrases “the cat ate the mouse” with “the mouse ate the cat food” by just looking at the words? It typically does not take into account the actual meaning behind words or the entire phrase in context.

In [3]:
### Import Libraries
import Levenshtein as lev
from nltk.stem import WordNetLemmatizer
import re

### Text Preprocessing

In [4]:
lemmatizer = WordNetLemmatizer()
def clean_text(doc):
    clean_text=doc.lower()
    clean_text = re.sub("-", " ", clean_text)
    clean_text=clean_text.split(" ")
    clean_text=[lemmatizer.lemmatize(x) for x in clean_text]
    clean_text=" ".join(clean_text)
    return clean_text

## 1. Jaccard Similarity

    DEFINITION:
J(A,B)= Intersection of A and B/ Union of A and B

J	=	Jaccard distance
A	=	set 1
B	=	set 2

    USE CASE:
Jaccard Similarity algorithm is to show the the common between two text



In [6]:
def Jaccard_Similarity(doc1, doc2): 
    
    # List the unique words in a document
    words_doc1 = set(doc1.split(" ")) 
    words_doc2 = set(doc2.split(" "))
    
    # Find the intersection of words list of doc1 & doc2
    intersection = words_doc1.intersection(words_doc2)

    # Find the union of words list of doc1 & doc2
    union = words_doc1.union(words_doc2)
        
    # Calculate Jaccard similarity score 
    # using length of intersection set divided by length of union set
    jac_similarity = float(len(intersection)) / len(union)
    return (jac_similarity)

In [8]:
Jaccard_Similarity("Communication skill","Communication")

0.5

## 2. Levenshtein Similarity

Levenshtein Distance called as Minimum Edit distance.

In [12]:
def Lev_Similarity(doc1, doc2):     
    words_doc1 = clean_text(doc1)
    words_doc2 = clean_text(doc2)
    Distance = lev.distance(words_doc1, words_doc2)
    Ratio = lev.ratio(words_doc1,words_doc2)       
    return (Distance,Ratio)

In [14]:
Lev_Similarity("Color","Colour")

(1, 0.9090909090909091)

##### Score calculation behind Levenshtein Distance and Levenshtein Ratio:
Calculation of Levenshtein Distance: 

It is calculated by counting number of edits required to transform one string into another. The edits could be one of the following:
 1. Addition of a new letter
 2. Removal of a letter
 3. Replacement of a letter
 
 
Calculation of Levenshtein Ratio: 

1. Replacement of any word will cost 2 points (4-2)/2 = 0.5
2. Deletion of any word will cost 1 point     (3-1)/3 =0.666
3. Insertion of any word will cost 1 point    (5-1)/5 = 0.8

In [None]:
##Replacement
Lev_Similarity("ab","ad")        ### change of 1 word i.e d

In [None]:
##Deletion
Lev_Similarity("ab","a")        ### change of 1 word i.e b

In [17]:
##Insertion
Lev_Similarity("ab","abc")      ### change of 1 word i.e c

(1, 0.8)

Where can lexical similarity be used?
1. Clustering – If you want to group similar texts together how can you tell if two groups of text are even similar?
2. Redundancy removal – If two pieces of texts are so similar, why do we need both? You can always eliminate the redundant one. Think of duplicate product listings, or the same person in your database, with slight variation in the name or even html pages that are near duplicates.
3. Information Retrieval – How do you rank documents that are similar to a query? You could start with something as simple as cosine similarity. While there are more established document retrieval measures like BM25, Language Models and PL2, you could also use a measure like cosine once you have a vector representation of your query and documents. You can even use Jaccard for information retrieval tasks, but this is not very effective as term frequencies are completely ignored by Jaccard.

## 3. Fuzzy Similarity

1. Ratio: This function used to calculate the Levenshtein distance similarity ratio between the two strings (sequences). 
2. Partial_Ratio: This function used to perform substring matching. This works by taking the shortest string and matching it with all substrings that are of the same length.
>USE CASE:This function is also useful when matching names. For example, if one sequence was someone’s first and middle name, and the sequence you’re trying to match on is that person’s first, middle, and last name.
3. Token_Sort_Ratio: This function can come in handy when the strings you are comparing are the same in spelling but are not in the same order. 
>USE CASE:This function is also useful when matching names are in Jumbled form. For example, 
Name1 = 'Gunner William Kline' 
Name2 = 'Kline, Gunner William'
4. Token_Set_Ratio: This function is the most helpful when applied to a set of strings with a significant difference in lengths.
>USE CASE:For example, 
Str1='The 3000 meter steeplechase winner, Soufiane El Bakkali' 
Str2='Soufiane El Bakkali'

In [4]:
from fuzzywuzzy import fuzz
Str1 = 'Chicago, Illinois' 
Str2 = 'Chicago'
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
Token_Set_Ratio = fuzz.token_set_ratio(Str1,Str2)
print("Ratio",Ratio)
print("Partial_Ratio",Partial_Ratio)
print("Token_Sort_Ratio",Token_Sort_Ratio)
print("Token_Set_Ratio",Token_Set_Ratio)

Ratio 58
Partial_Ratio 100
Token_Sort_Ratio 61
Token_Set_Ratio 100


### Notes:

1. In Jaccard Similarity, the number of common attributes is divided by the number of attributes that exists in at least one of the two objects.
2. In Levenshtein Similarity, the number of common attributes is divided by the number of attributes that exists in at least one of the two objects.