# Options of lexical similarity

This notebook intents to show the options of lexical similarity that can be used to compare two words. 

The following options are available:
- `Jaccard Similarity`
- `Cosine Similarity`
- `Levenshtein Distance`


In [8]:
# Example text data
human_word = "Head_and_Neck_Part"
mouse_word = "head/neck"

In [9]:
import re

def normalize_string(s):
    # Convert to lowercase
    s = s.lower()
    # Remove non-alphanumeric characters and spaces
    s = re.sub(r'[^a-zA-Z0-9\s]', ' ', s)
    # Remove extra spaces
    s = re.sub(r'\s+', ' ', s).strip()
    return s

## Jaccard Similarity

In [10]:
def jaccard_similarity(set1, set2):
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union if union != 0 else 0.0

In [11]:
# Needs this pre processing to make sure the words are in the same format
set_human = set(normalize_string(human_word).split())
set_mouse = set(normalize_string(mouse_word).split())


print("Human word: ", set_human)
print("Mouse word: ", set_mouse)
print("Jaccard similarity between human and mouse words: ", jaccard_similarity(set_human, set_mouse))

Human word:  {'head', 'neck', 'part', 'and'}
Mouse word:  {'head', 'neck'}
Jaccard similarity between human and mouse words:  0.5


## Cosine Similarity

In [13]:
from simpletransformers.language_representation import RepresentationModel

  from .autonotebook import tqdm as notebook_tqdm


In [16]:
import numpy as np

In [17]:
def cosine_similarity(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

In [18]:
model = RepresentationModel(
    model_type="bert",
    model_name="bert-base-uncased",
    use_cuda=False
)

word_vectors = model.encode_sentences([human_word, mouse_word], combine_strategy="mean")

In [19]:
cosine_similarity(word_vectors[0], word_vectors[1])

0.30870065

In [20]:
word_vectors_2 = model.encode_sentences(["same words", "same words"], combine_strategy="mean")
cosine_similarity(word_vectors_2[0], word_vectors_2[1])

0.9999999

This was a way that i found to calculate the embeddings and compare them... however i dont know if this is the best way to do it.

## Levenshtein Distance

In [6]:
import Levenshtein

In [14]:
human_normalized = normalize_string(human_word)
mouse_normalized = normalize_string(mouse_word)

distance = Levenshtein.distance(human_normalized, mouse_normalized)

print("Levenshtein distance between human and mouse words: ", distance)

Levenshtein distance between human and mouse words:  9
