**Q5: Compare the performance of smaller distilled multilingual models as compared to their largest counterparts**

**Q6: Evaluate Different distance functions used to measure Semantic Similarity in a practical setting**

## Similarity Methods (Traditional)

1. Jaccard Similarity - refers to the number of words that are common in two sentences of data, over the total number of words in the two sentences.
   

In [19]:
first_sentence = "I am Ashish, I love playing football. I wanted to be a professional football player."
second_sentence = "I am Ahmed, I wish I could play football. My dream was to be a professional football player."

In [6]:
# Split the sentences into list of words
first_sentence = first_sentence.split()
second_sentence = second_sentence.split()

In [14]:
# Convert our list of words into a sets(to remove duplicates)
first_sentence = set(first_sentence)
second_sentence = set(second_sentence)

print(first_sentence)
print('----' * 34)
print(second_sentence)

{'football.', 'a', 'to', 'Ashish,', 'player.', 'love', 'wanted', 'be', 'professional', 'playing', 'football', 'I', 'am'}
----------------------------------------------------------------------------------------------------------------------------------------
{'football.', 'to', 'My', 'could', 'am', 'play', 'a', 'player.', 'dream', 'be', 'professional', 'Ahmed,', 'was', 'football', 'I', 'wish'}


In [9]:
# Calculate the shared words between the two sentences
shared_words = first_sentence.intersection(second_sentence)
print(shared_words)
print('----' * 20)
print(len(shared_words))

{'football.', 'a', 'to', 'player.', 'be', 'professional', 'football', 'I', 'am'}
--------------------------------------------------------------------------------
9


In [11]:
# Count the total number of unique words in both sentences
total_words = first_sentence.union(second_sentence)
print(total_words)
print('----' * 20)
print(len(total_words))

{'football.', 'was', 'to', 'Ashish,', 'love', 'My', 'could', 'am', 'play', 'a', 'player.', 'dream', 'wanted', 'be', 'professional', 'Ahmed,', 'playing', 'football', 'I', 'wish'}
--------------------------------------------------------------------------------
20


In [16]:
jacard_similarity = len(shared_words) / len(total_words)
print(jacard_similarity)

0.45


Challenges:

Two sentences that share nothing but words like 'the', 'a', 'how' etc will have a high similarity score despite being completely dissimilar.

Solution:

We can use stopword removal, stemmning or lemmatization (so words like 'Travelling and 'travels' can match) and other preprocessing techniques. 

However, this will not work for languages like Chinese, Japanese, Korean etc. where there is no concept of a word. and There are methods that avoid these problems altogether.
Also, for our use case we need to keep stop words for lexical search.

## w-Shingling

w-shingling is a similar method to Jaccard similarity, but instead of using words, we use n-grams (sequences of n words).
Lets see an example: using bigrams (n=2) e.g 2-shingling

In [20]:
# Split the sentences into list of words
first_sentence = first_sentence.split()
second_sentence = second_sentence.split()

In [21]:
first_sentence_shingle = set([' '.join([first_sentence[i], first_sentence[i+1]]) for i in range(len(first_sentence)) if i != len(first_sentence) - 1])
second_sentence_shingle = set([' '.join([second_sentence[i], second_sentence[i+1]]) for i in range(len(second_sentence)) if i != len(second_sentence) - 1])

print(first_sentence_shingle)
print('----' * 34)  
print(second_sentence_shingle)

{'Ashish, I', 'I am', 'I love', 'am Ashish,', 'be a', 'football. I', 'a professional', 'love playing', 'professional football', 'to be', 'wanted to', 'I wanted', 'playing football.', 'football player.'}
----------------------------------------------------------------------------------------------------------------------------------------
{'I wish', 'I could', 'I am', 'was to', 'could play', 'My dream', 'be a', 'professional football', 'a professional', 'dream was', 'football. My', 'play football.', 'to be', 'Ahmed, I', 'wish I', 'football player.', 'am Ahmed,'}


In [22]:
second_sentence_shingle.intersection(first_sentence_shingle)

{'I am',
 'a professional',
 'be a',
 'football player.',
 'professional football',
 'to be'}

In [23]:
def jac(x: set, y: set):
    shared = x.intersection(y)  # selects shared tokens only
    return len(shared) / len(x.union(y))  # union adds both sets together

In [24]:
jac(first_sentence_shingle, second_sentence_shingle)

0.24

## Levenshtein Distance


We will be using a Wagner-Fischer matrix to calculate our Levenshtein distance, let's write a function that will perform this operation for us given two strings.

In [25]:
import numpy as np

def leven(a, b):
    # we must add an additional character at the start of each string
    a = f' {a}'
    b = f' {b}'
    # initialize empty zero array
    lev = np.zeros((len(a), len(b)))
    # now begin iterating through each value, finding the best path
    for i in range(len(a)):
        for j in range(len(b)):
            if min([i, j]) == 0:
                lev[i, j] = max([i, j])
            else:
                # calculate our three possible operations
                x = lev[i-1, j]  # deletion
                y = lev[i, j-1]  # insertion
                z = lev[i-1, j-1]  # substitution
                # take the minimum (eg best path/operation)
                lev[i, j] = min([x, y, z])
                # and if our two current characters don't match, add 1
                if a[i] != b[j]:
                    # if we have a match, don't add 1
                    lev[i, j] += 1
    return lev, lev[-1, -1]

In [26]:
leven('Levenshtein', 'Levinsten')

# Here we need 3 operations to get from 'Levenshtein' to 'Levinsten', the bottom right value of the matrix is the edit distance (the number of operations needed to get from one string to the other)

(array([[ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.],
        [ 1.,  0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.],
        [ 2.,  1.,  0.,  1.,  2.,  3.,  4.,  5.,  5.,  6.],
        [ 3.,  2.,  1.,  0.,  1.,  2.,  3.,  4.,  5.,  6.],
        [ 4.,  3.,  1.,  1.,  1.,  2.,  3.,  4.,  4.,  5.],
        [ 5.,  4.,  2.,  2.,  2.,  1.,  2.,  3.,  4.,  4.],
        [ 6.,  5.,  3.,  3.,  3.,  2.,  1.,  2.,  3.,  4.],
        [ 7.,  6.,  4.,  4.,  4.,  3.,  2.,  2.,  3.,  4.],
        [ 8.,  7.,  5.,  5.,  5.,  4.,  3.,  2.,  3.,  4.],
        [ 9.,  8.,  5.,  6.,  6.,  5.,  4.,  3.,  2.,  3.],
        [10.,  9.,  6.,  6.,  6.,  6.,  5.,  4.,  3.,  3.],
        [11., 10.,  7.,  7.,  7.,  6.,  6.,  5.,  4.,  3.]]),
 3.0)

## TF-IDF

is one of the best known methods for text focused search.

To calculate the TF-IDF for a given word (the query) and a sentence (the document), we calculate the **T**erm **F**requency (**TF**), and the **I**nverse **D**ocument **F**requency (**IDF**).


In [54]:
import numpy as np

# we'll merge all docs into a list of lists for easier calculations below
docs = [first_sentence, second_sentence]

def tfidf(word, sentence):
    # term frequency
    tf = sentence.count(word) / len(sentence)
    # inverse document frequency
    idf = np.log10(len(docs) / sum([1 for doc in docs if word in doc]))
    return round(tf*idf, 4)

In [55]:
# Let's calculate the score for each sentence against the word 'football'

first = tfidf('football', first_sentence)
print(f'First sentence score: {first}')

second = tfidf('football', second_sentence)
print(f'Second sentence score: {second}')

First sentence score: 0.0
Second sentence score: 0.0


In [56]:
# TF_IDF vectors is slightly diiferent. WE compute TF-IDF scores for all words withour document vocabulary (all words in all documents) and we produce document speciffic TF_IDF 

vocab = set(first_sentence + second_sentence)

In [57]:
# initialize vectors
vec_a = []
vec_b = []
vec_c = []

for word in vocab:
    vec_a.append(tfidf(word, first_sentence))
    vec_b.append(tfidf(word, second_sentence))

print(vec_a)

[0.0, 0.0, 0.0201, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0201, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0201, 0.0, 0.0, 0.0, 0.0201, 0.0]


In [58]:
print(vec_b)

[0.0, 0.0, 0.0, 0.0167, 0.0167, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0167, 0.0167, 0.0, 0.0167, 0.0, 0.0167, 0.0167, 0.0, 0.0, 0.0]


In [5]:
it_data = {"name": "Vaccum", "price": 130.675}

print(f"{it_data['name']}: {{{it_data['price']:.2f}}}")

Vaccum: {130.68}


In [9]:
x = float(input("Enter today's temperature in Celsius: "))
if x < 10:
    print(f"You entered {x:.1f}. It's cold outside!")
elif 10 < x < 20:
    print(f"You entered {x:.1f}. It's cool!")
else:
    print(f"You entered {x:.1f}. It's hot!")

You entered 15.8. It's cool!
