# Exercise 15 : Feature Engineering (Text Similarity)

Calculate i) Jaccard similarity ii) Cosine similarity for the following pair of texts. (Note: For cosine similarity use TF-IDF representation)

Pair 1: “What you do defines you” and “Your deeds define you” <br>
Pair 2: “Once upon a time there lived a king.” and “Who is your queen?” <br>
Pair 3: “He is desperate” and “Is he not desperate?” <br>

In [1]:
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
lemmatizer = WordNetLemmatizer()

In [2]:
pair1 = ["What you do defines you","Your deeds define you"]
pair2 = ["Once upon a time there lived a king.", "Who is your queen?"]
pair3 = ["He is desperate", "Is he not desperate?"]

In [3]:
def extract_text_similarity_jaccard (text1, text2):
    words_text1 = [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(text1)]
    words_text2 = [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(text2)]
    nr = len(set(words_text1).intersection(set(words_text2)))
    dr = len(set(words_text1).union(set(words_text2)))
    jaccard_sim = nr/dr
    return jaccard_sim

In [4]:
extract_text_similarity_jaccard(pair1[0],pair1[1])

0.14285714285714285

In [5]:
extract_text_similarity_jaccard(pair2[0],pair2[1])

0.0

In [6]:
extract_text_similarity_jaccard(pair3[0],pair3[1])

0.6

In [7]:
tfidf_model = TfidfVectorizer()

#Creating a corpus which will have texts of pair1, pair2 and pair 3 respectively
corpus = [pair1[0], pair1[1], pair2[0], pair2[1], pair3[0], pair3[1]]

tfidf_results = tfidf_model.fit_transform(corpus).todense()
#Note: Here tfidf_results will have tf-idf representation of texts of pair1, pair2 and pair3 in the given order.
#Thus, tfidf_results[0],tfidf_results[1] represents pair1
#tfidf_results[2],tfidf_results[3] represents pair2
#tfidf_results[4],tfidf_results[5] represents pair3

In [8]:
#cosine similarity between texts of pair1
cosine_similarity(tfidf_results[0],tfidf_results[1])

array([[0.3082764]])

In [9]:
#cosine similarity between texts of pair2
cosine_similarity(tfidf_results[2],tfidf_results[3])

array([[0.]])

In [10]:
#cosine similarity between texts of pair3
cosine_similarity(tfidf_results[4],tfidf_results[5])

array([[0.80368547]])