## Text Similarity using Jaccard Similarity
You can use Jaccard Similarity instead of using Cosine Similarity. It's also nice and common way to compute the similarity between two objects, such as two text documents. To explain the area of Jaccard similarity that can be used to find the similarity between two asymmetric binary vectors or to find the similarity between two sets.

In order to calculate similarity using Jaccard similarity, you need to first perform lemmatization to reduce words to the same root word. Such as "friendly" -> "friend", "has" or "have" -> "has" in both text and so on.

There are some differences between these two methods:

* Jaccard similarity takes only unique set of words for each sentence / document while cosine similarity takes total length of the vectors. (these vectors could be made from bag of words term frequency or tf-idf)
* This means that if you repeat the word “friend” in Sentence 1 several times, cosine similarity changes but Jaccard similarity does not. For ex, if the word “friend” is repeated in the first sentence 50 times, cosine similarity drops to 0.4 but Jaccard similarity remains at 0.5.
* Jaccard similarity is good for cases where duplication does not matter, cosine similarity is good for cases where duplication matters while analyzing text similarity. For two product descriptions, it will be better to use Jaccard similarity as repetition of a word does not reduce their similarity.



Task: "Compare two documents through text similarity and Jaccard Similarity"

Baseline: Preprocess with the same pipeline of the same document. then, generate its cosine and jaccard similarity

Experiment 1 : Alter the preprocess pipeline and run for both. See which one will drop in its calue.

In [218]:
import re
import pandas as pd
import numpy as np
import spacy
import nltk
from nltk.stem import WordNetLemmatizer
from collections import Counter

wn = WordNetLemmatizer()

resume = pd.read_csv("archive/Resume/Resume.csv")
#resume = resume.reindex(np.random.permutation(resume.index))
resume.head()


Unnamed: 0,ID,Resume_str,Resume_html,Category
0,16852973,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,"<div class=""fontsize fontface vmargins hmargin...",HR
1,22323967,"HR SPECIALIST, US HR OPERATIONS ...","<div class=""fontsize fontface vmargins hmargin...",HR
2,33176873,HR DIRECTOR Summary Over 2...,"<div class=""fontsize fontface vmargins hmargin...",HR
3,27018550,HR SPECIALIST Summary Dedica...,"<div class=""fontsize fontface vmargins hmargin...",HR
4,17812897,HR MANAGER Skill Highlights ...,"<div class=""fontsize fontface vmargins hmargin...",HR


In [2]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
stopwords = nltk.corpus.stopwords.words('english')
def tokenize(input_doc):
    tokens = []
    wn = WordNetLemmatizer()
    cptoken = [wn.lemmatize(x).lower() for x in word_tokenize(input_doc) if x not in stopwords and \
                 not x.isnumeric() and x.isalpha() and len(x) >= 2]
        
    return cptoken

In [3]:
import gensim

gensim.corpora.Dictionary()

<gensim.corpora.dictionary.Dictionary at 0x7fbb02addc50>

In [221]:
bow = {}
corpus = {}
fullcorpus = []
rawcorpus = []
# i switch from enumerate to index based because enumrate messes up with indexing
for docid, doc in enumerate(resume.Resume_str):
    words = tokenize(doc)
    rbow = Counter(words)
    corpus.setdefault(docid, None)
    corpus[docid] = words
    bow.setdefault(docid, None)
    bow[docid] = rbow
    fullcorpus.append(words)
    rawcorpus.append(doc)

In [223]:
import gensim
from gensim.similarities import MatrixSimilarity, Similarity

id2bow = gensim.corpora.Dictionary(fullcorpus)

# per doc id with its bow
gensim_corpus = [id2bow.doc2bow(text) for text in fullcorpus]

In [243]:
# generate the cosine similarity
tfidf_model = gensim.models.TfidfModel(gensim_corpus, id2word=id2bow)
# Im not sure why but using tfidf value alone is not good enough we change to this
# according to the paper imeplemented based on in gensim, 10 is good enogh
lsi_model   = gensim.models.LsiModel(gensim_corpus, num_topics=10, id2word=id2bow)

similarity_index = MatrixSimilarity(tfidf_model[gensim_corpus])
similarity_index.num_best = 11


In [244]:
from operator import itemgetter
from nltk.tokenize import word_tokenize

def get_similarity_result(term):
    # Returns the similarity value of the term against a set of corpus. The highest value relates to highest simlarity
    #tokenize_term = nltk.tokenize.word_tokenize(term)
    val = id2bow.doc2bow(tokenize(term))
    tfidf_val = tfidf_model[val]
    lsi_val   = lsi_model[tfidf_val]
    #Top 5 values    
    similarity_value = similarity_index.get_similarities(tfidf_val)
    assert similarity_value.shape[0] == len(fullcorpus)

#    similarity_value.sort(key=itemgetter(1), reverse=True)

    closest_docid = np.argmax(similarity_value)
    highest_score = np.max(similarity_value)
    print(f"closest doc id: {closest_docid}, score: {highest_score}")
    print(resume.Resume_str[closest_docid])

get_similarity_result(resume.Resume_str[1])

closest doc id: 1, score: 0.9999999403953552
         HR SPECIALIST, US HR OPERATIONS       Summary     Versatile  media professional with background in Communications, Marketing, Human Resources and Technology.         Experience     09/2015   to   Current     HR Specialist, US HR Operations    Company Name   －   City  ,   State       Managed communication regarding launch of Operations group, policy changes and system outages      Designed standard work and job aids to create comprehensive training program for new employees and contractors         Audited job postings for old, pending, on-hold and draft positions.           Audited union hourly, non-union hourly and salary background checks and drug screens             Conducted monthly new hire benefits briefing to new employees across all business units               Served as a link between HR Managers and vendors by handling questions and resolving system-related issues         Provide real-time process improvement feedback on ke