# Ranking candidate CVs using keywords and semantic matching


In [1]:
import json

# Load CVs
with open('test_CVs.json') as in_file:
    test_data = json.load(in_file)
    
# Load CVs
with open('test_job_listing.json') as in_file:
    job_data = json.load(in_file)

titles = [item[0] for item in test_data['data']]
CVs = [item[1] for item in test_data['data']]

company = job_data['data'].split("#")[0]
job_listing = job_data['data'].split("#")[1]

In [2]:
print("Company:", company,"\n", "Job listing:", job_listing[0:200])

Company: Brave 
 Job listing: Brave is looking for an experienced Machine Learning engineer to help build our Brave Web Browser. It's already receiving rave reviews and we are only just beginning. Jump in and work with a top-notch


In [3]:
# Display 
for idx in range(len(CVs)):
    print(idx, " \t ", titles[idx], " : \t", CVs[idx][:100])

0  	  Assistant Retail Manager  : 	 Assistant Retail Manager    ROBERT SMITH    Phone: (123) 456 78 99 Email: info@qwikresume.com Websit
1  	  Python Developer/Tester  : 	     ROBERT SMITH    Python Developer/Tester    info@qwikresume.com | LinkedIn Profile | Qwikresume.c
2  	  SOFTWARE EXPERT  : 	 Anthony Applicant    567 North Street  •  Boston, MA 02108  •  (123) 456-7890  •  anthony.applicant@
3  	  Full Stack Python Developer  : 	 E­mail: info@qwikresumc.com    ROBERT SMITH Full Stack Python Developer    SUMMARY    Phone: (0123)­
4  	  Python Developer  : 	 CONTACT DETAILS    1737 Marshville Road, Alabama    (123)-456-7899 info@qwikresume.com www.qwikresum
5  	  Data Scientist  : 	 Malik Rabb      Seattle, WA • (123) 456-7891      mrabb@email.com            SUMMARY      Data Scien


<a id="sec1"></a>
## 1. TF-IDF to score shared key words

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

import nltk
from nltk.corpus import stopwords

nltk.download('punkt')
stop_words = set(stopwords.words('english')) 

[nltk_data] Downloading package punkt to /Users/jama/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:


vectorizer = TfidfVectorizer(stop_words=stop_words)
vectors = vectorizer.fit_transform([job_listing] + CVs)

# Calculate the word frequency, and calculate the cosine similarity of the search terms to the CVs
cosine_similarities = linear_kernel(vectors[0:1], vectors).flatten()
applicant_scores = [item.item() for item in cosine_similarities[1:]]  # convert back to native Python dtypes

# Print the top-scoring results and their titles
score_titles = [(score, title) for score, title in zip(applicant_scores, titles)]

for score, title in (sorted(score_titles, reverse=True, key=lambda x: x[0])[:5]):
    print(f'{score:0.3f} \t {title}')

0.134 	 Data Scientist
0.063 	 SOFTWARE EXPERT
0.059 	 Python Developer
0.047 	 Full Stack Python Developer
0.031 	 Python Developer/Tester


<a id="sec1b"></a>
## 1b. Using a lemmatizer

A lemmatizer reduces words down to their simplest 'lemma'. This is particularly helpful with dealing with plurals.

In [6]:
# from: https://scikit-learn.org/stable/modules/feature_extraction.html

from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer

class LemmaTokenizer:
    """
    Interface to the WordNet lemmatizer from nltk
    """
    ignore_tokens = [',', '.', ';', ':', '"', '``', "''", '`']
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc) if t not in self.ignore_tokens]

In [7]:
# Demonstrate the job of the tokenizer
nltk.download('wordnet')
tokenizer=LemmaTokenizer()

tokenizer('It was raining cats and dogs in FooBar')

[nltk_data] Downloading package wordnet to /Users/jama/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['It', 'wa', 'raining', 'cat', 'and', 'dog', 'in', 'FooBar']

In [8]:
# Initialise TfidfVectorizer with the LemmaTokenizer. Also need to lemmatize the stop words as well
token_stop = tokenizer(' '.join(stop_words))
vectorizer = TfidfVectorizer(stop_words=token_stop, tokenizer=tokenizer)

# Calculate the word frequency, and calculate the cosine similarity of the search terms to the CVs
vectors = vectorizer.fit_transform([job_listing] + CVs)
cosine_similarities = linear_kernel(vectors[0:1], vectors).flatten()

document_scores = [item.item() for item in cosine_similarities[1:]]  # convert back to native Python dtypes

score_titles = [(score, title) for score, title in zip(document_scores, titles)]

for score, title in (sorted(score_titles, reverse=True, key=lambda x: x[0])[:5]):
    print(f'{score:0.3f} \t {title}')

0.143 	 Data Scientist
0.060 	 SOFTWARE EXPERT
0.040 	 Python Developer
0.038 	 Full Stack Python Developer
0.036 	 Python Developer/Tester


<a id="sec1c"></a>
## 1c. Using the standalone module

You can find the above functionality (TFidfVectorizer, stop_words, LemmaTokenizer, cosine_similarity) inside the `tfidf.py` module. This allows document scores to be calculated from a single function call: 

In [9]:
from tfidf import rank_documents

[nltk_data] Downloading package punkt to /Users/jama/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [10]:
cv_scores = rank_documents(job_listing, CVs)

In [11]:
score_titles = [(score, title) for score, title in zip(cv_scores, titles)]

for score, title in (sorted(score_titles, reverse=True, key=lambda x: x[0])[:5]):
    print(f'{score:0.3f} \t {title}')

0.143 	 Data Scientist
0.060 	 SOFTWARE EXPERT
0.040 	 Python Developer
0.038 	 Full Stack Python Developer
0.036 	 Python Developer/Tester


<a id="sec2"></a>
## 2. Semantic matching using GloVe embeddings

In [12]:
import json
import logging
from re import sub
from multiprocessing import cpu_count

import numpy as np

import gensim.downloader as api
from gensim.utils import simple_preprocess
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.models import WordEmbeddingSimilarityIndex
from gensim.similarities import SparseTermSimilarityMatrix
from gensim.similarities import SoftCosineSimilarity

In [13]:
import logging

# Initialize logging.
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.WARNING)  # DEBUG # INFO

In [14]:
import nltk

# Import and download stopwords from NLTK.
nltk.download('stopwords')  # Download stopwords list.
stopwords = set(nltk.corpus.stopwords.words("english"))

[nltk_data] Downloading package stopwords to /Users/jama/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [15]:
# Support functions for pre-processing and calculation
# From: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb

def preprocess(doc):
    # Tokenize, clean up input document string
    doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
    doc = sub(r'<[^<>]+(>|$)', " ", doc)
    doc = sub(r'\[img_assist[^]]*?\]', " ", doc)
    doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', " url_token ", doc)
    return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]

In [16]:
# Preprocess the CVs, including the job application
corpus = [preprocess(document) for document in CVs]
job_list = preprocess(job_listing)

### Build the model

The word embedding model is a large file, so loading is quite a long-running task.

In [17]:
%%time

# Download and/or load the GloVe word vector embeddings

if 'glove' not in locals():  # only load if not already in memory
    glove = api.load("glove-wiki-gigaword-50")
    
similarity_index = WordEmbeddingSimilarityIndex(glove)

CPU times: user 20.4 s, sys: 154 ms, total: 20.5 s
Wall time: 21.2 s


In [18]:
%%time

# Build the term dictionary, TF-idf model
# The search query must be in the dictionary as well, in case the terms do not overlap with the CVs (we still want similarity)
dictionary = Dictionary(corpus+[job_list])
tfidf = TfidfModel(dictionary=dictionary)

# Create the term similarity matrix. 
# The nonzero_limit enforces sparsity by limiting the number of non-zero terms in each column. 
# For my application, I got best results by removing the default value of 100
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)  # , nonzero_limit=None)

CPU times: user 19.7 s, sys: 228 ms, total: 19.9 s
Wall time: 5.12 s


In [19]:
# Compute Soft Cosine Measure between the job listing and the CVs.
query_tf = tfidf[dictionary.doc2bow(job_list)]

index = SoftCosineSimilarity(
            tfidf[[dictionary.doc2bow(document) for document in corpus]],
            similarity_matrix)

doc_similarity_scores = index[query_tf]

### Output the document similarity results

In [20]:
# Output the similarity scores for top 15 CVs
sorted_indexes = np.argsort(doc_similarity_scores)[::-1]
for idx in sorted_indexes[:15]:
    print(f'{idx} \t {doc_similarity_scores[idx]:0.3f} \t {titles[idx]}')

5 	 0.653 	 Data Scientist
4 	 0.525 	 Python Developer
3 	 0.516 	 Full Stack Python Developer
1 	 0.446 	 Python Developer/Tester
0 	 0.414 	 Assistant Retail Manager
2 	 0.360 	 SOFTWARE EXPERT


### Find the most relevant terms in the CVs

In [21]:
# For each term in the job listing, what were the most similar words in each CV?
doc_similar_terms = []
max_results_per_doc = 5
for term in job_list:
    idx1 = dictionary.token2id[term]
    for document in corpus:
        results_this_doc = []
        for word in set(document):
            idx2 = dictionary.token2id[word]
            score = similarity_matrix.matrix[idx1, idx2]
            if score > 0.0:
                results_this_doc.append((word, score))
        results_this_doc = sorted(results_this_doc, reverse=True, key=lambda x: x[1])  # sort results by score
        results_this_doc = results_this_doc[:min(len(results_this_doc), max_results_per_doc)]  # take the top results
        doc_similar_terms.append(results_this_doc)

In [22]:
# Output the results for the top 15 CVs
for idx in sorted_indexes[:15]:
    similar_terms_string = ', '.join([result[0] for result in doc_similar_terms[idx]])
    print(f'{idx} \t {doc_similarity_scores[idx]:0.3f} \t {titles[idx]}  :  {similar_terms_string}')

5 	 0.653 	 Data Scientist  :  passionate
4 	 0.525 	 Python Developer  :  
3 	 0.516 	 Full Stack Python Developer  :  
1 	 0.446 	 Python Developer/Tester  :  
0 	 0.414 	 Assistant Retail Manager  :  
2 	 0.360 	 SOFTWARE EXPERT  :  hero


This shows which terms in each of the documents were most similar to terms in the search query. What it doesn't show, however, is the exact contribution of each of the terms to the document score, as each word similarity score will be weighted by the term frequency. 

<a id="sec2b"></a>
## 2b. Using the ready-made DocSim class

The `DocSim` class wraps up functionality to prepare and compare data in a single object. It also persists the word embedding model to avoid having to reload it each time it is used. The word embedding model is loaded on initialisation, as this is quite a long-running task.

`DocSim_threaded` has similar functionality, but loads the model in a separate thread. Similarity queries cannot be evaluated until the model is ready - check the status of the `model_ready` flag.

In [23]:
import json
import docsim

In [24]:
%%time

docsim_obj = docsim.DocSim(verbose=True)
# docsim_obj = docsim.DocSim_threaded(verbose=True)

Loading default GloVe word vector model: glove-wiki-gigaword-50
Model loaded
CPU times: user 21.2 s, sys: 175 ms, total: 21.3 s
Wall time: 22 s


In [25]:
print(f'Model ready: {docsim_obj.model_ready}')

Model ready: True


In [26]:
%%time

similarities = docsim_obj.similarity_query(job_listing, CVs)

6 documents loaded into corpus
CPU times: user 21.1 s, sys: 369 ms, total: 21.5 s
Wall time: 5.55 s


In [27]:
# Output the similarity scores for top 15 CVs
for idx, score in (sorted(enumerate(similarities), reverse=True, key=lambda x: x[1])[:15]):
    print(f'{idx} \t {score:0.3f} \t {titles[idx]}')

5 	 0.653 	 Data Scientist
4 	 0.525 	 Python Developer
3 	 0.516 	 Full Stack Python Developer
1 	 0.446 	 Python Developer/Tester
0 	 0.414 	 Assistant Retail Manager
2 	 0.360 	 SOFTWARE EXPERT


# Preliminary Results


## Summary: TF-idf

1. It’s fast and works well when documents are large and/or have lots of overlap.
2. It looks for exact matches, so at the very least you should use a lemmatizer to take care of the plurals.
3. When comparing short documents with limited-term variety — such as search queries — there is a risk that you will miss semantic relationships where there isn’t an exact word match.

## Summary: Semantic similarity using GloVe

1. It is more flexible as it doesn’t rely on finding exact matches.
2. There is a lot more computation involved so it can be slower, and the word embedding models can be quite large and take a while to prepare for first use. This scales well, but running a single query is slow.
3. Most words have some degree of similarity to other words, so almost all documents will have some non-zero similarity to other documents. Semantic similarity is good for ranking content in order, rather than making specific judgements about whether a document is or is not about a specific topic.

## Preliminary results suggests semantic similarity using GloVe is the more suitable method for ranking CVs.