In [2]:
job_description='''
Job Description: Senior Backend Engineer

Responsibilities:

Design, develop, and maintain robust and scalable backend systems.
Collaborate with frontend and mobile teams to build seamless user experiences.
Optimize database performance and write efficient SQL queries.
Implement robust security measures to protect sensitive data.
Mentor junior engineers and foster a culture of continuous learning.
Required Skills:

Strong proficiency in backend programming languages (e.g., Python, Node.js, Ruby on Rails, Java).
Experience with database technologies (e.g., PostgreSQL, MySQL, MongoDB).
Solid understanding of RESTful API design and development.
Knowledge of cloud platforms (e.g., AWS, GCP, Azure).
Experience with containerization technologies (e.g., Docker, Kubernetes).
'''

In [3]:
interviewee_responce='''
I've been passionate about backend development for 3 years, and I'm excited to apply my skills to challenging projects.
At my previous role at egy_tech, I was responsible for building a scalable API that handled 100 requests per second.
I utilized [Specific technologies, e.g., Python, Flask, PostgreSQL] to optimize performance and ensure reliability.

I'm particularly interested in your company's focus on [database, data privacy, machine learning, Azuru].
I've been exploring Node.js and believe it could be a valuable asset to your team.
I'm eager to contribute to innovative projects and learn from experienced engineers.
'''

In [4]:
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet

from nltk.stem import WordNetLemmatizer
import re


nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt to C:\Users\Mohamed
[nltk_data]     Walid\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Mohamed
[nltk_data]     Walid\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Mohamed
[nltk_data]     Walid\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to C:\Users\Mohamed
[nltk_data]     Walid\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\Mohamed Walid\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [5]:
stop_words = set(stopwords.words('english'))
translated_table = str.maketrans('', '', string.punctuation)

In [6]:
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ  # Adjective
    elif tag.startswith('V'):
        return wordnet.VERB  # Verb
    elif tag.startswith('N'):
        return wordnet.NOUN  # Noun
    elif tag.startswith('R'):
        return wordnet.ADV  # Adverb
    else:
        return wordnet.NOUN  # Default to Noun

In [7]:
def preprocess_text(text):
    text = text.lower()

    text = re.sub(r'\d+', '', text)       # Remove numbers
    text = text.translate(translated_table)

    text_tokens = word_tokenize(text)

    filtered_words=[word for word in text_tokens if word not in stop_words ]
    # lemmatization => transforming words to their base or dictionary form
    lemmatizer=WordNetLemmatizer()

    lemma_words = []
    for word in filtered_words:
        pos_tag = nltk.pos_tag([word])[0][1]  # Get POS tag for each word
        wordnet_pos = get_wordnet_pos(pos_tag)  # Map POS to WordNet POS
        lemma_word = lemmatizer.lemmatize(word, pos=wordnet_pos)  # Lemmatize using WordNet POS
        lemma_words.append(lemma_word)

    processed_text = ' '.join(lemma_words)
    return processed_text

In [8]:
preprocessed_job_description = preprocess_text(job_description)
print(f"Preprocessed job description : {preprocessed_job_description}")

Preprocessed job description : job description senior backend engineer responsibility design develop maintain robust scalable backend system collaborate frontend mobile team build seamless user experience optimize database performance write efficient sql query implement robust security measure protect sensitive data mentor junior engineer foster culture continuous learn require skill strong proficiency backend program language eg python nodejs ruby rail java experience database technology eg postgresql mysql mongodb solid understand restful api design development knowledge cloud platform eg aws gcp azure experience containerization technology eg docker kubernetes


In [9]:
preprocessed_interviewee_responce= preprocess_text(interviewee_responce)
print(f"Preprocessed interviewee responce : {preprocessed_interviewee_responce}")

Preprocessed interviewee responce : ive passionate backend development year im excite apply skill challenge project previous role egytech responsible building scalable api handle request per second utilized specific technology eg python flask postgresql optimize performance ensure reliability im particularly interested company focus database data privacy machine learn azuru ive explore nodejs believe could valuable asset team im eager contribute innovative project learn experienced engineer


### ***Keyword-based-similarity***
##### LEXIACAL SIMILARITY:      TF-IDF, cosine similarity, Jaccard

### ***TF-IDF***

In [9]:
documents=[preprocessed_job_description,preprocessed_interviewee_responce]


In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

tf_idf=TfidfVectorizer()
sparse_matrix=tf_idf.fit_transform(documents)

In [14]:
# compute cosine similarity between all pairs of documents

doc_term_matrix = sparse_matrix.todense()
df=pd.DataFrame(doc_term_matrix, columns=tf_idf.get_feature_names_out(), index=['job_description','interviewee_responce'])

similarity_scores = cosine_similarity(df,df)[0,1]
match_keys=df.isin([0]).sum(axis=0)
match_words=match_keys[match_keys==0].keys()


# print the similarity score
print("\n Cosine Similarity Score using TF-IDF: ", round(similarity_scores,2))
print("\n Matching Words: ", list(match_words))


 Cosine Similarity Score using TF-IDF:  0.18

 Matching Words:  ['api', 'backend', 'data', 'database', 'development', 'eg', 'engineer', 'learn', 'nodejs', 'optimize', 'performance', 'postgresql', 'python', 'scalable', 'skill', 'team', 'technology']


### ***jaccard***

In [15]:
import textdistance as td

print("\n Lexical similarity methods:")
print("Jaccard: ", td.jaccard.similarity(preprocessed_job_description, preprocessed_interviewee_responce))
print("Sorensen: ", round(td.sorensen_dice.similarity(preprocessed_job_description, preprocessed_interviewee_responce),2))



 Lexical similarity methods:
Jaccard:  0.6907692307692308
Sorensen:  0.82


### NLP techniques
####  capture semantic meaning, context, and relationships between words or phrases, thereby providing more accurate and nuanced measures of text similarity

### Bert Base uncased  using a masked language modeling (MLM) objective
#### This model is uncased which means it does not make a difference between upper or lowercase, for example, "english" = "English"

In [10]:
from sentence_transformers import SentenceTransformer, util
import numpy as np

# Load a pre-trained model 
model = SentenceTransformer('all-MiniLM-L6-v2')        

# Generate embeddings
job_description_embedding = model.encode(preprocessed_job_description, convert_to_tensor=True)
interviewee_response_embedding = model.encode(preprocessed_interviewee_responce, convert_to_tensor=True)

# Compute cosine similarity using util.cos_sim
similarity_score = util.cos_sim(job_description_embedding, interviewee_response_embedding)
print(f"Context-based Similarity Score (util.cos_sim): {similarity_score.item():.2f}")


Context-based Similarity Score (util.cos_sim): 0.76


###  DistilBERT model 
#### small, fast, cheap and light Transformer model trained by distilling BERT base

In [11]:
distil_model=SentenceTransformer('distilbert-base-nli-mean-tokens')

# Generate embeddings
job_description_embeddingg = distil_model.encode(preprocessed_job_description, convert_to_tensor=True)
interviewee_response_embeddingg = distil_model.encode(preprocessed_interviewee_responce, convert_to_tensor=True)

# Compute cosine similarity using util.cos_sim
similarity_score = util.cos_sim(job_description_embeddingg, interviewee_response_embeddingg)
print(f"Similarity Score (util.cos_sim): {similarity_score.item():.2f}")


Similarity Score (util.cos_sim): 0.82
