In [2]:
job_description='''
Job Description: Senior Backend Engineer

Responsibilities:

Design, develop, and maintain robust and scalable backend systems.
Collaborate with frontend and mobile teams to build seamless user experiences.
Optimize database performance and write efficient SQL queries.
Implement robust security measures to protect sensitive data.
Mentor junior engineers and foster a culture of continuous learning.
Required Skills:

Strong proficiency in backend programming languages (e.g., Python, Node.js, Ruby on Rails, Java).
Experience with database technologies (e.g., PostgreSQL, MySQL, MongoDB).
Solid understanding of RESTful API design and development.
Knowledge of cloud platforms (e.g., AWS, GCP, Azure).
Experience with containerization technologies (e.g., Docker, Kubernetes).
'''

In [3]:
interviewee_responce='''
I've been passionate about backend development for 3 years, and I'm excited to apply my skills to challenging projects.
At my previous role at egy_tech, I was responsible for building a scalable API that handled 100 requests per second.
I utilized [Specific technologies, e.g., Python, Flask, PostgreSQL] to optimize performance and ensure reliability.

I'm particularly interested in your company's focus on [database, data privacy, machine learning, Azuru].
I've been exploring Node.js and believe it could be a valuable asset to your team.
I'm eager to contribute to innovative projects and learn from experienced engineers.
'''

### preproessing

In [4]:
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet

from nltk.stem import WordNetLemmatizer
import re


nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to C:\Users\Mohamed
[nltk_data]     Walid\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Mohamed
[nltk_data]     Walid\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Mohamed
[nltk_data]     Walid\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [5]:
stop_words = set(stopwords.words('english'))
translated_table = str.maketrans('', '', string.punctuation)

In [6]:
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ  # Adjective
    elif tag.startswith('V'):
        return wordnet.VERB  # Verb
    elif tag.startswith('N'):
        return wordnet.NOUN  # Noun
    elif tag.startswith('R'):
        return wordnet.ADV  # Adverb
    else:
        return wordnet.NOUN  # Default to Noun

In [7]:
def preprocess_text(text):
    text = text.lower()
    
    text = re.sub(r'\d+', '', text)       # Remove numbers
    text = text.translate(translated_table)

    text_tokens = word_tokenize(text)

    filtered_words=[word for word in text_tokens if word not in stop_words ]
    # lemmatization => transforming words to their base or dictionary form
    lemmatizer=WordNetLemmatizer()

    lemma_words = []
    for word in filtered_words:
        pos_tag = nltk.pos_tag([word])[0][1]  # Get POS tag for each word
        wordnet_pos = get_wordnet_pos(pos_tag)  # Map POS to WordNet POS
        lemma_word = lemmatizer.lemmatize(word, pos=wordnet_pos)  # Lemmatize using WordNet POS
        lemma_words.append(lemma_word)

    processed_text = ' '.join(lemma_words)
    return processed_text





In [8]:
preprocessed_job_description = preprocess_text(job_description)
print(f"Preprocessed job description : {preprocessed_job_description}")

Preprocessed job description : job description senior backend engineer responsibility design develop maintain robust scalable backend system collaborate frontend mobile team build seamless user experience optimize database performance write efficient sql query implement robust security measure protect sensitive data mentor junior engineer foster culture continuous learn require skill strong proficiency backend program language eg python nodejs ruby rail java experience database technology eg postgresql mysql mongodb solid understand restful api design development knowledge cloud platform eg aws gcp azure experience containerization technology eg docker kubernetes


In [9]:
preprocessed_interviewee_responce= preprocess_text(interviewee_responce)
print(f"Preprocessed interviewee responce : {preprocessed_interviewee_responce}")

Preprocessed interviewee responce : ive passionate backend development year im excite apply skill challenge project previous role egytech responsible building scalable api handle request per second utilized specific technology eg python flask postgresql optimize performance ensure reliability im particularly interested company focus database data privacy machine learn azuru ive explore nodejs believe could valuable asset team im eager contribute innovative project learn experienced engineer


## Extract important keywords From Job Description and Interviewee responce

##### pip install keybert spacy nltk
##### python -m spacy download en_core_web_md

In [10]:
from keybert import KeyBERT
import spacy

  warn(


In [11]:

def extract_relevant_keywords(text, nlp=None, min_word_length=2, max_keywords=15):
    """
    Extract relevant keywords with robust filtering and customization options.
    
    Args:
        text (str): Input text for keyword extraction.
        nlp (spacy.Language): spaCy language model for linguistic analysis.
        min_word_length (int): Minimum length of keywords to consider.
        max_keywords (int): Maximum number of keywords to return.
    
    Returns:
        List[str]: Refined list of keywords.
        List[float]: Corresponding scores for the keywords.
    """
    
    # Load spaCy model if not provided
    if nlp is None:
        nlp = spacy.load("en_core_web_md")
    
    # Initialize KeyBERT
    kw_model = KeyBERT()
    
    # Extract raw keywords
    raw_keywords = kw_model.extract_keywords(
        text,
        keyphrase_ngram_range=(1, 2),  # Allow phrases of 1 to 2 words
        top_n=max_keywords * 3,  # Extract more initially for better filtering
        use_mmr=True,  # Maximal Marginal Relevance
        diversity=0.7  # Increase diversity to balance relevance
    )
    
    # Filter keywords
    filtered_keywords = []
    valid_pos = {"NOUN", "PROPN"}  # Focus on nouns and proper nouns for relevance
    for keyword, score in raw_keywords:
        doc = nlp(keyword)  # Process the keyword with spaCy
        
        # Check linguistic and quality criteria
        if (
            len(keyword) >= min_word_length and  # Minimum keyword length
            len(keyword.split()) <= 2 and  # Limit to 2-word phrases
            all(token.pos_ in valid_pos for token in doc)  # Check POS
        ):
            filtered_keywords.append((keyword.strip().lower(), score))
    
    # Remove duplicates and sort by score
    unique_keywords = list(dict.fromkeys(filtered_keywords))  # Remove duplicates
    unique_keywords.sort(key=lambda x: x[1], reverse=True)
    
    # Limit to max_keywords and store both keywords and scores
    keywords = [kw for kw, _ in unique_keywords[:max_keywords]]
    scores = [score for _, score in unique_keywords[:max_keywords]]
    
    return keywords, scores


In [12]:
key_words_JobD ,key_words_JobD_scores =extract_relevant_keywords(preprocessed_job_description)
print("Keywords:", key_words_JobD)
print("Scores:", key_words_JobD_scores)


Keywords: ['backend engineer', 'proficiency backend', 'data mentor', 'job description', 'technology postgresql', 'experience database', 'knowledge cloud', 'mentor junior', 'api design', 'responsibility design', 'kubernetes', 'experience containerization', 'mysql', 'java', 'gcp azure']
Scores: [0.651, 0.4771, 0.4636, 0.4554, 0.4319, 0.4262, 0.3726, 0.3447, 0.3348, 0.3321, 0.3166, 0.3149, 0.2747, 0.2714, 0.2557]


In [38]:
print(f"the length of keywords in the job description is : {len(key_words_JobD)}")

the length of keywords in the job description is : 15


In [13]:
key_words_interviewee,key_words_interviewee_scores = extract_relevant_keywords(preprocessed_interviewee_responce)

print("Keywords:", key_words_interviewee)
print("Scores:", key_words_interviewee_scores)

Keywords: ['backend development', 'asset team', 'development year', 'technology python', 'database', 'skill challenge', 'postgresql', 'machine', 'performance', 'python flask', 'data privacy']
Scores: [0.5504, 0.4411, 0.3551, 0.3276, 0.327, 0.3259, 0.3188, 0.2793, 0.2254, 0.2124, 0.1523]


In [39]:
print(f"the length of keywords in the interviewee responce is : {len(key_words_interviewee)}")

the length of keywords in the interviewee responce is : 11


In [40]:
total_keywords=len(key_words_JobD)+len(key_words_interviewee)
print(f"the length of total keywords is : {total_keywords}")

the length of total keywords is : 26


# Under Testing

## get synonyms and calculate similarity score for each synonym     (job description - interviewee)   

In [18]:
from nltk.util import ngrams
from nltk.corpus import wordnet

In [21]:
# Function to fetch synonyms for a word using WordNet
def get_synonyms(word):
    """Fetch a set of synonyms for a word using WordNet."""
    synonyms = set()
    for syn in wordnet.synsets(word):
        for lemma in syn.lemmas():
            synonyms.add(lemma.name())
    return synonyms

# Function to calculate similarity between words using Wu-Palmer Similarity
def get_similarity(word1, word2):
    """Calculate the similarity between two words using WordNet's Wu-Palmer similarity."""
    syn1 = wordnet.synsets(word1)
    syn2 = wordnet.synsets(word2)
    
    if syn1 and syn2:
        # Calculate similarity between the first synsets of both words
        return syn1[0].wup_similarity(syn2[0])  # Wu-Palmer similarity (range: 0 to 1)
    return 0  # Return 0 if no similarity found

# Function to generate n-grams (1-gram and 2-gram) from the tokens
def generate_ngrams(tokens, n=2):
    """Generate n-grams from the list of tokens."""
    n_grams = ngrams(tokens, n)
    return [' '.join(gram) for gram in n_grams]

# Function to combine each bigram with its synonyms and similarity
def combine_with_synonyms_and_similarity(doc, n=2):
    """Combine each bigram in the text with its synonyms and calculate similarity."""
    combined_dict = {}
    tokens = [token.lower() for token in doc]  # Tokenize and lowercase
    
    n_grams = generate_ngrams(tokens, n)  # Generate n-grams
    
    for gram in n_grams:
        synonyms_with_scores = {}
        words_in_bigram = gram.split()  # Split bigram into individual words
        
        for word in words_in_bigram:
            synonyms = get_synonyms(word)  # Get synonyms for the word
            
            for synonym in synonyms:
                if word != synonym:  # Avoid self-similarity
                    similarity_score = get_similarity(word, synonym)
                    synonyms_with_scores[synonym] = similarity_score
        
        combined_dict[gram] = synonyms_with_scores  # Store the bigram with synonyms and scores
    
    return combined_dict


# Get the synonyms and similarity for each bigram
result = combine_with_synonyms_and_similarity(key_words_JobD, n=1)
print(result)



{'backend engineer': {'locomotive_engineer': 0.6, 'technologist': 0.75, 'organise': 0.16666666666666666, 'orchestrate': 0.16666666666666666, 'organize': 0.2, 'direct': 0.11764705882352941, 'applied_scientist': 0.75, 'mastermind': 0.7058823529411765, 'engine_driver': 0.6, 'railroad_engineer': 0.6}, 'proficiency backend': {'technique': 0.26666666666666666}, 'data mentor': {'data_point': 0.36363636363636365, 'datum': 0.36363636363636365, 'information': 0.4, 'wise_man': 1.0}, 'job description': {'chore': 0.75, 'problem': 0.2857142857142857, 'farm_out': 0.18181818181818182, 'subcontract': 0.25, 'line_of_work': 1.0, 'speculate': 0.15384615384615385, 'occupation': 1.0, 'task': 0.8, 'line': 0.3076923076923077, 'Job': 1.0, 'Book_of_Job': 0.26666666666666666, 'business': 0.2857142857142857, 'caper': 0.1111111111111111, 'verbal_description': 1.0}, 'technology postgresql': {'applied_science': 0.35294117647058826, 'engineering_science': 0.35294117647058826, 'engineering': 1.0}, 'experience database

### Git synonyms and calculate similarity score according to threshold  0.9

In [22]:
from nltk.corpus import wordnet
from nltk.util import ngrams
from nltk.tokenize import word_tokenize

# Function to fetch synonyms for a word using WordNet
def get_synonyms(word):
    """Fetch a set of synonyms for a word using WordNet."""
    synonyms = set()
    for syn in wordnet.synsets(word):
        for lemma in syn.lemmas():
            synonyms.add(lemma.name())
    return synonyms

# Function to calculate similarity between words using Wu-Palmer Similarity
def get_similarity(word1, word2):
    """Calculate the similarity between two words using WordNet's Wu-Palmer similarity."""
    syn1 = wordnet.synsets(word1)
    syn2 = wordnet.synsets(word2)
    
    if syn1 and syn2:
        # Calculate similarity between the first synsets of both words
        return syn1[0].wup_similarity(syn2[0])  # Wu-Palmer similarity (range: 0 to 1)
    return 0  # Return 0 if no similarity found

# Function to generate n-grams (1-gram and 2-gram) from the tokens
def generate_ngrams(tokens, n=2):
    """Generate n-grams from the list of tokens."""
    n_grams = ngrams(tokens, n)
    return [' '.join(gram) for gram in n_grams]

# Function to combine each bigram with its synonyms and similarity
def combine_with_synonyms_and_similarity(key_words, n=2, similarity_threshold=0.98):
    """Combine each bigram in the text with its synonyms and calculate similarity, filtering by similarity threshold."""
    combined_dict = {}
    
    for doc in key_words:
        tokens = [token.lower() for token in word_tokenize(doc)]  # Tokenize and lowercase the doc
        n_grams = generate_ngrams(tokens, n)  # Generate n-grams
        
        for gram in n_grams:
            synonyms_with_scores = {}
            words_in_bigram = gram.split()  # Split bigram into individual words
            
            for word in words_in_bigram:
                synonyms = get_synonyms(word)  # Get synonyms for the word
                
                for synonym in synonyms:
                    if word != synonym:  # Avoid self-similarity
                        similarity_score = get_similarity(word, synonym)
                        # Only include synonyms with similarity >= 0.98
                        if similarity_score >= similarity_threshold:
                            synonyms_with_scores[synonym] = similarity_score
            
            if synonyms_with_scores:  # Only add to dictionary if there are valid synonyms
                combined_dict[gram] = synonyms_with_scores  # Store the bigram with synonyms and scores
    
    return combined_dict

In [25]:
Synonyms_similarity_JobD = combine_with_synonyms_and_similarity(key_words_JobD, n=2, similarity_threshold=0.90)
Synonyms_similarity_JobD


{'data mentor': {'wise_man': 1.0},
 'job description': {'line_of_work': 1.0,
  'occupation': 1.0,
  'Job': 1.0,
  'verbal_description': 1.0},
 'technology postgresql': {'engineering': 1.0},
 'knowledge cloud': {'noesis': 1.0, 'cognition': 1.0},
 'mentor junior': {'wise_man': 1.0, 'Junior': 1.0},
 'api design': {'designing': 1.0},
 'responsibility design': {'obligation': 1.0, 'duty': 1.0, 'designing': 1.0},
 'gcp azure': {'lazuline': 1.0, 'sky-blue': 1.0, 'cerulean': 1.0}}

In [26]:
Synonyms_similarity_interviewee = combine_with_synonyms_and_similarity(key_words_interviewee, n=2, similarity_threshold=0.90)
Synonyms_similarity_interviewee

{'asset team': {'plus': 1.0},
 'development year': {'twelvemonth': 1.0, 'yr': 1.0},
 'technology python': {'engineering': 1.0, 'Python': 1.0},
 'skill challenge': {'acquirement': 1.0},
 'python flask': {'Python': 1.0},
 'data privacy': {'seclusion': 1.0}}

In [47]:
total_Synonyms_similarity_with_threshold = len(Synonyms_similarity_JobD)+len(Synonyms_similarity_interviewee)
print(f"the length of total keywords with similarity_thresholding in job description and interviewee responsce is : {total_Synonyms_similarity_with_threshold}")

the length of total keywords with similarity_thresholding in job description and interviewee responsce is : 14


In [43]:
print(f"the length of total keywords in job description and interviewee responsce is : {total_keywords}")

the length of total keywords in job description and interviewee responsce is : 26


## similarity percentage between job description and interviewee responce 

In [51]:
# Function to calculate percentage similarity using hash maps
def calculate_similarity_percentage_using_hash_map(dict1, dict2):
    count = 0

    # Create a hash map for synonyms from both dictionaries
    synonym_map_jobd = {}
    synonym_map_interviewee = {}

    # Fill the synonym map for Synonyms_similarity_JobD
    for key, synonyms in dict1.items():
        for synonym in synonyms:
            if synonym not in synonym_map_jobd:
                synonym_map_jobd[synonym] = set()
            synonym_map_jobd[synonym].add(key)

    # Fill the synonym map for Synonyms_similarity_interviewee
    for key, synonyms in dict2.items():
        for synonym in synonyms:
            if synonym not in synonym_map_interviewee:
                synonym_map_interviewee[synonym] = set()
            synonym_map_interviewee[synonym].add(key)

    # Loop through the synonyms in Synonyms_similarity_JobD
    for synonym, keywords in synonym_map_jobd.items():
        if synonym in synonym_map_interviewee:
            matching_keywords_jobd = keywords
            matching_keywords_interviewee = synonym_map_interviewee[synonym]
            # Increment count for each matching synonym
            count += len(matching_keywords_jobd & matching_keywords_interviewee)  # Intersection of matching keywords

            # Remove the synonym from interviewee map to avoid double counting
            del synonym_map_interviewee[synonym]

    # Calculate the percentage similarity
    similarity_percentage = (count / total_keywords) * 100
    return similarity_percentage

# Calculate similarity percentage using hash maps
similarity_percentage = calculate_similarity_percentage_using_hash_map(Synonyms_similarity_JobD, Synonyms_similarity_interviewee)
print(f"Similarity Percentage using Hash Map: {similarity_percentage:.2f}%")

Similarity Percentage using Hash Map: 0.00%
