# Developing Algorithms for Evaluating Competencies of Candidates for IT Positions: A SMART Analysis Approach - Bachelor's Thesis

Today, entrepreneurship emphasizes the use of the latest technologies and analytical tools for effective human resource management. It is vital to employ objective, accurate, and efficient methods to assess candidate skills,ensuring high quality staffing and competitiveness. Implementing a modern approach of SMART analysis can significantly enhance the candidate selection process and human resource management. This ensure that candidates’ skills align with the needs of the IT industry, contributing to the further sector’s development overall. In my Bachelor’s Diploma Thesis, I investigated the general recruitment process, where pre-planning of the recruitment process to assess candidates’ skills is the first stage and one of the key aspects of effective candidate selection. The second stage involves obtaining a large array of data, such as resumes, containing essential information about each candidate’s skills, education, work experience, achievements, etc. The third stage is the selection stage itself: this involves drawing up a short list of a few applicants for a vacant position from the initial large pool of candidates.  

To solve the problem of selecting the ideal candidate, I proved the need to implement an automated system for assessing skills precisely at the third stage, where the system can provide the most effective evaluation of candidate skills. I proposed an automated solution consisting of three stages:  
1) Parsing information from candidates’ resumes using parsers, specifically with libraries like PyPDF, re, and NLTK;
2) Converting the extracted information into vectors using SMART analysis methods, such as Bag-of-words, TF-IDF, Word2Vec, GloVe, and fasttext. This stage also involves summarizing knowledge about words and considering their contexts in the text using models like ELMo and BERT;
3) Comparing vectors of resume corpora and job descriptions to assess the candidate’s skills, experience and qualifications against the company’s requirements using similarity measures like Jaccard, Dice, cosine, sqrt-cos,
and ISC.

### Section 1: Text Preprocessing

In [None]:
# Import all necessary libraries
import re
import nltk
import torch
import warnings
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
from pypdf import PdfReader
from itertools import product
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from scipy import stats
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import KeyedVectors
from transformers import BertTokenizer, BertModel
tf.compat.v1.enable_eager_execution()
warnings.filterwarnings('ignore')

In [None]:
# Download all necessary sets of words
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dmytro.zhuk_whalebon\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\dmytro.zhuk_whalebon\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dmytro.zhuk_whalebon\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\dmytro.zhuk_whalebon\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [None]:
# Text preprocessing functions

# Read resume
def read_resume(resume) -> str:
    try:
        reader = PdfReader(resume)
        number_of_pages = reader.get_num_pages()
        text = ''
        for page_number in range(number_of_pages):
            page = reader.get_page(page_number)
            page_content = page.extract_text()
            text+=page_content
        return text
    except Exception as e:
        return f'The resume {resume} was not scraped due to {e}.'

# Read position description
def read_position(position) -> str:
    try:
        with open(position) as file:
            return file.read()
    except Exception as e:
        return f'The position {position} was not scraped due to {e}.'

# Tokenize description
def tokenize_description(description) -> list:
    try:
        return nltk.sent_tokenize(description)
    except Exception as e:
        return f'The description {description} was not scraped due to {e}.'

# Lemmatization; stopwords, punctuation, and whitespace removal
def preprocessing_text(text) -> str:
    try:
        lemmatizer = WordNetLemmatizer()
        text = text.split()
        text = [lemmatizer.lemmatize(word) for word in text if not word in set(stopwords.words('english'))]
        text = ' '.join(text)
        text = re.sub(r'\d+', '', text)
        text = re.sub(r'[^\w\s]', '', text)
        text = re.sub(r'\s+', ' ', text)
        text = text.lower().strip()
        return text
    except Exception as e:
        return f'The text was not preprocessed due to {e}.'

# Create one corpus from smaller corpuses
def create_corpus(text) -> str:
    try:
        return ' '.join(text)
    except Exception as e:
        return f'The corpus was not created due to {e}.'

In [None]:
# Metrics and helping functions

# Cosine similarity - Euclidean distance
def cos_sim(a, b):
    try:
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
    except Exception as e:
        return f'The cosine similarity was not calculated due to {e}.'

# Sqrt-cos similarity - Hellinger distance -> needs to be normalized on the scale from 0 to 1 due to non-support of negative values
def sqrt_cos_sim(a, b):
    try:
        def hellinger_distance(a,b):
            return np.sqrt(np.sum((np.sqrt(a) - np.sqrt(b))**2)) / np.sqrt(2)
        return np.sqrt(1 - hellinger_distance(a,b)**2)
    except Exception as e:
        return f'Sqrt-cos similarity was not calculated due to {e}.'

# MinMax scaler
def normalize_vector(vector):
    try:
        return (vector - np.min(vector)) / (np.max(vector) - np.min(vector))
    except Exception as e:
        return f'Vector normalization was not calculated due to {e}.'

# Improved sqrt-cos similarity - Manhattan distance
def improved_sqrt_cos_sim(a, b):
    try:
        return np.sqrt(np.dot(a,b)) / (np.sqrt(np.linalg.norm(a)) * np.sqrt(np.linalg.norm(b)))
    except Exception as e:
        return f'Improved sqrt-cos similarity was not calculated due to {e}.'

# Jaccard similarity
def jaccard_sim(a, b):
    try:
        set1 = set(a)
        set2 = set(b)
        intersection = len(set1.intersection(set2))
        union = len(set1.union(set2))
        similarity = intersection / union
        return similarity
    except Exception as e:
        return f'Jaccard similarity was not calculated due to {e}.'

# Dice similarity
def dice_sim(a, b):
    try:
        set1 = set(a)
        set2 = set(b)
        intersection = len(set1.intersection(set2))
        dice = (2.0 * intersection) / (len(set1) + len(set2))
        return dice
    except Exception as e:
        return f'Dice similarity was not calculated due to {e}.'

# Cosine similarity for tensors - Euclidean distance
def cosine_similarity_tensorflow(a, b):
    try:
        a = tf.reshape(a, [-1])
        b = tf.reshape(b, [-1])
        dot_product = tf.reduce_sum(a * b)
        magnitude1 = tf.sqrt(tf.reduce_sum(a * a))
        magnitude2 = tf.sqrt(tf.reduce_sum(b * b))
        similarity = dot_product / (magnitude1 * magnitude2)
        return similarity
    except Exception as e:
        return f'The cosine similarity was not calculated due to {e}.'

# Sqrt-cos similarity for tensors - Hellinger distance -> needs to be normalized on the scale from 0 to 1 due to non-support of negative values
def sqrt_cos_sim_tensorflow(a, b):
    try:
        a = tf.clip_by_value(a, 0.0, 1.0)
        b = tf.clip_by_value(b, 0.0, 1.0)
        def hellinger_distance(a, b):
            return tf.sqrt(tf.reduce_sum(tf.square(tf.sqrt(a) - tf.sqrt(b)))) / tf.sqrt(2.0)
        similarity = tf.sqrt(1 - tf.square(hellinger_distance(a, b)))
        return similarity
    except Exception as e:
        return f'Sqrt-cos similarity was not calculated due to {e}.'

# Improved sqrt-cos similarity for tensors - Manhattan distance
def improved_sqrt_cos_sim_tensorflow(a, b):
    try:
        a = tf.reshape(a, [-1])
        b = tf.reshape(b, [-1])
        dot_product = tf.sqrt(tf.reduce_sum(tf.multiply(a, b)))
        magnitude1 = tf.sqrt(tf.norm(a))
        magnitude2 = tf.sqrt(tf.norm(b))
        similarity = dot_product / (magnitude1 * magnitude2)
        return similarity
    except Exception as e:
        return f'Improved sqrt-cos similarity was not calculated due to {e}.'

In [None]:
# Models word embeddings functions

# Bag-of-words model
def bag_of_words(resume_corpus, position_corpus) -> list:
    try:
        resume_tokens = word_tokenize(resume_corpus)
        position_tokens = word_tokenize(position_corpus)
        vocabulary = set(resume_tokens + position_tokens)
        vectorizer = CountVectorizer(vocabulary=vocabulary)
        vectorizer.fit([resume_corpus, position_corpus])
        resume_bow = vectorizer.transform([resume_corpus])
        position_bow = vectorizer.transform([position_corpus])
        feature_names = vectorizer.get_feature_names_out()
        df = pd.DataFrame(
            data=[resume_bow[0].toarray()[0], position_bow[0].toarray()[0]],
            columns=feature_names,
            index=[resume_corpus, position_corpus])
        return [df, resume_bow[0].toarray()[0], position_bow[0].toarray()[0]]
    except Exception as e:
        return f'Bag-of-words model was not created due to {e}.'

# TF-IDF model
def tf_idf(resume_corpus, position_corpus) -> list:
    try:
        resume_tokens = word_tokenize(resume_corpus)
        position_tokens = word_tokenize(position_corpus)
        vocabulary = set(resume_tokens + position_tokens)
        vectorizer = TfidfVectorizer(vocabulary=vocabulary)
        vectorizer.fit([resume_corpus, position_corpus])
        resume_bow = vectorizer.transform([resume_corpus])
        position_bow = vectorizer.transform([position_corpus])
        feature_names = vectorizer.get_feature_names_out()
        df = pd.DataFrame(
            data=[resume_bow[0].toarray()[0], position_bow[0].toarray()[0]],
            columns=feature_names,
            index=[resume_corpus, position_corpus])
        return [df, resume_bow[0].toarray()[0], position_bow[0].toarray()[0]]
    except Exception as e:
        return f'Tf-idf model was not created due to {e}.'

# Word2Vec, GloVe, and fasttext build model
def build_model_ncontext(model_path) -> KeyedVectors:
    try:
        return KeyedVectors.load(model_path)
    except Exception as e:
        return f'The model {model_path} was not built due to {e}.'

# Word2Vec, GloVe, and fasttext word embeddings
def ncontext_word_embeddings(resume_tokens, position_tokens, model, vectors_size) -> list:
    try:
        p_sen1 = [item for sublist in resume_tokens for item in sublist]
        p_sen2 = [item for sublist in position_tokens for item in sublist]
        sen_vec1 = np.zeros(vectors_size)
        sen_vec2 = np.zeros(vectors_size)
        for val in p_sen1:
            try:
                sen_vec1 = np.add(sen_vec1, model[val])
            except:
                sen_vec1 = np.add(sen_vec1, 0)
                continue
        for val in p_sen2:
            try:
                sen_vec2 = np.add(sen_vec2, model[val])
            except:
                sen_vec2 = np.add(sen_vec2, 0)
                continue
        return [sen_vec1, sen_vec2]
    except Exception as e:
        return f'Word embeddings were not obtained by model {model} due to {e}.'

# Build BERT model and obtain word embeddings
def context_word_embeddings_bert(model_name, resume_corpus, position_corpus, tokenizer, model) -> list:
    try:
        return [get_word_embeddings(resume_corpus, tokenizer, model), get_word_embeddings(position_corpus, tokenizer, model)]
    except Exception as e:
        return f'Word embeddings were not obtained by BERT model due to {e}.'

# Helping function to obtain word embeddings from BERT model
def get_word_embeddings(sentence, tokenizer, model):
    try:
        tokens = tokenizer(sentence, return_tensors='pt', padding=True, truncation=True)
        with torch.no_grad():
            outputs = model(**tokens)
        hidden_states = outputs.hidden_states
        embeddings = hidden_states[-1].squeeze(0)  # Embeddings from last layer
        if embeddings.size(0) > 100:
            embeddings = embeddings[:100, :]
        return embeddings
    except Exception as e:
        return f'Word embeddings were not obtained by BERT model due to {e}.'

# Build ElMo model and obtain word embeddings
def context_word_embeddings_elmo(elmo_model, resume_corpus, position_corpus):
    try:
        return elmo_model.signatures["default"](tf.constant([resume_corpus, position_corpus]))["elmo"]
    except Exception as e:
        return f'Word embeddings were not obtained by ElMo model due to {e}.'

In [None]:
# Read resume and position descriptions
resume = read_resume("Business analyst Male.pdf")
position = read_position("Business Analyst.txt")

In [None]:
print(resume)

Page 1 of 2 Abc Abc  
E-mail  : 111.abc@gmail.com
Position : BUSINESS ANALYST /SYSTEM ANALYST 
SUMMARY 
I’d like to suggest to you my expertise , knowle dge and skills . My skills 
of instant response to change s in the cases, proficiency of 
negotiations , mediat ion, issue s solving allow me be sure in my ability 
to quickly dive into a new sphere of activity  
EDUCATION 
Years Educational institution  Qualification  
1989-1994 Yaroslav Mudryi National Law University  LL.M.
PROFESSIONAL SKILLS  
OS Windows, Android, iOS , MacOS, Ubuntu  
Technologies  WEB, mobile  
DBMS  MS SQL, MySQL, Oracle, Access  (surface immersion ) 
BA support tools  Jira, Confluence, ClickU p, MS project, Google Docs, Enterprise architect , 
MS Office etc. , Postman  
Design , diagrams  AdobeXD, inVision, Figma, Draw.io, Miro, 
Methodologies  Agile (Scrum, Kanban), Waterfall  
Foreign languages  English Upper Intermediate  
Other skills  IDEF, UML, BPMN , REST API, GraphQL , XML, JSON  
SERVICE RECORDS  
Year

In [None]:
print(position)

https://www.amazon.jobs/en/jobs/2608538/business-analyst

DESCRIPTION
Will you be able to determine if Amazon's Middle Mile automated driver supply planning is predicting the 'right' number of driver shifts to deliver millions of packages every day to its customers? As a Business Analyst, you will operate at the crossroads of multiple complex Amazon systems, getting global visibility of how Amazon moves inventory across our network and serves our customers. You will work with the core optimization models that drive the Middle Mile planning business for Amazon. You will enable the creation of products that drive ever-greater automation, scalability and optimization of every aspect of transportation, removing cost and delivering speed of execution for our customers. The impact of your work will be global, material, and remarkable. The successful candidate will be voraciously curious about Amazonâ€™s transportation operations and how data consumed and produced by our systems can be used t

In [None]:
# Tokenize descriptions into list of sentences
resume = tokenize_description(resume)
position = tokenize_description(position)

In [None]:
resume

['Page 1 of 2 Abc Abc  \nE-mail  : 111.abc@gmail.com\nPosition : BUSINESS ANALYST /SYSTEM ANALYST \nSUMMARY \nI’d like to suggest to you my expertise , knowle dge and skills .',
 'My skills \nof instant response to change s in the cases, proficiency of \nnegotiations , mediat ion, issue s solving allow me be sure in my ability \nto quickly dive into a new sphere of activity  \nEDUCATION \nYears Educational institution  Qualification  \n1989-1994 Yaroslav Mudryi National Law University  LL.M.',
 'PROFESSIONAL SKILLS  \nOS Windows, Android, iOS , MacOS, Ubuntu  \nTechnologies  WEB, mobile  \nDBMS  MS SQL, MySQL, Oracle, Access  (surface immersion ) \nBA support tools  Jira, Confluence, ClickU p, MS project, Google Docs, Enterprise architect , \nMS Office etc.',
 ", Postman  \nDesign , diagrams  AdobeXD, inVision, Figma, Draw.io, Miro, \nMethodologies  Agile (Scrum, Kanban), Waterfall  \nForeign languages  English Upper Intermediate  \nOther skills  IDEF, UML, BPMN , REST API, GraphQL , X

In [None]:
position

["https://www.amazon.jobs/en/jobs/2608538/business-analyst\n\nDESCRIPTION\nWill you be able to determine if Amazon's Middle Mile automated driver supply planning is predicting the 'right' number of driver shifts to deliver millions of packages every day to its customers?",
 'As a Business Analyst, you will operate at the crossroads of multiple complex Amazon systems, getting global visibility of how Amazon moves inventory across our network and serves our customers.',
 'You will work with the core optimization models that drive the Middle Mile planning business for Amazon.',
 'You will enable the creation of products that drive ever-greater automation, scalability and optimization of every aspect of transportation, removing cost and delivering speed of execution for our customers.',
 'The impact of your work will be global, material, and remarkable.',
 'The successful candidate will be voraciously curious about Amazonâ€™s transportation operations and how data consumed and produced by 

In [None]:
# Preprocess sentences in descriptions
resume = [preprocessing_text(sentence) for sentence in resume]
position = [preprocessing_text(sentence) for sentence in position]

In [None]:
resume

['page abc abc email abcgmailcom position business analyst system analyst summary id like suggest expertise knowle dge skill',
 'my skill instant response change cases proficiency negotiation mediat ion issue solving allow sure ability quickly dive new sphere activity education years educational institution qualification yaroslav mudryi national law university llm',
 'professional skills os windows android ios macos ubuntu technologies web mobile dbms ms sql mysql oracle access surface immersion ba support tool jira confluence clicku p ms project google docs enterprise architect ms office etc',
 'postman design diagram adobexd invision figma drawio miro methodologies agile scrum kanban waterfall foreign language english upper intermediate other skill idef uml bpmn rest api graphql xml json service records years employer position brief job description telesens business analyst dipocket business analyst resty application business analyst product manager project experience years brief pro

In [None]:
position

['httpswwwamazonjobsenjobsbusinessanalyst description will able determine amazons middle mile automated driver supply planning predicting right number driver shift deliver million package every day customers',
 'as business analyst operate crossroad multiple complex amazon systems getting global visibility amazon move inventory across network serf customers',
 'you work core optimization model drive middle mile planning business amazon',
 'you enable creation product drive evergreater automation scalability optimization every aspect transportation removing cost delivering speed execution customers',
 'the impact work global material remarkable',
 'the successful candidate voraciously curious amazonâs transportation operation data consumed produced system used improve outcome lower costs',
 'your responsibility expose measure current performance systems find quantify opportunity improvement dive deep existing algorithm explain unexpected performance',
 'we looking sophisticated user dat

In [None]:
# Tokenize words in sentences in descriptions
resume_tokens = [word_tokenize(sentence) for sentence in resume]
position_tokens = [word_tokenize(sentence) for sentence in position]
resume_tokens[:5]

[['page',
  'abc',
  'abc',
  'email',
  'abcgmailcom',
  'position',
  'business',
  'analyst',
  'system',
  'analyst',
  'summary',
  'id',
  'like',
  'suggest',
  'expertise',
  'knowle',
  'dge',
  'skill'],
 ['my',
  'skill',
  'instant',
  'response',
  'change',
  'cases',
  'proficiency',
  'negotiation',
  'mediat',
  'ion',
  'issue',
  'solving',
  'allow',
  'sure',
  'ability',
  'quickly',
  'dive',
  'new',
  'sphere',
  'activity',
  'education',
  'years',
  'educational',
  'institution',
  'qualification',
  'yaroslav',
  'mudryi',
  'national',
  'law',
  'university',
  'llm'],
 ['professional',
  'skills',
  'os',
  'windows',
  'android',
  'ios',
  'macos',
  'ubuntu',
  'technologies',
  'web',
  'mobile',
  'dbms',
  'ms',
  'sql',
  'mysql',
  'oracle',
  'access',
  'surface',
  'immersion',
  'ba',
  'support',
  'tool',
  'jira',
  'confluence',
  'clicku',
  'p',
  'ms',
  'project',
  'google',
  'docs',
  'enterprise',
  'architect',
  'ms',
  'office

In [None]:
position_tokens[:5]

[['httpswwwamazonjobsenjobsbusinessanalyst',
  'description',
  'will',
  'able',
  'determine',
  'amazons',
  'middle',
  'mile',
  'automated',
  'driver',
  'supply',
  'planning',
  'predicting',
  'right',
  'number',
  'driver',
  'shift',
  'deliver',
  'million',
  'package',
  'every',
  'day',
  'customers'],
 ['as',
  'business',
  'analyst',
  'operate',
  'crossroad',
  'multiple',
  'complex',
  'amazon',
  'systems',
  'getting',
  'global',
  'visibility',
  'amazon',
  'move',
  'inventory',
  'across',
  'network',
  'serf',
  'customers'],
 ['you',
  'work',
  'core',
  'optimization',
  'model',
  'drive',
  'middle',
  'mile',
  'planning',
  'business',
  'amazon'],
 ['you',
  'enable',
  'creation',
  'product',
  'drive',
  'evergreater',
  'automation',
  'scalability',
  'optimization',
  'every',
  'aspect',
  'transportation',
  'removing',
  'cost',
  'delivering',
  'speed',
  'execution',
  'customers'],
 ['the', 'impact', 'work', 'global', 'material', '

In [None]:
# Create resume and position corpuses
corpus_resume = create_corpus(resume)
corpus_position = create_corpus(position)

In [None]:
corpus_resume

'page abc abc email abcgmailcom position business analyst system analyst summary id like suggest expertise knowle dge skill my skill instant response change cases proficiency negotiation mediat ion issue solving allow sure ability quickly dive new sphere activity education years educational institution qualification yaroslav mudryi national law university llm professional skills os windows android ios macos ubuntu technologies web mobile dbms ms sql mysql oracle access surface immersion ba support tool jira confluence clicku p ms project google docs enterprise architect ms office etc postman design diagram adobexd invision figma drawio miro methodologies agile scrum kanban waterfall foreign language english upper intermediate other skill idef uml bpmn rest api graphql xml json service records years employer position brief job description telesens business analyst dipocket business analyst resty application business analyst product manager project experience years brief project descript

In [None]:
corpus_position

'httpswwwamazonjobsenjobsbusinessanalyst description will able determine amazons middle mile automated driver supply planning predicting right number driver shift deliver million package every day customers as business analyst operate crossroad multiple complex amazon systems getting global visibility amazon move inventory across network serf customers you work core optimization model drive middle mile planning business amazon you enable creation product drive evergreater automation scalability optimization every aspect transportation removing cost delivering speed execution customers the impact work global material remarkable the successful candidate voraciously curious amazonâs transportation operation data consumed produced system used improve outcome lower costs your responsibility expose measure current performance systems find quantify opportunity improvement dive deep existing algorithm explain unexpected performance we looking sophisticated user data querying tool expert synthe

### Section 2: Similarity Computation

In [None]:
# Binary similarity
flatten_position_tokens = sum(position_tokens, [])
flatten_resume_tokens = sum(resume_tokens, [])
print("Jaccard similarity:", jaccard_sim(flatten_resume_tokens, flatten_position_tokens))
print("Dice similarity:", dice_sim(flatten_resume_tokens, flatten_position_tokens))

Jaccard similarity: 0.06581740976645435
Dice similarity: 0.12350597609561753


##### Section 2.1. Statistical models

In [None]:
# Bag-of-words word embeddings
bow_resume = bag_of_words(corpus_resume, corpus_position)[1]
bow_position = bag_of_words(corpus_resume, corpus_position)[2]

In [None]:
bow_resume

array([ 2,  1,  1,  0,  1,  0,  3,  1,  1,  1,  0,  1,  0,  0,  1,  0,  0,
        0,  0,  1, 12,  5,  0,  1,  1,  1,  1,  1,  1,  0,  0,  0,  0,  0,
        1,  0,  1,  0,  2,  0,  1,  0,  0,  0,  0,  1,  2,  0,  9,  3,  1,
        0,  0,  1,  0,  0,  6,  1,  1,  1,  1,  1,  2,  0,  0,  0,  2,  0,
        1,  1,  0,  0,  1,  0,  7,  1,  0,  0,  0,  0,  1,  2,  0,  0,  0,
        0,  1,  0,  0,  0,  1,  0,  0,  0,  0,  0,  2,  1,  1,  0,  0,  0,
        2,  1,  1,  1,  1,  1,  0,  0,  1,  0,  0,  0,  1,  1,  0,  0,  1,
        2,  1,  1,  0,  0,  1,  0,  1,  0,  0,  0,  1,  0,  0,  0,  0,  1,
        0,  0,  1,  0,  1,  0,  0,  2,  0,  1,  0,  0,  1,  0,  0,  0,  0,
        1,  1,  0,  1,  0,  0,  0,  0,  1,  1,  0,  0,  0,  0,  1,  0,  0,
        0,  3,  1,  1,  1,  1,  0,  1,  0,  0,  0,  0,  0,  0,  0,  0,  1,
        1,  0,  3,  1,  0,  1,  1,  1,  1,  1,  5,  1,  1,  0,  1,  1,  0,
        1,  1,  0,  0,  1,  0,  0,  1,  0,  1,  0,  0,  1,  1,  0,  0,  0,
        0,  0,  0,  0,  0

In [None]:
bow_position

array([ 0,  0,  0,  1,  0,  2,  0,  0,  0,  0,  1,  0,  1,  1,  0,  1,  5,
        1,  1,  0,  1,  1,  1,  0,  0,  0,  0,  0,  0,  1,  2,  1,  1,  1,
        0,  1,  0,  1,  0,  1,  1,  1,  1,  1,  1,  0,  0,  2, 11,  0,  0,
        2,  1,  0,  1,  1,  0,  0,  0,  0,  0,  0,  0,  1,  1,  1,  0,  4,
        0,  0,  1,  1,  3,  1,  0,  0,  1,  1,  1,  1,  1,  3,  1,  9,  1,
        1,  0,  2,  2,  1,  0,  1,  1,  1,  1,  1,  1,  2,  0,  1,  2,  1,
        0,  0,  0,  0,  2,  0,  1,  1,  0,  3,  3,  1,  0,  0,  1,  1,  0,
        0,  0,  0,  1,  1,  0,  1,  0,  1,  1,  1,  1,  1,  1,  2,  3,  0,
        1,  1,  9,  1,  0,  1,  1,  0,  1,  0,  1,  1,  0,  1,  1,  1,  1,
        0,  0,  1,  0,  1,  2,  2,  2,  0,  0,  1,  1,  1,  1,  0,  2,  1,
        1,  0,  0,  0,  0,  0,  2,  0,  2,  1,  3,  1,  1,  1,  1,  4,  0,
        0,  1,  0,  0,  1,  0,  0,  0,  0,  0,  0,  0,  1,  1,  0,  0,  1,
        0,  0,  1,  1,  1,  1,  1,  0,  1,  0,  1,  1,  1,  0,  1,  1,  1,
        1,  1,  1,  1,  1

In [None]:
# Bag-of-words word embeddings similarity
print("Bag-of-words cosine similarity:", cos_sim(bow_resume, bow_position))
print("Bag-of-words sqrt-cos similarity:", sqrt_cos_sim(normalize_vector(bow_resume), normalize_vector(bow_position)))
print("Bag-of-words improved sqrt-cos similarity:", improved_sqrt_cos_sim(bow_resume, bow_position))

Bag-of-words cosine similarity: 0.183583553976399
Bag-of-words sqrt-cos similarity: nan
Bag-of-words improved sqrt-cos similarity: 0.4284665144167032


In [None]:
# Tf-idf word embeddings
tf_idf_resume = tf_idf(corpus_resume, corpus_position)[1]
tf_idf_position = tf_idf(corpus_resume, corpus_position)[2]

In [None]:
tf_idf_resume

array([0.06073262, 0.03036631, 0.03036631, 0.        , 0.03036631,
       0.        , 0.09109894, 0.03036631, 0.03036631, 0.03036631,
       0.        , 0.03036631, 0.        , 0.        , 0.03036631,
       0.        , 0.        , 0.        , 0.        , 0.03036631,
       0.25927057, 0.10802941, 0.        , 0.03036631, 0.03036631,
       0.03036631, 0.03036631, 0.03036631, 0.03036631, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.03036631,
       0.        , 0.03036631, 0.        , 0.06073262, 0.        ,
       0.02160588, 0.        , 0.        , 0.        , 0.        ,
       0.03036631, 0.06073262, 0.        , 0.19445293, 0.09109894,
       0.03036631, 0.        , 0.        , 0.03036631, 0.        ,
       0.        , 0.18219787, 0.03036631, 0.03036631, 0.03036631,
       0.03036631, 0.03036631, 0.06073262, 0.        , 0.        ,
       0.        , 0.06073262, 0.        , 0.03036631, 0.03036631,
       0.        , 0.        , 0.02160588, 0.        , 0.21256

In [None]:
tf_idf_position

array([0.        , 0.        , 0.        , 0.03559548, 0.        ,
       0.07119097, 0.        , 0.        , 0.        , 0.        ,
       0.03559548, 0.        , 0.03559548, 0.03559548, 0.        ,
       0.03559548, 0.17797741, 0.03559548, 0.03559548, 0.        ,
       0.02532648, 0.02532648, 0.03559548, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.03559548,
       0.07119097, 0.03559548, 0.03559548, 0.03559548, 0.        ,
       0.03559548, 0.        , 0.03559548, 0.        , 0.03559548,
       0.02532648, 0.03559548, 0.03559548, 0.03559548, 0.03559548,
       0.        , 0.        , 0.07119097, 0.27859127, 0.        ,
       0.        , 0.07119097, 0.03559548, 0.        , 0.03559548,
       0.03559548, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.03559548, 0.03559548,
       0.03559548, 0.        , 0.14238193, 0.        , 0.        ,
       0.03559548, 0.03559548, 0.07597944, 0.03559548, 0.     

In [None]:
# Tf-idf word embeddings similarity
print("Tf-idf cosine similarity:", cos_sim(tf_idf_resume, tf_idf_position))
print("Tf-idf sqrt-cos similarity:", sqrt_cos_sim(normalize_vector(tf_idf_resume), normalize_vector(tf_idf_position)))
print("Tf-idf improved sqrt-cos similarity:", improved_sqrt_cos_sim(tf_idf_resume, tf_idf_position))

Tf-idf cosine similarity: 0.11108178270708521
Tf-idf sqrt-cos similarity: nan
Tf-idf improved sqrt-cos similarity: 0.33328933782388726


##### Section 2.2. Deep Learning models (without context)

In [None]:
# Build Deep Learning models (without context)
model_w2v = build_model_ncontext(model_path="word2vec-google-news-300.model")
model_glove = build_model_ncontext(model_path="glove-wiki-gigaword-300.model")
model_fasttext = build_model_ncontext(model_path="fasttext-wiki-news-subwords-300.model")

In [None]:
# "Project" word embedding derieved by Word2Vec
model_w2v["project"]

array([-1.80664062e-02,  8.54492188e-03,  6.98242188e-02,  3.07617188e-02,
        8.05664062e-02,  3.44238281e-02,  4.41406250e-01,  1.91650391e-02,
        1.10473633e-02,  9.13085938e-02, -8.05664062e-02,  8.66699219e-03,
       -5.66406250e-02, -2.79296875e-01, -3.04687500e-01,  2.66113281e-02,
       -1.01074219e-01, -2.44140625e-01,  1.10473633e-02, -2.19726562e-02,
       -1.27929688e-01,  2.11914062e-01, -4.27246094e-02,  6.34765625e-02,
       -6.12792969e-02, -1.52343750e-01, -4.27246094e-02, -1.40625000e-01,
       -2.41699219e-02, -1.74804688e-01,  1.55639648e-02, -4.61425781e-02,
       -1.83593750e-01, -8.74023438e-02,  1.27929688e-01, -1.05957031e-01,
        7.26318359e-03, -2.64892578e-02,  1.35742188e-01, -1.41601562e-01,
       -1.19628906e-02,  2.43164062e-01,  5.61523438e-02,  1.40625000e-01,
       -3.22265625e-01, -3.39843750e-01, -2.53906250e-01, -1.36718750e-01,
       -2.08984375e-01,  3.61328125e-01,  1.34765625e-01, -1.11816406e-01,
       -1.14257812e-01, -

In [None]:
# "Project" word embedding derieved by GloVe
model_glove["project"]

array([-2.0397e-01, -3.5959e-02, -2.4745e-01, -5.5419e-01,  6.7167e-03,
       -8.7778e-02,  2.3057e-01, -3.3634e-01, -2.1594e-01, -1.3637e+00,
        2.1076e-01, -4.4217e-01,  2.1688e-01,  2.5215e-01,  3.8284e-01,
        1.7151e-02,  7.5829e-02,  1.8668e-01,  2.5643e-01,  4.7164e-01,
       -3.0530e-01,  1.8262e-01, -1.3302e-01,  2.1855e-01, -3.9873e-02,
        1.9053e-01,  3.3508e-01,  1.9015e-01, -1.5546e-02,  2.3514e-01,
        7.2200e-01,  2.9326e-01, -2.7213e-01,  5.2866e-01, -1.2719e-01,
        2.0123e-01, -2.4419e-01, -7.9395e-02,  3.3330e-01,  1.3958e-02,
       -2.7907e-01, -3.7687e-01, -3.3006e-01,  3.0789e-01, -1.2030e-01,
       -2.8289e-01, -8.8605e-02,  1.3664e-01, -1.6403e-01, -6.1411e-02,
        1.9604e-01,  1.0830e-01, -3.6917e-01,  1.8505e-03, -2.9781e-01,
        3.5050e-01,  4.3316e-01,  4.4869e-01, -1.3611e-01,  1.3710e-01,
       -8.4922e-01,  3.1850e-01, -4.3727e-02, -5.8593e-01,  5.6550e-02,
        8.6663e-01,  4.2441e-01,  3.1674e-01,  5.9644e-02, -2.14

In [None]:
# "Project" word embedding derieved by fasttext
model_fasttext["project"]

array([ 3.6720e-02, -4.6776e-02,  1.1238e-02, -1.4262e-02, -3.9045e-02,
       -3.1122e-04,  2.3092e-02, -9.4404e-02,  1.0284e-02,  1.8978e-02,
       -2.3707e-02, -7.2466e-02,  4.2775e-02, -5.9705e-03,  2.7818e-02,
        1.2642e-02,  1.0934e-01,  2.6429e-02,  5.4367e-02,  4.7241e-05,
       -1.2384e-02,  4.0120e-02, -3.6997e-02,  5.7316e-02,  3.5739e-02,
       -1.9788e-02, -4.6655e-02,  3.0372e-02, -1.7557e-02, -5.2527e-03,
       -2.3609e-02,  5.6846e-03,  9.2050e-03, -9.4131e-02, -8.8816e-03,
       -6.2144e-04,  9.2826e-03,  1.8940e-02, -7.2345e-03, -5.3008e-02,
       -1.7007e-02, -8.6123e-02, -6.8775e-02, -1.8025e-02, -1.9746e-02,
        3.3024e-02, -1.2066e-02,  1.9534e-02, -1.7749e-02, -2.6609e-02,
        5.0450e-02,  2.3968e-02, -1.8095e-02,  3.8698e-02, -7.7544e-02,
        3.9485e-02, -5.6603e-02, -8.2745e-02, -6.2821e-02,  7.1270e-02,
        5.3229e-02,  4.2347e-02,  1.1013e-01, -1.9416e-02,  4.0057e-02,
        1.1877e-02,  2.1976e-02, -4.3974e-02, -3.9952e-02,  8.14

In [None]:
# Resume and position word embeddings by Word2Vec model (without context)
w2v_resume = ncontext_word_embeddings(resume_tokens, position_tokens, model_w2v, 300)[0]
w2v_position = ncontext_word_embeddings(resume_tokens, position_tokens, model_w2v, 300)[1]

In [None]:
print("Word2Vec cosine similarity:", cos_sim(w2v_resume, w2v_position))
print("Word2Vec sqrt-cos similarity:", sqrt_cos_sim(normalize_vector(w2v_resume), normalize_vector(w2v_position)))
print("Word2Vec improved sqrt-cos similarity:", improved_sqrt_cos_sim(w2v_resume, w2v_position))

Word2Vec cosine similarity: 0.8775057765303992
Word2Vec sqrt-cos similarity: 0.5479819305240776
Word2Vec improved sqrt-cos similarity: 0.9367527830385127


In [None]:
# Resume and position word embeddings by GloVe model (without context)
glove_resume = ncontext_word_embeddings(resume_tokens, position_tokens, model_glove, 300)[0]
glove_position = ncontext_word_embeddings(resume_tokens, position_tokens, model_glove, 300)[1]

In [None]:
print("GloVe cosine similarity:", cos_sim(glove_resume, glove_position))
print("GloVe sqrt-cos similarity:", sqrt_cos_sim(normalize_vector(glove_resume), normalize_vector(glove_position)))
print("GloVe improved sqrt-cos similarity:", improved_sqrt_cos_sim(glove_resume, glove_position))

GloVe cosine similarity: 0.9282224260373347
GloVe sqrt-cos similarity: 0.9669506642601486
GloVe improved sqrt-cos similarity: 0.9634430061178163


In [None]:
# Resume and position word embeddings by fasttext model (without context)
fasttext_resume = ncontext_word_embeddings(resume_tokens, position_tokens, model_fasttext, 300)[0]
fasttext_position = ncontext_word_embeddings(resume_tokens, position_tokens, model_fasttext, 300)[1]

In [None]:
print("fasttext cosine similarity:", cos_sim(fasttext_resume, fasttext_position))
print("fasttext sqrt-cos similarity:", sqrt_cos_sim(normalize_vector(fasttext_resume), normalize_vector(fasttext_position)))
print("fasttext improved sqrt-cos similarity:", improved_sqrt_cos_sim(fasttext_resume, fasttext_position))

fasttext cosine similarity: 0.9597905927359239
fasttext sqrt-cos similarity: 0.920887663840816
fasttext improved sqrt-cos similarity: 0.9796890285881147


##### Section 2.3. Deep Learning models (with context)

In [None]:
# Build BERT model (with context)
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertModel.from_pretrained("bert-base-uncased", output_hidden_states=True)
bert_embeddings = context_word_embeddings_bert("bert-base-uncased", corpus_resume, corpus_position, bert_tokenizer, bert_model)

In [None]:
# Resume and position word embeddings by BERT model (with context)
bert_resume = tf.convert_to_tensor(bert_embeddings[0].numpy())
bert_position = tf.convert_to_tensor(bert_embeddings[1].numpy())

In [None]:
bert_resume

<tf.Tensor: shape=(100, 768), dtype=float32, numpy=
array([[-0.24948883,  0.23671284,  0.21300325, ..., -0.5167286 ,
        -0.0167064 ,  0.40110615],
       [-0.13348748,  0.50298893,  1.1789578 , ...,  0.11585795,
         0.89747536, -0.50147355],
       [ 0.3652598 ,  1.3704478 ,  0.76849085, ..., -0.15348034,
         0.26663238, -0.07749266],
       ...,
       [-0.38198444,  0.37855992,  1.1884888 , ..., -0.09209996,
         0.16719589,  0.6228476 ],
       [-0.10712307,  0.3816907 ,  0.21916893, ..., -0.21529454,
         0.04929146, -0.10734079],
       [-0.22910808,  0.26557174,  0.8073412 , ...,  0.01077889,
        -0.11126518,  0.36337057]], dtype=float32)>

In [None]:
bert_position

<tf.Tensor: shape=(100, 768), dtype=float32, numpy=
array([[-0.32719356,  0.2911166 ,  0.0995127 , ..., -0.70458657,
         0.10058991,  0.12757708],
       [-0.09747656, -0.32714102,  0.8564409 , ...,  0.58886635,
         0.62087536,  1.1605374 ],
       [ 0.2581681 ,  0.05714206,  0.87274504, ..., -0.47108504,
        -0.19724384, -0.45346382],
       ...,
       [-0.27867275,  0.08526617,  0.94428164, ..., -0.40449864,
         0.08880097,  0.09942784],
       [ 0.10933503,  0.32998013,  0.32587463, ..., -0.25213304,
        -0.16210476, -0.29880297],
       [-1.200947  , -0.5176007 ,  0.75854206, ..., -0.33721527,
         0.00796723, -0.65257365]], dtype=float32)>

In [None]:
print("BERT cosine tensor similarity:", cosine_similarity_tensorflow(bert_resume, bert_position).numpy())
print("BERT sqrt-cos tensor similarity:", sqrt_cos_sim_tensorflow(bert_resume, bert_position).numpy())
print("BERT improved sqrt-cos tensor similarity:", improved_sqrt_cos_sim_tensorflow(bert_resume, bert_position).numpy())

BERT cosine tensor similarity: 0.38806093
BERT sqrt-cos tensor similarity: nan
BERT improved sqrt-cos tensor similarity: 0.62294537


In [None]:
# Build ElMo model (with context)
elmo_model = hub.load("https://tfhub.dev/google/elmo/3")
elmo_embeddings = context_word_embeddings_elmo(elmo_model, corpus_resume, corpus_position)

In [None]:
# Resume and position word embeddings by ElMo model (with context)
elmo_resume = elmo_embeddings[0]
elmo_position = elmo_embeddings[1]

In [None]:
elmo_resume

<tf.Tensor: shape=(402, 1024), dtype=float32, numpy=
array([[ 0.96926683,  0.0197357 , -0.05022468, ..., -0.6011605 ,
         0.04994965,  0.564479  ],
       [ 0.6123357 ,  0.54600865,  0.39083087, ..., -0.6492162 ,
         0.16072023, -0.4677832 ],
       [ 0.443457  ,  0.4283167 ,  0.27562428, ..., -0.20470259,
         0.0925293 , -0.32184893],
       ...,
       [-0.02840841, -0.04353216,  0.04130162, ...,  0.02583168,
        -0.01429836, -0.01650422],
       [-0.02840841, -0.04353216,  0.04130162, ...,  0.02583168,
        -0.01429836, -0.01650422],
       [-0.02840841, -0.04353216,  0.04130162, ...,  0.02583168,
        -0.01429836, -0.01650422]], dtype=float32)>

In [None]:
elmo_position

<tf.Tensor: shape=(402, 1024), dtype=float32, numpy=
array([[ 0.0866392 , -0.04667937,  0.00528036, ..., -0.35448056,
         0.26946062, -0.2253712 ],
       [-0.39725125,  0.03733823,  0.06280055, ..., -0.93967694,
         0.5207259 ,  0.75969124],
       [-0.15930858,  0.29070768, -0.1728603 , ..., -0.14953661,
         0.42402536, -0.6926011 ],
       ...,
       [-0.3457772 ,  0.12610844, -0.34094432, ...,  0.8936167 ,
         0.55811393,  0.49377853],
       [-0.34069228,  0.21508452, -0.49008286, ..., -0.19523671,
        -0.10895094, -0.07450269],
       [-0.40868595, -0.34031767, -0.54438156, ...,  0.15125084,
         0.0727873 , -0.43524438]], dtype=float32)>

In [None]:
print("ElMo cosine tensor similarity:", cosine_similarity_tensorflow(elmo_resume, elmo_position).numpy())
print("ElMo sqrt-cos tensor similarity:", sqrt_cos_sim_tensorflow(elmo_resume, elmo_position).numpy())
print("ElMo improved sqrt-cos tensor similarity:", improved_sqrt_cos_sim_tensorflow(elmo_resume, elmo_position).numpy())

ElMo cosine tensor similarity: 0.42995268
ElMo sqrt-cos tensor similarity: nan
ElMo improved sqrt-cos tensor similarity: 0.6557077


### Section 3: Resume Selection Automation

In [None]:
# Build Deep Learning models
model_w2v = build_model_ncontext(model_path="word2vec-google-news-300.model")
model_glove = build_model_ncontext(model_path="glove-wiki-gigaword-300.model")
model_fasttext = build_model_ncontext(model_path="fasttext-wiki-news-subwords-300.model")

bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertModel.from_pretrained("bert-base-uncased", output_hidden_states=True)
elmo_model = hub.load("https://tfhub.dev/google/elmo/3")

# Define list of resumes and positions
resume_list = ["Business analyst Male.pdf", "Junior Frontend Developer Male.pdf", "Junior Full Stack Developer Female.pdf",
               "Junior Software Developer Female.pdf", "Middle .NET Developer Male.pdf", "Middle Java Engineer Male.pdf",
               "Product Marketing Manager Female.pdf", "QA Specialist Female.pdf", "Senior .NET Developer Male.pdf",
               "Senior Electrical Engineer Male.pdf", "Senior Software Engineer Male.pdf", "Trainee UX-UI Designer Female.pdf"]
position_list = ["Business Analyst.txt", "Senior Software Developer.txt"]

# Initialize an empty list to store the results
results = []

# Iterate over all combinations of resumes and positions
for resume, position in product(resume_list, position_list):
    # Read and preprocess resume and position
    resume_text = read_resume(resume)
    position_text = read_position(position)
    resume_text = tokenize_description(resume_text)
    position_text = tokenize_description(position_text)
    resume_text = [preprocessing_text(sentence) for sentence in resume_text]
    position_text = [preprocessing_text(sentence) for sentence in position_text]
    position_tokens = [word_tokenize(sentence) for sentence in position_text]
    resume_tokens = [word_tokenize(sentence) for sentence in resume_text]
    flatten_position_tokens = sum(position_tokens, [])
    flatten_resume_tokens = sum(resume_tokens, [])
    corpus_resume = create_corpus(resume_text)
    corpus_position = create_corpus(position_text)

    # Calculate and store characteristics
    characteristics = {}

    characteristics['Resume'] = resume
    characteristics['Position'] = position
    characteristics['Jaccard similarity'] = jaccard_sim(flatten_resume_tokens, flatten_position_tokens)
    characteristics['Dice similarity'] = dice_sim(flatten_resume_tokens, flatten_position_tokens)

    bow_resume = bag_of_words(corpus_resume, corpus_position)[1]
    bow_position = bag_of_words(corpus_resume, corpus_position)[2]
    characteristics['Bag-of-words cosine similarity'] = cos_sim(bow_resume, bow_position)
    characteristics['Bag-of-words sqrt-cos similarity'] = sqrt_cos_sim(normalize_vector(bow_resume), normalize_vector(bow_position))
    characteristics['Bag-of-words improved sqrt-cos similarity'] = improved_sqrt_cos_sim(bow_resume, bow_position)

    tf_idf_resume = tf_idf(corpus_resume, corpus_position)[1]
    tf_idf_position = tf_idf(corpus_resume, corpus_position)[2]
    characteristics['Tf-idf cosine similarity'] = cos_sim(tf_idf_resume, tf_idf_position)
    characteristics['Tf-idf sqrt-cos similarity'] = sqrt_cos_sim(normalize_vector(tf_idf_resume), normalize_vector(tf_idf_position))
    characteristics['Tf-idf improved sqrt-cos similarity'] = improved_sqrt_cos_sim(tf_idf_resume, tf_idf_position)

    for model_name, model in [("Word2Vec", model_w2v), ("GloVe", model_glove), ("FastText", model_fasttext)]:
        embeddings_resume = ncontext_word_embeddings(resume_tokens, position_tokens, model, 300)[0]
        embeddings_position = ncontext_word_embeddings(resume_tokens, position_tokens, model, 300)[1]
        characteristics[f"{model_name} cosine similarity"] = cos_sim(embeddings_resume, embeddings_position)
        characteristics[f"{model_name} sqrt-cos similarity"] = sqrt_cos_sim(normalize_vector(embeddings_resume), normalize_vector(embeddings_position))
        characteristics[f"{model_name} improved sqrt-cos similarity"] = improved_sqrt_cos_sim(embeddings_resume, embeddings_position)

    bert_embeddings = context_word_embeddings_bert("bert-base-uncased", corpus_resume, corpus_position, bert_tokenizer, bert_model)
    bert_resume = tf.convert_to_tensor(bert_embeddings[0].numpy())
    bert_position = tf.convert_to_tensor(bert_embeddings[1].numpy())
    characteristics['BERT cosine tensor similarity'] = cosine_similarity_tensorflow(bert_resume, bert_position).numpy()
    characteristics['BERT sqrt-cos tensor similarity'] = sqrt_cos_sim_tensorflow(bert_resume, bert_position).numpy()
    characteristics['BERT improved sqrt-cos tensor similarity'] = improved_sqrt_cos_sim_tensorflow(bert_resume, bert_position).numpy()

    elmo_embeddings = context_word_embeddings_elmo(elmo_model, corpus_resume, corpus_position)
    elmo_resume = elmo_embeddings[0]
    elmo_position = elmo_embeddings[1]
    characteristics['ElMo cosine tensor similarity'] = cosine_similarity_tensorflow(elmo_resume, elmo_position).numpy()
    characteristics['ElMo sqrt-cos tensor similarity'] = sqrt_cos_sim_tensorflow(elmo_resume, elmo_position).numpy()
    characteristics['ElMo improved sqrt-cos tensor similarity'] = improved_sqrt_cos_sim_tensorflow(elmo_resume, elmo_position).numpy()

    results.append(characteristics)

df = pd.DataFrame(results)
results

[{'Resume': 'Business analyst Male.pdf',
  'Position': 'Business Analyst.txt',
  'Jaccard similarity': 0.06581740976645435,
  'Dice similarity': 0.12350597609561753,
  'Bag-of-words cosine similarity': 0.183583553976399,
  'Bag-of-words sqrt-cos similarity': nan,
  'Bag-of-words improved sqrt-cos similarity': 0.4284665144167032,
  'Tf-idf cosine similarity': 0.11108178270708521,
  'Tf-idf sqrt-cos similarity': nan,
  'Tf-idf improved sqrt-cos similarity': 0.33328933782388726,
  'Word2Vec cosine similarity': 0.8775057765303992,
  'Word2Vec sqrt-cos similarity': 0.5479819305240776,
  'Word2Vec improved sqrt-cos similarity': 0.9367527830385127,
  'GloVe cosine similarity': 0.9282224260373347,
  'GloVe sqrt-cos similarity': 0.9669506642601486,
  'GloVe improved sqrt-cos similarity': 0.9634430061178163,
  'FastText cosine similarity': 0.9597905927359239,
  'FastText sqrt-cos similarity': 0.920887663840816,
  'FastText improved sqrt-cos similarity': 0.9796890285881147,
  'BERT cosine tensor 

In [None]:
df

Unnamed: 0,Resume,Position,Jaccard similarity,Dice similarity,Bag-of-words cosine similarity,Bag-of-words sqrt-cos similarity,Bag-of-words improved sqrt-cos similarity,Tf-idf cosine similarity,Tf-idf sqrt-cos similarity,Tf-idf improved sqrt-cos similarity,...,GloVe improved sqrt-cos similarity,FastText cosine similarity,FastText sqrt-cos similarity,FastText improved sqrt-cos similarity,BERT cosine tensor similarity,BERT sqrt-cos tensor similarity,BERT improved sqrt-cos tensor similarity,ElMo cosine tensor similarity,ElMo sqrt-cos tensor similarity,ElMo improved sqrt-cos tensor similarity
0,Business analyst Male.pdf,Business Analyst.txt,0.065817,0.123506,0.183584,,0.428467,0.111082,,0.333289,...,0.963443,0.959791,0.920888,0.979689,0.388061,,0.622945,0.429953,,0.655708
1,Business analyst Male.pdf,Senior Software Developer.txt,0.065076,0.1222,0.128332,,0.358235,0.074858,,0.273601,...,0.960415,0.957273,0.94058,0.978403,0.313262,,0.559698,0.449745,,0.670631
2,Junior Frontend Developer Male.pdf,Business Analyst.txt,0.056604,0.107143,0.116304,,0.341034,0.064117,,0.253213,...,0.96281,0.971515,0.917988,0.985655,0.339251,,0.582453,0.366787,,0.605629
3,Junior Frontend Developer Male.pdf,Senior Software Developer.txt,0.076419,0.141988,0.175285,,0.41867,0.104283,,0.322929,...,0.967528,0.974688,0.94798,0.987263,0.303084,,0.550531,0.406456,,0.637539
4,Junior Full Stack Developer Female.pdf,Business Analyst.txt,0.039583,0.076152,0.109768,,0.331313,0.060767,,0.246509,...,0.950115,0.890547,,0.943688,0.360206,,0.600171,0.367743,,0.606418
5,Junior Full Stack Developer Female.pdf,Senior Software Developer.txt,0.070175,0.131148,0.20591,,0.453773,0.120432,,0.347033,...,0.954712,0.894368,,0.94571,0.301823,,0.549384,0.41639,,0.645283
6,Junior Software Developer Female.pdf,Business Analyst.txt,0.04359,0.083538,0.133633,,0.365558,0.077524,,0.278431,...,0.895564,0.863143,,0.929055,0.24699,,0.49698,0.229923,,0.479502
7,Junior Software Developer Female.pdf,Senior Software Developer.txt,0.07027,0.131313,0.155096,,0.393823,0.090537,,0.300893,...,0.887932,0.860557,,0.927662,0.329097,,0.57367,0.289976,,0.538494
8,Middle .NET Developer Male.pdf,Business Analyst.txt,0.002933,0.005848,0.0,,0.0,0.0,,0.0,...,0.537811,-0.113323,,,0.286953,,0.53568,0.109714,,0.331231
9,Middle .NET Developer Male.pdf,Senior Software Developer.txt,0.00303,0.006042,0.0,,0.0,0.0,,0.0,...,0.548054,-0.106376,,,0.282867,,0.531853,0.14184,,0.376617


Next, ideal candidates' ranking is defined by the expert, where 1 marks the most suitable candidate, 12 marks the least suitable candidate.

Business Analyst, Amazon:
- 1 - Business analyst Male.pdf
- 7 - Junior Frontend Developer Male.pdf
- 3 - Junior Full Stack Developer Female.pdf
- 12 - Junior Software Developer Female.pdf
- 10 - Middle .NET Developer Male.pdf
- 9 - Middle Java Engineer Male.pdf
- 5 - Product Marketing Manager Female.pdf
- 6 - QA Specialist Female.pdf
- 4 - Senior .NET Developer Male.pdf
- 8 - Senior Electrical Engineer Male.pdf
- 3 - Senior Software Engineer Male.pdf
- 12 - Trainee UX-UI Designer Female.pdf

Senior Software Engineer, Google:
- 7 - Business analyst Male.pdf
- 8 - Junior Frontend Developer Male.pdf
- 5 - Junior Full Stack Developer Female.pdf
- 10 - Junior Software Developer Female.pdf
- 2 - Middle .NET Developer Male.pdf
- 7 - Middle Java Engineer Male.pdf
- 8 - Product Marketing Manager Female.pdf
- 3 - QA Specialist Female.pdf
- 3 - Senior .NET Developer Male.pdf
- 9 - Senior Electrical Engineer Male.pdf
- 1 - Senior Software Engineer Male.pdf
- 12 - Trainee UX-UI Designer Female.pdf

In [None]:
# Define lists of expert rankings
ba_ideal_rank = pd.Series([1, 7, 3, 12, 10, 9, 5, 3, 4, 8, 3, 11], name="Business Analyst profession ranking").astype(dtype=np.float64)
se_ideal_rank = pd.Series([7, 8, 5, 10, 2, 7, 8, 3, 3, 9, 1, 12], name="Senior Software Engineer profession ranking").astype(dtype=np.float64)

In [None]:
# Calculate Kendall's Tau weighted coefficients for every metrics and model - Business Analyst Amazon
weightedtau_coefficients = {}
for coefficient in df[df['Position'] == 'Business Analyst.txt'].columns[2:]:
    weightedtau_coefficients[coefficient] = stats.weightedtau(df[df['Position'] == 'Business Analyst.txt'][coefficient].rank(ascending=False), ba_ideal_rank)[0]
sorted_weightedtau_coefficients = sorted(weightedtau_coefficients.items(), key=lambda x: (np.isnan(x[1]), -x[1] if not np.isnan(x[1]) else np.inf))
for coefficient, value in sorted_weightedtau_coefficients:
    if np.isnan(value):
        print(f"Weighted Tau coefficient for {coefficient} equals NaN")
    else:
        print(f"Weighted Tau coefficient for {coefficient} equals {value}")

Weighted Tau coefficient for BERT cosine tensor similarity equals 0.8674191741179471
Weighted Tau coefficient for BERT improved sqrt-cos tensor similarity equals 0.8674191741179471
Weighted Tau coefficient for Word2Vec cosine similarity equals 0.5801721967564071
Weighted Tau coefficient for Word2Vec improved sqrt-cos similarity equals 0.5801721967564071
Weighted Tau coefficient for GloVe cosine similarity equals 0.5643702849778357
Weighted Tau coefficient for GloVe improved sqrt-cos similarity equals 0.5643702849778357
Weighted Tau coefficient for ElMo cosine tensor similarity equals 0.5592359727945382
Weighted Tau coefficient for ElMo improved sqrt-cos tensor similarity equals 0.5592359727945382
Weighted Tau coefficient for FastText cosine similarity equals 0.38780912975348425
Weighted Tau coefficient for Jaccard similarity equals 0.339690242290892
Weighted Tau coefficient for Dice similarity equals 0.339690242290892
Weighted Tau coefficient for GloVe sqrt-cos similarity equals 0.2326

In [None]:
# Calculate Kendall's Tau weighted coefficients for every metrics and model - Senior Software Engineer Google
weightedtau_coefficients = {}
for coefficient in df[df['Position'] == 'Senior Software Developer.txt'].columns[2:]:
    weightedtau_coefficients[coefficient] = stats.weightedtau(df[df['Position'] == 'Senior Software Developer.txt'][coefficient].rank(ascending=False), se_ideal_rank)[0]
sorted_weightedtau_coefficients = sorted(weightedtau_coefficients.items(), key=lambda x: (np.isnan(x[1]), -x[1] if not np.isnan(x[1]) else np.inf))
for coefficient, value in sorted_weightedtau_coefficients:
    if np.isnan(value):
        print(f"Weighted Tau coefficient for {coefficient} equals NaN")
    else:
        print(f"Weighted Tau coefficient for {coefficient} equals {value}")

Weighted Tau coefficient for GloVe sqrt-cos similarity equals 0.8276414238016554
Weighted Tau coefficient for GloVe cosine similarity equals 0.31069970138022807
Weighted Tau coefficient for GloVe improved sqrt-cos similarity equals 0.31069970138022807
Weighted Tau coefficient for FastText improved sqrt-cos similarity equals 0.2969169791042248
Weighted Tau coefficient for Word2Vec cosine similarity equals 0.2399238308633545
Weighted Tau coefficient for Word2Vec improved sqrt-cos similarity equals 0.2399238308633545
Weighted Tau coefficient for BERT cosine tensor similarity equals 0.16012410391050208
Weighted Tau coefficient for BERT improved sqrt-cos tensor similarity equals 0.16012410391050208
Weighted Tau coefficient for ElMo cosine tensor similarity equals 0.14686813039501512
Weighted Tau coefficient for ElMo improved sqrt-cos tensor similarity equals 0.14686813039501512
Weighted Tau coefficient for Jaccard similarity equals 0.13644042210955845
Weighted Tau coefficient for Dice simil

### Section 4. Conclusions

To compare human and algorithmic efficiency and understand their correlation, a manual ranking was conducted. This involved manual evaluation for assessing and ranking candidates’ skills based on their resumes. These manual rankings were then compared to the rankings generated by the intelligent system. The comparison aimed to identify discrepancies, validate the effectiveness of the automated system, and determine the extent to which human judgment aligns with algorithmic assessments.

The results obtained show a significant improvement in assesing candidates’ skills, achieving a Weighted Tau coefficient of up to 0.857, though the initial data must be taken into account. My recommendations also include the active involvement of a hybrid “Human-in-the-loop” approach to ensure the optimal combination of intelligent
technologies and peer review, thereby ensuring objectivity and neutrality in the process of assessing candidate skills.