## Problem Statement
In the field of talent sourcing and management, understanding the unique requirements of each role and identifying the right candidates who meet these criteria remains a complex challenge. Given the intricacies involved in understanding client requirements, discerning candidate suitability, and locating a large number of potential candidates, current semi-automated processes demands significant manual labor. To enhance efficiency,  a machine learning-powered pipeline will be established, capable of spotting and ranking candidates based on their fit for specific roles. While keyword-based searches, such as “full-stack software engineer” or “aspiring human resources,” to source candidates, the aim  is to introduce a dynamic ranking system. This system will incorporate warning signals for supervisory approval for occasional manual reviews, allowing for real-time re-ranking of candidate lists whenever a candidate is identified and starred as the ideal match. The objective is to streamline the candidate shortlisting process, reduce manual intervention, and continuously improve the accuracy of our automated rankings based on human feedback.

## Loading Libraries & Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity
import gensim
from gensim.models import Word2Vec

## Loading Data

In [2]:
sheet_id = '117X6i53dKiO7w6kuA1g1TpdTlv1173h_dPlJt5cNNMU'
url = 'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_id}'.format(sheet_id=sheet_id)
df = pd.read_csv(url)


In [3]:
df.head()

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85.0,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,
3,4,People Development Coordinator at Ryan,"Denton, Texas",,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",,


## Data Cleaning

In [4]:
df.loc[df['location'] == 'Kanada', 'location'] = 'Canada'


In [5]:
def remove_digits_punc(df):
  df['job_title'] = df['job_title'].str.replace('\d+', '')
  df['job_title'] = df['job_title'].str.replace('[^\w\s]', '')

  return df


df = remove_digits_punc(df)

df.head()


  df['job_title'] = df['job_title'].str.replace('\d+', '')
  df['job_title'] = df['job_title'].str.replace('[^\w\s]', '')


Unnamed: 0,id,job_title,location,connection,fit
0,1,CT Bauer College of Business Graduate Magna C...,"Houston, Texas",85.0,
1,2,Native English Teacher at EPIK English Program...,Canada,,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,
3,4,People Development Coordinator at Ryan,"Denton, Texas",,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",,


## Modeling and Preprocessing
Machine learning models cannot  understand words directly yet, so it was important to use a language model to convert the job titles into something that would work.
Word2Vec language model was used to create word embeddings to represent the each job title in the data frame. The word embeddings were then used to ascertain how close each word vector matched the keyword, 'Aspiring Human Resources'.
In the code below, the job titles were first tokenised and then embedded using the Word2Vec model. Cosine Similarity was finally used to determine the how close each job title matched the keyword. Scores close to 1 indicate very strong match while scores close to or lower than 0 indicate little or no relevant matches.


In [6]:
# Tokenization
tokenized_titles = [gensim.utils.simple_preprocess(title) for title in df['job_title']]

# Training the Word2Vec model
model = Word2Vec(sentences=tokenized_titles, vector_size=100, window=5, min_count=1, workers=4, sg=0)

# Calculate similarity with "aspiring human resources"
keyword_vector = (model.wv['aspiring'] + model.wv['human'] + model.wv['resources']) / 3

similarities = []
title_vectors = []
for title_tokens in tokenized_titles:
    title_vector = sum([model.wv[token] for token in title_tokens if token in model.wv]) / len(title_tokens)
    sim = cosine_similarity([title_vector], [keyword_vector])[0][0]
    title_vectors.append(title_vector)
    similarities.append(sim)

# Attach similarities to the dataframe and sort
df['fit'] = similarities


In [7]:
df.sort_values(by='fit', ascending=False).head(20)

Unnamed: 0,id,job_title,location,connection,fit
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823
23,24,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823
48,49,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823
35,36,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823
59,60,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823
16,17,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062
32,33,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062
20,21,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062
96,97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71.0,0.859062


In [8]:
X = title_vectors
y = df['fit']


## Ranking Candidates
Although the word2vec model was efficient at scoring candidates based on the job titles provided, we need a model that can rank and account for candidates who may be misranked or otherwise overlooked in the talent acquisition process.
A regression model will be used to rank candidates using the fit scores calculated from the job titles acquired from each candidate. After ranking, the model will be used to rerank candidates, showing how to account for individuals who could be overlooked.

## Random Forest Regressor Model



In [9]:
rf = RandomForestRegressor(criterion='squared_error')
rf.fit(X, y)

In [10]:
df['predicted_fit'] = rf.predict(X)
df.sort_values(by='predicted_fit', ascending=False).head(20)



Unnamed: 0,id,job_title,location,connection,fit,predicted_fit
48,49,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.863125
35,36,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.863125
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.863125
59,60,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.863125
23,24,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.863125
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062,0.859062
20,21,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062,0.859062
16,17,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062,0.859062
45,46,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062,0.859062
57,58,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062,0.859062


In [11]:
df.sort_values(by='fit', ascending=False).tail(20)

Unnamed: 0,id,job_title,location,connection,fit,predicted_fit
79,80,Junior MES Engineer Information Systems,"Myrtle Beach, South Carolina Area",52.0,0.067764,0.144397
1,2,Native English Teacher at EPIK English Program...,Canada,,0.06749,0.065605
15,16,Native English Teacher at EPIK English Program...,Canada,,0.06749,0.065605
19,20,Native English Teacher at EPIK English Program...,Canada,,0.06749,0.065605
44,45,Native English Teacher at EPIK English Program...,Canada,,0.06749,0.065605
31,32,Native English Teacher at EPIK English Program...,Canada,,0.06749,0.065605
41,42,SVP CHRO Marketing Communications CSR Officer...,"Houston, Texas Area",,0.041502,0.052651
63,64,SVP CHRO Marketing Communications CSR Officer...,"Houston, Texas Area",,0.041502,0.052651
54,55,SVP CHRO Marketing Communications CSR Officer...,"Houston, Texas Area",,0.041502,0.052651
11,12,SVP CHRO Marketing Communications CSR Officer...,"Houston, Texas Area",,0.041502,0.052651


## Random Forest Regressor Evaluation

In [12]:
mean_squared_error(y, rf.predict(X))


0.0010498440484866104

## Re-ranking Candidates
After ranking candidates using the random forest regressor model, it appears that candidate with job title, 'Aspiring Human Resources' were ranked significantly higher than candidates with the title 'HR Senior Specialist'. The code below shows how senior candidates were reranked.   

In [13]:
df.loc[df['job_title'] == 'HR Senior Specialist', 'predicted_fit'] = 1

## Modeling after Reranking


In [14]:
rf1 = RandomForestRegressor(criterion='squared_error')
rf1.fit(X, df['predicted_fit'])


In [15]:
df['reranked_fit'] = rf1.predict(X)

df.sort_values(by='reranked_fit', ascending=False).head(20)



Unnamed: 0,id,job_title,location,connection,fit,predicted_fit,reranked_fit
60,61,HR Senior Specialist,San Francisco Bay Area,,-0.121039,1.0,1.0
50,51,HR Senior Specialist,San Francisco Bay Area,,-0.121039,1.0,1.0
25,26,HR Senior Specialist,San Francisco Bay Area,,-0.121039,1.0,1.0
7,8,HR Senior Specialist,San Francisco Bay Area,,-0.121039,1.0,1.0
37,38,HR Senior Specialist,San Francisco Bay Area,,-0.121039,1.0,1.0
23,24,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.863125,0.863125
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.863125,0.863125
35,36,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.863125,0.863125
59,60,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.863125,0.863125
48,49,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.863125,0.863125


## Conclusion & Remarks

