## Loading Libraries & Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity
import gensim
from gensim.models import Word2Vec

## Loading Data

In [None]:
sheet_id = '117X6i53dKiO7w6kuA1g1TpdTlv1173h_dPlJt5cNNMU'
url = 'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_id}'.format(sheet_id=sheet_id)
df = pd.read_csv(url)


In [None]:
df.head()

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85.0,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,
3,4,People Development Coordinator at Ryan,"Denton, Texas",,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",,


## Data Cleaning

In [None]:
df.loc[df['location'] == 'Kanada', 'location'] = 'Canada'


In [None]:
def remove_digits_punc(df):
  df['job_title'] = df['job_title'].str.replace('\d+', '')
  df['job_title'] = df['job_title'].str.replace('[^\w\s]', '')

  return df


df = remove_digits_punc(df)

df.head()


  df['job_title'] = df['job_title'].str.replace('\d+', '')
  df['job_title'] = df['job_title'].str.replace('[^\w\s]', '')


Unnamed: 0,id,job_title,location,connection,fit
0,1,CT Bauer College of Business Graduate Magna C...,"Houston, Texas",85.0,
1,2,Native English Teacher at EPIK English Program...,Canada,,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,
3,4,People Development Coordinator at Ryan,"Denton, Texas",,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",,


## Modeling and Preprocessing
Machine learning models cannot  understand words directly yet, so it was important to use a language model to convert the job titles into something that would work.
Word2Vec language model was used to create word embeddings to represent the each job title in the data frame. The word embeddings were then used to ascertain how close each word vector matched the keyword, 'Aspiring Human Resources'.
In the code below, the job titles were first tokenised and then embedded using the Word2Vec model. Cosine Similarity was finally used to determine the how close each job title matched the keyword. Scores close to 1 indicate very strong match while scores close to or lower than 0 indicate little or no relevant matches.


In [None]:
# Tokenization
tokenized_titles = [gensim.utils.simple_preprocess(title) for title in df['job_title']]

# Training the Word2Vec model
model = Word2Vec(sentences=tokenized_titles, vector_size=100, window=5, min_count=1, workers=4, sg=0)

# Calculating similarity with "aspiring human resources"
keyword_vector = (model.wv['aspiring'] + model.wv['human'] + model.wv['resources']) / 3

similarities = []
title_vectors = []
for title_tokens in tokenized_titles:
    title_vector = sum([model.wv[token] for token in title_tokens if token in model.wv]) / len(title_tokens)
    sim = cosine_similarity([title_vector], [keyword_vector])[0][0]
    title_vectors.append(title_vector)
    similarities.append(sim)

# Attaching similarities to the dataframe and sort
df['fit'] = similarities


In [None]:
df.sort_values(by='fit', ascending=False).head(20)

Unnamed: 0,id,job_title,location,connection,fit
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823
23,24,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823
48,49,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823
35,36,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823
59,60,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823
16,17,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062
32,33,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062
20,21,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062
96,97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71.0,0.859062


In [None]:
X = title_vectors
y = df['fit']


## Ranking Candidates
Although the word2vec model was efficient at scoring candidates based on the job titles provided, we need a model that can rank and account for candidates who may be misranked or otherwise overlooked in the talent acquisition process.
A regression model will be used to rank candidates using the fit scores calculated from the job titles acquired from each candidate. After ranking, the model will be used to rerank candidates, showing how to account for individuals who could be overlooked.

## Random Forest Regressor Model



In [None]:
rf = RandomForestRegressor(criterion='squared_error')
rf.fit(X, y)

In [None]:
df['predicted_fit'] = rf.predict(X)
df.sort_values(by='predicted_fit', ascending=False).head(20)



Unnamed: 0,id,job_title,location,connection,fit,predicted_fit
35,36,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.868823
48,49,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.868823
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.868823
59,60,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.868823
23,24,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.868823
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062,0.859062
20,21,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062,0.859062
45,46,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062,0.859062
57,58,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062,0.859062
96,97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71.0,0.859062,0.859062


## Random Forest Regressor Evaluation

In [None]:
mean_squared_error(y, rf.predict(X))


0.0012692172762427346

## Re-ranking Candidates
reranking a candidate in position 10 to number 1



In [None]:
rank_diff = df.loc[df['id'] == 49, 'predicted_fit'].values[0] - df.loc[df['id'] == 72, 'predicted_fit'].values[0]
rank_diff


0.22176853820681575

In [None]:
df.loc[df['id'] == 72, 'predicted_fit'] = df.loc[df['id'] == 72, 'predicted_fit'].values[0] + rank_diff

In [None]:
df.sort_values(by='predicted_fit', ascending=False).head(20)

Unnamed: 0,id,job_title,location,connection,fit,predicted_fit
48,49,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.868823
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.868823
59,60,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.868823
23,24,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.868823
71,72,Business Management Major and Aspiring Human R...,"Monroe, Louisiana Area",5.0,0.686652,0.868823
35,36,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.868823
32,33,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062,0.859062
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062,0.859062
20,21,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062,0.859062
45,46,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062,0.859062


## Modeling after Reranking


In [None]:
rf.fit(X, df['predicted_fit'])


In [None]:
df['reranked_fit'] = rf.predict(X)

df.sort_values(by='reranked_fit', ascending=False).head(20)



Unnamed: 0,id,job_title,location,connection,fit,predicted_fit,reranked_fit
23,24,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.868823,0.868823
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.868823,0.868823
35,36,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.868823,0.868823
48,49,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.868823,0.868823
59,60,Aspiring Human Resources Specialist,Greater New York City Area,1.0,0.868823,0.868823,0.868823
45,46,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062,0.859062,0.859062
96,97,Aspiring Human Resources Professional,"Kokomo, Indiana Area",71.0,0.859062,0.859062,0.859062
57,58,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062,0.859062,0.859062
20,21,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062,0.859062,0.859062
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44.0,0.859062,0.859062,0.859062


## Conclusion & Remarks



* In this project, a pointwise ranking algorithm was used to rank
respective candidates. In the ranking algorithm above, the problem is treated as a regression problem, where each candidate is scored independently given a chosen keyword. The scores are then used to rank the candidate.



* This approach works because simple to implement and easy to understand as it turns the task of sourcing candidates into a standard regression problem. Also, because each candidate is scored independently, the algorithm will scale with larger number of number of potential candidates, which helps save time for sourcing talent.


* Candidates who are not relevant to the keyword search can be easily filtered out as the list grows larger. In the table above, candidates with scores close to the range of 0 to -1 can easily be filtered out as they do not match the keyword search for the ideal candidate.
Also in the table above, a cut-off point equal to or greater 0.5 can be used to filter out candidates without losing out on high potential individuals.
