# TALENT SOURCING
   
In this notebook, a talent sourcing company first screens the available candidates on a set of specific job title key words to identify and rank the candidates based on their job titles. By looking at the resulting list of ranked candidates, a preferred candidate is selected and the list is ranked again based on the job title as well as the location of the preferred candidate.

Data Attributes:   
id : unique identifier for candidate (numeric)   
job_title : job title for candidate (text)   
location : geographical location for candidate (text)   
connections: number of connections candidate has, 500+ means over 500 (text)   
Output (desired target):   
fit - how fit the candidate is for the role? (numeric, probability between 0-1)   
Keywords: “Aspiring human resources” or “seeking human resources”   


In [42]:
import nltk
import gensim
import numpy as np
import pandas as pd
from multiprocessing import cpu_count
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

In [163]:
df = pd.read_csv('potential-talents - Aspiring human resources - seeking human resources.csv')
df.head()

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"Ä°zmir, TÃ¼rkiye",500+,


In [157]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          104 non-null    int64  
 1   job_title   104 non-null    object 
 2   location    104 non-null    object 
 3   connection  104 non-null    object 
 4   fit         104 non-null    float64
dtypes: float64(1), int64(1), object(3)
memory usage: 4.2+ KB


## Data cleaning

In [167]:
#replace the misspelt location entries
# also replace '500+' in the 'connection' column to 500 so that data can be converted to integer values below
df.replace({'Ä°zmir, TÃ¼rkiye': 'İzmir, Türkiye', 'Kanada' : 'Canada', '500+ ':'500'}, inplace=True) #replace the spelling with incorrect characters

In [175]:
#change the data type of 'connection' column to 'int64'
df['connection'] = df['connection'].astype('int64')

In [176]:
df.head()

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Canada,500,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500,


In [177]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          104 non-null    int64  
 1   job_title   104 non-null    object 
 2   location    104 non-null    object 
 3   connection  104 non-null    int64  
 4   fit         0 non-null      float64
dtypes: float64(1), int64(2), object(2)
memory usage: 4.2+ KB


In [178]:
df.shape

(104, 5)

## Pre-process job_title column

In [97]:
df.job_title[3]

'People Development Coordinator at Ryan'

In [41]:
#preprocess job_title column only
job_titl = df.job_title.apply(gensim.utils.simple_preprocess)
job_titl

0      [bauer, college, of, business, graduate, magna...
1      [native, english, teacher, at, epik, english, ...
2             [aspiring, human, resources, professional]
3           [people, development, coordinator, at, ryan]
4      [advisory, board, member, at, celal, bayar, un...
                             ...                        
99     [aspiring, human, resources, manager, graduati...
100          [human, resources, generalist, at, loparex]
101    [business, intelligence, and, analytics, at, t...
102                [always, set, them, up, for, success]
103    [director, of, administration, at, excellence,...
Name: job_title, Length: 104, dtype: object

In [46]:
stopwords = nltk.corpus.stopwords.words('english')
stopwords[:10]  #check the first 10 stop words

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

## Define function to remove stop words from input phrases

In [56]:
#define function to remove stop words from each of the job descriptions
def remove_stopwords (tokenized_text):
    cleaned_text = [word for word in tokenized_text if word not in stopwords]
    return cleaned_text

### Remove stop words from the job_title column

In [192]:
# remove stopwords from each job description
for ind, job in enumerate(job_titl):
    job_titl[ind] = remove_stopwords(job_titl[ind])  

job_titl

0      [bauer, college, business, graduate, magna, cu...
1      [native, english, teacher, epik, english, prog...
2             [aspiring, human, resources, professional]
3               [people, development, coordinator, ryan]
4      [advisory, board, member, celal, bayar, univer...
                             ...                        
99     [aspiring, human, resources, manager, graduati...
100              [human, resources, generalist, loparex]
101       [business, intelligence, analytics, travelers]
102                               [always, set, success]
103      [director, administration, excellence, logging]
Name: job_title, Length: 104, dtype: object

### Remove stop words from the location column

In [285]:
#preprocess and remove stopwords for location column
location_name = df.location.apply(gensim.utils.simple_preprocess)

for ind, location in enumerate(location_name):
    location_name[ind] = remove_stopwords(location_name[ind])  
location_name

0                              [houston, texas]
1                                      [canada]
2      [raleigh, durham, north, carolina, area]
3                               [denton, texas]
4                               [zmir, türkiye]
                         ...                   
99                  [cape, girardeau, missouri]
100    [raleigh, durham, north, carolina, area]
101            [greater, new, york, city, area]
102               [greater, los, angeles, area]
103                               [katy, texas]
Name: location, Length: 104, dtype: object

### Combine cleaned job titles and locations into a single series

In [203]:
job_location_combined = job_titl.append(location_name)
job_location_combined

0      [bauer, college, business, graduate, magna, cu...
1      [native, english, teacher, epik, english, prog...
2             [aspiring, human, resources, professional]
3               [people, development, coordinator, ryan]
4      [advisory, board, member, celal, bayar, univer...
                             ...                        
99                           [cape, girardeau, missouri]
100             [raleigh, durham, north, carolina, area]
101                     [greater, new, york, city, area]
102                        [greater, los, angeles, area]
103                                        [katy, texas]
Length: 208, dtype: object

## Initialize gensim model

In [79]:
cpu_count()

4

In [208]:
# now we initialize a gensim model
# window is how many words before and after the target word. Can be anything 5, 7 etc
# min_count is the min length of a sentense. So don't use sentences with only 1 word
# workers how many CPU threads to be used?? (so CPUs with 4 cores, use 4, my Lenovo laptop has 2 cores but supposedly
# can handle upto 4 threads in parallel.. whatever that means) 

model = gensim.models.Word2Vec(
         window = 5,
         min_count = 2,
         workers = 4)  # cpu_count for this laptop is 4

## Build model vocabulary with BOTH cleaned job titles and locations

In [209]:
#build the vocabulary
model.build_vocab(job_location_combined)

In [210]:
model.corpus_count, model.epochs

(208, 5)

## Train the model

In [211]:
model.train(job_location_combined, total_examples = model.corpus_count, epochs = model.epochs)

(1523, 4695)

In [90]:
#following line of code preprocesses, tokenizes and removes stopwords (using remove_stopword function) from any phrase
key_words = remove_stopwords(gensim.utils.simple_preprocess('Aspiring human resources'))
key_words

['aspiring', 'human', 'resources']

## Define function to compute the cosine similarity between two lists of words derived from two phrases

In [212]:
#define function to compute cosine similarity between two lists of words (not individual words)
def word_set_similarity (test_set, reference):
    return model.wv.n_similarity (test_set, reference)

In [213]:
score = word_set_similarity(job_titl[5], key_words)
score

0.8704935

## Define function to pre-process and remove stop words from an input phrase

In [214]:
#define function to remove stopwords and preprocess a sentence
def rem_stopwrds_preprocess(sentence):
    return remove_stopwords(gensim.utils.simple_preprocess(sentence))

## Calculate the cosine similarity between specific key words and each job title of potential candidates in the list. Create a new dataframe and append the similarity value to the job_fit column and sort in a descending manner

In [215]:
#preprocess, tokenize and preprocess the keyword phrase that will be used to filter the candidate job descriptions
key_words = rem_stopwrds_preprocess('Aspiring human resources') # enter 'seeking human resources' or  'Aspiring human resources'

# each candidate job description is compared with key words using the function to calculate cosine similarity
df['fit'] = df['job_title'].apply(lambda x: word_set_similarity(rem_stopwrds_preprocess(x), key_words))

candidates_1 = df.sort_values('fit', ignore_index=True, ascending=False)
candidates_1.rename(columns={'fit':'job_fit'}, inplace=True)

candidates_1

Unnamed: 0,id,job_title,location,connection,job_fit
0,79,Liberal Arts Major. Aspiring Human Resources A...,"Baton Rouge, Louisiana Area",7,0.887459
1,36,Aspiring Human Resources Specialist,Greater New York City Area,1,0.870493
2,49,Aspiring Human Resources Specialist,Greater New York City Area,1,0.870493
3,24,Aspiring Human Resources Specialist,Greater New York City Area,1,0.870493
4,6,Aspiring Human Resources Specialist,Greater New York City Area,1,0.870493
...,...,...,...,...,...
99,32,Native English Teacher at EPIK (English Progra...,Canada,500,-0.032830
100,45,Native English Teacher at EPIK (English Progra...,Canada,500,-0.032830
101,2,Native English Teacher at EPIK (English Progra...,Canada,500,-0.032830
102,80,Junior MES Engineer| Information Systems,"Myrtle Beach, South Carolina Area",52,-0.046584


In [216]:
rem_stopwrds_preprocess(df[df.id==102].iloc[0]['location'])

['greater', 'new', 'york', 'city', 'area']

## Define function to re-rank the list of candidates based on the job title AND location of a preferred (starred) candidate. Re-ranking involves differentially weighting the job title, location and the number of connections.

In [277]:
# the following function takes the id number of the preferred candidate from the first round and uses that candidate's
# location to rerank the candidates based on the job_title AND location. For the ranking, job title is given a highest weight,
# followed by candidate location followed by the candidate's number of LinkedIn connections

def rerank (id_num, input_df):
    '''
    input: id_num - id number of the preferred candidate from the first round of screening
    input: input_df - this is the df where the data from the first sceening round is stored
    
    output: df containing the new columns where job and location similarity values, a score for the number of
            connections as well as a column for the average of the weighted scores.
    '''
    # remove the previously created columns in the input df since they will be calculated again in this function
    df = input_df.iloc[:, :4]
    
    #preferred candidate job_title embedding
    cand_job_preprocess = rem_stopwrds_preprocess (df[df.id==id_num].iloc[0]['job_title'])
    #preferred candidate location embedding
    cand_loc_preprocess = rem_stopwrds_preprocess (df[df.id==id_num].iloc[0]['location'])
    
    # calculate cosine similarities of the job fit for each candidate based on the job description of the preferred candidate
    df['job_fit'] = df['job_title'].apply(lambda x: word_set_similarity(rem_stopwrds_preprocess(x), cand_job_preprocess))
    # calculate cosine similarities of the location fit for each candidate based on the job description of the preferred candidate
    df['location_fit'] = df['location'].apply(lambda x: word_set_similarity(rem_stopwrds_preprocess(x), cand_loc_preprocess))
    
    # create a column for connection count and then normalize to the maximum value in the column
    df['connection_score'] = df['connection'].apply(lambda x: x/df['connection'].max()) 

    #calculate the mean for the job_fit and location_fit and conection_score values
    # BUT give the job_fit more weigt (*3) followed by the location_fit (*2) and then connection_score(*0.5)
    df['job_loc_connection_fit'] = (df['job_fit']*3+df['location_fit']*2+df['connection_score']*0.5)/3
    
    #sort the df based on the mean of weighted scores 
    df = df.sort_values('job_loc_connection_fit', ignore_index = True, ascending=False)
    
    return df

## An example run of the rerank function using a starred candidate (id = 53)

In [278]:
# rerank the list based on candidate id 53
output_df = rerank(53, candidates_1)
output_df.head(10)

Unnamed: 0,id,job_title,location,connection,job_fit,location_fit,connection_score,job_loc_connection_fit
0,62,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,500,1.0,1.0,1.0,1.833333
1,40,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,500,1.0,1.0,1.0,1.833333
2,53,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,500,1.0,1.0,1.0,1.833333
3,10,Seeking Human Resources HRIS and Generalist Po...,Greater Philadelphia Area,500,1.0,1.0,1.0,1.833333
4,69,"Director of Human Resources North America, Gro...","Greater Grand Rapids, Michigan Area",500,0.391234,0.817515,1.0,1.10291
5,74,Human Resources Professional,Greater Boston Area,16,0.516438,0.817515,0.032,1.066781
6,89,Director Human Resources at EY,Greater Atlanta Area,349,0.434369,0.718418,0.698,1.029648
7,99,Seeking Human Resources Position,"Las Vegas, Nevada Area",48,0.603871,0.583191,0.096,1.008665
8,68,Human Resources Specialist at Luxottica,Greater New York City Area,500,0.51184,0.482865,1.0,1.000417
9,29,Aspiring Human Resources Management student se...,"Houston, Texas Area",500,0.552762,0.411324,1.0,0.993644
