# Potential Talents Project

## Background:

As a talent sourcing and management company, we are interested in finding talented individuals for sourcing these candidates to technology companies. Finding talented candidates is not easy, for several reasons. The first reason is one needs to understand what the role is very well to fill in that spot, this requires understanding the client’s needs and what they are looking for in a potential candidate. The second reason is one needs to understand what makes a candidate shine for the role we are in search for. Third, where to find talented individuals is another challenge.

The nature of our job requires a lot of human labor and is full of manual operations. Towards automating this process we want to build a better approach that could save us time and finally help us spot potential candidates that could fit the roles we are in search for. Moreover, going beyond that for a specific role we want to fill in we are interested in developing a machine learning powered pipeline that could spot talented individuals, and rank them based on their fitness.

We are right now semi-automatically sourcing a few candidates, therefore the sourcing part is not a concern at this time but we expect to first determine best matching candidates based on how fit these candidates are for a given role. We generally make these searches based on some keywords such as “full-stack software engineer”, “engineering manager” or “aspiring human resources” based on the role we are trying to fill in. These keywords might change, and you can expect that specific keywords will be provided to you.

Assuming that we were able to list and rank fitting candidates, we then employ a review procedure, as each candidate needs to be reviewed and then determined how good a fit they are through manual inspection. This procedure is done manually and at the end of this manual review, we might choose not the first fitting candidate in the list but maybe the 7th candidate in the list. If that happens, we are interested in being able to re-rank the previous list based on this information. This supervisory signal is going to be supplied by starring the 7th candidate in the list. Starring one candidate actually sets this candidate as an ideal candidate for the given role. Then, we expect the list to be re-ranked each time a candidate is starred.

## Data Description:

The data comes from our sourcing efforts. We removed any field that could directly reveal personal details and gave a unique identifier for each candidate.

### Attributes:
* id : unique identifier for candidate (numeric)

* job_title : job title for candidate (text)

* location : geographical location for candidate (text)

* connections: number of connections candidate has, 500+ means over 500 (text)

### Output (desired target):
* fit - how fit the candidate is for the role? (numeric, probability between 0-1)

Keywords: “Aspiring human resources” or “seeking human resources”


### Import Libraries

In [1]:
# random
import random

# cell output clear tool
from IPython.display import clear_output

# data
import pandas as pd
import numpy as np

# visualisations
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

import re
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize

from gensim.models import Word2Vec

from sklearn.metrics.pairwise import cosine_similarity

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nntor\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nntor\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\nntor\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nntor\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Load Data and Data Preprocessing

In [2]:
data = pd.read_csv('Data/potential-talents - Aspiring human resources - seeking human resources.csv')
data.head()

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,


In [3]:
# get info about the data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          104 non-null    int64  
 1   job_title   104 non-null    object 
 2   location    104 non-null    object 
 3   connection  104 non-null    object 
 4   fit         0 non-null      float64
dtypes: float64(1), int64(1), object(3)
memory usage: 4.2+ KB


In [4]:
# we will drop fit column because is full of NaNs
data.drop('fit', axis=1, inplace=True)

In [5]:
# we will check for duplicates by removing the id col, which is unique for each row
duplicates_set = data.copy()
duplicates_set.drop('id', axis=1, inplace=True)
print(f'Number of duplicates: {duplicates_set.duplicated().sum()}')

Number of duplicates: 51


In [6]:
# we will drop the duplicate values
data = data.set_index('id')
data.drop_duplicates(inplace=True)
print(f'Number of duplicates: {data.duplicated().sum()}')
data.info()

Number of duplicates: 0
<class 'pandas.core.frame.DataFrame'>
Int64Index: 53 entries, 1 to 104
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   job_title   53 non-null     object
 1   location    53 non-null     object
 2   connection  53 non-null     object
dtypes: object(3)
memory usage: 1.7+ KB


### We will se the values of job titles in order to modify them for further

In [7]:
data['job_title'].value_counts()

Aspiring Human Resources Professional                                                                                    2
2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional                 1
Lead Official at Western Illinois University                                                                             1
Senior Human Resources Business Partner at Heil Environmental                                                            1
Aspiring Human Resources Professional | An energetic and Team-Focused Leader                                             1
HR Manager at Endemol Shine North America                                                                                1
Human Resources professional for the world leader in GIS software                                                        1
RRP Brand Portfolio Executive at JTI (Japan Tobacco International)                                                       1
Information Syst

We should remove numbers, special characters, whitespace, both capital and small letters, acronyms, and stopwords from the values of job title.

In [8]:
# create a function to deal with all the things we want to change in the job title column
def clean_text(text):

  # tokenize
  tokenizer = RegexpTokenizer(r'\w+')
  tokens = tokenizer.tokenize(text)

  # lemmatize + lowercase
  lemmatizer = WordNetLemmatizer()
  tokens = [lemmatizer.lemmatize(token.lower(), pos='v') for token in tokens]
  
  # remove stopwords
  keep_words = [token for token in tokens if token not in stopwords.words('english')]
  row_text = ' '.join(keep_words)
  row_text = ' '.join([word for word in row_text.split() if len(word)>1])  # remove one letter words
  row_text = re.sub(r'\w*\d\w*', '', row_text).strip()
  return row_text

# apply our function
data['job_title'] = data['job_title'].apply(clean_text)

In [9]:
# we will replace tha acronyms manually also
data.replace({'job_title' : {'magna cum laude' : 'with great distinction', 'chro' : 'chief human resources officer', 'svp' : 'senior vice president'
        ,'gphr' : 'global professional in human resources', 'hris' : 'human resources information system'
        , 'csr' : 'corporate social responsibility', 'sphr' : 'strategic and policy-making certification'
        , 'hr' : 'human resources', 'rrp' : 'recommended retail prices'}}, regex=True, inplace=True)

data['job_title'] = data['job_title'].apply(clean_text)
data.head()

Unnamed: 0_level_0,job_title,location,connection
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,bauer college business graduate great distinct...,"Houston, Texas",85
2,native english teacher epik english program korea,Kanada,500+
3,aspire human resources professional,"Raleigh-Durham, North Carolina Area",44
4,people development coordinator ryan,"Denton, Texas",500+
5,advisory board member celal bayar university,"İzmir, Türkiye",500+


In [10]:
data['job_title'].value_counts()

aspire human resources professional                                                                                                                                                                                         2
bauer college business graduate great distinction aspire human resources professional                                                                                                                                       1
lead official western illinois university                                                                                                                                                                                   1
senior human resources business partner heil environmental                                                                                                                                                                  1
aspire human resources professional energetic team focus leader                                                 

### We will se the values of location in order to modify them for further

In [11]:
data['location'].value_counts()

Houston, Texas Area                    4
Raleigh-Durham, North Carolina Area    3
Greater New York City Area             3
Austin, Texas Area                     2
Amerika Birleşik Devletleri            2
Kanada                                 2
Greater Philadelphia Area              2
Greater Atlanta Area                   2
Torrance, California                   1
Highland, California                   1
Gaithersburg, Maryland                 1
Baltimore, Maryland                    1
Milpitas, California                   1
Greater Chicago Area                   1
Houston, Texas                         1
Long Beach, California                 1
Chattanooga, Tennessee Area            1
Bridgewater, Massachusetts             1
Lafayette, Indiana                     1
Kokomo, Indiana Area                   1
Las Vegas, Nevada Area                 1
Cape Girardeau, Missouri               1
Greater Los Angeles Area               1
Los Angeles, California                1
Dallas/Fort Wort

In [12]:
# replace manually also some problem values in location
data.replace({'location' : {'Amerika Birleşik Devletleri' : 'United States of America', 'Houston, Texas Area' : 'Houston, Texas', 'Austin, Texas Area' : 'Austin, Texas',
                            'Kanada' : 'Canada', 'Kokomo, Indiana Area' : 'Kokomo, Indiana', 'Türkiye' : 'Turkey'}}, regex=True, inplace=True)
data['location'] = data['location'].apply(clean_text)
data.head()

Unnamed: 0_level_0,job_title,location,connection
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,bauer college business graduate great distinct...,houston texas,85
2,native english teacher epik english program korea,canada,500+
3,aspire human resources professional,raleigh durham north carolina area,44
4,people development coordinator ryan,denton texas,500+
5,advisory board member celal bayar university,i̇zmir turkey,500+


In [13]:
data['location'].value_counts()

houston texas                         5
raleigh durham north carolina area    3
greater new york city area            3
greater philadelphia area             2
greater atlanta area                  2
canada                                2
austin texas                          2
unite state america                   2
lake forest california                1
atlanta georgia                       1
greater los angeles area              1
cape girardeau missouri               1
las vegas nevada area                 1
kokomo indiana                        1
lafayette indiana                     1
bridgewater massachusetts             1
long beach california                 1
torrance california                   1
greater chicago area                  1
denton texas                          1
milpitas california                   1
baltimore maryland                    1
gaithersburg maryland                 1
highland california                   1
los angeles california                1


### Combine job title field together with location in order to you both as keywords of each employee

In [14]:
data['keywords'] = (data['job_title'] + ' ' + data['location'])
data.head()

Unnamed: 0_level_0,job_title,location,connection,keywords
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,bauer college business graduate great distinct...,houston texas,85,bauer college business graduate great distinct...
2,native english teacher epik english program korea,canada,500+,native english teacher epik english program ko...
3,aspire human resources professional,raleigh durham north carolina area,44,aspire human resources professional raleigh du...
4,people development coordinator ryan,denton texas,500+,people development coordinator ryan denton texas
5,advisory board member celal bayar university,i̇zmir turkey,500+,advisory board member celal bayar university i...


### Word2Vec Model to make the ranking based on keywords

In [95]:
def word2vector(feature, keyword, df, min_count=1, window=5, epochs=10):

  # put df's rows in a list. This is our corpus.
  corpus = df[feature].tolist()

  # ADD ALSO OUR KEYWORDS IN CORPUS to be trained data for our model and have their own vectors
  temp_keywords = ' '.join(keyword)
  corpus.append(temp_keywords) 
    
  # break our rows into tokens
  tokens = [word_tokenize(row) for row in corpus]

  # train our Word2Vec model
  # sg = 0 -> skip-gram | sg = 1 -> CBOW
  model = Word2Vec(tokens, min_count=min_count, window=window,sg=0,batch_words=10)
  model.train(tokens, total_examples=1, epochs=epochs)

  # compute our keywords vector by adding each keywords's vector
  p = np.array(0.)
  for k in keyword:
    p = p + model.wv[k]
    
  # init
  cs = []
  
  # we created vectors for our keywords through our model but we dont need the keywords in our tokens for analysis, so we remove the keywords from the tokens
  tokens.pop()
    
  # for each word in each token, if a keyword is in our token we add their vectors in order to
  # compute the cosine similarity with our keyword's vector previously computed
  for i in tokens:
    a = np.array(0.)
    for j in i:
      if j in keyword:
        a = a + model.wv[j]
    cs.append(cosine_similarity(np.mean(np.array(p)).reshape(-1, 1), np.mean(a).reshape(-1, 1)).item())
    
  # add the cosine similarities into our df and sort the values based on cs
  df['cosine_similarity'] = np.array(cs)
  df = df.sort_values('cosine_similarity', ascending=False)
  return df

In [96]:
feature = 'keywords'
keywords = ['seeking', 'human', 'resources']

d_w2v = data.copy()
d_w2v = word2vector(feature, keywords, d_w2v)
d_w2v

Unnamed: 0_level_0,job_title,location,connection,keywords,cosine_similarity
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,bauer college business graduate great distinct...,houston texas,85,bauer college business graduate great distinct...,1.0
83,human resources manager endemol shine north am...,los angeles california,268,human resources manager endemol shine north am...,1.0
74,human resources professional,greater boston area,16,human resources professional greater boston area,1.0
75,nortia staff seek human resources payroll admi...,san jose california,500+,nortia staff seek human resources payroll admi...,1.0
76,aspire human resources professional passionate...,new york new york,212,aspire human resources professional passionate...,1.0
77,human resources conflict management policies p...,dallas fort worth area,409,human resources conflict management policies p...,1.0
79,liberal arts major aspire human resources analyst,baton rouge louisiana area,7,liberal arts major aspire human resources anal...,1.0
81,senior human resources business partner heil e...,chattanooga tennessee area,455,senior human resources business partner heil e...,1.0
82,aspire human resources professional energetic ...,austin texas,174,aspire human resources professional energetic ...,1.0
84,human resources professional world leader gi s...,highland california,50,human resources professional world leader gi s...,1.0


### Word2Vec Model to make the re-ranking when a candidated is starred

In [99]:
candidate_id = 1
if candidate_id not in data.index:
    print('Candidate ID NOT valid!')

else:
    print('Candidate ID is valid!')

Candidate ID is valid!


In [100]:
feature = 'job_title'
keywords = word_tokenize(data[feature][candidate_id])

d_w2v_starred = data.copy()
d_w2v_starred  = word2vector(feature, keywords, d_w2v_starred)
d_w2v_starred

Unnamed: 0_level_0,job_title,location,connection,keywords,cosine_similarity
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,bauer college business graduate great distinct...,houston texas,85,bauer college business graduate great distinct...,1.0
73,aspire human resources manager seek internship...,houston texas,7,aspire human resources manager seek internship...,1.0
75,nortia staff seek human resources payroll admi...,san jose california,500+,nortia staff seek human resources payroll admi...,1.0
76,aspire human resources professional passionate...,new york new york,212,aspire human resources professional passionate...,1.0
77,human resources conflict management policies p...,dallas fort worth area,409,human resources conflict management policies p...,1.0
79,liberal arts major aspire human resources analyst,baton rouge louisiana area,7,liberal arts major aspire human resources anal...,1.0
81,senior human resources business partner heil e...,chattanooga tennessee area,455,senior human resources business partner heil e...,1.0
82,aspire human resources professional energetic ...,austin texas,174,aspire human resources professional energetic ...,1.0
83,human resources manager endemol shine north am...,los angeles california,268,human resources manager endemol shine north am...,1.0
84,human resources professional world leader gi s...,highland california,50,human resources professional world leader gi s...,1.0
