#Potential Talents

The objective of this project is to list and rank fitting candidates as the first step. This list then goes through review process; each candidate is reviewed and then determined how good a fit they are. This procedure is done manually and at the end of this manual review, best fit candidate is selected despite it's original ranking. Based on this feedback, the list will be re-rank. This supervisory signal is going to be supplied by starring the selected candidate. Starring one candidate actually sets this candidate as an ideal candidate for the given role. The list is expected to re-ranked each time a candidate is starred.

##Getting started

In [2]:
!pip install sentence_transformers --quiet

In [3]:
import pandas as pd 
import time
import os

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel

# doc2vec
import gensim
from gensim.models import Doc2Vec
import gensim.models.doc2vec
from gensim.models.doc2vec import TaggedDocument
import multiprocessing
#fasttext
from gensim.models.fasttext import FastText
from gensim.test.utils import datapath

from sentence_transformers import SentenceTransformer, util
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [5]:
#load data
talents = pd.read_csv("potential-talents.csv")
talents.head()

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,


In [6]:
talents = talents.drop(columns=['fit'])
talents.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          104 non-null    int64 
 1   job_title   104 non-null    object
 2   location    104 non-null    object
 3   connection  104 non-null    object
dtypes: int64(1), object(3)
memory usage: 3.4+ KB


In [7]:
jobs = talents['job_title'].tolist()
keywords = ['Aspiring human resources','seeking human resources']

Create corpus ready to be vectorized

In [8]:
corpus = jobs
corpus.extend(keywords)
print(corpus)

['2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional', 'Native English Teacher at EPIK (English Program in Korea)', 'Aspiring Human Resources Professional', 'People Development Coordinator at Ryan', 'Advisory Board Member at Celal Bayar University', 'Aspiring Human Resources Specialist', 'Student at Humber College and Aspiring Human Resources Generalist', 'HR Senior Specialist', 'Student at Humber College and Aspiring Human Resources Generalist', 'Seeking Human Resources HRIS and Generalist Positions', 'Student at Chapman University', 'SVP, CHRO, Marketing & Communications, CSR Officer | ENGIE | Houston | The Woodlands | Energy | GPHR | SPHR', 'Human Resources Coordinator at InterContinental Buckhead Atlanta', '2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional', '2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional', 'Native En

## Word Embedding

###TF-IDF

In [14]:
# Initialize an instance of tf-idf Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

# Generate the tf-idf vectors for the corpus
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

Compute and print the cosine similarity matrix

In [15]:
# Record start time
start = time.time()
# Compute cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
# Print cosine similarity matrix
print(cosine_sim)
# Print time taken
print("Time taken: %s seconds" % (time.time() - start))

[[1.         0.         0.37463732 ... 0.         0.28065738 0.12170504]
 [0.         1.         0.         ... 0.         0.         0.        ]
 [0.37463732 0.         1.         ... 0.         0.74914421 0.32486095]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.28065738 0.         0.74914421 ... 0.         1.         0.43364274]
 [0.12170504 0.         0.32486095 ... 0.         0.43364274 1.        ]]
Time taken: 0.002537965774536133 seconds


Compute and print the linear kernel matrix

In [16]:
# Record start time
start = time.time()
# Compute cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
# Print cosine similarity matrix
print(cosine_sim)
# Print time taken
print("Time taken: %s seconds" % (time.time() - start))

[[1.         0.         0.37463732 ... 0.         0.28065738 0.12170504]
 [0.         1.         0.         ... 0.         0.         0.        ]
 [0.37463732 0.         1.         ... 0.         0.74914421 0.32486095]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.28065738 0.         0.74914421 ... 0.         1.         0.43364274]
 [0.12170504 0.         0.32486095 ... 0.         0.43364274 1.        ]]
Time taken: 0.004076957702636719 seconds


In [17]:
jobs = talents['job_title']
jobs = jobs.append(pd.Series(['Aspiring human resources']),ignore_index=True)
indices = pd.Series(jobs.index, index=jobs).drop_duplicates()

tfidf = TfidfVectorizer(stop_words='english')
# Construct the TF-IDF matrix
tfidf_matrix = tfidf.fit_transform(jobs)
# Generate the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

idx = indices['Aspiring human resources']
sim_scores = list(enumerate(cosine_sim[idx]))
for idx, val in enumerate(sim_scores):
  sim_scores[idx] = val[1]
sim_scores_df = pd.DataFrame(sim_scores, columns=['scores'])
sim_scores_df = sim_scores_df.reset_index()
sim_scores_df = sim_scores_df.rename(columns={"index": "id"})
sim_scores_df = sim_scores_df.sort_values(by=['scores'], ascending=False)
talent_indices = sim_scores_df['id'][1:11]

#print top 10 for key word "Aspiring human resources"
print(talents[['id','job_title']].iloc[talent_indices])

    id                              job_title
57  58  Aspiring Human Resources Professional
45  46  Aspiring Human Resources Professional
20  21  Aspiring Human Resources Professional
16  17  Aspiring Human Resources Professional
32  33  Aspiring Human Resources Professional
2    3  Aspiring Human Resources Professional
96  97  Aspiring Human Resources Professional
59  60    Aspiring Human Resources Specialist
23  24    Aspiring Human Resources Specialist
48  49    Aspiring Human Resources Specialist


In [18]:
jobs = talents['job_title']
jobs = jobs.append(pd.Series(['seeking human resources']),ignore_index=True)
indices = pd.Series(jobs.index, index=jobs).drop_duplicates()

tfidf = TfidfVectorizer(stop_words='english')
# Construct the TF-IDF matrix
tfidf_matrix = tfidf.fit_transform(jobs)
# Generate the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

idx = indices['seeking human resources']
sim_scores = list(enumerate(cosine_sim[idx]))
for idx, val in enumerate(sim_scores):
  sim_scores[idx] = val[1]
sim_scores_df = pd.DataFrame(sim_scores, columns=['scores'])
sim_scores_df = sim_scores_df.reset_index()
sim_scores_df = sim_scores_df.rename(columns={"index": "id"})
sim_scores_df = sim_scores_df.sort_values(by=['scores'], ascending=False)
talent_indices = sim_scores_df['id'][1:11]

#print top 10 for key word "seeking human resources"
print(talents[['id','job_title']].iloc[talent_indices])

    id                                          job_title
29  30              Seeking Human Resources Opportunities
27  28              Seeking Human Resources Opportunities
98  99                   Seeking Human Resources Position
72  73  Aspiring Human Resources Manager, seeking inte...
9   10  Seeking Human Resources HRIS and Generalist Po...
61  62  Seeking Human Resources HRIS and Generalist Po...
39  40  Seeking Human Resources HRIS and Generalist Po...
52  53  Seeking Human Resources HRIS and Generalist Po...
26  27  Aspiring Human Resources Management student se...
28  29  Aspiring Human Resources Management student se...


### Word2vec

In [19]:
jobs = talents['job_title']

cores = multiprocessing.cpu_count()
model = Doc2Vec(dm=0, vector_size=100, negative=5, hs=0, min_count=2, sample=0, 
            epochs=20, workers=cores)
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(jobs)]
model.build_vocab(documents)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

#keyword - aspiring human resources
vec1 = model.infer_vector('Aspiring human resources')

similars = model.docvecs.most_similar(positive=[vec1], topn=10)
similar_index = [sim_tuple[0] for sim_tuple in similars]
# print(similars)
print(talents[['id','job_title']].iloc[similar_index])

    id                                          job_title
72  73  Aspiring Human Resources Manager, seeking inte...
96  97              Aspiring Human Resources Professional
16  17              Aspiring Human Resources Professional
20  21              Aspiring Human Resources Professional
57  58              Aspiring Human Resources Professional
45  46              Aspiring Human Resources Professional
32  33              Aspiring Human Resources Professional
48  49                Aspiring Human Resources Specialist
23  24                Aspiring Human Resources Specialist
2    3              Aspiring Human Resources Professional


In [20]:
#keyword - seeking human resources
vec2 = model.infer_vector('seeking human resources')

similars = model.docvecs.most_similar(positive=[vec2], topn=10)
similar_index = [sim_tuple[0] for sim_tuple in similars]
# print(similars)
print(talents[['id','job_title']].iloc[similar_index])

    id                                          job_title
72  73  Aspiring Human Resources Manager, seeking inte...
96  97              Aspiring Human Resources Professional
16  17              Aspiring Human Resources Professional
20  21              Aspiring Human Resources Professional
57  58              Aspiring Human Resources Professional
45  46              Aspiring Human Resources Professional
71  72  Business Management Major and Aspiring Human R...
73  74                       Human Resources Professional
32  33              Aspiring Human Resources Professional
2    3              Aspiring Human Resources Professional


##Ranking

In [21]:
embedder = SentenceTransformer('all-MiniLM-L6-v2')

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [22]:
corpus = jobs
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

In [23]:
# Query sentences: 
querry = 'Aspiring human resources'
query_embedding = embedder.encode(querry, convert_to_tensor=True)
hits = util.semantic_search(query_embedding, corpus_embeddings)

hits = pd.DataFrame(hits[0])
hits = hits.rename(columns={'corpus_id': 'id'}) 

In [24]:
result = pd.merge(talents,hits, on=['id'])
result

Unnamed: 0,id,job_title,location,connection,score
0,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,0.949807
1,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,0.928035
2,16,Native English Teacher at EPIK (English Progra...,Kanada,500+,0.949807
3,20,Native English Teacher at EPIK (English Progra...,Kanada,500+,0.949807
4,32,Native English Teacher at EPIK (English Progra...,Kanada,500+,0.949807
5,35,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,0.928035
6,45,Native English Teacher at EPIK (English Progra...,Kanada,500+,0.949807
7,48,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,0.928035
8,57,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,0.949807
9,96,Student at Indiana University Kokomo - Busines...,"Lafayette, Indiana",19,0.949807


In [25]:
# Query sentences: 
querry = 'seeking human resources'
query_embedding = embedder.encode(querry, convert_to_tensor=True)
hits = util.semantic_search(query_embedding, corpus_embeddings)

hits = pd.DataFrame(hits[0])
hits = hits.rename(columns={'corpus_id': 'id'}) 

In [26]:
result = pd.merge(talents,hits, on=['id'])
result

Unnamed: 0,id,job_title,location,connection,score
0,9,Student at Humber College and Aspiring Human R...,Kanada,61,0.807988
1,23,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,0.780673
2,27,Aspiring Human Resources Management student se...,"Houston, Texas Area",500+,0.899172
3,29,Aspiring Human Resources Management student se...,"Houston, Texas Area",500+,0.899172
4,35,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,0.780673
5,39,Student at Humber College and Aspiring Human R...,Kanada,61,0.807988
6,52,Student at Humber College and Aspiring Human R...,Kanada,61,0.807988
7,59,People Development Coordinator at Ryan,"Denton, Texas",500+,0.780673
8,61,HR Senior Specialist,San Francisco Bay Area,500+,0.807988
9,98,Student,"Houston, Texas Area",4,0.904126


##Re-rank with RankNet

In [27]:
class RankNet(nn.Module):
    def __init__(self, num_feature):
        super(RankNet, self).__init__()
        self.model = nn.Sequential(
            nn.Linear( num_feature, 512),
            nn.Dropout(0.5),
            nn.LeakyReLU(0.2, inplace=True),
            nn.Linear(512, 256),
            nn.Dropout(0.5),
            nn.LeakyReLU(0.2, inplace=True),
            nn.Linear(256, 1),
            nn.Sigmoid()
        )
        self.output_sig = nn.Sigmoid()
    def forward(self, input_1,input_2):
        s1 = self.model(input_1)
        s2 = self.model(input_2)
        out = self.output_sig(s1-s2)
        return out
    def predict(self, input_):
        s = self.model(input_)
        return s

In [28]:
result['starred'] = result['score']

Starrring candidate after manual inpsection

In [29]:
starred_id = [int(item) for item in input("Please, enter ids of candidates you want to star?: ").split()]

Please, enter ids of candidates you want to star?: 7


In [30]:
#starred_id
result.loc[result['id'].isin(starred_id),'starred'] = 1

In [31]:
result

Unnamed: 0,id,job_title,location,connection,score,starred
0,9,Student at Humber College and Aspiring Human R...,Kanada,61,0.807988,0.807988
1,23,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,0.780673,0.780673
2,27,Aspiring Human Resources Management student se...,"Houston, Texas Area",500+,0.899172,0.899172
3,29,Aspiring Human Resources Management student se...,"Houston, Texas Area",500+,0.899172,0.899172
4,35,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,0.780673,0.780673
5,39,Student at Humber College and Aspiring Human R...,Kanada,61,0.807988,0.807988
6,52,Student at Humber College and Aspiring Human R...,Kanada,61,0.807988,0.807988
7,59,People Development Coordinator at Ryan,"Denton, Texas",500+,0.780673,0.780673
8,61,HR Senior Specialist,San Francisco Bay Area,500+,0.807988,0.807988
9,98,Student,"Houston, Texas Area",4,0.904126,0.904126


In [32]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def train(optimiz, list_lrs, num_epochs):
    
    dict_best=dict()
    dict_hidden=dict()
    loss_best=0
    num_epochs = num_epochs
    model = RankNet(num_feature = doc_1.shape[1]).to(device)
    #Loss function and optimizer
    criterion = nn.BCELoss()
    for lr in list_lrs:
        if optimiz == 'SGD':
            optimizer = torch.optim.SGD(model.parameters(), lr = lr, momentum = 0.9)
        elif optimiz == 'Adam':
            optimizer = torch.optim.Adam(model.parameters(), lr = lr)
        elif optimiz =='Adadelta':
            optimizer = torch.optim.Adadelta(model.parameters(), lr = lr)

        print('lr: ', lr, 'optimizer: ', optimiz)
        base_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
        base_path = os.path.dirname(base_path)
        data_path = base_path + '/train_result.txt'

        total_step = len(y_true)
        for epoch in range(num_epochs):
            pred = model(doc_1, doc_2)
            loss = criterion(pred, y_true)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            if epoch % 100 == 0:
                print('Epoch [{}/{}], Loss: {:.4f}'.format(epoch + 1, num_epochs, loss.item()))
            if loss_best==0:
                loss_best=loss.item()
            if loss.item()<loss_best:
                loss_best=loss.item()
                name_model=optimiz+'model.ckpt'
                torch.save(model.state_dict(), name_model)
        
        dict_hidden[lr]=loss_best    
        print('>' * 60)
    dict_best[optimiz]=dict_hidden
    lr_min=min(dict_hidden, key=dict_hidden.get)
    print("=="*100)
    print('Best model with the optimizer', optimiz,':learning rate =:',lr_min,'; loss =', dict_hidden[lr_min])
    return model, loss

In [33]:
#GENERATE DATA
rows_1 = result.sample(n = 100, replace = True)
rows_2 = result.sample(n = 100, replace = True)
#get list of job titles for each data generated
job_list_1 = list(rows_1['job_title'])
job_list_2 = list(rows_2['job_title'])

doc_1 = embedder.encode(job_list_1)
doc_2 = embedder.encode(job_list_2)
doc_1 = torch.from_numpy(doc_1).float()
doc_2 = torch.from_numpy(doc_2).float()

#Define Y true labels
y_1_true = list(rows_1['starred'])
y_2_true = list(rows_2['starred'])
y_true = torch.tensor([1.0 if y1_i>y2_i else 0.5 if y1_i==y2_i else 0.0 for y1_i, y2_i in zip(y_1_true, y_2_true)]).float()

y_true=y_true.unsqueeze(1)

print('doc_1.shape',doc_1.shape)
print('doc_2.shape',doc_2.shape)

print('y_true.shape',y_true.shape)

doc_1.shape torch.Size([100, 384])
doc_2.shape torch.Size([100, 384])
y_true.shape torch.Size([100, 1])


In [34]:
#Test 3 types of optimizers and different learning rates
optimizer_list=['Adam','SGD','Adadelta']
for optimiz in optimizer_list:
    model, loss = train(optimiz=optimiz, list_lrs=[0.2, 0.1, 0.01, 0.001, 0.0001, 0.00001], num_epochs=1000)


lr:  0.2 optimizer:  Adam
Epoch [1/1000], Loss: 0.6928
Epoch [101/1000], Loss: 0.5006
Epoch [201/1000], Loss: 0.5056
Epoch [301/1000], Loss: 0.5132
Epoch [401/1000], Loss: 0.6931
Epoch [501/1000], Loss: 0.6931
Epoch [601/1000], Loss: 0.6931
Epoch [701/1000], Loss: 0.6931
Epoch [801/1000], Loss: 0.6931
Epoch [901/1000], Loss: 0.6931
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
lr:  0.1 optimizer:  Adam
Epoch [1/1000], Loss: 0.6931
Epoch [101/1000], Loss: 0.6931
Epoch [201/1000], Loss: 0.6931
Epoch [301/1000], Loss: 0.6931
Epoch [401/1000], Loss: 0.6931
Epoch [501/1000], Loss: 0.6931
Epoch [601/1000], Loss: 0.6931
Epoch [701/1000], Loss: 0.6931
Epoch [801/1000], Loss: 0.6931
Epoch [901/1000], Loss: 0.6931
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
lr:  0.01 optimizer:  Adam
Epoch [1/1000], Loss: 0.6931
Epoch [101/1000], Loss: 0.6931
Epoch [201/1000], Loss: 0.6931
Epoch [301/1000], Loss: 0.6931
Epoch [401/1000], Loss: 0.6931
Epoch [501/1000], Loss: 0.6931

Best model with the optimizer SGD :learning rate =: 0.2 ; loss = 0.5320268869400024
lr:  0.2 optimizer:  Adadelta

In [35]:
model = RankNet(num_feature = doc_1.shape[1]).to(device)
model.load_state_dict(torch.load('SGDmodel.ckpt'))

<All keys matched successfully>

In [36]:
pred_scores = []
for i in result['job_title']:
    sentence_embeddings = embedder.encode(i)
    sentence_embeddings_tensor = torch.from_numpy(sentence_embeddings).float()
    pred = round(model.predict(sentence_embeddings_tensor).detach().numpy().sum(),2)
    pred_scores.append(pred)

result['rerank_fit'] = pred_scores
result.sort_values(by ='rerank_fit', ascending = False).head()

Unnamed: 0,id,job_title,location,connection,score,starred,rerank_fit
2,27,Aspiring Human Resources Management student se...,"Houston, Texas Area",500+,0.899172,0.899172,1.0
3,29,Aspiring Human Resources Management student se...,"Houston, Texas Area",500+,0.899172,0.899172,1.0
9,98,Student,"Houston, Texas Area",4,0.904126,0.904126,1.0
6,52,Student at Humber College and Aspiring Human R...,Kanada,61,0.807988,0.807988,0.98
0,9,Student at Humber College and Aspiring Human R...,Kanada,61,0.807988,0.807988,0.97


In [37]:
result.sort_values(by ='rerank_fit', ascending = False)

Unnamed: 0,id,job_title,location,connection,score,starred,rerank_fit
2,27,Aspiring Human Resources Management student se...,"Houston, Texas Area",500+,0.899172,0.899172,1.0
3,29,Aspiring Human Resources Management student se...,"Houston, Texas Area",500+,0.899172,0.899172,1.0
9,98,Student,"Houston, Texas Area",4,0.904126,0.904126,1.0
6,52,Student at Humber College and Aspiring Human R...,Kanada,61,0.807988,0.807988,0.98
0,9,Student at Humber College and Aspiring Human R...,Kanada,61,0.807988,0.807988,0.97
5,39,Student at Humber College and Aspiring Human R...,Kanada,61,0.807988,0.807988,0.95
8,61,HR Senior Specialist,San Francisco Bay Area,500+,0.807988,0.807988,0.83
1,23,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,0.780673,0.780673,0.0
4,35,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,0.780673,0.780673,0.0
7,59,People Development Coordinator at Ryan,"Denton, Texas",500+,0.780673,0.780673,0.0


##Conclusion

The current best model is based on BERT and RankNet. It's the optimizer SGD with a learning rate of 0.2 and loss score of 0.44%.