# Modeling

This notebook looks at different implementations of Learning-to-rank (LTR) algorithms to see if we can obtain better recommendations than through the other basic techniques laid out in the 'eda.ipynb' notebook.

* The first task will be to add a few features to the dataset (mainly the 'relevancy' column) for LambdaRank to work properly.

* The second section will use PyTorch to implement the RankNet algorithm.

* The third section will implement LambdaRank via XGBoost and LightGBM.

* The fourth section will use pre-trained LLM's to see if rankings can be gotten, via prompt engineering.

In [58]:
import h5py
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch
import random
import sklearn
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings(action='ignore')

## reads in the BERT embeddings in
with h5py.File('../Save_Models/embeddings.h5', 'r') as f:
    sentence_embeddings = f['embeddings'][:]

print(f"Length of the overall dataset: {len(sentence_embeddings)}\n")
print(f"Length of each vector in each row: {len(sentence_embeddings[0])}\n")
print(f"First 5 vectors in the dataset:")
print(sentence_embeddings[:5])

Length of the overall dataset: 104

Length of each vector in each row: 384

First 5 vectors in the dataset:
[[-0.0670834   0.02225027  0.00085495 ... -0.1070449  -0.00528176
  -0.03933419]
 [-0.02041392  0.01740083  0.05136457 ...  0.01713212 -0.07010847
   0.04791332]
 [-0.03121123  0.04798946 -0.01407241 ... -0.06177549 -0.02161481
   0.03701356]
 [-0.10693805 -0.01276875 -0.07286399 ...  0.02066204  0.01083926
  -0.01667509]
 [-0.08322132  0.01650264 -0.00730821 ... -0.05996063 -0.0486112
   0.00990038]]


In [59]:
## reads in the altered dataset with job_titles stripped of stopwords and other unecessary words
df = pd.read_csv('../Data/alterned_job_talents.csv')
df.drop(labels='Unnamed: 0', axis=1, inplace=True)
df['vector_embeddings'] = sentence_embeddings.tolist()
df.head()

Unnamed: 0,id,job_title,connections,location,fit,vector_embeddings
0,1,ct bauer college business graduate magna cum l...,85,"Houston, Texas",,"[-0.06708339601755142, 0.02225026860833168, 0...."
1,2,native english teacher epik english program korea,500,Kanada,,"[-0.020413920283317566, 0.017400825396180153, ..."
2,3,aspiring human resources professional,44,"Raleigh-Durham, North Carolina Area",,"[-0.031211234629154205, 0.047989461570978165, ..."
3,4,people development coordinator ryan,500,"Denton, Texas",,"[-0.10693804919719696, -0.012768750078976154, ..."
4,5,advisory board member celal bayar university,500,"İzmir, Türkiye",,"[-0.08322132378816605, 0.016502641141414642, -..."


In [60]:
## reads in the original dataset and stores the original job_titles into a list for later on
og_df = pd.read_csv('../Data/potential-talents - Aspiring human resources - seeking human resources.csv')
og_df_job_title = og_df['job_title'].tolist()
og_df_id = og_df['id']

In [61]:
SEED = random.randint(1000,9999)
SEED = 6992

In [62]:
##### this step makes a dataframe that compares each candidates' cosine similarity score to every other candidate
sentence_similarities = cosine_similarity(sentence_embeddings, sentence_embeddings)
sentence_similarities = sentence_similarities.round(6)
sim_df = pd.DataFrame(sentence_similarities, columns=df['job_title'], index=df['job_title'])
sim_df.head(2)

job_title,ct bauer college business graduate magna cum laude aspiring human resources professional,native english teacher epik english program korea,aspiring human resources professional,people development coordinator ryan,advisory board member celal bayar university,aspiring human resources specialist,student humber college aspiring human resources generalist,hr senior specialist,student humber college aspiring human resources generalist,seeking human resources hris generalist positions,...,student westfield state university,student indiana university kokomo business management retail manager delphi hardware paint,aspiring human resources professional,student,seeking human resources position,aspiring human resources manager graduating seeking entry level human resources position st louis,human resources generalist loparex,business intelligence analytics travelers,set success,director administration excellence logging
job_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ct bauer college business graduate magna cum laude aspiring human resources professional,1.0,0.149286,0.612663,0.339209,0.32649,0.60809,0.61273,0.398876,0.61273,0.49061,...,0.343203,0.31335,0.612663,0.298546,0.488825,0.614539,0.350448,0.090413,0.087822,0.227342
native english teacher epik english program korea,0.149286,1.0,0.155521,0.227968,0.312158,0.153452,0.1283,0.168338,0.1283,0.102023,...,0.226923,0.298217,0.155521,0.301423,0.135444,0.206688,0.157501,0.007307,0.078028,0.111799


In [63]:
## takes the average of all similarity scores of a job_title between all other job_titles 
mean_sim_scores = []
for row in list(sim_df.iterrows()):
    row = row[1]
    mean_sim_score = np.mean(row.values)
    mean_sim_scores.append(mean_sim_score)

In [64]:
## creates a new column to the dataframe for mean_sim_scores list
df['average_cosine_score'] = mean_sim_scores
df.head()

Unnamed: 0,id,job_title,connections,location,fit,vector_embeddings,average_cosine_score
0,1,ct bauer college business graduate magna cum l...,85,"Houston, Texas",,"[-0.06708339601755142, 0.02225026860833168, 0....",0.461227
1,2,native english teacher epik english program korea,500,Kanada,,"[-0.020413920283317566, 0.017400825396180153, ...",0.211561
2,3,aspiring human resources professional,44,"Raleigh-Durham, North Carolina Area",,"[-0.031211234629154205, 0.047989461570978165, ...",0.544989
3,4,people development coordinator ryan,500,"Denton, Texas",,"[-0.10693804919719696, -0.012768750078976154, ...",0.376504
4,5,advisory board member celal bayar university,500,"İzmir, Türkiye",,"[-0.08322132378816605, 0.016502641141414642, -...",0.269539


In [65]:
## creates a dataframe that houses each job titles' vector embedding for training and testing purposes
embeddings_df = pd.DataFrame(sentence_embeddings.tolist())
embeddings_df['cosine_score'] = df['average_cosine_score']
embeddings_df['connection'] = df['connections']
embeddings_df['job_title'] = df['job_title']
embeddings_df['id'] = df['id']
embeddings_df.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,378,379,380,381,382,383,cosine_score,connection,job_title,id
0,-0.067083,0.02225,0.000855,0.014587,-0.024483,0.006765,-0.057208,0.001901,-0.022589,0.023938,...,-0.106439,0.008999,0.023517,-0.107045,-0.005282,-0.039334,0.461227,85,ct bauer college business graduate magna cum l...,1
1,-0.020414,0.017401,0.051365,-0.011567,0.01064,-0.016599,0.043034,0.025093,-0.06136,0.085835,...,-0.058693,-0.062418,0.03424,0.017132,-0.070108,0.047913,0.211561,500,native english teacher epik english program korea,2
2,-0.031211,0.047989,-0.014072,0.099424,-0.009609,-0.04226,0.06437,0.00912,-0.045817,0.07037,...,-0.073091,0.041168,0.041373,-0.061775,-0.021615,0.037014,0.544989,44,aspiring human resources professional,3


In [66]:
'''
this step is for later on when we go to evaluate the LGMBRanker and XGBRanker models

'''

## looking at job titles that have human resources in the title 
hr_df = embeddings_df[embeddings_df['job_title'].str.contains('human resources')]
## creates a binary feature that corresponds to whether the title has 'aspiring human resources' (1) and not (0)
hr_df['relevancy'] = hr_df['job_title'].str.contains('aspiring human resources').astype(int)
## replaces the 0-1 values with 1-2
hr_df.replace({'relevancy': {0:1, 1:2}}, inplace=True)

print(f"Length of the 'human resources' dataframe: {len(hr_df)}")
hr_df_id = hr_df['id'] ## ids of hr_df which are used to filter those candidates from the others that do not have 'aspiring human resources' in their title

## creates another dataframe that are the candidates whose job titles do not have 'aspiring human resources' in their job_title
non_relevant_cands = embeddings_df.drop(labels=hr_df_id-1, axis=0)
non_relevant_cands['relevancy'] = non_relevant_cands['job_title'].str.contains('1').astype(int) ## I just needed to create the relevancy column for the next step (forget '1')
print(f"Length of the NON-'human resources' dataframe: {len(non_relevant_cands)}")

## adds the two dataframes back together 
df = pd.concat([hr_df, non_relevant_cands]) 
df.tail()

Length of the 'human resources' dataframe: 61
Length of the NON-'human resources' dataframe: 43


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,379,380,381,382,383,cosine_score,connection,job_title,id,relevancy
95,-0.059213,0.054709,0.033263,-0.109099,0.035175,-0.050855,0.078239,-0.000411,-0.028953,-0.04313,...,-0.038664,0.043533,-0.038126,-0.055773,-0.002229,0.24823,19,student indiana university kokomo business man...,96,0
97,-0.012676,0.094574,0.006054,-0.008066,-0.044017,-0.097967,0.111964,0.03536,0.009763,0.051144,...,-0.027925,0.125092,0.009117,-0.019378,0.004659,0.238392,4,student,98,0
101,0.075153,-0.066715,-0.044303,0.058389,0.015654,0.019709,0.13126,-0.029413,-0.032807,-0.033412,...,0.04143,-0.027918,-0.016993,0.005655,-0.061106,0.118918,49,business intelligence analytics travelers,102,0
102,0.015027,0.049594,-0.014256,0.001936,-0.111583,0.049082,0.099522,0.025276,0.010212,-0.062282,...,0.052836,0.015224,0.009837,-0.007557,0.00817,0.056241,500,set success,103,0
103,0.026823,-0.053658,0.008991,0.02091,0.027634,-0.077393,0.006712,-0.000525,-0.04933,0.045446,...,-0.005262,0.078313,-0.06342,-0.00851,0.038078,0.25253,500,director administration excellence logging,104,0


In [67]:
## saves the pre-ranked dataframe
df.to_csv('../Data/pre-ranked_df.csv')

To make this a little more clear, I needed to pre-rank the candidates on a certain phrase or job title description (this phrase can be changed depending on the job title). So what the first part does is look at the more 'generic' part of the description - in this case 'human resources' - because if it was too specific e.g. 'aspiring human resources', then only a certain few candidates would be selected.

With that generic description we can partition out the more specific candidates. And in this case give them a ranking of 2 - relevant and 1 - sort or relevant.

The candidates who do not have that generic title in their job description will be ranked 0 - not relevant. And the two dataframes will be concatenated to one another to make a pre-ranked dataframe for later on when we go to use the LGBM and XGB models for LambdaRank.

In [68]:
relevancy_counts = df['relevancy'].value_counts()

print(f"There are {relevancy_counts[1]} candidates that have 'aspiring human resources' in their job title.\nThere are {relevancy_counts[2]} candidates that have 'human resources' in their job title.\nAnd there are {relevancy_counts[0]} candidates that have neither of those phrases in their job title.")

There are 26 candidates that have 'aspiring human resources' in their job title.
There are 35 candidates that have 'human resources' in their job title.
And there are 43 candidates that have neither of those phrases in their job title.


In [69]:
## this dataframe will be used for the final ranking of candidates after the model has been trained
''' the process here is to pre-rank the candidates based upon our preliminary ranking
 of the phrase 'aspiring human resources' as well as the cosine similarity scores'''

sorted_final_df_ = df.sort_values(by=['relevancy', 'cosine_score', 'connection'], ascending=False) ## sorts by relevancy (2 is highest, then 1, then 0's) and then sorts of cosine_score featue
sorted_final_df_label = sorted_final_df_['relevancy'].astype('float') ## changes the dtype to float
sorted_final_df_id = sorted_final_df_['id'] ## Series that will be used for indexing by the id later in the LambdaRanking section
sorted_final_df = sorted_final_df_.drop(labels=['relevancy', 'id', 'job_title'], axis=1) ## drops the unecessary columns for training and prediction

print(f"Shape of final dataset: {sorted_final_df.shape}")

sorted_final_df.head()

Shape of final dataset: (104, 386)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,376,377,378,379,380,381,382,383,cosine_score,connection
5,-0.030779,0.029002,-0.015748,0.113863,-0.022077,-0.037371,0.078885,0.0004,-0.073562,0.044439,...,0.095469,0.005587,-0.065189,0.059353,0.043539,-0.089666,-0.007134,0.031979,0.557132,1
23,-0.030779,0.029002,-0.015748,0.113863,-0.022077,-0.037371,0.078885,0.0004,-0.073562,0.044439,...,0.095469,0.005587,-0.065189,0.059353,0.043539,-0.089666,-0.007134,0.031979,0.557132,1
35,-0.030779,0.029002,-0.015748,0.113863,-0.022077,-0.037371,0.078885,0.0004,-0.073562,0.044439,...,0.095469,0.005587,-0.065189,0.059353,0.043539,-0.089666,-0.007134,0.031979,0.557132,1
48,-0.030779,0.029002,-0.015748,0.113863,-0.022077,-0.037371,0.078885,0.0004,-0.073562,0.044439,...,0.095469,0.005587,-0.065189,0.059353,0.043539,-0.089666,-0.007134,0.031979,0.557132,1
59,-0.030779,0.029002,-0.015748,0.113863,-0.022077,-0.037371,0.078885,0.0004,-0.073562,0.044439,...,0.095469,0.005587,-0.065189,0.059353,0.043539,-0.089666,-0.007134,0.031979,0.557132,1


The step above takes the dataset and sorts the dataset, additionally, on their respective cosine similarity scores and their 'connection' number.

The dataset 'df' in a couple of the cell's above is already sorted by relevancy - 2,1,0, I thought the additional sorting might be helpful.

In [70]:
## permutates the embeddings dataframe for testing and training purposes
permutation_df = np.random.permutation(df)
permutation_df = pd.DataFrame(permutation_df, columns=df.columns)
permutation_df.drop(labels=['job_title'], axis=1, inplace=True)
permutation_df = permutation_df.astype('float')

permutation_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,378,379,380,381,382,383,cosine_score,connection,id,relevancy
0,-0.056335,0.045705,-0.040824,-0.05263,0.012346,0.033436,0.022016,0.080037,-0.013169,-0.026876,...,-0.018279,0.015418,0.011737,-0.135312,-0.031769,-0.030431,0.128608,500.0,85.0,0.0
1,-0.014094,-0.027586,0.015556,0.046323,-0.006678,-0.042339,0.06359,-0.067278,-0.044355,0.048176,...,-0.0523,0.011147,0.029232,-0.104724,-0.077627,0.026439,0.50417,61.0,7.0,2.0
2,-0.020414,0.017401,0.051365,-0.011567,0.01064,-0.016599,0.043034,0.025093,-0.06136,0.085835,...,-0.058693,-0.062418,0.03424,0.017132,-0.070108,0.047913,0.211561,500.0,45.0,0.0
3,-0.08646,-0.002064,-0.015576,0.068794,-0.007499,-0.000398,0.048112,-0.008788,-0.04924,0.009,...,-0.028367,0.015792,-0.030258,-0.023479,0.008908,0.052969,0.489066,500.0,67.0,1.0
4,-0.014094,-0.027586,0.015556,0.046323,-0.006678,-0.042339,0.06359,-0.067278,-0.044355,0.048176,...,-0.0523,0.011147,0.029232,-0.104724,-0.077627,0.026439,0.50417,61.0,52.0,2.0


In [71]:
train_number = int((len(permutation_df)*.80)-3)

## these training and testing sets are comprised of the train_label - the column that shows whether the job title has the words 'aspiring human resources' in the title; train_id - the column that has each candidates' unique id; and train - the overall features that will be used for training the neural network

## the ids will be used to see which candidates got chosen for comparison during training

train = permutation_df[:train_number]
train_label = train['relevancy'].to_numpy()
train_id = train['id'].astype('int32').to_numpy()
train = train.drop(labels=['relevancy', 'id'], axis=1).to_numpy()

test = permutation_df[train_number:]
test_label = test['relevancy'].to_numpy()
test_id = test['id'].astype('int32').to_numpy()
test = test.drop(labels=['relevancy', 'id'], axis=1).to_numpy() ## converts all pd.DataFrames/pd.Series into numpy.ndarray for next cell

print(f"Train shape: {train.shape} | Train label shape: {train_label.shape} | Test shape: {test.shape} | Test label shape: {test_label.shape}")

Train shape: (80, 386) | Train label shape: (80,) | Test shape: (24, 386) | Test label shape: (24,)


In [72]:
from torch.utils.data import DataLoader, TensorDataset

# train_ and test_dataloader will be used for training and evaluating the neural network

train_dataset = TensorDataset(torch.tensor(train), torch.tensor(train_label), torch.tensor(train_id))
test_dataset = TensorDataset(torch.tensor(test), torch.tensor(test_label), torch.tensor(test_id))
final_dataset = TensorDataset(torch.tensor(sorted_final_df.to_numpy()), torch.tensor(sorted_final_df_id.to_numpy()))

train_dataloader = DataLoader(train_dataset, batch_size=2, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=2, shuffle=False)
final_dataloader = DataLoader(final_dataset, batch_size=2, shuffle=True)

## Pytorch

This link was a good guide for implementing RankNet via PyTorch:

https://medium.com/@mandeep0405/learning-to-rank-ranknet-simplified-5d7f7334133d

In [73]:
from torch import nn
from torch import functional as F
import torch.optim as optim
import torchmetrics
from torchmetrics import Accuracy, Recall, F1Score
from sklearn.metrics import accuracy_score as acc
from sklearn.metrics import ndcg_score
from sklearn.preprocessing import MultiLabelBinarizer
# torch.autograd.detect_anomaly(True)

####### RankNet implementation utilizing PyTorch ########
class RankNet(nn.Module):
    def __init__(self, num_feature):
        super(RankNet, self).__init__()
        self.input = nn.Linear(num_feature, 256)
        self.hidden_1 = nn.Linear(256, 256)
        self.hidden_2 = nn.Linear(256, 256)
        self.output = nn.Linear(256, 1)

        self.dropout = nn.Dropout(p=0.3)
        self.norm_layer = nn.LayerNorm(512)
        self.activation = nn.ReLU()

        self.sigmoid = nn.Sigmoid()

    def forward(self, x1, x2):
        # Process first item
        in_x1 = self.input(x1)
        h1_x1 = self.activation(self.dropout(self.hidden_1(in_x1)))
        h2_x1 = self.activation(self.dropout(self.hidden_2(h1_x1)))
        out1 = self.output(h2_x1)

        # Process second item
        in_x2 = self.input(x2)
        h1_x2 = self.activation(self.dropout(self.hidden_1(in_x2)))
        h2_x2 = self.activation(self.dropout(self.hidden_2(h1_x2)))
        out2 = self.output(h2_x2)

        # returns both candidate scores
        return (out1, out2)

    def cross_entropy_loss(self, out1, out2):
        # Decide whether x1 (out1) should be ranked higher than x2 (out2)
        if out1 > out2:
            p_hat_ij = 1.0
        elif out1 < out2:
            p_hat_ij = 0.0
        else:
            p_hat_ij = 0.5

        # Obtain probability whether x1 is greater than x2
        p_ij = self.sigmoid(out1 - out2)

        # Cross-Entropy Loss function
        bce_loss = -p_hat_ij * torch.log(p_ij) - (1.0 - p_hat_ij) * torch.log(1.0 - p_ij)
        # mean_bce_loss = np.mean(bce_loss)

        return (p_ij, bce_loss)

In [74]:
def train_pt(model, optimizer, train_data, test_data, epochs=100):
    '''
    -- Parameters --
    model: PyTorch RankNet model
    optimizer: PyTorch optimizer
    train_data: PyTorch Dataloader object used for training
    test_data: Pytorch Dataloader object used for evaluation
    epochs: set to 250 but can be changed if need be

    -- Returns --
    prints the epoch and loss for every pass through the network as well as a graph that charts the training loss/epoch
    '''
    model.train()
    training_loss = []
    correct_predictions = 0
    accuracy_scores = []

    for epoch in range(epochs):
        train_loss = 0.0
        for batch in train_data:
            data, _, _ = batch
            ## converts the data tensor to list for indexing purposes
            data = data.tolist()
            
            optimizer.zero_grad()
            
            data1_out, data2_out = model.forward(torch.tensor(data[0]).reshape(1,-1), torch.tensor(data[1]).reshape(1,-1))

            p_ij, loss = model.cross_entropy_loss(data1_out, data2_out)

            if p_ij > .50:
                correct_predictions += 1
            train_accuracy = correct_predictions / len(train_data)

            loss.backward(gradient=[torch.tensor(1e-6, dtype=torch.float).reshape(1, -1)])
            optimizer.step()

            train_loss += loss.item()

            ## evaluation
            test_loss = []
            val_accuracy_scores = []
            val_correct_predictions = 0
            val_loss = 0.0
            model.eval()

            with torch.no_grad():
                for batch in test_data:
                    data, _, _  = batch
                    data = data.tolist()
                    
                    data1_out, data2_out = model.forward(torch.tensor(data[0]).reshape(1,-1), torch.tensor(data[1]).reshape(1,-1))

                    p_ij, loss = model.cross_entropy_loss(data1_out, data2_out)

                    val_loss += loss.item()

                    if p_ij > .50:
                        val_correct_predictions += 1
                    val_accuracy = val_correct_predictions / len(test_data)

                eval_loss = val_loss / len(test_data)
                test_loss.append(eval_loss)
                val_accuracy_scores.append(val_accuracy)
                model.train()

        train_loss = train_loss / len(train_data)
        training_loss.append(train_loss)
        accuracy_scores.append(train_accuracy)

        print(f"Epoch {epoch}\nTraining Loss: {train_loss} | Evaluation Loss: {eval_loss}")
        print() 

    ## plots the training loss over all epochs
    # plt.plot(val_accuracy_scores, label='Validation Accuracy')
    # plt.plot(accuracy_scores, label='Training Accuracy')
    # plt.title('Training Loss by Epoch')
    # plt.legend()
    # plt.xlabel('Epoch')
    # plt.ylabel('Training Loss')
    # plt.show()


In [75]:
def final_ranking(model, dataloader, epochs=100):
    '''
    -- Parameters --
    model: PyTorch model
    dataloader: PyTorch Dataloader object
    epochs: set to 100 but can be changed if need be
    
    -- Returns --
    candidate id's from the final batch. 
     - Theoretically, the model has compared each candidate to each other candidate and this last batch is the final ranking from the model
    '''
    id_1 = []
    id_2 = []
    score = []
    ids = []
    model.eval()

    with torch.no_grad():
        for epoch in range(epochs):
            for batch in dataloader:
                data, id  = batch
                data = data.tolist()
                id = id.tolist()
                
                # id_1.append(id[0])
                # id_2.append(id[1])
                
                data1_out, data2_out = model.forward(torch.tensor(data[0]).reshape(1,-1), torch.tensor(data[1]).reshape(1,-1))

                p_ij, loss = model.cross_entropy_loss(data1_out, data2_out)

                if epoch == epochs-1: ## final batch results
                    if p_ij > .50:
                        ids.append(id[0])
                        ids.append(id[1])
                    else:
                        ids.append(id[1])
                        ids.append(id[0])

    return np.array(ids)

    '''these four lines would be to print out the dataframe with the ids of the two candidates and the probability score that the model gave
    #### make sure to uncomment the id_1.append and id_2.append lines
    '''
    #             y = 1 if p_ij >= .50 else 0
    #             score.append(y)
    
    # df = pd.DataFrame({'id_1': id_1, 'id_2': id_2, 'probability_score': score})
    # return df.head(len(dataloader))


In [76]:
n_feature = train.shape[1]

rank_model = RankNet(num_feature=n_feature)
optimizer = optim.SGD(lr=.001, params=rank_model.parameters(), weight_decay=1e-4)

In [77]:
# train_pt(model=rank_model, optimizer=optimizer, train_data=train_dataloader, test_data=test_dataloader)

In [78]:
## saves the rank_net model
# torch.save(rank_model, f='../Save_Models/rank_net.pt')

In [79]:
'''
- final rankings (by id) from the model
'''
rank_net_model = torch.load(f='../Save_Models/rank_net.pt')
final_eval = final_ranking(model=rank_net_model, dataloader=final_dataloader)
final_eval

array([ 41,  90,  38,  18,  37, 103,  17,  15,  86,   5,  84,  44,  99,
        75,  69,  16,  93,  92,  60,   4,  64,  22,  29,  65,  36,  47,
        55,  61,  88,  20,  52,  23,  31,   2,  70,  57,  72,  58,   3,
        56,  54,  25,   6,  45,   8,  85,  89,  51,  80,  78,  87, 102,
        95,  28,  96, 100,  77,  42,  74,  63,  11,   1,  50,  68,  79,
        35,  12,  32,  66,  67,   7,  81,  21,  40,  91,  71,  14,  13,
         9,  59, 104,  62,  30,  94,  24,  76,  73,  98,  33,  43,  27,
        48,  46,  19,  83,  53,  39,  34,  97, 101,  82,  26,  49,  10])

In [80]:
og_df.loc[final_eval-1].head(40)

Unnamed: 0,id,job_title,location,connection,fit
40,41,Student at Chapman University,"Lake Forest, California",2,
89,90,Undergraduate Research Assistant at Styczynski...,Greater Atlanta Area,155,
37,38,HR Senior Specialist,San Francisco Bay Area,500+,
17,18,People Development Coordinator at Ryan,"Denton, Texas",500+,
36,37,Student at Humber College and Aspiring Human R...,Kanada,61,
102,103,Always set them up for Success,Greater Los Angeles Area,500+,
16,17,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
14,15,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
85,86,Information Systems Specialist and Programmer ...,"Gaithersburg, Maryland",4,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,


In [81]:
## use this cell and next cell ONLY if you uncomment the lines in the final_ranking function

# test_candidates_1 = final_eval['id_1'].unique()
# df.iloc[test_candidates_1 - 1].tail(10)

In [82]:
# test_candidates_2 = final_eval['id_2'].unique()
# df.iloc[test_candidates_2 - 1].tail(10)

## LambdaRank

Using this link as a guide - https://forecastegy.com/posts/lightgbm-learning-to-rank-python/, I was able to implement a LambdaRank model using both XGBoost and LightGBM.

In [83]:
import lightgbm as lgb 
import xgboost as xgb
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import ndcg_score

SEED = 7247

train = permutation_df[:train_number]
train_group = train.groupby(by=['relevancy'])['relevancy'].count().to_numpy() ## counts the relevancy scores (0 and 1) 
X_train = train.drop(labels=['relevancy', 'id'], axis=1) ## drops the two features that are not relevant for training
y_train = train['relevancy'].to_numpy() ## target vector
print(f"Train group sizes (first corresponds to the candidates whose relevancy score is 2, 1, 0): {train_group}")

test = permutation_df[train_number:]
test_group = test.groupby(by=['relevancy'])['relevancy'].count().to_numpy()
X_test = test.drop(labels=['relevancy', 'id'], axis=1)
y_test = test['relevancy'].to_numpy()
print(f"Test group sizes: {test_group}\n")

print(f"X_train shape: {X_train.shape}\ny_train shape: {y_train.shape}\nX_test shape: {X_test.shape}\ny_test shape: {y_test.shape}")

X_train.head()

Train group sizes (first corresponds to the candidates whose relevancy score is 2, 1, 0): [34 19 27]
Test group sizes: [9 7 8]

X_train shape: (80, 386)
y_train shape: (80,)
X_test shape: (24, 386)
y_test shape: (24,)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,376,377,378,379,380,381,382,383,cosine_score,connection
0,-0.056335,0.045705,-0.040824,-0.05263,0.012346,0.033436,0.022016,0.080037,-0.013169,-0.026876,...,0.010774,0.001869,-0.018279,0.015418,0.011737,-0.135312,-0.031769,-0.030431,0.128608,500.0
1,-0.014094,-0.027586,0.015556,0.046323,-0.006678,-0.042339,0.06359,-0.067278,-0.044355,0.048176,...,0.067768,-0.018939,-0.0523,0.011147,0.029232,-0.104724,-0.077627,0.026439,0.50417,61.0
2,-0.020414,0.017401,0.051365,-0.011567,0.01064,-0.016599,0.043034,0.025093,-0.06136,0.085835,...,0.016263,-0.103442,-0.058693,-0.062418,0.03424,0.017132,-0.070108,0.047913,0.211561,500.0
3,-0.08646,-0.002064,-0.015576,0.068794,-0.007499,-0.000398,0.048112,-0.008788,-0.04924,0.009,...,0.039766,-0.003743,-0.028367,0.015792,-0.030258,-0.023479,0.008908,0.052969,0.489066,500.0
4,-0.014094,-0.027586,0.015556,0.046323,-0.006678,-0.042339,0.06359,-0.067278,-0.044355,0.048176,...,0.067768,-0.018939,-0.0523,0.011147,0.029232,-0.104724,-0.077627,0.026439,0.50417,61.0


In [84]:
#### LightGBM model
LGBM_ranker = lgb.LGBMRanker(
    random_state=SEED,
    objective='lambdarank',
    metric='pairwise',
    objective_seed=SEED,
    early_stopping=5,
    force_col_wise=True
    )

LGBM_ranker.fit(
    X_train,
    y_train,
    eval_metric=['ndcg'],
    group=train_group,
    eval_set=[(X_test, y_test)],
    eval_group=[test_group],
    eval_at=[5,10]
    )

[LightGBM] [Info] Total Bins 8288
[LightGBM] [Info] Number of data points in the train set: 80, number of used features: 386
Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[5]	valid_0's ndcg@5: 1	valid_0's ndcg@10: 1


In [85]:
## for the accuracy score metric in the XGBRanker evaluation
def create_multilabels(predictions=list):
    '''
    -- Parameters --
    predictions: a list of continuous floating point numbers

    -- Returns --
    a list of discrete numbers
    '''
    altered_preds = []
    for preds in predictions:
        if preds < 0:
            y = 0
        elif preds > 0 and preds < .50:
            y = 1
        else:
            y = 2
        altered_preds.append(y)
    return altered_preds

In [86]:
preds = LGBM_ranker.predict(sorted_final_df)

preds = create_multilabels(preds)

# mlb = MultiLabelBinarizer()
# preds_multilabel = mlb.fit_transform([preds])

LGBMRanker_predictions = pd.DataFrame({'Candidate_ID': sorted_final_df_id, 'Predictions': preds})
LGBMRanker_predictions = LGBMRanker_predictions.sort_values(by='Predictions', ascending=False)
# lgbm_ndcg_score = ndcg_score(sorted_final_df_label, preds)
# lgbm_ndcg_score = ndcg_score(sorted_final_df_label, y)

# print(f"NDCG Score: {lgbm_ndcg_score}")

print(f"Accuracy: {acc(sorted_final_df_label, preds)}")

og_df.loc[LGBMRanker_predictions['Candidate_ID']-1].head(40)

Accuracy: 0.75


Unnamed: 0,id,job_title,location,connection,fit
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1,
75,76,Aspiring Human Resources Professional | Passio...,"New York, New York",212,
56,57,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
43,44,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
30,31,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
18,19,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
14,15,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
13,14,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
65,66,Experienced Retail Manager and aspiring Human ...,"Austin, Texas Area",57,


In [87]:
lambdarank_trauncation_levels = [2,4,8,16,32]
lgbm_ranker2 = lgb.LGBMRanker(
    random_state=SEED,
    objective='rank_xendcg',
    learning_rate=.01,
    sigmoid=.50,
    metric='rank_xendcg',
    early_stopping=5,
    # label_gain=[0,4],
    objective_seed=SEED,
    force_col_wise=True,
    lambdarank_truncation_level=16
    )

lgbm_ranker2.fit(
    X_train,
    y_train,
    eval_metric=['rank_xendcg'],
    group=train_group,
    eval_set=[(X_test, y_test)],
    eval_group=[test_group],
    eval_at=[5,10]
)

[LightGBM] [Info] Total Bins 8288
[LightGBM] [Info] Number of data points in the train set: 80, number of used features: 386
Training until validation scores don't improve for 5 rounds
Early stopping, best iteration is:
[5]	valid_0's ndcg@5: 1	valid_0's ndcg@10: 1


In [88]:
preds2 = lgbm_ranker2.predict(sorted_final_df)

LGBMRanker2_predictions = pd.DataFrame({'Candidate_ID': sorted_final_df_id, 'Predictions': preds2})
LGBMRanker2_predictions = LGBMRanker2_predictions.sort_values(by='Predictions', ascending=False)

preds = create_multilabels(preds2)
LGBMRanker2_accuracy_score = acc(sorted_final_df_label, preds)
print(f"Accuracy score: {LGBMRanker2_accuracy_score}")

og_df.loc[LGBMRanker2_predictions['Candidate_ID']-1].head(40)

Accuracy score: 0.5


Unnamed: 0,id,job_title,location,connection,fit
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1,
14,15,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
18,19,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
30,31,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
43,44,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
56,57,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
51,52,Student at Humber College and Aspiring Human R...,Kanada,61,
49,50,Student at Humber College and Aspiring Human R...,Kanada,61,
38,39,Student at Humber College and Aspiring Human R...,Kanada,61,


In [89]:
## XGBoost model

XGB_ranker = xgb.XGBRanker(
    objective='rank:ndcg',
    random_state=SEED
    )

XGB_ranker.fit(
    X_train,
    y_train,
    group=train_group
    )

In [90]:
xgb_preds = XGB_ranker.predict(sorted_final_df)

XGB_ranker_predictions = pd.DataFrame({'Candidate_ID': sorted_final_df_id, 'Predictions': xgb_preds})
XGB_ranker_predictions = XGB_ranker_predictions.sort_values(by='Predictions', ascending=False)

preds = create_multilabels(xgb_preds)
accuracy_score = acc(sorted_final_df_label, preds)
print(f"Accuracy score: {accuracy_score}")

og_df.loc[XGB_ranker_predictions['Candidate_ID']-1].head(40)

Accuracy score: 0.7788461538461539


Unnamed: 0,id,job_title,location,connection,fit
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1,
8,9,Student at Humber College and Aspiring Human R...,Kanada,61,
56,57,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
43,44,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
30,31,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
18,19,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
14,15,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
13,14,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
51,52,Student at Humber College and Aspiring Human R...,Kanada,61,


In [91]:
# XGB_ranker.save_model('../Save_Models/xgb_ranker.json')

In [92]:
## mean method
XGB_ranker_2 = xgb.XGBRanker(
    lambdarank_pair_method='mean',
    lambdarank_num_pair_per_sample=16,
    objective='rank:ndcg',
    random_state=SEED,
    lambdarank_normalization=False,
    eval_metric='ndcg-',
    learning_rate=.01
)

XGB_ranker_2.fit(
    X_train,
    y_train,
    group=train_group
)

In [93]:
xgb2_preds = XGB_ranker_2.predict(sorted_final_df)

XGB_ranker_2_predictions = pd.DataFrame({'Candidate_ID': sorted_final_df_id, 'Predictions': xgb2_preds})
XGB_ranker_2_predictions = XGB_ranker_2_predictions.sort_values(by='Predictions', ascending=False)

preds = create_multilabels(xgb2_preds)
accuracy_score = acc(sorted_final_df_label, preds)
print(f"Accuracy score: {accuracy_score}")

og_df.loc[XGB_ranker_2_predictions['Candidate_ID']-1].head(40)

Accuracy score: 0.7692307692307693


Unnamed: 0,id,job_title,location,connection,fit
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1,
24,25,Student at Humber College and Aspiring Human R...,Kanada,61,
81,82,Aspiring Human Resources Professional | An ene...,"Austin, Texas Area",174,
56,57,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
43,44,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
30,31,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
18,19,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
14,15,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
13,14,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,


In [94]:
## topk method
XGB_ranker_3 = xgb.XGBRanker(
    lambdarank_pair_method='topk',
    lambdarank_num_pair_per_sample=10,
    objective='rank:ndcg',
    random_state=SEED,
    lambdarank_normalization=False,
    eval_metric='ndcg-',
    learning_rate=.01
)

XGB_ranker_3.fit(
    X_train,
    y_train,
    group=train_group
)

In [95]:
xgb3_preds = XGB_ranker_3.predict(sorted_final_df)

XGB_ranker_3_predictions = pd.DataFrame({'Candidate_ID': sorted_final_df_id, 'Predictions': xgb3_preds})
XGB_ranker_3_predictions = XGB_ranker_3_predictions.sort_values(by='Predictions', ascending=False)

preds = create_multilabels(xgb3_preds)
accuracy_score = acc(sorted_final_df_label, preds)
print(f"Accuracy score: {accuracy_score}")

og_df.loc[XGB_ranker_3_predictions['Candidate_ID']-1].head(40)

Accuracy score: 0.75


Unnamed: 0,id,job_title,location,connection,fit
5,6,Aspiring Human Resources Specialist,Greater New York City Area,1,
23,24,Aspiring Human Resources Specialist,Greater New York City Area,1,
81,82,Aspiring Human Resources Professional | An ene...,"Austin, Texas Area",174,
56,57,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
43,44,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
30,31,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
18,19,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
14,15,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
13,14,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,


In [96]:
# XGB_ranker = xgb.XGBRanker()
# XGB_ranker_model = XGB_ranker.load_model('../Save_Models/xgb_ranker.json')
# type(XGB_ranker_model)

In [97]:
## this function will recursively rank a pre-sorted pd.Dataframe 
## this function assumes that the first dataframe is a vectorized representation of the candidates' job titles along with other features that will be used for prediction
## the second dataframe will be the original dataset that will be used for the final ranking of candidates

def get_candidates(sorted_df, dataset, model, num_cands=10, index_list=[]):
    '''
    --Parameters--
    sorted_df: pd.Dataframe that has been pre-ranked for the job title search term
    dataset: pd.Dataframe that is the original un-ranked dataset of the candidates
    num_cands: number of times the ranking process will be applied to the pre-ranked dataset - default=10

    --Return--
    dataset that is indexed by the top candidates from each round of rankings
    '''

    indices = index_list

    predictions = model.predict(sorted_df)
    predictions = pd.DataFrame({'candidate_index': sorted_df.index, 'predictions': predictions})
    sorted_predictions = predictions.sort_values(by='predictions', ascending=False)

    top_candidate_index = sorted_predictions['candidate_index'].values[0]
    indices.append(top_candidate_index)
    # print('Top candidate:', top_candidate_index)

    rerank_indeces = sorted_predictions['candidate_index'][1:].values
    sorted_df = sorted_df.loc[rerank_indeces]

    if num_cands == 0:
        print(dataset.loc[indices])
    else:
        get_candidates(sorted_df=sorted_df, dataset=og_df, model= model, num_cands=num_cands-1, index_list=indices)

In [98]:
top_candidates = get_candidates(sorted_df=sorted_final_df, dataset=og_df, model=XGB_ranker, num_cands=10)
top_candidates

    id                                          job_title  \
5    6                Aspiring Human Resources Specialist   
8    9  Student at Humber College and Aspiring Human R...   
6    7  Student at Humber College and Aspiring Human R...   
24  25  Student at Humber College and Aspiring Human R...   
56  57  2019 C.T. Bauer College of Business Graduate (...   
16  17              Aspiring Human Resources Professional   
96  97              Aspiring Human Resources Professional   
2    3              Aspiring Human Resources Professional   
38  39  Student at Humber College and Aspiring Human R...   
20  21              Aspiring Human Resources Professional   
30  31  2019 C.T. Bauer College of Business Graduate (...   

                               location connection  fit  
5            Greater New York City Area          1  NaN  
8                                Kanada         61  NaN  
6                                Kanada         61  NaN  
24                               Ka

## Prompt Engineering

Using some of the latest large laguage models (LLM)'s, we will 'prompt' these models to see if we can obtain results like the ones above with only a few lines of code.

The down side of this method is that the generating time is over 2 hours. With better hardware (GPU's) the model might not take as long.

For more information on fine-tuning the Phi-3 model you can follow this link: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

In [99]:
candidates = og_df['job_title'].tolist()
candidates_ids = og_df['id'].tolist()

candidates_list = list(zip(candidates_ids, candidates)) ## list of tuples that correspond to the candidate id and their job title

In [100]:
import transformers
from transformers.generation import CompileConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import flash_attention

model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct", torch_dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

messages = [
    {
        "role": "system",
        "content": "You are a ranking algorithm that ranks a list of items based on how similar each item is to a given phrase and returns the top 'n' items that the user requests."},
    {
        "role": "user", 
        "content": f"In this list of items {candidates_list}, give me the top 15 items that are related to the phrase 'aspiring human resources'."}
]

model_inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
# print(model_inputs.shape)
input_length = model_inputs.shape[1]
generated_ids = model.generate(model_inputs, do_sample=True, max_new_tokens=650)

`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.
Python(14281) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  6.14it/s]
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48
You are not running the flash-attention implementation, expect numerical differences.


In [101]:
print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0])

Given the instruction to rank items in relation to the keyword 'aspiring human resources' from an initially sorted list (items 1-104), let's extract and rank them to find the top 15 most related to that theme.

Here are the top 15 items related to 'aspiring human resources':

1. Aspiring Human Resources Professional
2. HR Senior Specialist at Ryan
3. Aspiring Human Resources Professional
4. Senior HR Business Partner at EY
5. Aspiring Human Resources Management Major
6. HR Generalist at Schwan's
7. Aspiring Human Resources Professional | Passionate about helping to create an inclusive and engaging work environment
8. HR Manager at Endemol Shine North America
9. HR Professional for GIS software industry
10. HR Specialist at Heil Environmental
11. Aspiring Human Resources Analyst
12. Aspiring Human Resources Manager
13. Student at Humber College and Aspiring Human Resources Generalist
14. Seeking Human Resources Position in St. Louis
15. Aspiring Human Resources Professional – Seeking En

In [102]:
model.save_pretrained('../Save_Models/phi3_tuned', from_pt=True)
tokenizer.save_pretrained('../Save_Models/phi3_tuned')

Python(16181) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[2025-01-16 13:18:01,004] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to mps (auto detect)


W0116 13:18:01.392534 2136 site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.


('../Save_Models/phi3_tuned/tokenizer_config.json',
 '../Save_Models/phi3_tuned/special_tokens_map.json',
 '../Save_Models/phi3_tuned/tokenizer.json')

In [None]:
## for loading back in the saved models

model = AutoModelForCausalLM.from_pretrained("../Save_Models/phi3_tuned", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("../Save_Models/phi3_tuned")

Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  2.40it/s]
