# Potential Talents

## Background:

As a talent sourcing and management company, we are interested in finding talented individuals for sourcing these candidates to technology companies. Finding talented candidates is not easy, for several reasons. The first reason is one needs to understand what the role is very well to fill in that spot, this requires understanding the client’s needs and what they are looking for in a potential candidate. The second reason is one needs to understand what makes a candidate shine for the role we are in search for. Third, where to find talented individuals is another challenge.

The nature of our job requires a lot of human labor and is full of manual operations. Towards automating this process we want to build a better approach that could save us time and finally help us spot potential candidates that could fit the roles we are in search for. Moreover, going beyond that for a specific role we want to fill in we are interested in developing a machine learning powered pipeline that could spot talented individuals, and rank them based on their fitness.

We are right now semi-automatically sourcing a few candidates, therefore the sourcing part is not a concern at this time but we expect to first determine best matching candidates based on how fit these candidates are for a given role. We generally make these searches based on some keywords such as “full-stack software engineer”, “engineering manager” or “aspiring human resources” based on the role we are trying to fill in. These keywords might change, and you can expect that specific keywords will be provided to you.

Assuming that we were able to list and rank fitting candidates, we then employ a review procedure, as each candidate needs to be reviewed and then determined how good a fit they are through manual inspection. This procedure is done manually and at the end of this manual review, we might choose not the first fitting candidate in the list but maybe the 7th candidate in the list. If that happens, we are interested in being able to re-rank the previous list based on this information. This supervisory signal is going to be supplied by starring the 7th candidate in the list. Starring one candidate actually sets this candidate as an ideal candidate for the given role. Then, we expect the list to be re-ranked each time a candidate is starred.`

## Data Description:

The data comes from our sourcing efforts. We removed any field that could directly reveal personal details and gave a unique identifier for each candidate.

Attributes:
id : unique identifier for candidate (numeric)

job_title : job title for candidate (text)

location : geographical location for candidate (text)

connections: number of connections candidate has, 500+ means over 500 (text)

Output (desired target):
fit - how fit the candidate is for the role? (numeric, probability between 0-1)

Keywords: “Aspiring human resources” or “seeking human resources”

## Goal:

Predict how fit the candidate is based on their available information (variable fit)

## Success Metric(s):

Rank candidates based on a fitness score.

Re-rank candidates when a candidate is starred.

## Imports and Preprocessing
Let's start by importing necessary libraries and packages for our project.

In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from transformers import BertTokenizer, BertModel
from sentence_transformers import SentenceTransformer 
import torch
import spacy
import warnings
warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Download required NLTK data
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')
try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet')
try:
    nltk.data.find('tokenizers/punkt_tab')
except LookupError:
    nltk.download('punkt_tab')    

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sirak\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
# Loading Glove embeddings
glove_path='../glove_data/glove.6B.100d.txt'
def load_glove_embeddings(file_path):
    embeddings = {}
    with open(file_path, 'r', encoding='utf8') as f:
        for line in f:
            parts = line.strip().split()
            word = parts[0]
            vector = np.array(parts[1:], dtype=np.float32)
            embeddings[word] = vector
    return embeddings

glove_embeddings = load_glove_embeddings(glove_path)

In [4]:
# Loading fasttext embeddings
fasttext_path = '../fasttext_data/cc.en.300.vec'
def load_fasttext_vectors(file_path, max_words=200000):
    embeddings = {}
    with open(file_path, 'r', encoding='utf8', newline='\n', errors='ignore') as f:
        next(f)  # skip header
        for i, line in enumerate(f):
            if i >= max_words:
                break
            parts = line.rstrip().split(' ')
            word = parts[0]
            vector = np.array(parts[1:], dtype=np.float32)
            embeddings[word] = vector
    return embeddings

fasttext_embeddings = load_fasttext_vectors(fasttext_path)

In [5]:
# Now let's load the data
data = pd.read_csv('../data/potential_talents_data.csv')
print(f"Loaded {len(data)} candidates")
data.head()

Loaded 104 candidates


Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,


Now let's start the preprocessing by creating a custom function which will convert the text to lowercase, remove punctuation and extra whitespaces, remove the stopwords, tokenize and finally lemmatize the given text:

In [6]:
def preprocess_text(text):
    if pd.isna(text):
        return ""

    #convert to lowercase
    text=text.lower()

    #remove punctuation
    text=re.sub(r'[^\w\s]', '', text)

    # remove extra whitespace
    text=re.sub(r'\s+', ' ', text).strip()

    #Tokenize
    tokens=word_tokenize(text)

    # Create lemmatizer object
    lemmatizer = WordNetLemmatizer()

    #remove stopwords and lemmatize
    tokens= [lemmatizer.lemmatize(token) for token in tokens
    if token not in set(stopwords.words('english')) ]

    return ' '.join(tokens)

For some additional information about our candidates we can combine the 'job_title' and 'Location' columns and use the combined column in our models:

In [7]:
combined_strings = [
        f"{row['job_title']} {row['location']}" if pd.notna(row['job_title']) and pd.notna(row['location']) else ""
        for _, row in data.iterrows()
    ]

data['combined_string'] = combined_strings

In [8]:
# Now let's add a new column 'combined_string_preprocessed' to our dataset by applying the preprocess_text function
data['combined_string_preprocessed']=data['combined_string'].apply(preprocess_text)

In [9]:
data[['combined_string', 'combined_string_preprocessed']].head()

Unnamed: 0,combined_string,combined_string_preprocessed
0,2019 C.T. Bauer College of Business Graduate (...,2019 ct bauer college business graduate magna ...
1,Native English Teacher at EPIK (English Progra...,native english teacher epik english program ko...
2,Aspiring Human Resources Professional Raleigh-...,aspiring human resource professional raleighdu...
3,"People Development Coordinator at Ryan Denton,...",people development coordinator ryan denton texas
4,Advisory Board Member at Celal Bayar Universit...,advisory board member celal bayar university i...


## Embedding

For word embeddings we'll try a few methods and see which one works the best: Bag of Words, TF-IDF, Bert and Sbert. We'll create custom functions for each one of them.

In [10]:
# Starting with the Bag of Words method

def bag_of_words_similarity(data, target_string):
    #create BoW vectorizer
    vectorizer=CountVectorizer(max_features=1000, ngram_range=(1,2))

    # combine job titles with target string
    all_texts=list(data['combined_string_preprocessed'])+[target_string]
    bow_matrix=vectorizer.fit_transform(all_texts)

    #Calculate similarity between each job title and target
    job_title_matrix=bow_matrix[:-1] # All except last (target)
    target_vector = bow_matrix[-1:] # Last row (target)

    similarities=cosine_similarity(job_title_matrix, target_vector).flatten()
    return similarities

In [11]:
# Next on the list is TF-IDF

def tfidf_similarity(data, target_string):
    #create TF-IDF vectorizer
    vectorizer=TfidfVectorizer(max_features=1000, ngram_range=(1,2))

    #Combine job titles with target string
    all_texts=list(data['combined_string_preprocessed'])+[target_string]
    tfidf_matrix=vectorizer.fit_transform(all_texts)

    # Calculate similarity between each job title and target
    job_titles_matrix = tfidf_matrix[:-1]  # All except last (target)
    target_vector = tfidf_matrix[-1:]  # Last row (target)
        
    similarities = cosine_similarity(job_titles_matrix, target_vector).flatten()
    return similarities

In [12]:
# Now let's try the Glove embedding method
def glove_similarity(data, target_string):

    # First, we need a function for embedding each row by averaging word vectors
    def get_document_embedding(text): 
        if pd.isna(text) or text == "":
            return np.zeros(100)
            
        # Preprocess text
        text_processed = preprocess_text(text)
        words = text_processed.split()
        
        # Get word vectors
        word_embeddings = []
        for word in words:
            if word in glove_embeddings:
                word_embeddings.append(glove_embeddings[word])
        
        if len(word_embeddings) == 0:
            return np.zeros(100)
        
        # Average the word vectors
        return np.mean(word_embeddings, axis=0)

    # Get embeddings for all rows
    job_embeddings = []
    for job_title in data['combined_string']:
        embedding = get_document_embedding(job_title)
        job_embeddings.append(embedding)
    
    # Get embedding for target string
    target_embedding = get_document_embedding(target_string)
    
    # Calculate similarities
    similarities = cosine_similarity(job_embeddings, [target_embedding]).flatten()
    
    return np.array(similarities)

In [26]:
def fasttext_similarity(data, target_string):
    
    # Again, row embedding first
    def get_document_embedding(text):
        if pd.isna(text) or text == "":
            return np.zeros(300)
        
        # Get row embedding
        text_processed = preprocess_text(text)
        words = text_processed.split()
        
        word_embeddings = []
        for word in words:
            if word in fasttext_embeddings:
                word_embeddings.append(fasttext_embeddings[word])
        
        if len(word_embeddings) == 0:
            return np.zeros(300)
        return np.mean(word_embeddings, axis=0)
        
    # Get embeddings for all rows
    job_embeddings = []
    for job_title in data['combined_string']:
        embedding = get_document_embedding(job_title)
        job_embeddings.append(embedding)
    
    # Get embedding for target string
    target_embedding = get_document_embedding(target_string)
    
    # Calculate similarities using cosine_similarity
    similarities = cosine_similarity(job_embeddings, [target_embedding]).flatten()
    return similarities

In [14]:
# BERT similarity

def bert_similarity(data, target_string):
    #Load BERT model and tokenizer
    tokenizer=BertTokenizer.from_pretrained('bert-base-uncased')
    model=BertModel.from_pretrained('bert-base-uncased')
    model.eval()

    #Calculate BERT embeddings for job titles
    doc_embeddings=[]
    for text in data['combined_string']:
        if pd.isna(text):
            doc_embeddings.append(np.zeros(768))
            continue

        inputs=tokenizer(text, return_tensors='pt', max_length=512, truncation=True, padding=True)

        with torch.no_grad():
            outputs=model(**inputs)
            embedding=outputs.last_hidden_state[:,0,:].numpy().flatten()
            doc_embeddings.append(embedding)

    #Calculate BERT embedding for target
    target_inputs=tokenizer(target_string, return_tensors='pt', max_length=512, truncation=True, padding=True)

    with torch.no_grad():
        target_outputs=model(**target_inputs)
        target_embedding=target_outputs.last_hidden_state[:,0,:].numpy().flatten()

    #Calculate similarities
    similarities=[]
    for doc_embedding in doc_embeddings:
        similarity=np.dot(doc_embedding, target_embedding) / (np.linalg.norm(doc_embedding) * np.linalg.norm(target_embedding) + 1e-8)
        similarities.append(similarity)

    return np.array(similarities)        

In [15]:
# And last, but not least SBERT

def sbert_similarity(data, target_string):

    #Load SBERT model
    sbert_model=SentenceTransformer('all-MiniLM-L6-v2')

    #Get embeddings for job titles
    job_titles=data['combined_string'].fillna('').tolist()
    job_embeddings=sbert_model.encode(job_titles)

    #get embedding for target
    target_embedding = sbert_model.encode([target_string])

    #Calculate similarities
    similarities=cosine_similarity(job_embeddings, target_embedding).flatten()
    return similarities

## Ranking and Reranking functions

Before moving to comparing our embedding methods, let's build functions for ranking and reranking our candidates based on a embedding method that was chosen. We'll try two different methods for reranking the starred candidates. But first let's build our main ranking function.

In [16]:
# ranking candidates first

def rank_candidates(data, target_string, method='tfidf', starred_candidates=None, reranking_method='combining', connection_weight=0.1):
    """
    method (str): Embedding method ('bow', 'tfidf', 'glove', 'fasttext','bert', 'sbert')
    starred_candidates (list): List of candidate IDs that have been starred
    reranking_method (str) : Reranking method ('combining', 'boosting') -- functions are provided below
    """
    #Preprocessing the target string
    target_processed=preprocess_text(target_string)

    

    # Calculate similarities based on method
    if method == 'bow':
        similarities=bag_of_words_similarity(data, target_processed)
    elif method == 'tfidf':
        similarities = tfidf_similarity(data, target_processed)
    elif method == 'glove':
        similarities = glove_similarity(data, target_string)
    elif method == 'fasttext':
        similarities = fasttext_similarity(data, target_string)
    elif method == 'bert':
        similarities = bert_similarity(data, target_string)  
    elif method == 'sbert':
        similarities = sbert_similarity(data, target_string)  
    else:
        raise ValueError(f"Unknown method: {method}. Available methods: 'bow', 'tfidf', 'glove', 'fasttext','bert', 'sbert'")

    # Normalize similarities to 0-1 range
    scaler=MinMaxScaler()
    similarities_norm=scaler.fit_transform(similarities.reshape(-1,1)).flatten()

    # Create ranking dataframe
    ranking_df=data.copy()
    ranking_df['similarity_score'] = similarities_norm
    ranking_df['rank'] = ranking_df['similarity_score'].rank(ascending=False, method='min').astype(int)

    # Apply re-ranking if starred candidates provided
    if starred_candidates:
        if reranking_method == 'boosting':
            ranking_df=rerank_boosting(ranking_df, starred_candidates)
        else:
            if  method == 'bow':
                ranking_df=rerank_combining(ranking_df, target_string, starred_candidates, embedding_func=bag_of_words_similarity)
            elif method == 'tfidf':
                ranking_df=rerank_combining(ranking_df, target_string, starred_candidates, embedding_func=tfidf_similarity)
            elif method == 'glove':
                ranking_df=rerank_combining(ranking_df, target_string, starred_candidates, embedding_func=glove_similarity)
            elif method == 'fasttext':
                ranking_df=rerank_combining(ranking_df, target_string, starred_candidates, embedding_func=fasttext_similarity)
            elif method == 'bert':
                ranking_df=rerank_combining(ranking_df, target_string, starred_candidates, embedding_func=bert_similarity)
            elif method == 'sbert':
                ranking_df=rerank_combining(ranking_df, target_string, starred_candidates, embedding_func=sbert_similarity)

     # Adding a boost to the final similarity score based on the number of connections

    # Cleaning the column first (converting to int)
    def parse_connections(val):
        if isinstance(val, str) and "500" in val:
            return 500
        try:
            return int(val)
        except:
            return 0

    connections = ranking_df['connection'].apply(parse_connections)
    connections_norm = scaler.fit_transform(connections.to_numpy().reshape(-1,1)).flatten()
    ranking_df['similarity_score'] += connection_weight * connections_norm

    # Re-rank after adding connections
    ranking_df['rank'] = ranking_df['similarity_score'].rank(ascending=False, method='min').astype(int)
    
    # Sort by rank
    ranking_df=ranking_df.sort_values('rank').reset_index(drop=True)

    return ranking_df

### Method 1 : Boosting score based on the similarity to starred candidates average

First we'll try to calculate the average similarity score for our starred candidates and compare it to each candidate, giving them boosting scores based on the similarity.

In [17]:
def rerank_boosting(ranking_df, starred_candidates):

    # Get features of starred candidates
    starred_mask=ranking_df['id'].isin(starred_candidates)
    starred_features=ranking_df[starred_mask]['similarity_score']

    if len(starred_candidates) == 0:
        return ranking_df

    # Calculate average similarity of starred candidates
    starred_avg = starred_features.mean()

    # Boost scores for candidates similar to starred ones
    for idx, row in ranking_df.iterrows():
        candidate_score=row['similarity_score']

        # Calculate similarity to starred candidates' average
        similarity_to_starred=1-abs(candidate_score - starred_avg)

        # Apply boost
        boost_factor= 1 + 0.3 * similarity_to_starred
        ranking_df.loc[idx, 'similarity_score'] = candidate_score * boost_factor

    # Re-rank
    ranking_df['rank']=ranking_df['similarity_score'].rank(ascending=False, method='min').astype(int)

    return ranking_df

### Method 2 : Combining each candidates info with the starred candidates before calculating the similarity

In [18]:
def rerank_combining(ranking_df, target_string, starred_candidates, embedding_func):

    # Get features of starred candidates
    starred_mask = ranking_df['id'].isin(starred_candidates)
    starred_info = " ".join(
        ranking_df.loc[starred_mask, 'combined_string'].fillna("").tolist()
    )

    # Create combined string for each candidate (candidate info string + starred info)
    combined_strings = [
        f"{row['combined_string']} {starred_info}" if pd.notna(row['combined_string']) else starred_info
        for _, row in ranking_df.iterrows()
    ]

    # Compute similarity using the combined string
    temp_df = ranking_df.copy()
    temp_df['job_title'] = combined_strings 

    similarities = embedding_func(temp_df, target_string)
    temp_df['similarity_score'] = similarities

    # Re-rank
    temp_df['rank'] = temp_df['similarity_score'].rank(ascending=False, method='min').astype(int)


    return temp_df

## Comparing methods and choosing the best one

Now we're ready to create functions for comparing our methods:

In [62]:
# First, comparing the methods
def compare_methods(data, target_string, starred_candidates=None, reranking_method='combining'):
    methods = ['bow', 'tfidf','glove', 'fasttext', 'bert', 'sbert']
    results = {}

    # Comparing methods for target
    for method in methods:
        try:
            ranking=rank_candidates(data, target_string, method, starred_candidates, reranking_method=reranking_method)

            # Store top 10 results
            top10= ranking.head(10)[['rank', 'id', 'combined_string', 'similarity_score']]
            results[method]={
                'ranking': ranking,
                'top_10': top10,
                'avg_score': ranking['similarity_score'].mean(),
                'max_score': ranking['similarity_score'].max()
            }

           # print top 5 candidates
            if starred_candidates:
                print(f"Top 5 candidates using {method.upper()} embedding and {reranking_method.upper()} reranking method:")
            else:
                print(f"Top 5 candidates using {method.upper()} embedding:")
            for _, row in top10.head(5).iterrows():
                starred_mark = " ⭐" if row['id'] in (starred_candidates or []) else ""
                print(f"  {row['rank']:2d}. ID {row['id']:3d} - {row['combined_string'][:50]:<50} (Score: {row['similarity_score']:.4f}){starred_mark}")

        except Exception as e:
            print(f"Error with {method}: {e}")
            continue
    return results

# Now let's get the best method
def get_best_method(data, target_string, starred_candidates=None, reranking_method='combining'):
    results=compare_methods(data, target_string, starred_candidates=starred_candidates, reranking_method=reranking_method)

    #Find method with highest average score
    best_method = max(results.keys(), key=lambda x:results[x]['avg_score'])

    print(f"Best Method : {best_method}")
    print(f"Average similarity score : {results[best_method]['avg_score']:.4f}")

    return best_method


Now let's test our function, first without starred candidates

In [63]:
target_string = "seeking human resources"
get_best_method(data, target_string)

Top 5 candidates using BOW embedding:
   1. ID  28 - Seeking Human Resources Opportunities Chicago, Ill (Score: 1.0780)
   1. ID  30 - Seeking Human Resources Opportunities Chicago, Ill (Score: 1.0780)
   3. ID  10 - Seeking Human Resources HRIS and Generalist Positi (Score: 0.9044)
   3. ID  40 - Seeking Human Resources HRIS and Generalist Positi (Score: 0.9044)
   3. ID  53 - Seeking Human Resources HRIS and Generalist Positi (Score: 0.9044)
Top 5 candidates using TFIDF embedding:
   1. ID  28 - Seeking Human Resources Opportunities Chicago, Ill (Score: 1.0780)
   1. ID  30 - Seeking Human Resources Opportunities Chicago, Ill (Score: 1.0780)
   3. ID  10 - Seeking Human Resources HRIS and Generalist Positi (Score: 0.9660)
   3. ID  40 - Seeking Human Resources HRIS and Generalist Positi (Score: 0.9660)
   3. ID  53 - Seeking Human Resources HRIS and Generalist Positi (Score: 0.9660)
Top 5 candidates using GLOVE embedding:
   1. ID  10 - Seeking Human Resources HRIS and Generalist Pos

'bert'

Now let's add some starred candidates and use the 'boosting' reranking method:

In [57]:
target_string = "seeking human resources"
starred_candidates=[53, 28, 68]

get_best_method(data, target_string, starred_candidates=starred_candidates, reranking_method='boosting')

Top 5 candidates using BOW embedding and BOOSTING reranking method:
   1. ID  28 - Seeking Human Resources Opportunities Chicago, Ill (Score: 1.3067) ⭐
   1. ID  30 - Seeking Human Resources Opportunities Chicago, Ill (Score: 1.3067)
   3. ID  10 - Seeking Human Resources HRIS and Generalist Positi (Score: 1.1356)
   3. ID  40 - Seeking Human Resources HRIS and Generalist Positi (Score: 1.1356)
   3. ID  53 - Seeking Human Resources HRIS and Generalist Positi (Score: 1.1356) ⭐
Top 5 candidates using TFIDF embedding and BOOSTING reranking method:
   1. ID  28 - Seeking Human Resources Opportunities Chicago, Ill (Score: 1.2870) ⭐
   1. ID  30 - Seeking Human Resources Opportunities Chicago, Ill (Score: 1.2870)
   3. ID  10 - Seeking Human Resources HRIS and Generalist Positi (Score: 1.1819)
   3. ID  40 - Seeking Human Resources HRIS and Generalist Positi (Score: 1.1819)
   3. ID  53 - Seeking Human Resources HRIS and Generalist Positi (Score: 1.1819) ⭐
Top 5 candidates using GLOVE embed

'bert'

And lastly let's try 'combining' reranking method with the same list of starred candidates:

In [40]:
starred_candidates=[53, 28, 68]
get_best_method(data, target_string, starred_candidates=starred_candidates, reranking_method='combining')

Top 5 candidates using BOW embedding and COMBINING reranking method:
   1. ID  28 - Seeking Human Resources Opportunities Chicago, Ill (Score: 0.4825) ⭐
   1. ID  30 - Seeking Human Resources Opportunities Chicago, Ill (Score: 0.4825)
   3. ID  10 - Seeking Human Resources HRIS and Generalist Positi (Score: 0.4254)
   3. ID  40 - Seeking Human Resources HRIS and Generalist Positi (Score: 0.4254)
   3. ID  53 - Seeking Human Resources HRIS and Generalist Positi (Score: 0.4254) ⭐
Top 5 candidates using TFIDF embedding and COMBINING reranking method:
   1. ID  28 - Seeking Human Resources Opportunities Chicago, Ill (Score: 0.2903) ⭐
   1. ID  30 - Seeking Human Resources Opportunities Chicago, Ill (Score: 0.2903)
   3. ID  10 - Seeking Human Resources HRIS and Generalist Positi (Score: 0.2839)
   3. ID  40 - Seeking Human Resources HRIS and Generalist Positi (Score: 0.2839)
   3. ID  53 - Seeking Human Resources HRIS and Generalist Positi (Score: 0.2839) ⭐
Top 5 candidates using GLOVE emb

'bert'

It looks like the only embedding method that shows different results based on the reranking method is BERT. Others show the same results and include at least one starred candidate in the final top 5 results.