<a href="https://colab.research.google.com/github/Echo9k/3-potential_talents/blob/dev/Potential%20Talent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Candidate Ranking and Re-Ranking with Starring

We will implement this in a two-step process:

1. **Initial Ranking**: Rank candidates based on their `fit` score.
2. **Re-Ranking with Starring**: Incorporate supervisory signals (starring) dynamically into a Learning-to-Rank (LTR) algorithm. We can use a pairwise ranking approach (e.g., LambdaMART).

## Introduction to Candidate Ranking
This section introduces the learning-to-rank approach used to evaluate candidates based on their "fit" score. The method enhances the ranking accuracy by incorporating real-time feedback, significantly reducing manual effort.

### Implementation Code

In [19]:
import os
import sys


try:
    from google.colab import drive
    drive.mount('/content/drive')
    root_dir = "/content/drive/MyDrive/wdir/repos/Apziva/3-potential_talents/"
    os.getcwd()

except ImportError:
    while 'potential_talents' not in os.listdir('.'):
        os.chdir('..')
        root_dir=os.getcwd()
    
    # append term_deposit to system to import custom functions
    sys.path.append('.')
    
%pwd

'/home/sagemaker-user/3-potential_talents'

#### Dependencies

In [20]:
from pathlib import Path
from IPython.display import display, Markdown, clear_output
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split

#### Initial Setup

In [21]:
data_path = Path("data")
data = pd.read_parquet(data_path / "processed" / "grouped_results.parquet")\
    .sort_values('fit', ascending=False)
# Initial Ranking
data['rank'] = range(1, len(data) + 1)
# Initial Starred
data['is_starred'] = 0
data = data.reset_index()
data.head()

Unnamed: 0,job_title,fit,rank,is_starred
0,human resources staffing and recruiting profes...,0.752753,1,0
1,retired army national guard recruiter office m...,0.728867,2,0
2,aspiring human resources professional an ener...,0.727303,3,0
3,aspiring human resources manager seeking inter...,0.720167,4,0
4,human resources coordinator at intercontinenta...,0.717091,5,0


## Training the LambdaMART Model
- The **LambdaMART algorithm** is used for supervised learning-to-rank tasks.
- Features are engineered from candidate data to represent relevancy scores.
- Supervisory signals (e.g., recruiter "stars") are integrated to refine rankings dynamically.

In [22]:
def star_candidate(data):
    """
    Allow the user to interactively star a candidate and update the rankings.
    The function continues until the user types 'exit' or 'q'.
    """
    while True:
        # Clear the previous output
        clear_output(wait=True)

        # Display the current candidates table
        display(Markdown("## Current Candidates:"))
        display(data[['job_title', 'fit', 'rank', 'is_starred']])

        # Get user input
        user_input = input("\nEnter the job title or rank of the candidate to star (or type 'exit' or 'q' to quit): ").strip()

        # Exit condition
        if user_input.lower() in ['exit', 'q']:
            print("Exiting the star candidate selection.")
            break

        try:
            if user_input.isdigit():
                rank = int(user_input)
                if rank in data['rank'].values:
                    data.loc[data['rank'] == rank, 'is_starred'] = 1
                    print(f"\nCandidate with rank {rank} has been starred.")
                else:
                    print("Invalid rank. Please try again.")
            elif user_input in data['job_title'].values:
                data.loc[data['job_title'] == user_input, 'is_starred'] = 1
                print(f"\nCandidate '{user_input}' has been starred.")
            else:
                print("Invalid job title or rank. Please try again.")
        except Exception as e:
            print(f"Error: {e}")

    return data
data = star_candidate(data)

## Current Candidates:

Unnamed: 0,job_title,fit,rank,is_starred
0,human resources staffing and recruiting profes...,0.752753,1,1
1,retired army national guard recruiter office m...,0.728867,2,0
2,aspiring human resources professional an ener...,0.727303,3,1
3,aspiring human resources manager seeking inter...,0.720167,4,1
4,human resources coordinator at intercontinenta...,0.717091,5,0
5,experienced retail manager and aspiring human ...,0.715438,6,0
6,not tech is seeking human resources payroll a...,0.713653,7,1
7,human resources manager at not tech shine nort...,0.713335,8,1
8,aspiring human resources professional passion...,0.710101,9,1
9,aspiring human resources manager graduating m...,0.710046,10,0


Exiting the star candidate selection.


## Training the LambdaMART Model
- The **LambdaMART algorithm** is used for supervised learning-to-rank tasks.
- Features are engineered from candidate data to represent relevancy scores.
- Supervisory signals (e.g., recruiter "stars") are integrated to refine rankings dynamically.

#### Re-Ranking with Learning-to-Rank

1. **Generate Pairwise Data**: Create pairwise training examples based on starred candidates.

In [23]:
def generate_pairwise_data(df):
    """Generate pairwise training data for LTR."""
    starred = df[df['is_starred'] == 1]
    not_starred = df[df['is_starred'] == 0]
    
    X, y = [], []
    for _, s_row in starred.iterrows():
        for _, ns_row in not_starred.iterrows():
            # Features: difference of features between starred and not-starred
            X.append([s_row['fit'] - ns_row['fit']])
            y.append(1)  # Positive pair: Starred is better
            
            X.append([ns_row['fit'] - s_row['fit']])
            y.append(0)  # Negative pair: Not-starred is worse
    return np.array(X), np.array(y)

X, y = generate_pairwise_data(data)
SEED = 42
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)

print("Pairwise Training Data:")
print(X[:5], y[:5])

Pairwise Training Data:
[[ 0.02388589]
 [-0.02388589]
 [ 0.03566164]
 [-0.03566164]
 [ 0.03731496]] [1 0 1 0 1]


2. **Train a LambdaMART Model**: Use `XGBoost` for Learning-to-Rank.

In [24]:
# Train a LambdaMART model (using XGBoost)
model = xgb.XGBRanker(
    objective="rank:pairwise",
    learning_rate=0.1,
    max_depth=5,
    n_estimators=100,
    random_state=42
)

# Group by candidate sets (all candidates in one group for this example)
group = [len(X_train)]

model.fit(X_train, y_train, group=group)

# Predict Scores
data['adjusted_fit'] = model.predict(data[['fit']].values)

3. **Re-Rank Candidates**: Re-rank based on updated scores.

# Re-rank based on adjusted fit scores
data = data.sort_values('adjusted_fit', ascending=False).reset_index(drop=True)
data['rank'] = range(1, len(data) + 1)

print("Re-Ranked Candidates:")
print(data[['candidate_id', 'adjusted_fit', 'rank', 'is_starred']])

### Summary of Steps



1. **Initial Ranking**:
   - Candidates are ranked based on their initial fitness scores (`fit`).

2. **Re-Ranking**:
   - Starred candidates influence the ranking by dynamically updating the model with pairwise preferences.
   - LambdaMART (via XGBoost) is used to learn pairwise rankings, where starred candidates are preferred.

3. **Real-Time Updates**:
   - As new candidates are starred, the pairwise training data is updated, and the model can be fine-tuned or retrained.

In [25]:
# Re-rank based on adjusted fit scores
data = data.sort_values('adjusted_fit', ascending=False).reset_index(drop=True)
data['rank'] = range(1, len(data) + 1)

display(Markdown("### Re-Ranked Candidates:"))
display(data[['rank', 'job_title', 'is_starred',
			  'fit', 'adjusted_fit']])

### Re-Ranked Candidates:

Unnamed: 0,rank,job_title,is_starred,fit,adjusted_fit
0,1,human resources staffing and recruiting profes...,1,0.752753,3.836226
1,2,retired army national guard recruiter office m...,0,0.728867,3.836226
2,3,aspiring human resources professional an ener...,1,0.727303,3.836226
3,4,aspiring human resources manager seeking inter...,1,0.720167,3.836226
4,5,human resources coordinator at intercontinenta...,0,0.717091,3.836226
5,6,experienced retail manager and aspiring human ...,0,0.715438,3.836226
6,7,not tech is seeking human resources payroll a...,1,0.713653,3.836226
7,8,human resources manager at not tech shine nort...,1,0.713335,3.836226
8,9,aspiring human resources professional passion...,1,0.710101,3.836226
9,10,aspiring human resources manager graduating m...,0,0.710046,3.836226


---

### **3. Candidate Filtering**

#### **Approach:**

- Use a 

  preprocessing filter

   to exclude irrelevant candidates before ranking:

  - Text Matching:
    - Use fuzzy matching (Levenshtein distance, cosine similarity) to ensure job_title aligns closely with keywords.
    - Eliminate candidates with low semantic similarity.
  - Location Matching:
    - Set geographical cutoffs if location proximity is essential.
  - Connections:
    - Eliminate candidates with extremely low connections (e.g., <50 if a minimum threshold applies).

#### **Output:**

- A cleaned list of candidates.

#### **Success Metrics:**

- Filter efficiency (percentage of irrelevant candidates removed).
- False positive rate (relevant candidates mistakenly excluded).

In [34]:
from geopy.distance import geodesic
from fuzzywuzzy import fuzz

In [27]:
target_keywords = [
    'aspiring human resources',
    'human resources assistant',
    'hr coordinator',
    'hr generalist (entry-level)',
    'talent acquisition assistant',
    'recruitment coordinator',
    'hr intern',
    'hr trainee',
    'junior hr specialist',
    'hr associate',
    'people operations assistant'
    ]

# User-defined parameters
target_location = "New York"
target_coordinates = (40.7128, -74.0060)  # Latitude and longitude of New York
connection_threshold = 50

In [28]:
def filter_by_text(data, keywords, threshold=60):
    """Filter candidates based on fuzzy text matching with job titles."""
    def calculate_match(job_title):
        scores = [fuzz.partial_ratio(job_title.lower(), keyword.lower()) for keyword in keywords]
        return max(scores)  # Maximum match with any keyword
    
    data['text_match_score'] = data['job_title'].apply(calculate_match)
    return data[data['text_match_score'] >= threshold]


In [29]:
def filter_by_location(data, target_coordinates, max_distance_km=50):
    """Filter candidates based on geographical proximity."""
    def calculate_distance(row):
        candidate_coords = (row['latitude'], row['longitude'])
        return geodesic(candidate_coords, target_coordinates).kilometers
    
    data['distance_to_target'] = data.apply(calculate_distance, axis=1)
    return data[data['distance_to_target'] <= max_distance_km]


In [30]:
def filter_by_connections(data, min_connections):
    """Filter candidates based on minimum connections threshold."""
    return data[data['connections'] >= min_connections]

In [35]:
# Apply filters sequentially
filtered_data = filter_by_text(data, target_keywords, threshold=60)

if "location" in filtered_data.columns:
    filtered_data = filter_by_location(filtered_data, target_coordinates, max_distance_km=50)

if "connections" in filtered_data.columns:
    filtered_data = filter_by_connections(filtered_data, min_connections=connection_threshold)

print("Filtered Dataset:")
display(filtered_data)


Filtered Dataset:


Unnamed: 0,job_title,fit,rank,is_starred,adjusted_fit,text_match_score
0,human resources staffing and recruiting profes...,0.752753,1,1,3.836226,80
1,retired army national guard recruiter office m...,0.728867,2,0,3.836226,79
2,aspiring human resources professional an ener...,0.727303,3,1,3.836226,100
3,aspiring human resources manager seeking inter...,0.720167,4,1,3.836226,100
4,human resources coordinator at intercontinenta...,0.717091,5,0,3.836226,86
5,experienced retail manager and aspiring human ...,0.715438,6,0,3.836226,100
6,not tech is seeking human resources payroll a...,0.713653,7,1,3.836226,83
7,human resources manager at not tech shine nort...,0.713335,8,1,3.836226,72
8,aspiring human resources professional passion...,0.710101,9,1,3.836226,100
9,aspiring human resources manager graduating m...,0.710046,10,0,3.836226,100


In [36]:
# Efficiency: Percentage of candidates removed
initial_count = len(data)
filtered_count = len(filtered_data)
efficiency = (initial_count - filtered_count) / initial_count * 100

# Example placeholder for false positives (requires labeled data)
false_positives = 0
false_positive_rate = false_positives / initial_count * 100

print(f"Filter Efficiency: {efficiency:.2f}%")
print(f"False Positive Rate: {false_positive_rate:.2f}%")

Filter Efficiency: 23.08%
False Positive Rate: 0.00%


### **4. Determining a Cut-Off Point for Other Roles**

#### **Approach:**

- Use historical hiring patterns to determine thresholds:
  - Analyze the average fitness score of successful candidates for a role.
  - Use this as a baseline to create cut-offs, dynamically adjustable by user input.
- Incorporate a confidence interval to capture high-potential candidates slightly below the threshold.

#### **Output:**

- Role-specific cut-off scores for ranking.

#### **Success Metrics:**

- Evaluate missed potential hires (false negatives).
- Assess efficiency in reducing manual reviews.

In [38]:
filtered_data

Unnamed: 0,job_title,fit,rank,is_starred,adjusted_fit,text_match_score
0,human resources staffing and recruiting profes...,0.752753,1,1,3.836226,80
1,retired army national guard recruiter office m...,0.728867,2,0,3.836226,79
2,aspiring human resources professional an ener...,0.727303,3,1,3.836226,100
3,aspiring human resources manager seeking inter...,0.720167,4,1,3.836226,100
4,human resources coordinator at intercontinenta...,0.717091,5,0,3.836226,86
5,experienced retail manager and aspiring human ...,0.715438,6,0,3.836226,100
6,not tech is seeking human resources payroll a...,0.713653,7,1,3.836226,83
7,human resources manager at not tech shine nort...,0.713335,8,1,3.836226,72
8,aspiring human resources professional passion...,0.710101,9,1,3.836226,100
9,aspiring human resources manager graduating m...,0.710046,10,0,3.836226,100


In [46]:
thresh=0.7
refiltered_data = filtered_data[
    (filtered_data.fit>thresh) &
    (filtered_data.text_match_score>75) &
    (filtered_data.adjusted_fit>3.35)
    ]
refiltered_data.to_csv(data_path / "processed" /"pre_filtered.csv", index=False)
refiltered_data.to_parquet(data_path / "processed" /"pre_filtered.parquet", index=False, compression="brotli")
display(refiltered_data)

Unnamed: 0,job_title,fit,rank,is_starred,adjusted_fit,text_match_score
0,human resources staffing and recruiting profes...,0.752753,1,1,3.836226,80
1,retired army national guard recruiter office m...,0.728867,2,0,3.836226,79
2,aspiring human resources professional an ener...,0.727303,3,1,3.836226,100
3,aspiring human resources manager seeking inter...,0.720167,4,1,3.836226,100
4,human resources coordinator at intercontinenta...,0.717091,5,0,3.836226,86
5,experienced retail manager and aspiring human ...,0.715438,6,0,3.836226,100
6,not tech is seeking human resources payroll a...,0.713653,7,1,3.836226,83
8,aspiring human resources professional passion...,0.710101,9,1,3.836226,100
9,aspiring human resources manager graduating m...,0.710046,10,0,3.836226,100
10,ct bauer college of business graduate magna c...,0.70498,11,0,3.836226,100
