## Candidate Ranking and Re-Ranking with Starring

This section focuses on testing the performance of the candidate ranking system. Key steps and highlights include:

- **Data-Driven Insights**: Automating and refining the ranking process to improve efficiency and decision-making accuracy.
- **Two-Step Ranking Process**:
  1. **Initial Ranking**: Candidates are ranked based on their initial fitness scores (`fit`), providing a baseline.
  2. **Re-Ranking with Starring**: Feedback from users (e.g., starring preferred candidates) dynamically updates the ranking model using Learning-to-Rank techniques.

In [None]:
import os
import sys


try:
    from google.colab import drive
    drive.mount('/content/drive')
    root_dir = "/content/drive/MyDrive/wdir/repos/Apziva/3-potential_talents/"
    os.getcwd()

except ImportError:
    while 'potential_talents' not in os.listdir('.'):
        os.chdir('..')
        root_dir=os.getcwd()
    
    # append term_deposit to system to import custom functions
    sys.path.append('.')
    
%pwd

### **1. Dependencies**

In [None]:
from pathlib import Path
from IPython.display import display, Markdown
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split

from geopy.distance import geodesic
from fuzzywuzzy import fuzz
import toml


SEED = 42
connection_threshold = 50

data_path = Path("data")
data = pd.read_parquet(data_path  / "interim" / "encoded.parquet", columns=['job_title'])

credentials_path = Path(root_dir) / "config" / ".credentials"
credentials = toml.load(credentials_path)

keywords_path = Path(root_dir) / "config" / "search_terms.toml"
target_keywords = toml.load(keywords_path)['search_phrases']
target_location = toml.load(keywords_path)['search_phrases']

# API and credentials setup
API_URL = "https://api-inference.huggingface.co/models/sentence-transformers/msmarco-distilbert-base-tas-b"
headers = {"Authorization": f"Bearer {credentials['hf_api_token']}"}

**Key Highlights**:

- `xgboost` is a robust and scalable gradient boosting library, ideal for implementing Learning-to-Rank models.
- `pandas` enables efficient data manipulation and analysis.
- `IPython.display` supports dynamic visualization in notebooks.

### **2. Initial Setup**

#### Data Loading and Initial Ranking

In [None]:
data_path = Path("data")
data = pd.read_parquet(data_path / "processed" / "grouped_results.parquet")\
         .sort_values('fit', ascending=False)
# Initial Ranking
data['rank'] = range(1, len(data) + 1)
data['is_starred'] = 0
data = data.reset_index()
data.head()


- The `fit` score represents the initial evaluation of candidate suitability for a role.
- A simple ranking (`rank`) is assigned based on descending `fit` values.
- The `is_starred` column allows for supervisory input, marking candidates as starred for preference.

### **3. Incorporating User Feedback for Re-Ranking**

#### **Dynamic Starring Functionality**

In [None]:
def star_candidate(data):
    """
    Allow the user to interactively star a candidate and update the rankings.
    """
    print("\\nCurrent Candidates:")
    print(data[['job_title', 'fit', 'rank']])

    user_input = input("\\nEnter the job title or rank of the candidate to star: ")

    try:
        if user_input.isdigit():
            rank = int(user_input)
            if rank in data['rank'].values:
                data.loc[data['rank'] == rank, 'is_starred'] = 1
            else:
                print("Invalid rank. Please try again.")
                return data
        elif user_input in data['job_title'].values:
            data.loc[data['job_title'] == user_input, 'is_starred'] = 1
        else:
            print("Invalid job title. Please try again.")
            return data

        print(f"\\nCandidate '{user_input}' has been starred.")
    except Exception as e:
        print(f"Error: {e}")

    return data

data = star_candidate(data)

- The `star_candidate` function allows users to influence rankings in real time, incorporating domain-specific preferences.
- This interactivity ensures human oversight remains central to the process.

### **4. Re-Ranking with Learning-to-Rank (LTR)**

#### Pairwise Data Generation for Training

In [None]:
def generate_pairwise_data(df):
    """Generate pairwise training data for LTR."""
    starred = df[df['is_starred'] == 1]
    not_starred = df[df['is_starred'] == 0]
    
    X, y = [], []
    for _, s_row in starred.iterrows():
        for _, ns_row in not_starred.iterrows():
            X.append([s_row['fit'] - ns_row['fit']])
            y.append(1)  # Positive pair: Starred is better
            
            X.append([ns_row['fit'] - s_row['fit']])
            y.append(0)  # Negative pair: Not-starred is worse
    return np.array(X), np.array(y)


X, y = generate_pairwise_data(data)

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)

print("Pairwise Training Data:")
print(X[:5], y[:5])

## Training the LambdaMART Model

In [None]:
model = xgb.XGBRanker(
    objective="rank:pairwise",
    learning_rate=0.1,
    max_depth=5,
    n_estimators=100,
    random_state=42
)

# Group by candidate sets
group = [len(X_train)]

model.fit(X_train, y_train, group=group)

In [None]:
data['adjusted_fit'] = model.predict(data[['fit']].values)
data = data.sort_values('adjusted_fit', ascending=False).reset_index(drop=True)
data['rank'] = range(1, len(data) + 1)


**Key Advantages**:

- **Scalability**: LambdaMART is designed for large-scale applications, handling numerous candidates effectively.
- **Feedback Integration**: Incorporating starring ensures the model adapts dynamically to new preferences.

### **5. Filtering Candidates**

#### Pre-Processing Filters

In [None]:
# Define target keywords, location, and thresholds
def filter_by_text(data, keywords, threshold=60):
    """Filter candidates based on fuzzy text matching."""
    return data  # Implementation placeholder

filtered_data = filter_by_text(data, target_keywords, threshold=60)
