# Passage Ranking

This notebook evaluates and ranks candidates' fit for specific roles based on similarity scores. By leveraging machine learning and automation, we aim to streamline the recruitment process, reducing manual efforts and enhancing decision-making quality.

## Overview

This step focuses on testing and optimizing the performance of our candidate ranking system. We:

1. Set up the environment and load the necessary data and libraries.
2. Compute similarity scores between job titles and predefined phrases.
3. Filter, rank, and aggregate results to identify the best matches.
4. Store the results for further analysis and integration.

## Set up

### Loading Libraries and Setting Paths

We initialize the working environment, import libraries, and set up paths for data and configurations.

In [7]:
import os
import sys

try:
    from google.colab import drive
    drive.mount('/content/drive')
    root_dir = "/content/drive/MyDrive/wdir/repos/Apziva/3-potential_talents/"
    os.getcwd()

except ImportError:
    while 'potential_talents' not in os.listdir('.'):
        os.chdir('..')
        root_dir=os.getcwd()
    
    # append term_deposit to system to import custom functions
    sys.path.append('.')
    
%pwd

'g:\\Mi unidad\\wdir\\repos\\Apziva\\3-potential_talents'

### Loading Data and API Setup

Here, we load the encoded job titles and configure the API credentials for similarity computations.


In [8]:
import pandas as pd
from pathlib import Path
import toml
import json
import requests
import numpy as np
import time

data_path = Path("data")
data = pd.read_parquet(data_path  / "interim" / "encoded.parquet", columns=['job_title'])

credentials_path = Path(root_dir) / "config" / ".credentials"
credentials = toml.load(credentials_path)

# Define multiple search phrases for comparison
phrases_path = Path(root_dir) / "config" / "search_phrases.toml"
phrases = toml.load(phrases_path)['search_phrases']

# API and credentials setup
API_URL = "https://api-inference.huggingface.co/models/sentence-transformers/msmarco-distilbert-base-tas-b"
headers = {"Authorization": f"Bearer {credentials['hf_api_token']}"}

FileNotFoundError: [Errno 2] No such file or directory: 'g:\\Mi unidad\\wdir\\repos\\Apziva\\3-potential_talents\\config\\.credentials'

---

## Core Functions

### Query Function

The `query` function sends POST requests to the Hugging Face inference API to compute similarity scores.


In [None]:
def query(payload):
    """Send a POST request to Hugging Face inference API."""
    response = requests.post(API_URL, headers=headers, json=payload)
    if response.status_code == 200:
        return response.json()  # Assuming the API returns a JSON response
    else:
        raise Exception(f"API Error: {response.status_code} {response.text}")

---

### Computing Similarities

This function computes the similarity scores between multiple predefined phrases and job titles.


In [None]:
def compute_similarities(data, phrases):
    """Compute similarities between multiple phrases and job titles."""
    similarity_matrix = []
    
    for phrase in phrases:
        payload = {
            "inputs": {
                "source_sentence": phrase,
                "sentences": data['job_title'].tolist()
            }
        }
        response = query(payload)
        
        if isinstance(response, dict) and 'similarities' in response:
            scores = response['similarities']
        elif isinstance(response, list):
            scores = response
        else:
            raise TypeError(f"Unexpected response format: {response}")
        
        similarity_matrix.append(scores)
    
    return np.array(similarity_matrix)

---

### Retry Mechanism for Robustness

Added a retry mechanism to handle API timeouts or delays.


In [None]:
def query_with_retry(payload, retries=5, delay=20):
    """Send a POST request to Hugging Face inference API with retry mechanism."""
    for attempt in range(retries):
        response = requests.post(API_URL, headers=headers, json=payload)
        if response.status_code == 200:
            return response.json()
        elif response.status_code == 503:
            print(f"Model is loading, retrying in {delay} seconds...")
            time.sleep(delay)
        else:
            raise Exception(f"API Error: {response.status_code} {response.text}")
    raise Exception("Max retries exceeded")

In [None]:
def compute_similarities_with_retry(data, phrases):
    """Compute similarities between multiple phrases and job titles with retry mechanism."""
    similarity_matrix = []
    
    for phrase in phrases:
        payload = {
            "inputs": {
                "source_sentence": phrase,
                "sentences": data['job_title'].tolist()
            }
        }
        response = query_with_retry(payload)
        
        # Debug the response structure
        if isinstance(response, dict) and 'similarities' in response:
            scores = response['similarities']
        elif isinstance(response, list):  # Sometimes APIs return a list of scores
            scores = response
        else:
            raise TypeError(f"Unexpected response format: {response}")
        
        similarity_matrix.append(scores)
    
    return np.array(similarity_matrix)

---

### Adding Similarity Scores to the DataFrame


In [None]:
# Compute similarity scores
similarity_matrix = compute_similarities_with_retry(data, phrases)

# Add scores for each phrase to the DataFrame
for i, phrase in enumerate(phrases):
    data[f"similarity_to_{phrase}"] = similarity_matrix[i]

---

## Filtering and Ranking

This section filters and ranks job titles based on their similarity scores to each phrase.


In [None]:
# Filter and rank results for each phrase
filtered_results = []
for phrase in phrases:
    filtered = (
        data
        .sort_values(f"similarity_to_{phrase}", ascending=False)
    )
    filtered['matching_phrase'] = phrase
    filtered_results.append(filtered)

# Combine filtered results into a single DataFrame
final_result = pd.concat(filtered_results).drop_duplicates().reset_index(drop=True)

---

## Aggregation and Final Rankings

We compute an aggregate fit score to rank the job titles effectively.


In [None]:
final_result = final_result.assign(
    fit=final_result.iloc[:, 1:-1].median(axis=1)-final_result.iloc[:, 1:-1].std(axis=1)
    ).sort_values('fit', ascending=False).iloc[:, [0,-2,-1]]
final_result.sample(4, random_state=27)

Group and save the results for downstream analysis.


In [None]:
grouped_results = pd.DataFrame(
    final_result.groupby('job_title')["fit"].mean()\
    .sort_values(ascending=False)
    )
grouped_results.to_parquet(data_path / "processed" / "grouped_results.parquet")