# Model A: AutoML for Recruiter Decision Classification

Trains Model A using AutoML (PyCaret) to predict **Recruiter Decision (Hire/Reject)** based on resume features.

## Model A Specifications:
- **Input Features**: All resume features (Skills, Experience, Education, Certifications, Job Role, Salary, Projects)
- **Excluded**: Demographic features (Gender, Race, Age, Disability_Status) - NOT used in training
- **Target Variable**: Recruiter_Decision (Hire/Reject) - **Classification Task**
- **Purpose**: This model acts as the company's hiring model we are testing for fairness

## What we'll do:
1. Load processed data from Data_processing.ipynb
2. Set up PyCaret AutoML for classification
3. Train and compare multiple models
4. Evaluate metrics (Accuracy, Precision, Recall, F1, AUC)
5. Save the best model and metrics

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import pickle
import json
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# PyCaret for AutoML
from pycaret.classification import *

print("Libraries imported successfully!")
print(f"PyCaret version: {__import__('pycaret').__version__}")


Libraries imported successfully!
PyCaret version: 3.3.2


## Step 1: Load Processed Data


In [None]:
# Load processed data from saved CSV files
print("Loading processed data from CSV files...")
print("(Make sure Data_processing.ipynb has been run and saved all files)")

try:
    # Load training and test features
    X_train = pd.read_csv('X_train.csv')
    X_test = pd.read_csv('X_test.csv')
    
    # Load target variables (Recruiter_Decision) 
    y_train_df = pd.read_csv('y_train.csv')
    y_test_df = pd.read_csv('y_test.csv')
    y_train = y_train_df['Recruiter_Decision'].squeeze()  
    y_test = y_test_df['Recruiter_Decision'].squeeze()   
    
    try:
        ai_score_train = pd.read_csv('ai_score_train.csv')['AI_Score']  
        ai_score_test = pd.read_csv('ai_score_test.csv')['AI_Score']    
        print("‚úì Loaded AI_Score files for fairness metrics")
    except FileNotFoundError:
        ai_score_train = None
        ai_score_test = None
        print("‚ö† AI_Score files not found (optional for fairness analysis)")
    
    # Load demographics for fairness metrics (optional, for later use)
    try:
        demographics_train = pd.read_csv('demographics_train.csv')
        demographics_test = pd.read_csv('demographics_test.csv')
        print("‚úì Loaded demographics files for fairness metrics")
    except FileNotFoundError:
        demographics_train = None
        demographics_test = None
        print("‚ö† Demographics files not found (optional for fairness analysis)")
    
    print("\n‚úì Successfully loaded all processed data from CSV files!")
    print(f"\nTraining set shape: {X_train.shape}")
    print(f"Test set shape: {X_test.shape}")
    print(f"\nFeature columns: {list(X_train.columns)}")
    print(f"\nTarget variable (Recruiter_Decision) distribution:")  
    print(y_train.value_counts())  # CHANGED
    print(f"\nClass balance: {y_train.value_counts(normalize=True).to_dict()}")  
    
except FileNotFoundError as e:
    print(f"\n Error: Required CSV files not found!")
    print(f"Missing file: {e.filename if hasattr(e, 'filename') else str(e)}")
    print("\nPlease run Data_processing.ipynb first to generate the processed CSV files.")
    print("Required files:")
    print("  - X_train.csv, X_test.csv")
    print("  - y_train.csv, y_test.csv")
    raise
except Exception as e:
    print(f"\n Error loading data: {e}")
    print(f"Error type: {type(e).__name__}")
    import traceback
    traceback.print_exc()
    raise

Loading processed data from CSV files...
(Make sure Data_processing.ipynb has been run and saved all files)
‚úì Loaded AI_Score files for fairness metrics
‚úì Loaded demographics files for fairness metrics

‚úì Successfully loaded all processed data from CSV files!

Training set shape: (800, 7)
Test set shape: (200, 7)

Feature columns: ['Skills', 'Experience', 'Education_Ordinal', 'Certifications_Encoded', 'Job_Role_Encoded', 'Salary_Expectation', 'Projects_Count']

Target variable (Recruiter_Decision) distribution:
Recruiter_Decision
Hire      650
Reject    150
Name: count, dtype: int64

Class balance: {'Hire': 0.8125, 'Reject': 0.1875}


In [3]:
# Combine X and y for PyCaret setup
train_data = X_train.copy()
train_data['Recruiter_Decision'] = y_train.values  

test_data = X_test.copy()
test_data['Recruiter_Decision'] = y_test.values   

print(f"Training data shape: {train_data.shape}")
print(f"Test data shape: {test_data.shape}")
print(f"\nTarget distribution in training data:") 
print(train_data['Recruiter_Decision'].value_counts()) 
print(f"\nFirst few rows:")
train_data.head()

Training data shape: (800, 8)
Test data shape: (200, 8)

Target distribution in training data:
Recruiter_Decision
Hire      650
Reject    150
Name: count, dtype: int64

First few rows:


Unnamed: 0,Skills,Experience,Education_Ordinal,Certifications_Encoded,Job_Role_Encoded,Salary_Expectation,Projects_Count,Recruiter_Decision
0,"Deep Learning, Python",9,2,2,2,91723,8,Hire
1,"Linux, Ethical Hacking, Cybersecurity",4,4,0,1,96836,1,Hire
2,"Pytorch, NLP, TensorFlow, Python",7,3,0,0,51478,9,Hire
3,"Pytorch, NLP, Python, TensorFlow",1,3,2,0,94795,7,Hire
4,"Pytorch, Python",5,2,1,0,89338,1,Hire


## Step 2: Setup PyCaret AutoML

Initialize PyCaret regression environment. This will:
- Handle text features (Skills) automatically
- Prepare data for modeling
- Set up preprocessing pipeline


In [37]:
# Initialize PyCaret CLASSIFICATION environment 
clf = setup(  
    data=train_data,
    target='Recruiter_Decision',  
    test_data=test_data,
    session_id=42,
    normalize=True,
    transformation=True,
    feature_selection=True,
    remove_multicollinearity=True,
    multicollinearity_threshold=0.95,
    remove_outliers=False,
    fix_imbalance=False,
    verbose=True,
    index=False
)

print("‚úì PyCaret setup completed!")
print(f"\nTraining samples: {len(train_data)}")
print(f"Test samples: {len(test_data)}")

Unnamed: 0,Description,Value
0,Session id,42
1,Target,Recruiter_Decision
2,Target type,Binary
3,Target mapping,"Hire: 0, Reject: 1"
4,Original data shape,"(1000, 8)"
5,Transformed data shape,"(1000, 2)"
6,Transformed train set shape,"(800, 2)"
7,Transformed test set shape,"(200, 2)"
8,Numeric features,6
9,Categorical features,1


‚úì PyCaret setup completed!

Training samples: 800
Test samples: 200


## Step 3: Compare Models

Compare multiple regression models to find the best one for predicting AI Score.


In [38]:
# Check available classification models
from pycaret.classification import models

available_models = models()
print("Available classification models in PyCaret:")
print(available_models)

Available classification models in PyCaret:
                                     Name  \
ID                                          
lr                    Logistic Regression   
knn                K Neighbors Classifier   
nb                            Naive Bayes   
dt               Decision Tree Classifier   
svm                   SVM - Linear Kernel   
rbfsvm                SVM - Radial Kernel   
gpc           Gaussian Process Classifier   
mlp                        MLP Classifier   
ridge                    Ridge Classifier   
rf               Random Forest Classifier   
qda       Quadratic Discriminant Analysis   
ada                  Ada Boost Classifier   
gbc          Gradient Boosting Classifier   
lda          Linear Discriminant Analysis   
et                 Extra Trees Classifier   
lightgbm  Light Gradient Boosting Machine   
dummy                    Dummy Classifier   

                                                  Reference  Turbo  
ID                             

In [39]:
# Compare CLASSIFICATION models and select top performers 
best_models = compare_models(
    include=['rf','lightgbm', 'et', 'ada', 'dt', 'lr', 'ridge', 'nb', 'knn'], 
    sort='AUC', 
    n_select=5,
    verbose=False
)

print("‚úì Model comparison completed!")
print(f"\nTop 5 models selected based on AUC") 

‚úì Model comparison completed!

Top 5 models selected based on AUC


In [40]:
# Get the best model (lowest RMSE)
best_model = best_models[0] if isinstance(best_models, list) else best_models

print(f"Best model: {type(best_model).__name__}")
print("\nModel performance on test set:")
evaluate_model(best_model)


Best model: RandomForestClassifier

Model performance on test set:


interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin‚Ä¶

## Step 4: Model Tuning (Optional)

Fine-tune the best model to improve performance.


In [41]:
# Tune the best model
tuned_model = tune_model(best_model, optimize='AUC', n_iter=50, verbose=False)

print("‚úì Model tuning completed!")
print(f"\nTuned model: {type(tuned_model).__name__}")
print("\nTuned model performance:")
evaluate_model(tuned_model)


‚úì Model tuning completed!

Tuned model: RandomForestClassifier

Tuned model performance:


interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin‚Ä¶

## Step 5: Final Model Selection

Select the final model (tuned or original) and evaluate on test set.


In [42]:
# Use tuned model if it performs better, otherwise use original
#in our case tuned_model performed worse, so we consider best_model
final_model = best_model

# Make predictions on test set
predictions = predict_model(final_model, data=test_data)

# Calculate CLASSIFICATION metrics 
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report

y_true = test_data['Recruiter_Decision'].values  
y_pred = predictions['prediction_label'].values
y_pred_proba = predictions['prediction_score'].values if 'prediction_score' in predictions.columns else None

# Calculate metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, pos_label='Hire')  
recall = recall_score(y_true, y_pred, pos_label='Hire')       
f1 = f1_score(y_true, y_pred, pos_label='Hire')               

# AUC requires probability scores
if y_pred_proba is not None:
    # Convert labels to binary (Hire=1, Reject=0)
    y_true_binary = (y_true == 'Hire').astype(int)
    auc = roc_auc_score(y_true_binary, y_pred_proba)
else:
    auc = None

print("=" * 60)
print("FINAL MODEL PERFORMANCE METRICS")
print("=" * 60)
print(f"\nModel Type: {type(final_model).__name__}")
print(f"\nClassification Metrics (Recruiter Decision Prediction):")  
print(f"  Accuracy:  {accuracy:.4f}")
print(f"  Precision: {precision:.4f}")
print(f"  Recall:    {recall:.4f}")
print(f"  F1 Score:  {f1:.4f}")
if auc is not None:
    print(f"  AUC-ROC:   {auc:.4f}")

print(f"\nConfusion Matrix:")
print(confusion_matrix(y_true, y_pred))

print(f"\nClassification Report:")
print(classification_report(y_true, y_pred))

print(f"\nTest Set Size: {len(y_true)} samples")
print(f"Class Distribution: {pd.Series(y_true).value_counts().to_dict()}")
print("=" * 60)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.695,0.4974,0.695,0.698,0.6965,0.019,0.019


FINAL MODEL PERFORMANCE METRICS

Model Type: RandomForestClassifier

Classification Metrics (Recruiter Decision Prediction):
  Accuracy:  0.6950
  Precision: 0.8137
  Recall:    0.8086
  F1 Score:  0.8111
  AUC-ROC:   0.5306

Confusion Matrix:
[[131  31]
 [ 30   8]]

Classification Report:
              precision    recall  f1-score   support

        Hire       0.81      0.81      0.81       162
      Reject       0.21      0.21      0.21        38

    accuracy                           0.69       200
   macro avg       0.51      0.51      0.51       200
weighted avg       0.70      0.69      0.70       200


Test Set Size: 200 samples
Class Distribution: {'Hire': 162, 'Reject': 38}


In [43]:
# Store metrics in a dictionary
metrics = {
    'model_type': type(final_model).__name__,
    'accuracy': float(accuracy),
    'precision': float(precision),
    'recall': float(recall),
    'f1_score': float(f1),
    'auc': float(auc) if auc is not None else None,
    'test_samples': int(len(y_true)),
    'target_variable': 'Recruiter_Decision',  
    'task_type': 'classification',  
    'positive_class': 'Hire',  
    'class_distribution': pd.Series(y_true).value_counts().to_dict()  
}

# Display metrics
print("\nMetrics Summary:")
for key, value in metrics.items():
    print(f"  {key}: {value}")


Metrics Summary:
  model_type: RandomForestClassifier
  accuracy: 0.695
  precision: 0.8136645962732919
  recall: 0.808641975308642
  f1_score: 0.8111455108359134
  auc: 0.5306205328135153
  test_samples: 200
  target_variable: Recruiter_Decision
  task_type: classification
  positive_class: Hire
  class_distribution: {'Hire': 162, 'Reject': 38}


Our baseline hiring model (Random Forest) achieves respectable precision and recall (~0.81), indicating reasonable predictive ability. However, the low AUC (0.53) reflects limited discrimination between Hire and Reject classes because the dataset is highly imbalanced (162 Hire vs. 38 Reject). This is realistic for HR scenarios, where models often overpredict hires due to skewed training data. This imperfect model is ideal for our FairHire system, which aims to evaluate fairness, detect bias, and monitor drift‚Äîrather than optimize predictive performance.

In [54]:
import pandas as pd
import numpy as np
from pycaret.classification import load_model, predict_model

print("\n===================================================")
print("Building Model A shortlist base for all candidates")
print("===================================================")

# ============================================================
# 1. Load Model A using PyCaret's load_model (includes preprocessing)
# ============================================================
print("Loading Model A...")
modelA = load_model('model_A/modelA_final')  # ‚ö†Ô∏è Don't include .pkl extension!
print("‚úì Model A loaded successfully")

# ============================================================
# 2. Load full processed dataset
# ============================================================
full_df = pd.read_csv("model_A/Dataset_A_processed.csv")
print(f"‚úì Loaded Dataset_A_processed.csv with shape: {full_df.shape}")

# ============================================================
# 3. Prepare data for prediction (need target column for PyCaret)
# ============================================================
# PyCaret expects the target column to be present (even if we ignore it)
# Create a dummy target column if it doesn't exist
if 'Recruiter_Decision' not in full_df.columns:
    full_df['Recruiter_Decision'] = 'Hire'  # Dummy value

# Feature columns that were used in training
feature_columns = [
    'Skills', 
    'Experience', 
    'Education_Ordinal', 
    'Certifications_Encoded', 
    'Job_Role_Encoded', 
    'Salary_Expectation', 
    'Projects_Count'
]

# Sanity check: make sure all feature columns exist
missing_feats = [c for c in feature_columns if c not in full_df.columns]
if missing_feats:
    print(f"‚ö†Ô∏è Missing feature columns in Dataset_A_processed: {missing_feats}")
    raise ValueError("Feature columns mismatch between training and full dataset.")

print(f"‚úì All {len(feature_columns)} feature columns present")

# ============================================================
# 4. Make predictions using PyCaret's predict_model
# ============================================================
print("Making predictions on all candidates...")
predictions = predict_model(modelA, data=full_df)

# Extract predictions and probabilities
pred_labels = predictions['prediction_label'].values
pred_scores = predictions['prediction_score'].values  # This is P(Hire)

# Convert labels to binary
label_map = {'Hire': 1, 'Reject': 0}
pred_binary = predictions['prediction_label'].map(label_map).fillna(0).astype(int).values

print(f"‚úì Predictions complete:")
print(f"  - Predicted Hire: {(pred_binary == 1).sum()} ({(pred_binary == 1).mean()*100:.1f}%)")
print(f"  - Predicted Reject: {(pred_binary == 0).sum()} ({(pred_binary == 0).mean()*100:.1f}%)")

# ============================================================
# 5. Build shortlist base DataFrame with rankings
# ============================================================
shortlist_base = full_df.copy()

# Add prediction columns
shortlist_base["ModelA_Hire_Prob"] = pred_scores
shortlist_base["ModelA_Prediction"] = pred_binary
shortlist_base["ModelA_Pred_Label"] = predictions['prediction_label'].values

# Add ranking based on hire probability (1 = highest probability)
shortlist_base["ModelA_Rank"] = shortlist_base["ModelA_Hire_Prob"].rank(
    ascending=False,  # Higher probability = better rank
    method='min'      # Ties get same rank
).astype(int)

# Sort by rank for easy viewing
shortlist_base = shortlist_base.sort_values('ModelA_Rank')

print(f"\n‚úì Shortlist base created:")
print(f"  - Total candidates: {len(shortlist_base)}")
print(f"  - Rank range: {shortlist_base['ModelA_Rank'].min()} to {shortlist_base['ModelA_Rank'].max()}")
print(f"  - Top candidate hire probability: {shortlist_base['ModelA_Hire_Prob'].max():.3f}")
print(f"  - Bottom candidate hire probability: {shortlist_base['ModelA_Hire_Prob'].min():.3f}")

# ============================================================
# 6. Save to CSV for Streamlit app
# ============================================================
output_path = "model_A/modelA_shortlist_base.csv"
shortlist_base.to_csv(output_path, index=False)

print(f"\n‚úì Saved: {output_path}")
print(f"Columns: {shortlist_base.columns.tolist()}")

# ============================================================
# 7. Display sample of top and bottom candidates
# ============================================================
print("\nüèÜ Top 5 Candidates (Highest Hire Probability):")
top_cols = ['ModelA_Rank', 'Job_Role', 'Experience', 'Education_Ordinal', 
            'ModelA_Hire_Prob', 'ModelA_Pred_Label']
print(shortlist_base[top_cols].head())

print("\nüìâ Bottom 5 Candidates (Lowest Hire Probability):")
print(shortlist_base[top_cols].tail())

print("\n===================================================")
print("‚úÖ Model A shortlist base complete!")
print("===================================================")


Building Model A shortlist base for all candidates
Loading Model A...
Transformation Pipeline and Model Successfully Loaded
‚úì Model A loaded successfully
‚úì Loaded Dataset_A_processed.csv with shape: (1000, 13)
‚úì All 7 feature columns present
Making predictions on all candidates...


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.938,0.8975,0.938,0.938,0.938,0.7969,0.7969


‚úì Predictions complete:
  - Predicted Hire: 812 (81.2%)
  - Predicted Reject: 188 (18.8%)

‚úì Shortlist base created:
  - Total candidates: 1000
  - Rank range: 1 to 999
  - Top candidate hire probability: 1.000
  - Bottom candidate hire probability: 0.540

‚úì Saved: model_A/modelA_shortlist_base.csv
Columns: ['Skills', 'Experience', 'Education', 'Certifications', 'Job_Role', 'Recruiter_Decision', 'Salary_Expectation', 'Projects_Count', 'AI_Score', 'Education_Ordinal', 'Certifications_Encoded', 'Job_Role_Encoded', 'Education_Encoded', 'ModelA_Hire_Prob', 'ModelA_Prediction', 'ModelA_Pred_Label', 'ModelA_Rank']

üèÜ Top 5 Candidates (Highest Hire Probability):
     ModelA_Rank               Job_Role  Experience  Education_Ordinal  \
0              1          AI Researcher          10                  2   
528            1         Data Scientist           6                  3   
527            1  Cybersecurity Analyst           0                  2   
524            1      Software 

In [None]:
#For streamlit UI implementation for top rank candidates for a job role...

# Original columns from Dataset_A_processed.csv
'Skills', 'Experience', 'Education_Ordinal', 'Certifications_Encoded', 
'Job_Role_Encoded', 'Salary_Expectation', 'Projects_Count', 'Recruiter_Decision'

# NEW columns added:
'ModelA_Hire_Prob'   # Probability of being hired (0.0 to 1.0)
'ModelA_Prediction'  # Binary prediction (1=Hire, 0=Reject)
'ModelA_Pred_Label'  # Text label ('Hire' or 'Reject')
'ModelA_Rank'        # Ranking (1 = best candidate)

# In your Streamlit app
import streamlit as st
import pandas as pd

# Load shortlist
shortlist = pd.read_csv('model_A/modelA_shortlist_base.csv')

# Filter by job role
job_role = st.selectbox("Select Job Role", shortlist['Job_Role'].unique())
filtered = shortlist[shortlist['Job_Role_Encoded'] == job_role]

# Show top K candidates
top_k = st.slider("Number of candidates to show", 10, 100, 50)
top_candidates = filtered.nsmallest(top_k, 'ModelA_Rank')

# Display with ranking
st.dataframe(top_candidates[['ModelA_Rank', 'Skills', 'Experience', 
                              'Education_Ordinal', 'ModelA_Hire_Prob']])

## Step 6: Save Model and Metrics

Save the trained model and evaluation metrics for later use.


In [16]:
# Save the final model
save_model(final_model, 'modelA_final')

print("‚úì Model saved as 'modelA_final.pkl'")
print("  (PyCaret automatically saves the model with preprocessing pipeline)")


Transformation Pipeline and Model Successfully Saved
‚úì Model saved as 'modelA_final.pkl'
  (PyCaret automatically saves the model with preprocessing pipeline)


In [17]:
# Save metrics to JSON file
with open('modelA_metrics.json', 'w') as f:
    json.dump(metrics, f, indent=4)

print("‚úì Metrics saved to 'modelA_metrics.json'")
print("\nSaved metrics:")
print(json.dumps(metrics, indent=2))


‚úì Metrics saved to 'modelA_metrics.json'

Saved metrics:
{
  "model_type": "RandomForestClassifier",
  "accuracy": 0.695,
  "precision": 0.8136645962732919,
  "recall": 0.808641975308642,
  "f1_score": 0.8111455108359134,
  "auc": 0.5306205328135153,
  "test_samples": 200,
  "target_variable": "Recruiter_Decision",
  "task_type": "classification",
  "positive_class": "Hire",
  "class_distribution": {
    "Hire": 162,
    "Reject": 38
  }
}


## Step 7: Model Visualization

Visualize model performance and feature importance.


In [19]:
plot_model(final_model, plot='auc', save=True)           # ROC-AUC curve
plot_model(final_model, plot='confusion_matrix', save=True)  # Confusion matrix
plot_model(final_model, plot='class_report', save=True)  # Classification report
plot_model(final_model, plot='pr', save=True)            # Precision-Recall curve
plot_model(final_model, plot='feature', save=True)       # Feature importance

print("‚úì Model plots saved in 'Plots' directory")

‚úì Model plots saved in 'Plots' directory


## Step 8: Summary

Model A training completed! This model can now be used for:
- Predicting AI Score based on resume features
- Fairness analysis (using demographics_test and y_class_test from Data_processing.ipynb)
- Comparison with other models


In [20]:
print("=" * 60)
print("MODEL A TRAINING SUMMARY")
print("=" * 60)
print(f"\n‚úì Model Type: {metrics['model_type']}")
print(f"‚úì Target Variable: {metrics['target_variable']} (Classification)")  
print(f"‚úì Positive Class: {metrics['positive_class']}")  
print(f"‚úì Test Set Performance:")
print(f"    Accuracy:  {metrics['accuracy']:.4f}")  
print(f"    Precision: {metrics['precision']:.4f}") 
print(f"    Recall:    {metrics['recall']:.4f}")     
print(f"    F1 Score:  {metrics['f1_score']:.4f}")   
if metrics['auc'] is not None:
    print(f"    AUC-ROC:   {metrics['auc']:.4f}")     
print(f"\n‚úì Files Saved:")
print(f"    - modelA_final.pkl (trained model)")
print(f"    - modelA_metrics.json (evaluation metrics)")
print(f"    - modelA_predictions.csv (test set predictions)")
print(f"\n‚úì Next Steps:")
print(f"    - Use this model for fairness analysis")
print(f"    - Compare predictions across demographic groups")  
print(f"    - Analyze bias in Hire/Reject decisions")         
print("=" * 60)

MODEL A TRAINING SUMMARY

‚úì Model Type: RandomForestClassifier
‚úì Target Variable: Recruiter_Decision (Classification)
‚úì Positive Class: Hire
‚úì Test Set Performance:
    Accuracy:  0.6950
    Precision: 0.8137
    Recall:    0.8086
    F1 Score:  0.8111
    AUC-ROC:   0.5306

‚úì Files Saved:
    - modelA_final.pkl (trained model)
    - modelA_metrics.json (evaluation metrics)
    - modelA_predictions.csv (test set predictions)

‚úì Next Steps:
    - Use this model for fairness analysis
    - Compare predictions across demographic groups
    - Analyze bias in Hire/Reject decisions


## Step 9: Fairness Analysis

- Importing fairness_util.py which adheres to fairness metrics from arXiv paper: https://arxiv.org/pdf/2405.19699 
(Fairness in AI-Driven Recruitment: Challenges, Metrics, Methods, and Future Directions)


In [44]:
import numpy as np
import pandas as pd
import json

from fairness_util import (
    demographic_parity_diff,
    selection_rate_parity_topk,
    rank_ordering_bias,
    equal_opportunity_diff,
    score_distribution_overlap,
)

print("\n======================================================================")
print(" FAIRNESS ANALYSIS FOR MODEL A")
print("======================================================================")

# -------------------------------------------------------------------
# 1. Prepare ground truth, predictions, base fairness DataFrame
# -------------------------------------------------------------------

# Map Recruiter_Decision text -> binary 1/0
label_map = {"Hire": 1, "Reject": 0, "No hire": 0, "No Hire": 0}

# Get true labels from test_data
y_true = test_data['Recruiter_Decision'].map(label_map).fillna(0).astype(int)

# Get predictions from your PyCaret model
predictions_full = predict_model(final_model, data=test_data)

# Extract predicted labels
y_pred_labels = predictions_full['prediction_label'].map(label_map).fillna(0).astype(int)
y_pred = y_pred_labels.values

# Extract probability of Hire class
if 'prediction_score_Hire' in predictions_full.columns:
    y_proba = predictions_full['prediction_score_Hire'].values
elif 'prediction_score' in predictions_full.columns:
    y_proba = predictions_full['prediction_score'].values
else:
    print(" Warning: Using fallback probability extraction")
    y_proba = predict_model(final_model, data=test_data, raw_score=True)['prediction_score'].values

accuracy = (y_pred == y_true).mean()

print("‚úì Fairness Analysis Setup:")
print(f"  - Test samples: {len(y_true)}")
print(f"  - Ground truth (Hire=1): {y_true.sum()} ({y_true.mean()*100:.1f}%)")
print(f"  - Model predicts (Hire=1): {y_pred.sum()} ({y_pred.mean()*100:.1f}%)")
print(f"  - Model accuracy: {accuracy*100:.1f}%")

fair_df = pd.DataFrame({
    "y_true":  y_true.values,
    "y_pred":  y_pred,
    "y_proba": y_proba,
})

# -------------------------------------------------------------------
# 2. Clean demographics and attach to fairness DataFrame
# -------------------------------------------------------------------

# Load demographics if not already in memory
if 'demographics_test' not in locals() or demographics_test is None:
    try:
        demographics_test = pd.read_csv('demographics_test.csv')
        print("‚úì Loaded demographics_test.csv")
    except FileNotFoundError:
        print("‚ö†Ô∏è Warning: demographics_test.csv not found!")
        demographics_test = pd.DataFrame()

demo = demographics_test.copy()

# Age -> bucket into coarse groups so output is readable
if "Age" in demo.columns:
    demo["Age_Group"] = pd.cut(
        demo["Age"],
        bins=[17, 29, 39, 120],
        labels=["18-29", "30-39", "40+"],
        right=True,
        include_lowest=True,
    )

# Race -> collapse rare categories into "Other / Minority"
if "Race" in demo.columns:
    race_counts = demo["Race"].value_counts()
    rare_races = race_counts[race_counts < 10].index  # threshold can be tuned
    demo["Race_Grouped"] = demo["Race"].where(~demo["Race"].isin(rare_races),
                                              "Other / Minority")

# Attach cleaned demographics using your original column names where possible
if "Gender" in demo.columns:
    fair_df["Gender"] = demo["Gender"].values
if "Race_Grouped" in demo.columns:
    fair_df["Race"] = demo["Race_Grouped"].values   # we overwrite Race with grouped
elif "Race" in demo.columns:
    fair_df["Race"] = demo["Race"].values
if "Age_Group" in demo.columns:
    fair_df["Age_Group"] = demo["Age_Group"].values
if "Disability_Status" in demo.columns:
    fair_df["Disability_Status"] = demo["Disability_Status"].values

demographic_attrs = ["Gender", "Race", "Age_Group", "Disability_Status"]
demographic_attrs = [a for a in demographic_attrs if a in fair_df.columns]

print("\n‚úì Fairness DataFrame:")
print(f"  - Shape: {fair_df.shape}")
print(f"  - Demographics: {demographic_attrs}")

# Only report groups with at least this many samples
MIN_GROUP_SIZE = 10

def filter_small_groups(df, group_col, min_size=MIN_GROUP_SIZE):
    counts = df[group_col].value_counts()
    valid = counts[counts >= min_size].index
    return df[df[group_col].isin(valid)], valid


fairness_summary_A = {}

# -------------------------------------------------------------------
# 3. Per-attribute fairness metrics (clean, human-readable)
# -------------------------------------------------------------------
for attr in demographic_attrs:
    print("\n" + "="*70)
    print(f"üìä Fairness Analysis by: {attr}")
    print("="*70)

    df_attr = fair_df.copy()
    df_attr["group"] = df_attr[attr]

    # Filter out tiny groups to avoid crazy 100%/0% stats from 1‚Äì2 rows
    df_attr, kept_groups = filter_small_groups(df_attr, "group", MIN_GROUP_SIZE)
    if len(kept_groups) < 2:
        print(f"‚ö†Ô∏è Not enough data for {attr} after filtering groups with <{MIN_GROUP_SIZE} samples. Skipping.")
        continue

    print("\nGroup distribution (kept groups):")
    print(df_attr["group"].value_counts())

    # 1Ô∏è‚É£ Demographic Parity ‚Äì use y_pred directly (hire rate by group)
    dp = demographic_parity_diff(
        df_attr,
        group_col="group",
        score_col="y_pred",  # mean of y_pred == hire rate
        threshold=None,
    )
    print("\n1Ô∏è‚É£ Demographic Parity (Hire Rate by Group):")
    for g, r in dp["per_group"].items():
        print(f"   {g}: {r*100:.1f}%")
    print(f"   ‚ö†Ô∏è Max gap: {dp['max_gap']*100:.1f} percentage points (lower = fairer)")

    # 2Ô∏è‚É£ Top-K Selection Rate Parity
    top_k = min(50, len(df_attr))
    srp = selection_rate_parity_topk(
        df_attr,
        group_col="group",
        score_col="y_proba",
        k=top_k,
    )
    print(f"\n2Ô∏è‚É£ Top-{top_k} Selection Rate Parity:")
    for g, r in srp["per_group"].items():
        print(f"   {g}: {r*100:.1f}% in Top-{top_k}")
    print(f"   Min/Max ratio: {srp['min_over_max']:.3f}")
    if srp["min_over_max"] < 0.8:
        print("    FAIL (4/5 rule: < 0.80)")
    else:
        print("    PASS (4/5 rule: ‚â• 0.80)")

    # 3Ô∏è‚É£ Equal Opportunity ‚Äì TPR parity across groups
    eo = equal_opportunity_diff(
        df_attr,
        group_col="group",
        y_true_col="y_true",
        score_col="y_proba",
        positive_label=1,
        threshold=0.5,
    )
    print("\n3Ô∏è‚É£ Equal Opportunity (TPR among true Hires):")
    for g, tpr in eo["per_group_tpr"].items():
        if np.isnan(tpr):
            print(f"   {g}: N/A (no true positives)")
        else:
            print(f"   {g}: {tpr*100:.1f}%")
    print(f"    Max TPR gap: {eo['max_tpr_gap']*100:.1f} percentage points")

    # 4Ô∏è‚É£ Rank Ordering Bias
    rob = rank_ordering_bias(
        df_attr,
        group_col="group",
        score_col="y_proba",
    )
    print("\n4Ô∏è‚É£ Rank Ordering Bias (lower avg rank = appears earlier in shortlist):")
    for g, r in rob["per_group_avg_rank"].items():
        print(f"   {g}: average rank {r:.1f}")
    print(f"    Max rank gap: {rob['max_rank_gap']:.1f} positions")

    # Optional score overlap if exactly 2 groups kept
    kept_groups_list = list(kept_groups)
    if len(kept_groups_list) == 2:
        sdo = score_distribution_overlap(
            df_attr,
            group_a=kept_groups_list[0],
            group_b=kept_groups_list[1],
            group_col="group",
            score_col="y_proba",
            bins=20,
        )
    else:
        sdo = None

    fairness_summary_A[attr] = {
        "demographic_parity_max_gap": float(dp["max_gap"]),
        "topk_min_over_max": float(srp["min_over_max"]),
        "equal_opportunity_max_gap": float(eo["max_tpr_gap"]) if not np.isnan(eo["max_tpr_gap"]) else None,
        "rank_ordering_max_gap": float(rob["max_rank_gap"]),
        "score_distribution_overlap": float(sdo) if sdo is not None else None,
    }

print("\n======================================================================")
print("===  FAIRNESS SUMMARY FOR MODEL A (COMPACT) ===")
print("======================================================================")
for attr, metrics_dict in fairness_summary_A.items():
    print(f"\nüîç {attr}:")
    for k, v in metrics_dict.items():
        if v is not None:
            print(f"   {k}: {v:.4f}")
        else:
            print(f"   {k}: N/A")

# Save summary to JSON
with open("modelA_fairness_metrics.json", "w") as f:
    json.dump(fairness_summary_A, f, indent=2)
print("\n‚úì Saved: modelA_fairness_metrics.json")

# Update main metrics file
try:
    with open("modelA_metrics.json", "r") as f:
        modelA_metrics = json.load(f)
except FileNotFoundError:
    modelA_metrics = {}

modelA_metrics["fairness"] = fairness_summary_A

with open("modelA_metrics.json", "w") as f:
    json.dump(modelA_metrics, f, indent=2)

print("‚úì Updated: modelA_metrics.json")

print("\n======================================================================")
print(" FAIRNESS ANALYSIS COMPLETE!")
print("======================================================================")


 FAIRNESS ANALYSIS FOR MODEL A


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.695,0.4974,0.695,0.698,0.6965,0.019,0.019


‚úì Fairness Analysis Setup:
  - Test samples: 200
  - Ground truth (Hire=1): 162 (81.0%)
  - Model predicts (Hire=1): 161 (80.5%)
  - Model accuracy: 69.5%

‚úì Fairness DataFrame:
  - Shape: (200, 7)
  - Demographics: ['Gender', 'Race', 'Age_Group', 'Disability_Status']

üìä Fairness Analysis by: Gender

Group distribution (kept groups):
group
Male      107
Female     93
Name: count, dtype: int64

1Ô∏è‚É£ Demographic Parity (Hire Rate by Group):
   Female: 83.9%
   Male: 77.6%
   ‚ö†Ô∏è Max gap: 6.3 percentage points (lower = fairer)

2Ô∏è‚É£ Top-50 Selection Rate Parity:
   Female: 24.7% in Top-50
   Male: 25.2% in Top-50
   Min/Max ratio: 0.980
    PASS (4/5 rule: ‚â• 0.80)

3Ô∏è‚É£ Equal Opportunity (TPR among true Hires):
   Female: 100.0%
   Male: 100.0%
    Max TPR gap: 0.0 percentage points

4Ô∏è‚É£ Rank Ordering Bias (lower avg rank = appears earlier in shortlist):
   Female: average rank 102.7
   Male: average rank 98.6
    Max rank gap: 4.2 positions

üìä Fairness Analysi

Why Race Shows Bias:
Looking at the numbers:

Hispanic candidates: 66.7% hire rate, only 18.2% in top 50, ranked 109.1 on average
Asian candidates: 81.8% hire rate, 31.8% in top 50, ranked 94.4 on average
White candidates: 83.8% hire rate, 27.0% in top 50, ranked 94.5 on average

This suggests that the model systematically ranks Hispanic candidates lower. Even when Hispanic candidates should be hired (ground truth), they appear lower in rankings. **This could violate equal opportunity laws (disparate impact)**