# The Hired Hand

**Machine Learning for Job Placement Prediction**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Angry-Jay/ML_TheHiredHand/blob/main/ml-the-hired-hand.ipynb)

---

## Table of Contents

1. [Project & Dataset Description](#1-project--dataset-description)
   - [1.1 Project Aim](#11-project-aim)
   - [1.2 Existing Solutions](#12-existing-solutions)
   - [1.3 Dataset Information](#13-dataset-information)
2. [Library Imports](#2-library-imports)
3. [Data Access](#3-data-access)
4. [Dataset Exploratory Analysis](#4-dataset-exploratory-analysis)
   - [4.1 Metadata Analysis](#41-metadata-analysis)
   - [4.2 Missing Values Analysis](#42-missing-values-analysis)
   - [4.3 Feature Distributions, Scaling & Outliers](#43-feature-distributions-scaling--outliers)
   - [4.4 Target Feature Study](#44-target-feature-study)
   - [4.5 Feature Correlation & Selection](#45-feature-correlation--selection)
   - [4.6 Unsupervised Clustering](#46-unsupervised-clustering)
   - [4.7 Interpretations & Conclusions](#47-interpretations--conclusions)
5. [ML Baseline & Ensemble Models](#5-ml-baseline--ensemble-models)
   - [5.1 Train/Validation/Test Splits](#51-trainvalidationtest-splits)
   - [5.2 Pipelines & Models](#52-pipelines--models)
   - [5.3 Training & Validation](#53-training--validation)
   - [5.4 Testing](#54-testing)
   - [5.5 Results Interpretation & Discussion](#55-results-interpretation--discussion)
6. [Enhanced Models & Hyperparameter Tuning](#6-enhanced-models--hyperparameter-tuning)
   - [6.1 Justification of Choices](#61-justification-of-choices)
   - [6.2 Hyperparameter Optimization](#62-hyperparameter-optimization)
   - [6.3 Final Results & Analysis](#63-final-results--analysis)
7. [Conclusion & Future Work](#7-conclusion--future-work)

---

## 1. Project & Dataset Description

### 1.1 Project Aim

This project applies Machine Learning techniques to predict employment outcomes for graduating students using the **Job Placement Dataset**. 

**Primary Objectives:**
- **Predict employment outcomes** (Placed vs. Not Placed) based on demographic, academic, and professional attributes
- **Demonstrate a coherent ML methodology** from data discovery through model optimization
- **Apply comprehensive data analysis** including:
  - Data cleaning and preprocessing
  - Exploratory Data Analysis (EDA)
  - Feature engineering and selection
  - Correlation and clustering analysis
- **Build and evaluate multiple classification models** with proper validation techniques
- **Identify key employability factors** through feature importance analysis and model interpretation
- **Apply ML best practices** including proper train/validation/test splits, pipeline construction, and hyperparameter tuning

---

### 1.2 Existing Solutions

**Traditional Approach:**

Historically, HR departments and educational institutions rely on manual screening processes with heuristic filters (e.g., GPA cutoffs, specific degree specializations, work experience thresholds). This traditional approach has several limitations:
- Time-consuming and difficult to scale
- Subjective and prone to human bias
- Often inaccurate in predicting actual job placement success
- Fails to capture complex interactions between multiple factors

**Machine Learning Solutions:**

Several ML-based approaches exist on platforms like Kaggle and GitHub for placement prediction:

**Common Algorithms Used:**
- **Baseline Models:** Logistic Regression, K-Nearest Neighbors (KNN)
- **Tree-based Models:** Decision Trees, Random Forest, ExtraTrees
- **Boosting Methods:** XGBoost, AdaBoost, Gradient Boosting
- **Support Vector Machines:** SVC with various kernels

**Key Findings from Literature:**
- Tree-based ensemble methods (Random Forest, XGBoost) typically outperform simpler baselines
- Non-linear models better capture feature interactions (e.g., combined effect of GPA and work experience)
- Feature engineering significantly impacts model performance
- Proper handling of class imbalance is crucial for accurate predictions

**Typical Methodology:**
1. Exploratory Data Analysis (distributions, correlations, class imbalance)
2. Preprocessing pipelines (encoding categorical variables, scaling, imputation)
3. Model comparison using multiple metrics: Accuracy, Precision, Recall, F1-Score, ROC-AUC
4. Hyperparameter tuning using GridSearchCV or RandomizedSearchCV
5. Feature importance analysis for interpretability

---

### 1.3 Dataset Information

**Dataset Name:** Job Placement Dataset

**Original Source:** [Kaggle - Job Placement Dataset](https://www.kaggle.com/datasets/ahsan81/job-placement-dataset/data)

**Dataset Characteristics:**
- **Type:** Dense, structured tabular data
- **Size:** Small-to-medium (215 instances, 13 features)
- **Features:** Mix of numeric and categorical variables
- **Target Variable:** Binary classification (Placed / Not Placed)
- **Quality:** Clean with no missing values or duplicates

**Dataset Access:**
- **GitHub Repository:** `https://github.com/Angry-Jay/ML_TheHiredHand`
- **Raw Data URL:** `https://raw.githubusercontent.com/Angry-Jay/ML_TheHiredHand/main/Job_Placement_Data.csv`

**Features Overview:**
- Student demographics (gender)
- Academic performance (SSC %, HSC %, Degree %, MBA %)
- Educational background (SSC board, HSC board, HSC specialization, Degree type, MBA specialization)
- Work experience
- Employment test scores

## 2. Library Imports

In [None]:
# Setting up
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing & Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Model Selection & Tuning
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_val_score,
    train_test_split,
)

# Models
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Metrics
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    f1_score,
    precision_score,
    recall_score,
    roc_auc_score,
)

# Configuration
%matplotlib inline

## 3. Data Access

In [None]:
DATA_URL = "https://raw.githubusercontent.com/Angry-Jay/ML_TheHiredHand/refs/heads/main/aug_train.csv"

try:
    df = pd.read_csv(DATA_URL)
    
    print("Dataset loaded successfully!")
    print(f"Shape: {df.shape[0]} rows, {df.shape[1]} columns")
    
    display(df.head())
    
except Exception as e:
    print(f"Error loading data from {DATA_URL}")
    print(f"Error details: {e}")

## 4. Dataset Exploratory Analysis

### 4.1 Metadata Analysis

In this section, we analyze the dataset's metadata to understand its structure, data types, quality, and characteristics. This initial exploration helps identify:

- **Dataset dimensions** and scale
- **Feature data types** (numerical vs. categorical)
- **Data quality issues** (duplicates, missing values, irrelevant columns)
- **Statistical properties** of numerical features
- **Potential data leakage** concerns

In [None]:
# Display dataset info
df.info()

In [None]:
print("=" * 60)
print("DUPLICATE ANALYSIS")
print("=" * 60)
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")
if duplicates > 0:
    print("\nDuplicate rows:")
    display(df[df.duplicated(keep=False)])
else:
    print("No duplicate rows found.")


In [None]:
print("=" * 60)
print("FEATURE TYPE SEPARATION")
print("=" * 60)

numerical_cols = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

print(f"\nNumerical features ({len(numerical_cols)}):")
print(numerical_cols)

print(f"\nCategorical features ({len(categorical_cols)}):")
print(categorical_cols)


In [None]:
print("=" * 60)
print("NUMERICAL FEATURES - STATISTICAL SUMMARY")
print("=" * 60)
display(df[numerical_cols].describe())

In [None]:
print("=" * 60)
print("CATEGORICAL FEATURES - UNIQUE VALUES")
print("=" * 60)

for col in categorical_cols:
    print(f"\n{col}:")
    print(f"  Unique values: {df[col].nunique()}")
    print(f"  Values: {df[col].unique().tolist()}")

In [None]:
# Data Leakage Assessment and Target Variable Identification
print("=" * 60)
print("TARGET VARIABLE & DATA LEAKAGE ASSESSMENT")
print("=" * 60)

# Identify the target variable
target_col = 'target'
print(f"\nTarget variable: '{target_col}'")
print(f"Classes: {df[target_col].unique().tolist()}")
print(f"\nClass distribution:")
print(df[target_col].value_counts())
print(f"\nClass proportions:")
print(df[target_col].value_counts(normalize=True).round(3))

# Verify feature composition
print(f"\n--- Feature Inventory ---")
print(f"Total features: {len(df.columns)}")
print(f"  - Predictors: {len(df.columns) - 1}")
print(f"  - Target: 1 ('{target_col}')")

# Check for post-placement features that could leak information
print(f"\n--- Data Leakage Check ---")
suspicious_keywords = ['salary', 'offer', 'package', 'compensation', 'hired']
leakage_found = False

for keyword in suspicious_keywords:
    if any(keyword in col.lower() for col in df.columns):
        print(f"WARNING: Potential leakage feature containing '{keyword}' detected")
        leakage_found = True

if not leakage_found:
    print("No obvious data leakage features detected.")
    print("All features represent information available at prediction time.")

#### Summary

The initial metadata analysis reveals a **substantially larger dataset** compared to typical placement studies, with **19,158 instances** across **14 features** (13 predictors and 1 target). The dataset exhibits **no duplicate records**, ensuring data integrity. However, **missing values are present** in several features, with the most significant gaps in `company_type` (6,140 missing, 32.1%), `company_size` (5,938 missing, 31.0%), `major_discipline` (2,813 missing, 14.7%), and `gender` (4,508 missing, 23.5%). This necessitates careful imputation strategies or missing value handling during preprocessing.

The feature composition consists of **2 numerical predictors** (`city_development_index` and `training_hours`) and **10 categorical predictors** representing demographics, education, and employment history. Additionally, `enrollee_id` serves as a unique identifier and must be excluded from modeling, while `target` is the binary outcome variable. Categorical features exhibit **varying cardinality**: low cardinality for binary features like `relevent_experience` (2 values) and `gender` (3 values including missing), moderate cardinality for features like `education_level` (5 levels) and `major_discipline` (6 disciplines), and **high cardinality** for `city` (123 unique cities) and `experience` (22 levels), which may require specialized encoding techniques such as target encoding or frequency encoding.

The target variable exhibits **significant class imbalance**, with **75.1% of candidates not looking for job change** (class 0) and only **24.9% actively seeking change** (class 1), yielding an imbalance ratio of **3.01:1**. This substantial imbalance must be addressed during model training through techniques such as class weighting, resampling (SMOTE/undersampling), or using evaluation metrics robust to imbalance (F1-score, ROC-AUC, precision-recall curves). **No data leakage concerns** were identified; all features represent information collected during training enrollment, ensuring model validity for predicting actual job change intentions.

### 4.2 Missing Values Analysis

In [None]:
print("=" * 60)
print("MISSING VALUES ANALYSIS")
print("=" * 60)

# Missing values count
missing_counts = df.isnull().sum()
print("\nMissing values per feature:")
print(missing_counts)

# Missing values percentage
print("\n" + "=" * 60)
print("MISSING VALUES PERCENTAGE")
print("=" * 60)
missing_percentages = (df.isnull().sum() / len(df) * 100).round(2)
print(missing_percentages)

# Summary statistics
total_missing = df.isnull().sum().sum()
total_cells = df.shape[0] * df.shape[1]
print("\n" + "=" * 60)
print("SUMMARY")
print("=" * 60)
print(f"Total missing values: {total_missing}")
print(f"Total cells: {total_cells}")
print(f"Overall missingness: {(total_missing / total_cells * 100):.2f}%")
print(f"Features with missing values: {(missing_counts > 0).sum()} out of {len(df.columns)}")
print(f"Complete features: {(missing_counts == 0).sum()} out of {len(df.columns)}")

In [None]:
# Visualize missing values
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot of missing values
missing_data = df.isnull().sum()
missing_data = missing_data[missing_data > 0].sort_values(ascending=False)

if len(missing_data) > 0:
    axes[0].bar(range(len(missing_data)), missing_data.values, edgecolor='black', alpha=0.7, color='orange')
    axes[0].set_xticks(range(len(missing_data)))
    axes[0].set_xticklabels(missing_data.index, rotation=45, ha='right')
    axes[0].set_title('Missing Values Count by Feature', fontsize=12, fontweight='bold')
    axes[0].set_xlabel('Features')
    axes[0].set_ylabel('Number of Missing Values')
    axes[0].grid(axis='y', alpha=0.3)
    
    # Add count labels
    for i, v in enumerate(missing_data.values):
        axes[0].text(i, v + 100, str(v), ha='center', va='bottom')
else:
    axes[0].text(0.5, 0.5, 'No Missing Values', ha='center', va='center', fontsize=14)
    axes[0].set_xlim(0, 1)
    axes[0].set_ylim(0, 1)

# Percentage plot
missing_pct = (df.isnull().sum() / len(df) * 100).round(2)
missing_pct = missing_pct[missing_pct > 0].sort_values(ascending=False)

if len(missing_pct) > 0:
    axes[1].bar(range(len(missing_pct)), missing_pct.values, edgecolor='black', alpha=0.7, color='red')
    axes[1].set_xticks(range(len(missing_pct)))
    axes[1].set_xticklabels(missing_pct.index, rotation=45, ha='right')
    axes[1].set_title('Missing Values Percentage by Feature', fontsize=12, fontweight='bold')
    axes[1].set_xlabel('Features')
    axes[1].set_ylabel('Percentage (%)')
    axes[1].grid(axis='y', alpha=0.3)
    
    # Add percentage labels
    for i, v in enumerate(missing_pct.values):
        axes[1].text(i, v + 0.5, f'{v}%', ha='center', va='bottom')
else:
    axes[1].text(0.5, 0.5, 'No Missing Values', ha='center', va='center', fontsize=14)
    axes[1].set_xlim(0, 1)
    axes[1].set_ylim(0, 1)

plt.tight_layout()
plt.show()

In [None]:
# Missing value pattern analysis
print("=" * 60)
print("MISSING VALUE PATTERNS")
print("=" * 60)

# Features with missing values
features_with_missing = df.columns[df.isnull().any()].tolist()

if len(features_with_missing) > 0:
    print(f"\nFeatures with missing values ({len(features_with_missing)}): {features_with_missing}")
    
    # Check co-occurrence of missing values
    print("\n" + "=" * 60)
    print("CO-OCCURRENCE OF MISSING VALUES")
    print("=" * 60)
    
    # Check common combinations
    if 'company_size' in features_with_missing and 'company_type' in features_with_missing:
        both_missing = df['company_size'].isnull() & df['company_type'].isnull()
        print(f"\ncompany_size AND company_type both missing: {both_missing.sum()} ({both_missing.sum() / len(df) * 100:.2f}%)")
    
    if 'company_size' in features_with_missing and 'company_type' in features_with_missing and 'experience' in features_with_missing:
        all_three = df['company_size'].isnull() & df['company_type'].isnull() & df['experience'].isnull()
        print(f"company_size AND company_type AND experience all missing: {all_three.sum()} ({all_three.sum() / len(df) * 100:.2f}%)")
    
    # Distribution of missing counts per row
    print("\n" + "=" * 60)
    print("MISSING VALUES PER ROW DISTRIBUTION")
    print("=" * 60)
    missing_per_row = df.isnull().sum(axis=1)
    print(f"\nMissing values distribution:")
    print(missing_per_row.value_counts().sort_index())
    
    # Rows with any missing value
    rows_with_missing = df.isnull().any(axis=1).sum()
    print(f"\n--- Summary ---")
    print(f"Total rows with at least one missing value: {rows_with_missing} ({rows_with_missing / len(df) * 100:.2f}%)")
    
    # Rows with all values complete
    complete_rows = (~df.isnull().any(axis=1)).sum()
    print(f"Complete rows (no missing values): {complete_rows} ({complete_rows / len(df) * 100:.2f}%)")
    
    # Most common missing value patterns
    print("\n" + "=" * 60)
    print("TOP 5 MISSING VALUE PATTERNS")
    print("=" * 60)
    missing_patterns = df[features_with_missing].isnull().astype(int)
    pattern_counts = missing_patterns.groupby(features_with_missing).size().sort_values(ascending=False).head(5)
    
    for idx, (pattern, count) in enumerate(pattern_counts.items(), 1):
        missing_features = [feat for feat, is_missing in zip(features_with_missing, pattern) if is_missing == 1]
        if missing_features:
            print(f"\n{idx}. Missing: {missing_features}")
            print(f"   Count: {count} ({count / len(df) * 100:.2f}%)")
        else:
            print(f"\n{idx}. No missing values")
            print(f"   Count: {count} ({count / len(df) * 100:.2f}%)")
            
else:
    print("\nNo missing values detected in the dataset.")

#### Summary

The missing values analysis reveals **significant data incompleteness** affecting **53.26% of all rows** (10,203 instances), while only **46.74%** (8,955 instances) are complete. Out of **14 features**, **8 contain missing values** with a **hierarchical pattern**: **company_type** (6,140 missing, 32.05%) and **company_size** (5,938 missing, 30.99%) dominate, followed by **gender** (4,508 missing, 23.53%) and **major_discipline** (2,813 missing, 14.68%). Lower-level missingness appears in **education_level** (460 missing, 2.40%), **last_new_job** (423 missing, 2.21%), **enrolled_university** (386 missing, 2.01%), and **experience** (65 missing, 0.34%). The overall dataset missingness is **7.73%** of total cells.

**Pattern analysis** reveals **systematic co-occurrence** of missing values, confirming non-random (MNAR) behavior. The top missing value patterns are: **(1) Complete rows with no missing values: 46.74%**; **(2) Both company_size AND company_type missing: 14.50%** (2,777 rows) — strongly indicating unemployed candidates or students; **(3) Only gender missing: 11.61%** (2,224 rows); **(4) major_discipline, company_size, AND company_type missing: 4.42%** (847 rows); **(5) gender, company_size, AND company_type missing: 4.36%** (835 rows). Notably, **5,360 rows (27.98%)** have both employment features missing together, while only **20 rows (0.10%)** have all three employment-related features (company_size, company_type, experience) missing simultaneously. The missing values per row distribution shows most affected rows have **1-3 missing features**, with decreasing frequency for higher counts (628 rows with 4 missing, 176 with 5, 62 with 6, and only 12 with 7).

**Preprocessing strategy:** For **company_size and company_type**, we will **create an explicit "Not Employed" category** rather than impute values, because the 27.98% co-occurrence pattern clearly represents candidates without current employment (students/freshers) where these fields are genuinely not applicable — imputation would introduce false information and obscure this meaningful employment status indicator. For **low-missingness features** (education_level, enrolled_university, last_new_job, experience all <3%), we will apply **mode imputation** since their sporadic missingness suggests random data collection gaps rather than systematic patterns, and their low prevalence minimizes impact on model validity. For **gender and major_discipline**, we will **create "Unknown" categories** because their substantial independent missingness (11.61% for gender alone, 14.68% for major_discipline) indicates data provision reluctance or privacy concerns rather than inapplicability, and preserving this "not provided" signal may itself be predictive of job change behavior. Additionally, we will **engineer binary missingness indicators** (`has_employment_info`, `gender_provided`, `education_complete`) as the 53.26% of incomplete rows may exhibit distinct job-seeking behaviors, and these indicators could capture valuable patterns for prediction.

### 4.3 Feature Distributions, Scaling & Outliers

In [None]:
# Visualize distributions of numerical features (excluding enrollee_id and target)
numerical_features_for_viz = [col for col in numerical_cols if col not in ['enrollee_id', 'target']]

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

for idx, col in enumerate(numerical_features_for_viz):
    axes[idx].hist(df[col], bins=30, edgecolor='black', alpha=0.7, color='steelblue')
    axes[idx].set_title(f'Distribution of {col}', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Frequency')
    axes[idx].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Detect outliers using boxplots (excluding enrollee_id and target)
numerical_features_for_viz = [col for col in numerical_cols if col not in ['enrollee_id', 'target']]

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

for idx, col in enumerate(numerical_features_for_viz):
    axes[idx].boxplot(df[col].dropna(), vert=True)
    axes[idx].set_title(f'{col}', fontsize=11, fontweight='bold')
    axes[idx].set_ylabel('Value')
    axes[idx].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Quantitative outlier detection using IQR method (excluding enrollee_id and target)
print("=" * 60)
print("OUTLIER DETECTION (IQR METHOD)")
print("=" * 60)

numerical_features_for_analysis = [col for col in numerical_cols if col not in ['enrollee_id', 'target']]

for col in numerical_features_for_analysis:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    
    print(f"\n{col}:")
    print(f"  Q1: {Q1:.2f}, Q3: {Q3:.2f}, IQR: {IQR:.2f}")
    print(f"  Bounds: [{lower_bound:.2f}, {upper_bound:.2f}]")
    print(f"  Outliers detected: {len(outliers)} ({len(outliers)/len(df)*100:.1f}%)")
    
    if len(outliers) > 0:
        print(f"  Min outlier value: {df[col][outliers.index].min():.2f}")
        print(f"  Max outlier value: {df[col][outliers.index].max():.2f}")

In [None]:
# Categorical features distribution (visualize top categories for high-cardinality features)
categorical_features = categorical_cols.copy()

fig, axes = plt.subplots(3, 4, figsize=(18, 12))
axes = axes.ravel()

for idx, col in enumerate(categorical_features[:10]):
    # For high-cardinality features, show only top 10
    value_counts = df[col].value_counts()
    
    if len(value_counts) > 10:
        value_counts = value_counts.head(10)
        title_suffix = " (Top 10)"
    else:
        title_suffix = ""
    
    axes[idx].bar(range(len(value_counts)), value_counts.values, edgecolor='black', alpha=0.7)
    axes[idx].set_xticks(range(len(value_counts)))
    axes[idx].set_xticklabels(value_counts.index, rotation=45, ha='right', fontsize=8)
    axes[idx].set_title(f'{col}{title_suffix}', fontsize=11, fontweight='bold')
    axes[idx].set_xlabel(col, fontsize=9)
    axes[idx].set_ylabel('Count', fontsize=9)
    axes[idx].grid(axis='y', alpha=0.3)
    
    # Add count labels on bars
    for i, v in enumerate(value_counts.values):
        axes[idx].text(i, v + max(value_counts.values)*0.01, str(v), ha='center', va='bottom', fontsize=8)

# Remove empty subplots
for i in range(len(categorical_features[:10]), 12):
    fig.delaxes(axes[i])

plt.tight_layout()
plt.show()

# Print summary statistics for all categorical features
print("=" * 60)
print("CATEGORICAL FEATURES - CARDINALITY SUMMARY")
print("=" * 60)
for col in categorical_cols:
    unique_count = df[col].nunique()
    print(f"{col}: {unique_count} unique values")

#### Summary

**Numerical Feature Distributions:** The dataset contains only **two meaningful numerical predictors**. The `city_development_index` histogram reveals a **strongly left-skewed distribution** with a massive concentration at the highest development indices (0.9-0.95 range shows ~7,000+ candidates), indicating most training participants come from highly developed urban centers. The boxplot confirms minimal lower outliers around 0.448-0.47. In contrast, `training_hours` displays a **right-skewed distribution** with the mode at the lowest bins (0-50 hours showing ~2,700 candidates), followed by progressively decreasing frequencies toward higher training hours. The boxplot dramatically illustrates **extensive upper outliers** (984 instances, 5.1%) stretching from ~185 to 336 hours, representing candidates with exceptional training engagement far beyond the IQR upper bound of 88 hours.

**Outlier Analysis:** We will **retain all outliers** rather than remove them. The `city_development_index` outliers (17 instances, 0.1%) represent candidates from less-developed cities — a legitimate and potentially informative minority segment whose job-seeking behavior may differ from the urban majority. The `training_hours` outliers (984 instances, 5.1%) are particularly valuable: candidates investing 185-336 hours in training demonstrate extreme commitment that could strongly predict job change intentions, either indicating active upskilling for career transition or employer-mandated training for current roles. Removing these 5% of candidates would discard a behaviorally distinct cohort whose outlier status itself carries predictive signal. The outliers represent genuine behavioral variance, not measurement errors.

**Categorical Feature Distributions:** The visualizations reveal **pronounced imbalances across all features**. **City** (showing top 10 of 123) concentrates heavily in city_103 (4,355 candidates), with other cities having dramatically lower representation. **Gender** shows **strong male dominance** (13,221 males vs 1,238 females vs 191 other), creating a 10.7:1 imbalance. **Relevant experience** is heavily skewed toward "Has relevant experience" (13,792 vs 5,366). **Enrolled university** shows most are "no_enrollment" (13,817 vs 3,757 full-time vs 1,198 part-time). **Education level** is dominated by **Graduates** (11,598) followed by Masters (4,361), with minimal representation for High School, PhD, and Primary School. **Major discipline** overwhelmingly favors **STEM** (14,492), dwarfing Business Degree, Other, Humanities, Arts, and No Major. **Experience** (showing top 10 of 22 levels) shows broad distribution with ">20" years leading (3,286), but relatively balanced across 5-20 year ranges. **Company size** peaks at "50-99" (3,884) with decreasing frequencies for larger sizes. **Company type** is dominated by **"Pvt Ltd"** (10,817), far exceeding Funded Startup, Public Sector, Early Stage Startup, NGO, and Other. **Last new job** shows "1" year as most common (8,040), declining for longer gaps. These imbalances will require careful encoding strategies: target/frequency encoding for high-cardinality features (city, experience), one-hot encoding for low-cardinality features, and attention to class imbalance during model training to prevent bias toward majority classes.

### 4.4 Target Feature Study

In [None]:
# Target variable distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Count plot
target_counts = df['target'].value_counts()
axes[0].bar(target_counts.index, target_counts.values, edgecolor='black', alpha=0.7, color=['red', 'green'])
axes[0].set_title('Target Distribution (Count)', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Target (0=Not Looking, 1=Looking for Job Change)')
axes[0].set_ylabel('Count')
for i, v in enumerate(target_counts.values):
    axes[0].text(i, v + 100, str(v), ha='center', va='bottom', fontweight='bold')

# Pie chart
axes[1].pie(target_counts.values, labels=['Not Looking (0)', 'Looking (1)'], autopct='%1.1f%%', 
            startangle=90, colors=['red', 'green'], explode=(0.05, 0))
axes[1].set_title('Target Distribution (Proportion)', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

print("=" * 60)
print("TARGET VARIABLE ANALYSIS")
print("=" * 60)
print(f"\nClass distribution:")
print(target_counts)
print(f"\nClass proportions:")
print(df['target'].value_counts(normalize=True).round(3))
print(f"\nClass imbalance ratio: {target_counts.max() / target_counts.min():.2f}:1")

In [None]:
# Numerical features comparison by target class
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes = axes.ravel()

# Exclude enrollee_id from comparison (it's just an identifier)
numerical_features_for_comparison = [col for col in numerical_cols if col not in ['enrollee_id', 'target']]

for idx, col in enumerate(numerical_features_for_comparison):
    looking = df[df['target'] == 1.0][col]
    not_looking = df[df['target'] == 0.0][col]
    
    axes[idx].hist([not_looking, looking], bins=15, label=['Not Looking (0)', 'Looking (1)'], 
                   edgecolor='black', alpha=0.7, color=['red', 'green'])
    axes[idx].set_title(f'{col} by Target', fontsize=11, fontweight='bold')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Frequency')
    axes[idx].legend()
    axes[idx].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Statistical comparison of numerical features by target class
print("=" * 60)
print("NUMERICAL FEATURES - MEAN COMPARISON BY TARGET")
print("=" * 60)

numerical_features_for_comparison = [col for col in numerical_cols if col not in ['enrollee_id', 'target']]
comparison = df.groupby('target')[numerical_features_for_comparison].mean()
comparison.index = ['Not Looking (0)', 'Looking (1)']
print("\nMean values by target class:")
print(comparison.round(2))

print("\n" + "=" * 60)
print("DIFFERENCE (Looking - Not Looking)")
print("=" * 60)
difference = comparison.loc['Looking (1)'] - comparison.loc['Not Looking (0)']
print(difference.round(2))

# Visualize mean comparison
comparison.T.plot(kind='bar', figsize=(10, 5), edgecolor='black', alpha=0.7, color=['red', 'green'])
plt.title('Mean Comparison of Numerical Features by Target', fontsize=13, fontweight='bold')
plt.xlabel('Features')
plt.ylabel('Mean Value')
plt.xticks(rotation=45, ha='right')
plt.legend(title='Target')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Categorical features vs target - job change rates
categorical_features_for_analysis = [col for col in categorical_cols]

# Limit to top categories for high-cardinality features like 'city'
fig, axes = plt.subplots(3, 4, figsize=(18, 12))
axes = axes.ravel()

for idx, col in enumerate(categorical_features_for_analysis[:10]):
    if col == 'city':
        # For city, show only top 10 cities
        top_cities = df[col].value_counts().head(10).index
        df_subset = df[df[col].isin(top_cities)]
        ct = pd.crosstab(df_subset[col], df_subset['target'], normalize='index') * 100
    else:
        ct = pd.crosstab(df[col], df['target'], normalize='index') * 100
    
    ct.plot(kind='bar', ax=axes[idx], edgecolor='black', alpha=0.7, color=['red', 'green'])
    axes[idx].set_title(f'Job Change Rate by {col}', fontsize=10, fontweight='bold')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Percentage (%)')
    axes[idx].legend(['Not Looking (0)', 'Looking (1)'], fontsize=7)
    axes[idx].tick_params(axis='x', rotation=45, labelsize=8)
    axes[idx].grid(axis='y', alpha=0.3)

# Remove empty subplots
for i in range(len(categorical_features_for_analysis[:10]), 12):
    fig.delaxes(axes[i])

plt.tight_layout()
plt.show()

#### Summary

**Target Class Distribution:** The target variable exhibits **severe class imbalance** with **14,381 candidates (75.1%) not looking for job change** versus **4,777 candidates (24.9%) actively seeking change**, yielding a **3.01:1 imbalance ratio**. This significant imbalance will require careful handling during model training through class weighting, SMOTE resampling, or stratified sampling to prevent the model from defaulting to majority class predictions and achieving deceptively high accuracy while failing to identify job-seeking candidates.

**Numerical Features vs Target:** The comparative histograms reveal **minimal discriminatory power** for both numerical features. `city_development_index` shows nearly identical distributions for both classes, with both concentrated at high development indices (0.9-0.95 range). The mean comparison confirms this: **Not Looking: 0.83** vs **Looking: 0.81** — a trivial difference of only **-0.02**. Similarly, `training_hours` displays overlapping distributions with both classes peaking at 0-50 hours, though job-seekers show slightly broader spread into higher training hours. The mean difference is also minimal: **Not Looking: 65.70** vs **Looking: 64.35** — a difference of **-1.35 hours**. The bar chart visualization dramatically illustrates these negligible differences, with both features showing nearly identical mean values across classes. This suggests **neither numerical feature alone provides strong predictive signal** for job change intentions, indicating the model will need to rely heavily on categorical features and feature interactions.

**Categorical Features vs Target:** The categorical feature analysis reveals **substantial variation in job change propensity** across different segments. **City** shows dramatic variance, with city_21 exhibiting the **highest job change rate (~60%)** while most other top cities hover around 10-20%. **Gender** reveals **females (26%) have higher job-seeking rates than males (22%)**, contradicting common assumptions. **Relevant experience** shows a striking pattern: candidates **without relevant experience have 35% job change rate** versus only **20% for experienced candidates** — suggesting less-established professionals are more mobile. **Enrolled university** demonstrates **full-time students have the highest mobility (38%)**, followed by part-time (25%), while non-enrolled show lowest rates (20%). **Education level** reveals **Graduate and High School** graduates have higher mobility (~28%) compared to Masters, PhD, and Primary School. **Major discipline** shows relatively **balanced rates (20-28%) across all fields**. **Experience levels** display interesting non-linearity: **fresher candidates (<1 year) and very experienced (>20 years) show higher job-seeking rates (~25-30%)** compared to mid-career professionals (15-20%). **Company size** reveals **smaller companies (<10, 10/49) have higher attrition (~23%)** versus larger corporations (~15-18%). **Company type** shows **Early Stage Startups have highest mobility (~24%)** while Public Sector shows lowest (~18%). **Last new job** demonstrates **"never" changed jobs candidates have highest rate (~32%)**, suggesting first-time job seekers, while recent changers (1-2 years) show ~22-25%. These categorical patterns will be **crucial for model predictions**, as they reveal distinct behavioral segments with varying job change propensities that numerical features fail to capture.

### 4.5 Feature Correlation & Selection

In [None]:
# Numerical features correlation analysis
print("=" * 60)
print("NUMERICAL FEATURES CORRELATION")
print("=" * 60)

# Only correlate meaningful numerical features (exclude enrollee_id)
numerical_features_for_correlation = [col for col in numerical_cols if col not in ['enrollee_id']]

# Calculate correlation matrix
corr_matrix = df[numerical_features_for_correlation].corr()

print("\nCorrelation Matrix:")
print(corr_matrix.round(3))

# Visualize correlation heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8}, fmt='.3f')
plt.title('Correlation Heatmap - Numerical Features', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

#### Categorical Feature Correlation (Cramér's V)

Cramér's V mesure l'association entre variables catégorielles (0 = aucune, 1 = parfaite).
- **V < 0.1**: Négligeable | **0.1-0.3**: Faible | **0.3-0.5**: Modérée | **V ≥ 0.5**: Forte

In [None]:
from scipy.stats.contingency import association

# Calculate Cramér's V matrix for categorical features
categorical_features_for_corr = [col for col in categorical_cols]
n_features = len(categorical_features_for_corr)
cramers_matrix = np.zeros((n_features, n_features))

for i, col1 in enumerate(categorical_features_for_corr):
    for j, col2 in enumerate(categorical_features_for_corr):
        if i == j:
            cramers_matrix[i, j] = 1.0
        elif i < j:
            mask = df[col1].notna() & df[col2].notna()
            contingency = pd.crosstab(df.loc[mask, col1], df.loc[mask, col2])
            v = association(contingency, method='cramer')
            cramers_matrix[i, j] = v
            cramers_matrix[j, i] = v

# Visualize
cramers_df = pd.DataFrame(cramers_matrix, index=categorical_features_for_corr, columns=categorical_features_for_corr)

plt.figure(figsize=(10, 8))
sns.heatmap(cramers_df, annot=True, cmap='YlOrRd', square=True, fmt='.2f', vmin=0, vmax=1)
plt.title("Cramér's V - Categorical Features Correlation", fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

#### Chi-Square Tests: Categorical Features vs Target

Test d'indépendance pour évaluer l'association entre chaque feature catégorielle et la variable cible.

In [None]:
from scipy.stats import chi2_contingency

# Chi-square tests: categorical features vs target
chi2_results = []

for col in categorical_cols:
    mask = df[col].notna()
    contingency = pd.crosstab(df.loc[mask, col], df.loc[mask, 'target'])
    chi2, p_value, dof, _ = chi2_contingency(contingency)
    v = association(contingency, method='cramer')
    
    chi2_results.append({
        'Feature': col,
        'Chi-square': round(chi2, 2),
        'p-value': p_value,
        "Cramér's V": round(v, 3)
    })

chi2_df = pd.DataFrame(chi2_results).sort_values("Cramér's V", ascending=False)
print("Chi-square Tests: Categorical Features vs Target")
print(chi2_df.to_string(index=False))

# Visualize
plt.figure(figsize=(8, 5))
plt.barh(chi2_df['Feature'], chi2_df["Cramér's V"], color='steelblue', edgecolor='black')
plt.xlabel("Cramér's V")
plt.title("Association Strength with Target", fontsize=12, fontweight='bold')
plt.axvline(x=0.1, color='orange', linestyle='--', label='Weak (0.1)')
plt.axvline(x=0.3, color='red', linestyle='--', label='Moderate (0.3)')
plt.legend()
plt.tight_layout()
plt.show()

#### Summary

**Corrélation numérique:** Les deux features numériques (`city_development_index` et `training_hours`) sont indépendantes (r = 0.002). Seul `city_development_index` montre une corrélation modérée avec le target (r = -0.34), indiquant que les candidats des villes moins développées cherchent davantage à changer d'emploi.

**Corrélation catégorielle (Cramér's V):** La matrice révèle des associations modérées entre certaines features liées à l'expérience professionnelle:
- `relevent_experience` ↔ `experience` (V = 0.40)
- `relevent_experience` ↔ `enrolled_university` / `last_new_job` (V = 0.39)
- `relevent_experience` ↔ `education_level` (V = 0.32)

Ces corrélations sont logiques (progression de carrière) mais restent modérées, sans multicolinéarité sévère.

**Association avec le target (Chi-square):** Toutes les features catégorielles sont statistiquement significatives (p < 0.05). Classement par force d'association:
- **Modérée (V ≥ 0.3):** `city` (0.396) — meilleur prédicteur catégoriel
- **Faible (0.1-0.3):** `experience` (0.192), `enrolled_university` (0.156), `relevent_experience` (0.128)
- **Négligeable (V < 0.1):** `education_level`, `last_new_job`, `company_size`, `company_type`, `major_discipline`, `gender`

**Conclusion:** `city` et `city_development_index` sont les prédicteurs les plus forts. Les features liées à l'expérience (`experience`, `enrolled_university`, `relevent_experience`) forment un second groupe de prédicteurs utiles. Les features démographiques (`gender`, `major_discipline`) ont un pouvoir prédictif faible mais restent significatives.

### 4.6 Unsupervised Clustering

In [None]:
# Préparation des données pour le clustering
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# Sélection des features (sans enrollee_id et target)
X_cluster = df.drop(['enrollee_id', 'target'], axis=1)

# Identifier les colonnes
num_cols = X_cluster.select_dtypes(include=np.number).columns.tolist()
cat_cols = X_cluster.select_dtypes(include='object').columns.tolist()

# Pipeline de prétraitement
preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ]), num_cols),
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ]), cat_cols)
])

# Transformer les données
X_processed = preprocessor.fit_transform(X_cluster)
print(f"Données prétraitées: {X_processed.shape[0]} samples, {X_processed.shape[1]} features")

In [None]:
# Méthode du coude (Elbow Method)
from sklearn.cluster import KMeans

inertias = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_processed)
    inertias.append(kmeans.inertia_)

# Visualisation
plt.figure(figsize=(8, 5))
plt.plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
plt.xlabel('Nombre de clusters (k)')
plt.ylabel('Inertie')
plt.title('Méthode du Coude', fontsize=13, fontweight='bold')
plt.xticks(K_range)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Application K-Means (choisir k après avoir vu le coude)
k_optimal = 4  # À ajuster selon le graphique du coude

kmeans = KMeans(n_clusters=k_optimal, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_processed)

# Réduction PCA pour visualisation
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_processed)

# Visualisation des clusters
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=clusters, cmap='viridis', alpha=0.5, s=10)
plt.colorbar(scatter, label='Cluster')
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)')
plt.title(f'K-Means Clustering (k={k_optimal}) - Projection PCA', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

print(f"Variance expliquée par PCA: {pca.explained_variance_ratio_.sum()*100:.1f}%")

In [None]:
# Analyse des clusters vs target
df_clusters = df.copy()
df_clusters['cluster'] = clusters

# Distribution du target par cluster
cluster_analysis = df_clusters.groupby('cluster')['target'].agg(['count', 'sum', 'mean'])
cluster_analysis.columns = ['Total', 'Looking for Change', 'Rate (%)']
cluster_analysis['Rate (%)'] = (cluster_analysis['Rate (%)'] * 100).round(1)
print("Distribution du target par cluster:")
print(cluster_analysis)

# Visualisation
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Taille des clusters
axes[0].bar(cluster_analysis.index, cluster_analysis['Total'], color='steelblue', edgecolor='black')
axes[0].set_xlabel('Cluster')
axes[0].set_ylabel('Nombre de candidats')
axes[0].set_title('Taille des clusters', fontweight='bold')

# Taux de recherche d'emploi par cluster
colors = plt.cm.RdYlGn_r(cluster_analysis['Rate (%)'] / 100)
axes[1].bar(cluster_analysis.index, cluster_analysis['Rate (%)'], color=colors, edgecolor='black')
axes[1].axhline(y=df['target'].mean()*100, color='red', linestyle='--', label=f'Moyenne globale ({df["target"].mean()*100:.1f}%)')
axes[1].set_xlabel('Cluster')
axes[1].set_ylabel('Taux de recherche (%)')
axes[1].set_title('Taux de recherche d\'emploi par cluster', fontweight='bold')
axes[1].legend()

plt.tight_layout()
plt.show()

#### Summary

Le clustering K-Means (k=4) révèle **4 segments de candidats** avec des comportements distincts vis-à-vis de la recherche d'emploi:

- **Cluster 1** (~8,800 candidats, 13%): Profil le plus stable, taux de recherche bien inférieur à la moyenne
- **Cluster 3** (~4,800 candidats, 47%): Profil à **haut risque d'attrition**, taux presque 2x supérieur à la moyenne globale (24.9%)
- **Clusters 0 et 2**: Comportements intermédiaires (18% et 27%)

La variance expliquée par PCA (29.6%) est limitée en raison du grand nombre de features catégorielles encodées, mais les clusters restent visuellement distincts. Le fait que le clustering non-supervisé identifie des groupes corrélés au target **sans l'avoir utilisé** confirme que les features contiennent des patterns prédictifs exploitables.

### 4.7 Interpretations & Conclusions

L'analyse exploratoire a révélé les points clés suivants:

**Données:**
- 19,158 candidats, 12 features prédictives (2 numériques, 10 catégorielles)
- Déséquilibre de classes: 75% / 25% (ratio 3:1)
- Valeurs manquantes significatives (53% des lignes affectées), notamment `company_size` et `company_type` (candidats sans emploi)

**Features les plus prédictives:**
| Feature | Type | Association avec target |
|---------|------|------------------------|
| city | Catégorielle | V = 0.396 (modérée) |
| city_development_index | Numérique | r = -0.342 (modérée) |
| experience | Catégorielle | V = 0.192 (faible) |
| enrolled_university | Catégorielle | V = 0.156 (faible) |
| relevent_experience | Catégorielle | V = 0.128 (faible) |

**Insights métier:**
- Les candidats des **villes moins développées** cherchent davantage à changer d'emploi
- Les **étudiants à temps plein** et candidats **sans expérience pertinente** sont plus mobiles
- Le clustering identifie un segment à **haut risque (47%)** vs un segment **stable (13%)**

**Implications pour la modélisation:**
1. Gérer le déséquilibre de classes (class_weight, SMOTE)
2. Traiter les valeurs manquantes (imputation + catégorie "Unknown")
3. Encoder les features à haute cardinalité (`city`: 123 valeurs)
4. Privilégier les modèles non-linéaires (Random Forest, XGBoost) pour capturer les interactions

---

## 5. ML Baseline & Ensemble Models

### 5.1 Train/Validation/Test Splits

Division des données en **3 ensembles distincts** pour une évaluation rigoureuse:

| Ensemble | Proportion | Rôle |
|----------|------------|------|
| **Train** | 70% | Apprentissage des modèles |
| **Validation** | 15% | Tuning des hyperparamètres (section 6) |
| **Test** | 15% | Évaluation finale (jamais vu pendant l'entraînement) |

**Points clés:**
- *Stratification* → préserve le ratio de classes (75/25) dans chaque ensemble
- Suppression de `enrollee_id` → identifiant sans valeur prédictive
- `random_state=42` → reproductibilité des résultats

In [None]:
# Préparation des données
X = df.drop(['enrollee_id', 'target'], axis=1)
y = df['target']

# Split 70/15/15 stratifié
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

# Vérification
print(f"Train: {X_train.shape[0]} | Val: {X_val.shape[0]} | Test: {X_test.shape[0]}")

### 5.2 Pipelines & Models

Construction de **pipelines** qui enchaînent prétraitement et modèle en une seule étape:

**Prétraitement automatique:**
- *Features numériques* → Imputation (médiane) + Standardisation
- *Features catégorielles* → Imputation (mode) + One-Hot Encoding

**Modèles baseline:**
| Modèle | Type | Pourquoi |
|--------|------|----------|
| **Logistic Regression** | Linéaire | Simple, interprétable, baseline rapide |
| **Random Forest** | Ensemble | Capture les interactions non-linéaires |

**Note:** `class_weight='balanced'` → gère automatiquement le déséquilibre 75/25

In [None]:
# Identification des types de features
num_features = X_train.select_dtypes(include=np.number).columns.tolist()
cat_features = X_train.select_dtypes(include='object').columns.tolist()

# Préprocesseur
preprocessor = ColumnTransformer([
    ('num', Pipeline([('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]), num_features),
    ('cat', Pipeline([('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), cat_features)
])

# Pipelines avec modèles
lr_pipeline = Pipeline([('preprocessor', preprocessor), ('model', LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42))])
rf_pipeline = Pipeline([('preprocessor', preprocessor), ('model', RandomForestClassifier(class_weight='balanced', random_state=42))])

print(f"Features: {len(num_features)} numériques, {len(cat_features)} catégorielles")

### 5.3 Training & Validation

**Cross-validation 5-fold** sur l'ensemble d'entraînement pour évaluer la performance des modèles de manière robuste.

*Principe:* Le train set est divisé en 5 parties → on entraîne sur 4, on valide sur 1, et on répète 5 fois.

**Métriques évaluées:**
- `accuracy` → % de prédictions correctes
- `f1_weighted` → équilibre précision/rappel (adapté aux classes déséquilibrées)
- `roc_auc` → capacité à discriminer les classes

In [None]:
from sklearn.model_selection import cross_val_score

scoring = ['accuracy', 'f1_weighted', 'roc_auc']

# Cross-validation pour les deux modèles
results = {}
for name, pipeline in [('Logistic Regression', lr_pipeline), ('Random Forest', rf_pipeline)]:
    print(f"\n{name}:")
    results[name] = {}
    for metric in scoring:
        scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring=metric)
        results[name][metric] = scores.mean()
        print(f"  {metric}: {scores.mean():.3f} (+/- {scores.std():.3f})")

### 5.4 Testing

**Évaluation finale** sur le test set (données jamais vues pendant l'entraînement).

- Entraînement sur `X_train` complet
- Prédiction et évaluation sur `X_test`
- **Classification report** → précision, rappel, F1 par classe
- **Matrice de confusion** → visualisation des erreurs
- **Feature importance** → comparaison RF (*Gini importance*) vs LR (*coefficients*)

In [None]:
# Entraînement et évaluation sur test set
models = {'Logistic Regression': lr_pipeline, 'Random Forest': rf_pipeline}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]
    
    print(f"\n{'='*50}\n{name}\n{'='*50}")
    print(classification_report(y_test, y_pred))
    print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.3f}")

# Matrice de confusion
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
for idx, (name, model) in enumerate(models.items()):
    cm = confusion_matrix(y_test, model.predict(X_test))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[idx])
    axes[idx].set_title(f'{name}')
    axes[idx].set_xlabel('Prédit')
    axes[idx].set_ylabel('Réel')
plt.tight_layout()
plt.show()

In [None]:
# Feature Importance - Comparaison des deux modèles
feature_names = num_features + list(rf_pipeline.named_steps['preprocessor'].named_transformers_['cat'].named_steps['encoder'].get_feature_names_out(cat_features))

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Random Forest - Feature Importances
rf_importances = rf_pipeline.named_steps['model'].feature_importances_
top_rf = np.argsort(rf_importances)[-15:]
axes[0].barh(range(15), rf_importances[top_rf], color='steelblue')
axes[0].set_yticks(range(15))
axes[0].set_yticklabels([feature_names[i] for i in top_rf])
axes[0].set_xlabel('Importance')
axes[0].set_title('Random Forest (Gini Importance)', fontweight='bold')

# Logistic Regression - Coefficients (valeur absolue)
lr_coefs = np.abs(lr_pipeline.named_steps['model'].coef_[0])
top_lr = np.argsort(lr_coefs)[-15:]
axes[1].barh(range(15), lr_coefs[top_lr], color='coral')
axes[1].set_yticks(range(15))
axes[1].set_yticklabels([feature_names[i] for i in top_lr])
axes[1].set_xlabel('|Coefficient|')
axes[1].set_title('Logistic Regression (Coefficient Magnitude)', fontweight='bold')

plt.tight_layout()
plt.show()

### 5.5 Results Interpretation & Discussion

**Comparaison des modèles baseline:**

| Métrique | Logistic Regression | Random Forest | Meilleur |
|----------|---------------------|---------------|----------|
| Accuracy | 0.73 | **0.77** | RF |
| ROC-AUC | **0.777** | 0.765 | LR |
| Recall (classe 1) | **0.70** | 0.39 | LR |
| F1 (classe 1) | **0.57** | 0.45 | LR |

**Conclusions:**

1. **Logistic Regression surpasse Random Forest** pour l'objectif métier (détecter les candidats cherchant un emploi)
   - Détecte 70% des chercheurs d'emploi vs seulement 39% pour RF
   - Meilleur ROC-AUC (0.777 vs 0.765)

2. **L'accuracy est trompeuse** avec des classes déséquilibrées
   - RF a 77% d'accuracy mais rate 61% des candidats à risque

3. **Feature importance révèle des stratégies différentes:**
   - *LR* → exploite fortement la variable `city` (meilleur prédicteur identifié en EDA)
   - *RF* → disperse sur `training_hours` malgré sa faible corrélation linéaire

**Prochaine étape (Section 6):** Optimiser les hyperparamètres pour améliorer les performances, notamment le recall de la classe 1.

---

## 6. Enhanced Models & Hyperparameter Tuning

### 6.1 Justification of Choices

**Choix du modèle: HistGradientBoostingClassifier**

*Justification basée sur notre analyse:*
- **EDA (Section 4):** Déséquilibre de classes (75/25), `city` est le meilleur prédicteur (V=0.396), features catégorielles à haute cardinalité
- **Baselines (Section 5):** Random Forest sous-performe sur le recall classe 1 (0.39), LR meilleure mais limitée aux relations linéaires
- **Cours:** "RandomForest, AdaBoost, GBRT, and HGB are among the first models you should test"

*Pourquoi HistGradientBoosting:*
- **Gradient Boosting** → souvent meilleur que Random Forest sur données tabulaires
- Implémentation **optimisée** dans scikit-learn (rapide)
- Gère bien les **interactions non-linéaires** entre features

*Preprocessing:* Même pipeline que les baselines (imputation + one-hot encoding) pour comparaison équitable.

**Métriques d'optimisation:** `roc_auc` et `f1_weighted` (adaptées au déséquilibre de classes)

### 6.2 Hyperparameter Optimization

**GridSearchCV** pour trouver les meilleurs hyperparamètres:
- `max_depth` → profondeur des arbres (contrôle overfitting)
- `learning_rate` → vitesse d'apprentissage
- `max_iter` → nombre d'arbres dans l'ensemble

Validation sur `X_val` (ensemble de validation créé en 5.1).

In [None]:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

# Pipeline avec preprocessor (réutilise celui de la section 5.2)
hgb_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', HistGradientBoostingClassifier(class_weight='balanced', random_state=42))
])

# Grille d'hyperparamètres (préfixe 'model__' pour le pipeline)
param_grid = {
    'model__max_depth': [3, 5, 7],
    'model__learning_rate': [0.05, 0.1, 0.2],
    'model__max_iter': [100, 200]
}

# GridSearchCV avec les deux métriques
results_grid = {}
for scoring in ['roc_auc', 'f1_weighted']:
    print(f"\n{'='*50}\nOptimisation pour: {scoring}\n{'='*50}")
    
    grid_search = GridSearchCV(hgb_pipeline, param_grid, cv=5, scoring=scoring, n_jobs=-1)
    grid_search.fit(X_train, y_train)
    
    results_grid[scoring] = grid_search
    print(f"Meilleurs paramètres: {grid_search.best_params_}")
    print(f"Meilleur score CV: {grid_search.best_score_:.3f}")

In [None]:
# Comparaison des deux modèles optimisés (ROC-AUC vs F1)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

for idx, (metric, label) in enumerate([('roc_auc', 'ROC-AUC'), ('f1_weighted', 'F1-weighted')]):
    best_model = results_grid[metric].best_estimator_
    y_pred = best_model.predict(X_test)
    y_proba = best_model.predict_proba(X_test)[:, 1]
    
    print(f"\n{'='*50}\nHGB optimisé pour {label}\n{'='*50}")
    print(f"Params: {results_grid[metric].best_params_}")
    print(classification_report(y_test, y_pred))
    print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.3f}")
    
    # Matrice de confusion
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[idx])
    axes[idx].set_title(f'HGB optimisé {label}')
    axes[idx].set_xlabel('Prédit')
    axes[idx].set_ylabel('Réel')

plt.tight_layout()
plt.show()

### 6.3 Final Results & Analysis

**Comparaison finale de tous les modèles (Test Set):**

| Modèle | ROC-AUC | Recall (classe 1) | F1 (classe 1) | Candidats détectés |
|--------|---------|-------------------|---------------|-------------------|
| Logistic Regression | 0.777 | 0.70 | 0.57 | 504/716 (70%) |
| Random Forest | 0.765 | 0.39 | 0.45 | 276/716 (39%) |
| **HGB (ROC-AUC)** | **0.784** | **0.76** | **0.59** | **542/716 (76%)** |
| HGB (F1) | 0.785 | 0.74 | 0.59 | 530/716 (74%) |

**Meilleur modèle: HistGradientBoosting optimisé pour ROC-AUC**

*Hyperparamètres:* `max_depth=5`, `learning_rate=0.05`, `max_iter=100`

**Gains par rapport aux baselines:**
- **+0.7%** ROC-AUC vs Logistic Regression
- **+6 points** de recall vs LR → détecte 38 candidats supplémentaires
- **+37 points** de recall vs Random Forest → détecte 266 candidats supplémentaires

**Interprétation métier:** Sur 716 candidats cherchant un emploi, le modèle optimisé en identifie correctement **542 (76%)**, contre 504 pour LR et seulement 276 pour RF. Cette amélioration permet un ciblage plus efficace des candidats à risque d'attrition.

---

## 7. Conclusion

**Objectif:** Prédire les candidats cherchant à changer d'emploi à partir de leurs caractéristiques démographiques, éducatives et professionnelles.

**Principales découvertes (EDA):**
- Dataset de 19,158 candidats avec déséquilibre de classes (75/25)
- `city` et `city_development_index` sont les meilleurs prédicteurs
- Le clustering révèle un segment à haut risque (47% de mobilité)

**Comparaison des modèles:**

| Modèle | ROC-AUC | Recall (classe 1) |
|--------|---------|-------------------|
| Logistic Regression | 0.777 | 70% |
| Random Forest | 0.765 | 39% |
| **HGB (Optimisé)** | **0.784** | **76%** |

**Meilleur modèle:** HistGradientBoostingClassifier
- Hyperparamètres: `max_depth=5`, `learning_rate=0.05`, `max_iter=100`
- Détecte **76% des candidats** cherchant un emploi (542/716)

**Limites:**
- Déséquilibre de classes impacte la précision (49% de faux positifs)
- Feature `city` à haute cardinalité (123 valeurs) → one-hot encoding crée 123 features

**Pistes d'amélioration:**
- Target encoding pour `city` (réduire la dimensionnalité)
- Tester d'autres modèles (XGBoost, LightGBM)
- Ajuster le seuil de décision pour équilibrer précision/rappel