# The Hired Hand

**Machine Learning for Job Change Prediction**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Angry-Jay/ML_TheHiredHand/blob/main/ml-the-hired-hand.ipynb)

---

## Table of Contents

1. [Project & Dataset Description](#1-project--dataset-description)
   - [1.1 Project Aim](#11-project-aim)
   - [1.2 Existing Solutions](#12-existing-solutions)
   - [1.3 Dataset Information](#13-dataset-information)
2. [Library Imports](#2-library-imports)
3. [Data Access](#3-data-access)
4. [Dataset Exploratory Analysis](#4-dataset-exploratory-analysis)
   - [4.1 Metadata Analysis](#41-metadata-analysis)
   - [4.2 Missing Values Analysis](#42-missing-values-analysis)
   - [4.3 Feature Distributions, Scaling & Outliers](#43-feature-distributions-scaling--outliers)
   - [4.4 Target Feature Study](#44-target-feature-study)
   - [4.5 Feature Correlation & Selection](#45-feature-correlation--selection)
   - [4.6 Unsupervised Clustering](#46-unsupervised-clustering)
   - [4.7 Interpretations & Conclusions](#47-interpretations--conclusions)
5. [ML Baseline & Ensemble Models](#5-ml-baseline--ensemble-models)
   - [5.1 Train/Validation/Test Splits](#51-trainvalidationtest-splits)
   - [5.2 Pipelines & Models](#52-pipelines--models)
   - [5.3 Training & Validation](#53-training--validation)
   - [5.4 Testing](#54-testing)
   - [5.5 Results Interpretation & Discussion](#55-results-interpretation--discussion)
6. [Enhanced Models & Hyperparameter Tuning](#6-enhanced-models--hyperparameter-tuning)
   - [6.1 Justification of Choices](#61-justification-of-choices)
   - [6.2 Hyperparameter Optimization](#62-hyperparameter-optimization)
   - [6.3 Final Results & Analysis](#63-final-results--analysis)
7. [Conclusion & Future Work](#7-conclusion--future-work)

---

## 1. Project & Dataset Description

### 1.1 Project Aim

This project applies Machine Learning techniques to predict job-change behavior for data science candidates using the **HR Analytics: Job Change of Data Scientists** dataset from Kaggle.

**Primary Objectives:**
- **Predict job-change probability** (target = 1: looking for job change; target = 0: not looking) based on demographic, educational, and professional attributes
- **Demonstrate a coherent ML methodology** from data discovery through model optimization
- **Apply comprehensive data analysis** including:
  - Data cleaning and preprocessing
  - Exploratory Data Analysis (EDA)
  - Feature engineering and selection
  - Correlation and clustering analysis
- **Build and evaluate multiple classification models** with proper validation techniques
- **Identify key factors** influencing job-change decisions through feature importance analysis
- **Apply ML best practices** including proper train/validation/test splits, pipeline construction, and hyperparameter tuning

---

### 1.2 Existing Solutions

**Traditional Approach:**

HR departments traditionally rely on manual candidate assessment with heuristic filters (e.g., years of experience, specific education levels, company type). This approach has several limitations:
- Time-consuming and difficult to scale
- Subjective and prone to human bias
- Often inaccurate in predicting actual behavior
- Fails to capture complex interactions between multiple factors

**Machine Learning Solutions:**

Several ML-based approaches exist for HR analytics and employee retention prediction:

**Common Algorithms Used:**
- **Baseline Models:** Logistic Regression, K-Nearest Neighbors (KNN)
- **Tree-based Models:** Decision Trees, Random Forest, ExtraTrees
- **Boosting Methods:** XGBoost, LightGBM, CatBoost, AdaBoost
- **Support Vector Machines:** SVC with various kernels

**Key Findings from Literature:**
- Tree-based ensemble methods (Random Forest, XGBoost, LightGBM) typically outperform simpler baselines
- Gradient boosting models (LightGBM, CatBoost) handle categorical features natively and efficiently
- Feature engineering significantly impacts model performance
- Proper handling of class imbalance is crucial for accurate predictions
- Training hours and relevant experience are often among the strongest predictors

**Typical Methodology:**
1. Exploratory Data Analysis (distributions, correlations, class imbalance)
2. Preprocessing pipelines (encoding categorical variables, imputation, scaling)
3. Model comparison using multiple metrics: Accuracy, Precision, Recall, F1-Score, ROC-AUC
4. Hyperparameter tuning using GridSearchCV or RandomizedSearchCV
5. Feature importance analysis for interpretability

---

### 1.3 Dataset Information

**Dataset Name:** HR Analytics: Job Change of Data Scientists

**Original Source:** [Kaggle - HR Analytics (arashnic)](https://www.kaggle.com/datasets/arashnic/hr-analytics-job-change-of-data-scientists)

**Dataset Characteristics:**
- **Type:** Dense, structured tabular data
- **Size:** Medium (19,158 training instances, 13 features + target)
- **Features:** Mix of numeric and categorical variables (mostly categorical)
- **Target Variable:** Binary classification (0 = not looking for job change, 1 = looking for job change)
- **Quality:** Real-world data with missing values and class imbalance
- **License:** CC0 (Public Domain)

**Dataset Context:**

A company active in Big Data and Data Science conducts training programs and wants to identify which candidates are likely to work for the company after training versus those looking for new employment. This helps reduce cost and time while improving training quality and candidate planning.

**Dataset Files:**
- `aug_train.csv` - Training data with target labels
- `aug_test.csv` - Test data without target labels  
- `sample_submission.csv` - Submission format example

**Key Features:**
- `enrollee_id`: Unique ID for candidate
- `city`: City code
- `city_development_index`: Development index of the city (scaled)
- `gender`: Gender of candidate
- `relevent_experience`: Relevant experience (Has relevant experience / No relevant experience)
- `enrolled_university`: Type of university course enrolled if any
- `education_level`: Education level of candidate
- `major_discipline`: Education major discipline
- `experience`: Candidate total experience in years
- `company_size`: Number of employees in current employer's company
- `company_type`: Type of current employer
- `last_new_job`: Difference in years between previous job and current job
- `training_hours`: Training hours completed
- `target`: 0 - Not looking for job change, 1 - Looking for job change

**Dataset Access:**
- **Kaggle:** `https://www.kaggle.com/datasets/arashnic/hr-analytics-job-change-of-data-scientists`
- **GitHub Repository:** `https://github.com/Angry-Jay/ML_TheHiredHand`
- For this notebook, we'll load from local file: `aug_train.csv`

**Known Challenges:**
- Dataset is imbalanced (more candidates not looking for job change)
- Most features are categorical with high cardinality
- Missing values present in several features requiring imputation strategy

## 2. Library Imports

In [1]:
# Setting up
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing & Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Model Selection & Tuning
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_val_score,
    train_test_split,
)

# Models
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Metrics
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    f1_score,
    precision_score,
    recall_score,
    roc_auc_score,
)

# Configuration
%matplotlib inline

## 3. Data Access

## 4. Dataset Exploratory Analysis

### 4.1 Metadata Analysis

In this section, we analyze the dataset's metadata to understand its structure, data types, quality, and characteristics. This initial exploration helps identify:

- **Dataset dimensions** and scale
- **Feature data types** (numerical vs. categorical)
- **Data quality issues** (duplicates, missing values, irrelevant columns)
- **Statistical properties** of numerical features
- **Potential data leakage** concerns

### 4.2 Missing Values Analysis

### 4.3 Feature Distributions, Scaling & Outliers

### 4.4 Target Feature Study

### 4.5 Feature Correlation & Selection

### 4.6 Unsupervised Clustering

### 4.7 Interpretations & Conclusions

---

## 5. ML Baseline & Ensemble Models

### 5.1 Train/Validation/Test Splits

### 5.2 Pipelines & Models

### 5.3 Training & Validation

### 5.4 Testing

### 5.5 Results Interpretation & Discussion

---

## 6. Enhanced Models & Hyperparameter Tuning

### 6.1 Justification of Choices

### 6.2 Hyperparameter Optimization

### 6.3 Final Results & Analysis

---

## 7. Conclusion