# The Hired Hand

**Machine Learning for Job Placement Prediction**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Angry-Jay/ML_TheHiredHand/blob/main/ml-the-hired-hand.ipynb)

---

## Table of Contents

1. [Project & Dataset Description](#1-project--dataset-description)
   - [1.1 Project Aim](#11-project-aim)
   - [1.2 Existing Solutions](#12-existing-solutions)
   - [1.3 Dataset Information](#13-dataset-information)
2. [Library Imports](#2-library-imports)
3. [Data Access](#3-data-access)
4. [Dataset Exploratory Analysis](#4-dataset-exploratory-analysis)
   - [4.1 Metadata Analysis](#41-metadata-analysis)
   - [4.2 Missing Values Analysis](#42-missing-values-analysis)
   - [4.3 Feature Distributions, Scaling & Outliers](#43-feature-distributions-scaling--outliers)
   - [4.4 Target Feature Study](#44-target-feature-study)
   - [4.5 Feature Correlation & Selection](#45-feature-correlation--selection)
   - [4.6 Unsupervised Clustering](#46-unsupervised-clustering)
   - [4.7 Interpretations & Conclusions](#47-interpretations--conclusions)
5. [ML Baseline & Ensemble Models](#5-ml-baseline--ensemble-models)
   - [5.1 Train/Validation/Test Splits](#51-trainvalidationtest-splits)
   - [5.2 Pipelines & Models](#52-pipelines--models)
   - [5.3 Training & Validation](#53-training--validation)
   - [5.4 Testing](#54-testing)
   - [5.5 Results Interpretation & Discussion](#55-results-interpretation--discussion)
6. [Enhanced Models & Hyperparameter Tuning](#6-enhanced-models--hyperparameter-tuning)
   - [6.1 Justification of Choices](#61-justification-of-choices)
   - [6.2 Hyperparameter Optimization](#62-hyperparameter-optimization)
   - [6.3 Final Results & Analysis](#63-final-results--analysis)
7. [Conclusion & Future Work](#7-conclusion--future-work)

---

## 1. Project & Dataset Description

### 1.1 Project Aim

This project applies Machine Learning techniques to predict employment outcomes for graduating students using the **Job Placement Dataset**. 

**Primary Objectives:**
- **Predict employment outcomes** (Placed vs. Not Placed) based on demographic, academic, and professional attributes
- **Demonstrate a coherent ML methodology** from data discovery through model optimization
- **Apply comprehensive data analysis** including:
  - Data cleaning and preprocessing
  - Exploratory Data Analysis (EDA)
  - Feature engineering and selection
  - Correlation and clustering analysis
- **Build and evaluate multiple classification models** with proper validation techniques
- **Identify key employability factors** through feature importance analysis and model interpretation
- **Apply ML best practices** including proper train/validation/test splits, pipeline construction, and hyperparameter tuning

---

### 1.2 Existing Solutions

**Traditional Approach:**

Historically, HR departments and educational institutions rely on manual screening processes with heuristic filters (e.g., GPA cutoffs, specific degree specializations, work experience thresholds). This traditional approach has several limitations:
- Time-consuming and difficult to scale
- Subjective and prone to human bias
- Often inaccurate in predicting actual job placement success
- Fails to capture complex interactions between multiple factors

**Machine Learning Solutions:**

Several ML-based approaches exist on platforms like Kaggle and GitHub for placement prediction:

**Common Algorithms Used:**
- **Baseline Models:** Logistic Regression, K-Nearest Neighbors (KNN)
- **Tree-based Models:** Decision Trees, Random Forest, ExtraTrees
- **Boosting Methods:** XGBoost, AdaBoost, Gradient Boosting
- **Support Vector Machines:** SVC with various kernels

**Key Findings from Literature:**
- Tree-based ensemble methods (Random Forest, XGBoost) typically outperform simpler baselines
- Non-linear models better capture feature interactions (e.g., combined effect of GPA and work experience)
- Feature engineering significantly impacts model performance
- Proper handling of class imbalance is crucial for accurate predictions

**Typical Methodology:**
1. Exploratory Data Analysis (distributions, correlations, class imbalance)
2. Preprocessing pipelines (encoding categorical variables, scaling, imputation)
3. Model comparison using multiple metrics: Accuracy, Precision, Recall, F1-Score, ROC-AUC
4. Hyperparameter tuning using GridSearchCV or RandomizedSearchCV
5. Feature importance analysis for interpretability

---

### 1.3 Dataset Information

**Dataset Name:** Job Placement Dataset

**Original Source:** [Kaggle - Job Placement Dataset](https://www.kaggle.com/datasets/ahsan81/job-placement-dataset/data)

**Dataset Characteristics:**
- **Type:** Dense, structured tabular data
- **Size:** Small-to-medium (215 instances, 13 features)
- **Features:** Mix of numeric and categorical variables
- **Target Variable:** Binary classification (Placed / Not Placed)
- **Quality:** Clean with no missing values or duplicates

**Dataset Access:**
- **GitHub Repository:** `https://github.com/Angry-Jay/ML_TheHiredHand`
- **Raw Data URL:** `https://raw.githubusercontent.com/Angry-Jay/ML_TheHiredHand/main/Job_Placement_Data.csv`

**Features Overview:**
- Student demographics (gender)
- Academic performance (SSC %, HSC %, Degree %, MBA %)
- Educational background (SSC board, HSC board, HSC specialization, Degree type, MBA specialization)
- Work experience
- Employment test scores

## 2. Library Imports

In [1]:
# Setting up
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing & Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Model Selection & Tuning
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_val_score,
    train_test_split,
)

# Models
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Metrics
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    f1_score,
    precision_score,
    recall_score,
    roc_auc_score,
)

# Configuration
%matplotlib inline

## 3. Data Access

In [2]:
DATA_URL = "https://raw.githubusercontent.com/Angry-Jay/ML_TheHiredHand/refs/heads/main/Job_Placement_Data.csv"

try:
  # Load the dataset directly into a Pandas DataFrame
  df = pd.read_csv(DATA_URL)

  print(" Dataset loaded successfully!")
  print(f"Shape: {df.shape[0]} rows, {df.shape[1]} columns")

  # Display the first 5 rows to verify
  display(df.head())

except Exception as e:
  print("Error loading data. Check your URL.")
  print(f"Error details: {e}")

## 4. Dataset Exploratory Analysis

### 4.1 Metadata Analysis

In this section, we analyze the dataset's metadata to understand its structure, data types, quality, and characteristics. This initial exploration helps identify:

- **Dataset dimensions** and scale
- **Feature data types** (numerical vs. categorical)
- **Data quality issues** (duplicates, missing values, irrelevant columns)
- **Statistical properties** of numerical features
- **Potential data leakage** concerns

In [3]:
# Display dataset info
df.info()

In [None]:
print("=" * 60)
print("DUPLICATE ANALYSIS")
print("=" * 60)
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")
if duplicates > 0:
    print("\nDuplicate rows:")
    display(df[df.duplicated(keep=False)])
else:
    print("No duplicate rows found.")


In [None]:
print("=" * 60)
print("FEATURE TYPE SEPARATION")
print("=" * 60)

numerical_cols = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

print(f"\nNumerical features ({len(numerical_cols)}):")
print(numerical_cols)

print(f"\nCategorical features ({len(categorical_cols)}):")
print(categorical_cols)


In [None]:
print("=" * 60)
print("NUMERICAL FEATURES - STATISTICAL SUMMARY")
print("=" * 60)
display(df[numerical_cols].describe())

In [7]:
print("=" * 60)
print("CATEGORICAL FEATURES - UNIQUE VALUES")
print("=" * 60)

for col in categorical_cols:
    print(f"\n{col}:")
    print(f"  Unique values: {df[col].nunique()}")
    print(f"  Values: {df[col].unique().tolist()}")

In [9]:
# Data Leakage Assessment and Target Variable Identification
print("=" * 60)
print("TARGET VARIABLE & DATA LEAKAGE ASSESSMENT")
print("=" * 60)

# Identify the target variable
target_col = 'status'
print(f"\nTarget variable: '{target_col}'")
print(f"Classes: {df[target_col].unique().tolist()}")
print(f"\nClass distribution:")
print(df[target_col].value_counts())
print(f"\nClass proportions:")
print(df[target_col].value_counts(normalize=True).round(3))

# Verify feature composition
print(f"\n--- Feature Inventory ---")
print(f"Total features: {len(df.columns)}")
print(f"  - Predictors: {len(df.columns) - 1}")
print(f"  - Target: 1 ('{target_col}')")

# Check for post-placement features that could leak information
print(f"\n--- Data Leakage Check ---")
suspicious_keywords = ['salary', 'offer', 'package', 'compensation', 'hired']
leakage_found = False

for keyword in suspicious_keywords:
    if any(keyword in col.lower() for col in df.columns):
        print(f"WARNING: Potential leakage feature containing '{keyword}' detected")
        leakage_found = True

if not leakage_found:
    print("No obvious data leakage features detected.")
    print("All features represent information available at prediction time.")

#### Summary

The initial metadata analysis reveals a **clean, well-structured dataset** suitable for classification modeling. With **215 instances** across **13 features** (12 predictors and 1 target), the dataset contains **no missing values or duplicate records**, eliminating the need for imputation strategies at this stage.

The feature composition consists of **5 numerical variables** (all academic performance percentages) and **8 categorical variables** (including demographics, educational background, and the target). All categorical features exhibit low cardinality (2-3 unique values), which simplifies encoding requirements for subsequent modeling phases. Numerical features demonstrate similar scaling (percentage ranges from ~40-98%), with means centered around 62-72% and standard deviations ranging from 5-13%, indicating relatively consistent distributions across academic assessment levels.

The target variable exhibits **moderate class imbalance**, with *68.8% of students placed* and *31.2% not placed*. This imbalance must be considered during model training and evaluation to prevent bias toward the majority class. **No data leakage concerns** were identified; all features represent information available at prediction time, ensuring the model's validity for real-world deployment.

### 4.2 Missing Values Analysis

In [10]:
# Verify missing values
print("=" * 60)
print("MISSING VALUES VERIFICATION")
print("=" * 60)
print("\nMissing values per feature:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")

### 4.3 Feature Distributions, Scaling & Outliers

In [11]:
# Visualize distributions of numerical features
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()

for idx, col in enumerate(numerical_cols):
    axes[idx].hist(df[col], bins=20, edgecolor='black', alpha=0.7)
    axes[idx].set_title(f'Distribution of {col}', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Frequency')
    axes[idx].grid(axis='y', alpha=0.3)

# Remove empty subplot
fig.delaxes(axes[5])

plt.tight_layout()
plt.show()

In [12]:
# Detect outliers using boxplots
fig, axes = plt.subplots(1, 5, figsize=(18, 4))

for idx, col in enumerate(numerical_cols):
    axes[idx].boxplot(df[col], vert=True)
    axes[idx].set_title(f'{col}', fontsize=11, fontweight='bold')
    axes[idx].set_ylabel('Value')
    axes[idx].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

In [13]:
# Quantitative outlier detection using IQR method
print("=" * 60)
print("OUTLIER DETECTION (IQR METHOD)")
print("=" * 60)

for col in numerical_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    
    print(f"\n{col}:")
    print(f"  Q1: {Q1:.2f}, Q3: {Q3:.2f}, IQR: {IQR:.2f}")
    print(f"  Bounds: [{lower_bound:.2f}, {upper_bound:.2f}]")
    print(f"  Outliers detected: {len(outliers)} ({len(outliers)/len(df)*100:.1f}%)")

In [14]:
# Categorical features distribution
categorical_features = [col for col in categorical_cols if col != 'status']

fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.ravel()

for idx, col in enumerate(categorical_features):
    value_counts = df[col].value_counts()
    axes[idx].bar(value_counts.index, value_counts.values, edgecolor='black', alpha=0.7)
    axes[idx].set_title(f'{col}', fontsize=11, fontweight='bold')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Count')
    axes[idx].tick_params(axis='x', rotation=45)
    
    # Add count labels on bars
    for i, v in enumerate(value_counts.values):
        axes[idx].text(i, v + 1, str(v), ha='center', va='bottom')

# Remove empty subplot
fig.delaxes(axes[7])

plt.tight_layout()
plt.show()

#### Summary

**Numerical Feature Distributions:** The histogram analysis reveals that most numerical features exhibit approximately **normal distributions** with slight variations. Academic performance metrics (SSC, HSC, degree, and MBA percentages) are centered around their respective means (62-72%), with the majority of students scoring between 50% and 85%. The `emp_test_percentage` shows a more **uniform distribution** across its range, suggesting diverse performance levels on employment assessments. All features are naturally bounded within the percentage scale, maintaining consistency in measurement units.

**Outlier Analysis:** The IQR-based outlier detection identified **minimal outliers** across the dataset. Only **8 outliers (3.7%)** were detected in `hsc_percentage` and **1 outlier (0.5%)** in `degree_percentage`, while other features showed **no outliers**. The boxplots confirm this finding, with `hsc_percentage` displaying several lower-bound outliers (students with unusually low HSC scores around 37-42%). These outliers represent legitimate data points rather than errors and may provide valuable information about placement outcomes for lower-performing students. Given their small proportion, **no removal is recommended** at this stage.

**Categorical Feature Distributions:** The categorical features exhibit notable **class imbalances**. Gender distribution shows **139 males (64.7%)** versus **76 females (35.3%)**. Academic backgrounds reveal strong preferences: **Commerce dominates HSC subjects** (113 students), **Sci&Tech leads undergraduate degrees** (145 students), and most students lack **work experience (141 vs. 74)**. Board affiliations are relatively balanced between Central and Others. The **Mkt&Fin specialization** slightly outnumbers Mkt&HR (120 vs. 95). These imbalances should be considered during feature encoding and model interpretation, as minority classes may have reduced predictive power due to limited representation.

### 4.4 Target Feature Study

In [15]:
# Target variable distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Count plot
status_counts = df['status'].value_counts()
axes[0].bar(status_counts.index, status_counts.values, edgecolor='black', alpha=0.7, color=['green', 'red'])
axes[0].set_title('Target Distribution (Count)', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Status')
axes[0].set_ylabel('Count')
for i, v in enumerate(status_counts.values):
    axes[0].text(i, v + 2, str(v), ha='center', va='bottom', fontweight='bold')

# Pie chart
axes[1].pie(status_counts.values, labels=status_counts.index, autopct='%1.1f%%', 
            startangle=90, colors=['green', 'red'], explode=(0.05, 0))
axes[1].set_title('Target Distribution (Proportion)', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

print("=" * 60)
print("TARGET VARIABLE ANALYSIS")
print("=" * 60)
print(f"\nClass distribution:")
print(status_counts)
print(f"\nClass proportions:")
print(df['status'].value_counts(normalize=True).round(3))
print(f"\nClass imbalance ratio: {status_counts.max() / status_counts.min():.2f}:1")

In [16]:
# Numerical features comparison by target class
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()

for idx, col in enumerate(numerical_cols):
    placed = df[df['status'] == 'Placed'][col]
    not_placed = df[df['status'] == 'Not Placed'][col]
    
    axes[idx].hist([placed, not_placed], bins=15, label=['Placed', 'Not Placed'], 
                   edgecolor='black', alpha=0.7, color=['green', 'red'])
    axes[idx].set_title(f'{col} by Status', fontsize=11, fontweight='bold')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Frequency')
    axes[idx].legend()
    axes[idx].grid(axis='y', alpha=0.3)

# Remove empty subplot
fig.delaxes(axes[5])

plt.tight_layout()
plt.show()

In [17]:
# Statistical comparison of numerical features by target class
print("=" * 60)
print("NUMERICAL FEATURES - MEAN COMPARISON BY STATUS")
print("=" * 60)

comparison = df.groupby('status')[numerical_cols].mean()
print("\nMean values by placement status:")
print(comparison.round(2))

print("\n" + "=" * 60)
print("DIFFERENCE (Placed - Not Placed)")
print("=" * 60)
difference = comparison.loc['Placed'] - comparison.loc['Not Placed']
print(difference.round(2))

# Visualize mean comparison
comparison.T.plot(kind='bar', figsize=(12, 5), edgecolor='black', alpha=0.7, color=['green', 'red'])
plt.title('Mean Comparison of Numerical Features by Status', fontsize=13, fontweight='bold')
plt.xlabel('Features')
plt.ylabel('Mean Value (%)')
plt.xticks(rotation=45, ha='right')
plt.legend(title='Status')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

In [18]:
# Categorical features vs target - placement rates
categorical_features = [col for col in categorical_cols if col != 'status']

fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.ravel()

for idx, col in enumerate(categorical_features):
    # Create crosstab for placement rates
    ct = pd.crosstab(df[col], df['status'], normalize='index') * 100
    ct.plot(kind='bar', ax=axes[idx], edgecolor='black', alpha=0.7, color=['green', 'red'])
    axes[idx].set_title(f'Placement Rate by {col}', fontsize=10, fontweight='bold')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Percentage (%)')
    axes[idx].legend(title='Status', fontsize=8)
    axes[idx].tick_params(axis='x', rotation=45)
    axes[idx].grid(axis='y', alpha=0.3)

# Remove empty subplot
fig.delaxes(axes[7])

plt.tight_layout()
plt.show()

#### Summary

The target variable analysis will be completed after running the visualization cells above. This section examines the relationship between predictor features and placement outcomes, revealing which characteristics are most strongly associated with successful job placement. Key areas of investigation include class imbalance quantification, numerical feature differences between placed and not-placed students, and placement rate variations across categorical features. These insights will inform feature selection and model training strategies in subsequent sections.

### 4.5 Feature Correlation & Selection

### 4.6 Unsupervised Clustering

### 4.7 Interpretations & Conclusions

---

## 5. ML Baseline & Ensemble Models

### 5.1 Train/Validation/Test Splits

### 5.2 Pipelines & Models

### 5.3 Training & Validation

### 5.4 Testing

### 5.5 Results Interpretation & Discussion

---

## 6. Enhanced Models & Hyperparameter Tuning

### 6.1 Justification of Choices

### 6.2 Hyperparameter Optimization

### 6.3 Final Results & Analysis

---

## 7. Conclusion