# The Hired Hand

**Machine Learning for Job Placement Prediction**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Angry-Jay/ML_TheHiredHand/blob/main/ml-the-hired-hand.ipynb)

---

## Table of Contents

- [Project Overview](#project-overview)
- [Dataset Information](#dataset-information)
- [Project Objectives](#project-objectives)
- [Methodology](#methodology)

---

## Project Overview

This project applies Machine Learning techniques to predict employment outcomes for graduating students using the **Job Placement Dataset** from Kaggle. The goal is to build a robust binary classification model that predicts whether a student will be placed (employed) or not placed based on their demographic, academic, and professional attributes.

---

## Dataset Information

**Dataset Name:** Job Placement Dataset
**Source:** [Kaggle - Job Placement Dataset](https://www.kaggle.com/datasets/ahsan81/job-placement-dataset/data)
**Size:** Small/medium-sized tabular dataset
**Type:** Binary classification problem

**Features Include:**
- **Academic Performance:** Secondary education percentage, higher secondary percentage, degree percentage
- **Educational Background:** Board of education, degree specialization, field of study
- **Professional Attributes:** Work experience, employability test scores
- **Target Variable:** Placement status (Placed / Not Placed)

**Dataset Characteristics:**
- Heterogeneous data (numerical and categorical features)
- Potential class imbalance
- Real-world data with possible missing values

---

## Project Objectives

1. **Predict employment outcomes** (Placed vs. Not Placed) with high accuracy and reliability
2. **Conduct comprehensive data analysis** including:
   - Data cleaning and preprocessing
   - Exploratory Data Analysis (EDA)
   - Feature engineering and selection
   - Correlation analysis
3. **Build and evaluate multiple classification models** with proper validation
4. **Identify key employability factors** through feature importance analysis
5. **Apply ML best practices** throughout the pipeline

## 2. Setting up imports and environment

In [2]:
# Setting up
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing & Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Model Selection & Tuning
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_val_score,
    train_test_split,
)

# Models
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Metrics
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    f1_score,
    precision_score,
    recall_score,
    roc_auc_score,
)

# Configuration
%matplotlib inline

## 3. Data Access

In [None]:
DATA_URL = "https://raw.githubusercontent.com/Angry-Jay/ML_TheHiredHand/refs/heads/main/Job_Placement_Data.csv"

try:
  # Load the dataset directly into a Pandas DataFrame
  df = pd.read_csv(DATA_URL)

  print(" Dataset loaded successfully!")
  print(f"Shape: {df.shape[0]} rows, {df.shape[1]} columns")

  # Display the first 5 rows to verify
  display(df.head())

except Exception as e:
  print("Error loading data. Check your URL.")
  print(f"Error details: {e}")

## 4. Initial data inspection & cleaning

In [None]:
from os import dup
df.info()

# Dropping irrelevant or leakage columns
# 'sl_no' is just an ID. 'salary' causes data leakage for classification as only placed students have it
columns_to_drop = ['sl_no', 'salary']
df_clean = df.drop(columns = [ c for c in columns_to_drop if c in df.columns], errors='ignore')

# Checking for duplicates
duplicates = df_clean.duplicated().sum()
print(f"Number of duplicates found: {duplicates}")
if duplicates > 0:
  df_clean = df_clean.drop_duplicates(inplace= True)
  print("Duplicates removed")

# Check for Missing Values
print("\n--- Missing Values per Feature ---")
print(df_clean.isnull().sum())

# Display statistical summary for numerical features
print("\n--- Numerical Feature Statistics ---")
display(df_clean.describe())

# Display sample rows
print("\n--- Sample Data ---")
display(df_clean.head())