# K-Nearest Neighbours â€“ Challenge: Adult Income Classification

## Overview

This notebook is designed as a hands-on coding challenge for beginners. You'll implement your own k-NN classifier to predict whether someone's annual income exceeds $50K.

**Goal Question:** Can we predict whether someone's annual income exceeds $50K based on their personal and work attributes?

## About the Dataset

**Data Source:** [Income Dataset from Kaggle](https://www.kaggle.com/datasets/mastmustu/income)

### Dataset Details:
- **Total samples:** 32,561 adults
- **Features:** 14 attributes (mix of numerical and categorical)
- **Target:** Income bracket (â‰¤50K or >50K)
- **Real-world application:** Economic analysis, policy making, demographic studies

### Features in Our Dataset:

| Feature | Description | Type | Example Values |
|---------|-------------|------|----------------|
| **age** | Age in years | Numerical | 25, 38, 44 |
| **workclass** | Employment type | Categorical | Private, Self-emp, Gov |
| **fnlwgt** | Census sampling weight | Numerical | 226802, 89814 |
| **education** | Education level | Categorical | HS-grad, Bachelors, Masters |
| **educational-num** | Education years | Numerical | 9, 13, 16 |
| **marital-status** | Marital status | Categorical | Married, Single, Divorced |
| **occupation** | Job type | Categorical | Tech-support, Craft-repair |
| **relationship** | Family relationship | Categorical | Husband, Wife, Own-child |
| **race** | Race | Categorical | White, Black, Asian-Pac-Islander |
| **gender** | Gender | Categorical | Male, Female |
| **capital-gain** | Capital gains | Numerical | 0, 7688, 3103 |
| **capital-loss** | Capital losses | Numerical | 0, 1848, 323 |
| **hours-per-week** | Work hours per week | Numerical | 40, 50, 30 |
| **native-country** | Country of origin | Categorical | United-States, Mexico, India |

**Target Variable:** `income` - Income bracket (â‰¤50K or >50K)

## What You'll Learn

Through this exercise, you will:
- Handle missing values in real-world data
- Encode categorical variables for machine learning
- Apply feature scaling for k-NN
- Implement k-NN classification with hyperparameter tuning
- Evaluate binary classification with confusion matrix
- Understand precision, recall, and class imbalance

## Instructions

**Reference Material:** Look at `Glass Classification.ipynb` for coding examples and patterns to follow.

ðŸ’¡ **Key Reminders:**
- This is a **binary classification** problem (2 classes: â‰¤50K, >50K)
- k-NN requires **feature scaling** - very important!
- Use **confusion matrix** as your primary evaluation metric
- Handle **missing values** (marked as '?' in the dataset)
- **Encode categorical variables** before using k-NN

Let's begin!

## Step 1: Import Required Libraries

You'll need several Python libraries for this exercise. Import the essential tools for:
- **Data manipulation:** pandas, numpy
- **Visualisation:** matplotlib, seaborn  
- **Machine learning:** scikit-learn modules
- **Data preprocessing:** StandardScaler, LabelEncoder, train_test_split
- **k-NN algorithm:** KNeighborsClassifier
- **Evaluation:** classification_report, confusion_matrix, cross_val_score

ðŸ’¡ **Hint:** Look at the Glass Classification notebook to see which specific imports you need!

In [None]:
# TODO: Import all required libraries here
# Refer to Glass Classification.ipynb for the exact imports needed

# Data manipulation


# Visualisation  


# Machine learning


## Step 2: Load and Explore the Dataset

Your first task is to load the Adult dataset and understand its structure. The dataset is stored in `Data/adult.csv`.

### What to do:
1. **Load the data** using pandas
2. **Examine the shape** - How many samples and features?
3. **Check for missing values** - Look for '?' entries
4. **Display basic statistics** for numerical columns
5. **Show value counts** for categorical columns
6. **Check the target distribution** - How balanced are the income classes?

### Key Questions to Answer:
- How many people earn â‰¤50K vs >50K?
- Which features have missing values?
- What are the most common values for categorical features?

In [None]:
# TODO: Load the adult.csv dataset


# TODO: Explore the dataset structure


# TODO: Check for missing values (look for '?' entries)


# TODO: Check target variable distribution


# TODO: Display basic info about the dataset


## Step 3: Visualise the Data

Create visualisations to understand patterns in the data. This will help you understand which features might be important for predicting income.

### Suggested Visualisations:
1. **Income distribution** - Bar chart of â‰¤50K vs >50K counts
2. **Age vs Income** - Box plot or histogram showing age distribution by income bracket
3. **Education vs Income** - Count plot showing income by education level
4. **Work hours vs Income** - Box plot of hours-per-week by income bracket
5. **Correlation heatmap** - For numerical features only

### What to Look For:
- Which age groups tend to earn more?
- How does education level relate to income?
- Are there clear patterns that k-NN could learn from?

In [None]:
# TODO: Create visualisations to understand the data
# Set up plotting style similar to Glass Classification notebook

# TODO: 1. Income distribution bar chart
# TODO: 2. Age vs Income box plot
# TODO: 3. Education vs Income (you'll need to handle this after data cleaning)
# TODO: 4. Hours per week vs Income
# TODO: 5. Correlation heatmap for numerical columns only

## Step 4: Data Preprocessing - The Most Important Step

This is where the real work happens! You need to clean and prepare the data for k-NN classification.

### Major Preprocessing Tasks:

1. **Handle Missing Values:** Replace '?' with the most common value or remove rows
2. **Encode Categorical Variables:** Convert text categories to numbers using LabelEncoder or One-Hot Encoding
3. **Feature Scaling:** Standardise all features (Crucial for k-NN)
4. **Train/Test Split:** Split data for proper evaluation

### Why Each Step Matters:

- **Missing values:** k-NN can't handle missing data
- **Categorical encoding:** k-NN needs numerical input
- **Feature scaling:** Different scales (age vs capital-gain) would bias distance calculations  
- **Train/test split:** Need unseen data to evaluate model fairly

### Key Challenge:
This dataset has many categorical features, unlike the Glass Classification dataset. You'll need to decide:
- Which categorical features to keep?
- How to encode them (LabelEncoder vs One-Hot)?
- How to handle the many categories in some features?

In [None]:
# TODO: Data Preprocessing Steps

In [None]:
# TODO: Hyperparameter Tuning for k-NN
# Find the best k value using cross-validation (similar to Glass Classification)

## Step 5: Train Final Model and Evaluate Performance

Now train your final k-NN model using the best k value and evaluate it on the test set.

### Evaluation Focus:
Since this is a **binary classification** problem, pay special attention to:
- **Overall accuracy:** How often do we predict correctly?
- **Precision and Recall:** Especially for the >50K class (minority class)
- **Confusion Matrix:** Where does our model make mistakes?
- **Class imbalance:** Do we predict the majority class (â‰¤50K) too often?

### Important Questions:
- Is the model better at predicting â‰¤50K or >50K incomes?
- What's the trade-off between precision and recall?
- How does performance compare to always predicting the majority class?

In [None]:
# TODO: Train final k-NN model and make predictions

## Step 6: Confusion Matrix Analysis

Create and analyse the confusion matrix to understand your model's performance in detail.

### What the Confusion Matrix Shows:
- **True Negatives (TN):** Correctly predicted â‰¤50K
- **False Positives (FP):** Incorrectly predicted >50K (Type I error)
- **False Negatives (FN):** Incorrectly predicted â‰¤50K (Type II error)  
- **True Positives (TP):** Correctly predicted >50K

### Key Analysis Questions:
- Does the model have a bias toward predicting one class?
- Which type of error is more common?
- How does this compare to the Glass Classification results?

In [None]:
# TODO: Create and visualise the confusion matrix

## Reflection and Next Steps

### Questions to Consider:
1. **How did your k-NN model perform?** Compare your accuracy to the baseline.
2. **What was the biggest challenge?** Handling categorical variables? Class imbalance?
3. **Which features seem most important?** Based on your visualisations, what drives high income?
4. **How did this differ from Glass Classification?** What made this problem harder/easier?

### Key Learnings:
- **Real-world data is messy** - missing values and categorical variables are common
- **Feature preprocessing is crucial** - especially encoding and scaling
- **Class imbalance matters** - accuracy alone might be misleading
- **Domain knowledge helps** - understanding what features mean aids interpretation