# Census Income Classification

**Author:** Nathan Schaaf<br>
**Date:** 02/15/2025<br>
**Class:** DSBA 6162 - Data Mining

## Overview
This notebook applies **Decision Tree** and **Random Forest** models to predict whether an individual’s income exceeds **$50K per year** using the **Census Income** dataset.

## Steps Covered:
### Data Preprocessing
   - Load the dataset
   - Drop missing values
   - Remove categorical variables with more than 32 levels
   - Encode categorical variables  

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

In [3]:
# Load the dataset
url = "https://raw.githubusercontent.com/NRSchaaf/census-income-ml/refs/heads/main/AdultUCI.csv"
df = pd.read_csv(url)

In [4]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,small
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,small
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,small
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,small
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,small


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       46043 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      46033 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  47985 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [6]:
# Drop rows with missing values
df.dropna(inplace=True)

df.shape

(30162, 15)

In [7]:
# Remove categorical variables with more than 32 levels (specific requirement for this assignment)

# Identify categorical columns with more than 32 unique values
categorical_columns = df.select_dtypes(include=['object']).columns
high_cardinality_cols = [col for col in categorical_columns if df[col].nunique() > 32]

# Drop these columns
df.drop(columns=high_cardinality_cols, inplace=True)


In [8]:
# Encode categorical variables
categorical_columns = df.select_dtypes(include=['object']).columns

# Apply one-hot encoding
df = pd.get_dummies(df, columns=categorical_columns, drop_first=True)

In [9]:
df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass_Local-gov,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,...,relationship_Other-relative,relationship_Own-child,relationship_Unmarried,relationship_Wife,race_Asian-Pac-Islander,race_Black,race_Other,race_White,sex_Male,income_small
0,39,77516,13,2174,0,40,False,False,False,False,...,False,False,False,False,False,False,False,True,True,True
1,50,83311,13,0,0,13,False,False,False,True,...,False,False,False,False,False,False,False,True,True,True
2,38,215646,9,0,0,40,False,True,False,False,...,False,False,False,False,False,False,False,True,True,True
3,53,234721,7,0,0,40,False,True,False,False,...,False,False,False,False,False,True,False,False,True,True
4,28,338409,13,0,0,40,False,True,False,False,...,False,False,False,True,False,True,False,False,False,True


### Train-Test Split
   - Split the dataset into **80% training** and **20% test** data

In [11]:
# Define features and target
X = df.drop(columns=['income_small'])  # Assuming 'income' is the target column
y = df['income_small']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

### Model Building
   - Train a **Decision Tree** model and evaluate accuracy
   - Train a **Random Forest** model with **50 trees** and evaluate accuracy

**Decision Tree**

In [14]:
# Initialize the Decision Tree classifier
dt_model = DecisionTreeClassifier(random_state=42)

# Train the model
dt_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_dt = dt_model.predict(X_test)

# Evaluate accuracy
dt_accuracy = accuracy_score(y_test, y_pred_dt)
print(f"Decision Tree Model Accuracy: {dt_accuracy:.4f}")

Decision Tree Model Accuracy: 0.8071


**Random Forest**

In [15]:
# Initialize the Random Forest classifier with 50 trees
rf_model = RandomForestClassifier(n_estimators=50, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_rf = rf_model.predict(X_test)

# Evaluate the accuracy of the Random Forest model
rf_accuracy = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {rf_accuracy:.4f}")

# Optionally, display classification report for more details
print("\nClassification Report for Random Forest:")
print(classification_report(y_test, y_pred_rf))

Random Forest Accuracy: 0.8498

Classification Report for Random Forest:
              precision    recall  f1-score   support

       False       0.73      0.63      0.68      1502
        True       0.88      0.92      0.90      4531

    accuracy                           0.85      6033
   macro avg       0.81      0.78      0.79      6033
weighted avg       0.84      0.85      0.85      6033



## Results and Comparison
   - Compare model performance and insights

<strong>Model Comparison</strong><br>
<ul>
<li><strong>Decision Tree Model</strong>: accuracy = 81%</li>
<li><strong>Random Forest Model</strong>: accuracy = 85%</li>
</ul>
<br>
<strong>Insights</strong><br>
<p>The Random Forest model outperforms the Decision Treee model because it generally perofrms better than a single Decision Tree as it reduces the risk of overfitting by averaging the predicitons of multiple trees, which improves generalization.</p>

