# Census Income Classification

**Author:** Nathan Schaaf<br>
**Date:** 02/15/2025<br>
**Class:** DSBA 6162 - Data Mining

## Overview
This notebook applies **Decision Tree** and **Random Forest** models to predict whether an individual’s income exceeds **$50K per year** using the **Census Income** dataset.

## Steps Covered:
1. **Data Preprocessing**
   - Load the dataset
   - Drop missing values
   - Remove categorical variables with more than 32 levels
   - Encode categorical variables  

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

In [2]:
# Load the dataset
url = "https://raw.githubusercontent.com/NRSchaaf/census-income-ml/refs/heads/main/AdultUCI.csv"
df = pd.read_csv(url)

In [3]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,small
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,small
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,small
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,small
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,small


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       46043 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      46033 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  47985 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [6]:
# Drop rows with missing values
df.dropna(inplace=True)

df.shape

(30162, 15)

2. **Train-Test Split**
   - Split the dataset into **80% training** and **20% test** data

3. **Model Building**
   - Train a **Decision Tree** model and evaluate accuracy
   - Train a **Random Forest** model with **50 trees** and evaluate accuracy

4. **Results and Comparison**
   - Compare model performance and insights