# AI Student Collective: Machine Learning with the Adult Income Dataset

Welcome to this tutorial by the AI Student Collective. In this notebook, we will walk through a typical machine learning pipeline using the Adult Income dataset. We will explore how to preprocess data, build models, and evaluate performance using different metrics.


## 1. Introduction

The goal of this tutorial is to predict whether an individual earns more than $50,000 per year based on various demographic attributes such as age, education, occupation, and more.


First, we must import the packages that we will use later on in the notebook.

In [7]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE

## 2. Data Overview

Let's start by loading and viewing the data.

In [8]:
data = pd.read_csv('adult.csv')

In [9]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


The dataset contains columns such as age, workclass, education, occupation, gender, hours-per-week, etc. The target variable is income, which indicates whether the income of an individual is greater than or less than $50,000.

## 3. Setting Up Our Target

To feed our data into the model, we need to have a target to train on. In this case, it is our 'income' column. We encode the target variable and split our data into the 'features' and the 'target.'

In [10]:
data['income'] = data['income'].map({'>50K' : 1, '<=50K' : 0})

In [15]:
len(data.query('income==1'))

11687

In [17]:
data['fnlwgt'] = np.log(data['fnlwgt'])

In [19]:
data, test = data[:40000], data[40000:]

In [21]:
X, y = data.drop(columns='income'), data['income']

## 4. Encoding and Cleaning Our Training Data

1. To handle missing data, we use SimpleImputer from sklearn to fill in missing values.


2. We need to convert categorical variables into numerical ones using OneHotEncoder for nominal categories and OrdinalEncoder for ordinal categories.


3. Numeric features are scaled to ensure they contribute equally to the model’s performance.

We will use Pipeline from sklearn to streamline our preprocessing steps and model training.

In [23]:
num_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
cat_cols = ['gender', 'race']
ord_cols = ['occupation', 'relationship', 'education', 'workclass', 'marital-status', 'native-country']

In [25]:
num_trans = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean', add_indicator=True)),
    ('scale', StandardScaler())
])
cat_trans = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent', add_indicator=True)),
    ('OneHot', OneHotEncoder())
])
ord_trans = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent', add_indicator=True)),
    ('Ord', OrdinalEncoder())
])
preprocessor = ColumnTransformer(transformers=[
    ('num', num_trans, num_cols),
    ('cat', cat_trans, cat_cols),
    ('ord', ord_trans, ord_cols)
])

## 5. Model Selection: Random Forest

For this project, we will use a RandomForestClassifier to model the data. Random forests are ensemble learning methods that combine multiple decision trees to improve performance.

In [27]:
class_weights = {0: 1, 1: 1} 
model = RandomForestClassifier(n_estimators=100, class_weight=class_weights,random_state=1337)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model)
])

## 6. Training and Evaluation

We will split the data into training and testing sets using train_test_split.

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1337)

Next, we fit our pipeline with the training data.

In [31]:
pipeline.fit(X_train, y_train)

## 7. Model Evaluation

Accuracy is a common metric to evaluate classification models. It measures the ratio of correctly predicted instances to the total instances.

In [32]:
pipeline.score(X_test, y_test)

0.859875

1. Precision: Measures the accuracy of the positive predictions. High precision means a low false positive rate.
2. Recall: Measures the ability of the model to capture all positive instances. High recall means a low false negative rate.
3. F1-Score: Harmonic mean of precision and recall. It balances the trade-off between precision and recall.

In [35]:
predictions = pipeline.predict(X_test)

print(classification_report(predictions, y_test))

              precision    recall  f1-score   support

           0       0.94      0.89      0.91      6422
           1       0.62      0.75      0.68      1578

    accuracy                           0.86      8000
   macro avg       0.78      0.82      0.79      8000
weighted avg       0.87      0.86      0.86      8000



## 8. Hyperparameter Tuning

To further improve our model, we can tune hyperparameters using GridSearchCV.

In [30]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 250],
    'max_depth': [5, 10, 30, None],
    'min_samples_split': [2,4],
    'max_features': ['sqrt', 'log2']
}

grid_search = GridSearchCV(estimator=RandomForestClassifier(),
                          param_grid=param_grid, verbose=10)

grid_search_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', grid_search)
])


In [186]:
grid_search_pipeline.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
[CV 1/5; 1/48] START max_depth=5, max_features=sqrt, min_samples_split=2, n_estimators=50
[CV 1/5; 1/48] END max_depth=5, max_features=sqrt, min_samples_split=2, n_estimators=50;, score=0.844 total time=   0.3s
[CV 2/5; 1/48] START max_depth=5, max_features=sqrt, min_samples_split=2, n_estimators=50
[CV 2/5; 1/48] END max_depth=5, max_features=sqrt, min_samples_split=2, n_estimators=50;, score=0.851 total time=   0.3s
[CV 3/5; 1/48] START max_depth=5, max_features=sqrt, min_samples_split=2, n_estimators=50
[CV 3/5; 1/48] END max_depth=5, max_features=sqrt, min_samples_split=2, n_estimators=50;, score=0.848 total time=   0.3s
[CV 4/5; 1/48] START max_depth=5, max_features=sqrt, min_samples_split=2, n_estimators=50
[CV 4/5; 1/48] END max_depth=5, max_features=sqrt, min_samples_split=2, n_estimators=50;, score=0.849 total time=   0.3s
[CV 5/5; 1/48] START max_depth=5, max_features=sqrt, min_samples_split=2, n_estimators=50
[CV 

We take the best of our tested parameters and apply it to our data.

In [189]:
random_forest = grid_search.best_estimator_

In [191]:
random_forest_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', random_forest)
])

In [193]:
random_forest_pipeline.score(X_test, y_test)

0.862375

In [201]:
predictions = random_forest_pipeline.predict(X_test)

In [205]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.88      0.94      0.91      6081
           1       0.77      0.61      0.68      1919

    accuracy                           0.86      8000
   macro avg       0.83      0.78      0.80      8000
weighted avg       0.86      0.86      0.86      8000



In [237]:
print(confusion_matrix(y_test, predictions))

[[5725  356]
 [ 745 1174]]


## Your Turn

#### Use what you learned today to optimize the model. See if you can beat our accuracy with clever feature engineering or parameter tuning!

#### Thank you for following along! Feel free to reach out if you have any questions or need further clarification.