# Predicting income levels from census data

This notebook uses [xgboost](https://github.com/dmlc/xgboost) and [scikit-learn](http://scikit-learn.org/stable/) to build a model for the [census->income prediction challenge](https://archive.ics.uci.edu/ml/datasets/census+income).

The emphasis is on showing how such training can be performed very quickly using Google Cloud Platform infrastructure.

In [3]:
import os
import pandas as pd
import tensorflow as tf
import sklearn
import subprocess
import xgboost as xgb

## Downloading the data

The first step is to download the data so that it's accessible in this environment. We will download the data files to a `CENSUS_DATA_PATH`.

In [12]:
CENSUS_DATA_PATH = '/tmp/census'

In [14]:
os.makedirs(CENSUS_DATA_PATH, exist_ok=True)

In [15]:
BASE_URL = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult'
data = {'adult.data', 'adult.test', 'adult.names'}

for data_file in data:
    subprocess.call(['curl', '-o', '{}/{}'.format(CENSUS_DATA_PATH, data_file), '{}/{}'.format(BASE_URL, data_file)])

In [17]:
os.listdir(CENSUS_DATA_PATH)

['adult.data', 'adult.test', 'adult.names']

In [35]:
TRAIN_DATA = '{}/adult.data'.format(CENSUS_DATA_PATH)
TEST_DATA = '{}/adult.test'.format(CENSUS_DATA_PATH)

In [36]:
COLUMNS = (
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per-week',
    'native-country',
    'income-level'
)

In [50]:
CATEGORICAL_COLUMNS = (
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country'
)

In [37]:
train_raw_df = pd.read_csv(TRAIN_DATA, header=None, names=COLUMNS)

In [38]:
train_raw_df.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income-level
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


In [52]:
encoders = {col:sklearn.preprocessing.LabelEncoder() for col in CATEGORICAL_COLUMNS}

In [55]:
train_features_df = train_raw_df.drop('income-level', axis=1)

for col in CATEGORICAL_COLUMNS:
    train_features_df[col] = encoders[col].fit_transform(train_features_df[col])

In [54]:
train_labels_df = (train_raw_df['income-level'] == ' >50K')

In [60]:
test_raw_df = pd.read_csv(TEST_DATA, names=COLUMNS, skiprows=1)

In [61]:
test_raw_df.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income-level
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K.
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K.
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K.
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K.
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K.
5,34,Private,198693,10th,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,<=50K.
6,29,?,227026,HS-grad,9,Never-married,?,Unmarried,Black,Male,0,0,40,United-States,<=50K.
7,63,Self-emp-not-inc,104626,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,3103,0,32,United-States,>50K.
8,24,Private,369667,Some-college,10,Never-married,Other-service,Unmarried,White,Female,0,0,40,United-States,<=50K.
9,55,Private,104996,7th-8th,4,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,10,United-States,<=50K.


In [63]:
test_features_df = test_raw_df.drop('income-level', axis=1)

for col in CATEGORICAL_COLUMNS:
    test_features_df[col] = encoders[col].transform(test_features_df[col])

In [64]:
test_features_df.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,4,226802,1,7,4,7,3,2,1,0,0,40,39
1,38,4,89814,11,9,2,5,0,4,1,0,0,50,39
2,28,2,336951,7,12,2,11,0,4,1,0,0,40,39
3,44,4,160323,15,10,2,7,0,2,1,7688,0,40,39
4,18,0,103497,15,10,4,0,3,4,0,0,0,30,39
5,34,4,198693,0,6,4,8,1,4,1,0,0,30,39
6,29,0,227026,11,9,4,0,4,2,1,0,0,40,39
7,63,6,104626,14,15,2,10,0,4,1,3103,0,32,39
8,24,4,369667,15,10,4,8,4,4,0,0,0,40,39
9,55,4,104996,5,4,2,3,0,4,1,0,0,10,39


In [66]:
test_labels_df = (test_raw_df['income-level'] == ' >50K.')

In [67]:
test_labels_df.head(10)

0    False
1    False
2     True
3     True
4    False
5    False
6    False
7     True
8    False
9    False
Name: income-level, dtype: bool