# AutoML to predict income

Try to download the data file from remote location. If the file exists locally, just use that copy.


In [1]:
import requests
import requests
from pathlib import Path
import os

def get_census():
    data_file = './data/census.csv'
    my_file = Path(data_file)
    if my_file.is_file():
        print('file already downloaded and extracted')
    else:
        print('file not found - doing initial download')
        os.makedirs('data')
        url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
        r = requests.get(url)  
        with open('data/census.csv', 'wb') as f:
            f.write(r.content)

get_census()

file already downloaded and extracted


Load the CSV and display a preview

In [2]:
import pandas as pd

df = pd.read_csv('./data/census.csv')
df.columns = ['age','workclass','fnlwgt','education','education-num','marital-status','occupation',\
              'relationship','race','sex','capital-gain','capital-loss','hours-per-week',\
              'native-country','target']
display(df)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32555,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32556,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32557,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32558,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


People normally use this dataset to predict whether or not the last column (which we called target) will be $>50k annual income or <=50K. Most ML models like to have numerical targets, so we'll convert that to 0 is <=50K and 1 is >50K.

We're also going to drop the fnlwgt column which is an estimate by the Census of how many people fit into this bucket - each row isn't actually a person. Since that's not really a property of the person, I've left it out to keep it simple.

In [3]:
df.loc[df['target'] == ' <=50K', 'target'] = 0
df.loc[df['target'] == ' >50K', 'target'] = 1
df.drop('fnlwgt', 1,inplace=True)
display(df)

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
1,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
2,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
3,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0
4,37,Private,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32555,27,Private,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,0
32556,40,Private,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,1
32557,58,Private,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,0
32558,22,Private,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,0


One last step before we start. There are two high-level types of variables:
* Categorical - attributes that have a finite number of values. Examples include "state I live in".
* Continuous - numeric attributes that exist on a continuum. Examples include "height" and "weight"

With some algorithms like logistic regression, you used to have to convert categorical variables to binary flags (1/0). Most modern ML algorithms can handle (and actually treat) both types fine, but a visualization technique we're going to use at the end works better with numeric variables, so we're going to quickly conver them with this one statement. It will take each categorical variable and turn it into N 1/0 flags, where N is the number of possibilities it has.

In [4]:
df = pd.get_dummies(df, columns=['workclass','education','marital-status','occupation','relationship','race','sex','native-country'])
display(df)

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week,target,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,50,13,0,0,13,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,38,9,0,0,40,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,53,7,0,0,40,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,28,13,0,0,40,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,37,14,0,0,40,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32555,27,12,0,0,38,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
32556,40,9,0,0,40,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
32557,58,9,0,0,40,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
32558,22,9,0,0,20,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


We're going to use the PyCaret package for AutoML and need to import it. We need to tell it what to solve/predict which is called the target, as well as how much data to train with and how much to hold for testing. 

In [5]:
import pycaret.classification as pc
exp_clf = pc.setup(data=df, target = 'target',train_size = .8)


Unnamed: 0,Description,Value
0,session_id,6423
1,Target,target
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(32560, 108)"
5,Missing Values,False
6,Numeric Features,106
7,Categorical Features,1
8,Ordinal Features,False
9,High Cardinality Features,False


Let's see how different algorithms do on a 3 fold sample.

In [6]:
pc.compare_models(fold =3)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
catboost,CatBoost Classifier,0.8722,0.9279,0.6497,0.7821,0.7096,0.6286,0.6331,10.7433
lightgbm,Light Gradient Boosting Machine,0.8712,0.9262,0.6511,0.777,0.7084,0.6265,0.6306,0.39
gbc,Gradient Boosting Classifier,0.864,0.9183,0.5875,0.7932,0.6749,0.5913,0.602,2.6367
ada,Ada Boost Classifier,0.8608,0.9134,0.6094,0.7639,0.6778,0.5904,0.5967,0.8833
lr,Logistic Regression,0.8494,0.9046,0.6013,0.726,0.6575,0.5621,0.5664,3.9067
rf,Random Forest Classifier,0.8457,0.89,0.6112,0.7073,0.6555,0.5568,0.5594,2.2233
lda,Linear Discriminant Analysis,0.8397,0.8922,0.5658,0.7092,0.6292,0.5286,0.5343,0.4233
ridge,Ridge Classifier,0.8392,0.0,0.5081,0.7421,0.603,0.5066,0.521,0.1033
knn,K Neighbors Classifier,0.8386,0.8704,0.6196,0.6806,0.6486,0.5442,0.5453,8.12
et,Extra Trees Classifier,0.8268,0.8464,0.582,0.6579,0.6175,0.5061,0.5078,2.5033


<catboost.core.CatBoostClassifier at 0x7f4b213b3b10>

CatBoost did well which is not surprising - it is one of the newer gradient boosting approaches. Let's explore Catboost and LightGBM. Each supports different diagnostics.

In [7]:
lg = pc.create_model('lightgbm', fold = 3)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.8701,0.9251,0.6397,0.7802,0.703,0.6209,0.6259
1,0.8742,0.9301,0.6497,0.7897,0.7129,0.6334,0.6383
2,0.8692,0.9233,0.664,0.761,0.7092,0.6253,0.6277
Mean,0.8712,0.9262,0.6511,0.777,0.7084,0.6265,0.6306
SD,0.0022,0.0029,0.01,0.012,0.0041,0.0052,0.0055


In [None]:
cb =pc.create_model('catboost', fold = 2)

IntProgress(value=0, description='Processing: ', max=4)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC


What does the AUC / ROC curve look like?
https://en.wikipedia.org/wiki/Receiver_operating_characteristic


In [None]:
pc.plot_model(lg,plot='auc')

How about the tradeoff between precision and recall? We could change the confidence levels for predictions and:
* Go for high precision and lose a bit of recall. This would be the right strategy when the cost of being wrong is high, like an invasive medical procedure.
* Go for high recall at the cost of lower precision. This would be the right strategy when the cost of being wrong is low. Unsuprisingly, people who send junk mail are OK with this option. 

In [None]:
pc.plot_model(lg,plot='pr')

Subject matter experts are going to want to know how the model is working. Which features (columns) were most important in the outcome?

In [None]:
pc.plot_model(lg,plot='feature')

Ultimately, how did it do predicting 0 (below 50k) vs 1 (above 50k)?

In [None]:
pc.plot_model(lg,plot='confusion_matrix')

This brings up all the diagnostic tools at the same time and you can click around. Each of these has value in analyzing a specific behavior. For example:
* The class report shows you how the model did on the specific classes (potential outcomes for the target variable). It is not uncommon for a model to behave differently for different groups, especially if the data is imbalanced.
* The calibration curve tells you how the model did at different levels of confidence. Is it over-confident? We didn't show it, but you can re-calibrate the model to compensate for this curve.

In [None]:
pc.evaluate_model(lg)

Show how the model does on only the holdout data set.

In [None]:
pc.predict_model(lg);

In [None]:
pc.predict_model(cb);

 Not bad!

Accuracy only dropped a bit, indicating very little overfitting on training data.

In [None]:
pc.interpret_model(cb)