# Understand Robustness : Adult Census Income

----

In this notebook you'll explore the term of "$\text{Robustness}$" for a Machine Learning model. To go into specific we'll see that to have a robust model we need to get :

1. A model that is not overfitted or underfitted (bias-variance tradeoff)
2. A model that stays coherent when generating new data that are credible and outliers data
3. A model that resists to attack

This list allows us to go through some specific steps in a Machine Learning project :


| Section | Topics                            | Some references |
|---------|-----------------------------------|-----------------|
| 1.      | Cross validation                  |                 |
| 1.      | Train-Test split                  |                 |
| 1.      | Bias-Variance tradeoff            |                 |
| 2.      | Interpretability                  |                 |
| 2.      | Local explanation                 |                 |
| 2.      | Generating data to test the model |                 |
| 3.      | Differents attacks on a model     |                 |
| 3.      | Defending against these attacks   |                 |
    

For this notebook I choose to use [Adult Census Income dataset](https://www.kaggle.com/uciml/adult-census-income). It's available at the `../data/` directory.

## Import packages

In [1]:
import pandas as pd
import numpy as np

import os.path

from IPython.display import display, Markdown

## Load data

In [2]:
root_dir = '..'

In [12]:
fpath = os.path.join(root_dir, 'data/adult.csv')

data = pd.read_csv(fpath, na_values='?')

In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       30725 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education.num   32561 non-null  int64 
 5   marital.status  32561 non-null  object
 6   occupation      30718 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital.gain    32561 non-null  int64 
 11  capital.loss    32561 non-null  int64 
 12  hours.per.week  32561 non-null  int64 
 13  native.country  31978 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


The dataset is loaded ! Great. 

So now let's get a quick view of the data. I use [`pandas-profiling`](https://github.com/pandas-profiling/pandas-profiling) package to get a quick insight of the data.

## Analyse dataset

In [18]:
data.sample(5)

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
29433,29,Private,168015,HS-grad,9,Divorced,Machine-op-inspct,Unmarried,White,Female,0,0,40,United-States,<=50K
30326,33,Private,31449,Assoc-acdm,12,Divorced,Machine-op-inspct,Unmarried,Amer-Indian-Eskimo,Female,0,0,40,United-States,<=50K
8266,45,Private,162187,Bachelors,13,Married-civ-spouse,Sales,Husband,White,Male,0,0,52,United-States,>50K
213,63,,234083,HS-grad,9,Divorced,,Not-in-family,White,Female,0,2205,40,United-States,<=50K
31608,40,Self-emp-not-inc,57233,Assoc-voc,11,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,>50K


In [14]:
from pandas_profiling import ProfileReport

profile = ProfileReport(data, title="Adult Census Income", explorative=True)

In [15]:
fpath = os.path.join(root_dir, 'notebooks/reports/adult.html')

profile.to_file(fpath)

HBox(children=(HTML(value='Summarize dataset'), FloatProgress(value=0.0, max=29.0), HTML(value='')))




HBox(children=(HTML(value='Generate report structure'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(HTML(value='Render HTML'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(HTML(value='Export report to file'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [17]:
data.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education.num',
       'marital.status', 'occupation', 'relationship', 'race', 'sex',
       'capital.gain', 'capital.loss', 'hours.per.week', 'native.country',
       'income'],
      dtype='object')

In [16]:
Markdown('Report available at : [%s](%s)'%(fpath, fpath))

Report available at : [../notebooks/reports/adult.html](../notebooks/reports/adult.html)

Using this report we can see the following informations :

- There is 24 dupplicates rows
- `workclass` has 1836 (5.6%) missing values
- `occupation` has 1843 (5.7%) missing values 
- `native.country` has 583 (1.8%) missing values 
- `capital.gain` has 29849 (91.7%) zeros
- `capital.loss` has 31042 (95.3%) zeros 
- our target `income` is not correlated with `fnlwgt`, `race` and `native.country`
- `relationship` and `sex` are really correlated
- `education.num` is the encoded version of `education`

## Data Preparation

So we'll do the following preprocessing tasks :

- Drop duplicates rows
- Drop useless columns

And then for the next tasks we create a `scitkit-learn` pipeline to transform our data with the following steps :

- Missing values imputer : most common for categories and median for numeric
- OneHotEncoder for categories
- StandardScaler to finish

In [19]:
# drop duppl rows
data = data.drop_duplicates().reset_index(drop=True)

In [20]:
# drop useless columns
data = data.drop(columns=[
    'fnlwgt','race','native.country','education','relationship'
])

In [22]:
data.sample(5)

Unnamed: 0,age,workclass,education.num,marital.status,occupation,sex,capital.gain,capital.loss,hours.per.week,income
13782,37,Private,9,Never-married,Machine-op-inspct,Male,0,0,40,<=50K
795,27,Private,13,Married-civ-spouse,Machine-op-inspct,Male,0,1887,60,>50K
3659,30,Self-emp-not-inc,11,Married-civ-spouse,Sales,Male,3137,0,60,<=50K
7611,48,Private,10,Married-civ-spouse,Exec-managerial,Male,0,0,40,<=50K
318,34,Private,9,Never-married,Machine-op-inspct,Male,0,2001,40,<=50K


Before creating the pipeline let's encode our target `income` to 1 and 0.

In [23]:
target = 'income'

data[target] = data[target].replace({
    '<=50K':0,
    '>50K':1
})

In [24]:
data[target].value_counts()

0    24698
1     7839
Name: income, dtype: int64

In [26]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [27]:
numeric_features = ['age', 'education.num', 'capital.gain', 'capital.loss', 'hours.per.week']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_features = ['workclass', 'marital.status', 'occupation', 'sex']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

In [30]:
preprocessor = preprocessor.fit(data)

In [38]:
data_preprocessed = preprocessor.transform(data)

In [40]:
pd.DataFrame.sparse.from_spmatrix(data_preprocessed).sample(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,26,27,28,29,30,31,32,33,34,35
27898,-1.142822,1.134777,-0.145975,-0.216743,-0.035664,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
20878,-0.776193,-0.031815,-0.145975,-0.216743,2.394135,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
26666,2.083511,2.301369,-0.145975,-0.216743,-1.65553,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
30118,0.250367,-0.420679,-0.145975,-0.216743,4.499961,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
32077,1.350253,-2.364998,-0.145975,-0.216743,0.774269,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
