# Understand Robustness : Adult Census Income

----

In this notebook you'll explore the term of "$\text{Robustness}$" for a Machine Learning model. To go into specific we'll see that to have a robust model we need to get :

1. A model that is not overfitted or underfitted (bias-variance tradeoff)
2. A model that stays coherent when generating new data that are credible and outliers data
3. A model that resists to attack

This list allows us to go through some specific steps in a Machine Learning project :


| Section | Topics                            | Some references |
|---------|-----------------------------------|-----------------|
| 1.      | Cross validation                  |                 |
| 1.      | Train-Test split                  |                 |
| 1.      | Bias-Variance tradeoff            |                 |
| 2.      | Interpretability                  |                 |
| 2.      | Local explanation                 |                 |
| 2.      | Generating data to test the model |                 |
| 3.      | Differents attacks on a model     |                 |
| 3.      | Defending against these attacks   |                 |
    

For this notebook I choose to use [Adult Census Income dataset](https://www.kaggle.com/uciml/adult-census-income). It's available at the `../data/` directory.

## Import packages

In [1]:
import pandas as pd
import numpy as np

import os.path

from IPython.display import display, Markdown

## Load data

In [2]:
root_dir = '..'

In [3]:
fpath = os.path.join(root_dir, 'data/adult.csv')

data = pd.read_csv(fpath, na_values='?')

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       30725 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education.num   32561 non-null  int64 
 5   marital.status  32561 non-null  object
 6   occupation      30718 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital.gain    32561 non-null  int64 
 11  capital.loss    32561 non-null  int64 
 12  hours.per.week  32561 non-null  int64 
 13  native.country  31978 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


The dataset is loaded ! Great. 

So now let's get a quick view of the data. I use [`pandas-profiling`](https://github.com/pandas-profiling/pandas-profiling) package to get a quick insight of the data.

## Analyse dataset

In [5]:
data.sample(5)

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
30900,23,Private,436798,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,<=50K
28242,30,Private,225053,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,50,United-States,<=50K
17704,56,Private,33323,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,<=50K
23829,17,Private,118792,11th,7,Never-married,Sales,Own-child,White,Female,0,0,9,United-States,<=50K
52,51,State-gov,68898,Assoc-voc,11,Divorced,Tech-support,Not-in-family,White,Male,0,2444,39,United-States,>50K


In [6]:
from pandas_profiling import ProfileReport

profile = ProfileReport(data, title="Adult Census Income", explorative=True)

In [7]:
fpath = os.path.join(root_dir, 'notebooks/reports/adult.html')

profile.to_file(fpath)

HBox(children=(HTML(value='Summarize dataset'), FloatProgress(value=0.0, max=29.0), HTML(value='')))




HBox(children=(HTML(value='Generate report structure'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(HTML(value='Render HTML'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(HTML(value='Export report to file'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [8]:
data.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education.num',
       'marital.status', 'occupation', 'relationship', 'race', 'sex',
       'capital.gain', 'capital.loss', 'hours.per.week', 'native.country',
       'income'],
      dtype='object')

In [9]:
Markdown('Report available at : [%s](%s)'%(fpath, fpath))

Report available at : [../notebooks/reports/adult.html](../notebooks/reports/adult.html)

Using this report we can see the following informations :

- There is 24 dupplicates rows
- `workclass` has 1836 (5.6%) missing values
- `occupation` has 1843 (5.7%) missing values 
- `native.country` has 583 (1.8%) missing values 
- `capital.gain` has 29849 (91.7%) zeros
- `capital.loss` has 31042 (95.3%) zeros 
- our target `income` is not correlated with `fnlwgt`, `race` and `native.country`
- `relationship` and `sex` are really correlated
- `education.num` is the encoded version of `education`

## Data Preparation

So we'll do the following preprocessing tasks :

- Drop duplicates rows
- Drop useless columns

And then for the next tasks we create a `scitkit-learn` pipeline to transform our data with the following steps :

- Missing values imputer : most common for categories and median for numeric
- OneHotEncoder for categories
- StandardScaler to finish

In [10]:
# drop duppl rows
data = data.drop_duplicates().reset_index(drop=True)

In [11]:
# drop useless columns
data = data.drop(columns=[
    'fnlwgt','race','native.country','education','relationship'
])

In [12]:
data.sample(5)

Unnamed: 0,age,workclass,education.num,marital.status,occupation,sex,capital.gain,capital.loss,hours.per.week,income
23940,21,Private,9,Never-married,Other-service,Male,0,0,40,<=50K
137,21,Private,10,Married-civ-spouse,Exec-managerial,Male,0,2377,48,<=50K
9833,60,Self-emp-not-inc,9,Married-civ-spouse,Exec-managerial,Male,0,0,48,<=50K
21890,17,Private,6,Never-married,Prof-specialty,Female,0,0,15,<=50K
16904,39,Private,10,Divorced,Adm-clerical,Female,0,0,45,<=50K


Before creating the pipeline let's encode our target `income` to 1 and 0.

In [13]:
target = 'income'

data[target] = data[target].replace({
    '<=50K':0,
    '>50K':1
})

In [14]:
data[target].value_counts()

0    24698
1     7839
Name: income, dtype: int64

In [15]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [16]:
numeric_features = ['age', 'education.num', 'capital.gain', 'capital.loss', 'hours.per.week']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_features = ['workclass', 'marital.status', 'occupation', 'sex']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

In [17]:
preprocessor = preprocessor.fit(data)

In [18]:
data_preprocessed = preprocessor.transform(data)

In [19]:
pd.DataFrame.sparse.from_spmatrix(data_preprocessed).sample(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,26,27,28,29,30,31,32,33,34,35
11806,-1.216148,-0.420679,-0.145975,-0.216743,-0.035664,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25031,0.250367,-0.031815,-0.145975,-0.216743,-0.035664,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
28829,1.423579,-0.420679,-0.145975,-0.216743,-0.035664,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
32430,-0.996171,-0.031815,-0.145975,-0.216743,-0.035664,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
12368,-1.436125,-0.031815,-0.145975,-0.216743,-0.035664,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


# 1. Robustness : bias-variance tradeoff

If you're not familiar with a classic Machine Learning project pipeline, after defining your problem, collect and analyse your data you will need to choose some algorithms to train so that you'll get a model which can predict the task defined at first.

Like cooking, training a model need some basic ingredients :
- Reliable data : analysed to get the best insight from it and to be sure of its quality
- A sample to learn, a sample to validate and a sample to test
- An algorithm (or more) that is compatible with your task
- A metric (or more) to validate your trained model

**Let's assume that our data are reliable for the next part.**

The next question before starting the training to ask is the following : 

    "Is my dataset representative of the real world ?"

In most cases, training data is extracted from the same source as the future data on which the forecasts will be made. But, imagine you are training an AI that will be able to recognize traffic signs and you use only US traffic signs images, if in reality your algorithm is used in Europe the real data is different from training ! **You need to anticipate if your training data do not miss some possible input in real situation.**

Again for our Adult Census income case, let's say that the data is representative.

---

## 1.1. What is a train, validation and test set ?

Now how can we ensure our model is robust ?

Wikipedia says :

    In computer science, robustness is the ability of a computer system to cope with errors during execution 
    and cope with erroneous input.
    
<div style="text-align:right"><a target="_blank" href="https://en.wikipedia.org/wiki/Robustness_(computer_science)">Wikipedia : Robustness (computer science)</a></div>
    
For the Machine Learning, we can add that :

    The robustness is the property that characterizes how effective your algorithm is while being tested 
    on the new independent (but similar) dataset. In the other words, the robust algorithm is the one, 
    the testing error of which is close to the training error.
    

<div style="text-align:right"><a target="_blank" href="https://www.researchgate.net/post/What_is_the_definition_of_the_robustness_of_a_machine_learning_algorithm">ResearchGate : What is the definition of the robustness of a machine learning algorithm?</a></div>

And enter the term of **train set** and **test set** (and I add the **validation set**). What are they ?

- **Training Dataset**: The sample of data used to fit the model.
- **Validation Dataset**: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.
- **Test Dataset**: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.

[About Train, Validation and Test Sets in Machine Learning](https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7)

This picture summarize the concept :    

<br>
<br>

<img width="700" src="https://miro.medium.com/max/1896/1*r73p1rxMZWnZLoYi5Odf4A.png">

<br>
<br>

<strong style="color:red">/!\ You need to be sure each dataset are representative of the "real world" and randomly generated /!\ </strong> (some tasks don't need random generation like timeseries prediction)


## 1.2. Cross validation (K-fold)

In k-fold cross-validation, the original sample is randomly partitioned into $k$ equal sized subsamples. Of the $k$ subsamples, a single subsample is retained as the validation data for testing the model, and the remaining $k − 1$ subsamples are used as training data. The cross-validation process is then repeated $k$ times, with each of the $k$ subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation. The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once.


<div style="text-align:right"><a target="_blank" href="https://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validation">Wikipedia : Cross Validation</a></div>



<br>
<br>

<img width="700" src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/4b/KfoldCV.gif/1920px-KfoldCV.gif">

Illustration of k-fold cross-validation when n = 12 observations and k = 3. After data is shuffled, a total of 3 models will be trained and tested.
<br>
<br>
<br>
<br>



## 1.3. How to split the dataset into train, validation and test ?

There is not magic frequency for the split of the training set. But if you have not a lot of data (1000~10000 rows) the rule of 60% train, 20% validation and 20% is recommended.

But when you have a lot of data like more than 10 millions, taking 1% as validation and test is more than acceptable.


## 1.4. Overfitting & Underfitting

**Overfitting** is when your model is specialized on your training set : it works really well on your train set, but it do not generalize well. **When a model overfit, we say that the model has a high variance.**

What is variance?

    Variance is the variability of model prediction for a given data point or a value which tells us spread 
    of our data. Model with high variance pays a lot of attention to training data and does not 
    generalize on the data which it hasn’t seen before. As a result, such models perform very well 
    on training data but has high error rates on test data.
    
**Underfitting**  is the case where the model has “ not learned enough” from the training data, resulting in low generalization and unreliable predictions. **When a model underfit, we say that the model has a high bias.**

What is bias?

    Bias is the difference between the average prediction of our model and the correct value which 
    we are trying to predict. Model with high bias pays very little attention to the training data 
    and oversimplifies the model. It always leads to high error on training and test data.
    
[Understanding the Bias-Variance Tradeoff](https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229)

<br>
<br>

<img src="https://miro.medium.com/max/700/1*9hPX9pAO3jqLrzt0IE3JzA.png">

Why is Bias Variance Tradeoff?

If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters then it’s going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data.

This tradeoff in complexity is why there is a tradeoff between bias and variance. An algorithm can’t be more complex and less complex at the same time.

--- 

## In practice : on Adult Census Income

Here, I decided to train a **RandomForest** algorithm and to use **accuracy** as the score function.

<br>
<br>

### train_test_split

If your using tabular data, you can use the `train_test_split()` function from `scikit-learn`

[Documentation of train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

Using this function allows you to separate train and test randomly.

In [20]:
from sklearn.model_selection import train_test_split

X = preprocessor.transform(data)
y = data[target]

# Spliting train and test set.
# Don't forget to specify a random_state to reproduce this operation in the future !
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [21]:
display(Markdown('X_train shape : %s'%str(X_train.shape)))
display(Markdown('X_test shape : %s'%str(X_test.shape)))

X_train shape : (26029, 36)

X_test shape : (6508, 36)

In [23]:
# Check target freq into train and test
print('y_train freq')
display(y_train.value_counts(normalize=True))
print('y_test freq')
display(y_test.value_counts(normalize=True))

y_train freq


0    0.757232
1    0.242768
Name: income, dtype: float64

y_test freq


0    0.766441
1    0.233559
Name: income, dtype: float64

### Overfitting example

Let's create a Decision Tree that has a `max_depth` of 500 which means overfitting because it'll explore almost all possibilities of the train set.

In [58]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(
    max_depth=500,
    random_state=42
)

# Training the model
clf.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=500, random_state=42)

In [59]:
from sklearn.metrics import accuracy_score

y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)

acc_train = accuracy_score(y_train, y_train_pred)
acc_test = accuracy_score(y_test, y_test_pred)

print('Train accuracy : %.2f%%'%(acc_train*100))
print('Test accuracy : %.2f%%'%(acc_test*100))

Train accuracy : 97.16%
Test accuracy : 82.42%


We can see the training score is fat better than the testing one.

**To be sure that your model is not overfitted you need to compare train and test score : if the difference is important then your model is overfitted**

### Underfitting example

In [66]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(
    max_depth=1,
    random_state=42
)

# Training the model
clf.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=1, random_state=42)

In [67]:
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)

acc_train = accuracy_score(y_train, y_train_pred)
acc_test = accuracy_score(y_test, y_test_pred)

print('Train accuracy : %.2f%%'%(acc_train*100))
print('Test accuracy : %.2f%%'%(acc_test*100))

Train accuracy : 75.72%
Test accuracy : 76.64%


In this example the accuracy is very low for both the train set and the test set !


**To be sure that your model is not underfitted you need to compare train and test score : if both score are low then your model is underfitted**

### Cross validation with a grid search

The module `scikit-learn` offers you some ways to use cross validation. 

You can use the `GridSearchCV` which tests all combinations with a given list of parameters. Also there is the `RandomizedSearchCV` class that tests random combinations.

In [68]:
from sklearn.model_selection import GridSearchCV

parameters = {
    'n_estimators': [10,25,50,100],
    'max_depth': [2,5,7,10],
    'min_samples_split': [2,5,10]
}

rf = RandomForestClassifier(random_state=42)

grid = GridSearchCV(rf, parameters)

grid.fit(X_train, y_train)

GridSearchCV(estimator=RandomForestClassifier(random_state=42),
             param_grid={'max_depth': [2, 5, 7, 10],
                         'min_samples_split': [2, 5, 10],
                         'n_estimators': [10, 25, 50, 100]})

In [70]:
clf = grid.best_estimator_

In [71]:
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)

acc_train = accuracy_score(y_train, y_train_pred)
acc_test = accuracy_score(y_test, y_test_pred)

print('Train accuracy : %.2f%%'%(acc_train*100))
print('Test accuracy : %.2f%%'%(acc_test*100))

Train accuracy : 86.88%
Test accuracy : 86.09%


# 2. Robustness : reliability of the prediction on new data

# 3. Robustness : resistent to attacks