# Classification of earnings
Aim is to use details about a person to predict whether or no they earn more than $50,000 per year.

Run the cell below to download the data

In [None]:
!mkdir ./data
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data -O ./data/adult.csv
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test -O ./data/adult_test.csv

In [None]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Union, Optional, Tuple
from collections import OrderedDict, defaultdict
import os

import sklearn
from sklearn.model_selection import train_test_split 
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble.forest import ForestRegressor
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

from keras.models import Model, Sequential
from keras.layers import Dense, Activation, Dropout, BatchNormalization, Input, Embedding, Reshape, Concatenate
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, ModelCheckpoint, History

# Data Importing
The data is in *Comma Separated Value* (CSV) format. To load it up, we'll use Pandas.

In [None]:
df = pd.read_csv('data/adult.csv', header=None); print(len(df)); df.head()

There's also a test set

In [None]:
df_test = pd.read_csv('data/adult_test.csv', header=None, skiprows=[0])

## Column names
In the dataset as is, the features (columns) are just numbers. We can set them to a more human-readable format

In [None]:
df.columns = [ "Age", "WorkClass", "fnlwgt", "Education", "EducationNum", "MaritalStatus", "Occupation",
              "Relationship", "Race", "Gender", "CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Target"]
df_test.columns = df.columns

In [None]:
df.head()

We need a numerical target for our model, so we'll map <=50K to 0, and >50K to 1

In [None]:
set(df.Target)

In [None]:
df['Target'] = df.Target.map({' <=50K': 0, ' >50K': 1})
df_test['Target'] = df.Target.map({' <=50K': 0, ' >50K': 1})

There seems to be a class imbalance, but we'll ignore it for now

In [None]:
df.Target.hist()

## Validation set
Since we're fitting our model to data, we want to have an unbiased estimate of its performance to help optimise the architecture before we apply the model to the testing data. We can randomly sample a *validation* set from the training data.

In [None]:
_, val_ids = train_test_split(df.index, stratify=df.Target, test_size=0.2, random_state=0)

To help reduce code overhead in the next step, we'll simply set flag in the data for whether or not we want to use each row for training or validation.

In [None]:
df['val'] = 0
df.loc[val_ids, 'val'] = 1

# Feature processing
The data contains both continuous features (real values with numerical comparison) and categorical features (discreet values or string labels with no numerical comparison). Each need to be treated slightly differently.

In [None]:
cat_feats = ['WorkClass', 'Education', 'MaritalStatus', 'Occupation',
             'Relationship', 'Race', 'Gender', 'NativeCountry']
cont_feats = ['Age', 'fnlwgt', 'EducationNum', 'CapitalGain', 'CapitalLoss', 'HoursPerWeek']
train_feats = cont_feats+cat_feats

## Categorical encoding
Our model can only function on numbers, but the categorical features use strings. We can map these string values to integers in order to feed the data into our model. We also want to know whether there are categories which only appear in either training or testing

In [None]:
for feat in ['WorkClass', 'Education', 'MaritalStatus', 'Occupation',
             'Relationship', 'Race', 'Gender', 'NativeCountry']:
    print(feat, set(df[feat]) == set(df_test[feat]))

In [None]:
print('Missing from test:',  [f for f in set(df.NativeCountry) if f not in set(df_test.NativeCountry)])
print('Missing from train:', [f for f in set(df_test.NativeCountry) if f not in set(df.NativeCountry)])

So, the training data contains an extra country which doesn't appear in the testing data, however the model may well be able to learn things from the extra data which are invarient to country, so we'll keep it in.

We need to ensure the same string --> integer mapping is applied to both training and testing, in order to make sure the data still has the same meaning when we apply the model to the testing data. We'll also construct dictionaries to keep track of the mapping. **N.B.** Pandas has a dedicated column type `Categorical` for helping with this kind of data, but we'll stick with integer mapping for now.

In [None]:
cat_maps = defaultdict(dict)
for feat in ['WorkClass', 'Education', 'MaritalStatus', 'Occupation',
             'Relationship', 'Race', 'Gender', 'NativeCountry']:
    for i, val in enumerate(set(df[feat])):
        cat_maps[feat][val] = i
        df.loc[df[feat] == val, feat] = i
        df_test.loc[df_test[feat] == val, feat] = i

In [None]:
df.head()

Looks good, our data now only contains numerical information

## Continuous preprocessing
The weight initialisation we use is optimal for inputs which are unit-Gaussian. The closest we can get is to shift and scale each feature to have mean zero and standard deviation one. `SK-Learn` has `Pipeline` classes to handle series of transformations to data, and we'll use the `StandardScaler` to transform the data.

In [None]:
input_pipe = Pipeline([('norm_in', StandardScaler(with_mean=True, with_std=True))])

Next we need to fit the transformation to the data. Note the Boolean indexing of the data.

In [None]:
input_pipe.fit(df[df.val == 0][cont_feats].values.astype('float32'))

And finally apply the transformation to the training, validation, and testing data.

In [None]:
df[cont_feats] = input_pipe.transform(df[cont_feats].values.astype('float32'))
df_test[cont_feats] = input_pipe.transform(df_test[cont_feats].values.astype('float32'))

We can check the transformation by plotting an example feature

In [None]:
df.Age.hist()

# Model 
Now we need to build a model to fit to the data. Use the previous two example notebooks to help write a function which returns an appropriate model.

In [None]:
def get_model(n_in:int, hidden_sizes:List[int], n_out:int=1, lr:float=1e-3) -> Model:
    # Your code here______________________________________________________
    
    
    
    # ____________________________________________________________________

    model.compile(optimizer=Adam(lr=lr), loss='binary_crossentropy', metrics=['acc'])
    print(model.summary())
    return model

In [None]:
model = get_model(len(train_feats), [100, 100], 1, lr=1e-3)

Now we need to extract out our inputs and targets

In [None]:
x, y = df[df.val == 0][train_feats], df[df.val == 0]['Target']
x_val, y_val = df[df.val == 1][train_feats], df[df.val == 1]['Target']
len(x)

It's also useful to plot the training and validation performance evolution

In [None]:
def plot_history(hist:History) -> None:
    with sns.axes_style('whitegrid'):
        fig, axs = plt.subplots(1, 2, figsize=(24,8))
        axs[0].plot(range(len(hist.history['loss'])), np.array(hist.history['loss']), label='Training')
        axs[0].plot(range(len(hist.history['val_loss'])), np.array(hist.history['val_loss']), label='Validation')
        axs[1].plot(range(len(hist.history['acc'])), np.array(hist.history['acc']), label='Training')
        axs[1].plot(range(len(hist.history['val_acc'])), np.array(hist.history['val_acc']), label='Validation')

        axs[0].set_ylabel("Loss", fontsize=24)
        axs[1].set_ylabel("Accuracy", fontsize=24)
        for ax in axs:
            ax.legend(fontsize=16)
            ax.set_xlabel("Epoch", fontsize=24)
            ax.tick_params(axis='x', labelsize=16)
            ax.tick_params(axis='y', labelsize=16)
        plt.show()

In [None]:
hist = model.fit(x=x, y=y, validation_data=(x_val, y_val), batch_size=128, epochs=30, verbose=0)
plot_history(hist)

# 1-hot encoding
Are treatment of categorical features so far has simply allowed use to feed the inputs into the model, however NNs have a continuous response to inputs and by encoding as integers we have implied a numerical comparison between categories.
An encoding which removes this implication and allows the network to construct separate responses to each category is to expand each feature in to a series of features, one for each category, and mark the correct column with a 1, and the rest as zeros, e.g Monday --> (1,0,0,0,0,0,0)

Pandas has a function to do this.

In [None]:
df = pd.get_dummies(df, columns=cat_feats)
df_test = pd.get_dummies(df_test, columns=cat_feats)

In [None]:
df.head()

As we can see, the original categorical columns have been removed and replaced by dedicated columns for each feature category. We do need to remember about the missing category for `NativeCountry` which was not present in testing data, so we can add a column of zeros for that.

In [None]:
cat_maps['NativeCountry'][' Holand-Netherlands']

In [None]:
df_test[f"NativeCountry_{cat_maps['NativeCountry'][' Holand-Netherlands']}"] = 0

Note that the number of input features has now shot up from 13 to 108: 1-hot encoding is not an efficient method, but should help our model to better access the information in the data.

In [None]:
cat_feats = [f for f in df.columns if f not in cont_feats+['Target', 'val']]
train_feats = cont_feats+cat_feats
len(train_feats)

In [None]:
model = get_model(len(train_feats), [100, 100], 1, lr=1e-3)

In [None]:
x, y = df[df.val == 0][train_feats], df[df.val == 0]['Target']
x_val, y_val = df[df.val == 1][train_feats], df[df.val == 1]['Target']
len(x)

In [None]:
hist = model.fit(x=x, y=y, validation_data=(x_val, y_val), batch_size=128, epochs=30, verbose=0)
plot_history(hist)

## Regularisation
So, whilst the performance of the model improved, we can see overtime that the model begins to overtrain, resulting in a continual decrease in training loss, but this improvement does not transfer to the validation data; the model losses generalisation power.

We can help combat this by applying *regularisation* techniques. One of the most popular is [Dropout](http://jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf).

Adapt your previous model construction function to apply Dropout at a set rate after each dense layer.

In [None]:
def get_do_model(n_in:int, hidden_sizes:List[int], n_out:int=1, lr:float=1e-3, do:float=0) -> Model:
    # Your code here______________________________________________________
    
    
    
    
    # ____________________________________________________________________

    model.compile(optimizer=Adam(lr=lr), loss='binary_crossentropy', metrics=['acc'])
    print(model.summary())
    return model

In [None]:
model = get_do_model(len(train_feats), [100, 100], 1, lr=1e-3, do=0.5)

In [None]:
hist = model.fit(x=x, y=y, validation_data=(x_val, y_val), batch_size=128, epochs=30, verbose=0)
plot_history(hist)

Hopefully the model should now still reach about the same validation performance, but the overtraining should now be suppressed.