# Census Income Example
This is a Keras example of classification problem. To simplify the data preparation part I used the Census problem from the Sci-Kit Learn class.

The dataset comes from http://archive.ics.uci.edu/. 

Data extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0)). The data was also preprocessed for the purpose of this example.

Prediction task is to determine whether a person makes over 50K a year.


### List of attributes:

##### Features
- age: continuous. 
- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. 
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, - 10th, Doctorate, 5th-6th, Preschool. 
- education-num: continuous. 
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. 
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. 
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. 
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. 
- sex: Female, Male. 
- hours-per-week: continuous. 
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.



##### Labels
- income - >50K, <=50K. 

# Install tensorflow
If necessary uncomment one of the lines below

In [None]:
# Tensorflow installation - uncomment if necessary
#!pip install tensorflow

# Imports

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf

import matplotlib.pyplot as plt
%matplotlib inline

### Load dataset

In [None]:
df = pd.read_csv("https://github.com/PrzemekSekula/DeepLearningClasses1/raw/master/data/census.csv")

print (df.shape)
print (df.columns)
df.head()

# Data preparation
### Selecting columns
Personaly I decided to delete the following columns:
- education - we have education-num, this is enough
- marital status - too many classes
- relationship - I am not sure if it is useful, and there are many classes
- race - I am not sure if it is useful
- native-country - too many classes

**I deleted many columns just to make the task easier to read. Students are encouraged to experiment with the columns and check if they can improve the results.**


In [None]:
df = df[['age', 'workclass', 'education-num', 'occupation', 
         'sex', 'hours-per-week', 'income']]

df.head()

### Data preprocessing

#### First step - change the labels into binary values.

In [None]:
df['income'] = df['income'].replace({'<=50K': 0, '>50K': 1}).astype(int, errors='ignore')
print (df.income.value_counts())
df.head()

#### Second step - change the `sex` column into binary values

In [None]:
df = pd.get_dummies(df, columns=['sex'], drop_first=True)
df.head()

#### Third step - replace rare classes

Let's check if we have any rare classes

In [None]:
df.workclass.value_counts()

In [None]:
df.occupation.value_counts()

Then we should replace rare classes for one-hot encoded columns

In [None]:
df.loc[df.workclass.isin(['Without-pay', 'Never-worked']), 'workclass'] = '?'
df.workclass.value_counts()

In [None]:
df.loc[df.occupation.isin(['Protective-serv', 'Priv-house-serv', 'Armed-Forces']), 'occupation'] = '?'
df.occupation.value_counts()

#### Final step - one hot encoding

In [None]:
df = pd.get_dummies(df, columns=['workclass', 'occupation'])
print (df.shape)
print (df.columns)
df.head()

## Splitting dataset

Let's split the dataset into features and labels first.
- `income` is the label (`y`)
- all other columns are features (`X`)

In [None]:
y = np.array(df.income).astype('int32')
X = np.array(df.drop(['income'], axis=1)).astype('float32')

### Train test split
#### NOTE: This time we want to split the data into 3 datasets
Split ratio: 60, 20, 20

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, 
                                                    stratify = y, random_state=1)

X_valid, X_test, y_valid, y_test = train_test_split(X_test, y_test, test_size=0.5, 
                                                    stratify = y_test, random_state=1)


print ('X train shape:', X_train.shape)
print ('y train shape:', y_train.shape)

print ('X valid shape:', X_valid.shape)
print ('y valid shape:', y_valid.shape)

print ('X test shape:', X_test.shape)
print ('y test shape:', y_test.shape)

# Keras

Import modules

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Input
from tensorflow.keras import optimizers

## Building model

Let's define the hyperparameters first

In [None]:
CELLS_1 = 32
CELLS_2 = 8
LEARNING_RATE = 0.001
EPOCHS = 10
BATCH_SIZE = 128


NR_INPUTS = X_train.shape[1]

print ('X_train dataset cointains {} features (columns).'.format(NR_INPUTS))

Now we may build the model

In [None]:
model = Sequential()

model.add(Input([NR_INPUTS, ]))
model.add(Dense(CELLS_1, activation='relu'))
model.add(Dense(CELLS_2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

optimizer = optimizers.Adam(learning_rate=LEARNING_RATE)

model.compile(loss = 'binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])

model.summary()

In [None]:
history = model.fit(X_train, y_train, 
                    validation_data = (X_valid, y_valid),
                    batch_size = BATCH_SIZE, 
                    epochs=EPOCHS)

Let's display the training history

In [None]:
def plot_train_valid_history(history):
    """
    Plots train and validation losses.
    Arguments: history - history of training (result of keras model.fit).
        history.history must be a dictionary that looks as follow:
        {
            'loss' : .....
            'valid_loss' : .....
            'accuracy' : .... # Optional
            'val_accuracy' : ..... # Optional
        }
    """
    epochs = np.arange(len(history.history['val_loss'])) + 1
    fig = plt.figure(figsize=(8, 4))
    if 'accuracy' in history.history:
        ax1 = fig.add_subplot(121)
        ax1.plot(epochs, history.history['loss'], c='b', label='Train loss')
        ax1.plot(epochs, history.history['val_loss'], c='g', label='Valid loss')
        plt.legend(loc='lower left');
        plt.grid(True)        
        
        ax1 = fig.add_subplot(122)
        ax1.plot(epochs, history.history['accuracy'], c='b', label='Train acc')
        ax1.plot(epochs, history.history['val_accuracy'], c='g', label='Valid acc')
        plt.legend(loc='lower right');
        plt.grid(True)        
         
        
    else:
        ax1 = fig.add_subplot(111)
        ax1.plot(epochs, history.history['loss'], c='b', label='Train loss')
        ax1.plot(epochs, history.history['val_loss'], c='g', label='Valid loss')
        plt.legend(loc='lower left');
        plt.grid(True)
    plt.show()


plot_train_valid_history(history)

In [None]:
score, acc = model.evaluate(X_test, y_test)
print('Test score:', score)
print('Test accuracy:', acc)

## Task 1
Create, train and test a model with following parameters:
- First hidden layer: 32 neurons, relu activation
- Dropout after first hidden layer keep_probability = 0.5
- Second hidden layer: 32 neurons, relu activation
- Output layer: 1 neuron, sigmoid activation

Training parameters:
- Learning Rate: 0.0003
- Number of Epochs: 50
- Batch size: 128

*Note: You will need your model in task 2, so it is a good idea to write a function, which creates the model.*

![alt text](./img/model_keras_task1.png "Task 1 model")

In [None]:
# HYPERPARAMETERS
# ENTER YOUR CODE HERE


In [None]:
def create_model():
    # ENTER YOUR CODE HERE

    return model


In [None]:
# ENTER YOUR CODE HERE


## Question 1
- How does the model behave. Can you see any overfitting or undergitting problems?
- How you can prevent these problems?

## Task 2
Normalize your features. Use StandardScaler from sklearn.preprocessing library. Then train your model on the normalized features. Did it change anything with the behaviour of the model?

In [None]:
from sklearn.preprocessing import StandardScaler
# ENTER YOUR CODE HERE

In [None]:
# Create and train the model
# ENTER YOUR CODE HERE
