# Intro to Deep Learning with Keras

This Jupyter notebook contains code and explanations for the 2018 AIS Intro to Deep Learning workshop.

## How to Use This Notebook
This notebook has several cells, some with markdown and others with runnable Python code. To run a cell, click on the cell and then use the **SHIFT + ENTER** keyboard shortcut or navigate to **Cell** in the top menu bar and click on **Run Cells** in the dropdown menu.

## Software Prerequisites
Make sure to install the following software/libraries:

- **Anaconda** - Python distribution with many useful libraries
- **TensorFlow** - deep learning library, acts as a backend for Keras
- **Keras** - a high-level deep learning library that runs on top of TensorFlow

## Libraries Used

- **Numpy** - for handling linear algebra and numerical computations in machine learning.
- **Pandas** - for reading in, preprocessing, and analyzing data.
- **SciKit-Learn** - a general-purpose ML library.
- **Keras** - features a simple API for deep learning.

## About the Dataset - Predicting Income Class Using Census Data
The goal of this exercise is to train a neural network to predict whether or not a person earns over $50K per year using information from Census data. This is an example of a **binary classification** problem. The dataset is present in the GitHub repo for this workshop, but can also be found online at the UCI Machine Learning repository: https://archive.ics.uci.edu/ml/datasets/Adult

## Importing Numpy and Pandas

In [1]:
import numpy as np
import pandas as pd

## Reading in the Data
Using Pandas, we can read the dataset from a zipped csv file, and store the data in a dataframe object.

In [2]:
data = pd.read_csv('adult-census-income.zip')
data.head() # Displays the first 5 rows of the data

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education.num     32561 non-null int64
marital.status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital.gain      32561 non-null int64
capital.loss      32561 non-null int64
hours.per.week    32561 non-null int64
native.country    32561 non-null object
income            32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


### Label Encoding the Data

In [4]:
data['education'].unique()

array(['HS-grad', 'Some-college', '7th-8th', '10th', 'Doctorate',
       'Prof-school', 'Bachelors', 'Masters', '11th', 'Assoc-acdm',
       'Assoc-voc', '1st-4th', '5th-6th', '12th', '9th', 'Preschool'],
      dtype=object)

In [5]:
def encode_education(education):
    
    code_dict = {'Preschool':0,
                 '1st-4th': 1,
                 '5th-6th': 2,
                 '7th-8th': 3,
                 '9th': 4,
                 '10th': 5,
                 '11th': 6,
                 '12th': 7,
                 'HS-grad': 8,
                 'Prof-school': 9,
                 'Some-college': 10,
                 'Assoc-voc': 11,
                 'Assoc-acdm': 12,
                 'Bachelors': 13,
                 'Masters': 14,
                 'Doctorate': 15}
    
    return code_dict[education]

data['education'] = data['education'].apply(encode_education)

In [6]:
data['native.country'].unique()

array(['United-States', '?', 'Mexico', 'Greece', 'Vietnam', 'China',
       'Taiwan', 'India', 'Philippines', 'Trinadad&Tobago', 'Canada',
       'South', 'Holand-Netherlands', 'Puerto-Rico', 'Poland', 'Iran',
       'England', 'Germany', 'Italy', 'Japan', 'Hong', 'Honduras', 'Cuba',
       'Ireland', 'Cambodia', 'Peru', 'Nicaragua', 'Dominican-Republic',
       'Haiti', 'El-Salvador', 'Hungary', 'Columbia', 'Guatemala',
       'Jamaica', 'Ecuador', 'France', 'Yugoslavia', 'Scotland',
       'Portugal', 'Laos', 'Thailand', 'Outlying-US(Guam-USVI-etc)'],
      dtype=object)

In [7]:
from sklearn.preprocessing import LabelEncoder
label_encode_cols = ['income', 'sex']

for col in label_encode_cols:
    lbl = LabelEncoder()
    data[col] = lbl.fit_transform(data[col])


data = pd.get_dummies(data, columns=['workclass', 'marital.status', 'occupation', 'relationship',
                                    'race', 'native.country'])

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 93 columns):
age                                          32561 non-null int64
fnlwgt                                       32561 non-null int64
education                                    32561 non-null int64
education.num                                32561 non-null int64
sex                                          32561 non-null int64
capital.gain                                 32561 non-null int64
capital.loss                                 32561 non-null int64
hours.per.week                               32561 non-null int64
income                                       32561 non-null int64
workclass_?                                  32561 non-null uint8
workclass_Federal-gov                        32561 non-null uint8
workclass_Local-gov                          32561 non-null uint8
workclass_Never-worked                       32561 non-null uint8
workclass_Private                

### Scaling the Data

In [8]:
from sklearn.preprocessing import StandardScaler
X = data.drop(['income'], axis=1)
y = data['income']
for col in X.columns:
    scaler = StandardScaler()
    X[col] = scaler.fit_transform(X[col].values.reshape(-1, 1))




### Splitting the Data into Training and Testing Sets

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [10]:
y_train

6010     0
4461     0
8762     0
26437    0
2952     1
21580    0
16361    0
10508    1
21269    0
10914    0
24091    1
10042    0
5824     0
23009    0
20200    1
2084     1
5503     0
9328     0
27232    1
7420     1
11491    0
8161     0
9892     1
17316    0
8483     0
23843    0
6081     0
4305     0
11358    0
29827    0
        ..
5051     0
5311     1
2433     1
23333    0
32157    1
30187    0
26967    0
769      1
32052    0
1685     1
8322     1
16023    0
27495    1
11363    0
28020    0
14423    0
21962    0
4426     0
29910    1
16850    0
6265     0
22118    0
11284    0
11964    0
21575    0
29802    0
5390     1
860      1
15795    1
23654    0
Name: income, Length: 29304, dtype: int64

### Building the Structure of a Neural Network in Keras

In [21]:
from keras.models import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(92, input_dim=92, activation='sigmoid'))
model.add(Dense(184, activation='sigmoid'))
model.add(Dense(92, activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))

print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_14 (Dense)             (None, 92)                8556      
_________________________________________________________________
dense_15 (Dense)             (None, 184)               17112     
_________________________________________________________________
dense_16 (Dense)             (None, 92)                17020     
_________________________________________________________________
dense_17 (Dense)             (None, 1)                 93        
Total params: 42,781
Trainable params: 42,781
Non-trainable params: 0
_________________________________________________________________
None


### Adding a Loss Function and Optimizer to the Neural Network

In [22]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

### Training the Model

In [23]:
model.fit(X_train.values, y_train, validation_data=(X_test.values, y_test), epochs=20, batch_size=32)

Train on 29304 samples, validate on 3257 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x12b818668>