# Intro to Deep Learning with Keras

This Jupyter notebook contains code and explanations for the 2018 AIS Intro to Deep Learning workshop.

## How to Use This Notebook
This notebook has several cells, some with markdown and others with runnable Python code. To run a cell, click on the cell and then use the **SHIFT + ENTER** keyboard shortcut or navigate to **Cell** in the top menu bar and click on **Run Cells** in the dropdown menu.

## Software Prerequisites
Make sure to install the following software/libraries:

- **Anaconda** - Python distribution with many useful libraries
- **TensorFlow** - deep learning library, acts as a backend for Keras
- **Keras** - a high-level deep learning library that runs on top of TensorFlow

## Libraries Used

- **Numpy** - for handling linear algebra and numerical computations in machine learning.
- **Pandas** - for reading in, preprocessing, and analyzing data.
- **SciKit-Learn** - a general-purpose ML library.
- **Keras** - features a simple API for deep learning.

## About the Dataset - Predicting Income Class Using Census Data
The goal of this exercise is to train a neural network to predict whether or not a person earns over $50K per year using information from Census data. This is an example of a **binary classification** problem. The dataset is present in the GitHub repo for this workshop, but can also be found online at the UCI Machine Learning repository: https://archive.ics.uci.edu/ml/datasets/Adult

## Importing Numpy and Pandas

In [1]:
import numpy as np
import pandas as pd

## Reading in the Data
Using Pandas, we can read the dataset from a zipped csv file, and store the data in a dataframe object.

In [2]:
data = pd.read_csv('adult-census-income.zip')
data.head() # Displays the first 5 rows of the data

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


Based on the output above, we can see that our dataframe has several columns with information such as the age and education level of different individuals. We can use the **info()** function to get some more information about our data.

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education.num     32561 non-null int64
marital.status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital.gain      32561 non-null int64
capital.loss      32561 non-null int64
hours.per.week    32561 non-null int64
native.country    32561 non-null object
income            32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


### Understanding the Problem - Features vs. Targets 
Every machine learning problem involves using some variables called **features** to predict single or multiple **targets**. For this problem, based on our data we have a total of **14 features**:
- **age**
- **workclass**
- **fnlwgt**
- **education**
- **education.num**
- **marital status**
- **occupation**
- **relationship**
- **race**
- **sex**
- **capital.gain**
- **capital.loss**
- **hours.per.week**
- **native.country**

Our goal is to train a model that can use these **features** to predict the target value, which is the **income** column. This problem is a **binary classification** column since the target (income) has two possible values: <=50K and > 50K. The income of a given person is either above or below $50K and we want to train a deep learning model to predict this income classification for a person based on the information provided by the features. 

## What is Deep Learning? What are Neural Networks?
Deep learning is a subfield of machine learning focused on using biologically-inspired models known as **neural networks** to solve a wide range of machine learning problems. At a high-level, neural networks are basically **mathematical models** that are based roughly on neurological concepts in **human learning**. 

## The Three Key Components of Neural Networks
Neural networks can be a slightly challenging concept to grasp since they involve a mix of ideas from math, computer science, and even neuroscience. There is a lot of technical information in this workshop, so I would recommend focusing on gaining a high-level understanding of **three fundamental components** of neural networks:

1. **Structure** - what the neural network looks like, including all the mathematical functions involved, the number of inputs and outputs, and the parameters, called **weights** that the network has to learn.
    
2. **Loss Function** - a metric that tells us how good or bad the network's predictions are. 
3. **Optimizer** - the algorithm used for **learning the weights** that give the network the best predictions.


## Structure of Neural Networks - A Biology and Math Lesson
As we mentioned before neural networks are **biologically inspired** models. For a moment, let's forget about machine learning and review how the human nervous system works to understand where the concept of neural networks came from. A **neuron**, the fundamental unit of this system, looks something like this:

<img src='neuron.png', width=400px, height=200px>
(image source: http://home.agh.edu.pl/~vlsi/AI/intro/)

### Key Parts of the Neuron
A neuron transmits electrical signals that are constantly activated as the human brain learns and recognizes new concepts. The main parts of a neuron that we should take note of are:

- **inputs** - the neuron receives several input signals through **dendrites** from connections to neighboring neurons. 

### Building the Structure of a Neural Network in Keras

In [11]:
from keras.models import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(14, input_dim=14, activation='sigmoid'))
model.add(Dense(28, activation='sigmoid'))
model.add(Dense(14, activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))

print(model.summary())

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
  return f(*args, **kwds)


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 14)                210       
_________________________________________________________________
dense_2 (Dense)              (None, 28)                420       
_________________________________________________________________
dense_3 (Dense)              (None, 14)                406       
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 15        
Total params: 1,051
Trainable params: 1,051
Non-trainable params: 0
_________________________________________________________________
None


### Adding a Loss Function and Optimizer to the Neural Network

In [12]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Instructions for updating:
keep_dims is deprecated, use keepdims instead


## Preparing Our Data for the Neural Network

### Label Encoding the Data

In [4]:
data['education'].unique()

array(['HS-grad', 'Some-college', '7th-8th', '10th', 'Doctorate',
       'Prof-school', 'Bachelors', 'Masters', '11th', 'Assoc-acdm',
       'Assoc-voc', '1st-4th', '5th-6th', '12th', '9th', 'Preschool'],
      dtype=object)

In [5]:
def encode_education(education):
    
    code_dict = {'Preschool':0,
                 '1st-4th': 1,
                 '5th-6th': 2,
                 '7th-8th': 3,
                 '9th': 4,
                 '10th': 5,
                 '11th': 6,
                 '12th': 7,
                 'HS-grad': 8,
                 'Prof-school': 9,
                 'Some-college': 10,
                 'Assoc-voc': 11,
                 'Assoc-acdm': 12,
                 'Bachelors': 13,
                 'Masters': 14,
                 'Doctorate': 15}
    
    return code_dict[education]

data['education'] = data['education'].apply(encode_education)

In [6]:
data['native.country'].unique()

array(['United-States', '?', 'Mexico', 'Greece', 'Vietnam', 'China',
       'Taiwan', 'India', 'Philippines', 'Trinadad&Tobago', 'Canada',
       'South', 'Holand-Netherlands', 'Puerto-Rico', 'Poland', 'Iran',
       'England', 'Germany', 'Italy', 'Japan', 'Hong', 'Honduras', 'Cuba',
       'Ireland', 'Cambodia', 'Peru', 'Nicaragua', 'Dominican-Republic',
       'Haiti', 'El-Salvador', 'Hungary', 'Columbia', 'Guatemala',
       'Jamaica', 'Ecuador', 'France', 'Yugoslavia', 'Scotland',
       'Portugal', 'Laos', 'Thailand', 'Outlying-US(Guam-USVI-etc)'],
      dtype=object)

In [7]:
from sklearn.preprocessing import LabelEncoder

object_cols = list(data.select_dtypes(include=['object']))

for col in object_cols:
    lbl = LabelEncoder()
    data[col] = lbl.fit_transform(data[col])


#data = pd.get_dummies(data, columns=['workclass', 'marital.status', 'occupation', 'relationship',
                                    #'race', 'native.country'])

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         32561 non-null int64
fnlwgt            32561 non-null int64
education         32561 non-null int64
education.num     32561 non-null int64
marital.status    32561 non-null int64
occupation        32561 non-null int64
relationship      32561 non-null int64
race              32561 non-null int64
sex               32561 non-null int64
capital.gain      32561 non-null int64
capital.loss      32561 non-null int64
hours.per.week    32561 non-null int64
native.country    32561 non-null int64
income            32561 non-null int64
dtypes: int64(15)
memory usage: 3.7 MB


### Scaling the Data

In [8]:
from sklearn.preprocessing import StandardScaler
X = data.drop(['income'], axis=1)
y = data['income']
for col in X.columns:
    scaler = StandardScaler()
    X[col] = scaler.fit_transform(X[col].values.reshape(-1, 1))




### Splitting the Data into Training and Testing Sets

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [10]:
y_train

5514     0
19777    0
10781    0
32240    0
9876     0
5455     0
8615     0
29805    1
15081    0
17203    0
9747     1
12114    0
327      0
231      1
13770    1
5860     0
3272     0
27240    0
24431    0
7743     0
6951     1
21032    0
14778    0
11423    1
26009    1
16066    0
30825    0
10276    0
75       1
4        0
        ..
5051     0
5311     1
2433     1
23333    0
32157    1
30187    0
26967    0
769      1
32052    0
1685     1
8322     1
16023    0
27495    1
11363    0
28020    0
14423    0
21962    0
4426     0
29910    1
16850    0
6265     0
22118    0
11284    0
11964    0
21575    0
29802    0
5390     1
860      1
15795    1
23654    0
Name: income, Length: 26048, dtype: int64

## Training the Neural Network

In [15]:
model.fit(X_train.values, y_train, validation_data=(X_test.values, y_test), epochs=15, batch_size=32)

Train on 26048 samples, validate on 6513 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x12373b4e0>