<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Introduction to Neural Networks


---

<a id="learning-objectives"></a>
### Learning Objectives
- Get a quick overview of neural networks
- Build a Multilayer Perceptron Feed-Forward network with sklearn
- Compare to other algorithms

### Lesson Guide

- [Introduction](#introduction)
- [What are Neural Networks?](#what-are-neural-networks)
- [Pros vs. Cons](#pros-vs-cons)
- [Features](#features)
- [Outputs](#outputs)
- [Hidden Layers](#hidden-layers)
- [Activation Function](#activation-function)
	- [ReLU](#relu)
	- [Softmax](#softmax)
- [Backpropagation](#backpropagation)
- [Epochs and Batch Sizes](#epochs-and-batch-sizes)
- [Train a Multilayer Perceptron](#multilayer)
- [Additional Resources](#additionl-resources)


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
np.set_printoptions(precision=4) 
    
plt.style.use('ggplot')
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [2]:
import tensorflow as tf
from sklearn.base import TransformerMixin
from sklearn.preprocessing import Imputer, LabelBinarizer, StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline, make_union
from sklearn.model_selection import train_test_split

  from ._conv import register_converters as _register_converters


<a id="introduction"></a>
## Introduction
---

Neural networks are incredibly powerful and constantly talked about these days -- they've handled tasks such as image classification, [playing Go](http://www.nature.com/news/google-ai-algorithm-masters-ancient-game-of-go-1.19234), and [creating tweets in the style of President Trump](https://twitter.com/deepdrumpf?lang=en) with relatively little effort.


Neural networks were first studied in the 1940s (!) as a model of biological neural networks and had various ups and downs in the research mainstream.

Currently, this is a rapidly evolving field and represents some of the newest parts of Data Science, thanks to the increase in processing power and scale of data.

<a id="what-are-neural-networks"></a>
## What are Neural Networks?
---

Neural networks, in a single line, attempt to iteratively train a set (or sets) of weights that, when used together, return the most accurate predictions for a set of inputs. Just like many of our past models, the model is trained using a loss function, which our model will attempt to minimize over iterations. Remember that a loss function is some function that takes in our predictions and the actual values and returns some sort of aggregate value that shows how accurate (or not) we were.

Neural networks do this by establishing sets of neurons (known as hidden layers) that take in some sort of input(s), apply a weight, and pass that output onward. As we feed more data into the network, it adjusts those weights based on the output of the loss function, until we have highly trained and specific weights.

Why does one neuron turn out one way and a second neuron another? That's not generally something we can understand (though attempts have been made, such as Google's Deep Dream). You can understand this as a kind of (very advanced) mathematical optimization.

![](./assets/images/neuralnet.png)

<a id="pros-vs-cons"></a>
## Pros vs. Cons
---

**Advantages**

- Exceptionally accurate because we can learn complicated decision boundaries
- Appropriate for a vast range of techniques

**Disadvantages**

- Long training time
- Requires more data than most algorithms
- Can become very complex and hard to interpret
- Less user-friendly coding

<a id="features"></a>
## Features
---

Much like our other machine learning techniques, we do need to feed data into the network. While neural networks are pretty good at taking data in any form, it can help the network a lot to reduce the number of inputs when necessary - particularly with image data. A smaller quantity of inputs can often already give as good results as a larger number without much change in accuracy.

<a id="outputs"></a>
## Outputs
---

Much like other techniques, we do want some sort of output at the end as well. In most cases:

- for a regression style technique, one output is usually fine
- for a classification technique, one output per class is a good idea (in other words, we model a one-versus-all approach)


<a id="hidden-layers"></a>
## Hidden Layers
---

What makes neural networks tick is the idea of hidden layers. Hidden does not mean anything particularly devious here, just that it is not the input or the output layer.

Hidden layers can have any number of neurons per layer and you can include any number of layers in a neural network. Inputs into a neuron have different weights that are modified across iterations of the model and have a bias term as well -- you can almost imagine them as mini-linear models (though, that linearity does not need to hold at all).

<a id="activation-function"></a>
## Activation Function
---


Neurons process the input they receive in a standard way. Each of them first processes the input data in the following way:

$$
z = b+\sum_i w_i X_i
$$

Weights and intercepts are specific to each neuron and have to be determined through an iterative procedure.


Once the neuron has formed $z$ it applies a user-defined activation function to it. Some examples are:

<a id="relu"></a>
### ReLU

Also known as a [Rectified Linear Unit](https://en.wikipedia.org/wiki/Rectifier_(neural_networks), this returns 0 if the output is less than 0, otherwise it simply returns the input, i.e., 

- take the input and feed it through $f(z) = {\rm max}(0, z)$. 

This means that the neuron is activated when its output is positive and not activated otherwise.

**The ReLu function**
![](./assets/images/relu.png)

<a id="softmax"></a>
### Softmax

The softmax function you know from logistic regression - for two classes it reduces to the sigmoid. It returns values between 0 and 1 as desired for assigning probabilities of falling into any of the given classes ([more information here](https://en.wikipedia.org/wiki/Softmax_function)).

There's a wealth of information on different types of activation functions within [this article](https://en.wikipedia.org/wiki/Activation_function) - different activation functions, hidden layers, and neurons per layer can change how effective your neural network will be!


Of course there are a whole lot of other activation functions, for example the identity (just returning the input again) or the hyperbolic tangent.

One of the advantages of the ReLU is that its slope is constant but non-vanishing on the positive side, whereas functions like the sigmoid become very flat as they asymptote towards 1 or 0 which challenges optimization algorithms like gradient descent (see the [vanishing gradient](https://en.wikipedia.org/wiki/Vanishing_gradient_problem) problem).

<a id="backpropagation"></a>
## Backpropagation
---

While there are many ways that a neural network learns, we'll focus  on the easiest to understand method. Backpropagation is the method to adjust the weights in each hidden layer according to how well the network performed compared to the actual outputs in each iteration step.

How do we make good or bad choices within the network? We compare the outputs of the predictions (using the loss function), and make tiny changes to compare the outputs. Most frequently we use a learning rate and a gradient descent method to estimate the changes that our successive models have used.

<a id="epochs-and-batch-sizes"></a>
## Epochs and Batch Sizes
---

- **Epochs:** The number of iterations of full model fitting (i.e., how many times one runs through the fitting process). There's no upper limit, but generally there will be a point where additional epochs do not generate new insights.
- **Batch Size:** Neural networks tend to work best when you feed portions of your data in at a time (versus the full set) and adjust weights in between. Smaller batches allow for more frequent updates but may be less consistent in what changes are needed.

<a id="the-perceptron"></a>
### The Perceptron

This is the original model oriented on how the neurons in the brain might work.

- Each neuron is connected to many other neurons in a network.
- These neurons both send and receive signals from connected neurons.
- When a neuron receives a signal it can either fire or not, depending on whether the incoming signal is above some threshold.

A single perceptron, like a neuron, can be thought of as a decision-making unit. If the weight of the incoming signals is above a threshold, the perceptron fires, and if not it doesn't. In this case firing equals outputting a value of 1 and not firing equals outputting a value of 0.

<img src="images/ann-perceptron.png" width=500>

The graph shows how inputs are fed with some weights into a neuron. 
The neuron processes these inputs. It multiplies each input by its weight, sums them up together with a bias and checks if that sum is larger than zero. If it is, it produces a signal, otherwise not.

$$
\begin{eqnarray*}
b + \sum_i w_i X_i &>& 0 \Rightarrow 1 \\
b + \sum_i w_i X_i &<& 0 \Rightarrow 0
\end{eqnarray*}
$$

The activation function used in the case of the perceptron is the Heaviside step function $\theta(z)$, giving 1 if $z>0$ and 0 otherwise.

Logistic regression would work in the same way, only choosing a different activation function, the sigmoid (or the softmax function in the case of more than two classes).

$$
\begin{eqnarray*}
z &=& b+\sum_i w_i X_i\\
\sigma(z) &=& \frac{1}{1+e^{-z}}
\end{eqnarray*}
$$

**How would the activation function look like for linear regression?**

<a id="multilayer"></a>
## Train a Multilayer Perceptron
---

- A feedforward multilayer perceptron is one of the most well known neural network architectures
- They are structured just like the picture in the intro
    - We have an input layer of features
    - These input features are passed into neurons in the hidden layers
    - Each neuron is a perceptron, kind of like a bunch of small linear regressions
    - We pass information from one layer of neurons to the next layer of neurons until we hit the output layer
    - The output layer does one calculation to output a prediction for the outcome.

![](./assets/images/neuralnet.png)

Let'start with a simple linear regression problem.

In [3]:
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [4]:
y = boston.target
X_train, X_test, y_train, y_test = train_test_split(df,y,test_size = 0.3,random_state=1)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [5]:
model = LinearRegression(fit_intercept=True)
model.fit(X_train,y_train)
metrics.mean_squared_error(y_test,model.predict(X_test))

19.829609248605056

In [6]:
metrics.r2_score(y_test,model.predict(X_test))

0.7836482437943237

In [7]:
print(model.coef_)
print(model.intercept_)

[-0.8262  1.4271  0.4085  0.6805 -2.5339  1.9297  0.1034 -3.2349  2.6968
 -1.9156 -2.1581  0.593  -4.141 ]
22.339830508474606


We can use sklearn's multi-layer perceptron for this regression problem. Let's first reproduce the linear regression result. To do so, we use the identity activation function. The default solver is `adam`, for small datasets `lbfgs` might work better however. 

In [8]:
from sklearn.neural_network import MLPRegressor

In [9]:
nnet = MLPRegressor(hidden_layer_sizes=1,solver='lbfgs',activation='identity',max_iter=1000,random_state=1)
nnet.fit(X_train,y_train)
metrics.mean_squared_error(y_test,nnet.predict(X_test))

19.829613095150115

We can extract the neural network coefficients (the weights for each edge).

In [10]:
print(nnet.coefs_)

[array([[ 0.5094],
       [-0.8799],
       [-0.252 ],
       [-0.4195],
       [ 1.5624],
       [-1.1898],
       [-0.0637],
       [ 1.9947],
       [-1.6629],
       [ 1.1813],
       [ 1.3307],
       [-0.3656],
       [ 2.5531]]), array([[-1.6219]])]


In [11]:
nnet.intercepts_

[array([-9.0868]), array([7.6019])]

We multiply the first entries by the second to obtain the linear regression coefficients:

In [12]:
print((nnet.coefs_[0]*nnet.coefs_[1]).flatten())

[-0.8262  1.4271  0.4086  0.6804 -2.5341  1.9297  0.1032 -3.2352  2.6971
 -1.9159 -2.1582  0.593  -4.1409]


We get very good agreement:

In [13]:
print(model.coef_-(nnet.coefs_[0]*nnet.coefs_[1]).flatten())

[ 4.0056e-05 -6.8109e-06 -1.3393e-04  1.0810e-04  2.3199e-04  3.8637e-05
  1.4538e-04  3.3354e-04 -3.0156e-04  3.6917e-04  1.5923e-04 -9.2895e-07
 -9.8528e-05]


The same for the intercept:

In [14]:
print(nnet.intercepts_[0]*nnet.coefs_[1]+nnet.intercepts_[1])

[[22.3398]]


In [15]:
print(model.intercept_ - (nnet.intercepts_[0]*nnet.coefs_[1]+nnet.intercepts_[1]))

[[6.4899e-05]]


Now let's add a few hidden layers and a non-trivial activation function to see if we can do better.

In [16]:
nnet = MLPRegressor(hidden_layer_sizes=(10,10,10),solver='lbfgs',activation='relu',random_state=1)
nnet.fit(X_train,y_train)
metrics.mean_squared_error(y_test,nnet.predict(X_test))

13.455161951094174

There are many more model coefficients now.

In [17]:
print([coef.shape for coef in nnet.coefs_])
print(sum([np.prod(coef.shape) for coef in nnet.coefs_]))

[(13, 10), (10, 10), (10, 10), (10, 1)]
340


This gives the full list of the first set of coefficients:

In [18]:
print(nnet.coefs_[0])

[[ 0.393   0.2272 -0.2243  0.1135 -0.7329 -0.4566 -0.7866 -0.6251  0.0682
   0.3956]
 [-0.0198  0.9524 -0.9206  0.6652 -0.5628 -0.9042 -0.2413  0.8059 -0.8367
   0.307 ]
 [ 0.2966  0.3557 -0.0991 -0.4841 -0.1875  0.5684 -0.4308 -0.3834  0.2487
   0.3143]
 [-0.6005 -0.0409  1.1553 -0.1094 -0.2673 -0.7373 -0.6231 -0.5248 -1.1516
   0.8766]
 [ 0.4853 -0.6792 -0.8762 -0.7296  0.5356  0.4203  0.6987 -1.2268 -0.3795
  -0.5007]
 [-0.912  -0.3973 -0.0096 -0.3797  0.5824 -0.5669  0.926  -0.7645  0.7282
  -0.2356]
 [-0.5224 -0.0702  0.2336  0.164  -1.2548 -0.9444 -0.7949  0.5867  0.4373
   0.0828]
 [ 0.5682 -2.1178 -1.673  -0.4395  0.0121 -0.6657  0.9492 -1.0564 -0.1392
   0.4557]
 [ 0.5237  0.6498 -0.0246 -0.7877 -0.5531  0.6493 -0.9693  1.0615  0.3887
   0.7602]
 [-0.6973 -0.5462 -0.7218  0.8881 -1.0415 -0.4868 -0.2169  0.37   -0.1886
   0.4734]
 [-0.4464 -0.5386  0.9107 -0.7168  0.0676  0.9284 -1.2061  0.6931  0.3981
  -0.0392]
 [-1.646  -1.196   0.7899  0.9793 -0.0071 -0.0659 -0.2709 -0.2232

There are also many intercepts now:

In [19]:
[intercept.shape  for intercept in nnet.intercepts_]

[(10,), (10,), (10,), (1,)]

That is the total amount of layers (including input and output):

In [20]:
nnet.n_layers_

5

For the regression model we have a single output:

In [21]:
nnet.n_outputs_

1

We used the following activation function:

In [22]:
nnet.out_activation_

'identity'

We get predictions and scores (R2) in the usual way:

In [23]:
nnet.predict(X_test)[:10]

array([29.7623, 23.5729, 14.4505, 20.6724, 20.1204, 21.0417, 32.8136,
       15.0943, 21.5044, 25.0493])

In [24]:
nnet.score(X_test,y_test)

0.8531969096488475

<a id="load-in-the-titanic-data"></a>
### Load in the titanic data

In [25]:
data = pd.read_csv('assets/datasets/titanic_train.csv')
X = data.drop('Survived', axis=1)
y = data[['Survived']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

<a id="do-a-bit-of-data-cleaning"></a>
### Do a bit of data cleaning

In [26]:
# Create a helper class to extract features one by one in a pipeline
class FeatureExtractor(TransformerMixin):
    def __init__(self, column):
        self.column = column
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, x, y=None):
        return x[self.column].values.reshape(-1, 1)
    
    
FeatureExtractor('Fare').fit_transform(X_train)[0:5]

array([[51.8625],
       [15.5   ],
       [41.5792],
       [14.4542],
       [10.5167]])

The sklearn `LabelBinarizer` does not fit in a pipeline, let's use [this](https://github.com/scikit-learn/scikit-learn/issues/3112) customized version instead:

In [27]:
from sklearn.base import BaseEstimator

class CustomBinarizer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None,**fit_params):
        return self
    def transform(self, X):
        return LabelBinarizer().fit(X).transform(X)

In [28]:
# Create a pipeline to binarize labels and impute missing values with an appropriate method
pclass_pipe = make_pipeline(
    FeatureExtractor('Pclass'),
    CustomBinarizer(),
    Imputer(strategy='most_frequent'),
    StandardScaler()
)
sex_pipe = make_pipeline(
    FeatureExtractor('Sex'),
    CustomBinarizer(),
    Imputer(strategy='most_frequent'),
    StandardScaler()
)
age_pipe = make_pipeline(
    FeatureExtractor('Age'),
    Imputer(strategy='mean'),
    StandardScaler()
)
sibsp_pipe = make_pipeline(
    FeatureExtractor('SibSp'),
    Imputer(strategy='most_frequent'),
    StandardScaler()
)
parch_pipe = make_pipeline(
    FeatureExtractor('Parch'),
    Imputer(strategy='most_frequent'),
    StandardScaler()
)

fu = make_union(pclass_pipe, sex_pipe, 
                age_pipe, sibsp_pipe, parch_pipe)

In [29]:
# Create X, y, train, and test
def multi_binarizer(data):
    data = data.copy()
    data['Survived Class 0'] = data['Survived'].apply(lambda x: 1 if x == 0 else 0)
    return data[['Survived Class 0', 'Survived']].values

train_X = fu.fit_transform(X_train)
train_Y = multi_binarizer(y_train)
test_X = fu.transform(X_test)
ttest_Y = multi_binarizer(y_test)

In [30]:
train_X.shape, test_X.shape

((596, 7), (295, 7))

In [31]:
from sklearn.linear_model import LogisticRegression, Perceptron
from sklearn.ensemble import RandomForestClassifier

model = Perceptron(tol=10**(-6))
model.fit(train_X,train_Y[:,0])
print(model.score(test_X,ttest_Y[:,0]))

model = LogisticRegression(fit_intercept=True,C=10**10,solver='lbfgs',random_state=1)
model.fit(train_X,train_Y[:,0])
print(model.score(test_X,ttest_Y[:,0]))

model_rf = RandomForestClassifier(n_estimators=500)
model_rf.fit(train_X,train_Y[:,0])
print(model_rf.score(test_X,ttest_Y[:,0]))

0.7627118644067796
0.8271186440677966
0.8338983050847457


In [32]:
from sklearn.neural_network import MLPClassifier

Let's reproduce logistic regression. The hidden layer contains the identity function whereas the output layer makes automatically use of the sigmoid.

In [33]:
clf = MLPClassifier(solver='lbfgs', 
                    alpha=10**(-1),
                    hidden_layer_sizes=1, 
                    activation='identity', 
                    random_state=1,
                    batch_size='auto')
clf.fit(train_X, train_Y)
print(metrics.accuracy_score(y_test['Survived'], clf.predict(test_X)[:,1]))

0.8271186440677966


In [34]:
clf.n_layers_

3

Let's tweak the model a little bit.

In [35]:
clf = MLPClassifier(solver='adam', 
                    alpha=10**(-6),
                    hidden_layer_sizes=(8,8,8,8,8), 
                    activation='relu', 
                    random_state=42,
                    batch_size=100)
clf.fit(train_X, train_Y)
metrics.accuracy_score(y_test['Survived'], clf.predict(test_X)[:,1])

0.8406779661016949

Now we have many layers.

In [36]:
clf.n_layers_

7

We have even more coefficients.

In [37]:
print([coef.shape for coef in clf.coefs_])
print(sum([np.prod(coef.shape) for coef in clf.coefs_]))

[(7, 8), (8, 8), (8, 8), (8, 8), (8, 8), (8, 2)]
328


### Exercise:

- Tune the models above for the boston housing data set and the  titanic data set. Explore the all the different tuning options.

- Practice with further datasets.

- Use the Multi-Layer-Perceptron in the context of your capstone project.

<a id="additionl-resources"></a>
## Additional Resources
---

- [Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/chap1.html)
- [Deep Learning](http://www.deeplearningbook.org/)
- [Tensorflow Tutorials](https://github.com/pkmital/tensorflow_tutorials)
- [Awesome Tensorflow](https://github.com/jtoy/awesome-tensorflow)
- [Tensorflow Examples](https://github.com/aymericdamien/TensorFlow-Examples)
- [Mind: How to Build a Neural Network](https://stevenmiller888.github.io/mind-how-to-build-a-neural-network/)