# Introduction to Neural Networks

<a id="learning-objectives"></a>
### Learning Objectives
- Get a quick overview of neural networks
- Build a Multilayer Perceptron Feed-Forward network with sklearn
- Compare to other algorithms

### Lesson Guide

- [Introduction](#introduction)
- [What are Neural Networks?](#what-are-neural-networks)
- [Pros vs. Cons](#pros-vs-cons)
- [Features](#features)
- [Outputs](#outputs)
- [Hidden Layers](#hidden-layers)
- [Activation Function](#activation-function)
	- [ReLU](#relu)
	- [Softmax](#softmax)
- [Backpropagation](#backpropagation)
- [Epochs and Batch Sizes](#epochs-and-batch-sizes)
- [Train a Multilayer Perceptron](#multilayer)
- [Additional Resources](#additionl-resources)


In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
np.set_printoptions(precision=4) 
    
plt.style.use('ggplot')
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [9]:
from sklearn.impute import SimpleImputer
from sklearn.base import TransformerMixin
from sklearn.preprocessing import LabelBinarizer, StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline, make_union
from sklearn.model_selection import train_test_split

In [8]:
import tensorflow as tf
from sklearn.base import TransformerMixin
from sklearn.preprocessing import Imputer, LabelBinarizer, StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline, make_union
from sklearn.model_selection import train_test_split

ImportError: cannot import name 'Imputer' from 'sklearn.preprocessing' (C:\Users\aicyb\anaconda3\lib\site-packages\sklearn\preprocessing\__init__.py)

<a id="introduction"></a>
## Introduction
---

Neural networks are incredibly powerful and constantly talked about these days -- they've handled tasks such as image classification, [playing Go](http://www.nature.com/news/google-ai-algorithm-masters-ancient-game-of-go-1.19234), and [creating tweets in the style of President Trump](https://twitter.com/deepdrumpf?lang=en) with relatively little effort.


Neural networks were first studied in the 1940s (!) as a model of biological neural networks and had various ups and downs in the research mainstream.

Currently, this is a rapidly evolving field and represents some of the newest parts of Data Science, thanks to the increase in processing power and scale of data.

<a id="what-are-neural-networks"></a>
## What are Neural Networks?
---

Neural networks, in a single line, attempt to iteratively train a set (or sets) of weights that, when used together, return the most accurate predictions for a set of inputs. Just like many of our past models, the model is trained using a loss function, which our model will attempt to minimize over iterations. Remember that a loss function is some function that takes in our predictions and the actual values and returns some sort of aggregate value that shows how accurate (or not) we were.

Neural networks do this by establishing sets of neurons (known as hidden layers) that take in some sort of input(s), apply a weight, and pass that output onward. As we feed more data into the network, it adjusts those weights based on the output of the loss function, until we have highly trained and specific weights.

Why does one neuron turn out one way and a second neuron another? That's not generally something we can understand (though attempts have been made, such as Google's Deep Dream). You can understand this as a kind of (very advanced) mathematical optimization.

![](./assets/images/neuralnet.png)

<a id="pros-vs-cons"></a>
## Pros vs. Cons
---

**Advantages**

- Exceptionally accurate because we can learn complicated decision boundaries
- Appropriate for a vast range of techniques

**Disadvantages**

- Long training time
- Requires more data than most algorithms
- Can become very complex and hard to interpret
- Less user-friendly coding

<a id="features"></a>
## Features
---

Much like our other machine learning techniques, we do need to feed data into the network. While neural networks are pretty good at taking data in any form, it can help the network a lot to reduce the number of inputs when necessary - particularly with image data. A smaller quantity of inputs can often already give as good results as a larger number without much change in accuracy.

<a id="outputs"></a>
## Outputs
---

Much like other techniques, we do want some sort of output at the end as well. In most cases:

- for a regression style technique, one output is usually fine
- for a classification technique, one output per class is a good idea (in other words, we model a one-versus-all approach)


<a id="hidden-layers"></a>
## Hidden Layers
---

What makes neural networks tick is the idea of hidden layers. Hidden does not mean anything particularly devious here, just that it is not the input or the output layer.

Hidden layers can have any number of neurons per layer and you can include any number of layers in a neural network. Inputs into a neuron have different weights that are modified across iterations of the model and have a bias term as well -- you can almost imagine them as mini-linear models (though, that linearity does not need to hold at all).

<a id="activation-function"></a>
## Activation Function
---


Neurons process the input they receive in a standard way. Each of them first processes the input data in the following way:

$$
z = b+\sum_i w_i X_i
$$

Weights and intercepts are specific to each neuron and have to be determined through an iterative procedure.


Once the neuron has formed $z$ it applies a user-defined activation function to it. Some examples are:

<a id="relu"></a>
### ReLU

Also known as a [Rectified Linear Unit](https://en.wikipedia.org/wiki/Rectifier_(neural_networks), this returns 0 if the output is less than 0, otherwise it simply returns the input, i.e., 

- take the input and feed it through $f(z) = {\rm max}(0, z)$. 

This means that the neuron is activated when its output is positive and not activated otherwise.

**The ReLu function**
![](./assets/images/relu.png)

<a id="softmax"></a>
### Softmax

The softmax function you know from logistic regression - for two classes it reduces to the sigmoid. It returns values between 0 and 1 as desired for assigning probabilities of falling into any of the given classes ([more information here](https://en.wikipedia.org/wiki/Softmax_function)).

There's a wealth of information on different types of activation functions within [this article](https://en.wikipedia.org/wiki/Activation_function) - different activation functions, hidden layers, and neurons per layer can change how effective your neural network will be!


Of course there are a whole lot of other activation functions, for example the identity (just returning the input again) or the hyperbolic tangent.

One of the advantages of the ReLU is that its slope is constant but non-vanishing on the positive side, whereas functions like the sigmoid become very flat as they asymptote towards 1 or 0 which challenges optimization algorithms like gradient descent (see the [vanishing gradient](https://en.wikipedia.org/wiki/Vanishing_gradient_problem) problem).

<a id="backpropagation"></a>
## Backpropagation
---

While there are many ways that a neural network learns, we'll focus  on the easiest to understand method. Backpropagation is the method to adjust the weights in each hidden layer according to how well the network performed compared to the actual outputs in each iteration step.

How do we make good or bad choices within the network? We compare the outputs of the predictions (using the loss function), and make tiny changes to compare the outputs. Most frequently we use a learning rate and a gradient descent method to estimate the changes that our successive models have used.

<a id="epochs-and-batch-sizes"></a>
## Epochs and Batch Sizes
---

- **Epochs:** The number of iterations of full model fitting (i.e., how many times one runs through the fitting process). There's no upper limit, but generally there will be a point where additional epochs do not generate new insights.
- **Batch Size:** Neural networks tend to work best when you feed portions of your data in at a time (versus the full set) and adjust weights in between. Smaller batches allow for more frequent updates but may be less consistent in what changes are needed.

<a id="the-perceptron"></a>
### The Perceptron

This is the original model oriented on how the neurons in the brain might work.

- Each neuron is connected to many other neurons in a network.
- These neurons both send and receive signals from connected neurons.
- When a neuron receives a signal it can either fire or not, depending on whether the incoming signal is above some threshold.

A single perceptron, like a neuron, can be thought of as a decision-making unit. If the weight of the incoming signals is above a threshold, the perceptron fires, and if not it doesn't. In this case firing equals outputting a value of 1 and not firing equals outputting a value of 0.

<img src="images/ann-perceptron.png" width=500>

The graph shows how inputs are fed with some weights into a neuron. 
The neuron processes these inputs. It multiplies each input by its weight, sums them up together with a bias and checks if that sum is larger than zero. If it is, it produces a signal, otherwise not.

$$
\begin{eqnarray*}
b + \sum_i w_i X_i &>& 0 \Rightarrow 1 \\
b + \sum_i w_i X_i &<& 0 \Rightarrow 0
\end{eqnarray*}
$$

The activation function used in the case of the perceptron is the Heaviside step function $\theta(z)$, giving 1 if $z>0$ and 0 otherwise.

Logistic regression would work in the same way, only choosing a different activation function, the sigmoid (or the softmax function in the case of more than two classes).

$$
\begin{eqnarray*}
z &=& b+\sum_i w_i X_i\\
\sigma(z) &=& \frac{1}{1+e^{-z}}
\end{eqnarray*}
$$

**How would the activation function look like for linear regression?**

<a id="multilayer"></a>
## Train a Multilayer Perceptron
---

- A feedforward multilayer perceptron is one of the most well known neural network architectures
- They are structured just like the picture in the intro
    - We have an input layer of features
    - These input features are passed into neurons in the hidden layers
    - Each neuron is a perceptron, kind of like a bunch of small linear regressions
    - We pass information from one layer of neurons to the next layer of neurons until we hit the output layer
    - The output layer does one calculation to output a prediction for the outcome.

![](./assets/images/neuralnet.png)

Let'start with a simple linear regression problem.

In [10]:
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df.head()


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [11]:
y = boston.target
X_train, X_test, y_train, y_test = train_test_split(df,y,test_size = 0.3,random_state=1)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [12]:
model = LinearRegression(fit_intercept=True)
model.fit(X_train,y_train)
metrics.mean_squared_error(y_test,model.predict(X_test))

19.83132367206314

In [13]:
metrics.r2_score(y_test,model.predict(X_test))

0.7836295385076291

In [14]:
print(model.coef_)
print(model.intercept_)

[-0.8388  1.4284  0.4053  0.6794 -2.5304  1.9338  0.1009 -3.2362  2.7032
 -1.9173 -2.1558  0.5823 -4.1343]
22.339830508474606


We can use sklearn's multi-layer perceptron for this regression problem. Let's first reproduce the linear regression result. To do so, we use the identity activation function. The default solver is `adam`, for small datasets `lbfgs` might work better however. 

In [15]:
from sklearn.neural_network import MLPRegressor

In [16]:
nnet = MLPRegressor(hidden_layer_sizes=1,solver='lbfgs',activation='identity',max_iter=1000,random_state=1)
nnet.fit(X_train,y_train)
metrics.mean_squared_error(y_test,nnet.predict(X_test))

19.831456079421418

We can extract the neural network coefficients (the weights for each edge).

In [17]:
print(nnet.coefs_)

[array([[ 0.5127],
       [-0.8731],
       [-0.2478],
       [-0.4153],
       [ 1.5467],
       [-1.1821],
       [-0.0616],
       [ 1.9783],
       [-1.6525],
       [ 1.1721],
       [ 1.3179],
       [-0.3559],
       [ 2.5273]]), array([[-1.6359]])]


In [18]:
nnet.intercepts_

[array([-9.0325]), array([7.5635])]

We multiply the first entries by the second to obtain the linear regression coefficients:

In [19]:
print((nnet.coefs_[0]*nnet.coefs_[1]).flatten())

[-0.8388  1.4283  0.4053  0.6793 -2.5303  1.9338  0.1007 -3.2363  2.7033
 -1.9174 -2.1559  0.5823 -4.1344]


We get very good agreement:

In [20]:
print(model.coef_-(nnet.coefs_[0]*nnet.coefs_[1]).flatten())

[-5.7636e-05  5.2393e-05 -1.2655e-05  9.0265e-05 -8.5366e-05  6.1953e-05
  2.0033e-04  1.6900e-04 -1.2091e-04  1.3566e-04  7.7246e-05  7.8270e-06
  2.4097e-05]


The same for the intercept:

In [21]:
print(nnet.intercepts_[0]*nnet.coefs_[1]+nnet.intercepts_[1])

[[22.3397]]


In [22]:
print(model.intercept_ - (nnet.intercepts_[0]*nnet.coefs_[1]+nnet.intercepts_[1]))

[[0.0001]]


Now let's add a few hidden layers and a non-trivial activation function to see if we can do better.

In [23]:
nnet = MLPRegressor(hidden_layer_sizes=(10,10,10),solver='lbfgs',activation='relu',random_state=1)
nnet.fit(X_train,y_train)
metrics.mean_squared_error(y_test,nnet.predict(X_test))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


12.1730020768693

There are many more model coefficients now.

In [24]:
print([coef.shape for coef in nnet.coefs_])
print(sum([np.prod(coef.shape) for coef in nnet.coefs_]))

[(13, 10), (10, 10), (10, 10), (10, 1)]
340


This gives the full list of the first set of coefficients:

In [25]:
print(nnet.coefs_[0])

[[-1.3501e-01  1.8892e-01 -1.7448e-02  6.2869e-02 -9.3634e-01 -7.3111e-01
  -1.1192e+00 -8.8028e-01 -1.4081e-01  3.0936e-01]
 [ 2.2721e-02  8.2162e-01 -3.2669e-01  8.6818e-01 -4.7454e-01 -7.4026e-01
   1.7749e-01  8.3349e-01 -3.8158e-01  7.9707e-01]
 [ 1.6577e-01  2.9439e-01 -2.0889e-01 -2.2020e-01  4.8117e-01  4.2613e-01
  -8.5555e-01 -2.3904e-01  1.6931e-01  6.4456e-01]
 [-7.8899e-01 -2.3106e-01  8.8758e-01 -2.9974e-01  1.2332e-01 -4.0143e-01
  -8.4782e-01 -3.5369e-01 -1.3146e+00  4.3904e-01]
 [ 7.4132e-01 -3.2816e-01 -1.2440e+00 -5.5151e-01  3.4954e-01  6.6733e-01
   5.6538e-01 -7.6458e-01 -3.5615e-01 -6.0188e-01]
 [-8.1371e-01 -3.0567e-01  1.6828e-01 -5.9699e-01 -5.4383e-02 -8.2802e-01
   8.0427e-01 -1.0490e+00  7.3252e-01 -3.0316e-01]
 [-4.9710e-01 -1.5290e-01  9.0812e-02  1.7136e-01 -6.5082e-01 -1.0200e+00
  -2.5964e-01  4.1763e-01  4.5452e-01  1.8840e-01]
 [ 5.0594e-01 -1.6867e+00 -2.3020e+00 -2.6761e-01 -1.3438e-02 -7.9774e-01
   8.4861e-01 -1.0140e+00 -1.5321e-01  3.6364e-01]


There are also many intercepts now:

In [26]:
[intercept.shape  for intercept in nnet.intercepts_]

[(10,), (10,), (10,), (1,)]

That is the total amount of layers (including input and output):

In [27]:
nnet.n_layers_

5

For the regression model we have a single output:

In [28]:
nnet.n_outputs_

1

We used the following activation function:

In [29]:
nnet.out_activation_

'identity'

We get predictions and scores (R2) in the usual way:

In [30]:
nnet.predict(X_test)[:10]

array([31.9661, 24.706 , 20.2971, 21.0214, 23.1073, 19.6475, 33.1616,
       15.4072, 21.5863, 24.5721])

In [31]:
nnet.score(X_test,y_test)

0.8671859669745491

<a id="load-in-the-titanic-data"></a>
### Load in the titanic data

### Exercise:

- Tune the models above for the boston housing data set and the  titanic data set. Explore the all the different tuning options.

- Practice with further datasets.

<a id="additionl-resources"></a>
## Additional Resources
---

- [Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/chap1.html)
- [Deep Learning](http://www.deeplearningbook.org/)
- [Tensorflow Tutorials](https://github.com/pkmital/tensorflow_tutorials)
- [Awesome Tensorflow](https://github.com/jtoy/awesome-tensorflow)
- [Tensorflow Examples](https://github.com/aymericdamien/TensorFlow-Examples)
- [Mind: How to Build a Neural Network](https://stevenmiller888.github.io/mind-how-to-build-a-neural-network/)