Good source on deep learning: Geoffrey Hinton

### Artificial Neural Networks

#### The Neuron

Neuron --> Node. Node has m-input signals, which it transforms to an output layer.
Final Layer can give continuous numbers, binary, or categorical.
Between nodes: weights between the nodes.

What happens in the node?:
- Weighted sum of all the input values: 
$x=\sum_{i=1}^{m} w_i x_i$
- Add an activation function --> pass/not pass the signal

<img src="files/Neuron_network.png">


#### Activation Functions

Value that is passed over to the next layer.


Options:
- Treshold Function 
$$
\phi(x)=
\begin{cases}
1, x > 0,\\
0, x \leq 0
\end{cases}
$$

- Sigmoid Function (Smooth Threshold Function)
$$\phi(x) = \frac{1}{1+e^{-x}}$$
- Rectifier Function (ReLu)
$$\phi(x) = \max(x,0)$$

- Hyperbolic Tangent (tanh)

note: similar to sigmoid, but output from -1 to +1

$$\phi(x) = \frac{1-e^{-2x}}{1+e^{-2x}}$$

where $x=\sum_{i=1}^{m} w_i x_i$

Common sheme: Relu function in the hidden layers, simoid function in the output value

Extra reading on activation functions: http://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf
       


#### How do Neural Networks work?

Good source: computerphile - dr mike (michael) Pound playlist on NN, image recognition,CNN's.

#### How do Neural Networks learn?

Define a cost function (the error in the prediction) e.g.
$C=\frac{1}{2} (\hat{y}-y)^2$, where $\hat{y}$ is the predicted value & y is the actual value.

As a result: update the weights $w_i$ to minimize the cost function.

$w_i$ need to be updated to correctly predict all rows on the training set. 

- Batch gradient Descent
$C=\sum_i \frac{1}{2} (\hat{y_i}-y_i)^2$ --> the weights are only updated once after feeding all training data (1 epoch)

- Stochastic Gradient Descent (see later)
Rows are fed 1-by-1.

More reading: 
- CrossValidated (2015) - A list of cost functions used in neural networks, alongside applications.
- https://stats.stackexchange.com/questions/154879/a-list-of-cost-functions-used-in-neural-networks-alongside-applications

#### Gradient Descent

Gradient descent: method on how to update the weights.

Too many combinations to brute force the weights. 

Need to calculate the gradient in a vector $w$ to predict the local minimum.


#### Stochastic Gradient Descent

Expansion of gradient descent (simulated annealing[to check]) to find the global minimum.

More reading:
- Andrew Trask (2015) - A Neural network in 13 lines of Python (Part 2 - Gradient Descent) - https://iamtrask.github.io/2015/07/27/python-network-part2/
- Michael Nielsen (2015) - Neural networks and deep learning
http://neuralnetworksanddeeplearning.com/chap2.html
(the entire book is good background on the mathematics of NNs)



#### Backpropagation

How to combine multiple layers: what part of the error, the multiple weights are responsible for.


#### Training a NN with stochastic gradient descent


1. Randomly initialise the weights to small numbers close to 0 (but not 0)
2. Input the 1st observation. Each feature in 1 input node
3. Forward-propagation
4. Compare predicted result to the actual result
5. Back-propagation: the error is back-propagated. Update the weights according to how much they are responsible for the error. The learning rate decides by how much we update the weights.
    - After each observation = Reinforcement Learning
    - After a batch of observations = Batch Learning
    
    
#### Libraries 

- Theano/Torch: deep-learnig math library for parallel computations cpu/gpu (faster than tensorflow)
- Tensorflow: deep-learning core-library 
- Keras: library to build deep-learning networks in a few lines of code.

To learn more on the different packages used for NN's:

https://www.microway.com/hpc-tech-tips/deep-learning-frameworks-survey-tensorflow-torch-theano-caffe-neon-ibm-machine-learning-stack/


## Example: Geodemographic segmentation Model 

In [51]:
#libraries
%matplotlib notebook   
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os

# - Set path - 
#convert to raw string and add an extra \ to the end (not to escape the string)
dir = (r'C:\Users\msfernandez\Machine Learning A-Z\Machine Learning A-Z Template Folder\Part 8 - Deep Learning\Section 39 - Artificial Neural Networks (ANN)\\')
os.chdir(dir)

# - - - - - - - - - - - -
# - import the dataset - 
# - - - - - - - - - - - -
dataset = pd.read_csv('Churn_Modelling.csv', quoting = 3)
display(dataset.head())

# - - - - - - - - - - - - - - 
# Part 1: Data Preprocessing:
# - - - - - - - - - - - - - -

X = dataset.iloc[:, 3:13]
y = dataset.iloc[:, 13]


    # Label Encoding
    # - - - - - - - - - - - - - -

#Encode catagorical data
# Get dummies variable columns for categorical data; Drop first --> avoid dummy variable trap (multicoliniearity with the cte);
X = pd.get_dummies(X,columns=['Geography'],drop_first=True)
X = pd.get_dummies(X,columns=['Gender'],drop_first=True)

print('Categorical variables encoded')

display(X.head())


    # Splitting the data in train & test
    # - - - - - - - - - - - - - -

from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)


    # Feature scaling
    # - - - - - - - - - - - - - -

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()

X_train = sc_X.fit_transform(X_train.astype(float))
X_test = sc_X.transform(X_test.astype(float))

print('Feature Scaled')
display(pd.DataFrame(X_train[:5]))


# - - - - - - - - - - - - - - 
# Part 2: Build the ANN
# - - - - - - - - - - - - - -

    #Importing the keras libraries and packages
    # - - - - - - - - - - - - - -

import keras
from keras.models import Sequential #to initialize the NN (2 options: defining the sequence of layers, or defining the graph)
from keras.layers import Dense #to create the layers in the NN

    #Initialise the ANN 
    # - - - - - - - - - - - - - -

#initialise the ANN
classifier = Sequential()

# Adding the input layer and the first hidden layer
classifier.add(Dense(units = 6, kernel_initializer = 'glorot_uniform', activation = 'relu', input_dim = 11))

# How many nodes in the hidden layers?  --> General rule of thumb: average of number of nodes in the input/output layer.
classifier.add(Dense(units = 6, kernel_initializer = 'glorot_uniform', activation = 'relu'))

# Adding the output layer
classifier.add(Dense(units = 1, kernel_initializer = 'glorot_uniform', activation = 'sigmoid'))
# More then 2 Categories: change the activation function to softmax. 


# Compiling the ANN (add the stochastic gradient descent optimizer 'adam')
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics =['accuracy'])

# - - - - - - - - - - - - - - 
# Part 3: Fit the model to training set
# - - - - - - - - - - - - - -
classifier.fit(X_train, y_train, batch_size=10 , nb_epoch = 100)



Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


Categorical variables encoded


Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_Germany,Geography_Spain,Gender_Male
0,619,42,2,0.0,1,1,1,101348.88,0,0,0
1,608,41,1,83807.86,1,0,1,112542.58,0,1,0
2,502,42,8,159660.8,3,1,0,113931.57,0,0,0
3,699,39,1,0.0,2,0,0,93826.63,0,0,0
4,850,43,2,125510.82,1,1,1,79084.1,0,1,0


Feature Scaled


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,-0.735507,0.015266,0.00886,0.67316,2.535034,-1.553624,-1.03446,-1.64081,1.760216,-0.574682,-1.087261
1,1.024427,-0.652609,0.00886,-1.207724,0.804242,0.643657,-1.03446,-0.079272,-0.568112,-0.574682,-1.087261
2,0.808295,-0.461788,1.393293,-0.356937,0.804242,0.643657,0.966688,-0.99684,-0.568112,1.740094,-1.087261
3,0.396614,-0.080145,0.00886,-0.009356,-0.926551,0.643657,0.966688,-1.591746,-0.568112,1.740094,0.919743
4,-0.467915,1.255605,0.701077,-1.207724,0.804242,0.643657,0.966688,1.283302,-0.568112,-0.574682,0.919743


Instructions for updating:
Use tf.cast instead.




Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

'\n# Making the confusion matrix\nfrom sklearn.metrics import confusion_matrix\ncm = confusion_matrix(y_test, y_pred)\n\ndisplay(cm)\n'

In [54]:

# - - - - - - - - - - - - - -
# Predict the test set results
# - - - - - - - - - - - - - -
y_pred = classifier.predict(X_test)


# Making the confusion matrix
#y_pred gives probabilities --> round

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred.round())

display(cm)

array([[1871,  120],
       [ 240,  269]], dtype=int64)