# Artificial Neural Networks

<br></br>
Before we look at the Database Prediction problem and start programming, let's take a look at the theory behind the Artificial Neural Network algorithm which was popularized by Geoffrey Hinton in the 1980's and is used in Deep Machine Learning. "Deep" in Deep Learning refers to all the hidden layers used in this type of Dynamic Programming algorithm.


<br></br>
The input layer observations and related output refer to ONE row of data. Adjustment of weights is how Neural Nets learn, they decide the strength and importance of signals that are passed along or blocked by an Activation Function. They keep adjusting weights until the predicted output closely matches the actual output.


<br></br>
<img src="https://raw.githubusercontent.com/AMoazeni/Machine-Learning-Database-Prediction/master/Images/01%20-%20Deep%20Learning.png">


<br></br>
Here is a zoomed in version of the node diagram. Yellow nodes represent inputs, green nodes are the hidden layers, and red nodes are outputs.


<br></br>
<img src="https://raw.githubusercontent.com/AMoazeni/Machine-Learning-Database-Prediction/master/Images/02%20-%20Neuron.png">


<br></br>
Feature Scaling (Standardize or Normalize) is applied to input variables makes it easy for Neural Nets to process data by bringing their values close to each other, read 'Efficient Back Propagation.pdf' in the research papers section.


<br></br>
<img src="https://raw.githubusercontent.com/AMoazeni/Machine-Learning-Database-Prediction/master/Images/02_1%20-%20Standardized%20Equation.png" width="400">


<br></br>
<img src="https://raw.githubusercontent.com/AMoazeni/Machine-Learning-Database-Prediction/master/Images/02_2%20-%20Standardized%20Equation.png" width="400">


<br></br>
<br></br>

# Activation Function

<br></br>
Here is a list of some Neural Network Activation Functions. Read 'Deep sparse rectifier neural networks.pdf' in the research papers section.


<br></br>
1. Threshold Function - Rigid binary style function
<img src="https://raw.githubusercontent.com/AMoazeni/Machine-Learning-Database-Prediction/master/Images/03%20-%20Threshold.png" width="400">

2. Sigmoid Function - Smooth, good for output Layers that predict probability
<img src="https://raw.githubusercontent.com/AMoazeni/Machine-Learning-Database-Prediction/master/Images/04%20-%20Sigmoid.png" width="400">

3. Rectifier Function - Gradually increases as input Value increases
<img src="https://raw.githubusercontent.com/AMoazeni/Machine-Learning-Database-Prediction/master/Images/05%20-%20Rectifier.png" width="400">

4. Hyperbolic Tangent Function - Similar to Sigmoid Function but values can go below zero
<img src="https://raw.githubusercontent.com/AMoazeni/Machine-Learning-Database-Prediction/master/Images/06%20-%20Tanh.png" width="400">


<br></br>
Different layers of a Neural Net can use different Activation Functions.


<br></br>
<img src="https://raw.githubusercontent.com/AMoazeni/Machine-Learning-Database-Prediction/master/Images/07%20-%20NN%20Activation%20Example.png" width="600">


<br></br>
<br></br>

# Cost Function

<br></br>
The Cost Function is a plot of the differences between the target and the network's output, which we try to minimize through weight adjustments (Backpropagation) in epochs (one training cycle on the Training Set). Once input information is fed through the network and a y_hat output estimate is found (Forward-propagation), we take the error and go back through the network and adjust the weights (Backpropagation Algorithm). The most common cost function is the Quadratic (Root Mean Square) cost:


<br></br>
$$
Cost = \frac{(\hat y - y)^2}{2} = \frac{(Wighted Estimate - Actual)^2}{2} 
$$


<br></br>
Read this [Deep Learning Book](http://neuralnetworksanddeeplearning.com/index.html) and this [List of Cost Functions Uses](https://stats.stackexchange.com/questions/154879/a-list-of-cost-functions-used-in-neural-networks-alongside-applications?).


<br></br>
<br></br>

# Batch Gradient Descent

<br></br>
This is a Cost minimization technique that looks for downhill slopes and works on Convex Cost Functions. The function can have any number of dimensions, but we are only able to visualize up to three dimensions.


### 1D Gradient Descent
<img src="https://raw.githubusercontent.com/AMoazeni/Machine-Learning-Database-Prediction/master/Images/09%20-%20Gradient%20Descent%201D.png" width="600">


### 2D Gradient Descent
<img src="https://raw.githubusercontent.com/AMoazeni/Machine-Learning-Database-Prediction/master/Images/10%20-%20Gradient%20Descent%202D.png" width="300">


### 3D Gradient Descent
<img src="https://raw.githubusercontent.com/AMoazeni/Machine-Learning-Database-Prediction/master/Images/11%20-%20Gradient%20Descent%203D.png" width="600">



<br></br>
<br></br>

# Reinforcement Learning (Stochastic Gradient Descent)

<br></br>
This method is faster & more accurate than Batch Gradient Descent.


<br></br>
In order to avoid the Local Minimum trap, we can take more sporadic steps in random directions to increase the likelihood of finding the Global Minimum. We can achieve this by adjusting weights one row at a time (Stochastic Gradient Descent) instead of all-at-once (Batch Gradient Descent). Read 'Neural Network in 13 lines of Python.pdf' in the research papers section.


<br></br>
<img src="https://raw.githubusercontent.com/AMoazeni/Machine-Learning-Database-Prediction/master/Images/12%20-%20Local%20Min%20Trap.png">


<br></br>
These are the steps for Stochastic Gradient Descent:
1. Initialize weights to small numbers close to 0 (but NOT 0)
2. Input first row of Observation Data into input layer
3. Forward-propagate: Apply weights to inputs to get predicted result 'y_hat'
4. Compute Error = 'y_hat' - 'y_actual'
5. Back-propagate: Update weights according to the Learning Rate and how much they're responsible for the Error.
6. Repeat steps 1-5 after each observation (Reinforcement Learning), or after each batch (Batch Gradient Descent)
7. Epoch is the Training Set passing through the Artificial Neural Network, more Epochs yield improved results.


<br></br>
<br></br>

# Evaluating the ANN

<br></br>
Be careful when measuring the accuracy of a model. Bias and Variance can differ every time the model is evaluated. To solve this problem, we can use K-Fold Cross Validation which splits the data into multiple segments and averages overall accuracy.


<br></br>
<img src="https://raw.githubusercontent.com/AMoazeni/Machine-Learning-Database-Prediction/master/Images/13%20-%20Bias-Variance%20Tradeoff.png" width="400">


<br></br>
<img src="https://raw.githubusercontent.com/AMoazeni/Machine-Learning-Database-Prediction/master/Images/14%20-%20K-Fold%20Cross%20Validation.png" width="400">



<br></br>
<br></br>

# Overfitting

<br></br>
Overfitting is when your model is over-trained on the Training Set and isn't generalized enough. This reduces performance on Test Set predictions.


<br></br>
Indicators of overfitting:

1. Training and Test Accuracies have a large difference
2. Observing High Accuracy Variance when applying K-Fold Cross Validation


<br></br>
Solve overfitting with "Dropout Regularization", this randomly disables Neurons through iterations so they don't grow too dependent on each other. This helps the Neural Network learns several independent correlations from the data.


<br></br>
<br></br>

# Sample Problem - Bank Database Prediction

<br></br>
Let's test our knowledge of Artificial Neural Networks by solving a real world problem. Take a look at 'Bank_Customer_Data.csv' in the Data folder of this repository. This technique can be applied to any or any customer oriented business data set.


<br></br>
### Problem Description:

A Bank (or any business) is trying to improve customer retention. The Bank engineers have put together a table of data about their customers (Name, Age, Location, Income, etc). They also have data on whether customers left the Bank or stayed with them (last column of data).


<br></br>
The Bank is trying to build a Machine Learning model that predicts the likelihood of a customer leaving before it actually happens so they can work on improving customer satisfaction.


<br></br>
<br></br>

### Code

<br></br>
You can run the code online with Google Colab which is web based and doesn't require installations. 


<br></br>
The better alternative is to download the code and run it with 'Spyder' found in the [Anaconda Distribution](https://www.anaconda.com/download/). 'Spyder' is similar to MATLAB, it allows you to step through the code and examine the 'Variable Explorer' to see exactly how the data is parsed and analyzed.


<br></br>
```shell
$ git clone https://github.com/AMoazeni/Machine-Learning-Database-Prediction.git
$ cd Machine-Learning-Database-Prediction
```

<br></br>
<br></br>
<br></br>
<br></br>

In [None]:
# Artificial Neural Network
# Part 1 - Data Preprocessing

# Pip Install libraries in Terminal
# Install Theano (U. Montreal NumPy computation that can run on GPU or CPU, when parallel Float Point computation is important)
# Install Tensorflow (Google, same as above)
# Install Keras (Combines the above 2 libraries)

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('/Users/mac/Google Drive/Python & RasPi/Udemy Deep Learning/Deep Learning Code/Volume 1 - Supervised Deep Learning/Part 1 - Artificial Neural Networks (ANN)/Churn_Modelling.csv')
# Extract Independat Variables (Matrix of Features / Observations)
X = dataset.iloc[:, 3:13].values
# Extract Dependant Variables Vector
y = dataset.iloc[:, 13].values

# Encoding categorical (Dep/Indep) data
# We need to convert non-number data into numbers
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# In the Bank example: Convert France/Germany/Spain into 0/1/2
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
# In the Bank example: Convert Female/Male into 0/1
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
# Since our country categorical data is not ordinal (order doesn't matter)
# We need to create a Dummy variable
onehotencoder = OneHotEncoder(categorical_features = [1])
# Make all Depedent X objects have the same type (Float 64)
X = onehotencoder.fit_transform(X).toarray()
# Remove  column to avoid Dummy Variable trap
X = X[:, 1:]

# Encoding the Dependent Variable
# In Bank example we dont need to encode Dependent variables because it's already Binary
# Uncomment to activate the following code
#labelencoder_y = LabelEncoder()
#y = labelencoder_y.fit_transform(y)


# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
# Test_size = 0.2 means 80% of data for training, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Feature Scaling
# This steps Standardizes Input Data to ease computation
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)


In [None]:
# Artificial Neural Network
# Part 2 - Making the ANN

# Importing the Keras libraries and packages
import keras

# Required to initialize NN
from keras.models import Sequential

# Required to build Deep layers
from keras.layers import Dense

# Prevent Overfitting with Dropout Regularization
from keras.layers import Dropout

# Initialising the ANN Sequentially (can also initialize as Graph)
# We use Sequential because we have successive layers
# We call our NN "Classifier"
classifier = Sequential()

# Adding the input layer and the first hidden layerx
# This step initializes the Wights to small random numbers
# 'Units' is the number of hidden layers (begin with average of Input & Output layers = 11+1/2 = 6)
# 'Kernel_initializer': Initialize weights as small random numbers
# 'Input_dim': number Independent Variables
# 'Activation': Rectifier Activation Function ('relu') for Hidden Layers, Sigmoid Function for Output Layer
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))

# Add Dropout Regularization to first layer to prevent Overfitting
# 'p': Fraction of Neurons to drop. Start with 0.1 (10% dropped) and increment by 0.1 until Overfitting is solved, don't go over 0.5
classifier.add(Dropout(p = 0.1))

# Adding the second hidden layer
classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))
classifier.add(Dropout(p = 0.1))

# Adding the output layer
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
classifier.add(Dropout(p = 0.1))

# Compiling the ANN
# 'optimizer': Algorithm used to find the best Weights. 'adam' is a popular Stochastic Gradient Descent Algorithm
# 'loss' = 'binary_crossentropy' is useful for Binary Outputs with logarithmic functions
# 'loss' = 'categorical_crossentropy' is useful for 3+ categorical Outputs
# 'metrics' =  Used to evaluate the ANN, requires list. We use 1 metric called 'accuracy'  
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# Fitting the ANN to the Training set
# Experiment to find best 'batch_size' and 'epochs'
classifier.fit(X_train, y_train, batch_size = 10, epochs = 100)


In [None]:
# Artificial Neural Network
# Part 3 - Making predictions and evaluating the model

# Predicting the Test set results
# This gives a vector of probablities of Customers leaving the bank
# You can rank the probabilities of customers most likely to leave the bank
y_pred = classifier.predict(X_test)
# Choose a threshold of which customers leave or stay (use 50% as a starting threshold)
# This line converts probabilities into True/False
y_pred = (y_pred > 0.5)


# Predicting a single new observation
# Predict if the customer with the following informations will leave the bank:
# Geography: France
# Credit Score: 600
# Gender: Male
# Age: 40
# Tenure: 3
# Balance: 60000
# Number of Products: 2
# Has Credit Card: Yes
# Is Active Member: Yes
# Estimated Salary: 50000
# sc.transform Feature Scales the new prediction so the model will understand it
# Set 1 element as a float64 to set all to float64
new_prediction = classifier.predict(sc.transform(np.array([[0.0, 0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000]])))
new_prediction = (new_prediction > 0.5)


# Making the Confusion Matrix
# Tells you the number of correct vs. incorrect observations
# In the Confusion Matrix we get [1,1] + [2,2] Correct Predictions
# In the Confusion Matrix we get [1,2] + [2,1] Incorrect Predictions
# Compute accuracy = correct predictions / total predictions
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Measure accuracy percentage of the Training Set
accuracy = (cm[0,0] + cm[1,1])/2000*100


In [None]:
# Artificial Neural Network
# Part 4 - Evaluating the ANN

# Evaluating the ANN
# Import K-Fold Cross Validation Libraries

from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from keras.models import Sequential
from keras.layers import Dense
# Set up NN as a function
def build_classifier():
    classifier = Sequential()
    classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))
    classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))
    classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
    classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
    return classifier
# 'estimator': Object used to fit the data
# 'X': Features of the Training set
# 'y': Target variable of Training set
# 'cv': Number of Train Test Folds for K-Fold Cross Validation, start with 10, check for low Bias
# 'n_jobs': How many CPU cores to use. Use '-1' to use all available CPU cores for parallel computation
classifier = KerasClassifier(build_fn = build_classifier, batch_size = 10, epochs = 100)
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10, n_jobs = -1)
# We're looking for low Bias (means high Accuracy) & low Variance
# We will get 10 Accuracies
mean = accuracies.mean()
variance = accuracies.std()



In [None]:
# Artificial Neural Network
# Part 5 - Improving and Tuning the ANN

# Dropout Regularization to reduce overfitting if needed
# GridSearch tries several Tuning Hyper Parameters to find the best ones

# Tuning the ANN
from keras.wrappers.scikit_learn import KerasClassifier
# Try sklearn.grid_search if sklearn.model_selection doesn't work
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense

# This function has an input (Optimizer) so we can try different ones
# 'Adam' and 'rmsprop' (also good for RNN) are good optimizers for stochastic gradient descent
def build_classifier(optimizer):
    classifier = Sequential()
    classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu', input_dim = 11))
    classifier.add(Dense(units = 6, kernel_initializer = 'uniform', activation = 'relu'))
    classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
    classifier.compile(optimizer = optimizer, loss = 'binary_crossentropy', metrics = ['accuracy'])
    return classifier
# Build NN Classifier, we will train with K-Fold Cross Validation
classifier = KerasClassifier(build_fn = build_classifier)
parameters = {'batch_size': [25, 32],
              'epochs': [100, 500],
              'optimizer': ['adam', 'rmsprop']}
grid_search = GridSearchCV(estimator = classifier,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10)
# Fit Model to data using grid_search to try various Hyper Parameter
grid_search = grid_search.fit(X_train, y_train)
# Output best parameters
best_parameters = grid_search.best_params_
best_accuracy = grid_search.best_score_
