**DEEP LEARNING**  
* What is Deep Learning?  
* Need A LOT of data and A LOT of processing power - which we didn't have until recently.  
* This is why ANN/CNN etc were invented awhile ago - but only catching on recently.  
* Geoffrey Hinton:  Godfather of deep learning - researched it heavily in the 80s.  Good videos.  
* Attempt to mimic how the human brain functions.  
* Brain has 100B neurons - each is connected to ~ 1,000 neighbors  

How Deep Learning Attempts to mimic the brain:  
* Input Layer  
* Hidden Layer  
* Output Layer  
* Each input layer is connected to each hidden layer  
* Multiple Hidden layers - all hidden layers are connected  

**TOPICS FOR THIS SECTION - PLAN OF ATTACK**  
* Neuron  
* Activation Function  
* How do NN work?  (Example)  
* How do NN learn?  
* Gradient Descent  
* Stochastic Gradient Descent  
* Backpropagation  

**NEURON**  
* Basic building block of NNs  
1.  Input Values (1-N):  These are the N Independent Variables used in the model for a specific observation  
* Need to standardize or normalize  
2. Neuron: Hidden Layers  
3. Output:  Can be numerical ; categorical ; categorical  
* The above is an example of a single observation  
4. Synapses:  Connections between Input Value and Neuron (Hidden Layers)  
* Assigned weights:  these get adjusted through the process of learning  
* This is where Gradient Descent and Backward Propagation come into play  
**WHAT HAPPENS INSIDE THE NEURON?**  
1.  Sum product of input values and weights for each observation.  
2.  Applies Activation Function to the above sumproduct.  
3.  Depending on this result - value may or may not get passed to next level  

**ACTIVATION FUNCTION**  
* There are several options - we will look at 4 primary:  
1. Threshold Function:  1 if x above a certain value - otherwise 0  
2. Sigmoid Function:  1/(1 + exp(-x))  Curve from 0 to 1 from logistic ; unlike threshold - this is smooth  
3. Rectifier: max(x,0) ; kink at 0 then linear with x ; very popular even with discontinuity at 0  
4. Hyperbolic Tangent (tanh): Similar to Sigmoid ; Looks like Sigmoid - but Y values go -1 to 1 instead of 0 to 1  
* [(1-exp(-2x))]/[(1+exp(-2x))]  


**HOW DO NNs WORK?**  
* Example:  Predict a House Price ; 4 Input Variables ; Area - Bedrooms - Distance to City (Miles) - Age  
* the 4 variables above are our input layer for each observation  
* Simplest form:  No hiden layer ; Price = weighted sum of inputs.  
* Very powerful - even without hidden layers there is some value - but hidden layers create much more powerful performance  
* All inputs connected to each node in the hidden layer - but some weights might be 0  
* Different subsets of inputs have non-zero weights for the various neurons  

**HOW DO NNs LEARN?**  
* 2 ways for a machine to learn:  
A. Hard code all the rules  
B. NN - provide a framework and let the algo learn  
* Ultimately - NN will learn - and you might not know what rules it figured out / is using  

* Start with single layer:  'Single Layer Feed Forward NN' also called a perceptron - output is y-hat  
* For NN to learn  needs to compare y-hat with y  
* Cost Function:  C = 1/2(y-hat - y)^2  ; there are many options - but this is common and will be reviewed more w gradient descent  
* Feed the cost function back into the NN - to adjust the weights  - goal is to ultimately minimize the cost function  
* For this example - dealing with a single row (observation) - but several associated independent RVs  
WHAT ABOUT MULTIPLE ROWS?  
* For example - use the same perceptron for each  
* One epoch = go through a whole dataset and train NN on all rows  
* Start off - get your y-hat for all - compare to actual y for all and have cost function for entire dataset  
* Now you update the weights - same weights for all the rows - trying to minimize overall cost function  
* This whole process = BACK PROPAGATION: Sum of Squares difference is calculated and back propagated through NN to update weights.  

**GRADIENT DESCENT**  
* How can we minimize cost function during back propagation?  
* One example - brute force method - trial and error  
* Curse of dimensionality: Multiple layers and independent variables: weights grow exponentially  
* Gradient Descent is a method to make this more efficient  
* Differentiate slope wrt target variable - and adjust directionally based on this result  

**STOCHASTIC GRADIENT DESCENT**  
* Gradient Descent - requires Cost Function to be Convex (have a single global minimum)  
* Not Convex: Gradient Descent might find a local min that is not a global min  
* Stochastic Gradient Descent does not require convexity  
* Gradient Descent above - looked at all rows at once - also called Batch Gradient Descent  
* Stochastic Gradient Descent - look at rows one by one  
* Stochastic Gradient Descent helps avoid local mins - because it has higher fluctuations  
* Stochastic Gradient Descent is faster (somewhat counterintuitive)  
* Batch Gradient Descent is Deterministic where Stochastic is Random - results are stable

**BACKPROPAGATION**  
* Forward Propagation - data is entered into our algo (NN) - hidden layers - and results in a prediction  
* Backpropagation is where actual/expected fed back into the algo  
* ALL WEIGHTS ARE ADJUSTED AT THE SAME TIME!!!  THIS IS THE STRENGTH OF BACKPROPAGATION  

**TRAINING THE ANN WITH STOCHASTIC GRADIENT DESCENT**   
1. Randomly initialize weights to values close to 0 (but not 0).  
2. Input first observation of your dataset in the input layer - each feature in one input node  
3. Forward-Propagation:  From L-R the neurons are activated in a way that the impact of each neuron's activation is limited by the weights.  Propagate the activations until getting the predicted result y.  
4. Compare the predicted result to the actual result.  Measure the generated error.  
5. Back-Propagation: From R-L the error is back propagated.  Upate the weights according to how much they are responsible for the error.  Learning rate decides by how much we upate the weights.  
6. Repeat Steps 1-5 and update weights after each observation (Reinforcement Learning) OR  
   Repeat Steps 1-5 but update weights only after a batch of observations (Batch Learning)  
7. When whole training set passes through ANN - that makes an epoch.  Redo more epochs.  

**SAMPLE BUSINESS PROBLEM**  
* Simulated bank data - look at who might leave (noticed high churn rate)  

In [9]:
# Artificial Neural Network

# Importing the dataset
dataset = read.csv('Churn_Modelling.csv')
dataset = dataset[4:14]

# Encoding the categorical variables as factors
dataset$Geography = as.numeric(factor(dataset$Geography,
                                      levels = c('France', 'Spain', 'Germany'),
                                      labels = c(1, 2, 3)))
dataset$Gender = as.numeric(factor(dataset$Gender,
                                   levels = c('Female', 'Male'),
                                   labels = c(1, 2)))

# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Exited, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

# Feature Scaling
training_set[-11] = scale(training_set[-11])
test_set[-11] = scale(test_set[-11])

# Fitting ANN to the Training set
# install.packages('h2o')
library(h2o)
h2o.init(nthreads = -1)
model = h2o.deeplearning(y = 'Exited',
                         training_frame = as.h2o(training_set),
                         activation = 'Rectifier',
                         hidden = c(5,5),
                         epochs = 100,
                         train_samples_per_iteration = -2)

# Predicting the Test set results
y_pred = h2o.predict(model, newdata = as.h2o(test_set[-11]))
y_pred = (y_pred > 0.5)
y_pred = as.vector(y_pred)

# Making the Confusion Matrix
cm = table(test_set[, 11], y_pred)

 h2o.shutdown()

 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         2 hours 44 minutes 
    H2O cluster timezone:       America/Los_Angeles 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.24.0.5 
    H2O cluster version age:    1 year, 9 months and 7 days !!! 
    H2O cluster name:           H2O_started_from_R_tgsxp_cti312 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   3.37 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, Core V4 
    R Version:                  R version 3.6.1 (2019-07-05) 


"
Your H2O cluster version is too old (1 year, 9 months and 7 days)!
Please download and install the latest version from http://h2o.ai/download/"


Are you sure you want to shutdown the H2O instance running at http://localhost:54321/ (Y/N)? Y


In [10]:
cm

   y_pred
       0    1
  0 1545   48
  1  229  178