* * ****<font color='green'>
**Welcome to Deep Learning Tutorial for Beginners** 
* I am going to explain <u>every thing</u> one by one.
* Instead of writing long and hard for reading paragraphs, I define and emphasize keywords line by line.
* At the end of this tutorial, you will have enough information about deep learning to go deeper inside it.
* Lets look at content.

<font color='red'>
<br>Content:
    
* [Introduction](#1)
* [Overview the Data Set](#2)
* [Logistic Regression](#3)
    * [Computation Graph](#4)
    * [Initializing parameters](#5)
    * [Forward Propagation](#6)
        * Sigmoid Function
        * Loss(error) Function
        * Cost Function
    * [Optimization Algorithm with Gradient Descent](#7)
        * Backward Propagation
        * Updating parameters
    * [Logistic Regression with Sklearn](#8)
    * [Summary and Questions in Minds](#9)
    
* [Artificial Neural Network](#10)
    * [2-Layer Neural Network](#11)
        * [Size of layers and initializing parameters weights and bias](#12)
        * [Forward propagation](#13)
        * [Loss function and Cost function](#14)
        * [Backward propagation](#15)
        * [Update Parameters](#16)
        * [Prediction with learnt parameters weight and bias](#17)
        * [Create Model](#18)
    * [L-Layer Neural Network](#19)
        * [Implementing with keras library](#22)
* Time Series Prediction: https://www.kaggle.com/kanncaa1/time-series-prediction-with-eda-of-world-war-2
* [Artificial Neural Network with Pytorch Library](#23)
* [Convolutional Neural Network with Pytorch Library](#24)
* [Recurrent Neural Network with Pytorch Library](#25)
* [Conclusion](#20)



<a id="1"></a> <br>

# INTRODUCTION
* **Deep learning:** One of the machine learning technique that learns features directly from data. 

* **Why deep learning:** When the amounth of data is increased, machine learning techniques are insufficient in terms of performance and deep learning gives better performance like accuracy.

<a href="http://ibb.co/m2bxcc"><img src="http://preview.ibb.co/d3CEOH/1.png" alt="1" border="0"></a>
* **What is amounth of big:** It is hard to answer but intuitively 1 million sample is enough to say "big amounth of data"
* **Usage fields of deep learning:** Speech recognition, image classification, natural language procession (NLP) or recommendation systems
* **What is the difference of deep learning from machine learning:** 
    * Machine learning covers deep learning. 
    * Features are given machine learning manually.
    * On the other hand, deep learning learns features directly from data.
    
<a href="http://ibb.co/f8Epqx"><img src="http://preview.ibb.co/hgpNAx/2.png" alt="2" border="0"></a>

<br>Lets look at our data.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt 
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
import warnings
# filter warnings
warnings.filterwarnings("ignore")
from subprocess import check_output
print(check_output(["ls", "../input"]).decode('utf8'))
# Any results you write to the current directory are saved as output.

<a id="Overview the Data Set"></a> <br>
# Overview the Data Set
* We will use "Sign language digits dataset" for this tutorial.
* In this data there are 2062 sign language digits images.
* As you know digits are from 0 to 9. Therefore there are 10 unique sign.
* At the beginning of tutorial we will use only sign 0 and 1 for simplicity. 
* In data, sign zero is between indexes 204 and 408. Number of zero sign is 205.
* Also sign one is between indexes 822 and 1027. Number of one sign is 206. Therefore, we will use 205 samples from each classes(labels).
* Note: Actually 205 sample is very very very little for deep learning. But this is tutorial so it does not matter so much. 
* Lets prepare our X and Y arrays. X is image array (zero and one signs) and Y is label array (0 and 1).

In [None]:
# load dataset
X_1 = np.load('/kaggle/input/sign-language-digits-dataset/X.npy')
y_1 = np.load('/kaggle/input/sign-language-digits-dataset/Y.npy')

img_size = 64

# Look at some signs :
plt.subplot(1,2,1)
plt.imshow(X_1[210].reshape(img_size, img_size))
plt.axis('off')
plt.subplot(1,2,2)
plt.imshow(X_1[900].reshape(img_size, img_size))
plt.axis('off')

In [None]:
print(X_1.shape) 
print(y_1.shape)   # 10 possible outcomes 

* In order to create image array, I concatenate zero sign and one sign arrays
* Then I create label array 0 for zero sign images and 1 for one sign images.

In [None]:
# Join a sequence of arrays of 0s and 1s along the rows axis
X = np.concatenate((X_1[204:409], X_1[822:1028]), axis = 0) # from 0 to 204 are zero signs and from 205 to 411 are one signs 
z = np.zeros(205)
o = np.ones(206)
y = np.concatenate((z,o), axis = 0).reshape(X.shape[0], 1)

print("X shape : ", X.shape)
print("y shape : ", y.shape)

* The shape of the X is (411, 64, 64)
    * 411 means that we have 411 images (zero and one signs)
    * 64 means that our image size is 64x64 (64x64 pixels)
* The shape of the Y is (411,1)
    *  411 means that we have 411 labels (0 and 1) 
* Lets split X and Y into train and test sets.
    * test_size = percentage of test size. test = 15% and train = 75%
    * random_state = use same seed while randomizing. It means that if we call train_test_split repeatedly, it always creates same train and test distribution because we have same random_state.

In [None]:
# Let's create X_train, y_train, X_test, y_test arrays
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42, stratify=y)

number_of_train = X_train.shape[0]
number_of_test = X_test.shape[0]
print(f"train : {number_of_train}\ntest : {number_of_test}")

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

* Now we have 3 dimensional input array (X) so we need to make it flatten (2D) in order to use as input for our first deep learning model.
* Our label array (y) is already flatten(2D) so we leave it like that.
* Lets flatten X array(images array).


In [None]:
X_train_flatten = X_train.reshape(number_of_train,-1)
X_test_flatten = X_test.reshape(number_of_test, -1)

print("X train flatten",X_train_flatten.shape)
print("X test flatten",X_test_flatten.shape)

* As you can see, we have 349 images for training and each image has 4096 pixels in image train array.
* Also, we have 62 images for testing and each image has 4096 pixels in image test array.
* Then let's take transpose.

In [None]:
X_train = X_train_flatten.T   # shape(4096, 349)
X_test = X_test_flatten.T      # shape(4096, 62)
y_train = y_train.T            # shape(1,349)
y_test = y_test.T              # shape(1,62)

<font color='purple'>
What we did up to this point:
    
* Choose our labels (classes) that are sign zero and sign one
* Create and flatten train and test sets
* Our final inputs(images) and outputs(labels or classes) look like this:
<a href="http://ibb.co/bWMK7c"><img src="http://image.ibb.co/fOqCSc/3.png" alt="3" border="0"></a>


**Remark : instead of 348, it's 349**

<a id="3"></a> <br>
# Logistic Regression
* When we talk about binary classification( 0 and 1 outputs) what comes to mind first is logistic regression.
* However, one may ask what is logistic regression doing in a deep learning tutorial ?
* The answer is that  logistic regression is actually a very simple neural network. 
* By the way neural network and deep learning are same thing. When we will see artificial neural network, I will explain in details the term "deep".
* In order to understand logistic regression (simple deep learning) lets first learn computation graph.

<a id="4"></a> <br>
##  Computation Graph
* Computation graphs are a nice way to think about mathematical expressions.
* It is like visualization of  mathematical expressions.
* For example we have $$c = \sqrt{a^2 + b^2}$$
* It's computational graph is this. As you can see we express math with graph.
<a href="http://imgbb.com/"><img src="http://image.ibb.co/hWn6Lx/d.jpg" alt="d" border="0"></a>

* Now lets look at computation graph of logistic regression
<a href="http://ibb.co/c574qx"><img src="http://preview.ibb.co/cxP63H/5.jpg" alt="5" border="0"></a>
    * Parameters are weight and bias.
    * Weights: coefficients of each pixels
    * Bias: intercept
    * $z = (w^T)X + b  => z$ equals to (transpose of weights times input $X$) + bias 
    * In an other saying $=> z = b + px1*w1 + px2*w2 + ... + px4096*w4096$
    * $y_{head} = sigmoid(z)$
    * Sigmoid function crushes $z$ to make it between zero and one so that becomes a probability. You can see sigmoid function in computation graph.
* Why we use sigmoid function?
    * It gives probabilistic result
    * It is derivative so we can use it in gradient descent algorithm (we will see it later).
* Let's take an example:
    * Let's say we find $z = 4$ and put $z$ into sigmoid function. The result($y_{head}$) is almost 0.9. It means that our classification result is 1 with 90% probability.
* Now lets start with from beginning and examine each component of computation graph more detailed.

<a id="5"></a> <br>
## Initializing parameters
* As you know, input is our images that has 4096 pixels( in each image in X_train).
* Each pixel have its own weights.
* The first step is multiplying each pixels with their own weights.
* The question is that what is the initial value of weights?
    * There are some techniques that I will explain at artificial neural network but for this time initial weights are 0.01.
    * Okey, weights are 0.01 but what is the weight array shape? As you can see from computation graph of logistic regression, it is (4096,1)
    * Also initial bias is 0.
* Let's write some code. In order to use it in coming topics like artificial neural network (ANN), I made some definitions (methods).

In [None]:
# let's initialize parameters
# So what we need is dimension 4096 that is number 
# of pixels as a parameter for our initialize method(def)

def initialize_weights_and_bias(dimension) :
    w = np.full((dimension,1), 0.01) # shape(dimension,1)
    b = 0.0
    return w, b

In [None]:
w,b = initialize_weights_and_bias(4096) # w is shape(4096,1)

In [None]:
print(w)
print(b)

In [None]:
print(w.shape)

<a id="6"></a> <br>
## Forward Propagation

* All the steps from pixels to cost is called **forward propagation**
    * $z = (w^T)X + b =>$ in this equation we know that $X$ is pixel array, $w$ (weights), $b$ (bias), and $T$ is transpose.
    * Then we put $z$ into sigmoid function that returns $y_{head}$ (probability), (look at computation graph).
    * Then we calculate loss(error) function. 
    * Cost function is summation of all loss(error).
    * Let's start with $z$ and then write sigmoid definition(method) that takes $z$ as input parameter and returns $y_{head}$ (probability).

In [None]:
# Calculation of z
def sigmoid(z) :
    y_head = 1/(1+ np.exp(-z))
    return y_head

In [None]:
# Example :
print(sigmoid(0))

* As we write sigmoid method and calculate $y_{head}$. Let's learn what is a loss(error) function
* Let's take an example, I've put one image as input then multiply it with their weights and add bias term to have $z$. Then put $z$ into sigmoid method to find $y_{head}$. Up to this point we know what we did. Then e.g $y_{head}$ became 0.9 that is bigger than 0.5 so our prediction of the image is a sign "one" image. Ok, every thing looks fine. But, is our prediction correct ? and how do we check whether it is correct or not? The answer is we do it using the  **loss(error) function**:
    * Mathematical expression of log loss(error) function is that: 
    <a href="https://imgbb.com/"><img src="https://image.ibb.co/eC0JCK/duzeltme.jpg" alt="duzeltme" border="0"></a>
    * It says that if you make wrong prediction, loss(error) becomes big. **DENKLEM DUZELTME**
        * Example: Our real image is labeled as "sign one" with the label $y = 1$. Upon making a prediction, let's say $y_{\text{head}} = 1$. When we substitute $y$ and $y_{\text{head}}$ into the loss (error) equation, the result is 0. This indicates a correct prediction, hence our loss is 0. However, if we were to make an incorrect prediction, such as $y_{\text{head}} = 0$, the loss (error) becomes infinity.

* Following this, the cost function is the summation of the loss function. Each image contributes to the loss function. Therefore, the cost function is the summation of the loss functions generated by each input image.

* Now, let's proceed to implement forward propagation.


In [None]:
# Forward propagation steps :
# 1/ Find z = w^T * X ° b
# 2/ y_head = sigmoid(z)
# 3/ loss(error) = loss(y, y_head)
# 4/ cost = sum(loss)

def forward_propagation(w, b, X_train, y_train):
    z = w.T @ X_train + b   # shape(1,349)
    y_head = sigmoid(z)    # shape(1,349)
    loss = -(1-y_train) * np.log(1-y_head) - y_train*np.log(y_head)  # shape(1,349)
    cost = loss.sum() / X_train.shape[1]   # for scaling
    return cost

<a id="7"></a> <br>
##  Optimization Algorithm with Gradient Descent

* Now that we understand our cost, which is the error, therefore, we need to decrease the cost because a high cost indicates that we made the wrong prediction.

* Let's consider the **first step**: everything starts with initializing weights and biases. Therefore, the cost depends on them.

* To decrease the cost, we need to update the weights and biases, in other words, our model needs to learn the parameters (weights and biases) that minimize the cost function. This technique is called **gradient descent**.

* Let's illustrate this with an example:

Suppose we have $w = 5$ and bias $= 0$ (ignore bias for now). Then, after forward propagation, our cost function is $1.5$.
It looks like this: (red lines)

<a href="http://imgbb.com/"><img src="http://image.ibb.co/dAaYJH/7.jpg" alt="7" border="0"></a>

* As seen from the graph, we are not at the minimum point of the cost function. Therefore, we need to move towards the minimum cost. Ok, let's update the weight. (The symbol $:=$ denotes updating)
* $w := w - \text{step}$. The question is, what is this step? The step is the slope1. Ok, it seems remarkable. To find the minimum point, we can use slope1. Let's say slope1 $= 3$ and update our weight. $w := w - \text{slope1} \Rightarrow w = 2$.
* Now, our weight $w$ is $2$. As you remember, we need to find the cost function with forward propagation again.
* Let's say, according to forward propagation with $w = 2$, the cost function is $0.4$. Hmm, we are heading in the right direction because our cost function is decreasing. We have a new value for the cost function, which is $0.4$. Is that enough? Actually, I do not know; let's try one more step.
* Slope2 $= 0.7$ and $w = 2$. Let's update the weight: $w := w - \text{step}(\text{slope2}) \Rightarrow w = 1.3$, which is the new weight. So, let's find the new cost.
* Perform one more forward propagation with $w = 1.3$, and our cost $= 0.3$. Ok, our cost even decreased. It looks fine, but is it enough, or do we need to take one more step? The answer is, again, I do not know; let's try.
* Slope3 $= 0.01$ and $w = 1.3$. Updating weight: $w := w - \text{step}(\text{slope3}) \Rightarrow w = 1.29 \approx 1.3$. The weight does not change because we found the minimum point of the cost function.
Everything seems good, but how do we find the slope? If you remember from high school or university, to find the slope of a function (cost function) at a given point (at a given weight), we take the derivative of the function at the given point. You may ask, how does it know where to go? You can say that it can go to higher cost values instead of going to the minimum point. The answer is that the slope (derivative) gives both the step and the direction of the step. Therefore, do not worry :)
* The update equation is this. It says that there is a cost function (which takes weight and bias). Take the derivative of the cost function according to weight and bias. Then multiply it by $\alpha$, the learning rate. Then update the weight. (In order to explain, I ignore bias, but all these steps will be applied to bias as well)

<a href="http://imgbb.com/"><img src="http://image.ibb.co/hYTTJH/8.jpg" alt="8" border="0"></a>


* Now, I'm sure you are asking, what is the **learning rate** that I mentioned earlier? It is a very simple term that determines the learning rate. However, there is a tradeoff between learning fast and never learning. For example, you are in Paris (current cost) and want to go to Madrid (minimum cost). If your speed (learning rate) is small, you can go to Madrid very slowly, and it takes too long. On the other hand, if your speed (learning rate) is big, you can go very fast, but maybe you crash and never get to Madrid. Therefore, we need to choose our speed (learning rate) wisely.
* The learning rate is also called a hyperparameter that needs to be chosen and tuned. I will explain it more in detail in artificial neural networks with other hyperparameters. For now, just assume the learning rate is $1$ for our previous example.
* I think now you understand the logic behind **forward propagation (from weights and bias to cost)** and **backward propagation (from cost to weights and bias to update them)**. Also, you learned about gradient descent. Before implementing the code, you need to learn one more thing: how to take the derivative of the cost function according to weights and bias. It is not related to Python or coding; it is pure mathematics. There are two options: the first one is to google how to take the derivative of the log loss function, and the second one is even to google what the derivative of the log loss function is :) I choose the second one because I cannot explain math without talking.

$$ \frac{\partial J}{\partial w} = \frac{1}{m}x(  y_head - y)^T$$
$$ \frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^m (y_head-y)$$

In [None]:
# In backward propagation we will use y_head that we found in forward progation
# Therefore instead of writing backward propagation method, lets combine forward propagation and backward propagation


def forward_backward_propagation(w,b,X_train,y_train):
    
    # forward propagation
    z = w.T @ X_train + b
    y_head = sigmoid(z)
    loss = -y_train*np.log(y_head)-(1-y_train)*np.log(1-y_head)
    cost = (np.sum(loss))/X_train.shape[1]      # for scaling
    
    # backward propagation
    derivative_weight = (X_train @ ((y_head-y_train).T)) / X_train.shape[1] 
    derivative_bias = np.sum(y_head-y_train) / X_train.shape[1]
    
    # Dictionary of weights and biases derivatives
    gradients = {"derivative_weight": derivative_weight,"derivative_bias": derivative_bias}
    return cost,gradients

Up to this point, we have learned:

* Initializing the parameters (implemented)
* Finding the cost with forward propagation and cost function (implemented)
* Updating (learning) parameters (weight and bias). Now let's implement it.

In [None]:
# Updating the parameters (learning) 

def update(w, b, X_train, y_train, learning_rate,number_of_iterarion):
    cost_list = []
    cost_list2 = []
    index = []

    # updating(learning) the parameters, number_of_iterarion times
    for i in range(number_of_iterarion):
        # make forward and backward propagation and find cost and gradients
        cost,gradients = forward_backward_propagation(w,b,X_train,y_train)
        cost_list.append(cost)
        # let's update
        w = w - learning_rate * gradients["derivative_weight"]
        b = b - learning_rate * gradients["derivative_bias"]
        if i % 10 == 0:
            cost_list2.append(cost)  # add the cost after each 10th iteration
            index.append(i)          # iteration index
            print ("Cost after iteration %i: %f" %(i, cost))
            
    # weights and bias final update
    parameters = {"weight": w,"bias": b}
    
    # ploting the cost results after each 10th iteration
    plt.plot(index,cost_list2)
    plt.xticks(index,rotation='vertical')
    plt.xlabel("Number of Iterarion")
    plt.ylabel("Cost")
    plt.show()
    return parameters, gradients, cost_list
#parameters, gradients, cost_list = update(w, b, X_train, y_train, learning_rate = 0.009,number_of_iterarion = 200)

* Up to this point, we have learned our parameters. It means we have fitted the data.
* In order to make predictions, we need parameters. Therefore, let's predict.
* During the prediction step, we have `X_test` as an input, and we use it to make forward predictions.

In [None]:
# Prediction function

def predict(w, b, X_test) :
    # X_test is now the input for forward propagation
    z = sigmoid(w.T @ X_test +b)
    y_pred = np.zeros((1,X_test.shape[1]))
    # if z is bigger than 0.5, our prediction is sign one (y_head=1),
    # if z is smaller than/equal to 0.5, our prediction is sign zero (y_head=0),
    for i in range(z.shape[1]) :
        if z[0,i] <= 0.5 :
            y_pred[0,i] == 0
        else :
            y_pred[0,i] == 1
    return y_pred

#predict(parameters["weight"],parameters["bias"],X_test)

In [None]:
(1,X_test.shape[1])

* We made the prediction.
* Now let's put them all together.

In [None]:
def logistic_regression(X_train, y_train, X_test, y_test, learning_rate ,  num_iterations):
    # initialize=ation of the parameters :
    dimension =  X_train.shape[0]  # that's 4096
    w,b = initialize_weights_and_bias(dimension)
    
    # We don't change the learning rate
    parameters, gradients, cost_list = update(w, b, X_train, y_train, learning_rate,num_iterations)
    
    # Predictions 
    y_prediction_test = predict(parameters["weight"],parameters["bias"],X_test)
    y_prediction_train = predict(parameters["weight"],parameters["bias"],X_train)
    
    # Print train/test Errors
    print("train accuracy: {} %".format(100 - np.mean(np.abs(y_prediction_train - y_train)) * 100))
    print("test accuracy: {} %".format(100 - np.mean(np.abs(y_prediction_test - y_test)) * 100))
    
logistic_regression(X_train, y_train, X_test, y_test,learning_rate = 0.01, num_iterations = 150)

<a id="8"></a> <br>
## Logistic Regression with Sklearn
* In sklearn library, there is a logistic regression method that ease implementing logistic regression.
* To understand each parameter of logistic regression in sklearn, you can read from there http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
* The accuracies are different from what we find. Because logistic regression method uses a lot of different features that we didn't use like different optimization parameters or regularization.
* Let's make a conclusion for logistic regression and continue with artificial neural network.

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(random_state = 42,max_iter= 150)
clf_train = logreg.fit(X_train.T, y_train.T)
clf_test = logreg.fit(X_test.T, y_test.T)

print("test accuracy: {} ".format(clf_test.score(X_test.T, y_test.T)))
print("train accuracy: {} ".format(clf_train.score(X_train.T, y_train.T)))

<a id="9"></a> <br>
## Summary and Questions in Minds

<font color='purple'>
What we did in this first part:

* Initialized parameters: weight and bias
* Performed forward propagation
* Calculated the loss function
* Derived the cost function
* Implemented backward propagation (gradient descent)
* Made predictions using the learned parameters: weight and bias
* Utilized logistic regression with `sklearn`
* We will construct an artificial neural network based on logistic regression.

HOMEWORK: This is a good place to pause and practice. Your homework assignment is to create your own logistic regression method and classify two different sign language digits.

<a id="10"></a> <br>
# Artificial Neural Network (ANN)

Called a **deep neural network** or **deep learning**.

**What is a neural network ?**
It basically involves taking logistic regression and repeating it at least 2 times.

* In logistic regression, there are input and output layers. However, in a neural network, there is at least one hidden layer between the input and output layers.

**What is deep, in order to say "deep", and how many layers do I need to have?**
"'Deep' is a relative term, it of course refers to the 'depth' of a network, meaning how many hidden layers it has. 'How deep is your swimming pool?' could be 12 feet or it might be 2 feet; nevertheless, it still has a depth--it has the quality of 'deepness'. 32 years ago, people used two or three hidden layers. That was the limit for the specialized hardware of the day. Just a few years ago, 20 layers was considered pretty deep. In October, Andrew Ng mentioned 152 layers was (one of) the biggest commercial networks he knew of. Last week, I talked to someone at a big, famous company who said he was using 'thousands'. So I prefer to just stick with 'How deep?'"
**Why is it called hidden ?**
Because the hidden layer does not see the inputs (training set).
For example, if you have input, one hidden, and output layers, when someone asks you "hey, my friend, how many layers does your neural network have?" The answer is "I have a 2-layer neural network". Because **while computing the layer number, the input layer is ignored**.
Let's see a 2-layer neural network:
<a href="http://ibb.co/eF315x"><img src="http://preview.ibb.co/dajVyH/9.jpg" alt="9" border="0"></a>

Step by step, we will learn about this image:

* As you can see, there is one hidden layer between the input and output layers, and this hidden layer has 3 nodes. If you're curious why I chose the number of nodes as 3, the answer is there is no reason, I just chose it that way :). The number of nodes is a hyperparameter like the learning rate. Therefore, we will discuss hyperparameters at the end of artificial neural networks.

* Input and output layers do not change. They are the same as logistic regression.

* In the image, there is a $tanh$ function, it is an activation function like the sigmoid function. **The $tanh$ activation function is better than sigmoid for hidden units because the mean of its output is closer to zero, so it centers the data better for the next layer**. Also, the $tanh$ activation function increases non-linearity, which helps our model learn better.

* As you can see, with the purple color, there are two parts. Both parts are like logistic regression. The only difference is the activation function, inputs, and outputs.
    * **In logistic regression: input => output**
    * **In a 2-layer neural network: input => hidden layer => output**. You can think of the hidden layer as the output of part 1 and the input of part 2.
    
* That's all. We will follow the same path as logistic regression for the 2-layer neural network.

<a id="11"></a> <br>
## 2-Layer Neural Network

* 1/ Size of layers and initializing parameters weights and bias
* 2/ Forward propagation
* 3/ Loss function and Cost function
* 4/ Backward propagation
* 5/ Update Parameters
* 6/ Prediction with learnt parameters weight and bias
* 7/ Create Model

<a id="12"></a> <br>
## 2-1) Size of layers and initializing parameters weights and bias

* For `X_train` that has 349 samples (349 images, 1 image per column) $x^{(349)}$:
$$z^{[1] (349)} =  W^{[1]} x^{(349)} + b^{[1] (349)}$$ 
$$a^{[1] (349)} = \tanh(z^{[1] (349)})$$
$$z^{[2] (349)} = W^{[2]} a^{[1] (349)} + b^{[2] (349)}$$
$$\hat{y}^{(349)} = a^{[2] (349)} = \sigma(z^{ [2] (349)})$$

* In logistic regression, we initialized weights to 0.01 and bias to 0. However, now we initialize weights randomly, because if we initialize the parameters to zero, each neuron in the first hidden layer will perform the same computation. Therefore, even after multiple iterations of gradient descent, each neuron in the layer will compute the same thing as the other neurons. Thus, we initialize them randomly. Additionally, initial weights should be small. If they are very large initially, this will cause the inputs of the tanh function to be very large, resulting in gradients close to zero (vanishing gradient). Consequently, the optimization algorithm will be slow.

* Bias can be initialized to zero initially.







In [None]:
# Initialization of the parameters and layer sizes
layer_size = 3
def initialize_paramters_and_layer_sizes_NN(X_train, y_train) :
    parameters = {"weight1" : np.random.randn(layer_size, X_train.shape[0])* 0.1, #shape(3,4096)
               "bias1" : np.zeros((layer_size,1)),
               "weight2" : np.random.randn(y_train.shape[0],layer_size)*0.1,
               "bias2" : np.zeros((y_train.shape[0], 1))}
    return parameters

In [None]:
print(X_train.shape)
print(y_train.shape)

<a id="13"></a> <br>
## 2-2) Forward propagation
* Forward propagation is almost the same as logistic regression.
* The only difference is that we use the $tanh$ function and repeat the entire process twice.
* Also, `NumPy` has a tanh function, so we do not need to implement it.

In [None]:
def forward_propagation_NN(X_train, parameters) :
    # Z1 : (layer_size, X_train.shape[0])@ (layer_size, X_train.shape[0], layer_size, X_train.shape[1]) ---> shape(layer_size,X_train.shape[1])
    Z1 = parameters["weight1"]@X_train + parameters["bias1"]
    # A1 : shape(layer_size,X_train.shape[1])
    A1 = np.tanh(Z1)
    # Z2 : shape(y_train.shape[0],layer_size)@shape(layer_size,X_train.shape[1]) ---- > shape(y_train.shape[0],X_train.shape[1])
    Z2 = parameters["weight2"]@A1 + parameters['bias2']
    # A2 : shape(y_train.shape[0],X_train.shape[1])
    A2 = sigmoid(Z2)
    
    cache = {"Z1": Z1, "A1":A1, "Z2":Z2, "A2":A2}
    
    return A2, cache

<a id="14"></a> <br>
## 2-3) Loss function and Cost function
* Loss and cost functions are same as with logistic regression
* Cross entropy function
<a href="https://imgbb.com/"><img src="https://image.ibb.co/nyR9LU/as.jpg" alt="as" border="0"></a><br />

In [None]:
# Compute the cost
def compute_cost_NN(A2, y, parameters):
    """A2 : logits
    y : ground truths
    parameters : the weights and biases"""
    
    logprobs = np.multiply(np.log(A2), y)
    cost = -np.sum(logprobs)/y.shape[1]  # average
    return cost

<a id="15"></a> <br>
## 2-4) Backward propagation
* As you know backward propagation means derivative.

* The logic is the same, let's write code.

In [None]:
# backward propagation
def backward_propagation_NN(parameters, cache, X, y) :
    """parameters : weights and biases
    cache : dictionary of the inputs and outputs at each step
    X : X_train
    y : ground truths"""
    # shape of dX is same as the shape of X :
    # shape of dZ2 : (y_train.shape[0],X_train.shape[1])
    dZ2 = cache["A2"] - y
    # dW2 : (y_train.shape[0],layer_size)
    dW2 = (dZ2 @ cache["A1"].T)/X.shape[1]
    # db2 : (layer_size, 1)
    db2 = np.sum(dZ2,axis =1,keepdims=True)/X.shape[1]
    # dZ1 : (layer_size, y_train.shape[0])@(y_train.shape[0],X_train.shape[1]) --------> (layer_size,X_train.shape[1])
    dZ1 = (parameters["weight2"].T @ dZ2)*(1 - np.power(cache["A1"], 2))
    # dW1 : (layer_size, X_train.shape[0])
    dW1 = (dZ1@X.T)/X.shape[1]
    # db1 : (layer_size, 1)
    db1 = np.sum(dZ1,axis =1,keepdims=True)/X.shape[1]
    
    grads = {"dweight1": dW1,
             "dbias1": db1,
             "dweight2": dW2,
             "dbias2": db2}
    return grads

<a id="16"></a> <br>
## 2-5) Updatig the Parameters 
* Updating the parameters is also the same with logistic regression.
* We actually do alot of work with logistic regression

In [None]:
# update the parameters
def update_parameters_NN(parameters, grads, learning_rate =1e-2) :
    parameters = {"weight1" : parameters['weight1'] - learning_rate*grads["dweight1"],
                 "bias1" : parameters['bias1'] - learning_rate*grads["dbias1"],
                 "weight2" : parameters['weight2'] - learning_rate*grads["dweight2"],
                 "bias2" : parameters['bias2'] - learning_rate*grads["dbias2"]}
    return parameters
    

<a id="17"></a> <br>
## 2-6) Prediction with the learned parameters weight and bias
* Let's write a predict function like we did with logistic regression.

In [None]:
# Prediction
def predict_NN(parameters, X_test):
    # Forward propagation
    # X_test is the input for forward propagation
    A2, cache = forward_propagation_NN(X_test, parameters)
    y_pred = np.zeros((1, X_test.shape[1]))
    # if z is bigger than 0.5, our prediction is sign one (y_head=1),
    # if z is smaller than/equal to 0.5, our prediction is sign zero (y_head=0),
    for i in range(A2.shape[1]) :
        if A2[0,i] <= 0.5 :
            y_pred[0,i] = 0
        else :
            y_pred[0,i] = 1
    return y_pred

<a id="18"></a> <br>
## 2-7) Create Model
* Let's put the codes all together.

In [None]:
# 2-layer neural network
def two_layer_NN(X_train, y_train, X_test, y_test, num_iterations) :
    cost_list = []
    index_list = []
    
    # Parameters initialization and Layer Sizes
    parameters = initialize_paramters_and_layer_sizes_NN(X_train, y_train)
    
    for i in range(num_iterations) :
        # Forward propagation
        A2, cache = forward_propagation_NN(X_train, parameters)
        
        # Compute the cost
        cost = compute_cost_NN(A2, y_train, parameters)
        
        # backward propagation
        grads = backward_propagation_NN(parameters, cache, X_train, y_train)
        
        # Update the parameters
        parameters = update_parameters_NN(parameters, grads)
        
        if i%100 == 0:
            cost_list.append(cost)  # list of costs after each 100 iterations
            index_list.append(i)    # list of indexes
            print("Cost after iteration %i : %f" %(i, cost))  # %i for integer, %f for float
    
    # Plotting the cost 
    plt.plot(index_list, cost_list)
    plt.xticks(index_list,rotation='vertical')
    plt.xlabel("Number of Iterarion")
    plt.ylabel("Cost")
    plt.show()
    
    # Prediction
    y_pred_test = predict_NN(parameters, X_test)
    y_pred_train = predict_NN(parameters, X_train)
    
    # Print train/test Errors
    print("train accuracy: {} %".format(100 - np.mean(np.abs(y_pred_train - y_train)) * 100))
    print("test accuracy: {} %".format(100 - np.mean(np.abs(y_pred_test - y_test)) * 100))
    return parameters
    
parameters = two_layer_NN(X_train, y_train, X_test, y_test, num_iterations = 2600)

<font color='purple'>
Up to this point we create 2-layer neural network and learned how to implement :
* The size of layers and initializing the weights and bias parameters
* Forward propagation
* Loss function and Cost function
* Backward propagation
* Update Parameters
* Prediction with learned parameters weight and bias
* Create Model

<br> Now lets learn how to implement L-layer neural network with keras.

<a id="19"></a> <br>
# L Layer Neural Network

* **What happens if the number of hidden layers increases ? Earlier layers can detect simple features.**

* When the model composes simple features together in the later layers of the neural network, it can learn more and more complex functions. For example, let's look at our sign one.
<a href="http://ibb.co/dNgDJH"><img src="http://preview.ibb.co/mpD4Qx/10.jpg" alt="10" border="0"></a>
* the first hidden layer learns edges or basic shapes like lines. When the number of layers increases, layers start to learn more complex things like convex shapes or characteristic features like the forefinger.
* Let's create our model:
    * There are some hyperparameters we need to choose like learning rate, number of iterations, number of hidden layers, number of hidden units, and the type of activation functions.
    
    * These hyperparameters can be chosen intuitively if you spend a lot of time in the deep learning world. However, if you don't spend too much time, the best way is to Google it, but it is not necessary. You need to try hyperparameters to find the best one.
    
     * In this tutorial, our model will have **2 hidden layers with 8 and 4 nodes, respectively**. Because when the number of hidden layers and nodes increases, it takes too much time.
     
     * As an activation function, we will use **ReLU (Rectified Linear Unit) for the first hidden layer, ReLU for the second hidden layer, and sigmoid for the output layer, respectively**.
     
    * The **number of iterations will be 100**.
    
* Our approach is the same as in previous parts, however, as we learn the logic behind deep learning, we can ease our job and use the `Keras` for deeper neural networks.

* First, let's reshape our `X_train`, `X_test`, `y_train`, and `y_test`.

<a id="22"></a> <br>
## Implementing with keras library
Lets look at some parameters of keras library:
* **units**: output dimensions of node
* **kernel_initializer**: to initialize weights
* **activation**: activation function, we use relu
* **input_dim**: input dimension that is number of pixels in our images (4096 px)
* **optimizer**: we use adam optimizer
    * Adam is one of the most effective optimization algorithms for training neural networks.
    * Some advantages of Adam is that relatively low memory requirements and usually works well even with little tuning of hyperparameters
* **loss**: Cost function is the same. By the way the name of the cost function is **cross-entropy cost function** that we use previous parts.
$$J = - \frac{1}{m} \sum\limits_{i = 0}^{m} \large\left(\small y^{(i)}\log\left(a^{[2] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[2] (i)}\right)  \large  \right) \small \tag{6}$$
* **metrics**: it's accuracy.
* **cross_val_score**: we use cross validation (https://www.kaggle.com/kanncaa1/machine-learning-tutorial-for-beginners)
* **epochs**: number of iteration

In [None]:
!pip install scikeras

In [None]:
# Evaluating the ANN

from scikeras.wrappers import KerasClassifier  # Import from SciKeras
from sklearn.model_selection import cross_val_score
from tensorflow import keras  # Import TensorFlow for Keras
from tensorflow.keras.layers import Dense  # Import layers from Keras
from keras.models import Sequential



def build_classifier():
    classifier = Sequential() # initialize neural network
    # 1st layer : 8 neurons, Relu activation
    classifier.add(Dense(units = 8, kernel_initializer = 'uniform', activation = 'relu', input_dim = X_train.shape[1]))
    # 2nd layer : 4 neurons, Relu activation
    classifier.add(Dense(units = 4, kernel_initializer = 'uniform', activation = 'relu'))
    # 3rd layer : 1 neuron, sigmoid activation
    classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
    
    classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
    return classifier

classifier = KerasClassifier(build_fn = build_classifier, epochs = 100)
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 3)
mean = accuracies.mean()
variance = accuracies.std()
print("Accuracy mean: "+ str(mean))
print("Accuracy variance: "+ str(variance))