In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Neural Network - Part II

# A few things to remember while training a Neural Network (Perceptron)

## Shuffling
We discussed before that a training set can be processed multiple times, with each complete pass through the set referred to as one epoch. However, it’s a good idea to shuffle the samples in a training set after each epoch, so that the network isn’t negatively affected by the order in which the samples are presented. 

It may be a good idea to implement this feature in the neural-net software, or manually duplicate the training set in the spreadsheet and then randomly reorder the samples in the duplicated sets.

## Overtraining

We have seen in other modeling that **Overtraining** (i.e. making the model learn **too close** to the trainining data is **NOT** a good idea because in that case the model does not perform well in its prediction when explosed to more generic Non-Training data even though it performs very well when used on Training data.

The following diagram illustrates the concept.

![image.png](attachment:image.png)

The red dots and blue dots represent training samples that the neural network is classifying. 

The black line represents a good classification strategy: it follows the general pattern that separates red from blue, and consequently it will probably produce the lowest error on real data. 

The green line is an overtrained classification strategy. It follows the training data too well; in its attempt to perfectly classify the training samples, it has created an input–output relationship that is less generalized and therefore less appropriate for real-life data.



Other examples of different training strategies

![image.png](attachment:image.png)


![image.png](attachment:image.png)

![image.png](attachment:image.png)

## In conclusion: Undertraining the Neural Net is definitely NOT GOOD, but Overtraing is also problematics as in that case the Neural Net does not perform well enough with real life data

# Training: Theory and Practice

At first glance, neural-network training seems fairly straightforward. 

When we’re working with a simple network (such as the single-layer Perceptron shown below), the math required for training is certainly not overwhelming, the network itself can be implemented in a relatively short program written in common languages such as Python, and the training process doesn’t require excessive amounts of computational time.

Also, the general concept is fairly straighforward: 

* We apply an input, produce an output, 
* Compare the produced output to the expected output, 
* And feed that information back into the network in a way that allows the weights to gradually converge on values that are appropriate for the task at hand.

### However, there is more to the Training Process of a Neural Network

# Perceptron as Universal Approximator

A neural network can perform classification because it automatically finds and implements (via training) a mathematical relationship between input data and output values. 

In mathematical terminology, we use the word “function” to identify an input–output relationship, and we often express functions symbolically as f(x), e.g., f(x) = sin(x). Thus, x represents the input data, and f(x) is set equal to the procedure that we use when we want the function to operate on the input and produce an output.

# Multi-Layer Perceptron

A Perceptron that includes an additional layer of nodes (i.e., more than just the input and output layer) is called a **Multilayer Perceptron, or MLP.** 

These additional nodes constitute a hidden layer, because they aren’t directly “visible” from the input side or the output side. 

The diagram below shows a **Multi Layer Perceptron** architecture

![image.png](attachment:image.png)

A multilayer Perceptron is sometimes called  a **Universal Approximator** as it allows high degree of flexibility in modeling the relationship between input and output—created by the network’s mathematical operations.

When the network first starts training, random values have been assigned to the weights, and consequently the network’s f(x)— **(fNN(x)**) —is not at all consistent with the **real relationship (fREAL(x))**, between input and output. 

During training, the network generates beneficial adjustments in weight values by looking at error information fed back from the output, and **fNN(x)** gradually becomes more and more consistent with **fREAL(x)**.

**A thing to note:** the symbol x doesn’t represent just one variable, it is a generic representation of the input. In reality, x could be a 50-element vector.  

# The Error Bowl

Let’s assume that we’re working with a neural network that has two weighted connections leading to a single output node. 

If we’re training, we know the correct output value, and consequently we can calculate the error produced by this output node.

We can visualize the error using a three-dimensional plot: the two inputs correspond to the x-axis and the y-axis, and the error corresponds to the z-axis.

Each combination of input weights and output error is like a point in three-dimensional space. 

As the weights are modified, the x and y components of the point change, and the z component will change as well if the weight modification produces a change in error. As the weights improve, the error decreases toward zero, and this is represented by the three-dimensional error bowl shown below.

![image.png](attachment:image.png)

The training procedure can be seen as a quest for the bottom of that bowl, **where z = 0**.

Because if z = 0, the output produced by the node is equal to the expected output. As the weights are gradually adjusted during training, the point defined by the two weights and the error is moving along the surface of that bowl.

Remember the concept of **Maxima-Minima, Local and Global Maxima/Minima (from your Math class)**. Only one set of values of X and Y produces the Minimum value of Z. All other sets of X, Y values produces Z-value higher than the **Minimum Z-value** (assuming it is a Single Maximum/Minimum function, there are functions that can produce **Multiple* Maxima/Minima**.


This method of function approximation is called the **Gradient Descent** toward the minimum error and is used in many other areas such as the case of finding the **Slope** of the **Best Fit** line in **Linear Regression**.

In summary, the training forces the network to modify its weights in a way that results in minimization of the error function and that causes the mathematical operations of the overall network to approximate the **best** mathematical relationship between input and output.

# Learning Rate

**Learning Rate** influences the rate at which the neural network learns.

In the context of neural networks, **Learning** is more or less equivalent in meaning to **Training** but there are slight differences between the two concepts.

A Data Scientist **Trains** a neural network by providing training data and performing a training procedure. While this is happening, the network is **Learning** or more specifically, it is learning to approximate the input–output relationship contained in the training data. 

The manifestation of **Learning** is the **Weight Modification**, and learning rate affects the way in which weights are modified.
δ * Input)
# The Effect of Learning Rate

Learning rate influences the size of the jumps that lead to the bottom of the bowl. I’m going to switch to a two-dimensional representation now because the images will be easier to create and easier to interpret. Here is our two-dimensional error function:

![image.png](attachment:image.png) + δ * Input)
In one the previous section (also in the basic exampe code of training a Perceptron) we saw the formula to Calculate the **Learning Rate**

                      W(new) = W(old) + (α * δ * Input)
               
where α is the learning rate and δ is the difference between expected output and calculated output (i.e., the error). 

**Every time we apply this learning rule, the weight jumps to a new point on the error curve**. 

If δ is large, those jumps could also be quite large, and the network may not train effectively because the weights are not gradually converging toward minimum error. Instead, they’re bouncing around somewhat chaotically, as shown below.

![image.png](attachment:image.png)

Since δ is multiplied by learning rate before the modification is applied to the weight, we can reduce the size of the jumps by choosing α < 1. The goal is to use learning rate to promote moderately fast, consistent convergence.

The sort of training that we want might look something like this:

![image.png](attachment:image.png)

# How to Choose a Learning Rate

There’s no universal rule that tells you how to choose a learning rate, and there’s not even a neat and tidy way to identify the optimal learning rate for a given application. 

Training is a complex and variable process, and when it comes to learning rate, we need to rely on intuition and experimentation.

If the network can process the training data quickly, we can simply choose a few different learning rates and compare the resulting weights (if we know what the weights should be) or input fresh data and assess the relationship between learning rate and classification accuracy.

A more involved approach, and one that would be more practical for networks that require long training times, is to analyze the changes in error as the network is training. The error should be decreasing toward the minimum, and the changes in error should be small enough to avoid the “bouncing” behavior shown above but not so small that the network learns extremely slowly. 

So the key to selecting a **Learning Rate** is the balance between **Too much bouncing around** and **Speed of the Learning Process**.

# Learning Rate Schedule

Trying to understand the **Learning Technique** of a Neural Network, it is important to recognize is that the **Learning Rate** need not be constant throughout the entire training procedure. 

**Learning Rate** is applied every time the weights are updated via the learning rule; thus, if learning rate changes during training, the network’s evolutionary path toward its final form will immediately be altered. 

One way to take advantage of this is to **decrease the learning rate during training**. This is called **Annealing** the learning rate. There are various ways to do this, but for now, the important thing is to recognize why it helps.

When the network first starts training, the **error is probably going to be large**. A higher learning rate helps the network to take long strides toward minimum error. 

As the network approaches the bottom of the error curve, though, these long strides can impede convergence, similar to how a person taking long strides might find it difficult to land directly in the middle of a small circle painted on the floor. 

**As learning rate decreases, long strides become smaller steps, and eventually the network is tiptoeing toward the center of the circle**.

# Going Deeper in Creating a Realistic Neural Network for Classification

Thus far we have focused on the single-layer Perceptron, which consists of an input layer and an output layer. As you might recall, we use the term “single-layer” because this configuration includes only one layer of computationally active nodes—i.e., nodes that modify data by summing and then applying the activation function. The nodes in the input layer just distribute data.

The single-layer Perceptron is conceptually simple, and the training procedure is quite straightforward and even **somewhat underwhelming**. 

Unfortunately, it doesn’t offer the functionality that we need for complex, real-life applications. An easy way to explain the fundamental limitation of the single-layer Perceptron is by using **Boolean Operations** as illustrative examples.

# A Neural Network Logic Gate

The general shape of the simple Single Layer Perceptron is similar to a **Logic Gate**. 

If we train this network with samples consisting of zeros and ones for the elements of the input vector and an output value that equals one only if both inputs equal one. The result will be a neural network that classifies an input vector in a way that is **analogous to the behavior of an AND gate**.

The dimensionality of this network’s input is 2, so we can easily plot the input samples in a two-dimensional graph. Let’s say that input0 corresponds to the horizontal axis and input1 corresponds to the vertical axis. The four possible input combinations will be arranged as follows:

![image.png](attachment:image.png)


Since we’re replicating the AND operation, the network needs to modify its weights such that 

* the output is one for input vector [1,1] and 
* zero for the other three input vectors. 

Based on this information, let’s divide the input space into sections corresponding to the desired output classifications:

![image.png](attachment:image.png)

# Linearly Separable Data

In this plot of the data set (using an AND Logic Gate) the plotted input vectors can be **Classified** by drawing a straight line. 

Everything on one side of the line receives an output value of one, and everything on the other side receives an output value of zero. 

Thus, in the case of an AND operation, the data that are presented to the network are linearly separable. This would also be the case with an OR operation

This will represent a rather **Simplistic** Classification problem. However, in reallife we need to deal with problems that are NOT simplistic and the data set is **NOT Linearly Separable** as shown the figure below

# Solving Problems that are NOT Linearly Separable

A single-layer Perceptron can solve a problem only if the data are linearly separable, regardless of the dimensionality of the input samples. 

To generalize the concept of linear separability in two dimesnsions, we have to use the word **Hyperplane** instead of “line. A **Hyperplane** is a geometric feature that can separate data in **n-dimensional space**. 

* In a **two-dimensional environment**, a hyperplane is a one-dimensional feature (i.e., **a line**). 

* In a **three-dimensional environment**, a hyperplane is an ordinary **two-dimensional plane**. 

* In an **n-dimensional environment**, a hyperplane has **(n-1) dimensions**

During the training procedure, a single-layer Perceptron is using the training samples to figure out where the classification hyperplane should be. After it finds the hyperplane that reliably separates the data into the correct classification categories, it is ready to Classify.

Let’s look at an example of an input-to-output relationship that is not linearly separable:

![image.png](attachment:image.png)

Taking a closer look and you’ll see that it is the XOR operation. 

XOR data cannot be separated with a straight line. Thus, a single-layer Perceptron cannot implement the functionality provided by an XOR gate, and if it can’t perform the XOR operation, it will not work for other more sophisticated problems.

Fortunately, we can vastly increase the problem-solving power of a neural network simply by adding one additional layer of nodes. 

This turns the **Single-layer Perceptron** into a **Multi-layer Perceptron (MLP)**. As mentioned before this layer is called **Hidden Layer** because it has no direct interface with the outside world.

## We will explore more sophisticated function such as a Sigmoid Function as the Activation Function in the next steps

# Gradient Descent with Summed Square Error

**Summed Squared Error** is our error function, and updating weights via **Gradient Descent requires that we find the partial derivative of the error function with respect to the weight that we want to update**. Performing this differentiation reveals that the error gradient with respect to a weight is given by an expression that includes the **derivative of the activation function**.

The unit step Activation Function is simple but it becomes meaningless in the context of gradient descent because the unit step is **not differentiable—it’s not a continuous function**, and **the slope** at the point where the output transitions from zero to one **is infinity**.

If we intend to train a neural network using gradient descent, **we need a differentiable activation function**. Since the unit step is consistent with the on/off behavior of biological neurons and theoretically effective (though limited) within systems consisting of artificial neurons, **it makes sense to consider an activation function that is similar to the unit step but without the lack of differentiability**. We need look no further than the **logistic sigmoid function**.

# The Sigmoid Activation Function

The word “sigmoid” refers to something that is curved in two directions. There are various sigmoid functions, and we’re only interested in one. It’s called the logistic function, and the mathematical expression is fairly straightforward:

                                 f(x) = L/(1 + e**-kx)

The constant **L determines the curve’s maximum value**, and the **constant k influences the steepness of the transition**. The plot below shows examples of the logistic function for different values of L, and the following plot shows curves for different values of k.

![image.png](attachment:image.png)

          Legend:  Blue --> L = 1.5,    Orange --> L = 1.0      Green --> L = 0.5

![image.png](attachment:image.png)

        Legend:  Blue --> k = 1.5,    Orange --> k = 1.0      Green --> k = 0.5
        
The logistic function is not the only activation function used in MLPs, but it is very common and has multiple benefits:

* As mentioned above, logistic activation is an excellent improvement upon the unit step because the general behavior is equivalent, but the smoothness in the transition region ensures that the function is continuous and therefore differentiable.

* The computational burden certainly exceeds that of the unit step, but it still seems fairly reasonable to me—just one exponential operation, one addition, and one division.

* We can easily fine-tune the input–output relationship by adjusting the L and k parameters. However, I believe that neural networks typically use the standard logistic function, i.e., with L = 1 and k = 1.

# The Derivative of the Logistic Function

The standard logistic function, f(x), has the following first derivative:

         f(x) = 1/ (1 - e**-x) ---> f-prime(x) = e**x / (1 + e**x)**2   [f-prime(x) = First Derivative of f(x)]

## In the next part (Part - III) we will explore how this is used to train a Multi-Layer Perceptron