## **Resources**

### **Training Neural Network with C++**

[Linkedin_Learning](https://www.linkedin.com/learning/training-neural-networks-in-c-plus-plus-22661958/the-many-applications-of-machine-learning?autoSkip=true&resume=false&u=42288921)

### **Understanding Neural Network in Depth**

[Essential_Idea_Of_Neural_Network](https://www.youtube.com/watch?v=CqOfi41LfDw)

[How_CNN_Works_in_Depth](https://www.youtube.com/watch?v=JB8T_zN7ZC0)

### **The Mathematics Behind Neural Network**

[Maths_Behind_Neural_Network](https://www.youtube.com/watch?v=Ixl3nykKG9M)


# **Tomorrow**

[Cont_Run](https://www.linkedin.com/learning/training-neural-networks-in-c-plus-plus-22661958/solution-finish-the-multilayer-perceptron-class?resume=false&u=42288921)


<hr>
<hr>
<hr>
<hr>


## **Neural Network Implementation Note**

- All values must be real numbers, not integers. We will use double point precision (e.g., 0.1, 0.2).

- The weights and inputs may be implemented as `1-D` vectors. We will use the `std::vector<double>` type from the C++ Standard Library i.e. `vec(w)` and `vec(x)`.

- This way, the sum may be calculated in one operation: `z = vec(w) * vec(x)`.

- We will feed the weighted sum to the sigmoid activation function.


## **Files and Their Meaning**

**`.h` files**: These are header files in C++ that typically contain function declarations, class definitions, and macros. They are included in `.cpp` files to provide the necessary declarations for the functions and classes used in the implementation.

**`.cpp` files**: These are source files in C++ that contain the actual implementation of the functions and classes declared in the corresponding header files. They are compiled to create the final executable program.


<hr>
<hr>
<hr>
<hr>


## **Built in Functions That We Will Use**

`std::vector`: A dynamic array that can resize itself automatically when elements are added or removed.

Syntax: `std::vector<Type> vec;`

<hr>

`std::inner_product`: Computes the inner product of two ranges.

Syntax: `std::inner_product(first1, last1, first2, init);`

<hr>

`std::generate`: Fills a range with values generated by a function.

Syntax: `std::generate(first, last, generator);`

<hr>

`std::push_back`: Adds an element to the end of a vector.

Syntax: `vec.push_back(value);`

<hr>

`std::resize`: Changes the size of a vector.

Syntax: `vec.resize(new_size);`

<hr>

`std::exp`: Computes the exponential function.

Syntax: `std::exp(x);`


<hr>
<hr>
<hr>
<hr>


## **Neural Network into Action**

We will write all the declarations in the header files and all the implementations in the source files. This will help us keep our code organized and modular.

Our first task it to implement basic `Multi Layer Perceptron` class in C++.

For that we are creating `MLP.h` and `MLP.cpp` files.

### **`MLP.h`**

```C++

// Perceptron class

class Perceptron
{
public:
  std::vector<double> weights;
  double bias;

  // Constructor
  Perceptron(size_t inputs, double bias = 1.0);

  // Run the perceptron
  double run(std::vector<double> x);

  // Set Custom Weights if needed
  void set_weights(std::vector<double> w_init);

  // Sigmoid Activation Function
  double sigmoid(double x);
};
```

Here, `size_t` is used to represent the number of inputs to the perceptron, ensuring that the value is always non-negative. It is an `unsigned integer` type which store `8 Bytes` in 64 Bit System and `4 Bytes` in 32 Bit System.

<hr>

Now we'll implement the `Perceptron` class in the `MLP.cpp`.

### **`MLP.cpp`**

Here, we will write the implementation of the `Perceptron` class.

```C++

#include "mlp.h"
#include <iostream>
using namespace std;

// Random Number Generator Function

double frand()
{
  return (2.0 * (double)rand() / RAND_MAX) - 1.0;
}

// Return a new Perceptron Object with the Specified number of Inputs (+1 for the bias)

Perceptron::Perceptron(size_t inputs, double bias)
{
  this->bias = bias;

  // Initialize the Weights as Random numbers of Double between -1 and 1

  weights.resize(inputs + 1); // Resize the Vector for Weights + Bias

  // Generate Random Numbers and Fill in the Vectors. Pass the frand function to generate the number

  generate(weights.begin(), weights.end(), frand);
}

// Run Function
// Feeds an Input Vector X into the perceptron to return the activation function output.

double Perceptron::run(std::vector<double> x)
{

  // Add the bias at the end
  x.push_back(bias);

  // Weighted Sum
  double sum = inner_product(x.begin(), x.end(), weights.begin(), (double)0.0);

  return sigmoid(sum); // Pass into the sigmoid function
}

// Set the weights. w_init is a vector with the Weights

void Perceptron::set_weights(std::vector<double> w_init)
{
  weights = w_init; // Copies the vector
}

// Evaluate the Sigmoid Function for the floating point of input

double Perceptron::sigmoid(double x)
{
  return 1.0 / (1.0 + exp(-x));
}
```

**Below is the Step wise Step Explanation for Each Implementation.**

`weights.resize(inputs + 1);`

This line resizes the weights vector to hold the specified number of inputs plus one `additional` element for the bias. This ensures that the weights vector has the correct size to accommodate all input weights and the bias term.

`generate(weights.begin(), weights.end(), frand);`

This line fills the weights vector with random values generated by the `frand` function. The `generate` function takes a range (from the beginning to the end of the weights vector) and applies the `frand` function to each element in that range, effectively initializing the weights to small random values.

`x.push_back(bias);`

This line adds the bias term to the end of the input vector `x`. This is necessary because the bias is treated as an additional input to the perceptron, and it needs to be included in the weighted sum calculation.

`inner_product(x.begin(), x.end(), weights.begin(), (double)0.0);`

This line computes the weighted sum of the inputs by taking the inner product of the input vector `x` (which now includes the bias) and the weights vector. The `inner_product` function multiplies each element of the input vector by the corresponding element of the weights vector and sums the results. The last argument `(double)0.0` specifies the initial value for the sum.

`return sigmoid(sum);`

This line passes the computed weighted sum into the sigmoid function and returns the result. The sigmoid function applies the logistic activation function to the weighted sum, squashing the output to a range between 0 and 1. This is a crucial step in the perceptron's operation, as it determines the final output of the neuron.

`weights = w_init;`

This line sets the weights of the perceptron to the provided initialization vector `w_init`. This allows the user to specify custom weights for the perceptron, which can be useful for tasks like transfer learning or fine-tuning a pre-trained model.

`return 1.0 / (1.0 + exp(-x));`

Calculates the sigmoid activation value for the given input `x`.


<hr>
<hr>
<hr>
<hr>


## **AND Gate**

Both the inputs need to be `True` for `True` output.

Now how do we create a Perceptron that can classify inputs like an AND gate?

Let's visualize the inputs and outputs of the `AND` gate in a Graph:

<img src='./Notes_Images/and_gate.png'>

Now, to successfully classify we need to draw a line that separates the two classes (0 and 1). This line is called the decision boundary.

<img src='./Notes_Images/boundary.png'>

**The Line that is To be Drawn is of Sigmoid Function**

<img src='./Notes_Images/sigmoid.png'>

In this image, the boundary is the line where sigmoid is `0.5`.

<hr>

`Before Moving Forward`,

Let's try to implement a function that exactly mimics as `AND` gate, but the function should be linear i.e. `f(x1, x2) = w1*x1 + w2*x2 + b`

**Is that Possible?**

<img src='./Notes_Images/linear_and.png'>

This proof shows that it is not possible to create a linear function that mimics the behavior of an `AND` gate.

The only solution is that the function should be non-linear, which means the function can be `exponential`, `quadratic`, or any other non-linear form.

<img src='./Notes_Images/non_linear_and.png'>

**TL;DR: AND is `linearly separable` (a perceptron can classify it), but it is not a linear function of the `inputs`.**

### **A Perceptron as an AND Gate**

Let's say there are two inputs `x1` and `x2`. The perceptron will compute a weighted sum of the inputs and pass it through a step function to produce the output.

The weighted sum can be represented as:

```
z = w1*x1 + w2*x2 + b
```

Where:

- `w1` and `w2` are the weights for the inputs
- `b` is the bias term

<hr>

The earlier problem was that we were not able to find a linear function that could separate the two classes (0 and 1).

But if we pass the output of the `linear function` i.e. `z = w1*x1 + w2*x2 + b` through a `non-linear activation` function i.e. `sigmoid`, we can achieve the desired results.

**Sigmoid**

The sigmoid function is defined as:

```
σ(z) = 1 / (1 + e^(-z))
```

Where `e` is the base of the natural logarithm.

When the value of `z` is `0`, the sigmoid function outputs `0.5`.

When `z` is positive, the sigmoid function outputs a value between `0.5` and `1`. When `z` is negative, the sigmoid function outputs a value between `0` and `0.5`.

For positive value, the output converges to `1` as `z` increases. For negative value, the output converges to `0` as `z` decreases.

<hr>

The step function will output `1` if `z` is greater than or equal to `0`, and `0` otherwise.

To implement the AND gate, we need to find appropriate values for `w1`, `w2`, and `b` such that the perceptron produces the correct output for all possible combinations of inputs.

The truth table for the AND gate is as follows:

| x1  | x2  | AND |
| --- | --- | --- |
| 0   | 0   | 0   |
| 0   | 1   | 0   |
| 1   | 0   | 0   |
| 1   | 1   | 1   |

From the truth table, we can see that the perceptron should output `1` only when both `x1` and `x2` are `1`. This means we need to set the weights and bias as follows:

- `w1 = 10`
- `w2 = 10`
- `b = -15`

With these values, the perceptron will compute the following:

```text
For (0, 0): z = 10*0 + 10*0 - 15 = -15 (output 0) i.e. 0.0000003 near to 0
For (0, 1): z = 10*0 + 10*1 - 15 = -5 (output 0) i.e. 0.0066929 near to 0
For (1, 0): z = 10*1 + 10*0 - 15 = -5 (output 0) i.e. 0.0066929 near to 0
For (1, 1): z = 10*1 + 10*1 - 15 = 5 (output 1) i.e. 0.9933071 near to 1
```

As we can see, the perceptron correctly mimics the behavior of the AND gate.

<hr>

To conclude,

we can see that the non-linear activation function (sigmoid) is able to generalize the `AND` gate with a `Single Perceptron`. Here, the weighted sum is the `Perceptron` output before applying the sigmoid function.

### **The Equation of Boundary Line That Separates the Classes**

The decision boundary for the AND gate can be represented by the equation:

```bash
z = 10*x1 + 10*x2 - 15

and

10*x1 + 10*x2 - 15 = 0 // The Sigmoid Function outputs 0.5 when z = 0

So,

x1 + x2 = 1.5

or

x2 = 1.5 - x1 i.e. y = mx + c
```

Because this equation defines a line in the 2D space (x1, x2) that separates the two classes (0 and 1).

**Image**

<img src='./Notes_Images/boundary_line.png'>

### **Note**

We just witnessed how a `Simple Single Perceptron` can model the behavior of an `AND` gate using a non-linear activation function.

Now, imagine what `1000s` or `even millions` of these simple perceptrons can achieve when combined in a multi-layer architecture.

Also, note that here we witnessed that for the `Non-Linear Activation Function` to give correct output, the combination of `Weights` and `Bias` should be carefully chosen.

Therefore, the design of neural networks involves not just the architecture (how many layers, how many neurons per layer) but also the careful tuning of these parameters to achieve the desired performance.

<hr>

The generalization rule for the `AND` becomes:

The weights `w1` and `w2` should be positive and the bias `b` should be negative. This ensures that the perceptron will only activate (output 1) when both inputs are 1.

But,

The `Bias` should be a negative number that is bigger than the weighted sum of the inputs when they are both `1`. This ensures that the perceptron will only activate (output 1) when both inputs are 1.

**Would `Sigmoid` be able to Generalize well, if the Weights and Bias are not carefully chosen?**

No, `Sigmoid` would not be able to generalize well if the weights and bias are not carefully chosen. This is because the `Sigmoid` function is sensitive to the input values, and if the weights and bias do not create a suitable decision boundary, the output may not correctly represent the underlying data distribution.

If the `Weights` i.e. `{10,10}` and `Bias` i.e. `{-5}` then the output of the `sigmoid` would be as below:

```bash

Gate: AND
0 AND 0 = 0.00669285 i.e. 0 which is correct
0 AND 1 = 0.993307 i.e. 1 which is not correct should be 0
1 AND 0 = 0.993307 i.e. 1 which is not correct should be 0
1 AND 1 = 1 i.e. 1 which is correct

```

Therefore, for the `Non-Linear` Activation function to work effectively, the weights and bias must be chosen carefully to create a suitable decision boundary.

For that, Gradient Descent is often used to optimize the weights and bias during the training process.

### **Follow Up Questions**

**What if our inputs are not binary (0 or 1) but continuous values? How would that affect the design of the perceptron?**

**What if we want to implement a different logical operation, such as OR or XOR? How would the design of the perceptron change in those cases?**

**What if we want to implement a multi-class classification problem? How would the design of the perceptron change in that case?**


<hr>
<hr>
<hr>
<hr>


## **Our Perceptron as an AND Gate**

Now let's try to implement our Perceptron as an AND gate. The AND gate outputs 1 only if both inputs are 1, otherwise, it outputs 0.

```C++
#include "mlp.h"
#include <iostream>
#include <vector>
using namespace std;

int main()
{
  Perceptron p(2); // Object with 2 inputs on the Stack, No need to delete

  p.set_weights({10, 10, -15}); // +1 Bias
}

cout << "Gate: AND" << endl;

cout << "0 AND 0 = " << p.run({0,0}) << endl;
cout << "0 AND 1 = " << p.run({0,1}) << endl;
cout << "1 AND 0 = " << p.run({1,0}) << endl;
cout << "1 AND 1 = " << p.run({1,1}) << endl;

// Output

// Gate: AND
// 0 AND 0 = 3.05902e-07
// 0 AND 1 = 0.00669285
// 1 AND 0 = 0.00669285
// 1 AND 1 = 0.993307
```


<hr>
<hr>
<hr>
<hr>


## **OR Gate**

<img src='./Notes_Images/or_gate.png'>

The OR gate outputs 1 if at least one of the inputs is 1, otherwise, it outputs 0.

<hr>

The weights should be `{15,15}` and the bias should be `-10`.

The linear combination for the OR gate can be expressed as:

```bash
15x + 15y - 10 = 0

then

y = -x + 2/3

```

Below are the outputs of Sigmoid function for the OR gate:

```C++

// Output

// Gate: OR
// 0 AND 0 = 4.53979e-05
// 0 AND 1 = 0.993307
// 1 AND 0 = 0.993307
// 1 AND 1 = 1

```

**OR Gate Boundary Line Equation**

<img src='./Notes_Images/or_gate_boundary.png'>


<hr>
<hr>
<hr>
<hr>


## **Linear Separability**

`Linear separability` is a property of a dataset that allows it to be separated into different classes using a `linear boundary`. In the context of neural networks, this means that a single layer perceptron can be used to classify the data points.

For example, the `AND` gate is linearly separable because we can draw a `straight line` (or `hyperplane` in higher dimensions) that separates the positive examples (1s) from the negative examples (0s).

Similarly, the `OR` gate is also linearly separable for the same reason.

**Note**

- Both the `Straight Line` i.e. `y = mx + b` and the `Hyperplane` i.e. `W*x + b = 0` can be used to separate linearly separable data.

- For 2 dimensional data i.e. 2 input features we need the equation of line, for 3 dimensional data, we need the equation of plane i.e. `Ax + By + Cz + D = 0` and for data whose dimension is greater than 3, we need the equation of hyperplane i.e. `W*x + b = 0`, where `W` is the weight vector and `b` is the bias.

<hr>

On the other hand, the `XOR` gate is not linearly separable because there is no single straight line that can separate the positive examples from the negative examples.

Below is the Graphical representation of the `XOR` gate:

<img src='./Notes_Images/xor_gate.png'>

Here, we cannot separate the positive and negative examples with a single straight line. We would need two lines to separate the classes.

If we use an `OR` gate only, it will get all but one of the `XOR` gate inputs correct. The `OR` gate will output `1` for both `(0, 1)` and `(1, 0)` inputs, which is incorrect for the `XOR` operation.

<img src='./Notes_Images/or_for_xor.png'>

If we use `NAND` gate, it will give one incorrect output for the `(0, 0)` input, which is the only case where the `XOR` gate outputs `1`.

<img src='./Notes_Images/nand_for_xor.png'>

**But**

If we combine the outputs of the `NAND` gate and the `OR` gate, we can create a circuit that correctly implements the `XOR` function.

<img src='./Notes_Images/nand_or_xor.png'>

### **Creating XOR with NAND, AND, and OR Gates**

To create an `XOR` gate using `NAND`, `AND`, and `OR` gates, we can use the following configuration:

1. **Inputs**: A and B

2. **NAND Gate**: Connect A and B to a `NAND` gate. This will give us the output `NAND(A, B)`.

3. **OR Gate**: Connect A and B to an `OR` gate. This will give us the output `OR(A, B)`.

4. **AND Gate**: Connect the outputs of the `NAND` gate and the `OR` gate to an `AND` gate. This will give us the final output `XOR(A, B)`.

The logical expression for the `XOR` gate can be represented as:

```bash
XOR(A, B) = AND(NAND(A, B), OR(A, B))
```

This configuration allows us to implement the `XOR` function using only `NAND`, `AND`, and `OR` gates.

**XOR Diagram**

<img src='./Notes_Images/xor_diagram.png'>

### **Neural Network for XOR Gate**

We know a single perceptron can solve a linear separable problem, but the `XOR` function is not linearly separable. Therefore, we need a neural network with `3 perceptrons` i.e. `2 in the hidden layer and 1 in the output layer`.

Also, we know that a single perceptron can represent all the three basic logic gates: `AND`, `OR`, and `NAND` each having different `weight` and `bias` configurations.

**Linear Equations for Basic Logic Gates**

`OR Gate` : `y = -x + 0.5` with Weights `{15,15}` and Bias `{-10}`

`NAND Gate` : `y = x + 1.5` with Weights `{-10,-10}` and Bias `{15}`

Then we plug the output of `OR Gate` and `NAND Gate` as an Input for the `AND Gate`

`AND Gate` : `y = -x + 1.5` with Weights `{10.10}` and Bias `{-15}`

<hr>


## **Multi Layer Perceptron**

A Multi-Layer Perceptron (MLP) is a type of neural network that consists of multiple layers of neurons, including an input layer, one or more hidden layers, and an output layer. MLPs are capable of learning complex patterns in data and can be used for a variety of tasks, including classification and regression.

**Image of MLP for XOR Gate**

<img src='./Notes_Images/mlp.png'/>

### **Architecture of MLP for XOR Gate**

1. **Input Layer**: The input layer consists of two inputs, each representing one of the input features (X1 and X2) of the XOR gate.

2. **Hidden Layer**: The hidden layer contains two neurons. One of the neurons represents the `NAND Gate` operation, while the other represents the `OR` operation.

3. **Output Layer**: The output layer consists of a single neuron that produces the final output of the network. This neuron receives inputs from both hidden layer neurons, applies a weighted sum and a non-linear activation function, and produces the final output (Y) of the `XOR` gate.


## **XOR Gate Implementation**

<hr>

In the `mlp.h` file previously we had written declaration for the `Single Perceptron` class. Now, we will extend this class to create a `MultiLayerPerceptron` class that can handle the XOR problem.

### **`mlp.h`**

```C++
#pragma once

#include <algorithm>
#include <vector>
#include <iostream>
#include <random>
#include <numeric>
#include <cmath>
#include <time.h>

class Perceptron
{
public:
  std::vector<double> weights;
  double bias;

  // Constructor
  Perceptron(size_t inputs, double bias = 1.0);

  // Run the Perceptron
  double run(std::vector<double> x);

  // Set the Customize Weights if Needed
  void set_weights(std::vector<double> w_init);

  // Sigmoid Activation Function
  double sigmoid(double x);
};

class MultiLayerPerceptron
{
public:
  // Constructor for initilizing layers
  MultiLayerPerceptron(std::vector<size_t> layers, double bias = 1.0, double eta = 0.5);

  // Set custom weights, w_init for weights of 3 perceptron
  void set_weights(std::vector<std::vector<std::vector<double>>> w_init);

  // Display the weights
  void print_weights();

  // Run the MLP
  std::vector<double> run(std::vector<double> x);

  double bp(std::vector<double> x, std::vector<double> y);

  // Attributes

  std::vector<size_t> layers; // Unsigned Integers, Number of Neurons Per Layer, 0 for Input, 2 for Hidden, 1 for Output

  double bias; // Bias
  double eta;  // Learning Rate

  std::vector<std::vector<Perceptron>> network;
  // 'network' (the outer vector object) is created on the stack.
  // But the actual Perceptrons stored inside inner vectors
  // will be allocated dynamically on the heap.

  // becuase the outer vector is an Object created on the Stack without the New Keyword

  std::vector<std::vector<double>> values; // Output values of each neuron in each layer
  // Outer vector object is on stack; inner vectors manage heap-allocated arrays for neuron outputs

  // becuase the outer vector is an Object created on the Stack without the New Keyword

  std::vector<std::vector<double>> d; // Error terms (deltas) for each neuron
  // Outer vector object is on stack; inner vectors manage heap-allocated arrays for errors

  // becuase the outer vector is an Object created on the Stack without the New Keyword
};
```

In the above code,

The `MultiLayerPerceptron` class is designed to handle the XOR problem by utilizing multiple layers of neurons.

**`MultiLayerPerceptron(std::vector<size_t> layers, double bias = 1.0, double eta = 0.5)`** : This constructor initializes the MLP with the specified layer structure, bias, and learning rate. It creates the necessary layers and populates the network with `Perceptron` objects.

**`run(std::vector<double> x)`** : This method takes an input vector `x` and passes it through the network, returning the output of the MLP.

**`bp(std::vector<double> x, std::vector<double> y)`** : This method performs backpropagation to update the weights of the network based on the error between the predicted output and the true output `y`.

**`std::vector<std::vector<Perceptron>> network;`** : This attribute holds the layers of the MLP, where each layer is a vector of `Perceptron` objects.

**`std::vector<std::vector<double>> values;`** : This attribute stores the output values of each neuron in the network for a given input.

**`std::vector<std::vector<double>> d;`** : This attribute holds the error terms for the neurons, which are used during backpropagation to update the weights.

<hr>

### **`mlp.cpp`**

```C++

#include "mlp.h"
#include <iostream>
using namespace std;

// Random Number Generator Function
double frand()
{
  return (2.0 * (double)rand() / RAND_MAX) - 1.0;
}

/*
Single Layer Perceptron Implementation
*/

// Return a new Perceptron Object with the Specified number of Inputs (+1 for the bias)

Perceptron::Perceptron(size_t inputs, double bias)
{
  this->bias = bias;

  // Initialize the Weights as Random numbers of Double between -1 and 1

  weights.resize(inputs + 1); // Resize the Vector for Weights + Bias

  // Generate Random Numbers and Fill in the Vectors. Pass the frand function to generate the number

  generate(weights.begin(), weights.end(), frand);
}

// Run Function
// Feeds an Input Vector X into the perceptron to return the activation function output.
double Perceptron::run(std::vector<double> x)
{

  // Add the bias at the end
  x.push_back(bias);

  // Weighted Sum
  double sum = inner_product(x.begin(), x.end(), weights.begin(), (double)0.0);

  return sigmoid(sum); // Pass into the sigmoid function
}

// Set the weights. w_init is a vector with the Weights
void Perceptron::set_weights(std::vector<double> w_init)
{
  weights = w_init; // Copies the vector
}

// Evaluate the Sigmoid Function for the floating point of input

double Perceptron::sigmoid(double x)
{
  return 1.0 / (1.0 + exp(-x));
}

/*
Multi Layer Perceptron Implementation
*/

// Return a new MLP Object with the Specified number layers, bias and Learning Rate

MultiLayerPerceptron::MultiLayerPerceptron(std::vector<size_t> layers, double bias, double eta) : layers(layers), bias(bias), eta(eta)
{
  // Create Neurons Layer By Layer
  // Outer Loop
  for (size_t i = 0; i < layers.size(); i++)
  {
    // Add Vector of Values Filled with Zeros
    values.push_back(vector<double>(layers[i], 0.0)); // Output of Each Neuron Value set to Zero based on the number of Neurons in Each layer

    // Add Vector of Neurons
    network.push_back(vector<Perceptron>()); // Creates a temporary empty std::vector<Perceptron> object and pushes it into 'network'. Without '()', we would only refer to the type name, not an object, causing a compiler error.

    // Inner Loop
    // network[0] is the input layer, so it has no neurons
    if (i > 0)
    {
      // Iterate on Each Neuron in the Layer
      for (size_t j = 0; j < layers[i]; j++)
      {
        // Add Perceptron in Every Layer, Starting with the Layer 1, cause 0 is Input Layer
        // Each Perceptron Should Accept the Input as Number of Neurons in the Pervious Layer
        network[i].push_back(Perceptron(layers[i - 1], bias));
      }
    }
  }
}
```

Here, in the above code for `MLP`,

**`MultiLayerPerceptron::MultiLayerPerceptron(std::vector<size_t> layers, double bias, double eta)`** : This constructor initializes the MLP with the specified layer structure, bias, and learning rate. It creates the necessary layers and populates the network with `Perceptron` objects.

```C++

// Create Neurons Layer By Layer

for (size_t i = 0; i < layers.size(); i++)
{
  // Add Vector of Values Filled with Zeros
  values.push_back(vector<double>(layers[i], 0.0));
}

```

The above code creates the necessary layers for the MLP by adding vectors of zeros for each layer.

This `vector<double>(layers[i], 0.0)` calls the `std::vector` constructor to create a vector of the specified size (`layers[i]`) initialized with zeros (`0.0`).

Which means that the values vector will hold `{{0,0}}` in the first iteration.

```C++
// Add Vector of Neurons
network.push_back(vector<Perceptron>()); // Empty Vector of Perceptrons is added to the end
```

The above code adds a new vector of `Perceptron` objects for each layer in the network. This sets up the structure for the neurons in each layer, which will be initialized with random weights during the training process.

Before `network.push_back(vector<Perceptron>());` the size of `network.size()` is `0`, but after this line, the size becomes `1`.

Meaning, for the first iteration i.e. input layer, it will create an empty vector of `Perceptron` objects i.e. `network = {{}}`.

<hr>

```C++
if (i > 0)
  {
  // Iterate on Each Neuron in the Layer
  for (size_t j = 0; j < layers[i]; j++)
  {
    // Add Perceptron in Every Layer, Starting with the Layer 1, cause 0 is Input Layer
    // Each Perceptron Should Accept the Input as Number of Neurons in the Pervious Layer
    network[i].push_back(Perceptron(layers[i - 1], bias));
  }
  }
```

In the above code,

First will be adding `Perceptron` objects to the first hidden layer (layer 1) of the network not the input layer. Therefore, we've `i > 0` condition.

Then we will start to add the required number of `Perceptron` objects to the current layer (layer `i`) based on the specified architecture in the `layers` vector.

Note: Each `Perceptron` in the current layer will be initialized with the number of inputs equal to the number of neurons in the previous layer, and the specified bias value. `layers[i - 1]` means the number of neurons in the previous layer (layer `i - 1`).

<hr>
<hr>

**set_weights(vector<vector<vector<double>>> w_init) method**

```C++
// Set Custom Weights
void MultiLayerPerceptron::set_weights(vector<vector<vector<double>>> w_init)
{
  // Write all the weights into the neural network
  // w_init is a vector of vectors of vectors of doubles
  for (size_t i = 0; i < w_init.size(); i++)
  {
    for (size_t j = 0; j < w_init[i].size(); j++)
    {
      network[i + 1][j].set_weights(w_init[i][j]);
    }
  }

  // Example Weights Initialization

  // Example of w_init
  vector<vector<vector<double>>> w = {
      {
        {0, 0}, {0, 0}
      },
      {
        {0, 0}
      }
  };
}

```

In the above code,

As our `Constructor` would have initialized,

`values` as : `{{0,0}.{0,0},{0}}`

and

`network` as : `{{},{Perceptron(2,bias),Perceptron(2,bias)},{Perceptron(2,bias)}}`, we can set the weights of the `Perceptron` objects in the network using the `set_weights` method.

For example, if the

```C++
std::vector<std::vector<std::vector<double>>> w_init = {
{   // Layer 0 → Hidden
  {20.0, 20.0},   // Hidden neuron 1
  {-20.0, -20.0}   // Hidden neuron 2
},
{   // Layer 1 → Output
  {20.0, 20.0}    // Output neuron
}
};
```

At first `w_init.size()` will be 2. Then we will need to iterate over `Perceptron` of the 1st Layer i.e. `w_init[0].size` is will be 2 i.e. weights for 2 `Perceptron` of the first layer.

Then to add the weight in the `Neural Network` starting from the `index 1` as `0 index` does not contain any `Perceptron`.

To add the weight in the `Neural Network`, we will use the `set_weights(vector<double> w_init)` method of each `Perceptron` object in the network.

For example, for the `Perceptrons` of first layer,

`network[1][0].set_weights(w_init[0][0]);` i.e. First `Perceptron` of Layer 1

`network[1][1].set_weights(w_init[0][1]);` i.e. Second `Perceptron` of Layer 1

<hr>
<hr>

**print_weights() method**

```C++
// Print the Weights
void MultiLayerPerceptron::print_weights()
{
  cout << endl;

  for (size_t i = 1; i < network.size(); i++)
  {
    for (size_t j = 0; j < layers[i]; j++)
    {
      cout << "Layer" << i + 1 << " Neuron " << j << ": ";
      for (auto &it : network[i][j].weights)
      {
        cout << it << " ";
      }
      cout << endl;
    }
  }
}
```

In the above code, we are printing the weights of each neuron in the network. We iterate through each layer and each neuron within that layer, printing the weights associated with each neuron.

<hr>
<hr>

**`run(vector<double> x)`**

```C++
// Run the Network
vector<double> MultiLayerPerceptron::run(vector<double> x)
{
  // Run an Input Forward Through the Neural Network
  // x is a vector with the input values

  // Set the values for the first layer to the given value x, before it was initialized with {0,0}
  values[0] = x;

  for (size_t i = 1; i < network.size(); i++)
  {
    for (size_t j = 0; j < layers[i]; j++)
    {
      values[i][j] = network[i][j].run(values[i - 1]);
    }
  }
  return values.back(); // Return the output of the last layer
}
```

In the above code,

`values[0] = x`

Here, the initial value of `values` will be `{{0,0},{0,0},{0,0}}`. Then we will set the input values for the first layer to the given value `x`, effectively initializing the first layer with the input values i.e. `values = {{input_vector}, {0,0}, {0,0}}`

We then loop the network i.e. from `1 to network.size() - 1` because the first layer is input layer and we need to process the hidden and output layer.

Then from each layer we will update the output of each `Perceptron` using the `run` method with the output of the previous layer as input.

The `run` function will compute the weighted sum of the inputs and apply the activation i.e. `Sigmoid` and return the output which will be stored as `values[i][j]`.

This way all the outputs from the previous layer are used as inputs for the next layer, allowing the network to learn complex patterns in the data.

Finally, we return the output of the last layer using `return values.back();`, which gives us the final output of the MLP for the given input `x`. Because the last value of `values` is the output from the last layer of the network.

<hr>
<hr>

### **`neuralNet.cpp`**

Now will implement the complete code for `XOR` Gate.

```C++

// XOR Gate
  // Instaintiate with 2 inputs, 2 Perceptron in hidden and 1 Perceptron in the Output
  MultiLayerPerceptron mlp({2, 2, 1});

  // Set the Weights, NAND Gate, OR Gate and then AND Gate
  mlp.set_weights(
      {{{-10, -10, 15}, {15, 15, -10}},
       {{10, 10, -15}}});

  cout << endl;

  // Print Weights
  cout << "Hardcoded Weights:" << endl;
  mlp.print_weights();

  cout << endl;

  // Run the Network
  cout << "XOR: " << endl;
  cout << "0 0 = " << mlp.run({0, 0})[0] << endl; // For 0 0 Input, Output should be  0.00669585
  cout << "0 0 = " << mlp.run({0, 1})[0] << endl; // For 0 1 Input, Output should be  1
  cout << "0 0 = " << mlp.run({1, 0})[0] << endl; // For 1 0 Input, Output should be  1
  cout << "0 0 = " << mlp.run({1, 1})[0] << endl; // For 1 1 Input, Output should be  0

// Output

// Hardcoded Weights:

// Layer2 Neuron 0: -10 -10 15
// Layer2 Neuron 1: 15 15 -10
// Layer3 Neuron 0: 10 10 -15

// XOR:
// 0 0 = 0.00669585
// 0 0 = 0.992356
// 0 0 = 0.992356
// 0 0 = 0.00715281
```

In the above code, we are implementing a Multi-Layer Perceptron (MLP) to solve the XOR problem. The MLP consists of an input layer, a hidden layer, and an output layer. We hardcode the weights for the neurons in each layer to mimic the behavior of the XOR gate. The `print_weights()` method displays the weights of each neuron, and the `run()` method takes an input vector and computes the output of the network by propagating the inputs through the layers. Finally, we test the network with all possible combinations of binary inputs for the XOR gate and print the results.

<hr>
<hr>


## **Mathematical Intution Behind XOR Gate**

<hr>

First, let's look at our Neural Network,

Hidden 1 (NAND): $(-10,-10; \, +15)$

Hidden 2 (OR): $(+15,+15; \, -10)$

Output (AND): $(+10,+10; \, -15)$

Let $\sigma(t)=\dfrac{1}{1+e^{-t}}$.

# Hidden layer

$$
\begin{aligned}
z_1 &= -10x_1-10x_2+15,
&h_1 &= \sigma(z_1),\\[2pt]
z_2 &= \phantom{-}15x_1+15x_2-10,
&h_2 &= \sigma(z_2).
\end{aligned}
$$

# Output neuron

$$
z_{\text{out}} = 10h_1+10h_2-15,\qquad
y=\sigma(z_{\text{out}}).
$$

# Explicit closed-form (final equation)

$$
\boxed{\;
y(x_1,x_2)=\sigma\!\Big(
10\,\sigma(-10x_1-10x_2+15)\;+\;10\,\sigma(15x_1+15x_2-10)\;-\;15
\Big)\; }
$$

# Decision boundaries (hyperplanes at $\sigma=0.5$)

- Hidden 1 (NAND): $z_1=0 \iff \boxed{x_1+x_2=1.5}$.
- Hidden 2 (OR): $z_2=0 \iff \boxed{x_1+x_2=\tfrac{10}{15}=\tfrac{2}{3}}$.
- Output (in $(h_1,h_2)$-space): $z_{\text{out}}=0 \iff \boxed{h_1+h_2=1.5}$.

Pulled back to $(x_1,x_2)$-space, the **final decision boundary** is the implicit curve

$$
\boxed{\;\sigma(-10x_1-10x_2+15)\;+\;\sigma(15x_1+15x_2-10)\;=\;1.5\;}
$$

(Region “$y>0.5$” is where the left-hand side $>\,1.5$.)

<hr>

But how?

### 1. Reminder: what happens for AND

For the AND gate with weights $(10,10,-15)$, the **decision boundary** was

$$
\sigma(10x_1+10x_2-15)=0.5 \;\;\Longleftrightarrow\;\; 10x_1+10x_2-15=0,
$$

because $\sigma(z)=0.5 \iff z=0$. That’s a straight line.

So a **single perceptron** always produces a _linear_ boundary.

### 2. XOR with 2 hidden neurons

Now for XOR, the **output neuron** takes two hidden activations:

$$
h_1 = \sigma(-10x_1 - 10x_2 + 15), \qquad
h_2 = \sigma(15x_1 + 15x_2 - 10).
$$

Then at the output:

$$
z_{\text{out}} = 10h_1 + 10h_2 - 15, \quad
y=\sigma(z_{\text{out}}).
$$

The **decision boundary** is $y=0.5 \iff z_{\text{out}}=0$:

$$
10h_1 + 10h_2 - 15 = 0
\quad\Longleftrightarrow\quad
h_1 + h_2 = 1.5.
$$

### 3. Why it’s nonlinear in $(x_1, x_2)$

Substitute the hidden activations:

$$
\sigma(-10x_1-10x_2+15) + \sigma(15x_1+15x_2-10) = 1.5.
$$

Each term here is a **sigmoid of a linear function**.

- A sigmoid of a line is an **S-shaped curve**, not a line.
- The **sum of two sigmoids = constant** is generally a _curved contour_.

That’s why the boundary is **not linear** anymore—it bends.

### 4. Visual intuition

- For a **single perceptron**, the boundary is always a straight line, because the “0.5 cutoff” corresponds to the input = 0.

- But once you **stack perceptrons** (like in XOR), the output depends on hidden nonlinearities. The “boundary = 0.5” condition then becomes an implicit equation involving **sums of sigmoids**, which defines a **nonlinear curve**.

- In the **steep-weight limit**, the sigmoids approximate steps. Then the boundary becomes a polygon-shaped region:

  $$
  \tfrac{2}{3} < x_1+x_2 < 1.5,
  $$

  which is not a line, but the _intersection of two half-planes_ → that’s what makes XOR work.

So, the reason it’s “not a line” is because once you include multiple hidden sigmoids, the boundary equation is no longer linear in $(x_1, x_2)$; it’s the **level set of a sum of nonlinear functions**.

Below is the **plot this curve** (showing how it bends between the four XOR points), so you see exactly why it isn’t a line?

<img src='./Notes_Images/xor_graph.png'>

**Desmos for XOR**

<img src='./Notes_Images/desmos_xor.png'>

This shows that `XOR` needs two half-planes rather than one. And the **decision boundary** is the intersection of these two half-planes not a single straight line which means we found the non-linear decision boundary.

<hr>

# Exact numeric evaluations on $\{0,1\}^2$

$$
\begin{array}{c|c|c|c|c}
(x_1,x_2) & (z_1,h_1) & (z_2,h_2) & z_{\text{out}} & y=\sigma(z_{\text{out}})\\ \hline
(0,0) & (15,\;0.9999996941) & (-10,\;4.5398{\times}10^{-5}) & -4.99954908 & 0.00669585\\
(0,1) & (5,\;0.9933071491) & (5,\;0.9933071491) & \phantom{-}4.86614298 & 0.99235586\\
(1,0) & (5,\;0.9933071491) & (5,\;0.9933071491) & \phantom{-}4.86614298 & 0.99235586\\
(1,1) & (-5,\;0.0066928509) & (20,\;0.9999999980) & -4.93307151 & 0.00715281
\end{array}
$$

(Threshold $0.5$ yields XOR: $0,1,1,0$.)

# Why the solution is not linear

- Each $h_i=\sigma(a_ix_1+b_ix_2+c_i)$ is **nonlinear** in $(x_1,x_2)$.
- The final boundary is the **nonlinear** implicit curve
  $\sigma(-10x_1-10x_2+15)+\sigma(15x_1+15x_2-10)=1.5$ (not a line).
- In the large-slope limit, $\sigma$ behaves like a Heaviside step, giving the band
  $\tfrac{2}{3}<x_1+x_2<1.5$ as the positive region—again **not** a single hyperplane.

These equations fully specify the XOR realized by your MLP.


## **Why Training is Needed in the MLP?**

If we've noticed,

Until this point we've been using hardcoded weights for our neurons in the Multi-Layer Perceptron (MLP). This approach allows us to quickly test the network's behavior without going through the training process. However, hardcoding weights is not a scalable solution for more complex problems or larger datasets.

<hr>

Also, for small problems like solving `AND`, `OR` and `XOR` functions we can use other alternatives as well i.e. `Programming Logic` instead of `Neural Network`.

Because, the real value of `Neural Network` is its ability to learn complex patterns and generalize from examples, making it suitable for a wide range of tasks beyond simple logic functions.

Instead of hardcoding weights, how about we show a lots of examples of how an `XOR` behaves so that it can learn from those examples?

For that we use an algorithm called `backpropagation`, which is a supervised learning algorithm used for training artificial neural networks.

### **Reasons to Train a Neural Network**

**Linear Separability Is Hardly a Given**

In many real-world problems, the data is not linearly separable. This means that a simple linear decision boundary (a straight line in 2D, for example) cannot effectively separate the different classes in the data. Neural networks, especially those with hidden layers, can learn complex, non-linear decision boundaries, making them much more powerful for a wide range of tasks.

Take the below example:

Suppose we're trying to classify `Small or Large` based on `Length` and `Width`. Below is the scatter plot of the data points:

<img src="./Notes_Images/l_w.png" alt="Scatter Plot">

Here, we can see that the closest we can draw a line that classifies with less error is the `line` i.e `Single Perceptron` shown in the graph. We've misclassified some points, indicating that a linear boundary is not sufficient for this problem.

To improve, we can use `MLP` to generalize the decision boundary and better classify the data points.

Below is the graph:

<img src="./Notes_Images/mlp_db.png" alt="Scatter Plot">

We can see here, we are only misclassifying one data point. It's better but not perfect, and that's what we are looking for.

That is the whole point of training a neural network: to minimize misclassifications and improve accuracy.

<hr>

There are three situations:

**Underfitting** occurs when the model is too simple to capture the underlying patterns in the data. This can happen if the model has too few parameters or if it is not trained long enough. Underfitting results in high training and validation errors.

**Overfitting** occurs when the model is too complex and learns the noise in the training data instead of the actual patterns. For the training data, this means the model performs well (low error) but fails to generalize to new, unseen data (high error).

Or for the data that are close to the decision boundary, the model may become overly sensitive to small fluctuations in the input, leading to erratic predictions.

**Good fitting** occurs when the model is able to generalize well to unseen data. This is the desired outcome of the training process, where the model achieves a balance between bias and variance.

<img src='./Notes_Images/triad_nn.png'>

<hr>

### **Datasets**

For training we need data and dataset is a collection of examples used to train the model. Each example consists of input features and the corresponding target output.

We teach the network by showing samples to it. The `Neural Network` learns with each feature-label pair.

We split the datasets into three parts:

**Training** : This subset is used to train the model. The model learns from this data by adjusting its weights based on the input-output pairs.

**Validation** : This subset is used to tune the model's hyperparameters and prevent overfitting. The model is evaluated on this data during training, but it does not learn from it.

**Testing** : This subset is used to assess the model's performance after training is complete. The model makes predictions on this data, and its accuracy is measured.

<hr>

So we the data with which the `Network` learns is only the `Training` set. Other subsets are used asserting the model's performance and generalization ability.

We've to run the `Training` set lots of times to allow the model to learn effectively. Each such run is called an `Epoch`. We stop after some number of `Epochs` when we see that the model's performance on the validation set is no longer improving.

After running `Training` dataset for some `epochs` the `Network` would have learned something therefore, we can evaluate the model's performance on the `Validation` set to see how well it generalizes to unseen dat and compared to other competitors.

Also, while `Training` the data we use multiple `Architecture` and `Techniques`. The best performing model on the validation set is selected for further evaluation on the test set.

<img src='./Notes_Images/rank.png'>

Lastly, the model's performance on the test set is evaluated to get an unbiased estimate of its generalization ability. This is crucial for understanding how well the model will perform in real-world scenarios.

### **What Happens When we Train One Single Training Sample?**

First we feed an input sample `X` to the `Network`.

Then compare the output to the correct value of `Y`.

Calculate the error.

Use this error to adjust the weights of the network through backpropagation.

<hr>
<hr>


## **Training Error Functions**

An error function measures how bad a model's predictions are compared to the actual target values. It quantifies the difference between the predicted output and the true output, providing a way to assess the model's performance during training.

The `Error Function` is a crucial component in training neural networks, as it guides the optimization process. By minimizing the error function, we can improve the model's predictions and overall performance.

We use a training process called `Gradient Descent` to minimize the error function.

Also, we use two error metrics i.e. the `Output Error` for the individual predictions and the `Overall Error` which is also known as the `Loss Function`.

<hr>

### **Output Error()**

The **Output Error** measures the difference between the predicted output of the network and the actual target output for a single training example. It provides a way to assess how well the network is performing on individual predictions.

Mathematically, the output error for a single example can be defined as:

$$
E = \frac{1}{2} (y - \hat{y})^2
$$

Where:

- \(y\) is the true output (target value)
- $\hat{y}$ is the predicted output (network's output)

The factor of $\tfrac{1}{2}$ is included for convenience, as it simplifies the derivative calculation during backpropagation.

The output error is used to compute the gradients for updating the weights in the network. By minimizing the output error for each training example, we can improve the network's performance on that specific example.

### **Overall Error (Loss Function)**

The **Overall Error**, also known as the **Loss Function**, measures the average error across all training examples. It provides a single scalar value that represents the model's performance on the entire training dataset.

Mathematically, the overall error can be defined as:

Or

`Mean Squared Error (MSE)`

$$
E_{overall} = \frac{1}{N} \sum_{i=1}^{N} E_i
$$

Where:

- `N` is the total number of training examples
- \(E_i\) is the output error for the \(i\)-th training example

The overall error is used to guide the optimization process during training. By minimizing the overall error, we can improve the model's performance across all training examples.

<hr>

The advantage of using `Mean Squared Error (MSE)` as the overall error metric is that it penalizes larger errors more heavily than smaller ones. This is because the errors are squared before being averaged, which means that outliers have a greater impact on the overall error value. This property makes MSE a useful metric for tasks where large errors are particularly undesirable.

Also, MSE is differentiable, which is a key requirement for optimization algorithms like gradient descent. The smooth nature of the MSE curve allows for more stable and efficient convergence during training.

Also, `MSE` gets rid of `Sign` of the actual error. So, when minimizing the error, the model does not have to worry about the direction of the error (positive or negative), only the magnitude.

The main purpose is to find how big or small the error is, regardless of its direction.


### **Delta Rule**

It is a simple update formula for adjusting the weights in a neuron. We add the product of the learning rate, the error term, and the input to the weight.

**Considers the Following Value**

- The output error

- One Input

- Learning rate factor.

Mathematically, it can be expressed as:

$$
\Delta w = \eta \cdot (d - y) \cdot x
$$

Where:

- $ \Delta w $ is the change in weight

- $ \eta $ is the learning rate

- $ d $ is the desired output

- $ y $ is the actual output

- $ x $ is the input

This rule helps the model learn from its mistakes by adjusting the weights in the direction that reduces the error.

**Output Error**

The `output error` will be `positive` if the predicted output is higher than the desired output and `negative` if it is lower. This means that when we later update the `Weight`, it will contribute to making the output closer to the desired value.

**Summary**

- **The Delta Rule is a _special case_ of Gradient Descent.**
- Specifically, it’s (stochastic) gradient descent on a **single linear neuron** using **squared-error loss**.

Below is the clean picture,

Assume a single neuron with linear activation:
$y = w^\top x$

Use the per-example squared error:
$E = \tfrac12(d - y)^2$

Take the gradient w\.r.t. $w$:

$$
\frac{\partial E}{\partial w}
= \frac{\partial}{\partial w}\tfrac12(d - w^\top x)^2
= -(d - y)\,x
$$

One **gradient descent** step does:

$$
w \leftarrow w - \eta \frac{\partial E}{\partial w}
= w + \eta(d - y)\,x
$$

which is exactly the **Delta Rule** you wrote:

$$
\Delta w = \eta(d - y)\,x.
$$

So, the delta rule = “do (stochastic) gradient descent on a linear unit with MSE.”

Gradient Descent is the general recipe:

$$
w \leftarrow w - \eta\,\frac{\partial E_{\text{overall}}}{\partial w}
$$

where:

- the **model** can be anything differentiable (linear, logistic, deep nets, …),
- the **loss** can be anything differentiable (MSE, cross-entropy, etc.),
- and you can update **per example** (SGD), **per mini-batch**, or **full-batch**.


<hr>


### **Gradient Descent**

Gradient Descent is an optimization algorithm used to minimize the error function by iteratively updating the model's parameters (weights and biases). The basic idea is to compute the gradient (or derivative) of the error function with respect to the model's parameters and then adjust the parameters in the opposite direction of the gradient.

The update rule for a single parameter \(w\) can be expressed as:

$$
w = w - \eta \frac{\partial E_{overall}}{\partial w}
$$

Where:

- $ w $ is the model parameter (weight or bias)

- $ \eta\ $ is the learning rate (a small positive scalar)

- $ \frac{\partial E\_{overall}}{\partial w}\ $ is the gradient of the overall error with respect to the parameter \(w\)

The learning rate controls the step size of the update. If the learning rate is too large, the optimization process may overshoot the minimum and diverge. If it is too small, the convergence may be slow.

<hr>

### **Understanding with Graph**

Suppose we've a Network with several weights but for now let's try to study how changing single weight affects the overall error.

<img src='./Notes_Images/gradient_des.png'>

In the graph above, say at any point in training the Weight \(w\) is at position \(x\) on the x-axis. The corresponding overall error is at position \(y\) on the y-axis. As we adjust the weight \(w\), we can see how the overall error changes.

We see that we'll need to increase the weight \(w\) to reduce the overall error \(E\_{overall}\) i.e. we reach the `Global Minimum`.

This is the lowest error we can get by modifying the weight \(w\) in the direction that reduces the overall error.

**But,**

What if the `Weight` is at a Local Minimum? As show in the image below:

<img src='./Notes_Images/local_minimum.png'>

As we will initialize the weights randomly, there's a possibility that the weight \(w\) could start at a local minimum. In such cases, the gradient descent algorithm may not be able to escape the `local minimum` and find the global minimum.

<hr>

This was the case for a `Single Weight`, but in real neural networks, we have many weights and biases. The optimization landscape becomes much more complex, with many local minima and saddle points. This makes it challenging for gradient descent to find the global minimum.

For example, let's say we are modifying two weights to manipulate the error. This would give us a `3D` plot where the height is the error and `two weights` will place the marble at different points in this surface with mountains and valley. The object is to find the lowest point in this 3D landscape, which corresponds to the minimum error.

Below is the `Graph`:

<img src='./Notes_Images/3d_grad.png'>

It becomes more complex as we add more weights and biases, creating a high-dimensional optimization landscape. In this space, the gradient descent algorithm must navigate through various local minima and saddle points to find the global minimum.

So, with `Two Weights` it became a `3D` optimization problem. Similiarly, for our `XOR` problem we would have a `10D` optimization problem as we've `9 Weights` which we cannot event understand graphically.


<hr>


## **Types of Gradient Descent**

There are different types of `Gradient Descent` algorithms, each with its own characteristics and use cases:

1. **Batch Gradient Descent**: Computes the gradient using the entire dataset. It provides a stable estimate of the gradient but can be slow and memory-intensive for large datasets.

2. **Stochastic Gradient Descent (SGD)**: Computes the gradient using a single randomly chosen example. It is faster and can escape local minima but introduces noise into the optimization process.

3. **Mini-Batch Gradient Descent**: Combines the advantages of both batch and stochastic gradient descent by using a small random subset (mini-batch) of the data to compute the gradient. It strikes a balance between speed and stability.

4. **Momentum**: An extension of SGD that accumulates a velocity vector in the direction of the gradient, helping to accelerate convergence and reduce oscillations.

5. **Adaptive Learning Rate Methods**: Algorithms like AdaGrad, RMSProp, and Adam adjust the learning rate for each parameter based on past gradients, allowing for more efficient training.

6. **Nesterov Accelerated Gradient (NAG)**: A variant of momentum that looks ahead at the future position of the parameters, providing a more accurate gradient estimate.

Each of these methods has its own strengths and weaknesses, and the choice of which to use depends on the specific problem and dataset.

<hr>


## **Backpropagation**

`Backpropagation` is a supervised learning algorithm used for training artificial neural networks. It is a form of gradient descent that computes the gradient of the `loss function` with respect to each `weight` by the chain rule, allowing for efficient weight updates.

This algorithm will update the `Weights` throughout the network.

<hr>
