# Perceptron for the OR problem in NumPy

We will start by building and training a perceptron that will act as an OR gate.

An OR gate is a simple system with two binary inputs and one output. It returns the logical OR of the inputs. That is :

| x_1 | x_2 | out
| --- | --- | --- |
|  0  |  0  |  0  |
|  0  |  1  |  1  |
|  1  |  0  |  1  |
|  1  |  1  |  1  |

## Requirements 

We will first install and import the necessary requirements.

In [None]:
!pip install numpy matplotlib

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## set random seed for reproducibility 
np.random.seed(42)

A machine learning problem has three components : a dataset, a model, and an optimization method.

## Training data.

We will first create the training data. A training datapoint has the form (X, Y) where X is the input data (in our case, the 2 values we need to OR) and Y is the label vector defining which output our model should give (in our case the label is 0 or 1).

The dataset is the list of training data points.

In [None]:
import numpy as np 
X = np.array([
    [0, 0],  
    [0, 1],  
    [1, 0],  
    [1, 1] 
])
Y = np.array([0,1,1,1])

print(f"The output of an OR gate if the input is {X[0]} is {Y[0]}")

## Model

The model will be the function that will process our data and generate the required output. In our case the model will be a simple perceptron. 

A perceptron is a processing unit that works in a MISO (Multiple Input, Single Output) way. It computes the weighted sum of its inputs and adds a bias. Then the result is passed through an activation function to bring a non linearity.

In other word for an input vectors $ \textbf{x} = (x_0, x_1)$ a perceptron $u$ computes $u(\textbf{x}) = \sigma(w_0*x_0 + w_1*x_1 + b)$ where $\sigma$ is the activation function, $w_0,w_1$ are the weights and $b$ is the bias.


### Activation function

In our case, the activation function will be the sigmoid function. 
$$ \sigma(x) = \frac{1}{1 + e^{-x}} $$

In [None]:
# TODO : implement the sigmoid function
import numpy as np

def sigmoid(x) :
    return None

You can plot the sigmoid function using matplotlib to validate your implementation. It should look like this :
 
<img src="sigmoid.png">

In [None]:
import matplotlib.pyplot as plt

plt.plot(np.arange(-10, 10, 0.1), sigmoid(np.arange(-10, 10, 0.1)), label="sigmoid")
plt.grid()
plt.title("Sigmoid activation function")

### The perceptron class

Now we need to define the perceptron class as stated above. It should be able to take multiple input and be called to provide the output.

Your work is to implement the class and its functionnalities.

In [None]:
# TODO: implement the perceptron class using numpy

class Perceptron : 

    def __init__(self, num_inputs):
        ## TODO: init the perceptron class with the weights and bias, given the number of input it takes
        pass

    def __call__(self, x): ## this is required to be able to call the instance like a function
        ## TODO : implement the perceptron function
        pass

Once the class has been created, you need now to instantiate a perceptron that can take 2 inputs, and test it on your dataset. It has not been trained so it will provide rubish results. The next section will be about the training procedure.

In [None]:
## TODO : instantiate your model and test it on your training data

## Training the model with gradient descent.

Now that the model is implemented, we need a training procedure to be able to learn the problem. We will use gradient descent to train the model parameters.

In a few words, gradient descent consists in computing the gradient of the loss with respect to the parameters, and taking a step in the opposite direction, in order to minimize the loss. A more detailed explanation can be found in the course and [here](https://medium.com/@datasciencewizards/a-simple-guide-to-gradient-descent-algorithm-60cbb66a0df9)

Our perceptron has three parameters :

    - w_1 is the weight for the first input.

    - w_2 is the weight for the second input.

    - b is the bias of the perceptron.


### Loss function and its gradient

We define the L2-loss function between the prediction $ \hat{y} $ and the label $ y $ as:

$$
L(\hat{y},y) = (\hat{y} - y)^2
$$

The prediction $ \hat{y} $ for input data $ \mathbf{x} = [x_0,x_1] $ is a function of the perceptron parameters (weights $ \mathbf{w} = [w_0, w_1] $) and the bias $ b $, along with the sigmoid activation function $ \sigma(z) = \frac{1}{1 + e^{-z}} $:

$$

\hat{y} = \sigma(f) = \frac{1}{1 + e^{-f}} \\
~\\
f(\mathbf{x},\mathbf{w},b) = w_0 \cdot x_0 + w_1 \cdot x_1 + b
$$

The derivative of the sigmoid function is:
$$
\frac{\partial \sigma}{\partial z}(z) = \sigma(z) \cdot (1 - \sigma(z))
$$

### Derivatives for Gradient Descent

To perform gradient descent, we need to compute the gradient of the loss function with respect to all parameters: $ w_0 $, $ w_1 $, and $ b $. The loss function is a composite function so we need to apply the chain rule. 

1. **Derivative with respect to $ w_0 $:**

   $$
   \frac{\partial L}{\partial w_0} = \frac{\partial L}{\partial {\hat{y}}} \cdot \frac{\partial {\hat{y}}}{\partial f} \cdot \frac{\partial {f}}{\partial w_0}
   $$

   We have:

   $$
   \frac{\partial L}{\partial \hat{y}} = 2 (\hat{y} - y)   \\
   ~\\
   \frac{\partial \hat{y}}{\partial f} = \hat{y} (1-\hat{y}) \\
   ~\\
   \frac{\partial f}{\partial w_0} = x_0
   $$

   Thus:

   $$
   \frac{\partial L}{\partial w_0} = 2(\hat{y} - y) \cdot \hat{y}(1 - \hat{y}) \cdot x_0
   $$

2. **Derivative with respect to $ w_1 $:**

   Similarly:

   $$
   \frac{\partial L}{\partial w_1} = 2(\hat{y} - y) \cdot \hat{y}(1 - \hat{y}) \cdot x_1
   $$

3. **Derivative with respect to $ b $:**

   $$
   \frac{\partial L}{\partial b} = 2(\hat{y} - y) \cdot \hat{y}(1 - \hat{y})
   $$

### Parameter Update Rules

The parameters are updated as follows:

- $ w_0 \leftarrow w_0 - \alpha \cdot \frac{\partial L}{\partial w_0} $

- $ w_1 \leftarrow w_1 - \alpha \cdot \frac{\partial L}{\partial w_1} $

- $ b \leftarrow b - \alpha \cdot \frac{\partial L}{\partial b} $

Where $ \alpha $ is the learning rate.


In [None]:
# TODO: implement the L2 loss function
def l2_loss(y_pred, y_true):
    return 0

In [None]:
# TODO : implement the training loop
def train(model, X, Y, lr, num_epochs):
    ### Loop over epochs
    for epoch in num_epochs : 
        pass
        ### TODO : you need to : 
        ### - Loop over your training data
        ### - setup the data as input / label
        ### - forward pass your input through your model
        ### - compute the loss between you prediction and the label
        ### - compute the gradient of the loss 
        ### - take a gradient descent step to update the model's parameters using the learning rate lr

        ### optionnal : add some prints and save your loss so you can plot it later.

In [None]:
## TODO: run the training on your model
import numpy as np
## instantiate model
model = Perceptron()

## select relevant hyperparameters 
lr = 0 # learning rate
num_epochs = 0 # number of training epochs

## run the training loop on your data 
train(model, X, Y, lr, num_epochs)

### Visualization of the loss

Create plots to show the loss over time.

In [None]:
## TODO : plot the loss over each epoch 

### Making a decision.

Now that the model is trained, we can give it inputs to see what would be its decision. The output of our model is currently a value between 0 and 1. We want, however, the model to output exactly 0 or 1 depending on the value of the output. For that, if the value is greater than $0.5$, we will set the output to 1, and if the value is smaller than $0.5$, we set the value to 0. This is the decision step where we convert a probability to an actual choice.

In [None]:
## TODO : Compute the decision of the model for each of the inputs.

## The XOR problem

The XOR operation is the operation "exclusive or" where the output of the xor gate is 1 if and only if only one of the inputs is 1.

The truthtable of the XOR operation is as follows : 

| x_1 | x_2 | out
| --- | --- | --- |
|  0  |  0  |  0  |
|  0  |  1  |  1  |
|  1  |  0  |  1  |
|  1  |  1  |  0  |

Train a perceptron on the new XOR problem, show that the results are not good.

In [None]:
## TODO : Create the dataset of the XOR problem

X = None
Y = None

## TODO : Instantiate a perceptron
model = Perceptron()

## TODO : Train the model
train(model, X, Y, lr, num_epochs)

## TODO : Show the decision of the model for each input 
