# Lab - Training `and` and `xor` Feedforward NNs in `torch`
This notebook serves as the starter code and lab description covering **Chapter 21 - Deep Learning (Part 1)** from the book *Artificial Intelligence: A Modern Approach.*

In [1]:
import torch
device = torch.device('cpu')
# if you are using a gpu or you want your code be flexibly running over both CPU and GPU use the following line instead:
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## OVERVIEW
In the lecture, we talked about the simple feedforward networks and the way they work. In this lab, we learn to use `torch` library to create and train such networks.

## First, training `and` network
The first network that we create is a simple network to perform binary `and` operation on two binary inputs and calculate the result. For this purpose, lets first create the input/output data. In `torch`, the networks work with an object type called `Tensor`. Simply put, tensors are multi-dimensional matrices (e.g. a 1-d tensor is an actual vector, 2-d tensors are also equivalent to want you have seen as a matrix, higher than 2-d tensors are also valid and can be used to represent different dimensions of data). Read [here](https://pytorch.org/docs/stable/tensors.html) and familiarize yourself with different tensor types.

In our example, we create input/output pairs as `torch.FloatTensor` as most of the loss calculation objects in torch work with floating point typed tensors. 

**Note(1):** you remember that in the lecture we talked about batching the input data while training. `torch` library lets you batch up many input records in one `torch.FloatTensor` object.  However, as we are not going to use multi-record batches (for the sake of simplicity), you see the records receiving a 2-dimentional list containing only one input record (e.g. torch.FloatTensor(**\[\[0, 0\]\]**)).

**Note(2):** each index of input object (`data_x`) pairs up with the item with the same index in output object (`data_y`).

In [2]:
# here are expected binary input/output pairs defined using torch tensors
data_x = [
    torch.FloatTensor([[0, 0]]), 
    torch.FloatTensor([[0, 1]]), 
    torch.FloatTensor([[1, 0]]), 
    torch.FloatTensor([[1, 1]])
]
data_y = [
    torch.FloatTensor([0]), 
    torch.FloatTensor([0]), 
    torch.FloatTensor([0]), 
    torch.FloatTensor([1])
]

Now that we have the network input/outputs ready, we need to create the feedforward network model. Our model, however, does not have to be really complicated. It wil be just an instance of `torch.nn.Linear` (the implementation of the simple feedforward layer in `torch` library. Read more about it [here](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)).

For the next step, create an instance of `torch.nn.Linear` with the input dimension size of `2` and output dimension size of `1`. For more help, you can use the documentation of the `torch.nn.Linear` class, mentioned earlier.

**Note: don't forget to call `.to(device)` on any model that you create to support underlying hardware changes from GPU to CPU and the reverse.**

In [3]:
# TODO create the `model` object here
model = torch.nn.Linear(2, 1)



Next thing we need is an instance of a loss calculation object which can help the training process. [Here](https://pytorch.org/docs/stable/nn.html#loss-functions) you can find a complete list of all the implemented loss functions in `torch` library. For this section, we focus on mean squeared error loss function [`torch.nn.MSELoss`](https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html).

Read through the documentation of `torch.nn.MSELoss` and create an instance of it with the parameter (`reduction='sum'`) which guides the network to augment the loss values for each instance by summing them up (as opposed to averaging them).

In [4]:
# TODO create and instance of `MSELoss` and call it loss_fn
loss_fn = torch.nn.MSELoss(reduction='sum')





Next, implement a function `train` which
1. Receives an input/output pair (e.g. one `(x,y)` pair from the pairs we stored in `data_x` and `data_y`).
 - It should also receive the model instance as well as the `learning_rate` parameter.
2. Calculates the model prediction on `x`.
3. Calculates the loss using the model prediction and `y`.
4. Runs `model.zero_grad()` to reset the gradient variables of the computation graph in the network.
5. Performs `backward()` on the loss.
6. Updates each of model parameters (e.g. `param in model.parameters()`) with the value of `learning_rate * param.grad`.
 - Make sure this step is done under `with torch.no_grad()` block.
7. Returns the value of loss at the end.
 

In [5]:
# output = model(data_x)
# out = loss_fn(data_x,data_y)
# model.zero_rad()
# out.backward()
# model.parameters()

In [18]:
# TODO implement `train` function here

def train(x,y, model, learning_rate):
    model.zero_grad()
    Y_hat = model(x)
    loss = loss_fn(Y_hat,y)
    
    loss.backward()
#     with torch.no_grad():
    for param in model.parameters():
        param = param.data+ learning_rate* param.grad
            
    return loss.item()    






The last training step would be to use the function we just implemented to train the feedforward `model` which we created earlier. 
- Set the value of `learning_rate` to `1`.
- Iterate over the training data (pairs in `data_x` and `data_y`) for about 500 epochs (this should be done really quickly so don't worry). 
  - After each epoch if the loss of the last instance (or the average of the losses of all the instances to be more accurate!) is less than `0.01`, make sure you break the outer loop (which is performing training epochs). This is called early stopping. 
  - Print out the value of loss (you can comment this line in your submission).
  - After each 50 epochs, half the `learning_rate` (this is called learning rate decay).

In [19]:
# TODO perform the training loop in here
learning_rate= 1
epoch = 500
loss = 0
loss_avg = 0
for i in range(epoch):
    for x,y in zip(data_x, data_y):
#             print(x,y)
        loss = loss + train(x,y,model,learning_rate)
            
    loss_avg = loss_avg + (loss/len(data_x))
    if loss_avg < 0.1:
        break
    if epoch == 50:
        learning_rate = learning_rate/2


The next cell will perform an evaluation on the model you just trained (I expect you get 4 `Correctly predicted` messages, if you didn't re-run the training cell and then re-run the test cell).

In [20]:
print(model.weight)
print(model.bias)

for x, y in zip(data_x, data_y):
    y_pred = model(x) > 0.5
    print(x.numpy()[0], "\tPrediction: {}\t{} predicted.".format(y_pred.item(), "Correctly" if y.item() == y_pred.item() else "Incorrectly"))

Parameter containing:
tensor([[-0.0397,  0.0941]], requires_grad=True)
Parameter containing:
tensor([0.2987], requires_grad=True)
[0. 0.] 	Prediction: False	Correctly predicted.
[0. 1.] 	Prediction: False	Correctly predicted.
[1. 0.] 	Prediction: False	Correctly predicted.
[1. 1.] 	Prediction: False	Incorrectly predicted.


## Next, training `xor` network
Now that we have some hands on experience try reusing what you just implemented on the following train set representing binary `xor` operation, and show the results.

In [21]:
data_x = [
    torch.FloatTensor([[0, 0]]), 
    torch.FloatTensor([[0, 1]]), 
    torch.FloatTensor([[1, 0]]), 
    torch.FloatTensor([[1, 1]])
]
data_y = [
    torch.FloatTensor([1]), 
    torch.FloatTensor([0]), 
    torch.FloatTensor([0]), 
    torch.FloatTensor([1])
]

In [22]:
# TODO re-create the model object and reuse the training loop and report the test results for it here
model_xor = torch.nn.Linear(2, 1)
# TODO perform the training loop in here
learning_rate= 1
epoch = 500
loss = 0
loss_avg = 0
for i in range(epoch):
    for x,y in zip(data_x, data_y):
#             print(x,y)
        loss = loss + train(x,y,model_xor,learning_rate)
            
    loss_avg = loss_avg + (loss/len(data_x))
    if loss_avg < 0.1:
        break
    if epoch == 50:
        learning_rate = learning_rate/2









## training `xor` network with multi-layer perceptron (MLP) module
Now lets's try a more complex model for our `xor` dataset. The model we intend to use will have two fully-connected (feedforward; `torch.nn.Linear`) layers put together using [`torch.nn.Sequential`](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html) container. Look at [here](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html) to see how Sequential container works. For a better performance, have the `Sequential` instance pass the results of the first fully-connected layer through a `torch.nn.Tanh` non-linearity before passing it through the second fully-connected layer. 

Using this architecture, your model will first map the input of size `2` to a hidden value of size `4` (we are setting these values) and then maps the hidden vector to the output of size `1`. In the next cell, implement this model and use the `xor` data and the `train` function and the created `loss_fn` instance to train the new model.

**Note(1): \[again\] don't forget to call `.to(device)` on any model that you create to support underlying hardware changes from GPU to CPU and the reverse.**

**Note(2): since this new model has more parameters, set the initial value of `learning_rate` to `0.25` instead of `1`.**

In [24]:
# TODO create the MLP module here
learning_rate= 0.25
model = torch.nn.Sequential(
          torch.nn.Linear(2, 1),
          torch.nn.Tanh() ,
          torch.nn.Linear(2, 1)
        )


The next cell will perform an evaluation on the model you just trained.

In [25]:
print(model[0].weight)
print(model[0].bias)
print(model[2].weight)
print(model[2].bias)

for x, y in zip(data_x, data_y):
    y_pred = model(x) > 0.5
    print(x.numpy()[0], "\tPrediction: {}\t{} predicted.".format(y_pred.item(), "Correctly" if y.item() == y_pred.item() else "Incorrectly"))

Parameter containing:
tensor([[ 0.2211, -0.2304]], requires_grad=True)
Parameter containing:
tensor([-0.4005], requires_grad=True)
Parameter containing:
tensor([[-0.3336,  0.6473]], requires_grad=True)
Parameter containing:
tensor([-0.4531], requires_grad=True)


RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x1 and 2x1)

Compare the prediction results of the MLP model with the single layer perceptron (the one we first trained on the `xor` data) in the next cell.

## Analysis
...

## Further Reading
You definitely need to learn `torch` in great detail. [Here](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html) is where you can get started. Please talk to me if you had any problem understanding the examples provided in this link.