# PyTorch Journey Part 1
Hi all, this notebook contains the first episode of my **PyTorch** Series. 

The other notebooks of this series:
* [Part 2: CNN & Gradient Accumulation](https://www.kaggle.com/milankalkenings/pytorch-2-cnn-gradient-accumulation/edit)
* [Part 3: (Batch) Normalization](https://www.kaggle.com/milankalkenings/pytorch-3-batch-normalization)

<h1 style="background-color:SteelBlue; color:white" >-> Content:</h1>

## 0. [Prerequisits](#sec1)

## 1. [Tensor Fundamentals](#sec2)
#### 1.1. [Change the Appearance of Tensors](#sec21)
#### 1.2. [Tensor Broadcasting](#sec22)
#### 1.3. [Use Tensors on GPUs](#sec23)
#### 1.4. [Autograd](#sec24)

## 2. [Linear Layers](#sec3)

## 3. [FNN & Trainloop](#sec4)
#### 3.1. [TensorDatasets](#sec41)
#### 3.2. [Samplers](#sec42)
#### 3.3. [DataLoaders](#sec43)
#### 3.4. [Create the Model & Investigate it](#sec44)
#### 3.5. [Train and Validation Loop](#sec45)

<a id="sec1"></a>
***
<h1 style="background-color:SteelBlue; color:white" >-> 0. Prerequisits</h1>


1. **python fundamentals:** [This simple & free Kaggle Course](https://www.kaggle.com/learn/python) is already enough!
2. **notebooks & numpy:** [Chapter 1&2 of this free book](https://jakevdp.github.io/PythonDataScienceHandbook/) is probably the best way to learn it! 

From now on I expect you all to be familiar with the concepts used in the named sources. I don't expect any further python skills.

<a id="sec2"></a>
***
<h1 style="background-color:SteelBlue; color:white" >-> 1. Tensor Fundamentals</h1>

There are multiple ways of interpreting tensors. Physicists have a different view on them as we have. Our point of view is still valid in our use cases of them:

A tensor can be interpreted as an array. We can create arrays of arrays; arrays of arrays of arrays and so on.

* a scalar $\in \mathbb{R}$ can be interpreted as a tensor of rank 0
* an array of scalars / a vector $\in \mathbb{R}^n$ can be interpreted as a tensor of rank 1
* an array of arrays of scalars / a matrix $\in \mathbb{R}^{n \text{ x } m}$ can be interpreted as a tensor of rank 2
* an array of arrays of arrays of scalars $\in \mathbb{R}^{n \text{ x } m \text{ x } k}$ can be interpreted as a tensor of rank 3, and so on..

PyTorch tensors work in the same way as numpy arrays but have some beneficial properties as we will see later on.

In [None]:
import numpy as np
import torch

# create a tensor from a list or a np array
array = [[[1,2, 3, 4], 
          [5, 6, 7, 8],
          [0, 0, 0, 1]],
         
         [[4, 3, 2, 1],
          [8, 7, 6, 5],
          [1, 0, 0, 0]]
        ]

# "bridge" to torch tensors
tensor = torch.tensor(array)
print(tensor, "\n\n")

# get the dimensionality of the tensor
print("size =", tensor.size())

# get the dtype of the tensor
print("dtype =", tensor.dtype)

the *size* of the tensor tells us that this is a tensor of rank 3 consisting of 2 tensors of rank 2, in which 3 tensors of rank 1 are stored. In each of these tensors of rank 1 are 4 tensors of rank 0.

In some cases it might be necessary to bridge from tensors to numpy arrays:

In [None]:
print("tensor:\n", tensor.numpy())
print("\ntype:\n", type(tensor.numpy()))

<a id="sec21"></a>
***
<h2 style="background-color:SteelBlue; color:white" >-> 1.1. Change the Appearance of Tensors</h2>

The tensor from the example above is of *dtype=torch.int64*. We might change this in later operations. Casting the *dtype* of a tensor is as easy as:

In [None]:
print("tensor double:\n", tensor.double())

`tensor.view()` works siilar to `numpy.reshape()`. I use it in the folowing example to transform a tensor of rank 3 into a tensor of rank 2. Try out different sizes to get a feeling for the results!

In [None]:
print("reshaped:\n", tensor.reshape(2, 12))
print("\nsize:", tensor.reshape(2, 12).size())

In [None]:
print("reshaped:\n", tensor.reshape(3, 8))
print("\nsize:", tensor.reshape(3, 8).size())

To conclude: having a rank $n$ tensor, reshaping it results in filling the desired shape beginning with the first element of the first rank $n-1$ tensor. 

Data oftentimes has a format similar to the following:

In [None]:
tensor = tensor.reshape(1, 2, 1, 12)
print("tensor:\n", tensor)
print("\nsize:", tensor.size())

The method [squeeze](https://pytorch.org/docs/stable/generated/torch.squeeze.html) was created to handle exactly this problem, when some dimensions contain only one value. It can be applied on one specific dimension, or on all dimensions at once.

In [None]:
tensor_squeezed = tensor.squeeze(dim=0)
print("tensor:\n", tensor_squeezed)
print("\nsize:", tensor_squeezed.size())

In [None]:
tensor_squeezed = tensor.squeeze()
print("tensor:\n", tensor_squeezed)
print("\nsize:", tensor_squeezed.size())

<a id="sec22"></a>
***
<h2 style="background-color:SteelBlue; color:white" >-> 1.2. Tensor Broadcasting</h2>

similar to numpy arrays, we can add, subtract, multiply ... 2 tensors. 

The following example shows that tensor operations stick to the rules of **numpy broadcasting** as explained in [this chapter](https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-broadcasting.html) of the book listed in the prerequisits:

if we add a tensor **a of rank 2** to a tensor **b of rank 1**, b is will be added to each tensor of rank 1 stored in a

In [None]:
tensor_a = torch.tensor([[1, 2, 3],
                         [4, 5, 6]])

tensor_b = torch.tensor([7, 8, 9])

tensor_a + tensor_b

<a id="sec23"></a>
***
<h2 style="background-color:SteelBlue; color:white" >-> 1.3. Use Tensors on GPUs</h2>


One of the most important benefits of tensors is that you can perform tensor operations on the GPU (if you have one):

In [None]:
# only possible if ýour NVIDEA GPU is activated 
if torch.cuda.is_available():        
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

# move them to the gpu,
# all operations on them are performed on gpu
# both ahve to be on the same device
tensor_a.to(device)
tensor_b.to(device)

tensor_a + tensor_b

<a id="sec24"></a>
***
<h2 style="background-color:SteelBlue; color:white" >-> 1.4. Autograd</h2>

We can describe every tensor operation as a function. PyTorch Autograd allows us to calculate the Gradient of such a function with respect to each individual input. Let me give you an example. If you are not familiar with matrix calculus, you can always refer to [this](https://en.wikipedia.org/wiki/Matrix_calculus):

$$f:\mathbb{R}^2\rightarrow \mathbb{R}, f(x) = x^Tx+1 = x_1^2+x_2^2 +1$$

Has the following partial derivatives:
$$\frac{\partial f(x)}{\partial x_1}=2x_1$$

$$\frac{\partial f(x)}{\partial x_2}=2x_2$$

So we would obtain: 


$$f(\left(\begin{array}{c} 2 \\ 3 \end{array}\right))=2^2+3^2+1=14$$

$$\frac{\partial f(x)}{\partial x_1}=4$$

$$\frac{\partial f(x)}{\partial x_2}=6$$

In [None]:
def f(x):
    return x.t()@x + 1

x = torch.tensor([[2], 
                  [3]],dtype=torch.float32)

# requires_grad_() enables autograd functionality for that tensor
# the underscore in the end denotes inplace operations in PyTorch
x.requires_grad_()

y = f(x)

print(y)

In [None]:
y.backward() # calculates the gradient w.r.t. every input
x.grad

We automatically calculated the gradient w.r.t. every input of the function.

<a id="sec3"></a>
***
<h1 style="background-color:SteelBlue; color:white" >-> 2. Linear Layers</h1>

PyTorch combines tensors and tensor operations to modules, which can then be used in artificial neural networks. The most simple module is the [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) layer.

Let's create such a layer from scratch!

Given some input $x$, the linear layer computes its output $y$ via $y=xA^T+b$. Both, the weight matrix $A$ and the bias $b$ will be adapted during the training process so that the output $y$ becomes as close as possible to the ground truth.

In [None]:
import torch.nn as nn

class CustomLinear(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        A = torch.randn(size=(out_features, in_features))
        self.weight = nn.Parameter(data=A, requires_grad=True)  # weight
        b = torch.randn(size=(out_features,))
        self.bias = nn.Parameter(data=b, requires_grad=True)  # bias
        
    def forward(self, x):
        return x @ self.weight.t() + self.bias

The `CustomLinear` module is a simplified implementiation of the [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) module. The fundamentla functionalities (forward and backward) are the same and `CustomLinear` can be used in a fully connected layer already. Note that this layer is nothing else but a combination of tensors, some of them with `requires_grad=True`, and tensor operations. Note that `A` and `b` are `nn.Parameters`, and thus are iteratively altered during the training process to improve the performance of the module. I renamed `A` and `b` to `weight` and `bias` respectively. to match the pattern of the original linear layer.

<a id="sec4"></a>
***
<h1 style="background-color:SteelBlue; color:white" >-> 3. FNN & Trainloop</h1>

Now that we have a feeling for the deep learning perspective on tensors and how to calculate gradients from tensor operations, we can finally train our first neural network.

I will use a **FNN** (Feed Forward Neural Network) to explain the most important components of the training process.

<a id="sec41"></a>
***
<h2 style="background-color:SteelBlue; color:white" >-> 3.1. TensorDatasets</h2>


In [None]:
import pandas as pd
from torch.utils.data import TensorDataset

# train
data = pd.read_csv("../input/mnist-in-csv/mnist_train.csv")
X = torch.tensor(data.drop(["label"], axis=1).values, dtype=torch.float).to(device)
y = torch.tensor(data["label"].values, dtype=torch.long).to(device)
data_t = TensorDataset(X, y)

# val
data = pd.read_csv("../input/mnist-in-csv/mnist_test.csv")
X = torch.tensor(data.drop(["label"], axis=1).values, dtype=torch.float).to(device)
y = torch.tensor(data["label"].values, dtype=torch.long).to(device)
data_v = TensorDataset(X, y)

Each entry of a TensorDataset contains the independend and the  target / dependend variable(s) of one observation.

In [None]:
obs_1 = data_t[0]

print("Type of the first observation:\n", type(obs_1), "\n")
print("The first observation:\n", obs_1, "\n")
print("The independend variables of the first observation:\n", obs_1[0], "\n")
print("The target of the first observation:\n", obs_1[1], "\n")
print("The whole dataset contains ", len(data), " many observations.")

<a id="sec42"></a>
***
<h2 style="background-color:SteelBlue; color:white" >-> 3.2. Samplers</h2>

Samplers allow us to define data drawing policies.

In [None]:
from torch.utils.data import RandomSampler

train_sampler = RandomSampler(data_source=data_t)
val_sampler = RandomSampler(data_source=data_v)

<a id="sec43"></a>
***
<h2 style="background-color:SteelBlue; color:white" >-> 3.3. DataLoaders</h2>

A DataLoader uses the drawing policies defined in a `sampler` to draw `batch_size` many observations from a `dataset`. 

The result is an iterable containing `int(len(dataset)/batch_size)` many batches. 

In [None]:
from torch.utils.data import DataLoader
batch_size = 16 
epochs = 3

train_loader = DataLoader(dataset=data_t, 
                          batch_size=batch_size, 
                          sampler=train_sampler)

val_loader = DataLoader(dataset=data_v, 
                        batch_size=batch_size, 
                        sampler=val_sampler)

In [None]:
some_x_batch, some_y_batch = next(iter(train_loader))

print("each batch is stored in a list.")
print("\nthe first entry is of type", type(some_x_batch))
print("and has a size of", some_x_batch.size())
print("\nthe second entry is of type", type(some_y_batch))
print("and has a size of", some_y_batch.size())
print("\nThe train_loader contains", len(train_loader), "many batches.")
print("\nThe val_loader contains", len(val_loader), "many batches.")

The first entry of each batch contains the independend variables / features and is a second rank tensor with `batch_size` many first order tensors storing `features` many zero rank tensors.

The second entry of each batch contains the target and is a first rank tensor containing `batch_size` many zero rank tensors.

<a id="sec44"></a>
***
<h2 style="background-color:SteelBlue; color:white" >-> 3.4. Create the Model & Investigate it</h2>

* it has to inherit `torch.nn.Module`
* it has to implement (at least) **__init__()** and **forward()**
* **__init__()** has to call the constructor of torch.nn.Module 

Note that the first two linear layers are 'CustomLinear' modules, whereas the third linear layer is a `nn.Linear` layer.

In [None]:
class ExampleFNN(nn.Module):
    def __init__(self, num_feats):
        super(ExampleFNN, self).__init__()
        
        # hidden layer 1
        self.linear1 = CustomLinear(in_features=num_feats, out_features=256)
        self.relu1 = nn.ReLU()
        
        # output layer 
        self.linear2 = CustomLinear(in_features=256, out_features=10)

        
    def forward(self, x):
        x = self.linear1(x)
        x = self.relu1(x)
    
        return self.linear2(x)

In [None]:
model = ExampleFNN(num_feats=784).to(device)

Let's investigate our model a little. Therefore, we can either use a predefined model summarizer like [torchsummery](https://pypi.org/project/torch-summary/https://pypi.org/project/torch-summary/) or write our own summarizer from scratch. The latter provides some further insights into handling the model.

Let's investigate how the model parameters are initialized:

In [None]:
def show_weights(module):
    print(module)
    print(type(module))
    if (type(module) == nn.Linear) or (type(module) == CustomLinear):
        print(module.weight)
    print("\n\n")
        
model.apply(show_weights);

Note that we can use the apply method to iterate over the whole model. We can exploit this functionality to select certain module types and transform them. I.e., we can perform our own parameter initialization before training the model.

The last element containw the whole model.

In [None]:
from tabulate import tabulate

def summary(module, x):
    print_list = []
    total_params = 0
    if len(list(module.named_children())) > 0:  # if it's a model
        for child in module.named_children():
            x = child[1](x)
            param_string = ""
            child_params = 0
            for param in child[1].named_parameters():
                shape = list(param[1].size())
                params = 1
                for ax in shape:
                    params *= ax
                child_params += params
                param_string = param_string + f"'{param[0]}'" + " shape: " + str(shape) + " "
            total_params += child_params
            print_list.append([child[0], list(x.size()), param_string, child_params])
            
    else:  # if it's a single module
        x = module(x)
        param_string = ""
        for param in module.named_parameters():
            shape = list(param[1].size())
            params = 1
            for ax in shape:
                params *= ax
            total_params += params
            param_string = param_string + f"'{param[0]}'" + " shape: " + str(shape) + " "
        print_list.append([module, list(x.size()), param_string, total_params])
        
    print(f"Using a Batch Size of {x.size(0)}:\n")
    print(tabulate(print_list, headers=["Name", "Out Shape", "Weights", "Trainable Parameters"]))
    print("\nTrainable Model Parameters:", total_params)

summary(module=model, x=some_x_batch)

In [None]:
model.linear1.weight

In [None]:
linear_module = CustomLinear(in_features=784, out_features=3)

summary(module=linear_module, x=some_x_batch)

Each layer $\mathscr{l}$ of a FNN has $M_{\mathscr{l}-1} \cdot M_{\mathscr{l}}$ many weights plus $M_{\mathscr{l}}$ many bias terms. $M_{\mathscr{l}}$ is the number of nodes (i.e. output size) of layer ${\mathscr{l}}$.

Describing a neural network as a so called [computational graph](https://www.tutorialspoint.com/python_deep_learning/python_deep_learning_computational_graphs.htm), allows us to have a visual representation of the network. A gradient can be computed for each **Leaf Node**.

We can display such a computational using [tensorboard](https://www.youtube.com/watch?v=pSexXMdruFM).

<a id="sec45"></a>
***
<h2 style="background-color:SteelBlue; color:white" >-> 3.5. Train and Validation Loop</h2>

In [None]:
from torch.optim import Adam

optimizer = Adam(model.parameters(), lr=0.002)

loss_func = nn.CrossEntropyLoss()
total_epochs = 10

In [None]:
from sklearn.metrics import accuracy_score

def train(model, optimizer, loss_func, train_loader):
    model.train() 
    for batch in train_loader:
        x_batch = batch[0] 
        y_batch = batch[1]
        optimizer.zero_grad() 
        probas = model(x_batch) 
        loss = loss_func(probas, y_batch)
        loss.backward()
        optimizer.step()

def validate(model, loader):
    model.eval()
    acc = 0
    for batch in loader:
        x_batch = batch[0]
        y_batch = batch[1]
        with torch.no_grad():
            probas = model(x_batch)
        pred = np.argmax(probas.cpu().numpy(), axis=1)
        acc += accuracy_score(y_true=y_batch.cpu().numpy(), y_pred=pred)
    return(np.round(acc/len(loader), 4))

In [None]:
for epoch in range(1, total_epochs + 1):
    print(f"Epoch {epoch} / {total_epochs}:")
    train(model=model, optimizer=optimizer, loss_func=loss_func, train_loader=train_loader)
    acc_train = validate(model=model, loader=train_loader)
    acc_val = validate(model=model, loader=val_loader)
    print("Train Accuracy:", acc_train)
    print("Validation Accuracy:", acc_val)
    print("\n")

## Weight Decay

In [None]:
model = ExampleFNN(num_feats=784).to(device)

optimizer_w = Adam([{"params": model.linear1.bias},
                   {"params": model.linear2.bias},
                   {"params": model.linear1.weight, "weigth_decay": 50},
                   {"params": model.linear2.weight, "weight_decay": 50}], lr=0.002)

loss_func = nn.CrossEntropyLoss()
total_epochs = 30

In [None]:
for epoch in range(1, total_epochs + 1):
    print(f"Epoch {epoch} / {total_epochs}:")
    train(model=model, optimizer=optimizer_w, loss_func=loss_func, train_loader=train_loader)
    acc_train = validate(model=model, loader=train_loader)
    acc_val = validate(model=model, loader=val_loader)
    print("Train Accuracy:", acc_train)
    print("Validation Accuracy:", acc_val)
    print("\n")

That's it, thank you for reading this notebook!

The other notebooks of this series:
* [Part 2: CNN & Gradient Accumulation](https://www.kaggle.com/milankalkenings/pytorch-2-cnn-gradient-accumulation/edit)
* [Part 3: (Batch) Normalization](https://www.kaggle.com/milankalkenings/pytorch-3-batch-normalization)

Helpful Videos and Blogs:
* [Elliot Waite: Autograd](https://www.youtube.com/watch?v=MswxJw-8PvE&t=75s)

<div class="alert alert-danger" role="alert">
    <h3>Feel free to <span style="color:red">comment</span> if you have any suggestions   |   motivate me with an <span style="color:red">upvote</span> if you like this project.</h3>
</div>