# Lab 6: Introduction to Deep Learning

This is an important lab session which aims to develop intuition for Deep Reinforcement Learning(DRL). 
For those familiar with Deep Learning(DL), you'll get to brush up some concepts. For those unfamiliar with this lab session it will be crash course of DL.

We'll use PyTorch, a popular tool for upcoming DRL lab sessions. 
Please go through jupyter notebooks in oder.

Technically you should have PyTorch installed from Lab 1, but incase you missed it you can install locally with.
`pip install -r requirements.txt --user`

After going through the Notebooks and getting a hang of the way PyTorch works, we will ask you to play around with the hyperparameters to fine-tune and techniques you can use to greatly improve your network's performance, and get a better intuition for the how and why it works.

Please read the notebook chronologically, and fill in the **TODO**s as you encounter them.
* <span style="color:blue"> Blue **TODOs** </span> means you have to implement the TODOs in the code.
* <span style="color:red"> Red **TODOs** </span> means you have to submit an explanation (of graph/results).



----------

## 6.1. Linear regression
In linear regression, the relationships are modeled using linear functions whose unknown model parameters are estimated from the data. <br> We'll play around with a popular dataset of wine. <br> In dataset based on multiple features you'll determine the quality of wine.

In [4]:
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as sns 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics
%matplotlib inline

Load dataset

In [None]:
dataset = pd.read_csv('winequality-red.csv',sep=';')

Check how it looks actually, other popular way to do is calling the head() function. Please refer to pandas for more details

In [None]:
dataset.head()

Sometimes dataset can be too huge to display, the shape comes in handy to check how your dataset looks, <br> in current example there are 1599 rows and 12 columns. <br> When working with image dataset the shape maybe in channels like RGB and number of images or 1 channel if it's black and white image

In [None]:
dataset.shape

When working with stastical problems describe can give you quick insights. (here it's just to show you some functionalities of pandas)

In [None]:
dataset.describe()

real world data is not as clean as this one, **ALWAYS KNOW YOUR DATA** . We check here just in case if there are null values that we may have to remove or fill

In [None]:
dataset.isnull().any()

Define features

In [None]:
features = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates','alcohol']

Defining your data and labels

In [None]:
X = dataset[features].values
y = dataset['quality'].values

We would want to see the distribution of data so we know how imbalanced it is. you can read a lot about imbalance in detail. **It's good and bad ;)** Here we we'll contine playing with this imbalanced data

In [None]:
plt.figure(figsize=(15,10))
plt.tight_layout()
sns.distplot(dataset['quality'])

Here we just plot how these 3 features varry for different classes. If you are a wine lover you'll probably know this 

In [None]:
relationship = sns.pairplot(dataset, vars=['fixed acidity','volatile acidity','citric acid'], hue='quality')
plt.show(relationship)

Stop and think why do we have train and test set. If you don't know contact any of the teaching assistant to explain you. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

**FINALLY** <br> in this example we just used default parameters. you can have fun with them 

In [None]:
regressor = LinearRegression()  
regressor.fit(X_train, y_train)

Apart from just estimating a function linear regression has lot many uses. It's one of the favorate tools used in reporting. We are just trying to explain the relation of a wine with different features here

In [None]:
coeff_df = pd.DataFrame(regressor.coef_, features, columns=['Coefficient'])  
coeff_df

In [None]:
y_pred = regressor.predict(X_test)

In [None]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df1 = df.head(25)

Let's check how you performed shall we?

In [None]:
df1

Will you be able to explain the graph?

In [None]:
df1.plot(kind='bar',figsize=(10,8))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

Here we report common matrices i.e how close we can predict to data.

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

## 6.2. Decision Tree
You can choose to read more about them here (https://en.wikipedia.org/wiki/Decision_tree) but in short they are like flowchart where you get the final class based on features. Decisions and their possible consequences are because of your features.

In [None]:
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [None]:
x = dataset.drop('quality',axis=1)
y = dataset['quality']

In [None]:
clfTre = tree.DecisionTreeClassifier(max_depth=5)
clfTre.fit(X_train, y_train)

In [None]:
utfall = np.count_nonzero(clfTre.predict(X_test) == y_test)
print("The decision tree predicts the test data in", utfall/(len(X_test))*100 , "% of the cases.")

<span style="color:red"> **TODO** </span> Are we performing better or worst compared to Linear regression? If worst, what can we do?

## 6.3. Random Forest
Random forests or random decision forests are an ensemble method. They are a form of decision tree, except they construct multitude of decision trees at training time and output the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set

In [None]:
rf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=0)
rf.fit(X_train, y_train)

In [None]:
utfall = np.count_nonzero(rf.predict(X_test) == y_test)
print("The decision tree predicts the test data in", utfall/(len(X_test))*100 , "% of the cases.")

## 6.4. Multi Layer Perceptron(MLP)

A multilayer perceptron (MLP) is a class of feed forward artificial neural network (ANN). Multilayer perceptrons are sometimes colloquially referred to as "vanilla" neural networks, especially when they have a single hidden layer.

<!-- ![image.png](attachment:image.png)
 -->
<img src=https://scikit-learn.org/stable/_images/multilayerperceptron_network.png alt="drawing" width="400"/>

[Image source](https://scikit-learn.org/stable/modules/neural_networks_supervised.html)

The above image shows one hidden layer MLP with scaler output 


Most of the following can be visualized beforehand on a toy problem using the [Tensorflow Playground](https://playground.tensorflow.org/#activation=sigmoid&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=2&seed=0.35230&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false).

On this playground, you can change the inputs, depth and width of the network, learning rate, task and noise, activation function, regularizer, batch size. We very strongly recommend you give it some time to play around, as it is excellent to build an intuition and understand the impact of all the choices the NN designer makes.

### Network architecture
Three main choices can be made about an MLP's architecture:
* *depth*: number of hidden layers
* *width*: number of neurons in each layer
* *non-linearity*: activation functions

First, playing around with the depth of the network - the depth is defined by the number of functions you add in the init method of the Net class.

**Note that** for 0 hidden layers, we have a regular linear model. 

Using only one hidden layer, but with sufficient width, ensures that the network is a Universal Approximator (can approximate any function). 

The activation functions can have a very important impact on the network. It is common practice to choose the same for all layers, except the output layer, which controls the nature of the overall function (net). First, visualize, using TF Playground, the difference in hypothesis (function shape) when changing function. The sigmoid function is historically important, but has near 0 gradient for high absolute values, which makes the gradient vanish. Try to compare performance in MNIST between sigmoid and more recent activation functions like ReLU or tanh.

### Learning Rate and Optimizer
The learning rate of the gradient descent is a very common and crucial hyperparameter is a lot of ML applications, including Deep Learning.
Try tweaking it - you should observe that high values lead to unstable learning, but low values lead to slow learning.

Using simply the SGD update w -= alpha*grad_w(J) is often very naive, and prone to stochasticity. Lately, a lot of methods have appeared to try and add momentum, vary the learning rate depending on the specific parameter, etc. In PyTorch, the optimizer is selected using
`optimizer = torch.optim.SGD(net.parameters(), lr=0.01)`

## Loss and Regularization
The most common loss to optimize is simply the Means Squared Error, (h(x) - y)Â² - trying to minimize the distance between your prediction and the ground truth. However, other measures can be used. After documenting yourself, try comparing MSE with the Cross Entropy Loss in PyTorch on the MNIST problem.

A very common cause for overfitting is that the network weights explode - if you try to fit 10 2D points with a 10 degree polynomial, you will often find very high weight values that lead to severe overfitting, rather than truly trying to find the trend.
In order to prevent weight explosion, *L2 Regularization* add a soft constraint to the loss under the form of a lambda*||w||\_2 term (L1 Reg uses norm 1). This way, the optimizer tries to solve the task using weights as small as possible. Conveniently in PyTorch, as you can see in the doc, the Regularization ("weight decay") is an optional argument to the optimizer!

## Batch size
The reason the optimizer is called Stochastic Gradient Descent, as opposed to usual Gradient Descent, is because we only use subsets (batches) of the training data instead of the whole thing at once, acting like a sample in a stochastic computation. This was found to lead to great gains in wall-clock performance, since we don't have to loop over the whole dataset, which might be millions of entries big. In particular, this has lead to huge gains in efficiency thanks to GPUs, massively excelling in parallelized computing but with limited RAM that cannot hold the whole dataset at once.

## Dropout
The neurons of a neural network are extremely heavily dependent on the values of the previous neurons - each of the inputs can have a drastic impact on the output. This is often a major culprit for overfitting, where the neurons cannot generalize properly because the new testing distribution looks very different from the training distribution.
In order to prevent these heavy dependencies, one of the core techniques of Deep Learning was invented: Dropout. This simply means that in training, each neuron has some probability to be turned off altogether! This means that the downstream neurons need to be flexible enough to adapt to all kinds of changes in input; no rely too heavily on a single input, but rather find valuable information in all of it.
Dropout can conveniently be seen as an [https://pytorch.org/docs/stable/nn.html#dropout-layers](additional layer), that you can add after any layer (except the output), with a constant giving the probability to turn the neuron off.



In [None]:
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score

import torch
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable


In [None]:
class Net(nn.Module):
    # define nn
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(11, 100)
        self.fc2 = nn.Linear(100, 100)
        self.fc3 = nn.Linear(100, 6)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, X):
        X = F.relu(self.fc1(X))
        X = self.fc2(X)
        X = self.fc3(X)
        X = self.softmax(X)

        return X

In [None]:
dataset = pd.read_csv('winequality-red.csv', sep=';')

In [None]:
dataset['quality'] = str(dataset['quality'])

In [None]:
unique_labels = np.unique(dataset['quality'].values)
le = preprocessing.LabelEncoder()
le.fit(unique_labels)
labels = le.transform(dataset['quality'])

In [None]:
train_X, test_X, train_y, test_y = train_test_split(dataset[features].values,
                                                    labels, test_size=0.8)

In [None]:
train_X = Variable(torch.Tensor(train_X).float())
test_X = Variable(torch.Tensor(test_X).float())
train_y = Variable(torch.Tensor(train_y).long())
test_y = Variable(torch.Tensor(test_y).long())

In [None]:
net = Net()

criterion = nn.CrossEntropyLoss()# cross entropy loss

optimizer = torch.optim.SGD(net.parameters(), lr=0.01)

Try to observe that, for a very wide layer, the network overfits to the testing data (...leading to a suspicious wine quality accuracy). The width can be controlled by changing how many inputs and outputs each intermediate function (i.e. hidden layer) takes in and gives out.

In [None]:
for epoch in range(1000):
    optimizer.zero_grad()
    out = net(train_X)
    loss = criterion(out, train_y)
    loss.backward()
    optimizer.step()
    
    if epoch % 100 == 0:
        print('number of epoch', epoch, 'loss', loss.data)

predict_out = net(test_X)
_, predict_y = torch.max(predict_out, 1)

print ('prediction accuracy', accuracy_score(test_y.data, predict_y.data))

print ('macro precision', precision_score(test_y.data, predict_y.data, average='macro'))
print ('micro precision', precision_score(test_y.data, predict_y.data, average='micro'))
print ('macro recall', recall_score(test_y.data, predict_y.data, average='macro'))
print ('micro recall', recall_score(test_y.data, predict_y.data, average='micro'))

<span style="color:red"> **TODO** </span> How can you identify overfitting network?

Try now to compare your overfitting network with a deeper, but less wide network. Hopefully, you get stronger results at test time: a deeper network can help generalize better. You can test the limits of this by using a much deeper network: you will overfit again! A lot of weights often means great expressiveness, which sometimes hurts generelization.
To observe overfitting in TF Playground, try using the maximal network depth and width!

<span style="color:blue"> **TODO** </span> Reinitilize a narrower network and lower number of epochs. re-run the code.

In [None]:
for epoch in range( **TODO** ):
    optimizer.zero_grad()
    out = net(train_X)
    loss = criterion(out, train_y)
    loss.backward()
    optimizer.step()
    
    if epoch % 100 == 0:
        print('number of epoch', epoch, 'loss', loss.data)

predict_out = net(test_X)
_, predict_y = torch.max(predict_out, 1)

print ('prediction accuracy', accuracy_score(test_y.data, predict_y.data))

print ('macro precision', precision_score(test_y.data, predict_y.data, average='macro'))
print ('micro precision', precision_score(test_y.data, predict_y.data, average='micro'))
print ('macro recall', recall_score(test_y.data, predict_y.data, average='macro'))
print ('micro recall', recall_score(test_y.data, predict_y.data, average='micro'))

-----------

## 6.5. CNN
Welcome to third part of tutorial. <br>
CNNs are Convolutional Neural Networks, Due to limitations of MLP CNNs were born. <br>
Now it's time to play. <br>
I would like to remind you. It's very important that you develop understanding of **MLP** and **CNNs** as they will be primary building blocks of DRL.

In [None]:
from __future__ import print_function
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.optim.lr_scheduler import StepLR


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout2d(0.25)
        self.dropout2 = nn.Dropout2d(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x) # convolution layer 1
        x = F.relu(x) # activation function
        x = self.conv2(x) # convolution layer 2
        x = F.relu(x) # activation function
        x = F.max_pool2d(x, 2) # polling layer
        x = self.dropout1(x) # drop out
        x = torch.flatten(x, 1) # flatten
        x = self.fc1(x) # fully connected
        x = F.relu(x) # activation function
        x = self.dropout2(x) # drop out
        x = self.fc2(x) # fully connected
        output = F.log_softmax(x, dim=1)  
        return output


def train(model, device, train_loader, optimizer, epoch, log_interval):
    """
    Function to train the model.
    
    Parameters
    ----------
    model: instance of model.
    
    device: GPU/CPU
        determined by pytorch if you have cuda installed. by default CPU
        
    train_loader: DataLoader
        instance to load training data, useful to form batches and do data transformation.
    
    optimiser: optimiser
    
    epoch: int
    
    log_interval: int
        the interval with which you want to log your metrices.
    
    Returns
    -------
    train_loss: float
    
    train_accuracy: float
    """
    model.train()
    correct = 0
    train_loss = list()
    train_accuracy = list()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % log_interval == 0:
            pred = output.argmax(dim=1, keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()
            accuracy =100. * correct / len(train_loader.dataset)
            train_accuracy.append(accuracy)
            train_loss.append(loss.item())
            print('Train Epoch: {} [{}/{} ({:.0f}%)]'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader)))
    return train_loss, train_accuracy


def test(model, device, test_loader):
    """
    Function to test the model.
    
    Parameters
    ----------
    model: instance of model.
    
    device: GPU/CPU
        determined by pytorch if you have cuda installed. by default CPU
        
    test_loader: DataLoader
        instance to load testing data, useful to form batches and do data transformation.
    
    Returns
    -------
    test_loss: float
    
    test_accuracy: float
    """
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    accuracy = 100. * correct / len(test_loader.dataset)
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        accuracy))
    return test_loss, accuracy


In [None]:
def main(batch_size, test_batch_size, epochs, lr, gamma, seed, file_name):
    # Training settings
    batch_size=batch_size
    test_batch_size=test_batch_size
    epochs=epochs
    lr=lr
    gamma=gamma
    no_cuda=False
    seed=seed
    log_interval=1
    save_model=False
    test_loss_array = list()
    train_loss_array = list()
    test_accuracy_array = list()
    train_accuracy_array = list()
    use_cuda = not no_cuda and torch.cuda.is_available()

    torch.manual_seed(seed)
    
    # set device for training (cpu/gpu)
    device = torch.device("cuda" if use_cuda else "cpu")

    kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
    
    # wrap data in class and apply transformations
    train_loader = torch.utils.data.DataLoader(
        datasets.MNIST('../data', train=True, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])),
        batch_size=batch_size, shuffle=True, **kwargs)
    
    test_loader = torch.utils.data.DataLoader(
        datasets.MNIST('../data', train=False, transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])),
        batch_size=test_batch_size, shuffle=True, **kwargs)
    
    # declare model and copy it's instance to device
    model = Net().to(device)
    
    # declare optimiser.
    optimizer = optim.SGD(model.parameters(), lr=lr)

    scheduler = StepLR(optimizer, step_size=1, gamma=gamma)
    # 
    for epoch in range(1, epochs + 1):
        train_loss , train_accuracy = train(model, device, train_loader, optimizer, epoch, log_interval)
        test_loss, test_accuracy = test(model, device, test_loader)
        test_loss_array.append(test_loss)
        test_accuracy_array.append(test_accuracy)
        train_loss_array.append(np.mean(train_loss))
        train_accuracy_array.append(np.mean(train_accuracy))
        scheduler.step()

    epoch_count = range(1, len(test_loss_array) + 1)
    # Visualize loss history
    plt.plot(epoch_count, test_loss_array, 'r--')
    plt.plot(epoch_count, train_loss_array, 'b-')
    plt.legend(['Test Loss', 'Train Loss'])
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.show();
    
    plt.plot(epoch_count, test_accuracy_array, 'r--')
    plt.plot(epoch_count, train_accuracy_array, 'b-')
    plt.legend(['Test Accuracy', 'Train Accuracy'])
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.show();
    
    if save_model:
        torch.save(model.state_dict(), filename)

<span style="color:blue"> **TODO** </span> Run the main function.

In [None]:
main(batch_size=64, test_batch_size=64, epochs=10, lr=0.2, gamma=0.7, seed=1, file_name="mnist_cnn.pt")

<span style="color:blue"> **TODO** </span> Reinitilize optimizer to Adam.

<span style="color:red"> **TODO** </span>  How does this impact learning? Explain briefly.

In [None]:
main(batch_size=64, test_batch_size=64, epochs=10, lr=0.2, gamma=0.7, seed=1, file_name="mnist_cnn.pt")

<span style="color:blue"> **TODO** </span> Reinitilize size of the batch (increase and decrease).

<span style="color:red"> **TODO** </span>  How does this impact learning? Explain briefly.

In [None]:
main(batch_size=, test_batch_size=64, epochs=10, lr=0.2, gamma=0.7, seed=1, file_name="mnist_cnn.pt")

In [None]:
main(batch_size=, test_batch_size=64, epochs=10, lr=0.2, gamma=0.7, seed=1, file_name="mnist_cnn.pt")