# Notebook 2: Classification with Neural Networks

**Authors:** Kenny Choo, Mark H. Fischer, Eliska Greplova for the conference "Summer School: ML in Quantum Physics and Chemistry" (24.08.-03.09.2021, Warsaw)

Adapted for the ML4Q-retreat 2022 by Alexander Gresch

## Import of the first ML library!
There are various libraries useful for ML practitioners. Among the most popular are Tensorflow (with Keras) and PyTorch. Both are pretty competitive with each other, so basically both offer the same functionalities [[source]](https://www.imaginarycloud.com/blog/pytorch-vs-tensorflow/).
- **Tensorflow** is a Python numerical library (created by the Google Brain team) with many methods being in fact high-performance C++ binaries. It can be accelerated in the Google Colaboratory by choosing TPU (Tensor Programming Units). Keras is the high-level API of TensorFlow 2, so a kind of user-friendly overlay which simplifies many operations.
- **PyTorch** is a Python library created by Facebook AI Research Lab. Before Tensorflow 2.0, PyTorch strength was a little different way of storing and building ML models, which allowed easier interaction with the models' internals. But Tensorflow 2.0 adapted to that and now the main difference is a little easier debugging of a PyTorch code than Tensorflow one.

Fun fact to confuse you even more: Keras API used to build ML models in Tensorflow looks very similar to the PyTorch way of building ML models. We will stick to PyTorch. Note that there exists a wrapper module, pytorch lightning, for a more convenient control of the workflow.

In [None]:
# install the pytorch lightning wrapper locally for this notebook
!pip install pytorch-lightning

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader
import pytorch_lightning as pl

# Helper Libraries
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

## Step 0. Loading and saving files with Google Colaboratory

Within these tutorials, we will need to upload some data (in particular, training data). E.g., today, we will upload Monte Carlo generated configurations of Ising spins on a 30x30 lattice.

- *Clone the GitHub repository*

If you want to avoid working with Google drive, you can use this work-around. All needed data are on [this GitHub](https://github.com/Shmoo137/SummerSchool2021_MLinQuantum). You can clone this GitHub repository into your Colab environment in the same way as you would in your local machine, using git clone. Once the repository is cloned, refresh the file-explorer on the left to browse through its contents.

In [None]:
# Option B.
!git clone https://github.com/GreschAl/ML4Q_retreat22_ML_with_python

In [None]:
# Option B.
folder = "/content/ML4Q_retreat22_ML_with_python/exercises/Ising_data"
# Access the data with simple commands:
ising_training_configs = np.load(folder + "/ising_training_configs_30x30.npy")

Take note these are local and temporary files, and will disappear after closing the notebook.

In this notebook we will repeat the exercise from the notebook `01_Unsupervised_learning` on using data analysis to try to identify different phases of matter. We have seen that sometimes clustering can be really powerful tool, but sometimes it does not work too well. Here, we will learn how to accomplish the same task using neural networks and see if they can do better.

# Example #1: Ising Spin Configuration Classification


The Ising model is given by the (classical) Hamiltonian:

\begin{align}
H(\boldsymbol{\sigma}) = -\sum_{<ij>} \sigma_{i}\sigma_{j},
\end{align}
where the spins $\sigma_{i} \in \lbrace -1, 1 \rbrace$ are binary variables living on the vertices of a square lattice and the sum is taken over nearest neighbours $<ij>$. We have set $J=1$.
  
At a given temperature $\beta = 1/T$, the probability of a configuration $\sigma$ is given by the Boltzmann distribution
  
\begin{align}
  P(\boldsymbol{\sigma}) = \frac{e^{-\beta H(\boldsymbol{\sigma})}}{Z},
 \end{align}
  
  where $Z$ is the partition function. This model exhibits a phase transition from the ferromagnetic phase at low tempertures to a paramagnetic phase at high temperatures. The transition temperature is $T_c \approx 2.2692$.
  
  **Task**
 
1.   Classify the ferromagnetic versus the paramagnetic phase of the Ising model
2.   Find the transition temperature
  
**Dataset**: Monte Carlo generated configurations on a 30x30 square lattice. The configuration are labelled by temperature.



## Step 1: Import data and analyze the data shape

The folder `Ising_data` contains Monte Carlo generated Ising configurations on the two-dimensional lattice. The data set is divided into training and test parts and corresponding label files containing the temperature, $T$, of each Monte Carlo sample.

In [None]:
N = 30 # linear dimension of the lattice 

ising_training_configs = np.load(folder + "/ising_training_configs_{0}x{0}.npy".format(N))
ising_training_labels = np.load(folder + "/ising_training_labels_{0}x{0}.npy".format(N))
ising_test_configs = np.load(folder + "/ising_test_configs_{0}x{0}.npy".format(N))
ising_test_labels = np.load(folder + "/ising_test_labels_{0}x{0}.npy".format(N))

print('train_images.shape =', ising_training_configs.shape)
print('train_labels.shape =', ising_training_labels.shape)
print('test_images.shape =', ising_test_configs.shape)
print('test_labels.shape =', ising_test_labels.shape)

We see that we have a training set of size 1000 and a test set of size 1000.
Each image is a 30x30 array which takes values in {-1, 1}. The labels of these images are the temperatures and they are in [1, 3.5].

## Step 2: Prepare data

At the moment, our configurations are labelled by their temperature. Since we want to learn to classify the two phases, we need to label our data by 'Ordered' (label=0) vs 'Disordered (label=1).

Let us assume that we know $1.5=T_{low}<T_{c}< T_{high} = 2.5$. Then we exclude all the data between $T_{low}$ and $T_{high}$. We label all configurations below $T_{low}$ with '0' and all those above $T_{high}$ with '1'.

In [None]:
# Assign labels according to the temperature
T_low = 1.5
T_high = 2.5

########################################################################
### TODO: ###
########################################################################
# Starting with training data, create the labels for the data according
# to the explanation above. Do not forget to omit data from inbetween the phases.
included = np.bitwise_or(ising_training_labels <= T_low , ising_training_labels >= T_high)
train_images = ising_training_configs[included]
train_labels = np.zeros(len(train_images),dtype=int)
train_labels[ising_training_labels[included] >= T_high] = 1

# do the same for the test data
included = np.bitwise_or(ising_test_labels <= T_low , ising_test_labels >= T_high)
test_images = ising_test_configs[included]
test_labels = np.zeros(len(test_images),dtype=int)
test_labels[ising_test_labels[included] >= T_high] = 1
########################################################################

# Now you should have smaller training data set, check it:
print(train_images.shape,train_labels.shape)
# Now you should have smaller training data set, check it:
print(test_images.shape,test_labels.shape)

Before we can commence with setting up the neural networks, we have to wrap the data further into a pytorch data loader instance. For example, the data loader does the correct mini-batching for us. It is also possible, to alter the data as they are being fed into the model. This is REALLY helpful if you are dealing with large data samples such as high-resolution images.

In [None]:
train_set = TensorDataset(torch.tensor(train_images,dtype=torch.float),torch.tensor(train_labels,dtype=torch.long))
test_set  = TensorDataset(torch.tensor(test_images,dtype=torch.float), torch.tensor( test_labels,dtype=torch.long))

## Step 3: Setup the model

Specifications of our model: 

*  Our model takes in a 30 by 30 array
*  And outputs a 2-dimensional vector

The 2-dimensional vector gives the models prediction for whether the system is ferromagnetic (i.e., $T < T_c$) or paramagnetic ($T>T_c$).

The model's prediction is given by the index with the largest value, i.e. argmax(output)

There are two ways to create models within pytorch.

1.   Sequential Model
2.   Model class with the functional API

In both methods, the basic building block is the layer. A layer takes some input tensor and applies some transformation and returns an output tensor.

First let us explore the sequential model.

In [None]:
# first, we define a convenience layer that flattens the 2d input samples into a 1d vector
class Flatten(nn.Module):
    def forward(self,x):
        return x.view(len(x),-1)

In [None]:
network = nn.Sequential(
    Flatten(),
    nn.Linear(900,32),
    nn.ReLU(),
    nn.Linear(32,2)
)

print(network)

### Hyperparameters

Why 32 neurons? Good question! No one knows exactly ;) Building a ML model is a combination of educated guess, good practices, and pure luck. General rule - the simpler model (the smaller number of parameters), the better. We will encounter more quantities on whose values we need to decide to some degree arbitrarily. These quantities are called **hyperparameters**.

*Aside note for curious minds:*

While training, you optimize the parameters of the model. You check the performance of the trained model (on the subset of data called validation data) and then try to improve it by playing with hyperparameters. You can think that you optimize the model parameters on the training data and the hyperparameters on the validation data. The final test of the model performance is the test data. 

*Question:* Why shouldn't you play with your hyperparameters to improve your model performance on test data?

In [None]:
# define the LightningModule. The init() takes all relevant model(s), hyperparameters etc.
# We only have to define the training_step(), i.e. how a mini-batch of data is processed.
# The self.log() functions allows to keep track of the training loss but can also be used for other accuracy metrics.
class LitModel(pl.LightningModule):
    def __init__(self, network, learning_rate=1e-3):
        super().__init__()
        self.network = network
        self.lr = learning_rate
        self.train_loss = []
        self.test_loss = []
        self.train_acc = []
        self.test_acc = []

    def training_step(self, batch, batch_idx):
        # training_step defines the train loop.
        # it is independent of forward
        # batch = (x,y) comes internally from the data loader
        # batch_idx is only passed for internal reasons
        x, y = batch
        
        scores = self.network(x)
        loss = F.cross_entropy(scores, y)

        y_pred = torch.argmax(F.softmax(scores, dim=1), dim=1)
        correct = torch.sum((y_pred==y).to(dtype=torch.long))
        # Logging to TensorBoard by default
        values = {"loss": loss, "train_correct": correct, "train_total": len(y)}  # add more items if needed
        self.log_dict(values)
        # self.log("train_loss", loss) <- in case you want to only log one value
        return values
    
    def validation_step(self, batch, batch_idx):
        x, y = batch
        scores = self.network(x)
        loss = F.cross_entropy(scores, y)

        y_pred = torch.argmax(F.softmax(scores, dim=1), dim=1)
        correct = torch.sum((y_pred==y).to(dtype=torch.long))
        values = {"loss": loss, "val_correct": correct, "val_total": len(y)}
        self.log_dict(values)
        return values

    def training_epoch_end(self,outputs):
        #  the function is called after every epoch is completed
        # calculating average loss  
        avg_loss = torch.stack([x['loss'] for x in outputs]).mean()

        # calculating correct and total predictions
        correct=sum([x["train_correct"] for  x in outputs])
        total=sum([x["train_total"] for  x in outputs])

        self.train_loss.append(avg_loss)
        self.train_acc.append(correct/total)

        return

    def validation_epoch_end(self,outputs):
        #  the function is called after every epoch is completed
        # calculating average loss  
        avg_loss = torch.stack([x['loss'] for x in outputs]).mean()

        # calculating correct and total predictions
        correct=sum([x["val_correct"] for  x in outputs])
        total=sum([x["val_total"] for  x in outputs])

        self.test_loss.append(avg_loss)
        self.test_acc.append(correct/total)

        return

    def predict_step(self, batch, batch_idx, dataloader_idx=0):
        return torch.argmax(F.softmax(self.network(batch[0]),dim=1),dim=1)

    def predict(self,data):
        with torch.no_grad():
            return F.softmax(self.network(torch.tensor(data,dtype=torch.float)),dim=1).numpy()

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.lr)
        return optimizer

In [None]:
dense_model = LitModel(network)

We also defined a few specific details before commencing with the training, namely:


1.   Loss function: we need to choose what function we want our model to minimise e.g. mean square error or cross entropy or ...
2.   Optimisation method: How we want to update the weights e.g. stochastic gradient descent or ADAM or ...
3.   Metrics: some quantity we want to keep track off while we are training, e.g. value of the loss function or the accuracy of the model...

One could also choose other loss functions or optimisers: https://pytorch.org/docs/stable/nn.html#loss-functions, https://pytorch.org/docs/stable/optim.html

It is always good to check the history of the training. How did the training loss look like? How did the validation loss look like? You can learn a lot from that and we will show you a very meaningful example in the end of this notebook. For now, remember it is a good practice to plot the training and validation loss!



## Step 4. Train the model

Finally we are ready to train our model. Here we basically just need to feed the data set to our model, and the model would minimise the loss function we chose and update the model parameters according to the optimisation method we specified. 

The last thing we need to choose is the number of epochs and the batch_size.


*   batch_size: this is the number of images we feed to our model in 1 iteration
*   epochs: the number of times we run through our data set.

Let's suppose batch size = 100. Then the training proceeds as follows: 

1.   Divide our dataset in batches of 100. 
2.   Take a batch of 100 samples and feed it to the model. This gives 100 output vectors from which we compute the loss function and its gradient w.r.t. to model parameters.
3. Use the gradients to update the model parameters according to the optimiser we chose 
4. Repeat steps 2 and 3 until we have cycled through to the end of the dataset. This will be the end of one epoch.

Number of iterations in one epoch = size of data set / batch_size


A few comments:

* The smaller the batch size, the faster the model trains.
* The smaller the batch size, the more noisy the training will be.
* Some amount of noise is useful to prevent us from getting stucked in local minima.



In [None]:
# wrap the torch dataset instance in a data loader
train_loader = DataLoader(train_set,batch_size=16,shuffle=True)
test_loader  = DataLoader(test_set ,batch_size=100)

In [None]:
trainer = pl.Trainer(max_epochs=5)
trainer.fit(model=dense_model, train_dataloaders= train_loader, val_dataloaders=test_loader )

In [None]:
# Useful plotting procedure to plot training and validation losses vs. training time (measured in so-called epochs)
def plot_history(model,key="loss"):
    plt.figure(figsize=(16,10))

    for i, (name, style) in enumerate(zip(["train","test"],["r-","r--"])):
        data = getattr(model,name+"_"+key)[i:]
        plt.plot(data, style, label=name)
    plt.xlabel('Epochs')
    plt.ylabel(key)
    plt.legend()

    plt.xlim([0,len(data)])
    return

In [None]:
# See how the training looked like:
# summarize history for accuracy
plot_history(dense_model, 'acc')
# summarize history for loss function value
plot_history(dense_model, 'loss')

### Can you tell something out of these plots?

**Answer**: the training was successful: in both metrics, the graphs for training and validation data are very similar to each other

## Step 5. Evaluate our model
Now that our model is trained, we can test our model on the configurations we have not yet seen (ising_test_configs). 

In [None]:
predictions = trainer.predict(dense_model,test_loader)
predictions = np.array([pred.numpy() for pred in predictions]).flatten()
acc = np.mean(predictions==test_labels)

In [None]:
print('accuracy on test set =', acc*100, "%")

In [None]:
ising_predictions = dense_model.predict(ising_test_configs)
print(ising_predictions.shape)

To evaluate where the model predicts $T_c$ to be, we average the prediction for all the configurations for a given temperature. 

We also calculate the absolute value of the magnetization ($m=|\sum \sigma_i|$) for comparison, since we know that this is our order parameter.

In [None]:
Temps = list(np.sort(list(set(ising_test_labels))))
NT = len(Temps)
phase1 = np.zeros(NT)
phase2 = np.zeros(NT)
points = np.zeros(NT)
m = np.zeros(NT)
lastT = 0.
for i, T in enumerate(ising_test_labels):
    j = Temps.index(T)
    phase1[j]+=ising_predictions[i:i+1, 0][0]
    phase2[j]+=ising_predictions[i:i+1, 1][0]
    m[j] += abs(np.mean(ising_test_configs[i]))
    points[j]+=1.

for j in range(NT):
    phase1[j] /= points[j]
    phase2[j] /= points[j]
    m[j] /= points[j]

In [None]:
plt.rcParams["figure.figsize"] = (8,6)
plt.plot(Temps, phase1, 'b', label='ordered')
plt.plot(Temps, phase2, 'r', label='disordered')
plt.plot(Temps, m, 'g--', label="magnetization")
plt.legend()
plt.ylim(0, 1.05)
plt.xlim(1,3.5)
plt.xlabel('T [J]')
plt.ylabel('model prediction/magnetization')
plt.grid()
plt.show()

We can now estimate the location of the transition. Lets define this to be the location where our model's prediction drops to  0.5 .

In [None]:
index = (np.abs(phase1 - 0.5)).argmin()
tc = Temps[index]

print("Estimated Transition Temp =", tc)

The exact transition temperature in the thermodynamic limit is  2.2692 , so our result is not so bad considering finite size effects. If we look again at the above plot, we can see that the curves coincide relatively nicely with the average magnetization, this suggest that the network is indeed learning the magnetization, i.e. it is computing the magnetization and using it to make its prediction.

# Example #2: Ising model with local constraints: Ising Gauge Theory (IGT)

In the previous example, we classified spin configurations of the simple Ising model. That was a relatively easy task given that we know that there's a global order parameter, i.e., the magnetization that distinguishes the two phases the model has.

In the following, we will look at spin configurations coming from a different model on which the simple PCA spectacularly fails. In this model, Ising spins live on the edges of a square lattice (see Figs. below). The Hamiltonian then favors even down and up spins around a square. If the number is odd, a pentalty is paid. The Hamiltonian is given by

\begin{align}
H(\boldsymbol{\sigma}) = -\sum_{p} \prod_{i \in p}\sigma_{i},
\end{align}
where we sum over the plaquettes $p$ of the square lattice.

This model does not have a finite temperature transition. We thus want to train a network to distinguish the (highly degenerate) ground states of this system from any excited state.

First, we load and analyze the shape of our data set again. As before, they are located in the folder `Ising` and labeled by a temperature.

In [None]:
N = 16 # linear dimension of the lattice 

ilgt_training_configs = np.load(folder + "/ilgt_training_configs.npy".format(N))
ilgt_training_labels = np.load(folder + "/ilgt_training_labels.npy".format(N))
ilgt_test_configs = np.load(folder + "/ilgt_test_configs.npy".format(N))
ilgt_test_labels = np.load(folder + "/ilgt_test_labels.npy".format(N))

print('train_images.shape =', ilgt_training_configs.shape)
print('train_labels.shape =', ilgt_training_labels.shape)
print('test_images.shape =', ilgt_test_configs.shape)
print('test_labels.shape =', ilgt_test_labels.shape)

In [None]:
# wrap data in a Dataset instance
train_set = TensorDataset(torch.tensor(ilgt_training_configs,dtype=torch.float),torch.tensor(ilgt_training_labels,dtype=torch.long))
test_set  = TensorDataset(torch.tensor(ilgt_test_configs,dtype=torch.float),    torch.tensor(ilgt_test_labels,dtype=torch.long))

In [None]:
## Let's build this dense neural network (DNN) ourselves now!
## We want (for start) a DNN which takes an input of certain shape
## then let's go with hidden layer of 100 neurons and ReLU
## then output layer (# of neurons = # of classes in the problem)

# Dense network
dense_network = nn.Sequential(
    Flatten(),
    nn.Linear(16*16*2,100),
    nn.ReLU(),
    nn.Linear(100,2)
)

# wrap it up in pytorch lightning model
dense_model = LitModel(dense_network)

# Print a summary of the model
print(dense_network)

In [None]:
# Now let's train our DNN! 50 epochs, take batch size 32
# wrap the torch dataset instance in a data loader
train_loader = DataLoader(train_set,batch_size=32,shuffle=True)
test_loader  = DataLoader(test_set ,batch_size=100)

# create a trainer instance
trainer = pl.Trainer(max_epochs=50)
trainer.fit(model=dense_model, train_dataloaders= train_loader, val_dataloaders=test_loader )

In [None]:
# How does the DNN perform?
predictions = trainer.predict(dense_model,test_loader)
predictions = np.array([pred.numpy() for pred in predictions]).flatten()
acc = np.mean(predictions==ilgt_test_labels)

print('Dense Network')
print('accuracy on test set = {}%'.format(acc*100))

### WOW! What just happened? Can you guess by checking the training history? Plot the accuracy and loss for training and validation data...

In [None]:
# See how the training looked like:
# summarize history for accuracy
plot_history(dense_model, 'acc')
# summarize history for loss function value
plot_history(dense_model, 'loss')

## Convolutional Neural Networks

Finally, let's play with one of the most powerful ML models, designed especially for images (i.e., for higher-dimentional data where spatial dependence is a significant feature).

What do convolutional layers do?

![Alt Text](https://miro.medium.com/max/789/0*jLoqqFsO-52KHTn9.gif)

The yellow matrix is called a kernel, and its size is one of the hyperparameters. It moves around the green (input) image with step defined by `stride` (here = 1), and how it behaves at the edges of the image is called `padding`. The resulting convolved image is an input to a next layer.

In [None]:
# we want to be able to use different kernel sizes and compare
conv_network = nn.Sequential(
    nn.Conv2d(2,16,kernel_size=2,stride=1, padding=0),
    nn.ReLU(),
    Flatten(),
    nn.Linear(17*17*16,8),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(8,2)
)

conv_model = LitModel(conv_network,learning_rate=1e-4)
# I have adjusted the default learning rate.
# This hyperparameter tuning is, in general, quite tedious and might sometimes explain why learning fails as well.

print(conv_model)

We have introduced here two new layers. Lets briefly understand what each layer does.

1.  **Convolutional**: This layer applies 16 kernels of size 2 by 2 over the input image (in case they are identical, we can also only provide a single integer). For our purpose, we need periodic boundary conditions and we thus use no padding, which means it does not add additional 'pixels' around the configuration. We instead add the padding ourselves (see below).
```
 nn.Conv2D(16, kernel_size=(2,2), strides=(1, 1), padding = 0)
```


2.  **Dropout**: This layers drops the nodes in the layer above with a 50% probability.
```
nn.Dropout(0.5)
```
Note that this layer is only applied during training. When evaluating or predicting on test samples, this layer is not applied.

For more information about layer check out the keras documentation: https://pytorch.org/docs/stable/nn.html#loss-functions


One final note on the input shape: While for the (first) dense layer we can define just about any shape, the input shape for the convolutional layer is necessarily N x C x M, where C is the number of channels. If the input available has only a single channel, i.e., its shape is N x M, an additional axis with dimension 1 needs to be added for it to work (to make it N x 1 x M). M here can even be another shape M = M1 x M2 x ...

In [None]:
# Create the periodic padding for training and test configurations
def create_periodic_padding(configs, kernel_size=2):
    N = np.shape(configs)[1]
    padding = kernel_size-1
    x = []
    for config in configs:
        padded = np.zeros((N+2*padding, N+2*padding, 2))
        # lower left corner
        padded[:padding,:padding, :] = config[N-padding:,N-padding:,:]
        # lower middle
        padded[padding:N+padding, :padding, :] = config[:,N-padding:,:]
        # lower right corner
        padded[N+padding:, :padding, :] = config[:padding, N-padding:, :]
        # left side
        padded[:padding, padding:N+padding, :] = config[N-padding:, :, :]
        # center
        padded[padding:N+padding, padding:N+padding, :] = config[:,:,:]
        # right side
        padded[N+padding:, padding:N+padding, :] = config[:padding, :, :]
        # top left corner
        padded[:padding, N+padding:,:] = config[N-padding:, :padding, :]
        # top middle
        padded[padding:N+padding, N+padding:, :] = config[:, :padding, :]
        # top right corner
        padded[N+padding:, N+padding:, :] = config[:padding, :padding, :]
        x.append(padded)
    return np.array(x).transpose([0,3,1,2]) # this ensures that the color channel is at the right position

x_n = create_periodic_padding(ilgt_training_configs)
test_x_n = create_periodic_padding(ilgt_test_configs)

# wrap data in a Dataset instance
train_set = TensorDataset(torch.tensor(x_n,dtype=torch.float),torch.tensor(ilgt_training_labels,dtype=torch.long))
test_set  = TensorDataset(torch.tensor(test_x_n,dtype=torch.float),    torch.tensor(ilgt_test_labels,dtype=torch.long))

In [None]:
# notice the distinctive shape now
x_n.shape, test_x_n.shape

In [None]:
# wrap the torch dataset instance in a data loader
train_loader = DataLoader(train_set,batch_size=32,shuffle=True)
test_loader  = DataLoader(test_set ,batch_size=100)

# create a trainer instance
trainer = pl.Trainer(max_epochs=50)
trainer.fit(model=conv_model, train_dataloaders= train_loader, val_dataloaders=test_loader )

In [None]:
# Convolutional Network
predictions = trainer.predict(conv_model,test_loader)
predictions = np.array([pred.numpy() for pred in predictions]).flatten()
acc = np.mean(predictions==ilgt_test_labels)

print('Conv Network')
print('accuracy on test set = {}%'.format(acc*100))

In [None]:
# See how the training looked like:
# summarize history for accuracy
plot_history(conv_model, 'acc')
# summarize history for loss function value
plot_history(conv_model, 'loss')

Comparing the plots for the dense network (above) and the conv network (here), we can observe a few things. 

1.  **Dense Network (Blue**):  Very quickly, the loss is decreasing on the training set, but it is actually increasing for the validation set. There is a large and widening gap between validation loss and training loss.  (We are overfitting)
2.   **Convolutional Network (Red)**: Both the validation and the training losses are decreasing for all epochs (No overfitting).
3. Even though the dense network has a lower loss on the training set, the loss on the validation set is much higher than the convolutional network.

We can see that the dropout layer is definitely helps to prevent the overfitting.
For many applications, adding a dropout layer can help avoiding overfitting. This is, however, not enough here - we also have two include conv layers as well!


Some ways to combat overfitting:

1. **Dropout layers**: This prevents the model from co-adapting or memorising data
2. **Adding regularisation** terms to cost function, e.g. to penalise large parameters: For example, adding a L2 regularisation term to the loss function, i.e. L2 = ||w||^2 where w represents the weights of the corresponding layer. This limits the power of the model by preventing it from exploring large parameters. In pytorch, this can be done by adjusting the arguments of the *optimizer* accordingly, see https://pytorch.org/docs/stable/optim.html


3. **Early stopping**: Here we simply stop training when we have achieved a satisfactory validation accuracy or loss. For example, in the dense network we trained above, it is probably a good idea to stop the training after aroud 11 epochs.

