<a href="https://colab.research.google.com/github/Alice9th/Python_Challenges/blob/master/16_1_2_PRACTICE_Training_a_AllConvolutional_CNN_on_CIFAR10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CNN on CIFAR 10

We've used a fully connected NN with many layers to classify the 10 CIFAR10 classes. However, its performance was not very good (please go back and check its accuracy, as well as how many parameters the network had).

By adding convolutional layers to the beginning of a neural network, we can create a CNN that will be able to achieve betters results with less parameters.

IMPORTANT: We highly recommend using a GPU for this exercise. The training of the models takes a significant amount of time. You should write down the parameters and results of every run you make to compare and determine the best model.




## 1. AllConvolutional Network

In this case, we'll use the `All-CNN-C` variant of [the AllConvolutional network](https://arxiv.org/abs/1412.6806). This network was one of the first deep networks that proposed using only convolution operations to classify images. The network does not use MaxPooling layers; instead it relies on Convolutions with `strides=(2,2)` to downsample the feature maps. As with VGG, AllConvolutional relies on `3x3` convolutions and employs the strategy of duplicating the number of filters every time a subsampling operation reduces by 1/4 the number of pixels of the feature maps.

Note that the AllConvolutional model does not include a Flatten layer after the final convolutional layer, but instead relies on a combination of two special layers to generate the scores for each class.

First, a `Conv2D` layer with `10` feature maps as outputs and `1x1` convolutions converts the `192` feature maps into `10xHxW` scores. Then a [GlobalAveragePooling2D](https://keras.io/api/layers/pooling_layers/global_average_pooling2d) layer averages out the `HxW` spatial dimensions, leaving just `10` scores, one per class. Finally, the original article used a softmax directly over these `10` class scores. Instead, we propose you add a `Dense` layer with `softmax` activation to perform the final classification. This combination of Global Average Pooling with Dense layers has become a standard practice since it adds a bit of prediction power to the model.

**Hint:** your network should only have around 1M parameters.
**Hint:** Use `padding=same` to maintain make all intermediate spatial dimensions powers of 2 (32,16,8, etc)

In [None]:
from tensorflow import keras

model = None

model = keras.Sequential([
    # YOUR IMPLEMENTATION HERE (START)

    # YOUR IMPLEMENTATION HERE (END)
])

model.summary()

## 2. Data loading

You can load the data for the MNIST dataset with a single keras function: `tf.keras.datasets.cifar10.load_data`.


In [None]:
# Normalize data
(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()

# YOUR IMPLEMENTATION HERE (START)


# YOUR IMPLEMENTATION HERE (END)

# 3. Training the model

CNNs are trained just the same as any NN model; you'll need to `compile` the model with an optimizer, a loss function, and some metrics to monitor the performance of the model.

Since this is a classification problem, you'll use a `categorical_crossentropy` loss, but check the encoding of the outputs to decide if it is `sparse` (labels) or dense (one-hot encoding).


Afterwards, you can just call the `fit` function as usual. Training for 30 epochs with a batch size of 128 should suffice to obtain a reasonably good accuracy on the test set (~79%). Save the result of the `fit` function to a variable named `history`.

In [None]:
# Compile the model

# YOUR IMPLEMENTATION HERE (START)


# YOUR IMPLEMENTATION HERE (END)

history = model.fit(x_train,y_train, validation_data=(x_test,y_test), epochs=30, batch_size=128)

#4. Analyzing the accuracy curves

Let's visualize the accuracy for the train (`accuracy`) and test (`val_accuracy`) during training.

* Is the training accuracy always increasing? And the training loss?
* What about the test accuracy/loss?
* Is the model underfitting o overfitting?

In [None]:
import matplotlib.pyplot as plt
def plot_history(history):
  plt.figure()
  plt.plot(history.history['accuracy'], label='Train accuracy')
  plt.plot(history.history['val_accuracy'], label = 'Test accuracy')
  plt.xlabel('Epoch')
  plt.ylabel('Accuracy')
  plt.ylim([0, 1])
  plt.legend(loc='lower right')
  plt.figure()

  plt.plot(history.history['loss'], label='Train loss')
  plt.plot(history.history['val_loss'], label = 'Test loss')
  plt.xlabel('Epoch')
  plt.ylabel('Loss')
  plt.legend(loc='lower right')

plot_history(history)

# 5. Adding Dropout egularization

Your network has many parameters! Therefore, it's likely that if you overtrain it you'll overfit the training set. To counter that, you can try adding a `Dropout` layer before the last Dense layer, introducing a reasonably low amount of noise. We suggest trying low dropout proabilities, in the order of `0.1` to `0.3`. Define a new model, now with Dropout, and train it again.

* What happens if you increase too much the probability?
* Can you find a dropout probability that increases the final test set accuracy?
* Try adding a bit of dropout between convolutions as well. Is the model more sensitive to higher dropout probabilities in this lower layers? Why?
* Do you need to adjust the number of epochs to obtain a good performance?
* Extra: You could also try using L1/L2 regularization and compare the results.


In [None]:


model = None
# YOUR IMPLEMENTATION HERE (START)

# YOUR IMPLEMENTATION HERE (END)
history = model.fit(x_train,y_train, validation_data=(x_test,y_test), epochs=30, batch_size=128)
plot_history(history)

# 6. Data Augmentation

Data augmentation (DA) is a common and realtively cheap way to increase model performance.

In Keras, DA transformations can be added to the model definition simply as additional layers. They will only run while training the model (`fit` and friends), but not during evaluation or other scenarios.

Try adding [random rotatiosn](https://www.tensorflow.org/api_docs/python/tf/keras/layers/RandomRotation), [flips](https://www.tensorflow.org/api_docs/python/tf/keras/layers/RandomFlip) and other [random DA layers](https://keras.io/api/layers/preprocessing_layers/image_augmentation/) between the input layer and the first convolutional layer. Then you can simply train your model as usual.

Note that you'll probably need to train the model a bit more than in the previous cases.

Note: This way of applying DA is relatively new in Keras and differs significantly from the previous API (replaced in ~2020) that used the `ImageDataGenerator` class, so beware when browsing documentation and tutorials.


Answer:

* Try using extreme DA transformations, such as cropping more than 50% of the image. What effect does that have on the model and the train/test performance?
* By default, data augmentation is only applied to the training set, but the test set is not augmented. Why do you think that is? In which cases would it make sense to apply DA also to the test set?.
* Given your previous answers, think of 3 transformations that would not help the model perform better on the test set.




In [None]:
model = None
# YOUR IMPLEMENTATION HERE (START)


])
# YOUR IMPLEMENTATION HERE (END)
history = model.fit(x_train,y_train, validation_data=(x_test,y_test), epochs=30, batch_size=128)
plot_history(history)