# Welcome to the second part of the ML image analysis and classification workshop!

In this notebook you will implement a very basic CNN model to try to classify the images that you analysed in the previous lab.

You will use the Keras framework to implement a CNN. This is because Keras generally is easier to understand compared to PyTorch.

## <ins>STARTING COMMANDS<ins>
1. Start by changing the runtime type to GPU. Do this by selecting __runtime__ in the menu, and then __change runtime type__. Choose GPU.


2. Then run these cells below, to bring all files into the directory and get you up to speed :-)

In [None]:
# We start by running this cell to make sure that all relevant files are present in the folder structure
!git clone https://github.com/NordAxon/NBI-Handelsakademin-ML-Labs.git


Cloning into 'NBI-Handelsakademin-ML-Labs'...
remote: Enumerating objects: 64, done.[K
remote: Counting objects: 100% (64/64), done.[K
remote: Compressing objects: 100% (50/50), done.[K
remote: Total 64 (delta 11), reused 59 (delta 9), pack-reused 0[K
Unpacking objects: 100% (64/64), done.
Downloading covid-xray-modified.zip to /content
 52% 9.00M/17.3M [00:01<00:01, 5.12MB/s]
100% 17.3M/17.3M [00:01<00:00, 11.2MB/s]


### Then we download the data from Kaggle
#### For this to work you need to do the following:
1. Register or Log in to [Kaggle](https://www.kaggle.com/)
2. Create a API token: Kaggle -> Settings -> Account -> Create New Token ([link](https://www.kaggle.com/settings/account))
3. Place the downloaded *kaggle.json* file under the NBI-Handelsakademin-ML-Labs folder.

#### In the next cell we run commands to move the Kaggle API key to the right place and download and unzip the data. 

In [None]:

# We move the Kaggle API token to where Colab wants it
!mkdir -p ~/.kaggle/ && mv NBI-Handelsakademin-ML-Labs/kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

# We download the kaggle dataset
!kaggle datasets download -d suddirutten/covid-xray-modified

# And we unzip the dataset and put it in the image-lab folder :-)
import zipfile
with zipfile.ZipFile("/content/covid-xray-modified.zip","r") as zip_ref:
    zip_ref.extractall("/content/NBI-Handelsakademin-ML-Labs/image-lab")


### Preprocessing

In [26]:
# In case you did not finish the first part of the lab, this cell will convert the raw data into the processed images :-)
# You can add your own code here and run that instead if you wish
# Feel free to change the hyperparameters to the CLAHE function as you see fit

from collections import defaultdict
import cv2
import numpy as np
from pathlib import Path

# Point to dataset path
path_to_images = Path(
    "/content/NBI-Handelsakademin-ML-Labs/image-lab/Covid19-dataset/raw"
)
# rglob through and make sure to get all file extensions
all_paths = (
    list(path_to_images.rglob("*.jpeg"))
    + list(path_to_images.rglob("*.jpg"))
    + list(path_to_images.rglob("*.png"))
)

# Create empty list and empty dict
images_list = []
set_sizes = defaultdict(lambda: defaultdict(int))
# Loop thorugh each path that we have
for image_path in all_paths:
    set_name = image_path.parent.parent.name
    image_class = image_path.parent.name
    set_sizes[set_name][image_class] += 1
    image = cv2.imread(str(image_path))
    images_list.append((image_path, image))

# Saving the CLAHE transformed images under processed
for path, img in images_list:
    # --- Applying clahe -----
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
    cl1 = clahe.apply(img[:, :, 0])

    # --- Saving new image under processed ----
    new_path = path.parents[3] / "processed" / path.parents[1].name / path.parent.name
    if not new_path.exists():
        new_path.mkdir(parents=True)
    cv2.imwrite(str(new_path / path.name), cl1)

## <ins>BACKGROUND</ins>
When training a model, you will need to split your dataset into 3 subsets.

These sets are called __training set__, __validation set__ and finally the __test set__.

As you have noticed, the dataset is currently split only into training and test set folders. Therefore, your first task is related to splitting the dataset.

When splitting datasets, normally we use the names `X` and `y` for the tensors (matrices) that we are creating. The `X` matrix will consist of the samples of a given set and the `y` matrix will consist of the labels that correspond to each sample in `X`. 

- It is __VERY IMPORTANT__ that the order of the labels in the `y` matrix is corresponding to the order of the samples in `X`. 
- It is also __VERY IMPORTANT__ that no samples are overlapping between the sets - i.e. every image should only be present in one set. 

## <ins>EXERCISE 1 - Dataset splitting</ins>
Your first task is to read the images from the Covid19-dataset/processed/ folder into 3 separate lists. These lists should be created:
- `X_train` = a numpy array of all training images. In this list, you will put 80% of the images found in the folder /processed/train.
- `y_train` = a numpy array consisting of the labels for the `X_train` list. You can use the following class to number mapping: `labels = {"normal": 0, "viral pneumonia": 1, "covid": 2}`. Make sure that this list corresponds to the order of the images in `X_train`: if the first image in `X_train` is of class `normal`, then the first entry in the `y_train` list should be `0`.
- `X_val` = a numpy array of all validation images. In this list, you will put 20% of the images found in the folder /processed/train.
- `y_val` = a numpy array consisting of the labels for the `val_samples` list. Make sure that this list corresponds to the order of the images in `val_samples`: if the first image in `val_samples` is of class `covid`, then the first entry in the `val_labels` list should be `2`.
- `X_test` = a numpy array of all images found in the folder /processed/test
- `y_test` = a numpy array consisting of the labels of the classes. Make sure that this list corresponds to the order of the images in `test_sample`.


#### When the lists are created, you will one-hot encode the __label__ lists. You can do this by using the keras method `to_categorical`.

After this exercise, you should be able to answer the following questions:
- How many samples are there in the training and validation sets?
- How did the labels change from `{0,1,2}` when you one-hot encoded them?
- How many channels are there in each image?


### Optional hints
- You will make your life easier if you use the sklearn method called `train_test_split` :-)


In [None]:
# Importing some libraries
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical
import cv2
import numpy as np
from pathlib import Path


## Enter your code here :-)

## <ins>BACKGROUND</ins>
Now we are getting ready to actually implement and train a real ML model! We will implement a Convolutional Neural Network (CNN), and hopefully you already know the basics of how a CNN is built. Either way, here comes a recap of how to build a CNN in Keras:

1. Initiate a model. This is normally done with `models.Sequential()` in Keras.
2. Add convolutional layers and max pooling layers as you wish. Remember to specify number of filters, kernel size and activation function as hyperparameters to each convolutional layer. In the max pooling layers you can specify the pooling size.
3. Add a flattening layer to the CNN, to squish out the tensor to a long array
4. Normally, we add a dense layer following this, to boil down the flattened layer to a specified amount of neurons
5. If you add more dense layers at the end of the model, remember that the final layer should have as many outputs as there are classes.
6. Do a `model.summary()` call to see if your model compiles and the layers are working together.
7. When you have a working model, finish it off by adding a compiling layer. For this lab, it enough to add the hyperparameters `{loss = "categorical_crossentropy", optimizer = optimizers.RMSprop(), metrics=["acc"]}`

If you want a head start, you can find a simple Keras CNN in the hidden cell below (double click "A basic CNN"). If you want a challenge - implement it yourself from scratch :-) 

In [None]:
# @title A basic CNN
from tensorflow.keras import layers
from tensorflow.keras import models
from tensorflow.keras import optimizers

# Basic model for a CNN
model = models.Sequential()
# Conv net
model.add(layers.Conv2D(16, (3, 3), activation="relu", input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
# Classifier
model.add(layers.Flatten())
model.add(layers.Dense(128, activation="relu"))
model.add(layers.Dense(3, activation="softmax"))
model.summary()
# Compile model
model.compile(
    loss="categorical_crossentropy",
    optimizer=optimizers.RMSprop(learning_rate=1e-4),
    metrics=["acc"],
)

## <ins>EXERCISE 2 - Building a CNN</ins>

Finally we get to the fun stuff!

It is now time to implement your very own CNN. The specifications for this model is given below:
- The number of trainable parameters should exceed 1 million
- Use categorical crossentropy as loss function
- Use the accuracy metric when training
- The optimization function RMSprop should be used
- You should train the model for at least 10 epochs
- Use as many convolutional layers and max pooling layers as you wish


You should be able to answer these questions when you have implemented your CNN:
- What is the input shape of the CNN? How does this correlate to the size of the images?
- How many trainable parameters do you have in your CNN? Do you have any non-trainable parameters? If yes, why do you think that is?


In [None]:
from tensorflow.keras import layers
from tensorflow.keras import models
from tensorflow.keras import optimizers
from tensorflow.keras.callbacks import ModelCheckpoint

# Enter your code from here on below

## <ins>BACKGROUND</ins>

Now we have arrived at the last step - training and evaluating the models. 

To make the training and saving of models as easy as possible, we have defined a `model_save_path` for you. The idea is that you will only save the best model during training, as the weights will update each epoch and the last epoch is not neccesarily the best. When we say *best* model, we mean the one that has the highest validation accuracy. 

We have pre-defined the `callbacks_list` for you as well, but it is up to you to figure out how to call the `callbacks_list` in the training command.

## <ins>EXERCISE 3 - Training and evaluating the model</ins>
You will now fit your model to the training data (i.e. the `X_train` and encoded `y_train` matrices).

Requirements for the training:
- You should train for at least 10 epochs
- You should choose your batch size to a reasonable number, normally a power of 2
- You should use your validation set during training

When the model is fully trained, you will predict on the test set (that you have __not__ used in training). Print a classification report from the predictions and analyze it.

You should be able to answer the following questions after the training and evaluation:
- What is your highest validation accuracy during training?
- What is your F1 score on the test set?
- What is the difference between validation accuracy and F1 score on the test set? How do these normally compare?
- Should you modify the model to better fit the test set?
- What is the precision on the different classes? What does this mean?
- Do you think precision or recall is most important as a metric when talking about medical applications like this?


### Optional hints
- Use the Keras function ´fit´ to train the model.
- Use the `sklearn` method `classification_report´ to easily print the classification report. 


In [None]:
from tensorflow.keras.models import load_model
from sklearn.metrics import classification_report

# Enter your code from here on below :-) Use as many cells as you wish