# Welcome to the first part of the ML image analysis and classification workshop!

### In this notebook you will do some basic analysis of a dataset. The tasks consist of dataloading using pathlib and applying an image transform using opencv.

## <ins>START COMMANDS</ins>

#### First we clone the git repo.

In [1]:
# We start by running this cell to make sure that all relevant files are present in the folder structure
!git clone https://github.com/NordAxon/NBI-Handelsakademin-ML-Labs.git


Cloning into 'NBI-Handelsakademin-ML-Labs'...
remote: Enumerating objects: 64, done.[K
remote: Counting objects: 100% (64/64), done.[K
remote: Compressing objects: 100% (50/50), done.[K
remote: Total 64 (delta 11), reused 59 (delta 9), pack-reused 0[K
Unpacking objects: 100% (64/64), done.
Downloading covid-xray-modified.zip to /content
 52% 9.00M/17.3M [00:01<00:01, 5.12MB/s]
100% 17.3M/17.3M [00:01<00:00, 11.2MB/s]


### Then we download the data from Kaggle
#### For this to work you need to do the following:
1. Register or Log in to [Kaggle](https://www.kaggle.com/)
2. Create a API token: Kaggle -> Settings -> Account -> Create New Token ([link](https://www.kaggle.com/settings/account))
3. Place the downloaded *kaggle.json* file under the NBI-Handelsakademin-ML-Labs folder.

#### In this first cell we run commands to move the Kaggle API key to the right place and download and unzip the data. Then in the second cell we run the basic imports that you might need. Feel free to import more libraries if you need!


In [None]:

# We move the Kaggle API token to where Colab wants it
!mkdir -p ~/.kaggle/ && mv NBI-Handelsakademin-ML-Labs/kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

# We download the kaggle dataset
!kaggle datasets download -d suddirutten/covid-xray-modified

# And we unzip the dataset and put it in the image-lab folder :-)
import zipfile
with zipfile.ZipFile("/content/covid-xray-modified.zip","r") as zip_ref:
    zip_ref.extractall("/content/NBI-Handelsakademin-ML-Labs/image-lab")


#### <ins>Google Colab Navigation</ins>

When you have run the cell above, make sure to open the folder structure by clicking the folder symbol in the menu to the left (see image). If you cannot see a folder named NBI-Handelsakademin-ML-Labs, click the folder update symbol (see image).

![alt text](./utils/image2.JPG "YOLO")

In [5]:
# Run this cell to import all neccesary packages

import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
import cv2
import random
from collections import defaultdict

## <ins>BACKGROUND</ins>
We will analyse a dataset of X-Ray images that consists of 3 classes - Covid19, Viral Pneumonia and Normal. You can explore the dataset under __/content/NBI-Handelsakademin-ML-Labs/image-lab/Covid19-dataset/raw__ or https://www.kaggle.com/suddirutten/covid-xray-modified

## <ins>EXERCISE 1 - Analysis</ins>

### Your first task is to read in all images from the dataset and analyze them
We want you to read the images into one list and thus be able to find the sizes of the tensors (=images)

__You should be able to answer the following questions after this exercise__:

- What shapes are the images?
- How many images are there, per class, in the training and test set?
- There are 3 channels in each image, is this necessary? (I.e. is the data unique over the channels?)


In [None]:
# Point to dataset path
path_to_images = Path(
    "/content/NBI-Handelsakademin-ML-Labs/image-lab/Covid19-dataset/raw"
)

## Enter your solution below. Use as many cells as you wish :-)

#### Optional hints for solving the task:
- By using pathlib, we can glob our way through all images and find them immediately. Example: path_dir.rglob('*.xslx') finds all excel files (that end on .xslx) recursively in the path_dir
- Using the pathlib parent and name property, we can easily find the class and set of an image
- Saving the number of images found in a nested dict makes it easy to read (both code and output). Creating a nested dict is possible through dict_name = defaultdict(lambda: defaultdict(int)), or dict_name = {{}}. defaultdict is neat as we do not need to check for existing keys when handling the dict.
- Using cv2 we can read images into a numpy array

## <ins>BACKGROUND</ins>
Assume that you have done a basic pixel value distribution analysis on the images and hypothesize that a contrast/brightness adjustment could be useful. To apply a transform like this you will use the ready-to-go method __createCLAHE__ from opencv.

## <ins>EXERCISE 2 - Pre-processing</ins>

### Your second task is to apply some pre-processing techniques to your images

You will apply one pre-processing technique to one channel of each image, and save them in a separate folder. This folder will follow the folder structure as for the raw data, but instead lie in the folder "Covid19-dataset/processed". Each image should be named the same as under the raw folder.

__You should be able to answer the following questions after this exercise__:
- Looking at one picture, how did the pixel value distribution change after applying a CLAHE function to the image?
- What hyperparameters did you try? How did they affect the distribution and how the image looked?
- Do you think this pre-processing technique could aid an ML model when training?

In [None]:
## Enter your solution below. Use as many cells as you wish :-)