# Ready, Steady, Go AI (*Tutorial*)

This tutorial is a supplement to the paper, **Ready, Steady, Go AI: A Practical Tutorial on Explainable Artificial Intelligence and Its Applications in Plant Digital Phenomics** (submitted to *Patterns, 2021*) by Farid Nakhle and Antoine Harfouche

Read the accompanying paper [here](https://doi.org).

# Table of contents


* **1. Background**
* **2. Downloading the Dataset**
* **3. Splitting the Dataset**

# 1. Background


**Why do we need to split?**

When training a model for image analysis, an algorithm is presented with image data to learn from. For the model to learn, the algorithm uses a loss function to inform the model how close or far away it is from making a correct prediction. As a result, the model formulates a predicting function based on the feedback it got from the loss function, ultimately mapping pixels in the image to output a prediction.

While this works well, the model might learn an overly specific function that performs well on the given images without being able to generalize to new images that it has not seen during its training. This is known as overfitting.

The train, validation, and testing splits help us avoid overfitting.

Instead of using all data to train, the training set, commonly the largest split of a dataset, will be reserved to train the model.

During training, it is important to understand how well the model is doing on images that are not part of training. Here, the validation set is used to report a validation metric that tells us the performance of the model at each traning epoch.

After the training have concluded, we might have a hint on the performance of the final model based on reports of the validation set. However, the validation set metrics might have influenced our hyperparameter choices during the creation of the model, and in this sense, we might have accidentally caused an overfit to the validation set.
And here comes the role of the test set. The evaluation metrics should be measured on the test set at the very end of the project, to finaly get a real sense of how well a model will perform in production.

**What is the best splitting ratio?**


There is no exact golden ratio for the splitting. The choice of a split ratio is completely dependent on the dataset. Having more data allows the choice of a bigger portion for the traning set. It is best to experiment with different splitting ratio. This is called cross validation, where 3, 5, 10 or any K number of splits can be done. Those splits are called folds, and there are many strategies used to create these folds. However, this is not in the scope of this tutorial.

# 2. Downloading the Dataset


As a reminder, we are working with the PlantVillage dataset, originally obtained from [here](http://dx.doi.org/10.17632/tywbtsjrjv.1).
For this tutorial, we will be working with a subset of PlantVillage, where we will choose the tomato classes only. We have made the subset available [here](http://dx.doi.org/10.17632/4g7k9wptyd.1). 

The next code will automatically download the dataset and save it to our current working environment.

**It is important to note that Colab deletes all unsaved data once the instance is recycled. Therefore, remember to download your results once you run the code.**

In [None]:
import requests
import os
import zipfile

## FEEL FREE TO CHANGE THESE PARAMETERS
dataset_url = "http://faridnakhle.com/pv/tomato-original.zip"
save_data_to = "/content/dataset/"
dataset_file_name = "dataset.zip"
#######################################

if not os.path.exists(save_data_to):
    os.makedirs(save_data_to)

r = requests.get(dataset_url, stream = True, headers={"User-Agent": "Ready, Steady, Go AI"})

print("Downloading dataset...")  

with open(save_data_to + dataset_file_name, "wb") as file: 
    for block in r.iter_content(chunk_size = 1024):
         if block: 
             file.write(block)

## Extract downloaded zip dataset file
print("Dataset downloaded")  
print("Extracting files...")  
with zipfile.ZipFile(save_data_to + dataset_file_name, 'r') as zip_dataset:
    zip_dataset.extractall(save_data_to)

## Delete the zip file as we no longer need it
os.remove(save_data_to + dataset_file_name)
print("All done!")  


Downloading dataset...
Dataset downloaded
Extracting files...
All done!


# 3. Splitting the Dataset

To split the dataset, we will use split-folders python library to do a random split of 80% for training, 10% for validation, and 10% for testing.
split-folders will use a parameter called seed to help reproduce a specific split (instead of getting a new random split every time).

To reproduce our results, use the value 1337 for the seed as shown below.

In [None]:
!pip install split-folders tqdm
!splitfolders --output "/content/dataset/tomato-split/" --seed 1337 --ratio .8 .1 .1 -- "/content/dataset/"

Collecting split-folders
  Downloading https://files.pythonhosted.org/packages/b8/5f/3c2b2f7ea5e047c8cdc3bb00ae582c5438fcdbbedcc23b3cc1c2c7aae642/split_folders-0.4.3-py3-none-any.whl
Installing collected packages: split-folders
Successfully installed split-folders-0.4.3
Copying files: 18160 files [00:02, 6226.72 files/s]
