# A primer on artificial intelligence in plant digital phenomics: embarking on the data to insights journey (*Tutorial*)

This notebook is a supplement to the paper **A primer on artificial intelligence in plant digital phenomics: embarking on the data to insights journey** (submitted to *Trends in Plant Science, 2022*) by Antoine L. Harfouche, Farid Nakhle, Antoine H.
Harfouche, Orlando G. Sardella, Eli Dart, and Daniel Jacobson.

Read the accompanying paper [here](https://doi.org) (a link will be available once the paper is published).

Before attempting to solve the exercises found in this notebook, visit our Github repository and try to open and run the notebook provided by the tutorial. 

Here, the solution for each exercise can be found in a hidden code cell at its end.

Interested users should try to solve the exercises with the help of the notebook provided by the tutorial before looking at the solution.

**It is important to note that Colab deletes all unsaved data once the instance is recycled. Therefore, remember to download your results once you run the code.**

#Exercise I: dataset preparation

The next code block defines a function that downloads files from Google Drive based on the file ID. 

Use this function to download the cassava dataset hosted on Google Drive at https://drive.google.com/file/d/13jwC684Sg1wWLhF7SjPIlsfJNuKqJ_IQ/view?usp=sharing.

In [None]:
import requests

def download_file_from_google_drive(id, destination):
    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)    

def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value

    return None

def save_response_content(response, destination):
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

In [None]:
### WRITE YOUR CODE HERE ###

#Solution

In [None]:
file_id = '13jwC684Sg1wWLhF7SjPIlsfJNuKqJ_IQ'
destination = '/content/dataset.zip'
download_file_from_google_drive(file_id, destination)

# Exercise II: data extraction

Complete the code using the unzip command to extract the dataset to /content/dataset

In [None]:
!mkdir /content/dataset
!apt-get install unzip
### Write YOUR CODE HERE ###

#Solution

In [None]:
#unzip dataset
!mkdir /content/dataset
!apt-get install unzip
!unzip /content/dataset.zip -d /content/dataset/
!rm -R  /content/dataset.zip #save some space

#Exercise III: descriptive data analysis

The next code block will count all images in every class in the training dataset and bar plot will be used to display the results. This helps us discover whether or not our training dataset is balanced over all classes.

In [None]:
import numpy as np
import pandas as pd
import os
import shutil
import cv2
import matplotlib.pyplot as plt
import seaborn as sns

train_dir = '/content/dataset/cdsv5/train/'
train_classes = [path for path in os.listdir(train_dir)]
train_imgs = dict([(ID, os.listdir(os.path.join(train_dir, ID))) for ID in train_classes])
train_classes_count = []
for trainClass in train_classes:
  train_classes_count.append(len(train_imgs[trainClass]))

plt.figure(figsize=(15, 10))
g = sns.barplot(x=train_classes, y=train_classes_count)
g.set_xticklabels(labels=train_classes, rotation=30, ha='right')

Use the code above to check the distribution of the augmented training dataset located under /content/dataset/cdsv5/train_aug

In [None]:
### WRITE YOUR CODE HERE ###

#Solution

In [None]:
train_dir = '/content/dataset/cdsv5/train_aug/'
train_classes = [path for path in os.listdir(train_dir)]
train_imgs = dict([(ID, os.listdir(os.path.join(train_dir, ID))) for ID in train_classes])
train_classes_count = []
for trainClass in train_classes:
  train_classes_count.append(len(train_imgs[trainClass]))

plt.figure(figsize=(15, 10))
g = sns.barplot(x=train_classes, y=train_classes_count)
g.set_xticklabels(labels=train_classes, rotation=30, ha='right')

#Exercise IV: cloning a GitHub repository

Our implementation of the 'this looks like that' interpretable by design AI algorithm is hosted on our GitHub repository at https://github.com/HarfoucheLab/A-Primer-on-AI-in-Plant-Digital-Phenomics. 

Clone this repository under /content to obtain the code so that you can use it later on to train a model.

In [None]:
### YOUR CODE HERE ###

#Solution

In [None]:
!git clone https://github.com/HarfoucheLab/A-Primer-on-AI-in-Plant-Digital-Phenomics.git

#Exercise V: configuring 'this looks like that'

As in any AI algorithm, 'this looks like that' requires some hyperparameters to be set before running it. To satisfy these requirements, we save all hyperparameters in a file called settings.py located under /content/A-Primer-on-AI-in-Plant-Digital-Phenomics/settings.py.

The settings are defined in a string variable in the next code block.

Complete the code that saves these settings to settings.py.

In [None]:
settings = """base_architecture = 'densenet161'
img_size = 224
prototype_shape = (2000, 128, 1, 1)
num_classes = 5
prototype_activation_function = 'log'
add_on_layers_type = 'regular'

experiment_run = '001'

data_path = '/content/dataset/cdsv5/'
train_dir = data_path + 'train_aug/'
test_dir = data_path + 'val/'
train_push_dir = data_path + 'train/'
train_batch_size = 40 #80
test_batch_size = 40
train_push_batch_size = 64

num_workers=3
min_saving_accuracy=0.05

joint_optimizer_lrs = {'features': 1e-4,
                       'add_on_layers': 3e-3,
                       'prototype_vectors': 3e-3}
joint_lr_step_size = 5

warm_optimizer_lrs = {'add_on_layers': 3e-3,
                      'prototype_vectors': 3e-3}

last_layer_optimizer_lr = 1e-4

coefs = {
    'crs_ent': 1,
    'clst': 0.8,
    'sep': -0.08,
    'l1': 1e-4,
}

num_train_epochs = 1000
num_warm_epochs = 5

push_start = 10
push_epochs = [i for i in range(num_train_epochs) if i % 10 == 0] """

In [None]:
text_file = open("/content/A-Primer-on-AI-in-Plant-Digital-Phenomics/py/settings.py", "w")
### YOUR CODE HERE ###
text_file.close()

#Solution

In [None]:
text_file = open("/content/A-Primer-on-AI-in-Plant-Digital-Phenomics/py/settings.py", "w")
n = text_file.write(settings)
text_file.close()

#Exercise VI: training 'this looks like that'

Now with the settings all set, run the code located in mainDistributed.py under /content/A-Primer-on-AI-in-Plant-Digital-Phenomics/py to start the training process.

Hint: use 1 node, 1 gpu, and set nr to 0.

In [None]:
%cd /content/A-Primer-on-AI-in-Plant-Digital-Phenomics/py/
### WRITE YOUR CODE HERE ###

#Solution

In [None]:
%cd /content/A-Primer-on-AI-in-Plant-Digital-Phenomics/py/
!python3 mainDistributed.py --nodes 1 --gpus 1 --nr 0

#Exercise VII: using a pretrained model

The next codeblock will download a pretrained model which will be extracted to /content/pretrained and a testing dataset which will be extracted to /content/dataset/cdsv5/test. 

In [None]:
%cd /content/
file_id = '12ugCaMfPdylDPPmfqzoOMWtB55k0L9tL'
destination = '/content/pretrained.zip'
download_file_from_google_drive(file_id, destination)
mkdir /content/pretrained
!unzip /content/pretrained.zip -d /content/pretrained/
!rm -R /content/pretrained.zip
file_id = '1Ruy2At0G3oLlA1Gb9gz1-aMpcfJ6653B'
destination = '/content/dataset_test.zip'
download_file_from_google_drive(file_id, destination)
!unzip /content/dataset_test.zip -d /content/dataset/cdsv5/
!rm -R /content/dataset_test.zip

Complete the code below to use the downloaded pretrained model and test dataset to test the model performance and generate the confusion matrix. 

Hint: The python file for the the testing and confusion matrix generation is located under /content/A-Primer-on-AI-in-Plant-Digital-Phenomics/py/RunTestAndConfusionMatrix.py

In [None]:
%cd /content/A-Primer-on-AI-in-Plant-Digital-Phenomics/py/
### YOUR CODE HERE ###

#Solution

In [None]:
%cd /content/A-Primer-on-AI-in-Plant-Digital-Phenomics/py/
!python3 RunTestAndConfusionMatrix.py
%cd /content/

#Exercise VIII: generating the confusion matrix

Testing the pretrained model should have generated a PNG file containing the confusion matrix located under /content/confusion_matrix.png.

Complete the below code to display the confusion matrix.

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

img = mpimg.imread('/content/confusion_matrix.png')
### YOUR CODE HERE ###
plt.show()

#Solution

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

img = mpimg.imread('/content/confusion_matrix.png')
plt.figure(figsize=(10,10))
plt.imshow(img)
plt.show()

# Exercise IX: explaning predictions

Explanations to predictions are generated by running a local analysis on a specific image. To do so, the local_analysis.py python file located under /content/A-Primer-on-AI-in-Plant-Digital-Phenomics/py is used.


In [None]:
!python3 /content/A-Primer-on-AI-in-Plant-Digital-Phenomics/py/local_analysis.py -modeldir /content/pretrained/ -model 240_12push0.8884.pth -imgdir /content/dataset/cdsv5/test/1/ -img 931787054.jpg -imgclass 1

Running the above codeblock should have generated many images under /content/dataset/cdsv5/test/1/pretrained/240_12push0.8884.pth/top-1_class_prototypes. 

Complete the python code below to display the best explanation generated for the prediction of the test image located under /content/dataset/cdsv5/test/1/931787054.jpg

In [None]:
img = mpimg.imread('/content/dataset/cdsv5/test/1/pretrained/240_12push0.8884.pth/top-1_class_prototypes/prototype_activation_map_by_top-1_prototype.png')
### WRITE YOUR CODE HERE ###

#Solution

In [None]:
img = mpimg.imread('/content/dataset/cdsv5/test/1/pretrained/240_12push0.8884.pth/top-1_class_prototypes/prototype_activation_map_by_top-1_prototype.png')
plt.figure(figsize=(10,10))
plt.imshow(img)
plt.show()

#Exercise X: displaying the prototypes

Complete the next codeblock to display the best two prototypes used for the classification on the above explained image.

In [None]:
img = mpimg.imread('/content/dataset/cdsv5/test/1/pretrained/240_12push0.8884.pth/top-1_class_prototypes/top-1_activated_prototype.png')
img2 = mpimg.imread('/content/dataset/cdsv5/test/1/pretrained/240_12push0.8884.pth/top-1_class_prototypes/top-2_activated_prototype.png')
### WRITE YOUR CODE HERE ###

#Solution

In [None]:
img = mpimg.imread('/content/dataset/cdsv5/test/1/pretrained/240_12push0.8884.pth/top-1_class_prototypes/top-1_activated_prototype.png')
plt.figure(figsize=(3,3))
plt.imshow(img)
plt.show()
img = mpimg.imread('/content/dataset/cdsv5/test/1/pretrained/240_12push0.8884.pth/top-1_class_prototypes/top-2_activated_prototype.png')
plt.figure(figsize=(3,3))
plt.imshow(img)
plt.show()
img = mpimg.imread('/content/dataset/cdsv5/test/1/pretrained/240_12push0.8884.pth/top-1_class_prototypes/top-17_activated_prototype.png')
plt.figure(figsize=(3,3))
plt.imshow(img)
plt.show()