# A primer on artificial intelligence in plant digital phenomics: embarking on the data to insights journey (*Tutorial*)

This tutorial is a supplement to the paper **A primer on artificial intelligence in plant digital phenomics: embarking on the data to insights journey** (submitted to *Trends in Plant Science, 2022*) by Antoine L. Harfouche, Farid Nakhle, Antoine H.
Harfouche, Orlando G. Sardella, Eli Dart, and Daniel Jacobson.

Read the accompanying paper [here](https://doi.org) (a link will be available once the paper is published).

This interactive tutorial aims to train, for the first time, an interpretable by design model to identify and classify cassava plant diseases, and to explain its predictions.

This tutorial is organized as follows:

- Why interpretable by design algorithms? 
- Why cassava?
- Data analytics
 - Exploring the dataset: data preprocessing steps and image classes and format
 - 'This looks like that' explainable artificial intelligence (X-AI) algorithm architecture
 - Training a 'this looks like that' model
 - The computer cluster used for training and testing
 - Analyzing the model performance and generating the confusion matrix
 - Generating explanations for predictions made by the model

- Glossary
- References

##Why interpretable by design algorithms? 

AI models are commonly referred to as a black boxes because they do not reveal their internal mechanisms to their users. Such models are created directly from data and, not even the scientists who created them can understand or explain what exactly how they made a specific prediction.
As AI becomes more advanced and widely adopted, scientists are challenged to comprehend and retrace how a model came to a prediction.



In an attempt towards opening black box models, approaches that make the inner workings of AI models understandable to humans have been developed. These approaches consist of creating a second (post-hoc) model to explain the first black box model. Post-hoc models can be classified based on whether they are applicable to all AI algorithms (i.e., model-agnostic) or only to one AI algorithm (i.e., model-specific); they often employ data perturbation strategies which involve modifying the input data and observing the changes in the black box model predictions. Based on these changes, they identify which parts of data have been important for the predictions and thus, generate an explanation. However, according to [2], these explanations are unreliable as they cannot have perfect fidelity with respect to the original model. A post-hoc model that predicts almost identically to a black box model might use completely different features, and is thus not faithful to the computation of the black box one.

As a solution, other approaches aimed to develop algorithms that are interpretable by design; they provide their own explanations, which are faithful to what their models actually computes. For example, the 'this looks like that' algorithm appends a special prototype layer to the end of a deep convolutional neural network where, during training, the prototype layer finds parts of training images that act as prototypes for each class. Thus, during testing, when a new test image needs to be evaluated, the network finds parts of the test image that are similar to the prototypes it learned during training. The final class prediction of the network is based on the weighted sum of similarities to the similar prototypes. The explanations given by the network are the prototypes. These explanations are the actual computations of the model, and are not post-hoc explanations.



In this tutorial, we will show how to use the 'this looks like that' interpretable by design algorithm to identify and classify cassava plant diseases, and to generate explainations for the predictions.

##Why cassava?

Any X-AI-based analysis should start with problem formulation, where the scope and purpose of the analysis are paramount to define in order to identify what task should the algorithm perform and to guide the choice of the input data.

Framing the biological question at hand is encouraged to be accomplished in conjunction with all stakeholders to make sure that the pertinent questions are answered and that they will ultimately be satisfied by the algorithm predictions and explanations. AI and X-AI models are often powered by data whose collection can be both time-consuming and expensive. Thus, the availability of public image datasets is crucial as it provides the phenomics data science community access to valuable data, and allows for prototyping and evaluating X-AI algorithms for digital phenomics tasks. The availability of phenomic datasets is growing and many have been established and made publicly available in various repositories. Table 1 of the the accompanying paper surveys the main characteristics and potential applications of these datasets.

Here, we employ the ‘this looks like that’ algorithm, originally desgined and implemented by [1], to identify and classify cassava plant diseases using the [cassava disease classification dataset shared on Kaggle](https://www.kaggle.com/c/cassava-leaf-disease-classification/data) repository by the AI lab at Makerere University. This choice is motivated by the importance of cassava, being a key food security crop grown by smallholder farmers in Africa, Asia, and South America because of its robustness to adverse weather conditions.

The accompanying paper describes the labor-intensive process that makes it difficult to monitor and treat disease progression. With the help of X-AI, it may be possible to identify and classify cassava diseases to monitor their progression.

The dataset was **crowdsourced** (see Glossary) from farmers in Uganda who took the images with smartphones. Images were manually annotated by experts at the National Crops Resources Research Institute (NaCRRI) in collaboration with the AI lab at Makerere University. 

##Data analytics

Before starting, you should note that **graphics processing units** (GPUs) can dramatically increase training speed thanks to their processing cores initially designed to process visual data such as videos and images.
It is recommended to use a Google Colab GPU instance for faster training.

By default, this **notebook** runs on GPU. If you would like to change the instance type, check the [Google Colab documentation](https://colab.research.google.com/notebooks/gpu.ipynb).

Let us start by checking the number and type of GPUs that Google Colab assigned us for this session:

In [None]:
import tensorflow as tf
import torch

print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()

if device.type == 'cuda':
    print(torch.cuda.get_device_name(0))

###Exploring the dataset: data preprocessing steps and image classes and format

The original version of the cassava dataset is publicly available on the Kaggle repository at https://www.kaggle.com/c/cassava-leaf-disease-classification.

The dataset included a total of n<sub>cassava</sub> $=$ 21,397 labeled red-green-blue (RGB) joint photographic experts group (JPG) images divided into 5 classes: class 0 (healthy, n<sub>h</sub> $=$ 2,577), class 1 (bacterial blight, n<sub>bb</sub> $=$ 1,087), class 2 (brown streak, n<sub>bs</sub> $=$ 2,189), class 3 (green mottle, n<sub>gm</sub> $=$ 2,386), and class 4 (mosaic, n<sub>m</sub> $=$ 13,158). The data was cleaned by removing irrelevant (e.g., images that do not contain a leaf) or wrongly annotated (e.g., diseased images labeled as healthy) images using [image-sorter2](https://github.com/Nestak2/image-sorter2), a free, open-source Python script that helps data scientists in sorting a dataset into folders differentiated by classes.

After data cleaning, the image distribution in the dataset became as follows: n<sub>cassava</sub> $=$ 17,190, n<sub>h</sub> $=$ 1,398, n<sub>bb</sub> $=$ 963, class 2, n<sub>bs</sub> $=$ 1,823, n<sub>gm</sub> $=$ 1,915, and n<sub>m</sub> $=$ 11,091.

The dataset was randomly split into training, validation, and testing sets, allocating 60% of images for training (n<sub>train</sub> $=$ 10,311), 20% for validation (n<sub>val</sub> $=$ 3,436), and 20% for testing (n<sub>test</sub> $=$ 3,443). To minimize noise in training images, a [second version of the dataset](https://www.kaggle.com/ammarali32/cassava-datasetv2) with images cropped to leaf boundaries using a trained ‘you only look once’ (YOLO) model was used.

The training set was balanced using the [Augmentor](https://augmentor.readthedocs.io/en/master/) Python software library, **oversampling** each class to approximately 20,000 images (see Figure 4 in the paper).

Basic data preprocessing steps, including data splitting, balancing, cropping, and segmenting are explained in our previous tutorial **Ready, Steady, Go AI: A Practical Tutorial on Fundamentals of Artificial Intelligence and Its Applications in Phenomics Image Analysis** [published](https://doi.org/10.1016/j.patter.2021.100323) in *Patterns* where the code is implemented in interactive notebooks hosted on our Github repository at https://github.com/HarfoucheLab/Ready-Steady-Go-AI.

Please visit the following notebooks for our tutorials on:
- [Data Splitting Using split-folders](https://colab.research.google.com/github/faridnakhle/RSG/blob/main/1.%20RSG_Data%20splitter.ipynb)
- [Image Cropping Using the 'you only look once' (YOLO) AI Algorithm](https://colab.research.google.com/github/faridnakhle/RSG/blob/main/2.%20RSG_Leaf%20cropper.ipynb)
- [Image **Segmentation** Using SegNet AI Algorithm](https://colab.research.google.com/github/faridnakhle/RSG/blob/main/3.%20RSG_Leaf%20segmenter.ipynb)
- [Data Balancing by Oversampling with **Geometric Transformations** Using Augmentor](https://colab.research.google.com/github/faridnakhle/RSG/blob/main/4.%20RSG_Oversample%20with%20Augmentor.ipynb)
- [Data Balancing by Oversampling with **Synthetic Data** Using Deep Convolutional Generative Adverserial Network (DCGAN) AI Algorithm](https://colab.research.google.com/github/faridnakhle/RSG/blob/main/5.%20RSG_Oversample%20with%20DCGAN.ipynb)
- [Data Balancing by Downsampling Using K Nearest Neighbor AI Algorithm](https://colab.research.google.com/github/faridnakhle/RSG/blob/main/6.%20RSG_Downsample%20with%20KNN.ipynb)


The following code block will download the preprocessed cassava dataset which is hosted on Google Drive. 

In [None]:
import requests

def download_file_from_google_drive(id, destination):
    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)    

def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value

    return None

def save_response_content(response, destination):
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk:
                f.write(chunk)

In [None]:
file_id = '13jwC684Sg1wWLhF7SjPIlsfJNuKqJ_IQ'
destination = '/content/dataset.zip'
download_file_from_google_drive(file_id, destination)

Next, we will create a folder called 'dataset' under /content/, and extract the downloaded dataset to it. As a result, three folders should be created under /content/dataset/cdsv5 as following:
- train: the folder containing the training dataset.
- train_aug: the folder containing the augmented training dataset.
- val: the folder containing the validation dataset.

In [None]:
!mkdir /content/dataset
!apt-get install unzip
!unzip /content/dataset.zip -d /content/dataset/
!rm -R  /content/dataset.zip

Now that our dataset is ready, let us take a quick look on the differences between the class distribution in the original and the balanced training sets. To do so, the next code block will count all images in every class in the training and augmented datasets. A bar plot will be used to display the results.

In [None]:
import numpy as np
import pandas as pd
import os
import shutil
import cv2
import matplotlib.pyplot as plt
import seaborn as sns

train_dir = '/content/dataset/cdsv5/train/'
train_classes = [path for path in os.listdir(train_dir)]
train_imgs = dict([(ID, os.listdir(os.path.join(train_dir, ID))) for ID in train_classes])
train_classes_count = []
for trainClass in train_classes:
  train_classes_count.append(len(train_imgs[trainClass]))

plt.figure(figsize=(15, 10))
g = sns.barplot(x=train_classes, y=train_classes_count)
g.set_xticklabels(labels=train_classes, rotation=30, ha='right')

We can see that the training set is highly **unbalanced**. Let us check the distribution in the balanced folder by running the next code block.

In [None]:
train_dir = '/content/dataset/cdsv5/train_aug/'
train_classes = [path for path in os.listdir(train_dir)]
train_imgs = dict([(ID, os.listdir(os.path.join(train_dir, ID))) for ID in train_classes])
train_classes_count = []
for trainClass in train_classes:
  train_classes_count.append(len(train_imgs[trainClass]))

plt.figure(figsize=(15, 10))
g = sns.barplot(x=train_classes, y=train_classes_count)
g.set_xticklabels(labels=train_classes, rotation=30, ha='right')

We can see that the data is balanced.

###'This looks like that' X-AI algorithm architecture

Having the data ready, we will clone our implementation of the 'this looks like that' algorithm hosted on our [Github repository](https://github.com/HarfoucheLab/A-Primer-on-AI-in-Plant-Digital-Phenomics).

It is worth mentioning that our version of the code was modified to support **parallel computing** and thus can run on a **computer cluster** with multiple nodes and supports training on multiple GPUs.

In [None]:
!git clone https://github.com/HarfoucheLab/A-Primer-on-AI-in-Plant-Digital-Phenomics.git

The code should now be located under /content/A-Primer-on-AI-in-Plant-Digital-Phenomics/py.



Next, we identify some settings that indicate the architecutre of our network and include other **hyperparameters**, such as the dataset relevant directories (training and validation sets), the batch size, the number of workers, learning rates, etc.

Feel free to change those parameters in accordance to your needs and hardware.

In [None]:
settings = """base_architecture = 'densenet161'
img_size = 224
prototype_shape = (2000, 128, 1, 1)
num_classes = 5
prototype_activation_function = 'log'
add_on_layers_type = 'regular'

experiment_run = '001'

data_path = '/content/dataset/cdsv5/'
train_dir = data_path + 'train_aug/'
test_dir = data_path + 'val/'
train_push_dir = data_path + 'train/'
train_batch_size = 40 #80
test_batch_size = 40
train_push_batch_size = 64

num_workers=3
min_saving_accuracy=0.05

joint_optimizer_lrs = {'features': 1e-4,
                       'add_on_layers': 3e-3,
                       'prototype_vectors': 3e-3}
joint_lr_step_size = 5

warm_optimizer_lrs = {'add_on_layers': 3e-3,
                      'prototype_vectors': 3e-3}

last_layer_optimizer_lr = 1e-4

coefs = {
    'crs_ent': 1,
    'clst': 0.8,
    'sep': -0.08,
    'l1': 1e-4,
}

num_train_epochs = 1000
num_warm_epochs = 5

push_start = 10
push_epochs = [i for i in range(num_train_epochs) if i % 10 == 0] """

After defining the settings, we will write them to a file called settings.py so that 'this looks like that' can locate and read those settings.

In [None]:
text_file = open("/content/A-Primer-on-AI-in-Plant-Digital-Phenomics/py/settings.py", "w")
n = text_file.write(settings)
text_file.close()

The algorithm was configured to build a densely connected convolutional network (DenseNet)-161 architecture [3] and, instead of starting from randomly initialized weights, they were loaded from a DenseNet-161 model pretrained on the ImageNet dataset using **transfer learning**. The algorithm architecture follows [1] and consists of regular convolutional layers, followed by a prototype layer and a fully connected layer (see Figure 4 in the paper).

Given an input image, the convolutional layers extract useful features to be used for prediction. Then, the prototype layer collects activation patterns in the convolutional output and forms a prototype that corresponds to an image patch in the original input. The prototype layer then computes similarity scores indicating how strong a prototypical part is present in the image and produces a heat map that identifies which part of the input image is most similar to the learned prototype. Finally, the fully connected layer uses the similarity scores to produce the output logits in order to yield the predicted probabilities for the input belonging to certain classes. 

### Training a 'this looks like that' model

Now that we are all set, we're ready to start the training process!

But, as this is a data intensive and time-consuming step, we have included our pretrained model to this notebook and thus, you can skip the next code block.

The below parameters indicate the number of **compute nodes** and GPUs that the model will be trained on.

In [None]:
%cd /content/A-Primer-on-AI-in-Plant-Digital-Phenomics/py/
!python3 mainDistributed.py --nodes 1 --gpus 1 --nr 0

###The computer cluster used for training and testing

Training was executed on a computer cluster with four compute nodes, each having four 24-core Intel Xeon Gold 6240R **central processing units** (CPUs) and two Nvidia Tesla T4 GPUs

The next code block will download and extract our pretrained model hosted on Google Drive. The downloaded archive will be extracted to /content/pretrained.

In [None]:
%cd /content/
file_id = '12ugCaMfPdylDPPmfqzoOMWtB55k0L9tL'
destination = '/content/pretrained.zip'
download_file_from_google_drive(file_id, destination)

In [None]:
!mkdir /content/pretrained
!unzip /content/pretrained.zip -d /content/pretrained/
!rm -R /content/pretrained.zip

###Analyzing the model performance and generating the confusion matrix

Now that we have our model ready, it's time to test its performance!
The first step is to download the testing dataset.

In [None]:
file_id = '1Ruy2At0G3oLlA1Gb9gz1-aMpcfJ6653B'
destination = '/content/dataset_test.zip'
download_file_from_google_drive(file_id, destination)

In [None]:
!unzip /content/dataset_test.zip -d /content/dataset/cdsv5/
!rm -R /content/dataset_test.zip

The testing set is now located under /content/dataset/cdsv5/test

The next code block will attempt to classify every image in the test dataset, and calculate the overall **accuracy** of the model, along with its **confusion matrix**.

In [None]:
%cd /content/A-Primer-on-AI-in-Plant-Digital-Phenomics/py/
!python3 RunTestAndConfusionMatrix.py
%cd /content/

Let us display the normalized confusion matrix to get a better overview on the model performance.

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

img = mpimg.imread('/content/confusion_matrix.png')
plt.figure(figsize=(10,10))
plt.imshow(img)
plt.show()

###Generating explanations for predictions made by the model

Now that we are satisfied with the model performance, we can generate the explanation of a specific prediction using the following code:

In [None]:
!python3 /content/A-Primer-on-AI-in-Plant-Digital-Phenomics/py/local_analysis.py -modeldir /content/pretrained/ -model 240_12push0.8884.pth -imgdir /content/dataset/cdsv5/test/1/ -img 931787054.jpg -imgclass 1

The next code will plot the most activated region in the image, which the algorithm based its prediction on.

In [None]:
img = mpimg.imread('/content/dataset/cdsv5/test/1/pretrained/240_12push0.8884.pth/top-1_class_prototypes/prototype_activation_map_by_top-1_prototype.png')
plt.figure(figsize=(10,10))
plt.imshow(img)
plt.show()

Next, we display some prototypes that the activation region resembled to, and thus, we can interpret the prediction.

In [None]:
img = mpimg.imread('/content/dataset/cdsv5/test/1/pretrained/240_12push0.8884.pth/top-1_class_prototypes/top-1_activated_prototype.png')
plt.figure(figsize=(3,3))
plt.imshow(img)
plt.show()
img = mpimg.imread('/content/dataset/cdsv5/test/1/pretrained/240_12push0.8884.pth/top-1_class_prototypes/top-2_activated_prototype.png')
plt.figure(figsize=(3,3))
plt.imshow(img)
plt.show()
img = mpimg.imread('/content/dataset/cdsv5/test/1/pretrained/240_12push0.8884.pth/top-1_class_prototypes/top-17_activated_prototype.png')
plt.figure(figsize=(3,3))
plt.imshow(img)
plt.show()

Let us generate the explanation for another example!

In [None]:
!python3 /content/A-Primer-on-AI-in-Plant-Digital-Phenomics/py/local_analysis.py -modeldir /content/pretrained/ -model 240_12push0.8884.pth -imgdir /content/dataset/cdsv5/test/1/ -img 1074333151.jpg -imgclass 1

In [None]:
img = mpimg.imread('/content/dataset/cdsv5/test/1/pretrained/240_12push0.8884.pth/top-1_class_prototypes/prototype_activation_map_by_top-1_prototype.png')
plt.figure(figsize=(10,10))
plt.imshow(img)
plt.show()

In [None]:
img = mpimg.imread('/content/dataset/cdsv5/test/1/pretrained/240_12push0.8884.pth/top-1_class_prototypes/top-1_activated_prototype.png')
plt.figure(figsize=(3,3))
plt.imshow(img)
plt.show()
img = mpimg.imread('/content/dataset/cdsv5/test/1/pretrained/240_12push0.8884.pth/top-1_class_prototypes/top-2_activated_prototype.png')
plt.figure(figsize=(3,3))
plt.imshow(img)
plt.show()
img = mpimg.imread('/content/dataset/cdsv5/test/1/pretrained/240_12push0.8884.pth/top-1_class_prototypes/top-17_activated_prototype.png')
plt.figure(figsize=(3,3))
plt.imshow(img)
plt.show()

##Glossary

**Accuracy:** a measure of performance reflecting how close the predictions of an AI model are to actual annotations, calculated by dividing the number of correct predictions by the number of total predictions.

**Central processing unit (CPU):** the core component of a classical computer that controls the interpretation and execution of instructions.

**Computer cluster:** a group of interconnected computers working together as a single, integrated computing resource.

**Compute node:** a standalone computer connected to other compute nodes through a high-performance local network, forming a computer cluster.

**Confusion matrix:** a visual representation that describes the complete performance of an AI model, summarizing its predictions in four categories: true-positives, true-negatives, false-positives, and false-negatives.

**Crowdsourced:** the act of collecting data by soliciting contributions from a large group of people rather than from traditional experiments.

**Graphics processing unit (GPU):** a device used to generate computer output and have processors that are optimized for graphics computations, making it suitable for parallel computing.

**Geometric transformation:** a series of operations performed on a set of images, such as rotating and flipping, in order to augment a dataset. 

**Hyperparameters:** a group of variables whose values cannot be estimated from data and are manually tweaked to determine the optimal configuration to train a specific model (e.g., learning rate, batch size, number of training epochs).

**Unbalanced dataset:** a dataset having certain classes contain substantially more training examples than other classes, misleading the classifier algorithm to overlearn the majority classes and to perform poorly in the prediction of the minority classes.

**Notebook:** a web-based interactive computing environment that can use to combine software code, computational output, explanatory text, and multimedia resources in a single document.

**Oversampling:** a technique wherein the number of training examples within the minority class in a dataset is augmented to be equivalent to other classes.

**Parallel computing:** a form of computation in which multiple compute nodes operating simultaneously are used to solve a large problem broken into independent smaller parts that can be processed concurrently.

**Segmentation:** the task of assigning a class to every pixel in an input image (e.g., leaf or background).

**Synthetic data:** data generated artificially using AI algorithms when real data cannot be collected in sufficient amounts.

**Transfer learning:** a technique in which an AI algorithm reuses parts of a previously trained model on a new model to perform a different but similar task.


## References

1. Chen, C. et al. (2019) This looks like that: deep learning for interpretable image recognition. In *Advances in Neural Information Processing Systems*, 32
2. Rudin, C. (2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. *Nat. Mach. Intell.* 1, 206–215
3. Huang, G. et al. (2016) Densely connected convolutional networks. *arXiv* 1608.06993