# Ready, Steady, Go AI (*Tutorial*)

This tutorial is a supplement to the paper, **Ready, Steady, Go AI: A Practical Tutorial on Explainable Artificial Intelligence and Its Applications in Plant Digital Phenomics** (submitted to *Patterns, 2021*) by Farid Nakhle and Antoine Harfouche

Read the accompanying paper [here](https://doi.org).

# Table of contents


* **1. Background**
* **2. Downloading Segmented Images**
* **3. Downsampling the Yellow Leaf Curl Class**

# 1. Background


**Why do we need to balance a dataset?**

Data imbalance refers to an unequal distribution of classes within a dataset. In such scenario, a classification model could become biased, inaccurate and might produce unsatisfactory results. Therefore, we balance the dataset either by oversampling the minority class or undersampling the majority classes. To demonstrate the two scenarios, both oversampling and undersampling will be applied. Here, we will downsample the yellow leaf curl class in the training set using K-nearest neighbors (KNN).

**What is KNN?**

Like oversampling, undersampling is also designed to balance the class distribution in an imbalanced dataset. However, in contrast to oversampling, undersampling techniques delete data from the majority classes to balance the distribution. 

KNN is an ML algorithm that calculates feature similarity between its training data to predict values for new, previously unseen data. When given a new input, KNN finds k (a user-predefined number) most resembling data (nearest neighbors) using similarity metrics, such as the Euclidean distance. Based on the majority class of those similar cases, the algorithm classifies the new input. 

We will use KNN to discover similarities of leaves in the yellow leaf curl virus class, deleting images with redundant features, ultimately undersampling it to 1500 images.



# 2. Downloading Segmented Images


As a reminder, we are working with the PlantVillage dataset, originally obtained from [here](http://dx.doi.org/10.17632/tywbtsjrjv.1).
For this tutorial, we will be working with a subset of PlantVillage, where we will choose the tomato classes only. We have made the subset available [here](http://dx.doi.org/10.17632/4g7k9wptyd.1). 

The next code will automatically download the dataset segmented with SegNet.

**It is important to note that Colab deletes all unsaved data once the instance is recycled. Therefore, remember to download your results once you run the code.**

In [None]:
import requests
import os
import zipfile

## FEEL FREE TO CHANGE THESE PARAMETERS
dataset_url = "http://faridnakhle.com/pv/tomato-split-cropped-segmented.zip"
save_data_to = "/content/dataset/tomato-segmented/"
dataset_file_name = "tomato-segmented.zip"
#######################################

if not os.path.exists(save_data_to):
    os.makedirs(save_data_to)

r = requests.get(dataset_url, stream = True, headers={"User-Agent": "Ready, Steady, Go AI"})

print("Downloading dataset...")  

with open(save_data_to + dataset_file_name, "wb") as file: 
    for block in r.iter_content(chunk_size = 1024):
         if block: 
             file.write(block)

## Extract downloaded zip dataset file
print("Dataset downloaded")  
print("Extracting files...")  
with zipfile.ZipFile(save_data_to + dataset_file_name, 'r') as zip_dataset:
    zip_dataset.extractall(save_data_to)

## Delete the zip file as we no longer need it
os.remove(save_data_to + dataset_file_name)
print("All done!")  


Downloading dataset...


#  3. Downsampling the Yellow Leaf Curl Class

In [None]:
from sklearn.neighbors import NearestNeighbors
from glob import glob

import numpy as np
import scipy.sparse as sp
from keras.applications import VGG19
from keras.applications.vgg19 import preprocess_input
from keras.engine import Model
from keras.preprocessing import image
import numpy as np
import os


img_dir = "/content/dataset/tomato-segmented/train/Tomato___Tomato_Yellow_Leaf_Curl_Virus/*"
targetLimit = 1500
deleteImages = True

def SaveFile(arr, filename):
    with open(filename, 'w') as filehandle:
        for listitem in arr:
            filehandle.write(str(listitem) + "\n")


def vectorize_all(files, model, px=224, n_dims=512, batch_size=512):
    min_idx = 0
    max_idx = min_idx + batch_size
    total_max = len(files)
    if (max_idx > total_max):
        max_idx = total_max
    
    preds = sp.lil_matrix((len(files), n_dims))

    print("Total: {}".format(len(files)))
    while min_idx < total_max - 1:
        print(min_idx)
        X = np.zeros(((max_idx - min_idx), px, px, 3))
        # For each file in batch, 
        # load as row into X
        i = 0
        for i in range(min_idx, max_idx):
            file = files[i]
            try:
                img = image.load_img(file, target_size=(px, px))
                img_array = image.img_to_array(img)
                X[i - min_idx, :, :, :] = img_array
            except Exception as e:
                print(e)
        max_idx = i
        X = preprocess_input(X)
        these_preds = model.predict(X)
        shp = ((max_idx - min_idx) + 1, n_dims)
        preds[min_idx:max_idx + 1, :] = these_preds.reshape(shp)
        min_idx = max_idx
        max_idx = np.min((max_idx + batch_size, total_max))
    return preds

def vectorizeOne(path, model):
    img = image.load_img(path, target_size=(224, 224))
    x = image.img_to_array(img)
    x = np.expand_dims(x, axis=0)
    x = preprocess_input(x)
    pred = model.predict(x)
    return pred.ravel()

def findSimilar(vec, knn, filenames, n_neighbors=6):
    if n_neighbors >= len(filenames):
        print("Error. number of neighbours should be less than the number of images.")
    else:
        n_neighbors = n_neighbors + 1
        dist, indices = knn.kneighbors(vec.reshape(1, -1), n_neighbors=n_neighbors)
        dist, indices = dist.flatten(), indices.flatten()
        similarList = [(filenames[indices[i]], dist[i]) for i in range(len(indices))]
        del similarList[0]
        #similarImages.sort(reverse=True, key=lambda tup: tup[1])
        return similarList

files = glob(img_dir)
nbrOfImages2Delete = len(files) - targetLimit

if (nbrOfImages2Delete > 0):

    imgToSearchFor = files[0]

    base_model = VGG19(weights='imagenet')
    model = Model(inputs=base_model.input, outputs=base_model.get_layer('fc1').output)
    vecs = vectorize_all(files, model, n_dims=4096)

    knn = NearestNeighbors(metric='cosine', algorithm='brute')
    knn.fit(vecs)

    vec = vectorizeOne(imgToSearchFor, model)
    similarImages = findSimilar(vec, knn, files, nbrOfImages2Delete)
    print(similarImages)
    SaveFile(similarImages, "deletedImages.txt")

    if deleteImages:
        for i in range(0, len(similarImages)):
            if os.path.exists(similarImages[i][0]):
                os.remove(similarImages[i][0])
    print("Balancing done. A list of deleted images can be found in deletedImages.txt")
else:
    print("nothing to delete")

Let's re-count the files in the folder

In [None]:
files = glob(img_dir)
print(len(files))

1500
