# Introduction

As the second-largest provider of carbohydrates in Africa, cassava is a key food security crop grown by smallholder farmers because it can withstand harsh conditions. At least 80% of household farms in Sub-Saharan Africa grow this starchy root, but viral diseases are major sources of poor yields. With the help of data science, it may be possible to identify common diseases so they can be treated. Our dataset consists of 21,367 labeled images collected during a regular survey in Uganda. Our goal is to classify each cassava image into four disease categories or a fifth category indicating a healthy leaf. With our help, farmers may be able to quickly identify diseased plants, potentially saving their crops before they inflict irreparable damage.

In [None]:
#shutil.rmtree("/kaggle/working/train_data")
#shutil.rmtree("/kaggle/working/valid_data")
#os.remove("/kaggle/working/0")
#os.remove("/kaggle/working/2")
#os.remove("/kaggle/working/3")
#os.remove("/kaggle/working/4")
#os.remove("/kaggle/working/1")

# Set up environment
## Load libraries

In [None]:
!pip install --upgrade tensorflow
#!pip install --upgrade pip
#!pip install -U efficientnet

In [None]:
# System related imports
import math, re, os, warnings

# Data manipulation
import numpy as np
import pandas as pd
import json
import shutil

# Visualization imports
import matplotlib.pyplot as plt
from matplotlib import gridspec

# Deep learning framework
import tensorflow as tf
from kaggle_datasets import KaggleDatasets
from tensorflow import keras
from functools import partial
from sklearn.model_selection import train_test_split
import tensorflow_hub as hub
from tensorflow import keras
from tensorflow.keras import preprocessing
from tensorflow.keras.layers.experimental import preprocessing
from tensorflow.keras import layers
from tensorflow.keras.callbacks import EarlyStopping
tf.config.optimizer.set_jit(True)
from tensorflow.keras.preprocessing import image_dataset_from_directory
from tensorflow.keras import regularizers

# Validation
from sklearn import metrics

# Checken of we dit echt gebruiken
import cv2
import PIL
from PIL import Image

# Check if tensorflow is updated
print("Tensorflow version " + tf.__version__)

In [None]:
# Set Matplotlib defaults
plt.rc('figure', autolayout=True)
plt.rc('axes', labelweight='bold', labelsize='large',
       titleweight='bold', titlesize=18, titlepad=10)
plt.rc('image', cmap='magma')
warnings.filterwarnings("ignore") # to clean up output cells

## Place images in the right folders

Because we are unable to manipulate the images from the input folders directly, we will transfer them to new folders in the output.

In [None]:
# Create training and validation folder
os.mkdir('/kaggle/working/train_data/')
os.mkdir('/kaggle/working/valid_data/')

# Open file with labels 
dataset = pd.read_csv("../input/cassava-leaf-disease-classification/train.csv")

# Split training images in training and validation images
training_data, validation_data = train_test_split(dataset, test_size=0.33)

training_file_names = list(training_data['image_id'].values) 
training_img_labels = list(training_data['label'].values) 
validation_file_names = list(validation_data['image_id'].values) 
validation_img_labels = list(validation_data['label'].values) 

# Create folders of labels
folders_to_be_created = np.unique(list(dataset['label']))

# Create folders for training and validation images
for new_path in folders_to_be_created: 
    if not os.path.exists(".//" + str(new_path)):
        train_map = os.path.join('/kaggle/working/train_data/', str(new_path))
        valid_map = os.path.join('/kaggle/working/valid_data/', str(new_path))
        os.makedirs(train_map)
        os.makedirs(valid_map)
        
folders = folders_to_be_created.copy() 

# Set source and destination
source = "../input/cassava-leaf-disease-classification/train_images"
training_destination = '/kaggle/working/train_data'
validation_destination = '/kaggle/working/valid_data'

# Places training images in the train folders   
for f in range(len(training_file_names)): 
    tr_current_img = training_file_names[f] 
    tr_current_label = training_img_labels[f] 
    src = os.path.join(source, tr_current_img)
    dst = os.path.join(training_destination, str(tr_current_label))
    shutil.copy(src, dst)
    
# Places validation images in the validation folders    
for f in range(len(validation_file_names)): 
    va_current_img = validation_file_names[f] 
    va_current_label = validation_img_labels[f] 
    src = os.path.join(source, va_current_img)
    dst = os.path.join(validation_destination, str(va_current_label))
    shutil.copy(src, dst)

# Set up variables
We'll set up some of our variables and functions for our notebook here. 

#### Variables
* `BATCH_SIZE`: The amount of data included in each sub-epoch weight change. A larger batch size means a degradation in model quality and thus less ability to generalize. We therefore chose a batch size of 64. The downside of a smaller batch size is that it takes more time to run. (Keshar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2016, Sep 17). *On large-batch training for deep learning: Generalization gap and sharp minima.* Arxiv. https://arxiv.org/abs/1609.04836) 
* `IMAGE_SIZE`: Dimensions of the images in the dataset in pixels. Due to memory and speed considerations, we picked an image_size of 48 x 48. 
* `CLASSES`: Four disease categories and a fifth category indicating a healthy leaf.
* `EPOCHS`: The number of times the whole training dataset is passed through the model. We chose 60 epochs.

#### Functions
* `early_stopping`: we make an early stopping function which we can implement as a callback in the model.fit() to prevent overfitting. (We did not use this function in the end because it was more likely that our model was underfitting than overfitting). 

In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE
BATCH_SIZE = 64
IMAGE_SIZE = [48, 48]
CLASSES = ['0', '1', '2', '3', '4']
EPOCHS = 60

early_stopping = EarlyStopping(
    min_delta=0.01,
    patience=15, 
    restore_best_weights=True,
    monitor = 'val_sparse_categorical_accuracy', 
    mode = 'max',
)

## Load training and validation datasets 

For tweaking of the parameters we used smaller training and validation sets. 


In [None]:
# Load training and validation sets
ds_train_ = image_dataset_from_directory(
    '/kaggle/working/train_data',
    labels='inferred',
    label_mode='int',
    image_size=IMAGE_SIZE, 
    interpolation='nearest',
    batch_size=BATCH_SIZE,  
    shuffle=True,
    )

ds_valid_ = image_dataset_from_directory(
    '/kaggle/working/valid_data',
    labels='inferred',
    label_mode='int',
    image_size=IMAGE_SIZE,
    interpolation='nearest',
    batch_size=BATCH_SIZE,
    shuffle=True,
)

# Data pipeline
def convert_to_float(image, label):
    image = tf.image.convert_image_dtype(image, dtype=tf.float32)
    return image, label

AUTOTUNE = tf.data.experimental.AUTOTUNE
ds_train = (
    ds_train_
    .map(convert_to_float)
    .cache()
    .prefetch(buffer_size=AUTOTUNE)
)
ds_valid = (
    ds_valid_
    .map(convert_to_float)
    .cache()
    .prefetch(buffer_size=AUTOTUNE)
)

# Exploratory data analysis (EDA)

We want to take do some initial investigations on the data so as to discover patterns and to spot anomalies with the help of summary statistics and graphical representations.

### Explore disease types

We first map the label number to the different disease types.

In [None]:
with open('/kaggle/input/cassava-leaf-disease-classification/label_num_to_disease_map.json') as f:
    mapping = json.loads(f.read())
    print(mapping)

### Visualise disease types

We adapted the code from Manu Siddhartha's notebook for our visualisations. (Siddhartha, M. (2021, July). *Cassava Leaf Disease Classification EfficientNet*. Kaggle https://www.kaggle.com/sid321axn/cassava-leaf-disease-classification-efficientnet)

In [None]:
def visualize(img_list):
    rows = 1
    cols = 3

    plt.figure(figsize=(18, 10))

    for i in range(rows*cols):
        plt.subplot(10/cols+1, cols, i+1)
        r = np.random.randint(len(img_list))
        img_path = "/kaggle/input/cassava-leaf-disease-classification/train_images/" + str(img_list[r])
        img = cv2.imread(img_path)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        plt.xticks([])
        plt.yticks([])
        plt.title(str(img_list[r]))
        plt.imshow(img)
       

    plt.tight_layout()
    plt.show()

#### '0': 'Cassava Bacterial Blight (CBB)

In [None]:
cbb_df = dataset[dataset['label'].isin([0])]
cbb_img_list = list(dataset['image_id'])

visualize(cbb_img_list)

#### '1': 'Cassava Brown Streak Disease (CBSD)'

In [None]:
cbb_df = dataset[dataset['label'].isin([1])]
cbb_img_list = list(dataset['image_id'])

visualize(cbb_img_list)

#### '2': 'Cassava Green Mottle (CGM)'

In [None]:
cbb_df = dataset[dataset['label'].isin([2])]
cbb_img_list = list(dataset['image_id'])

visualize(cbb_img_list)

#### '3': 'Cassava Mosaic Disease (CMD)'

In [None]:
cbb_df = dataset[dataset['label'].isin([3])]
cbb_img_list = list(dataset['image_id'])

visualize(cbb_img_list)

#### '4': 'Healthy'

In [None]:
cbb_df = dataset[dataset['label'].isin([4])]
cbb_img_list = list(dataset['image_id'])

visualize(cbb_img_list)

### Look at the target distribution

In the plot below, the amount of unique images per category is shown. We concluded that there were substantially more images in class 3 which shows the Cassava Mosaic Disease (CMD).

In [None]:
dataset['label'].value_counts()

In [None]:
dataset['label'].value_counts().plot(kind='bar')

### Balance the dataset

We discovered that class 3 has much more training samples than the other classes, therefore we decided to remove a portion of the images in this class for both the training and validation dataset. 

Another option was customizing our weights for training in de loss function, however this option was less functional and resulted in a lower accuracy and peculiar loss curve. 

In [None]:
#Delete images out of 3
path, dirs, files = next(os.walk("/kaggle/working/train_data/3"))
file_count = len(files)
print(file_count)

path, dirs, files = next(os.walk("/kaggle/working/valid_data/3"))
file_count_valid = len(files)
print(file_count_valid)

# training photo's removal
filenames = os.listdir("/kaggle/working/train_data/3")

for i in filenames[:7000]:
    os.remove("/kaggle/working/train_data/3/" + i) 
    
# validation photo's removal
filenames_valid = os.listdir("/kaggle/working/valid_data/3")

for i in filenames_valid[:3500]:
    os.remove("/kaggle/working/valid_data/3/" + i) 

# Check
path, dirs, files = next(os.walk("/kaggle/working/valid_data/3"))
file_count = len(files)
file_count

path, dirs, files = next(os.walk("/kaggle/working/train_data/3"))
file_count = len(files)
file_count

# Calculate weights for each class
total_images = 8452+1580 + 1479 + 706 + 1718
weight_0 = 1 / (706/total_images)
weight_1 = 1 / (1479/total_images)
weight_2 = 1 / (1580/total_images)
weight_3 = 1 / (8452/total_images)
weight_4 = 1 / (1718/total_images)
    

# Building the model

## Computer vision model with a self-defined base

Our model is a convolutional neural network model with a self-defined base constisting of two blocks of convolutional and pooling layers, and a head consisting of a flatten layer and 4 dense layers. We chose not to build a model with a pre-trained base because this gave us memory problems. We did try several of them: VGG-16, ResNet50, Inceptionv3, and EfficientNetB0.

### Model specifications
#### Preprocessing:
* `input_shape = [48,48,3]`: we chose to reduce the image size to reduce training time while still keeping enough image information.
* `layers.Rescaling(1./255)`: we rescale the input in the [0, 255] range to be in the [0, 1] range.
* `preprocessing`: we add data augmentation so that we have more data to train the model on. We only use a few, so that the color remains, because color is a feature for distinguishing the classes. 

#### Base:
* The base consists of 4 equal blocks of two convolutional layers and a max-pooling layer.
* `layers.BatchNormalization(renorm=True)`: applies a transformation that maintains the mean output close to 0 and the output standard deviation close to 1.
* `layers.Dropout(0.2)`: randomly sets input units to 0 with a frequency of 0.25 at each step during training time, which helps prevent overfitting. Inputs not set to 0 are scaled up by 1/(1 - rate) such that the sum over all inputs is unchanged. (Baldi, P., & Sadowski, P. J. (2013). Understanding dropout. In *Advances in neural information processing systems* (pp. 2814–2822)).
* `kernel_size`: a kernel size of 3 is adviced. (Sahoo, S. (2018, August 19). *Deciding optimal kernel size for CNN*. Towards Data Science. https://towardsdatascience.com/deciding-optimal-filter-size-for-cnns-d6f7b56f9363)
)
* `regulizer`: here we use a L2 regulizer, which is adviced for image classification tasks. (Chollet, F. (2017). *Overfit and underfit*. Tensorflow. https://www.tensorflow.org/tutorials/keras/overfit_and_underfit)
* The chosen values for the parameters rendered the highest accuracy. They are determined by trial-and-error.

#### Head:
* Our head consists of a flatten layer and 1 dense layers with dropout.
* `layers.Flatten()` flattens the input while not affecting the batch size.
* The final layer consists of 5 nodes, corresponding to the amount of classes, and a `softmax` activation function. This function is used for multi-class predictions. The sum of all outputs generated by softmax is 1.

In [None]:
model = keras.Sequential([
    layers.InputLayer(input_shape=[48, 48, 3]),
    layers.Rescaling(1./255),
    layers.Dropout(0.2),
    
    # Data Augmentation
    preprocessing.RandomFlip(mode='horizontal'), 
    preprocessing.RandomRotation(factor=0.1),
    preprocessing.RandomFlip(mode='vertical'), 
    
    # Block One
    layers.BatchNormalization(renorm=True),
    layers.Conv2D(filters=64, 
                  kernel_size=3,
                  activation='relu',
                  padding='same',
                  strides = (2,2),
                  kernel_regularizer=regularizers.l2(0.001)),
    layers.Conv2D(filters=64, 
                  kernel_size=3,
                  activation='relu',
                  padding='same',
                  strides = (2,2),
                  kernel_regularizer=regularizers.l2(0.001)),
    layers.MaxPool2D(pool_size=2,
                     strides=2,
                     padding='same'),
    layers.Dropout(0.2), 
    
    # Block Two
    layers.BatchNormalization(renorm=True),
    layers.Conv2D(filters=128,
                  kernel_size=3,
                  activation='relu',
                  padding='same', 
                  strides = (2,2),
                  kernel_regularizer=regularizers.l2(0.001)),
    layers.Conv2D(filters=128,
                  kernel_size=3,
                  activation='relu',
                  padding='same', 
                  strides = (2,2),
                  kernel_regularizer=regularizers.l2(0.001)),
    layers.MaxPool2D(pool_size=2,
                     strides=2,
                     padding='same'),
    layers.Dropout(0.2),
    
    # Block Three
    layers.BatchNormalization(renorm=True),
    layers.Conv2D(filters=128,
                  kernel_size=3,
                  activation='relu', 
                  padding='same',
                  strides = (2,2),
                  kernel_regularizer=regularizers.l2(0.001)),
    layers.Conv2D(filters=128,
                  kernel_size=3,
                  activation='relu',
                  padding='same',
                  strides = (2,2),
                  kernel_regularizer=regularizers.l2(0.001)),
    layers.MaxPool2D(pool_size=2,
                     strides=2,
                     padding='same'),
    layers.Dropout(0.2),
    
    # Block four
    layers.BatchNormalization(renorm=True),
    layers.Conv2D(filters=256,
                  kernel_size=3,
                  activation='relu', 
                  padding='same',
                  strides = (2,2),
                  kernel_regularizer=regularizers.l2(0.001)),
    layers.Conv2D(filters=256,
                  kernel_size=3,
                  activation='relu',
                  padding='same',
                  strides = (2,2),
                  kernel_regularizer=regularizers.l2(0.001)),
    layers.MaxPool2D(pool_size=2,
                     strides=2,
                     padding='same'),
    layers.Dropout(0.2),

    # Head
    layers.BatchNormalization(renorm=True),
    layers.Flatten(),
    layers.Dense(128, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
    layers.Dropout(0.5),
    layers.Dense(len(CLASSES), activation='softmax'),
])

With model.summary() we'll see a printout of each of our layers, their corresponding shape, as well as the associated number of parameters.

In [None]:
model.summary()

# Train the model

### Training specifications:
* `optimizer = 'adam'`: we chose Adam as an optimizer because it is computationally efficient, requires little memory, and is well suited for problems that are large in terms of data or parameters or both.
* We are using `sparse_categorical_crossentropy` as our loss function and `sparse_categorical_accuracy` as our performance metric, because we did _not_ one-hot encode our labels. The four disease categories and the fifth category indicating a healthy leaf are mutually exclusive (e.g. each image belongs to one of the classes). (Author Unknown, (Date unknown). *Sparse Categorical Crossentropy Class*. Keras. https://keras.io/api/losses/probabilistic_losses/#sparsecategoricalcrossentropy-class)

In [None]:
optimizer = tf.keras.optimizers.Adam(lr=0.0001)

model.compile(
    optimizer = optimizer,
    loss='sparse_categorical_crossentropy',
    metrics=['sparse_categorical_accuracy'], 
)

history = model.fit(
    ds_train,
    validation_data=ds_valid,
    epochs=EPOCHS,
    batch_size = BATCH_SIZE,
    #callbacks=[early_stopping],
    verbose=1,
    #class_weight = {0: weight_0, 
                    #1: weight_1, 
                    #2: weight_2, 
                    #3: weight_3, 
                    #4: weight_4, 
                    #}
)

# Evaluating our model

In [None]:
# loss and accuracy graph
history_frame = pd.DataFrame(history.history)
history_frame.loc[:, ['loss', 'val_loss']].plot()
history_frame.loc[:, ['sparse_categorical_accuracy', 'val_sparse_categorical_accuracy']].plot();

In [None]:
# Make predictions
predictions = model.predict(ds_valid)

# Count individual counts per class
values, counts = np.unique([CLASSES[np.argmax(predictions[i,])] for i in range(len(predictions))], return_counts=True)
values, counts

In [None]:
# Calculate Cohen Kappa score to determine how far away from 'chance' we are
classs = [0, 1, 2, 3, 4]

target = np.concatenate([label for example, label in ds_valid])

metrics.cohen_kappa_score([classs[np.argmax(predictions[i,])] for i in range(len(predictions))], target)

# Making predictions on testset
Now that we've trained our model we can use it to make predictions.

We adapted the code from Aayush Mishraa's notebook for our visualisations. (Mishraa, A. (2021). *Cassava Leaf Disease 69% Acc [Simple CNN approach]*. Kaggle. https://www.kaggle.com/aayushmishra1512/cassava-leaf-disease-69-acc-simple-cnn-approach)

In [None]:
preds = []
ss = pd.read_csv('../input/cassava-leaf-disease-classification/sample_submission.csv')

for image in ss.image_id:
    img = tf.keras.preprocessing.image.load_img('../input/cassava-leaf-disease-classification/test_images/' + image)
    img = tf.keras.preprocessing.image.img_to_array(img)
    img = tf.keras.preprocessing.image.smart_resize(img, (48, 48))
    img = tf.reshape(img, (-1, 48, 48, 3))
    prediction = model.predict(img/255)
    preds.append(np.argmax(prediction))

submission = pd.DataFrame({'image_id': ss.image_id, 'label': preds})
submission

# Creating a submission file
Now that we've trained a model and made predictions we're ready to submit to the competition.

In [None]:
submission.to_csv('submission.csv', index = False)