# CASSAVA LEAF DISEASE CLASSIFICATION

This is starter training notebook for tensorflow users for this competition. This notebook is run with internet 'on' for downloading pretrained models.

- It uses tf.data for loading data for training
- model is trained using transfer learning.

Importing required dependencies

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

Getting Path to csv files:

- train.csv
- submission.csv

submission.csv is loaded only for checking format of submission file

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        if filename.endswith(".csv"):
            print(os.path.join(dirname, filename))

In [None]:
submission_file ="/kaggle/input/cassava-leaf-disease-classification/sample_submission.csv"
train_path ="/kaggle/input/cassava-leaf-disease-classification/train.csv"
train_images_path = "../input/cassava-leaf-disease-classification/train_images"

Reading CSV files and viewing their heads

In [None]:
train_data = pd.read_csv(train_path)
submission = pd.read_csv(submission_file)

In [None]:
train_data.head()

In [None]:
submission.head()

## Checking class imbalance 

In [None]:
train_data.label.value_counts()

In [None]:
sns.countplot("label", data=train_data)

From above plot we can see that dataset has more images of class 3 (Cassava Mosaic Disease (CMD)), So we need to have care that we split data in stratified manner.

## Splitting data into training and validation

In [None]:
trainX,valX,trainY,valY = train_test_split(train_data.iloc[:,0].values,
                                             train_data.iloc[:,1].values,
                                             stratify= train_data.iloc[:,1].values,
                                             random_state=11,
                                             test_size=0.2
                                            )

In [None]:
num_train_images = len(trainX)
num_eval_images = len(valX)

In [None]:
print("Number of train images: ",num_train_images)
print("Number of validation images: ",num_eval_images)
print("Shape of train data: ",trainX.shape)
print("Shape of validation data: ",valX.shape)
print("Shape of train targets: ",trainY.shape)
print("Shape of validation targets: ",valY.shape)

In [None]:
df_train = pd.DataFrame.from_dict({"image_id":trainX, "label":trainY})
df_val = pd.DataFrame.from_dict({"image_id":valX, "label":valY})

In [None]:
df_train.head()

In [None]:
df_train.label.value_counts()

In [None]:
sns.countplot("label", data=df_train)

In [None]:
df_val.label.value_counts()

In [None]:
sns.countplot("label", data=df_val)

In above plots we can see both train and validation data which we have splitted have same distribution of labels which we want.

## Setting Parameters here

In [None]:
EPOCHS=5
BATCH_SIZE=32
IMAGE_DIM=(224,224)

## Helper Functions

- get_path_of_image : for getting full path to image from its name
- load_tf_image : loading and normalizing image and converting to tensor
- generate_tf_dataset : generate tf dataset

In [None]:
def get_path_of_image(image):
    return os.path.join(train_images_path,image)

In [None]:
def load_tf_image(image_path,dim):
    image = tf.io.read_file(image_path)
    image = tf.image.decode_jpeg(image,channels=3)
    image = tf.image.resize(image,dim)
    image = tf.image.convert_image_dtype(image, tf.float32)
    image = image/255.0
    return image

In [None]:
def generate_tf_dataset(X,Y,image_size):
    X = [get_path_of_image(str(x)) for x in X]
    datasetX = tf.data.Dataset.from_tensor_slices(X).map(
            lambda path: load_tf_image(path,image_size),
            num_parallel_calls=tf.data.experimental.AUTOTUNE
    )
    datasetY = tf.data.Dataset.from_tensor_slices(tf.keras.utils.to_categorical(Y))
    dataset = tf.data.Dataset.zip((datasetX,datasetY))
    dataset = dataset.batch(BATCH_SIZE)
    dataset = dataset.repeat()
    dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
    return dataset

In [None]:
def plot_images_grid(data,num_rows=1,class_names=None):
    images, labels = data
    n=len(images)
    labels = np.argmax(labels.numpy(), axis=1)
    if n > 1:
        num_cols=np.ceil(n/num_rows)
        fig,axes=plt.subplots(ncols=int(num_cols),nrows=int(num_rows))
        axes=axes.flatten()
        fig.set_size_inches((20,20))
        for i,image in enumerate(images):
            axes[i].imshow(image.numpy())
            axes[i].axis('off')
            axes[i].set_title(class_names[str(labels[i])])

## Setting up train and validation tf dataset

Create tf datasets using tf.data for training and validation. A single element of these datasets return *(Image,Label)* where Image = *(batch_size,image_width,image_height,channels)* and Label = *(batch_size,)*. 

In [None]:
train_dataset=generate_tf_dataset(trainX,trainY,IMAGE_DIM)
print(train_dataset.element_spec)

In [None]:
eval_dataset=generate_tf_dataset(valX,valY,IMAGE_DIM)
print(eval_dataset.element_spec)

## Plotting and Visualizing images

In [None]:
image_classes = {
    "0":"Cassava Bacterial Blight (CBB)",
    "1":"Cassava Brown Streak Disease (CBSD)",
    "2":"Cassava Green Mottle (CGM)",
    "3":"Cassava Mosaic Disease (CMD)",
    "4":"Healthy"
}

In [None]:
plot_images_grid(next(iter(train_dataset.take(1))),class_names=image_classes,num_rows=8)

In [None]:
plot_images_grid(next(iter(eval_dataset.take(1))),class_names=image_classes,num_rows=8)

## Training Model

We use InceptionResnetV2 pretrained model trained on imagenet, chop off its last classification layers (Dense Layers) and finetune it.

In [None]:
pretrained = tf.keras.applications.InceptionResNetV2(
                include_top=False, weights='imagenet',input_shape=(*IMAGE_DIM,3)
            )
pretrained.summary()

In [None]:
model = tf.keras.Sequential([
    pretrained,
    tf.keras.layers.GlobalAveragePooling2D(),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(512),
    tf.keras.layers.LeakyReLU(),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(128),
    tf.keras.layers.LeakyReLU(),
    tf.keras.layers.Dense(len(image_classes),activation="softmax")
])

In [None]:
model.summary()

In [None]:
model.compile(loss="categorical_crossentropy",optimizer="adam",metrics=["accuracy"])

Callbacks for:
- model_checkpointing- For checpointing model with best validation accuracy.
- early_stop- Stop training of model if model's validation accuracy did not improved in last 5 steps
- reduce_lr- reduce learning rate if validation accuracy did not improved in last 2 steps.

In [None]:
checkpoint_path="best_checkpoint"

In [None]:
model_checkpoint=tf.keras.callbacks.ModelCheckpoint(checkpoint_path,monitor="val_accuracy",
                                                    save_best_only=True,mode="max",
                                                    save_weights_only=True,
                                                    verbose=1)
early_stop=tf.keras.callbacks.EarlyStopping(monitor="val_accuracy",patience=5,
                                            mode="max", verbose=1)
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_accuracy',mode="max",
                                                 factor=0.2,patience=2, 
                                                 min_lr=0.001, verbose=1)

In [None]:
callbacks=[model_checkpoint,early_stop,reduce_lr]

model training happens here with fit method. It takes tf datasets and steps.

In [None]:
history=model.fit(train_dataset,
                  epochs=EPOCHS,
                  steps_per_epoch=num_train_images//BATCH_SIZE,
                  validation_data=eval_dataset,
                  validation_steps=num_eval_images//BATCH_SIZE,
                  callbacks=callbacks)

Loading best model checkpoint

In [None]:
if os.path.isfile(checkpoint_path):
    model.load_weights(checkpoint_path)

## Predicting test data

This competition provide test data during submission for scoring and notebook with internet disabled. So create a new inference notebook for submission.

## What to do next?

- Add data augmentation for better generalization of model, we can use [albumentations](https://albumentations.ai/) or tensorflow image augmentation explained [here](https://www.tensorflow.org/tutorials/images/data_augmentation).

- Tune hyperparmeters and check performance.

- Use different pretrained models like Xception, InceptionV3, Vgg, EfficientNet, Resnet etc. find list of some pretrained models [here](https://www.tensorflow.org/api_docs/python/tf/keras/applications)

- Use K folds for cross validation (we have already using stratified hold_out_split cross validation in this notebook) 

- Using different trained models and create ensemble using Voting Classification, Model Stacking or Blending.

- Do not just limited to these use your intitution for feature engineering.

### ALL THE BEST 👍

#### Thanks for tuning till last and consider upvoting ✔✔.