**Read this before starting**

This notebook was adapted from the course materials for the Udemy course "Deployment of Machine Learning Models". For more info, see the README in the GitHub repo. 
- **Source data**: The notebook uses the Kaggle dataset "V2 Plant Seedlings Dataset" (<https://www.kaggle.com/vbookshelf/v2-plant-seedlings-dataset>).
- **On Kaggle**: To avoid downloading the large dataset locally, a Kaggle Kernel has been created to run this notebook here: <https://www.kaggle.com/btw78jt/deploy-ml-course-cnn>.
- **In the Git repo**: The notebook is saved in my GitHub fork of the course repo here: <https://github.com/A-Breeze/deploying-machine-learning-models>. See `jupyter_notebooks/Section12_DeepLearningModel/`. It is up to the user to *manually* ensure that the copy of the notebook in the Kaggle Kernel is the same as the copy committed to the repo.

## Machine Learning Model Building Pipeline: Big Data, Images and Neural Networks

In this notebook, we go through a practical example of how to build a Neural Network utilising a big dataset (> 1GB). We will do some data exploration, to understand what the dataset is about, and how we need to pre-process our data, to be able to use it in a convolutional neural network.

The accompanying repo goes on to show the the code for productionising and deployment of the model.

<!-- This table of contents is updated *manually* -->
## Contents
1. [Differentiating weed from crop seedlings](#Differentiating-weed-from-crop-seedlings)
1. [Load images](#Load-images)
1. [Examine images](#Examine-images)
1. [Separate train and test](#Separate-train-and-test)
1. [Pre-process data for modelling](#Pre-process-data-for-modelling)
1. [CNN: Specify and train](#CNN:-Specify-and-train)
1. [CNN: Assess model](#CNN:-Assess-model)


<p style="text-align: right"><a href="#Contents">Back to Contents</a></p>

## Differentiating weed from crop seedlings

The aim of the project is to correctly identify the weed type from a variety of weed and crop RGB images.

### Why is this important? 

As taken from Kaggle website:

"Successful cultivation of maize depends largely on the efficacy of weed control. Weed control during the first six to eight weeks after planting is crucial, because weeds compete vigorously with the crop for nutrients and water during this period. Annual yield losses occur as a result of weed infestations in cultivated crops. Crop yield losses that are attributable to weeds vary with type of weed, type of crop, and the environmental conditions involved. Generally, depending on the level of weed control practiced yield losses can vary from 10 to 100 %. Thereore, effective weed control is imperative. In order to do effective control the first critical requirement is correct weed identification."


### What is the objective of the machine learning model?

We aim to maximise the accuracy, this is, the correct classification of the different weed varieties.

### How do I download the dataset?
- Go to the Kaggle dataset page: <https://www.kaggle.com/vbookshelf/v2-plant-seedlings-dataset>. Log in to Kaggle.
- Click **download (2GB)** button towards the top right of the screen, to download the dataset.
    -  You may need to accept terms and conditions of the competition.
- Unzip the folder and save it in `NOTEBOOK_FOLDER/kaggle/input`, where `NOTEBOOK_FOLDER` is the directory of this notebook.

====================================================================================================

In [None]:
# check system that is running
import platform
import sys

# Show all warnings in IPython
import warnings
warnings.filterwarnings("always")
# Ignore specific numpy warning, as per: <https://github.com/numpy/numpy/issues/11788#issuecomment-422846396>
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

# navigate folders
from glob import glob
import os
from pathlib import Path

# saving output (with a timestamp)
import pickle

# other utils
import time
import datetime
import re

# to handle datasets
import numpy as np
import pandas as pd

# for plotting
from matplotlib import __version__ as mpl_version
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# to open the images
import cv2

# to display all the columns of the dataframe in the notebook
pd.pandas.set_option('display.max_columns', None)

# data preprocessing
from sklearn import __version__ as sk_version
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

# evaluate model and separate train and test
from sklearn.metrics import confusion_matrix

# Confirm expected versions (i.e. the versions running in the Kaggle Kernel)
assert platform.python_version() == '3.6.6'
print(f"Python version:\t\t{sys.version}")
assert pd.__version__ == '0.25.3'
print(f"pandas version:\t\t{pd.__version__}")
assert np.__version__ == '1.18.2'
print(f"numpy version:\t\t{np.__version__}")
assert mpl_version == '3.2.1'
print(f"matplotlib version:\t{mpl_version}")
assert sns.__version__ == '0.10.0'
print(f"seaborn version:\t{sns.__version__}")
assert cv2.__version__ == '4.2.0'
print(f"cv2 version:\t\t{cv2.__version__}")
assert sk_version == '0.22.2.post1'
print(f"sklearn version:\t{sk_version}")

In [None]:
# Ignore warnings that can show up, specific to Keras
warnings.filterwarnings(
    "ignore", message="can't resolve package from __spec__ or __package__")
warnings.filterwarnings(
    "ignore", message="unclosed file <_io.TextIOWrapper name='/root/.keras/keras.json'")

# for the convolutional network
from keras import __version__ as keras_version
from keras.models import Sequential
from keras.layers import Dense, Dropout, Conv2D, MaxPooling2D, Flatten
from keras.optimizers import Adam
from keras.metrics import categorical_crossentropy
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Model
from keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from keras.preprocessing import image
from keras.utils import np_utils

# Confirm expected version
assert keras_version == '2.3.1'
print(f"keras version:\t{keras_version}")

In [None]:
# Configuration variables
NOTEBOOK_FOLDER = Path('/')  # Change this to the location of your notebook
DATA_FOLDER = NOTEBOOK_FOLDER / 'kaggle' / 'input' / 'v2-plant-seedlings-dataset'

<p style="text-align: right"><a href="#Contents">Back to Contents</a></p>

## Load Images / Data

In [None]:
# each weed class is in a dedicated folder
print('\t'.join(os.listdir(DATA_FOLDER)))

In [None]:
# let's walk over the directory structure, so we understand
# how the images are stored
max_print_subfolders = 4
max_print_files_per_folder = 3
subfolder_counter = 0
for class_folder_path in DATA_FOLDER.iterdir():
    subfolder_counter += 1
    if subfolder_counter > max_print_subfolders:
        print(str(DATA_FOLDER / '...') + "more subfolders in this folder...")
        break
    file_counter = 0
    for image_path in class_folder_path.glob("*.png"):
        file_counter += 1
        if file_counter > max_print_files_per_folder:
            print(str(class_folder_path / '...') + "more files in this folder...\n")
            break
        print(image_path)

In [None]:
# let's create a dataframe:
# the dataframe stores the image file name in one column
# and the class of the weed (the target) in the next column
images_df = pd.DataFrame.from_records([
    (image_file_path.name, image_file_path.parent.name) for 
    image_file_path in DATA_FOLDER.glob("*/*.png")  # Only look one subfolder down
], columns=['image', 'target']).sort_values(['target', 'image'])

def get_image_file_path(images_row, DATA_FOLDER=DATA_FOLDER):
    """Get the file path from a row of images_df"""
    return(DATA_FOLDER / images_row.target / images_row.image)

images_df.head(10)

In [None]:
# how many images do we have per class?
images_df['target'].value_counts()

<p style="text-align: right"><a href="#Contents">Back to Contents</a></p>

## Examine images

In [None]:
# let's isolate a path, for demo
# we want to load the image in this path later
images_df.loc[0, :].agg(get_image_file_path)

In [None]:
# let's visualise a few images
# if the images you see in your notebook are not the same, don't worry

def plot_single_image(df, image_number):
    im = cv2.imread(str(df.loc[image_number, :].agg(get_image_file_path)))
    plt.title(df.loc[image_number, :].agg(lambda x: f"{x.target}: {x.image}"))
    plt.imshow(im)
    
plot_single_image(images_df, 0)

In [None]:
plot_single_image(images_df, 3000)

In [None]:
plot_single_image(images_df, 1000)

In [None]:
# let's go ahead and plot a bunch of our images together,
# so we get e better feeling of how our images look like

def plot_for_class(df, label):
    # function plots 9 images
    nb_rows = 3
    nb_cols = 3
    fig, axs = plt.subplots(nb_rows, nb_cols, figsize=(10, 10))
    n = 0
    for i in range(0, nb_rows):
        for j in range(0, nb_cols):
            tmp = df[df['target'] == label]
            tmp.reset_index(drop=True, inplace=True)
            im = cv2.imread(str(tmp.loc[n,:].agg(get_image_file_path)))
            axs[i, j].set_title(tmp.loc[n, :].agg(lambda x: f"{x.target}: {x.image}"))
            axs[i, j].imshow(im)
            n += 1 

In [None]:
plot_for_class(images_df, 'Cleavers')

In [None]:
plot_for_class(images_df, 'Maize')

In [None]:
plot_for_class(images_df, 'Common Chickweed')

<p style="text-align: right"><a href="#Contents">Back to Contents</a></p>

## Separate train and test

In [None]:
# train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    images_df.target + '/' + images_df.image, images_df.target,
    test_size=0.20, random_state=101
)
print(X_train.shape)
print(X_test.shape)

In [None]:
# the indices of the training data are shuffled
# this will cause problems later
X_train.head()

In [None]:
# reset index, because later we iterate over row number
X_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)

# reset index in target as well
y_train.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)

print(X_train.head())

In [None]:
y_train.value_counts(normalize=True) - y_test.value_counts(normalize=True)

In [None]:
# percentage of images within each class for
# train should be (roughly) the same in the test set
thresh = 1.2e-2
assert (np.abs(
    y_train.value_counts(normalize=True) - y_test.value_counts(normalize=True)
) < thresh).all()
print(f'Proportions are within the threshold of: {thresh:.1%}\n')
y_train.value_counts(normalize=True).to_frame("Proportion of sample") \
.style.format('{:.2%}')

<p style="text-align: right"><a href="#Contents">Back to Contents</a></p>

## Pre-process data for modelling

In [None]:
# let's prepare the target
# it is a multiclass classification, so we need to make 
# one hot encoding of the target

encoder = LabelEncoder()
encoder.fit(y_train)

train_y = np_utils.to_categorical(encoder.transform(y_train))
test_y = np_utils.to_categorical(encoder.transform(y_test))

print(train_y.shape)
print('')
print(train_y[:10])

In [None]:
# The images in our folders, are all different sizes
# For neural networks however, we need images in the same size
# The images will all be resized to this size:

IMAGE_SIZE = 150

In [None]:
def im_resize(image_location, image_size=IMAGE_SIZE, DATA_FOLDER=DATA_FOLDER):
    return(cv2.resize(
        cv2.imread(str(DATA_FOLDER / image_location)),
        (IMAGE_SIZE, IMAGE_SIZE)
    ))

In [None]:
tmp = im_resize(X_train[7])
tmp.shape

In [None]:
# the shape of the datasets needs to be (n1, n2, n3, n4)
# where n1 is the number of observations
# n2 and n3 are image width and length
# and n4 indicates that it is a color image, so 3 planes per image

def create_dataset(image_locations, **kwargs):
    """**kwargs: Additional arguments to im_resize()"""
    return(np.array([
        im_resize(image_location, **kwargs) for image_location in image_locations
    ]))

In [None]:
%%time
# Took me approx: 45 secs
x_train = create_dataset(X_train, image_size=IMAGE_SIZE)
print(f'Train Dataset Images shape: {x_train.shape}   size: {x_train.size:,}\n')

In [None]:
%%time
# Took me approx: 15 secs
x_test = create_dataset(X_test)
print(f'Train Dataset Images shape: {x_train.shape}   size: {x_train.size:,}')

In [None]:
# number of different classes
y_train.unique().shape[0]

<p style="text-align: right"><a href="#Contents">Back to Contents</a></p>

## CNN: Specify and train

In [None]:
# Specify the cnn
# Source: https://www.kaggle.com/fmarazzi/baseline-keras-cnn-roc-fast-5min-0-8253-lb

# CNN structure parameters
kernel_size = (3,3)
pool_size= (2,2)
first_filters = 32
second_filters = 64
third_filters = 128

dropout_conv = 0.3
dropout_dense = 0.3

model = Sequential()
model.add(Conv2D(first_filters, kernel_size, activation = 'relu', 
                 input_shape = (IMAGE_SIZE, IMAGE_SIZE, 3)))
model.add(Conv2D(first_filters, kernel_size, activation = 'relu'))
model.add(MaxPooling2D(pool_size = pool_size)) 
model.add(Dropout(dropout_conv))

model.add(Conv2D(second_filters, kernel_size, activation ='relu'))
model.add(Conv2D(second_filters, kernel_size, activation ='relu'))
model.add(MaxPooling2D(pool_size = pool_size))
model.add(Dropout(dropout_conv))

model.add(Conv2D(third_filters, kernel_size, activation ='relu'))
model.add(Conv2D(third_filters, kernel_size, activation ='relu'))
model.add(MaxPooling2D(pool_size = pool_size))
model.add(Dropout(dropout_conv))

model.add(Flatten())
model.add(Dense(256, activation = "relu"))
model.add(Dropout(dropout_dense))
model.add(Dense(12, activation = "softmax"))

model.summary()

In [None]:
model.compile(Adam(lr=0.0001), loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
# Training parameters
batch_size = 10
epochs = 8
filepath = "model.h5"

In [None]:
# Define callbacks to run after specific epochs
checkpoint = ModelCheckpoint(
    filepath, monitor='accuracy', verbose=1, 
    save_best_only=True, mode='max'
)
reduce_lr = ReduceLROnPlateau(
    monitor='accuracy', factor=0.5, patience=1, 
    verbose=1, mode='max', min_lr=0.00001
)
callbacks_list = [checkpoint, reduce_lr]

In [None]:
%%time
# Fit model
# Took me approx: 45 mins

run_this_command = False  # Set to False to avoid inadvertently running this command
history_filename_base = "fitting_history"
if run_this_command:
    history = model.fit(
        x=x_train, y=train_y,
        batch_size=batch_size, 
        validation_split=10,
        epochs=epochs,
        verbose=2,
        callbacks=callbacks_list
    )
    # Save history (with timestamp in the filename)
    new_filename = "{history_filename_base}.pkl"
    try:  # It is crucial this does not fail, so I also have written a backup
        ts = time.time()
        st = datetime.datetime.fromtimestamp(ts).strftime('%Y%m%d_%H%M%S')
        new_filename = f"{history_filename_base}_{st}.pkl"
    except Exception:
        pass
    with open(new_filename, "wb") as output_file:
        pickle.dump(history, output_file)
else:
    print("Command has *not* been run\n")
    previous_files = [path for path in Path(os.getcwd()).glob(f"{history_filename_base}_*.pkl")]
    if len(previous_files) == 0:
        print("No previous files available. History not loaded.\n")
    else:
        with open(sorted(previous_files)[-1], 'rb') as input_file:
            history = pickle.load(input_file)
        print("Most recent history file reloaded.\n")

In [None]:
# View fitting history
acc = history.history['accuracy']
loss = history.history['loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.title('Training loss')
plt.legend()
plt.show()

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.title('Training accuracy')
plt.legend()
plt.show()

<p style="text-align: right"><a href="#Contents">Back to Contents</a></p>

## CNN: Assess model

In [None]:
%%time
# calculate predictions on test set
# Took me approx: 20 secs
predictions = model.predict_classes(x_test, verbose=1)

In [None]:
# inspect predictions
predictions[:50]

We see that the model has simply predicted that every observation is in **one** particular class. Something has gone wrong. As this is not crucial to the course, I have not gone back to debug it.

In [None]:
# get confusion matrix
cnf_matrix = confusion_matrix(encoder.transform(y_test), predictions)

In [None]:
# create a dict to map back the numbers onto the classes
encoding_dict = dict(zip(range(len(encoder.classes_)), encoder.classes_))
encoding_dict

In [None]:
abbreviation_dict = {}
for code, label in encoding_dict.items():
    label_words = re.split(r"[\s-]", label)
    if len(label_words) == 1:
        abbreviation_dict[code] = label_words[0][:2]
    else:
        abbreviation_dict[code] = ''.join([label_word[0].upper() for label_word in label_words])
abbreviation = pd.DataFrame.from_dict(
    abbreviation_dict, columns=['abbrev'], orient='index'
).sort_index()
abbreviation

In [None]:
fig, ax = plt.subplots(1)
ax = sns.heatmap(cnf_matrix, ax=ax, cmap=plt.cm.Greens, annot=True)
ax.set_xticklabels(abbreviation.abbrev)
ax.set_yticklabels(abbreviation.abbrev)
plt.title('Confusion Matrix')
plt.ylabel('True class')
plt.xlabel('Predicted class')
#fig.savefig('Confusion matrix.png', dpi=300)
plt.show();

In [None]:
accuracy_score(encoder.transform(y_test), predictions, normalize=True, sample_weight=None)

In [None]:
print(classification_report(encoder.transform(y_test), predictions))

<p style="text-align: right"><a href="#Contents">Back to Contents</a></p>


In [None]:
os.cpu_count()

In [None]:
len(os.sched_getaffinity(0))