# Modelling and Evaluating

## Objectives

* Answer business requirement 2:

    - The client wants to develop a system capable of identifying whether a leaf shows signs of powdery mildew infection.

## Inputs 

* Image shape embeddings (pickle file).

* Images from the test, train, validation folders and their subfolders.

    ├── inputs
    │   └──cherryleaves_dataset
    │      └──cherry-leaves
    │           │
    │           ├── test
    │           │   ├── healthy
    │           │   └── powdery_mildew
    │           │
    │           ├── train
    │           │   ├── healthy
    │           │   └── powdery_mildew
    │           │
    │           └── validation
    │               ├── healthy
    │               └── powdery_mildew
    └──

## Outputs

* Images distribution plot in train, validation, and test set.

    - label distribution - bar chart. -set distriburion - pie chart.

* Image augmentation.

    - plot augmented images for each set.

* Class indices to change prediction inference in labels.

* Creation of a Machine learning model and display of its summary.

* Model training.

* Save model.

* Learning curve plot for model performance.

    - Model A - separate plots for accuracy and loss.
    - Model B - comprehensive model history plot.
    - Model C - plot model history with plotly.

* Model evaluation on saved file.

    - Calculate accuracy.
    - Plot ROC curve.
    - Calculate classification report (Model A)
        - Model B - classification report with macro avg and weighted avg
        - Model C - syntetic classification report per label

* Plot Confusion Matrix

* Save evaluation pickle file

* Prediction on the random image file.

## Comments | Insights | Conclusions

* The same data was plotted in different versions to accomodate possible client's requests of further data understanding.

* The CNN was built seeking maximise accuracy while minimizing loss and training time.

* The CNN was kept as small as possible withouth compromising accuracy and avoiding overfitting.

* More about hyperparameters optimization and trial and error phase is documented in the readme.md file and in a separate .pdf file.

---

# Import Packages

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns 
from matplotlib.image import imread

## Set working directory

In [2]:
cwd = os.getcwd()

In [3]:
os.chdir('/workspaces/Mildew-detection-in-cherry-leaves')
print("You set a new current directory")

You set a new current directory


In [4]:
work_dir = os.getcwd()
work_dir

'/workspaces/Mildew-detection-in-cherry-leaves'

## Set input directory

* Set train path
* Set validation path
* Set test path

In [5]:
my_data_dir = 'inputs/cherryleaves_dataset/cherry-leaves'
train_path = my_data_dir + '/train' 
val_path = my_data_dir + '/validation'
test_path = my_data_dir + '/test'

## Set output directory

In [6]:
version = 'v1'
file_path = f'outputs/{version}'

if 'outputs' in os.listdir(work_dir) and version in os.listdir(work_dir + '/outputs'):
    print('Old version is already available create a new version.')
    pass
else:
    os.makedirs(name=file_path)

Old version is already available create a new version.


## Set Label names

In [7]:
labels = os.listdir(train_path)
print('Label for the images are', labels)

Label for the images are ['powdery_mildew', 'healthy']


## Set image shape

In [8]:
import joblib
version = 'v1'
image_shape = joblib.load(filename=f"outputs/{version}/image_shape.pkl")
image_shape

(256, 256, 3)

---

# Images distribution

## Count number of images per set and label

In [9]:
import plotly.express as px

rows = []
for folder in ['train', 'test', 'validation']:
    for label in labels:
        count = len(os.listdir(os.path.join(my_data_dir, folder, label)))
        rows.append({
            'Set': folder,
            'Label': label,
            'Count': count
        })
        print(f"* {folder} - {label}: {count} images")

df_freq = pd.DataFrame(rows)
print("\n")

* train - powdery_mildew: 1472 images
* train - healthy: 1472 images
* test - powdery_mildew: 422 images
* test - healthy: 422 images
* validation - powdery_mildew: 210 images
* validation - healthy: 210 images


