# Data Visualization Notebook

## Objectives

* Answer business requirement no.1:
    - The client is interested in carrying out a study that focuses on visually distinguishing between a healthy cherry leaf and one that shows signs of powdery mildew.

## Inputs

Images are taken from the test, train, validation folders and their subfolders.

    ├── inputs
    │   └──cherryleaves_dataset 
    │      └──cherry-leaves
    │           │
    │           ├── test
    │           │   ├── healthy
    │           │   └── powdery_mildew
    │           │
    │           ├── train
    │           │   ├── healthy
    │           │   └── powdery_mildew
    │           │
    │           └── validation
    │               ├── healthy
    │               └── powdery_mildew
    └──

## Outputs

* Compute the average image size from the training set.
    - Since the CNN will be trained on the test set, it is essential to standardize all input images to the same size. The chosen input size will directly determine the CNN architecture.

* Leverage the saved image shape embeddings (pickle file).
    - Use these embeddings to analyze image dimensions and ensure consistency.

* Visualize image characteristics by label.
    - Plot the mean and variability of image sizes for each class label.

* Highlight class-specific differences.
    - Create plots that emphasize the contrast between healthy cherry leaves and those infected with mildew.

* Address Business Requirement 1.
    - Develop code that not only fulfills this requirement but can also be adapted to generate an image montage for display in the Streamlit dashboard.

## Comments | Insights | Conclusions

These steps are necessary to further understand and prepare the data that will be fed into the CNN. Additionally, the data has been visually arranged to meet the client’s request (Business Requirement #1)

---

# Import libraries

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns 
import joblib
sns.set_style('white')
from matplotlib.image import imread

# Set working directory

In [2]:
cwd= os.getcwd()

In [4]:
os.chdir('/workspaces/Cherry-Powdery-Mildew-Detector')
print("You set a new current directory")

You set a new current directory


In [5]:
work_dir = os.getcwd()
work_dir

'/workspaces/Cherry-Powdery-Mildew-Detector'

# Set input directories

Set train, validation and test paths

In [6]:
my_data_dir = 'inputs/cherryleaves_dataset/cherry-leaves'
train_path = my_data_dir + '/train' 
val_path = my_data_dir + '/validation'
test_path = my_data_dir + '/test'

# Set output directory

In [7]:
version = 'v1'
file_path = f'outputs/{version}'

if 'outputs' in os.listdir(work_dir) and version in os.listdir(work_dir + '/outputs'):
    print('Old version is already available create a new version.')
    pass
else:
    os.makedirs(name=file_path)

# Set label names

In [8]:
labels = os.listdir(train_path)
print('Label for the images are', labels)

Label for the images are ['powdery_mildew', 'healthy']
