# **Data Collection**
---

## Objectives

* Answer Business Requirement 1:
  * The client is interested in conducting a study to visually differentiate a healthy cherry leaf from one with powdery mildew.

## Inputs

* inputs/cherry_leaves_dataset/cherry-leaves/train
* inputs/cherry_leaves_dataset/cherry-leaves/validation
* inputs/cherry_leaves_dataset/cherry-leaves/test

## Outputs

* Image shape embeddings pickle file
* Mean and variability of images per label plot
* Plot to distinguish contrast between healthy and powdery mildew-infected cherry leaves.
* Generate code to answer business requirement 1 and can be used to build a gallery of images on the Streamlit dashboard

## Additional Comments

* No comments
---

## **Set Data Directory**
---
### Import Libraries

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
sns.set_style("white")
from matplotlib.image import imread

### Set working directory

In [None]:
current_dir = os.getcwd()
current_dir

'/workspaces/mildew-detection-in-cherry-leaves-p5/jupyter_notebooks'

In [4]:
os.chdir('/workspaces/mildew-detection-in-cherry-leaves-p5')
print('You set a new current directory')

You set a new current directory


In [6]:
current_dir = os.getcwd()
current_dir

'/workspaces/mildew-detection-in-cherry-leaves-p5'

### Set input directories

In [7]:
my_data_dir = 'inputs/cherry_leaves_dataset/cherry-leaves'
train_path = my_data_dir + '/train'
val_path = my_data_dir + '/validation'
test_path = my_data_dir + '/test'

### Set output directory

In [8]:
version = 'v1'
file_path = f'outputs/{version}'

# Checks to see if a specified version already exists in the outputs folder in the workspace
if 'outputs' in os.listdir(current_dir) and version in os.listdir(current_dir + '/outputs'):
    print('Old version is already available. Create a new version.')
    pass
else:
    os.makedirs(name=file_path)

### Set the label names

In [9]:
labels = os.listdir(train_path)
print('Label for the images are', labels)

Label for the images are ['healthy', 'powdery_mildew']


## **Data Visualisation of Image Data**
---

### Image Shape

Calculate the mean dimensions of the images in the train set

In [None]:
# Arrays that store the height and widths of each image
dim1, dim2 = [], []

# Loop through each label (healthy, powdery_mildew)
for label in labels:
    # Loop through each img in both categories
    for image_filename in os.listdir(train_path + '/' + label):
        img = imread(train_path + '/' + label + '/' + image_filename)
        
        # Gather the dimensions of each img
        d1, d2, colors = img.shape

        dim1.append(d1)  # image height
        dim2.append(d2)  # image width