# Data Visualization

## Objectives

This notebook addresses **Business Requirement 1**:  
*Provide visual differentiation between healthy and mildew-affected cherry leaves.*

---

## Inputs (datasets)

- `inputs/mildew_dataset/cherry-leaves/train`
- `inputs/mildew_dataset/cherry-leaves/validation`
- `inputs/mildew_dataset/cherry-leaves/test`

---

## Outputs

### Dataset Overview

First we will get a statistical representation of the data. This includes:
- a total image count and class distribution (healthy or infected)
- we will show random sample images in a montage to give a visual of the dataset quality

### Image Statistics & Analysis
- Average image dimensions for preprocessing standardization
- Statistical comparison of grayscales between healthy and infected class (t-Test) maybe
- Visual analysis of class separation using Principal Component Analysis (PCA)

### Generated data
- pkl file to store Stores the averaged image dimensions, used to standardize input sizes during preprocessing.
- Visualizations: Key plots and graphics saved for later use in the Streamlit dashboard to support interactive data exploration.

# Import notebook packages

In [1]:
import os
import sys
import numpy as np

# Add the ressource directory to our path to be able to load relevant functions
sys.path.append('./src')

## Set working directory and file path architecture for notebook
As the notebooks are set in a subfolder of this repo we need to adjust the working directory so files can be accessed properly. 

First we check our current working directory.

In [2]:
current_dir = os.getcwd()
current_dir

'e:\\Projects\\Code-I\\vscode-projects\\PP5-predictive_analysis\\jupyter_notebooks'

Now we can change the directory to the parent folder that contains the complete repo. We will also print our new working directory so we can check everything worked out as planned.

In [3]:
# Only change the directory if not already at the repo root
current_dir = os.getcwd()
target_dir = os.path.abspath(os.path.join(current_dir, os.pardir))  # One level up

# Check if we're already in the repo root
if os.path.basename(current_dir) == 'jupyter_notebooks':
    os.chdir(target_dir)
    current_dir = os.getcwd()
    print(f"Working directory set to: {os.getcwd()}")
else:
    print(f"Current working directory remains: {current_dir}")

Working directory set to: e:\Projects\Code-I\vscode-projects\PP5-predictive_analysis


## Define data directory

In [4]:
# define variable for data directory
data_dir = os.path.join(os.getcwd(), 'inputs/datasets/raw/cherry-leaves')
# define variable for train set directory
train_dir = os.path.join(data_dir, 'train')
# define variable for test set directory
test_dir = os.path.join(data_dir, 'test')
# define variable for test set directory
val_dir = os.path.join(data_dir, 'validation')

## Check directory integrity

In [5]:
from exploration_visualization import check_structure
# Define the sets and labels for the dataset structure checkup
sets = ['train', 'test', 'validation']
labels = ['healthy', 'diseased']
# Check the integrity of the dataset structure
check_structure(data_dir, sets, labels)

'healthy' in 'train' is valid with 1472 images.
'diseased' in 'train' is valid with 1472 images.
'healthy' in 'test' is valid with 317 images.
'diseased' in 'test' is valid with 317 images.
'healthy' in 'validation' is valid with 315 images.
'diseased' in 'validation' is valid with 315 images.


## Define the output directory




In [6]:
# Set the current version
version = 'v1'
# Define the  directory
output_dir = os.path.join(current_dir, 'outputs', version)

# Check if the versioned output directory already exists
if os.path.exists(output_dir):
    print(f"Output directory '{output_dir}' already exists. "
          "Please create a new version.")
else:
    os.makedirs(output_dir)
    print(f"Created output directory: '{output_dir}'")

Created output directory: 'e:\Projects\Code-I\vscode-projects\PP5-predictive_analysis\outputs\v1'
