# **DATA VISUALIZATION**

## Objectives

* Conduct a study for visual distinction of healthy and infected cherry leaves

## Inputs

* Healthy and infected cherry leaves images
  - inputs/dataset/cherry-leaves/train
  - inputs/dataset/cherry-leaves/validation
  - inputs/dataset/cherry-leaves/test

## Outputs

* Average, variability and difference images
  - outputs/images/avg_varia_healthy.png
  - outpuoutputs/images/avg_varia_powdery_mildew.png
  - outputs/images/difference_image.png

* Training dataset as numpy arrays
  - outputs/dataset/train_X.npy
  - outputs/dataset/train_y.npy

---

# Change working directory

In [None]:
import os

os.chdir("./..")  # change to parent directory
working_dir = os.getcwd()
working_dir  # check output for correct directory

# Set dataset and output directory paths

In [None]:
data_dir = working_dir + "/inputs/dataset/cherry-leaves"
train_dir = data_dir + "/train"
validation_dir = data_dir + "/validation"
test_dir = data_dir + "/test"
output_images = working_dir + "/outputs/images"
output_dataset = working_dir + "/outputs/dataset"
if not "images" in os.listdir(working_dir + "/outputs"):
    os.makedirs(output_images)
if not "dataset" in os.listdir(working_dir + "/outputs"):
    os.makedirs(output_dataset)

---

# Load images as array data

We already know all images in dataset are 200x200 pixels in rgb format giving the (200, 200, 3) shape. For the project they will be resized to 75x75 pixels.

Create functions to load training images as array data

In [None]:
from keras.preprocessing import image
import numpy as np
from matplotlib import pyplot as plt


def load_resize_image_as_array(img_path, width, height):
    img = image.load_img(img_path, target_size=(width, height))
    return image.img_to_array(img) / 255


# Remove commented lines to limit image loading to 50 images per category

def images_to_array(dir, width, height):
    X = np.array([])
    y = np.array([])

    for category in os.listdir(dir):
        # max_images = 0
        for img in os.listdir(dir + "/" + category):
            # if max_images == 50:
            #     break
            #  max_images += 1
            X = np.append(
                    X, 
                    load_resize_image_as_array(
                            dir + "/" + category + "/" + img,
                            width,
                            height
                            )
                ).reshape(-1, width, height, 3)
            y = np.append(y, category)

    return X, y

Training set will be used to generate average images and image variability plots.

In [None]:
X, y = images_to_array(train_dir, 75, 75)

We will also save training images data that's now in arrays compatible with future use for machine learning

In [None]:
np.save(
    f"{working_dir}/outputs/dataset/train_X.npy",
    X
    )

np.save(
    f"{working_dir}/outputs/dataset/train_y.npy",
    y
)

# Average image and image variability for healthy and infected leaves

Create function that returns average image and image variability for categories

In [None]:
def get_average_and_variability_images():
    results_arr = []

    for category in np.unique(y):
        category_arr = X[np.where(y == category)]
        average_image = np.mean(category_arr, axis=0)
        variability_image = np.std(category_arr, axis=0)
        results_arr.append([average_image, variability_image, category])

    return results_arr

Store images and category in an array

In [None]:
results = get_average_and_variability_images()

Create display function for average and variability images with option to save

In [None]:
def display_average_and_variability_images(results, save=False):
    for i in range(0, len(results)):
        category = results[i][2]
        avg_img = results[i][0]
        varia_img = results[i][1]

        fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 10))

        axes[0].set_title(f"{category} leaf average image")
        axes[0].imshow(avg_img)

        axes[1].set_title(f"{category} leaf variability image")
        axes[1].imshow(varia_img)

        plt.tight_layout()

        if save:
            plt.savefig(output_images + "/avg_varia_" + category, bbox_inches="tight", pad_inches=0.3)

    fig.show()

Display average and variability images

In [None]:
display_average_and_variability_images(results, save=True)

Powdery mildew leaves average image is clearly brighter that it's counterpart due to mildew on it. Variability image for powdery mildew leaves shows much higher variance towards center of the image for the same reason.

Let's visualize the difference in average images

In [None]:
def display_average_images_difference(average_healthy, average_mildew, save=False):
    fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15,10))

    axes[0].set_title("Healthy leaf average image")
    axes[0].imshow(average_healthy)

    axes[1].set_title("Mildew powder leaf average image")
    axes[1].imshow(average_mildew)

    axes[2].set_title("Difference image")
    axes[2].imshow(np.abs(average_healthy - average_mildew))

    if save:
        plt.savefig(output_images + "/difference_image", bbox_inches="tight", pad_inches=0.3)

    fig.show()


In [None]:
display_average_images_difference(results[0][0], results[1][0], save=True)

Difference image clearly indicates leaf surface changes when infected

---

# Conclusion and next steps

Average image difference between healthy and infected leaves is clearly visible and detectable. We can be optimistic in designing a model that can classify infected leaves.