# **Data preparation**

## Objectives

* Train, validation and test datasets resized and saved as numpy arrays

## Inputs

* Validation and test dataset raw data
  - inputs/dataset/cherry-leaves/validation
  - inputs/dataset/cherry-leaves/test

## Outputs

* Validation dataset as numpy arrays
  - outputs/dataset/validation_X.npy
  - outputs/dataset/validation_y.npy

* Test dataset as numpy arrays
  - outputs/dataset/test_X.npy
  - outputs/dataset/test_y.npy 

## Additional Comments

* These steps ensure time saving over multiple sessions of work by avoiding repetition of lengthy data manipulation



---

# Change working directory

In [None]:
import os

os.chdir("./..")  # change to parent directory
working_dir = os.getcwd()
working_dir  # check output for correct directory

---

# Set directory paths

In [None]:
data_dir = working_dir + "/inputs/dataset/cherry-leaves"
validation_dir = data_dir + "/validation"
test_dir = data_dir + "/test"
output_dataset = working_dir + "/outputs/dataset"

---

# Load images as array data

Using created function from last notebook to resize and load into numpy arrays

In [None]:
from keras.preprocessing import image
import numpy as np


def load_resize_image_as_array(img_path, width, height):
    img = image.load_img(img_path, target_size=(width, height))
    return image.img_to_array(img) / 255


# Remove commented lines to limit image loading to 50 images per category

def images_to_array(dir, width, height):
    X = np.array([])
    y = np.array([])

    for category in os.listdir(dir):
        # max_images = 0
        for img in os.listdir(dir + "/" + category):
            # if max_images == 50:
            #     break
            #  max_images += 1
            X = np.append(
                    X, 
                    load_resize_image_as_array(
                            dir + "/" + category + "/" + img,
                            width,
                            height
                            )
                ).reshape(-1, width, height, 3)
            y = np.append(y, category)

    return X, y

Train set is already prepared and saved, do the same process for validation and test sets

In [None]:
validation_X, validation_y = images_to_array(validation_dir, 75, 75)
test_X, test_y = images_to_array(test_dir, 75, 75)

In [None]:
np.save(
    f"{output_dataset}/validation_X.npy",
    validation_X
)

np.save(
    f"{output_dataset}/validation_y.npy",
    validation_y
)

np.save(
    f"{output_dataset}/test_X.npy",
    test_X
)

np.save(
    f"{output_dataset}/test_y.npy",
    test_y
)

---

# Next steps

With data ready ML modelling awaits