<center>

### COSC2753 - Machine Learning

# **Data Preprocessing**

<center>────────────────────────────</center>

### README
Due to the **time-consuming nature** of processing this entire file, it is **highly not recommended** to re-run it in its entirety. However, **specific sections** of the file can be executed to verify the output of the **pre-processing steps**.

The **pre-processed data** required for subsequent stages of the project is already available as a CSV dataframe within the **data/processed** folder. This eliminates the need to re-run the entire **pre-processing script** unless **absolutely necessary**.

# I. Introduction

In this notebook, we will apply common data preprocessing techniques to the dataset, building on the analysis conducted during the *exploratory data analysis* (EDA) steps. Data preprocessing is essential in the machine learning pipeline as it helps clean, transform, and prepare the data for the model.

The following preprocessing steps will be implemented:

1. **Image Labeling**: Labels associated with each image will be extracted from the filenames and stored in a designated column within a Pandas DataFrame.

2. **Train-Test Split**: The dataset will be divided into training and testing sets. The training set will be used to train the model, while the testing set will be used to evaluate its performance on unseen data.

3. **Image Resizing**: All images will be resized to a uniform dimension, ensuring consistency across the dataset.

4. **Data Augmentation**: The images are augmented to increase the size of the dataset and improve the model's generalization.
   
5. **Pixel Normalization**: The pixel values are normalized to a range of [`0`, `1`] to ensure that the model can learn effectively.
   
While the detection and handling of transparent images can, in some cases, be advantageous. This is because it can help to mitigate model errors, such as PNG image formats with transparent backgrounds. This would ultimately lead to improved data quality and, consequently, enhanced model performance. However, based on the exploratory data analysis (EDA) conducted, it is evident that the dataset does not contain any transparent images.

# II. Project Setup

## 1. Import Libraries

In [1]:
# Base configuration for all scripts
import sys
import importlib
from sklearn.model_selection import train_test_split  # Split for train and test
import warnings

sys.path.append("../../")

modules_to_reload = [
    "scripts.leon",
    "scripts.styler",
    "scripts.utils",
    "scripts.constants",
]
[importlib.reload(sys.modules[m]) for m in modules_to_reload if m in sys.modules]

# Import user-defined scripts
from scripts.utils import Utils

# Ignore future warnings as they are not applicable at the moment
warnings.simplefilter(action="ignore", category=FutureWarning)
Utils.import_modules()

>>> os imported
>>> sys imported
>>> importlib imported
>>> inspect imported
>>> pandas imported as pd
>>> numpy imported as np
>>> matplotlib.pyplot imported as plt
>>> seaborn imported as sns
>>> tabulate imported
>>> Error importing scripts.leon: No module named 'keras'
>>> scripts.constants imported as const
>>> scripts.styler imported as styler


## 2. Global Properties

In [2]:
# Define the base directory path
BASE_PATH = const.COLAB_PATH if const.IS_COLAB else const.LOCAL_PATH
RAW_DIR = f"{BASE_PATH}/{const.RAW_DATA_DIR}"

print ("Base configuration loaded successfully!")
print ("Raw data directory: ", RAW_DIR)

Base configuration loaded successfully!
Raw data directory:  ../../data/raw/Furniture_Data/Furniture_Data


# III. Data Preprocessing

## Invalid Image Handling
Exploratory data analysis (**EDA**) showed that there is a presence of an **invalid image** within the dataset. This appears to be an empty folder mistakenly named like an image file. To ensure data integrity, we will proceed to **remove** this invalid image from the dataset.


In [19]:
empty_folder = [
    f"{RAW_DIR}/lamps/Modern/11286modern-lighting.jpg",
    rf"{RAW_DIR}/lamps/Modern/11286modern-lighting.jpg\^J",
]

print ("Empty folders to be removed: ", empty_folder)

# Get the absolute path
for i in range(len(empty_folder)):
    empty_folder[i] = os.path.abspath(empty_folder[i])
    if os.path.exists(empty_folder[i]):
        os.remove(empty_folder[i])
    else:
        print("File resolved already.")

Empty folders to be removed:  ['../../data/raw/Furniture_Data/Furniture_Data/lamps/Modern/11286modern-lighting.jpg', '../../data/raw/Furniture_Data/Furniture_Data/lamps/Modern/11286modern-lighting.jpg\\^J']
File resolved already.
File resolved already.


## Inconsistent Data Format Handling
As we conclude from the *exploratory data analysis* (EDA), there is a mismatch between the file format of `jpgD` and the other `jpg` files. Hence, to ensure that the data is consistent, we will change the file format of `jpgD` to `jpg`. Based on testing and observation, simply changing the file format from `jpgD` to `jpg` does not affect the image quality or integrity.

This adjustment will help maintain consistency across the dataset and prevent any potential issues during the preprocessing and modeling stages.

In [20]:
# Original file path
jpgd_path = f"{RAW_DIR}/dressers/Farmhouse/30826farmhouse-coffee-tables.jpgD"

# Get the absolute path
jpgd_path = os.path.abspath(jpgd_path)

# Check if the file exists
if os.path.exists(jpgd_path):
    # Get the directory and file name
    directory, filename = os.path.split(jpgd_path)

    # Remove the extra characters after ".jpg" in the filename
    new_filename = filename.split(".jpg")[0] + ".jpg"

    # Create the new file path
    new_path = os.path.join(directory, new_filename)

    # Rename the file
    os.rename(jpgd_path, new_path)

    print(f"File renamed from '{jpgd_path}' to '{new_path}'")
else:
    print("File not found or has been resovled already.")

File not found or has been resovled already.


## Image Labeling and Training-Test Split

This stage focuses on processing image filenames to extract **labels**, which will subsequently be stored as new columns within our **Pandas DataFrame**. This approach facilitates convenient access and management of images using Pandas' functionalities.

Firstly, the entire dataset containing the following columns will be loaded:
- **Path**: Relative path to the image file.
- **Category**: The category extracted from the image filename.
- **Style**: The style associated with the image category.
- **Width**: Width of the image in pixels.
- **Height**: Height of the image in pixels.
- **MinValue**: Minimum pixel value in the image.
- **MaxValue**: Maximum pixel value in the image.
- **StdDev**: Standard deviation of the pixel values in the image.

Next, the dataset will be divided into separate **training and testing sets** using an `80/20` split ratio. This allows us to train the model on a subset of the data while evaluating its performance on unseen data.

Since our objective is to classify images and predict their styles, the split will be **stratified** based on the "**Style**" column. This ensures a balanced representation of different styles in both the training and testing sets, leading to a more robust model.

In [6]:
# Get the paths of all images within the raw directory
image_paths = leon.get_image_paths(raw_dir)

# Remove previously augmented images
styler.boxify("Purging non-raw files")
leon.remove_nonraw_files(image_paths)
print(">>> Purging complete.\n")

# Load the data
styler.boxify("Loading data")
print(">>> Data is being loaded... Please wait.\n")
df = leon.load_data_frame(raw_dir)

df.info()

╭───────────────────────╮
│ Purging non-raw files │
╰───────────────────────╯
This is a destructive operation as files will be deleted permanently. Are you sure you want to continue? (y/n)

Please wait and do not interrupt the process.

Removing non-raw files...

>>> Purging complete.

╭──────────────╮
│ Loading data │
╰──────────────╯
>>> Data is being loaded... Please wait.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75698 entries, 0 to 75697
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Path      75698 non-null  object 
 1   Category  75698 non-null  object 
 2   Style     75698 non-null  object 
 3   Width     75698 non-null  int64  
 4   Height    75698 non-null  int64  
 5   MinValue  75698 non-null  uint8  
 6   MaxValue  75698 non-null  uint8  
 7   StdDev    75698 non-null  float64
dtypes: float64(1), int64(2), object(3), uint8(2)
memory usage: 3.6+ MB


In [7]:
# Display the first few rows
df.head(10)

Unnamed: 0,Path,Category,Style,Width,Height,MinValue,MaxValue,StdDev
0,../../data/raw/Furniture_Data/Furniture_Data/l...,lamps,Traditional,350,350,0,255,70.282
1,../../data/raw/Furniture_Data/Furniture_Data/l...,lamps,Traditional,350,350,0,255,73.807
2,../../data/raw/Furniture_Data/Furniture_Data/l...,lamps,Traditional,350,350,0,255,67.194
3,../../data/raw/Furniture_Data/Furniture_Data/l...,lamps,Traditional,350,350,0,255,67.955
4,../../data/raw/Furniture_Data/Furniture_Data/l...,lamps,Traditional,350,350,0,255,95.022
5,../../data/raw/Furniture_Data/Furniture_Data/l...,lamps,Traditional,350,350,0,255,56.056
6,../../data/raw/Furniture_Data/Furniture_Data/l...,lamps,Traditional,350,350,19,255,58.35
7,../../data/raw/Furniture_Data/Furniture_Data/l...,lamps,Traditional,350,350,5,255,40.87
8,../../data/raw/Furniture_Data/Furniture_Data/l...,lamps,Traditional,350,350,0,255,71.595
9,../../data/raw/Furniture_Data/Furniture_Data/l...,lamps,Traditional,350,350,21,255,53.708


In [8]:
# Print count of each style within each category in the train set
style_counts_per_category = df.groupby("Category")["Style"].value_counts()
print("\nCount of each style within each category in the train set:")
print(style_counts_per_category)


Count of each style within each category in the train set:
Category  Style        
beds      Contemporary     1944
          Transitional     1715
          Traditional      1391
          Modern            375
          Rustic            238
          Craftsman         184
          Midcentury        130
          Farmhouse          89
          Victorian          82
          Mediterranean      75
          Industrial         61
          Tropical           55
          Beach              51
          Southwestern       49
          Asian              48
          Scandinavian       33
          Eclectic           22
chairs    Contemporary     4172
          Transitional     4164
          Traditional      3748
          Midcentury       3102
          Modern           1484
          Farmhouse         555
          Industrial        495
          Tropical          425
          Asian             350
          Rustic            303
          Victorian         217
          Scandinavi

In [7]:
from tensorflow.keras.preprocessing.image import (
    ImageDataGenerator,
    img_to_array,
    load_img,
)
import numpy as np

# Initialize ImageDataGenerator for data augmentation
datagen = ImageDataGenerator(
    rotation_range=10,  # Random rotation up to 10 degrees
    width_shift_range=0.1,  # Randomly shift width by up to 10%
    height_shift_range=0.1,  # Randomly shift height by up to 10%
    shear_range=0.2,  # Shear angle in counter-clockwise direction in radians
    zoom_range=0.2,  # Random zoom up to 20%
    horizontal_flip=True,  # Randomly flip images horizontally
    vertical_flip=False,  # Don't flip images vertically
    fill_mode="nearest",  # Strategy for filling in newly created pixels
)


# Define a function to load images from file paths
def load_images(image_paths):
    images = []
    for path in image_paths:
        img = load_img(path, target_size=(256, 256))  # Adjust target_size as needed
        img_array = img_to_array(img)
        images.append(img_array)
    return np.array(images)


# Define a function to handle class imbalance within categories
def oversample_with_augmentation(group):
    # Count occurrences of each class
    class_counts = group["Style"].value_counts()
    max_samples = class_counts.max()

    # Initialize empty list to store augmented images and labels
    augmented_images = []
    augmented_labels = []

    # Apply data augmentation to minority classes
    for style, count in class_counts.items():
        if count < max_samples:
            # Filter data for the minority style
            minority_data = group[group["Style"] == style]

            # Load images
            minority_images_array = load_images(minority_data["Path"])

            # Compute number of additional samples needed
            num_additional_samples = max_samples - count

            # Generate augmented images
            augmented_data_generator = datagen.flow(
                minority_images_array, batch_size=1, shuffle=False
            )
            for _ in range(num_additional_samples):
                augmented_images.append(next(augmented_data_generator)[0])
                augmented_labels.append(style)

    # Create DataFrame for augmented data
    augmented_df = pd.DataFrame(
        {
            "Path": [
                "oversampled_image_{}".format(i) for i in range(len(augmented_images))
            ],
            "Style": augmented_labels,
        }
    )

    # Concatenate the original data with the augmented data
    augmented_group = pd.concat([group, augmented_df])

    return augmented_group

In [8]:
# Assuming df contains 'ImagePath' and 'Style' columns
grouped_df = df.groupby("Category")

# Apply oversampling with augmentation to each group
oversampled_groups = [oversample_with_augmentation(group) for _, group in grouped_df]

# Concatenate oversampled groups into a single DataFrame
oversampled_df = pd.concat(oversampled_groups)

# Save augmented images to disk if needed
for i, img_array in enumerate(augmented_images):
    img = array_to_img(img_array)
    img.save("oversampled/oversampled_image_{}.jpg".format(i))

# Print count of each style within each category in the oversampled dataset
style_counts_per_category = oversampled_df.groupby("Category")["Style"].value_counts()
print("\nCount of each style within each category in the oversampled dataset:")
print(style_counts_per_category)

: 

In [None]:
# Print count of each style within each category in the train set
style_counts_per_category = df.groupby("Category")["Style"].value_counts()
print("\nCount of each style within each category in the train set:")
print(style_counts_per_category)


Count of each style within each category in the train set:
Category  Style        
beds      Contemporary     1944
          Transitional     1715
          Traditional      1391
          Modern            375
          Rustic            238
          Craftsman         184
          Midcentury        130
          Farmhouse          89
          Victorian          82
          Mediterranean      75
          Industrial         61
          Tropical           55
          Beach              51
          Southwestern       49
          Asian              48
          Scandinavian       33
          Eclectic           22
chairs    Contemporary     4172
          Transitional     4164
          Traditional      3748
          Midcentury       3102
          Modern           1484
          Farmhouse         555
          Industrial        495
          Tropical          425
          Asian             350
          Rustic            303
          Victorian         217
          Scandinavi

In [None]:
# Split the data into train and test sets
styler.boxify("Splitting data")
print(">>> Splitting data into train and test sets... Please wait.\n")
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

# Display the first few rows of the train set
df_train.info()

# Group by the 'category' column and count the number of unique values
category_counts = df_train["Category"].value_counts()

# Display the category counts
for category, count in category_counts.items():
    print(f"{category}: {count}")

╭────────────────╮
│ Splitting data │
╰────────────────╯
>>> Splitting data into train and test sets... Please wait.

<class 'pandas.core.frame.DataFrame'>
Index: 60559 entries, 58338 to 15795
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Path      60559 non-null  object 
 1   Category  60559 non-null  object 
 2   Style     60559 non-null  object 
 3   Width     60559 non-null  int64  
 4   Height    60559 non-null  int64  
 5   MinValue  60559 non-null  uint8  
 6   MaxValue  60559 non-null  uint8  
 7   StdDev    60559 non-null  float64
dtypes: float64(1), int64(2), object(3), uint8(2)
memory usage: 3.3+ MB
lamps: 18419
chairs: 16144
tables: 12752
dressers: 5278
beds: 5251
sofas: 2715


In [None]:
styler.boxify("First 20 rows of the train set")
df_train.head(10)

╭────────────────────────────────╮
│ First 20 rows of the train set │
╰────────────────────────────────╯


Unnamed: 0,Path,Category,Style,Width,Height,MinValue,MaxValue,StdDev
58338,../../data/raw/Furniture_Data/Furniture_Data/c...,chairs,Traditional,256,256,0,255,75.131
19160,../../data/raw/Furniture_Data/Furniture_Data/l...,lamps,Contemporary,256,256,6,255,22.291
1410,../../data/raw/Furniture_Data/Furniture_Data/l...,lamps,Traditional,256,256,0,255,60.138
10986,../../data/raw/Furniture_Data/Furniture_Data/l...,lamps,Modern,256,256,0,255,96.166
17652,../../data/raw/Furniture_Data/Furniture_Data/l...,lamps,Contemporary,256,256,16,255,35.015
4187,../../data/raw/Furniture_Data/Furniture_Data/l...,lamps,Midcentury,256,256,10,255,19.738
50134,../../data/raw/Furniture_Data/Furniture_Data/b...,beds,Contemporary,256,256,0,255,68.49
14243,../../data/raw/Furniture_Data/Furniture_Data/l...,lamps,Asian,256,256,0,255,49.072
73203,../../data/raw/Furniture_Data/Furniture_Data/c...,chairs,Contemporary,256,256,0,255,88.265
13873,../../data/raw/Furniture_Data/Furniture_Data/l...,lamps,Farmhouse,256,256,8,255,51.032


## 3. Image Resizing

In this step, we will resize all images to a uniform dimension. This step is crucial as it ensures that all images have the same size, which is a requirement for most machine learning models.

While there's no single "**best**" image size for deep learning applications, we will resize our images to `256x256` pixels. This choice balances the need to capture sufficient detail in the images with computational efficiency. Additionally, research by O. Rukundo (Lund University) suggests that `256x256` pixels is a common and effective size for processing medical images, particularly **LGE-MRI** images.

Although our dataset is not related to medical imaging, we can leverage this insight of its reliability to guide our decision.

[Link to Research](https://www.mdpi.com/2079-9292/12/4/985)

In [None]:
styler.boxify("Resizing images to 256x256")

# Resize images to 256x256
for img_path in df_train["Path"]:
    leon.resize_image(img_path, 256, 256)

df_train.head()

╭────────────────────────────╮
│ Resizing images to 256x256 │
╰────────────────────────────╯


Unnamed: 0,Path,Category,Style,Width,Height,MinValue,MaxValue,StdDev
58338,../../data/raw/Furniture_Data/Furniture_Data/c...,chairs,Traditional,256,256,0,255,75.131
19160,../../data/raw/Furniture_Data/Furniture_Data/l...,lamps,Contemporary,256,256,6,255,22.291
1410,../../data/raw/Furniture_Data/Furniture_Data/l...,lamps,Traditional,256,256,0,255,60.138
10986,../../data/raw/Furniture_Data/Furniture_Data/l...,lamps,Modern,256,256,0,255,96.166
17652,../../data/raw/Furniture_Data/Furniture_Data/l...,lamps,Contemporary,256,256,16,255,35.015


## 4. Data Augmentation

In machine learning, data augmentation is a well-established technique employed to artificially expand the size of a dataset. This is achieved by applying various transformations to the existing data points. Data augmentation proves particularly valuable when dealing with limited datasets, as it mitigates the risk of overfitting and enhances the model's ability to generalize to unseen data.

While our image dataset may not be severely restricted in size, data augmentation can still provide significant benefits. We will incorporate the following transformations to augment our dataset:

1. **Random Rotation**: Images will be rotated by a random angle within a predefined range.
2. **Vertical Flip**: Images will be flipped along the vertical axis.
3. **Random Contrast Adjustment**: The contrast of each image will be adjusted by a random factor.

These aforementioned transformations are commonly utilized in image augmentation and demonstrably assist the model in learning robust features from the data [1]. It is important to note that with a dataset exceeding `65,000` images, regarding the training set, employing an excessive number of augmentation techniques would result in a computationally expensive dataset to process. Therefore, a selection of the top `3` prevalent and effective transformations has been chosen.

Following this augmentation step, we will possess a dataset that is approximately **three times** larger than the original dataset (Each image will have `2` augmented versions, plus the original image). This expanded dataset will equip the model with a richer learning experience and improved generalizability.

[1] [Image Data Augmentation for Computer Vision](https://viso.ai/computer-vision/image-data-augmentation-for-computer-vision/#:~:text=Popular%20Types%20and%20Methods%20of%20Data%20Augmentation,-Early%20experiments%20showing&text=Geometric%20transformations%3A%20Augmenting%20image%20data,%2Fdown%2C%20or%20noise%20injection.)

In [None]:
styler.boxify("Augmenting images in the training set")

print("\n>>> Augmenting images in the training set... Please wait.\n")

# Augment images in the training set
for img_path in df_train["Path"]:
    # Check if img_path starts with "aug_"
    if os.path.basename(img_path).strip().startswith("aug_"):
        continue

    # Extract directory name
    directory = os.path.dirname(img_path)

    # Augment images in the directory
    df_train = leon.augment_image(
        image_path=img_path, output_dir=directory, df_train=df_train
    )

# Display the first few rows of the filtered DataFrame
df_train.head()

╭───────────────────────────────────────╮
│ Augmenting images in the training set │
╰───────────────────────────────────────╯

>>> Augmenting images in the training set... Please wait.



KeyboardInterrupt: 

## 5. Pixel Normalization

Normalization is a critical stage in preprocessing image data for deep learning applications. It ensures all pixel values fall within a consistent range, typically between `0` and `1`. This seemingly simple step offers several advantages:

- **Faster Convergence**: By reducing the overall value range, normalization accelerates the convergence of the optimization algorithm used to train the model.
- **Improved Stability**: Normalization creates a more stable training process, mitigating issues like vanishing and exploding gradients:

    - *Vanishing Gradients*: In deep learning, vanishing gradients occur when the gradients (slopes) of the loss function become very small as they propagate backward through the layers of a deep neural network. This can cause the model to learn very slowly or not at all.
    
    - *Exploding Gradients*: Conversely, exploding gradients happen when gradients become excessively large, causing the model to diverge (lose stability) during training.
- **Boosted Performance**: Consistent input data, achieved through normalization, often leads to better model performance and generalization on unseen data.

It's important to acknowledge that the normalization range is not restricted to `[0, 1]`. Other techniques, such as **Z-score normalization**, may be appropriate depending on the specific model and data characteristics. However, due to its simplicity and effectiveness, `[0, 1]` normalization remains a popular choice for preprocessing pipelines.

### Note: The Normalization Will Not Be Performed At This Stage

While the normalization process itself is successful, it is not recommended to save the **normalized images** to disk afterwards. This is due to a data type mismatch. The original images are stored in an unsigned 8-bit integer format (**uint8**), whereas the **normalized images** are in a single-precision floating-point format (**float32**). This conversion can lead to a loss of color information and a reduction in image sharpness.

Consequently, to avoid this issue, the normalization process will be performed during the **model training** phase. During training, the normalization will be applied dynamically to the input data, ensuring compatibility with the model's input requirements without compromising image quality.

# IV. Conclusion

Overall, data preprocessing is a crucial step in the machine learning pipeline. By applying the techniques outlined in this notebook, we have prepared our dataset for model training. The data is now **clean**, **labeled**, **split into training and testing sets**, **resized**, **augmented**, and **normalized**.

These preprocessing steps are essential for ensuring that the model can learn effectively from the data and make accurate predictions.

In [None]:
# Display info of the training set
df_train.info()

In [None]:
# Display the first few rows of the filtered DataFrame
df_train.head(10)

In [None]:
# Write train set to CSV
try:
    styler.boxify("Writing train set to CSV")
    df_train.to_csv("../../data/processed/train.csv", index=False)
except Exception as e:
    print(e)

print (">>> Data saved successfully")