<center>

### COSC2753 - Machine Learning

# **Data Preprocessing**

<center>────────────────────────────</center>

### README
Due to the **time-consuming nature** of processing this entire file, it is **highly not recommended** to re-run it in its entirety. However, **specific sections** of the file can be executed to verify the output of the **pre-processing steps**.

The **pre-processed data** required for subsequent stages of the project is already available as a CSV dataframe within the **data/processed** folder. This eliminates the need to re-run the entire **pre-processing script** unless **absolutely necessary**.

# I. Introduction

In this notebook, we will apply common data preprocessing techniques to the dataset, building on the analysis conducted during the *exploratory data analysis* (EDA) steps. Data preprocessing is essential in the machine learning pipeline as it helps clean, transform, and prepare the data for the model.

The following preprocessing steps will be implemented:

1. **Image Labeling**: Labels associated with each image will be extracted from the filenames and stored in a designated column within a Pandas DataFrame.

2. **Train-Test Split**: The dataset will be divided into training and testing sets. The training set will be used to train the model, while the testing set will be used to evaluate its performance on unseen data.

3. **Image Resizing**: All images will be resized to a uniform dimension, ensuring consistency across the dataset.

4. **Data Augmentation**: The images are augmented to increase the size of the dataset and improve the model's generalization.
   
5. **Pixel Normalization**: The pixel values are normalized to a range of [`0`, `1`] to ensure that the model can learn effectively.
   
While the detection and handling of transparent images can, in some cases, be advantageous. This is because it can help to mitigate model errors, such as PNG image formats with transparent backgrounds. This would ultimately lead to improved data quality and, consequently, enhanced model performance. However, based on the exploratory data analysis (EDA) conducted, it is evident that the dataset does not contain any transparent images.

# II. Project Setup

## 1. Import Libraries

In [49]:
# Import necessary packages
import pandas as pd  # Data manipulation
import sys  # System specific parameters and functions
import importlib  # Importing modules
import os  # OS related functions
from sklearn.model_selection import train_test_split  # Split for train and test

# Reload modules
sys.path.append("../../")  # Root directory
modules_to_reload = ["scripts.leon", "scripts.styler"]

# Reload modules if they have been modified
missing_modules = []

for module_name in modules_to_reload:
    if module_name in sys.modules:
        importlib.reload(sys.modules[module_name])
    else:
        missing_modules.append(module_name)

# Recache missing modules
if missing_modules:
    print(f"Modules {missing_modules} not found. Recaching...")

# Import user-defined scripts
from scripts.leon import Leon  # Leon class
from scripts.styler import Styler  # Styler class

# Configuration
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.precision", 3)

# Initialize objects
leon = Leon()
styler = Styler()


        @|\@@
       -  @@@@                                                            LEON 1.0.0
      /7   @@@@                                         This is Leon, the friendly lion. He is here to help you
     /    @@@@@@                                     Leon is tailored to manipulate images, data and visualizations
     \-' @@@@@@@@`-_______________                                      Made by: Team X
      -@@@@@@@@@             /    \                                     Version: 1.0.3
 _______/    /_       ______/      |__________-
/,__________/  `-.___/,_____________----------_)



## 2. Global Properties

In [50]:
import warnings

# Ignore future warnings as they are not applicable at the moment
warnings.simplefilter(action="ignore", category=FutureWarning)

# Define the base directory path
raw_dir = "../../data/raw/Furniture_Data/Furniture_Data/"

# III. Data Preprocessing

## Invalid Image Detection

## Invalid Image Handling
Exploratory data analysis (**EDA**) showed that there is a presence of an **invalid image** within the dataset. This appears to be an empty folder mistakenly named like an image file. To ensure data integrity, we will proceed to **remove** this invalid image from the dataset.


In [None]:
empty_folder = (
    "../../data/raw/Furniture_Data/Furniture_Data/lamps/Modern/11286modern-lighting.jpg"
)

# Get the absolute path
invalid_file_path = os.path.abspath(empty_folder)

# Check if the file exists
if os.path.exists(invalid_file_path):
    os.remove(invalid_file_path)
else:
    print("File not found or has been resovled already.")

## Inconsistent Data Format Handling
As we conclude from the *exploratory data analysis* (EDA), there is a mismatch between the file format of `jpgD` and the other `jpg` files. Hence, to ensure that the data is consistent, we will change the file format of `jpgD` to `jpg`. Based on testing and observation, simply changing the file format from `jpgD` to `jpg` does not affect the image quality or integrity.

This adjustment will help maintain consistency across the dataset and prevent any potential issues during the preprocessing and modeling stages.

In [51]:
# Original file path
jpgd_path = "../../data/raw/Furniture_Data/Furniture_Data/dressers/Farmhouse/30826farmhouse-coffee-tables.jpgD"

# Get the absolute path
jpgd_path = os.path.abspath(jpgd_path)

# Check if the file exists
if os.path.exists(jpgd_path):
    # Get the directory and file name
    directory, filename = os.path.split(jpgd_path)

    # Remove the extra characters after ".jpg" in the filename
    new_filename = filename.split(".jpg")[0] + ".jpg"

    # Create the new file path
    new_path = os.path.join(directory, new_filename)

    # Rename the file
    os.rename(jpgd_path, new_path)

    print(f"File renamed from '{jpgd_path}' to '{new_path}'")
else:
    print("File not found or has been resovled already.")

File not found or has been resovled already.


## Image Labeling and Training-Test Split

This stage focuses on processing image filenames to extract **labels**, which will subsequently be stored as new columns within our **Pandas DataFrame**. This approach facilitates convenient access and management of images using Pandas' functionalities.

Firstly, the entire dataset containing the following columns will be loaded:
- **Path**: Relative path to the image file.
- **Category**: The category extracted from the image filename.
- **Style**: The style associated with the image category.
- **Width**: Width of the image in pixels.
- **Height**: Height of the image in pixels.
- **MinValue**: Minimum pixel value in the image.
- **MaxValue**: Maximum pixel value in the image.
- **StdDev**: Standard deviation of the pixel values in the image.

Next, the dataset will be divided into separate **training and testing sets** using an `80/20` split ratio. This allows us to train the model on a subset of the data while evaluating its performance on unseen data.

Since our objective is to classify images and predict their styles, the split will be **stratified** based on the "**Style**" column. This ensures a balanced representation of different styles in both the training and testing sets, leading to a more robust model.

In [52]:
# Get the paths of all images within the raw directory
image_paths = leon.get_image_paths(raw_dir)

# Remove previously augmented images
styler.boxify("Purging non-raw files")
leon.remove_nonraw_files(image_paths)
print(">>> Purging complete.\n")

# Load the data
styler.boxify("Loading data")
print(">>> Data is being loaded... Please wait.\n")
df = leon.load_data_frame(raw_dir)

df.info()

╭───────────────────────╮
│ Purging non-raw files │
╰───────────────────────╯
This is a destructive operation as files will be deleted permanently. Are you sure you want to continue? (y/n)

Please wait and do not interrupt the process.

Removing non-raw files...

>>> Purging complete.

╭──────────────╮
│ Loading data │
╰──────────────╯
>>> Data is being loaded... Please wait.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81935 entries, 0 to 81934
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Path      81935 non-null  object 
 1   Category  81935 non-null  object 
 2   Style     81935 non-null  object 
 3   Width     81935 non-null  int64  
 4   Height    81935 non-null  int64  
 5   MinValue  81935 non-null  uint8  
 6   MaxValue  81935 non-null  uint8  
 7   StdDev    81935 non-null  float64
dtypes: float64(1), int64(2), object(3), uint8(2)
memory usage: 3.9+ MB


In [53]:
# Display the first few rows
df.head(10)

Unnamed: 0,Path,Category,Style,Width,Height,MinValue,MaxValue,StdDev
0,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256,0,255,71.341
1,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256,0,255,82.918
2,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256,0,255,89.007
3,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256,0,255,59.835
4,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256,0,255,87.811
5,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256,0,255,87.261
6,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256,0,255,82.231
7,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256,0,255,74.379
8,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256,0,255,100.72
9,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256,0,255,101.753


In [54]:
# Group the data by style
grouped_df = df.groupby("Style")

# Initialize empty DataFrames for train and test sets
df_train = pd.DataFrame(columns=df.columns)
df_test = pd.DataFrame(columns=df.columns)

# Split each group and concatenate train and test sets
for _, group in grouped_df:
    train_group, test_group = train_test_split(group, test_size=0.2, random_state=42)

    # Check if train_group and test_group are not empty before concatenating
    if not train_group.empty and train_group["Category"].notna().all():
        df_train = pd.concat([df_train, train_group])
    if not test_group.empty and test_group["Category"].notna().all():
        df_test = pd.concat([df_test, test_group])

# Write test set to CSV
df_test.to_csv("../../data/test/test.csv", index=False)

# Display info of the training set
df_train.info()

# Print count of each style in the train set
style_counts = df_train["Category"].value_counts()
print()
styler.boxify("Count of each style in the train set:")
print(style_counts)

<class 'pandas.core.frame.DataFrame'>
Index: 65542 entries, 27772 to 60904
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Path      65542 non-null  object 
 1   Category  65542 non-null  object 
 2   Style     65542 non-null  object 
 3   Width     65542 non-null  object 
 4   Height    65542 non-null  object 
 5   MinValue  65542 non-null  object 
 6   MaxValue  65542 non-null  object 
 7   StdDev    65542 non-null  float64
dtypes: float64(1), object(7)
memory usage: 4.5+ MB

╭───────────────────────────────────────╮
│ Count of each style in the train set: │
╰───────────────────────────────────────╯
Category
lamps       20905
chairs      16688
tables      13357
dressers     6161
beds         5234
sofas        3197
Name: count, dtype: int64


In [55]:
styler.boxify("First 20 rows of the train set")
df_train.head(10)

╭────────────────────────────────╮
│ First 20 rows of the train set │
╰────────────────────────────────╯


Unnamed: 0,Path,Category,Style,Width,Height,MinValue,MaxValue,StdDev
27772,..\..\data\raw\Furniture_Data\Furniture_Data\d...,dressers,Asian,256,256,0,255,78.264
35626,..\..\data\raw\Furniture_Data\Furniture_Data\l...,lamps,Asian,256,256,0,255,56.407
27547,..\..\data\raw\Furniture_Data\Furniture_Data\d...,dressers,Asian,256,256,0,255,69.653
35706,..\..\data\raw\Furniture_Data\Furniture_Data\l...,lamps,Asian,256,256,5,255,84.765
35244,..\..\data\raw\Furniture_Data\Furniture_Data\l...,lamps,Asian,256,256,2,255,56.765
35917,..\..\data\raw\Furniture_Data\Furniture_Data\l...,lamps,Asian,256,256,0,255,52.5
6696,..\..\data\raw\Furniture_Data\Furniture_Data\c...,chairs,Asian,256,256,0,255,93.022
27926,..\..\data\raw\Furniture_Data\Furniture_Data\d...,dressers,Asian,256,256,0,255,91.396
65521,..\..\data\raw\Furniture_Data\Furniture_Data\t...,tables,Asian,256,256,28,255,48.064
6722,..\..\data\raw\Furniture_Data\Furniture_Data\c...,chairs,Asian,256,256,0,255,76.749


## 3. Image Resizing

In this step, we will resize all images to a uniform dimension. This step is crucial as it ensures that all images have the same size, which is a requirement for most machine learning models.

While there's no single "**best**" image size for deep learning applications, we will resize our images to `256x256` pixels. This choice balances the need to capture sufficient detail in the images with computational efficiency. Additionally, research by O. Rukundo (Lund University) suggests that `256x256` pixels is a common and effective size for processing medical images, particularly **LGE-MRI** images.

Although our dataset is not related to medical imaging, we can leverage this insight of its reliability to guide our decision.

[Link to Research](https://www.mdpi.com/2079-9292/12/4/985)

In [56]:
styler.boxify("Resizing images to 256x256")

# Resize images to 256x256
for img_path in df_train["Path"]:
    leon.resize_image(img_path, 256, 256)

df_train.head()

╭────────────────────────────╮
│ Resizing images to 256x256 │
╰────────────────────────────╯


Unnamed: 0,Path,Category,Style,Width,Height,MinValue,MaxValue,StdDev
27772,..\..\data\raw\Furniture_Data\Furniture_Data\d...,dressers,Asian,256,256,0,255,78.264
35626,..\..\data\raw\Furniture_Data\Furniture_Data\l...,lamps,Asian,256,256,0,255,56.407
27547,..\..\data\raw\Furniture_Data\Furniture_Data\d...,dressers,Asian,256,256,0,255,69.653
35706,..\..\data\raw\Furniture_Data\Furniture_Data\l...,lamps,Asian,256,256,5,255,84.765
35244,..\..\data\raw\Furniture_Data\Furniture_Data\l...,lamps,Asian,256,256,2,255,56.765


## 4. Data Augmentation

In machine learning, data augmentation is a well-established technique employed to artificially expand the size of a dataset. This is achieved by applying various transformations to the existing data points. Data augmentation proves particularly valuable when dealing with limited datasets, as it mitigates the risk of overfitting and enhances the model's ability to generalize to unseen data.

While our image dataset may not be severely restricted in size, data augmentation can still provide significant benefits. We will incorporate the following transformations to augment our dataset:

1. **Random Rotation**: Images will be rotated by a random angle within a predefined range.
2. **Vertical Flip**: Images will be flipped along the vertical axis.
3. **Random Contrast Adjustment**: The contrast of each image will be adjusted by a random factor.

These aforementioned transformations are commonly utilized in image augmentation and demonstrably assist the model in learning robust features from the data [1]. It is important to note that with a dataset exceeding `65,000` images, regarding the training set, employing an excessive number of augmentation techniques would result in a computationally expensive dataset to process. Therefore, a selection of the top `3` prevalent and effective transformations has been chosen.

Following this augmentation step, we will possess a dataset that is approximately **three times** larger than the original dataset (Each image will have `2` augmented versions, plus the original image). This expanded dataset will equip the model with a richer learning experience and improved generalizability.

[1] [Image Data Augmentation for Computer Vision](https://viso.ai/computer-vision/image-data-augmentation-for-computer-vision/#:~:text=Popular%20Types%20and%20Methods%20of%20Data%20Augmentation,-Early%20experiments%20showing&text=Geometric%20transformations%3A%20Augmenting%20image%20data,%2Fdown%2C%20or%20noise%20injection.)

In [57]:
styler.boxify("Augmenting images in the training set")

print("\n>>> Augmenting images in the training set... Please wait.\n")
# Augment images in the training set
for img_path in df_train["Path"]:
    # Check if img_path starts with "aug_"
    if os.path.basename(img_path).strip().startswith("aug_"):
        continue

    # Extract directory name
    directory = os.path.dirname(img_path)

    # Augment images in the directory
    df_train = leon.augment_image(
        image_path=img_path, output_dir=directory, df_train=df_train
    )

# Display the first few rows of the filtered DataFrame
df_train.head()

╭───────────────────────────────────────╮
│ Augmenting images in the training set │
╰───────────────────────────────────────╯

>>> Augmenting images in the training set... Please wait.



Unnamed: 0,Path,Category,Style,Width,Height,MinValue,MaxValue,StdDev
0,..\..\data\raw\Furniture_Data\Furniture_Data\d...,dressers,Asian,256,256,0,255,78.264
1,..\..\data\raw\Furniture_Data\Furniture_Data\l...,lamps,Asian,256,256,0,255,56.407
2,..\..\data\raw\Furniture_Data\Furniture_Data\d...,dressers,Asian,256,256,0,255,69.653
3,..\..\data\raw\Furniture_Data\Furniture_Data\l...,lamps,Asian,256,256,5,255,84.765
4,..\..\data\raw\Furniture_Data\Furniture_Data\l...,lamps,Asian,256,256,2,255,56.765


## 5. Pixel Normalization

Normalization is a critical stage in preprocessing image data for deep learning applications. It ensures all pixel values fall within a consistent range, typically between `0` and `1`. This seemingly simple step offers several advantages:

- **Faster Convergence**: By reducing the overall value range, normalization accelerates the convergence of the optimization algorithm used to train the model.
- **Improved Stability**: Normalization creates a more stable training process, mitigating issues like vanishing and exploding gradients:

    - *Vanishing Gradients*: In deep learning, vanishing gradients occur when the gradients (slopes) of the loss function become very small as they propagate backward through the layers of a deep neural network. This can cause the model to learn very slowly or not at all.
    
    - *Exploding Gradients*: Conversely, exploding gradients happen when gradients become excessively large, causing the model to diverge (lose stability) during training.
- **Boosted Performance**: Consistent input data, achieved through normalization, often leads to better model performance and generalization on unseen data.

It's important to acknowledge that the normalization range is not restricted to `[0, 1]`. Other techniques, such as **Z-score normalization**, may be appropriate depending on the specific model and data characteristics. However, due to its simplicity and effectiveness, `[0, 1]` normalization remains a popular choice for preprocessing pipelines.

### Note: The Normalization Will Not Be Performed At This Stage

While the normalization process itself is successful, it is not recommended to save the **normalized images** to disk afterwards. This is due to a data type mismatch. The original images are stored in an unsigned 8-bit integer format (**uint8**), whereas the **normalized images** are in a single-precision floating-point format (**float32**). This conversion can lead to a loss of color information and a reduction in image sharpness.

Consequently, to avoid this issue, the normalization process will be performed during the **model training** phase. During training, the normalization will be applied dynamically to the input data, ensuring compatibility with the model's input requirements without compromising image quality.

# IV. Conclusion

Overall, data preprocessing is a crucial step in the machine learning pipeline. By applying the techniques outlined in this notebook, we have prepared our dataset for model training. The data is now **clean**, **labeled**, **split into training and testing sets**, **resized**, **augmented**, and **normalized**.

These preprocessing steps are essential for ensuring that the model can learn effectively from the data and make accurate predictions.

In [58]:
# Display info of the training set
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196626 entries, 0 to 196625
Data columns (total 8 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   Path      196626 non-null  object 
 1   Category  196626 non-null  object 
 2   Style     196626 non-null  object 
 3   Width     196626 non-null  object 
 4   Height    196626 non-null  object 
 5   MinValue  196626 non-null  object 
 6   MaxValue  196626 non-null  object 
 7   StdDev    196626 non-null  float64
dtypes: float64(1), object(7)
memory usage: 12.0+ MB


In [59]:
# Display the first few rows of the filtered DataFrame
df_train.head(10)

Unnamed: 0,Path,Category,Style,Width,Height,MinValue,MaxValue,StdDev
0,..\..\data\raw\Furniture_Data\Furniture_Data\d...,dressers,Asian,256,256,0,255,78.264
1,..\..\data\raw\Furniture_Data\Furniture_Data\l...,lamps,Asian,256,256,0,255,56.407
2,..\..\data\raw\Furniture_Data\Furniture_Data\d...,dressers,Asian,256,256,0,255,69.653
3,..\..\data\raw\Furniture_Data\Furniture_Data\l...,lamps,Asian,256,256,5,255,84.765
4,..\..\data\raw\Furniture_Data\Furniture_Data\l...,lamps,Asian,256,256,2,255,56.765
5,..\..\data\raw\Furniture_Data\Furniture_Data\l...,lamps,Asian,256,256,0,255,52.5
6,..\..\data\raw\Furniture_Data\Furniture_Data\c...,chairs,Asian,256,256,0,255,93.022
7,..\..\data\raw\Furniture_Data\Furniture_Data\d...,dressers,Asian,256,256,0,255,91.396
8,..\..\data\raw\Furniture_Data\Furniture_Data\t...,tables,Asian,256,256,28,255,48.064
9,..\..\data\raw\Furniture_Data\Furniture_Data\c...,chairs,Asian,256,256,0,255,76.749


In [60]:
# Write train set to CSV
try:
    styler.boxify("Writing train set to CSV")
    df_train.to_csv("../../data/processed/train.csv", index=False)
except Exception as e:
    print(e)

print (">>> Data saved successfully")

╭──────────────────────────╮
│ Writing train set to CSV │
╰──────────────────────────╯
>>> Data saved successfully
