<center>

### COSC2753 - Machine Learning

# **Data Preprocessing**

<center>────────────────────────────</center>
&nbsp

# I. Introduction

In this notebook, we will apply common data preprocessing techniques to the dataset, building on the analysis conducted during the *exploratory data analysis* (EDA) steps. Data preprocessing is essential in the machine learning pipeline as it helps clean, transform, and prepare the data for the model.

The following preprocessing steps will be implemented:

1. **Image Labeling**: Labels associated with each image will be extracted from the filenames and stored in a designated column within a Pandas DataFrame.
   
2. **Image Resizing**: All images will be resized to a uniform dimension, ensuring consistency across the dataset.

3. **Data Augmentation**: The images are augmented to increase the size of the dataset and improve the model's generalization.
   
4. **Pixel Normalization**: The pixel values are normalized to a range of [`0`, `1`] to ensure that the model can learn effectively.
   
5. **Train-Test Split**: The dataset will be divided into training and testing sets. The training set will be used to train the model, while the testing set will be used to evaluate its performance on unseen data.


# II. Project Setup

## 1. Import Libraries

In [13]:
# Import necessary packages
import pandas as pd  # Data manipulation
import sys  # System specific parameters and functions
import importlib  # Importing modules
import os
from sklearn.model_selection import train_test_split

# Reload modules
sys.path.append("../../")  # Root directory
modules_to_reload = ["scripts.leon", "scripts.styler"]

# Reload modules if they have been modified
missing_modules = []

for module_name in modules_to_reload:
    if module_name in sys.modules:
        importlib.reload(sys.modules[module_name])
    else:
        missing_modules.append(module_name)

# Recache missing modules
if missing_modules:
    print(f"Modules {missing_modules} not found. /nRecaching...")

# Import user-defined scripts
from scripts.leon import Leon  # Leon class
from scripts.styler import Styler  # Styler class

# Configuration
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.precision", 3)

# Initialize objects
leon = Leon()
styler = Styler()


        @|\@@
       -  @@@@                                                            LEON 1.0.0
      /7   @@@@                                         This is Leon, the friendly lion. He is here to help you
     /    @@@@@@                                     Leon is tailored to manipulate images, data and visualizations
     \-' @@@@@@@@`-_______________                                      Made by: Team X
      -@@@@@@@@@             /    \                                     Version: 1.0.3
 _______/    /_       ______/      |__________-
/,__________/  `-.___/,_____________----------_)



## 2. Global Properties

In [4]:
# Define the base directory path
raw_dir = "../../data/raw/Furniture_Data/Furniture_Data/"

# III. Data Preprocessing

## Invalid Data Handling

As we conclude from the *exploratory data analysis* (EDA), there is a mismatch between the file format of `jpgD` and the other `jpg` files. Hence, to ensure that the data is consistent, we will change the file format of `jpgD` to `jpg`. Based on testing and observation, simply changing the file format from `jpgD` to `jpg` does not affect the image quality or integrity.

This adjustment will help maintain consistency across the dataset and prevent any potential issues during the preprocessing and modeling stages.

In [5]:
# Original file path
jpgd_path = "../../data/raw/Furniture_Data/Furniture_Data/dressers/Farmhouse/30826farmhouse-coffee-tables.jpgD"

# Get the absolute path
jpgd_path = os.path.abspath(jpgd_path)

print (jpgd_path)
# Check if the file exists
if os.path.exists(jpgd_path):
    # Get the directory and file name
    directory, filename = os.path.split(jpgd_path)

    # Remove the extra characters after ".jpg" in the filename
    new_filename = filename.split(".jpg")[0] + ".jpg"

    # Create the new file path
    new_path = os.path.join(directory, new_filename)

    # Rename the file
    os.rename(jpgd_path, new_path)

    print(f"File renamed from '{jpgd_path}' to '{new_path}'")
else:
    print("File not found or has been renamed already.")

c:\Users\huuqu\Academic\RMIT\Machine Learning\Group Assignment\data\raw\Furniture_Data\Furniture_Data\dressers\Farmhouse\30826farmhouse-coffee-tables.jpgD
File not found or has been renamed already.


## Image Labeling and Training-Test Split

In this step, we will process the image filenames to extract labels. These labels will be stored in a new column within our Pandas DataFrame. This process will enable us to conveniently access and manage the images using Pandas' built-in functionalities.

In [6]:
df = leon.load_data_frame(raw_dir)

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81935 entries, 0 to 81934
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Path    81935 non-null  object
 1   Class   81935 non-null  object
 2   Style   81935 non-null  object
 3   Width   81935 non-null  int64 
 4   Height  81935 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 3.1+ MB


In [22]:
df.head()

Unnamed: 0,Path,Class,Style,Width,Height
0,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256
1,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256
2,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256
3,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256
4,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256


In [9]:
grouped_df = df.groupby("Style")

# Initialize empty DataFrames for train and test sets
df_train = pd.DataFrame(columns=df.columns)
df_test = pd.DataFrame(columns=df.columns)

# Split each group and concatenate train and test sets
for _, group in grouped_df:
    train_group, test_group = train_test_split(group, test_size=0.2, random_state=42)
    df_train = pd.concat([df_train, train_group])
    df_test = pd.concat([df_test, test_group])

# Write test set to CSV
df_test.to_csv("../../data/test/test.csv", index=False)

# Display info of the training set
df_train.info()

# Print count of each style in the train set
style_counts = df_train["Category"].value_counts()
print("\nCount of each style in the train set:")
print(style_counts)

<class 'pandas.core.frame.DataFrame'>
Index: 65546 entries, 35858 to 60909
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Path      65546 non-null  object
 1   Category  65546 non-null  object
 2   Style     65546 non-null  object
 3   Width     65546 non-null  object
 4   Height    65546 non-null  object
dtypes: object(5)
memory usage: 3.0+ MB

Count of each style in the train set:
Category
lamps       20915
chairs      16681
tables      13361
dressers     6152
beds         5235
sofas        3202
Name: count, dtype: int64


## 3. Image Resizing

In this step, we will resize all images to a uniform dimension. This step is crucial as it ensures that all images have the same size, which is a requirement for most machine learning models.

While there's no single "**best**" image size for deep learning applications, we will resize our images to `256x256` pixels. This choice balances the need to capture sufficient detail in the images with computational efficiency. Additionally, research by O. Rukundo (Lund University) suggests that `256x256` pixels is a common and effective size for processing medical images, particularly **LGE-MRI** images. 

Although our dataset is not related to medical imaging, we can leverage this insight of its reliability to guide our decision.

[Link to Research](https://www.mdpi.com/2079-9292/12/4/985)

In [24]:
for img_path in df_train["Path"]:
    leon.resize_image(img_path, 256, 256)

df_train.head()

KeyboardInterrupt: 

## 4. Data Augmentation

In machine learning, data augmentation is a well-established technique employed to artificially expand the size of a dataset. This is achieved by applying various transformations to the existing data points. Data augmentation proves particularly valuable when dealing with limited datasets, as it mitigates the risk of overfitting and enhances the model's ability to generalize to unseen data.

While our image dataset may not be severely restricted in size, data augmentation can still provide significant benefits. We will incorporate the following transformations to augment our dataset:

1. **Random Rotation**: Images will be rotated by a random angle within a predefined range.
2. **Vertical Flip**: Images will be flipped along the vertical axis.
3. **Random Contrast Adjustment**: The contrast of each image will be adjusted by a random factor.

These aforementioned transformations are commonly utilized in image augmentation and demonstrably assist the model in learning robust features from the data [1]. It is important to note that with a dataset exceeding `80,000` images, employing an excessive number of augmentation techniques would result in a computationally expensive dataset to process. Therefore, a selection of the top `3` prevalent and effective transformations has been chosen.

Following this augmentation step, we will possess a dataset that is approximately **three times** larger than the original dataset. This expanded dataset will equip the model with a richer learning experience and improved generalizability.

[1] [Image Data Augmentation for Computer Vision](https://viso.ai/computer-vision/image-data-augmentation-for-computer-vision/#:~:text=Popular%20Types%20and%20Methods%20of%20Data%20Augmentation,-Early%20experiments%20showing&text=Geometric%20transformations%3A%20Augmenting%20image%20data,%2Fdown%2C%20or%20noise%20injection.)

In [10]:
def remove_output_folders_and_files(df_train):
    processed_directories = set()

    for img_path in df_train["Path"]:
        # Extract directory name
        directory = os.path.dirname(img_path)

        # Check if directory has already been processed
        if directory in processed_directories:
            continue

        # Add directory to the set of processed directories
        processed_directories.add(directory)

        # Remove files starting with "augmented_" if they exist
        files_in_directory = os.listdir(directory)
        for file_name in files_in_directory:
            if file_name.startswith("aug_"):
                file_path = os.path.join(directory, file_name)
                try:
                    os.remove(file_path)
                    print(f"Removed file: {file_path}")
                except OSError as e:
                    print(f"Error removing file: {file_path}, {e}")

In [14]:
# Keep track of processed directories
processed_directories = set()

# Remove previous output folders and files
remove_output_folders_and_files(df_train)

count = 0
for img_path in df_train["Path"]:
    if count == 2:
        break
    # Extract directory name
    directory = os.path.dirname(img_path)

    # Augment images in the directory
    df_train = leon.augment_image(
        image_path=img_path, output_dir=directory, df_train=df_train
    )

    count += 1
# Display the first few rows of the filtered DataFrame
df_train.head()

Removed file: ..\..\data\raw\Furniture_Data\Furniture_Data\lamps\Asian\aug_19274asian-table-lamps.jpg_0.jpg
Removed file: ..\..\data\raw\Furniture_Data\Furniture_Data\tables\Asian\aug_18358asian-plant-stands-and-telephone-tables.jpg_0.jpg
╭───────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Augmenting image: ..\..\data\raw\Furniture_Data\Furniture_Data\lamps\Asian\19274asian-table-lamps.jpg │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────╯
  - Saved augmented image: aug_19274asian-table-lamps.jpg_0.jpg
  - Saved augmented image: aug_19274asian-table-lamps.jpg_1.jpg
  - Saved augmented image: aug_19274asian-table-lamps.jpg_2.jpg
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Augmenting image: ..\..\data\raw\Furniture_Data\Furniture_Data\tables\Asian\18358asian-plant-stands-and-telephone-tables

Unnamed: 0,Path,Category,Style,Width,Height
0,..\..\data\raw\Furniture_Data\Furniture_Data\l...,lamps,Asian,256,256
1,..\..\data\raw\Furniture_Data\Furniture_Data\t...,tables,Asian,256,256
2,..\..\data\raw\Furniture_Data\Furniture_Data\t...,tables,Asian,256,256
3,..\..\data\raw\Furniture_Data\Furniture_Data\t...,tables,Asian,256,256
4,..\..\data\raw\Furniture_Data\Furniture_Data\c...,chairs,Asian,256,256
