<center>

### COSC2753 - Machine Learning

# **Data Preprocessing**

<center>────────────────────────────</center>
&nbsp

# I. Introduction

In this notebook, we will apply common data preprocessing techniques to the dataset, building on the analysis conducted during the *exploratory data analysis* (EDA) steps. Data preprocessing is essential in the machine learning pipeline as it helps clean, transform, and prepare the data for the model.

The following preprocessing steps will be implemented:

1. **Image Labeling**: Labels associated with each image will be extracted from the filenames and stored in a designated column within a Pandas DataFrame.
   
2. **Image Resizing**: All images will be resized to a uniform dimension, ensuring consistency across the dataset.

3. **Data Augmentation**: The images are augmented to increase the size of the dataset and improve the model's generalization.
   
4. **Pixel Normalization**: The pixel values are normalized to a range of [`0`, `1`] to ensure that the model can learn effectively.
   
5. **Train-Test Split**: The dataset will be divided into training and testing sets. The training set will be used to train the model, while the testing set will be used to evaluate its performance on unseen data.


# II. Project Setup

## 1. Import Libraries

In [1]:
# Import necessary packages
import pandas as pd  # Data manipulation
import sys  # System specific parameters and functions
import importlib  # Importing modules
import os

# Reload modules
sys.path.append("../../")  # Root directory
modules_to_reload = ["scripts.leon", "scripts.styler"]

# Reload modules if they have been modified
missing_modules = []

for module_name in modules_to_reload:
    if module_name in sys.modules:
        importlib.reload(sys.modules[module_name])
    else:
        missing_modules.append(module_name)

# Recache missing modules
if missing_modules:
    print(f"Modules {missing_modules} not found. /nRecaching...")

# Import user-defined scripts
from scripts.leon import Leon  # Leon class
from scripts.styler import Styler  # Styler class

# Configuration
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.precision", 3)

# Initialize objects
leon = Leon()
styler = Styler()

## 2. Global Properties

In [None]:
# Define the base directory path
raw_dir = "../../data/raw/Furniture_Data/Furniture_Data/"

# III. Data Preprocessing

## 1. Invalid Data Handling

As we conclude from the *exploratory data analysis* (EDA), there is a mismatch between the file format of `jpgD` and the other `jpg` files. Hence, to ensure that the data is consistent, we will change the file format of `jpgD` to `jpg`. Based on testing and observation, simply changing the file format from `jpgD` to `jpg` does not affect the image quality or integrity.

This adjustment will help maintain consistency across the dataset and prevent any potential issues during the preprocessing and modeling stages.

In [None]:
# Original file path
jpgd_path = "../../data/raw/Furniture_Data/Furniture_Data/dressers/Farmhouse/30826farmhouse-coffee-tables.jpgD"

# Get the absolute path
jpgd_path = os.path.abspath(jpgd_path)

print (jpgd_path)
# Check if the file exists
if os.path.exists(jpgd_path):
    # Get the directory and file name
    directory, filename = os.path.split(jpgd_path)

    # Remove the extra characters after ".jpg" in the filename
    new_filename = filename.split(".jpg")[0] + ".jpg"

    # Create the new file path
    new_path = os.path.join(directory, new_filename)

    # Rename the file
    os.rename(jpgd_path, new_path)

    print(f"File renamed from '{jpgd_path}' to '{new_path}'")
else:
    print("File not found or has been renamed already.")

c:\Users\huuqu\Academic\RMIT\Machine Learning\Group Assignment\data\raw\Furniture_Data\Furniture_Data\dressers\Farmhouse\30826farmhouse-coffee-tables.jpgD
File not found or has been renamed already.


## 2. Image Labeling

In this step, we will process the image filenames to extract labels. These labels will be stored in a new column within our Pandas DataFrame. This process will enable us to conveniently access and manage the images using Pandas' built-in functionalities.

In [None]:
df_train = leon.load_data_frame(raw_dir)

In [None]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81935 entries, 0 to 81934
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Path    81935 non-null  object
 1   Class   81935 non-null  object
 2   Style   81935 non-null  object
 3   Width   81935 non-null  int64 
 4   Height  81935 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 3.1+ MB


In [None]:
df_train.head()

Unnamed: 0,Path,Class,Style,Width,Height
0,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256
1,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256
2,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256
3,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256
4,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256


## 3. Image Resizing

In this step, we will resize all images to a uniform dimension. This step is crucial as it ensures that all images have the same size, which is a requirement for most machine learning models.

While there's no single "**best**" image size for deep learning applications, we will resize our images to `256x256` pixels. This choice balances the need to capture sufficient detail in the images with computational efficiency. Additionally, research by O. Rukundo (Lund University) suggests that `256x256` pixels is a common and effective size for processing medical images, particularly **LGE-MRI** images. 

Although our dataset is not related to medical imaging, we can leverage this insight of its reliability to guide our decision.

[Link to Research](https://www.mdpi.com/2079-9292/12/4/985)

In [None]:
for img_path in df_train["Path"]:
    leon.resize_image(img_path, 256, 256)

df_train.head()

Unnamed: 0,Path,Class,Style,Width,Height
0,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256
1,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256
2,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256
3,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256
4,..\..\data\raw\Furniture_Data\Furniture_Data\b...,beds,Asian,256,256


## 4. Data Augmentation

Data augmentation is a technique used to artificially increase the size of the dataset by applying various transformations to the existing data. This technique is particularly useful when working with limited data, as it helps prevent overfitting and improves the model's generalization.

For our image dataset, although the data is not severely limited, data augmentation can still be beneficial. We will apply the following transformations to augment our dataset:

- **Random Rotation**: Rotates the image by a random angle within a specified range.
- **Vertical Flip**: Flips the image vertically.
- **Zoom**: Zooms into the image by a specified factor.

The above transformations are commonly used and can help the model learn robust features from the data. [1]

At the end of this step, we will have an augmented dataset that is approximately `3` times larger than the original dataset.

[1] [Image Data Augmentation for Computer Vision](https://viso.ai/computer-vision/image-data-augmentation-for-computer-vision/#:~:text=Popular%20Types%20and%20Methods%20of%20Data%20Augmentation,-Early%20experiments%20showing&text=Geometric%20transformations%3A%20Augmenting%20image%20data,%2Fdown%2C%20or%20noise%20injection.)

In [None]:
# Keep track of processed directories
processed_directories = set()

for img_path in df_train["Path"]:
    # Extract directory name
    directory = os.path.dirname(img_path)

    # Check if directory has already been processed
    if directory in processed_directories:
        continue

    # Add directory to the set of processed directories
    processed_directories.add(directory)

    # Remove the output folder if it exists
    try:
        output_dir = os.path.join(directory, "output")
        output_dir = os.path.abspath(output_dir)
        print(output_dir)
        os.removedirs(output_dir)
    except FileNotFoundError:
        pass

    leon.augment_images(directory)

c:\Users\huuqu\Academic\RMIT\Machine Learning\Group Assignment\data\raw\Furniture_Data\Furniture_Data\beds\Asian\output
c:\Users\huuqu\Academic\RMIT\Machine Learning\Group Assignment\data\raw\Furniture_Data\Furniture_Data\beds\Beach\output
c:\Users\huuqu\Academic\RMIT\Machine Learning\Group Assignment\data\raw\Furniture_Data\Furniture_Data\beds\Contemporary\output
c:\Users\huuqu\Academic\RMIT\Machine Learning\Group Assignment\data\raw\Furniture_Data\Furniture_Data\beds\Craftsman\output
c:\Users\huuqu\Academic\RMIT\Machine Learning\Group Assignment\data\raw\Furniture_Data\Furniture_Data\beds\Eclectic\output
c:\Users\huuqu\Academic\RMIT\Machine Learning\Group Assignment\data\raw\Furniture_Data\Furniture_Data\beds\Farmhouse\output
c:\Users\huuqu\Academic\RMIT\Machine Learning\Group Assignment\data\raw\Furniture_Data\Furniture_Data\beds\Industrial\output
c:\Users\huuqu\Academic\RMIT\Machine Learning\Group Assignment\data\raw\Furniture_Data\Furniture_Data\beds\Mediterranean\output
c:\Users\