# About
This template is designed to efficiently preprocess large datasets, particularly images, with the goal of minimizing memory usage.\
In addition to images, metadata (bounding boxes, labels, etc.) can be preprocessed and converted to the desired format.\
All processed data is then stored in a Pandas DataFrame and saved as a .pkl file to preserve Python data structures.

By preprocessing data in this manner, you can significantly reduce training time and enable the use of larger datasets on older or less powerful hardware.\
Additionally, this approach allows you to standardize input data so that model training code can be reused with various datasets.

For examples of potential edge cases and solutions, refer to the section at the bottom of this file.

# Imports

In [1]:
# Data Extraction
from PIL import Image
import io
import pandas as pd

# QoL
from tqdm import tqdm

# Create your Preprocessing class
- Initialize your dataframe with desired columns such as 'images', 'annotations', 'metadata'
- Load in your dataset
- Create methods to transform data for each of the desired columns

In [None]:
class Preprocess():

    def __init__(self, DESIRED_IMAGE_SIZE=(400, 400), DESIRED_FILE_NAME='transformed_data.pkl'):
        self.dataframe = pd.DataFrame(columns = ['images', 'annotations', 'metadata'])
        self.dataset = pd.read_pickle('pickleddata.pkl')
        self.DESIRED_IMAGE_SIZE = DESIRED_IMAGE_SIZE
        self.DESIRED_FILE_NAME = DESIRED_FILE_NAME
    
    
    """
    Loads images and resizes them before converting them (back) to bytes data in JPEG format
    This template method assumes that you are loading in a pickle file with images
    saved as bytes data, however, it is commonly the case that you work with
    saved JPEG or PNG images, in which case the the initial io.BytesIO is unecessary
    """
    def transform_image(self, row):
        img = Image.open(io.BytesIO(row['images'])).convert('RGB')
        img = img.resize(self.DESIRED_IMAGE_SIZE, Image.Resampling.LANCZOS)
        img_byte_arr = io.BytesIO()
        img.save(img_byte_arr, format='JPEG')

        return img_byte_arr.getvalue()


    """
    Loops through all known bounding boxes in given image
    Resizes bounding boxes to match desired image size
    Assumes bbox data is already saved in Pascal VOC dataset format
    """
    def transform_annotation(self, row):
        boxes = []
        for box in row['annotations']:
            xmin = box[0] * self.DESIRED_IMAGE_SIZE[0]
            ymin = box[1] * self.DESIRED_IMAGE_SIZE[1]
            xmax = box[2] * self.DESIRED_IMAGE_SIZE[0]
            ymax = box[3] * self.DESIRED_IMAGE_SIZE[1]
            boxes.append([xmin, ymin, xmax, ymax])

        return boxes


    """
    Returns metadata as a number for labelling purposes
    In this example there are 4 possible labels to come across
    Method converts from string data to numbers
    """
    def transform_metadata(self, row):
        metadata = row['metadata']
        if metadata is 'MaskWornCorrectly':
            return 3
        elif metadata is 'MaskWornIncorrectly':
            return 2
        elif metadata is 'NoMaskWorn':
            return 1
        else:
            return 0 # label meaning 'background' or 'no object'


    """
    Loops over all data and calls helper transform methods
    Builds our dataframe using the transformed data
    """
    def preprocess(self):
        for idx, row in tqdm(self.dataset):
            image = self.transform_image(row)
            annotation = self.transform_annotation(row)
            metadata = self.transform_metadata(row)

            transformed_data = [image, annotation, metadata]
            data_to_add = pd.DataFrame([transformed_data], columns=self.dataframe.columns)
            self.dataframe = pd.concat([self.dataframe, data_to_add], ignore_index=True)
        
        return self.dataframe
    

    """
    Saves data as .pkl file 
    This maintains complex python data structures and data types, such as bytes
    """
    def pickle_data(self):
        self.dataframe.to_pickle(self.DESIRED_FILE_NAME)
            

# Run the program

In [None]:
DESIRED_IMAGE_SIZE = (360, 500)
DESIRED_FILE_NAME = 'transformed_data.pkl'
pp = Preprocess(DESIRED_IMAGE_SIZE, DESIRED_FILE_NAME)
pp.preprocess()
pp.pickle_data()

# Notable Scenarios:

## Images and Metadata are separated
- Create list of image file pathways and use idx value from .iterrows() to parse this list
```python
list(os.listdir("archive/train_images/"))
```

---
## Data stored in XML File
- Use BeautifulSoup to parse through xml file data
```python
from bs4 import BeautifulSoup

with open(file) as f:
        data = f.read()
        soup = BeautifulSoup(data, 'xml')
        objects = soup.find_all('object')

        for obj in objects:
            xmin = int(obj.find('xmin').text) # Example bounding box info
            label = obj.find('name') # Example label info
```

---
## Using HuggingFace
- Load in desired data using load_dataset function
- Loop through data like list, not like pandas dataframe
```python
# Example with parquet file
data_files = {"train": "train*", "test": "test*"}
dataset_train = load_dataset("parquet", data_dir="C:\\Users\\User\\dataset_name\\data\\", data_files=data_files, split="train[:20%]")

for element in dataset_train:
    pass
```