## Data Preprocessing
This notebook handles the preprocessing of the dataset that is used to train the model.
<br><br>

### __Dataset Details__
The data consists of labeled images of rocks and minerals organized by their respective classes: Igneous, Sedimentary, and Metamorphic. <br><br>
There are two sets of data, the [__training data__](#training-data-processing) and the [__testing data__](#test-data-processing). Each set follows the same structure outlined above, with the only difference coming from how the images are preprocessed. <br><br>

### __Preprocessing Transformations__
#### Resizing
Image size affects a variety of factors when it comes to training a model, but mainly __performance__ and __detail__. Due to rocks and minerals being the target of identification, i'll be training the model with and image size of __256px x 256px__. This will add more detail for the model to train on, which will hopefully translate to improved accuracy when fed sub-optimal images to identify.
<br><br>

#### Augmentation
Both the `.RandomHorizontalFlip()` and `.RandomRotation()` functions are used to augment the images, providing the model with a greater variety of training data from a limited orignal dataset.
<br><br>

#### Tensor Conversion
...
<br><br>

#### Normalization
...

***

### Step 1: Splitting dataset
The first step is to split the data into a set for training and a set for testing. I'm going with an 80/20 split.

In [1]:
# imports for splitting data into training and testing data
import os
from sklearn.model_selection import train_test_split

In [15]:
# Path to the dataset
dataset_path = 'rock_dataset\\'

# Arrays storing image paths & labels for use with
# train_test_split function
image_paths = []
image_labels = []

for folder in os.listdir(dataset_path):
    # igneous, metamorphic, sedimentary
    for subfolder in os.listdir(os.path.join(dataset_path, folder)):
        subfolder_path = os.path.join(dataset_path, folder, subfolder)
        for rock in os.listdir(subfolder_path):
            image_paths.append(os.path.join(subfolder_path, rock))
            image_labels.append(subfolder)

# Now we split the data up into training and testing
X_train, X_test, y_train, y_test = train_test_split(image_paths, image_labels, test_size=0.2, stratify=image_labels, random_state=42)

### Step 2: Processing The Data

In [None]:
# Using PyTorch to preprocess images
from torchvision import transforms
from torchvision import datasets

# Const vairable for image size
IMAGE_SIZE = 256

#### Training Data Processing

In [None]:
training_data_processing = transforms.Compose([
    transforms.Resize((IMAGE_SIZE, IMAGE_SIZE)),
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(10),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])

#### Test Data Processing
Test data is processed similary to the training data, with the exception of augmentation.

In [None]:
test_data_processing = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])