# The Predictive Pioneers Pipeline Instruction

We adapted Support Vector Machine (SVM) to conduct the prediction.

### Step 1

Load the training data and preprocess RGB images.

In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report
from skimage.io import imread
from tqdm import tqdm

# Define constants
IMAGE_SIZE = (32, 32)
RANDOM_STATE = 42
N_COMPONENTS = 100  # Number of principal components to keep

# Get the current directory of the notebook
base_path = os.getcwd()
train_csv_path = os.path.join(base_path, 'train.csv')
test_csv_path = os.path.join(base_path, 'test.csv')
train_ims_path = os.path.join(base_path, 'train_ims')
test_ims_path = os.path.join(base_path, 'test_ims')

# Load training data
train_df = pd.read_csv(train_csv_path)

# Function to load and preprocess RGB images
def load_images_with_color(image_folder, image_filenames, image_size):
    images = []
    for filename in tqdm(image_filenames, desc="Loading images"):
        img_path = os.path.join(image_folder, filename)
        img = imread(img_path)  # Load as RGB
        if img.shape[:2] != image_size:  # Ensure correct size
            raise ValueError(f"Image {filename} has incorrect dimensions {img.shape[:2]}, expected {image_size}")
        images.append(img.flatten())  # Flatten RGB channels into a single vector
    return np.array(images)

# Load training images and labels
train_images = load_images_with_color(train_ims_path, train_df['im_name'], IMAGE_SIZE)
train_labels = train_df['label'].values

Loading images: 100%|██████████| 50000/50000 [00:22<00:00, 2230.28it/s]


### Step 2

Standardize the data and apply PCA with 100 components to reduce dimensionality.

In [2]:
# Apply PCA to reduce dimensionality
print("Applying PCA for dimensionality reduction...")
scaler = StandardScaler()
train_images_scaled = scaler.fit_transform(train_images)  # Standardize the data

pca = PCA(n_components=N_COMPONENTS, random_state=RANDOM_STATE)
train_images_pca = pca.fit_transform(train_images_scaled)

Applying PCA for dimensionality reduction...


### Step 3

Split the training data into training and validation sets, with the proportion of 0.8, 0.2.

In [4]:
# Split into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(
    train_images_pca, train_labels, test_size=0.2, random_state=RANDOM_STATE)

### Step 4

Perfor the grid search for hyperparaeter tunning.
However, considering the grid search might take a long time (around 16 hrs), we may skip this step and initial with the parameters: `{'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}`.

In [9]:

# Perform grid search for hyperparameter tuning
param_grid = {
    'C': [0.1, 1, 10],
    'gamma': [0.001, 0.01, 0.1],
    'kernel': ['rbf']
}
print("Performing grid search for optimal hyperparameters...")
grid_search = GridSearchCV(SVC(), param_grid, cv=3)
grid_search.fit(X_train, y_train)
print("Best parameters found:", grid_search.best_params_)


Performing grid search for optimal hyperparameters...
Best parameters found: {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}


### Step 5

Train the model on the training dataset with the tuned parameters.

In [11]:
# Instantiate the SVM model with the best parameters
svm_pca_rbf_model = SVC(C=grid_search.best_params_['C'],
                        gamma=grid_search.best_params_['gamma'],
                        kernel='rbf', random_state=RANDOM_STATE)
'''svm_pca_rbf_model = SVC(C=10, gamma=0.001, kernel='rbf', random_state=42)'''
svm_pca_rbf_model.fit(X_train, y_train)

### Step 6

Validate the model on the validation dataset.

In [12]:

# Validate the model
print("Validating the model...")
y_val_pred = svm_pca_rbf_model.predict(X_val)
print(classification_report(y_val, y_val_pred))


Validating the model...
              precision    recall  f1-score   support

           0       0.61      0.60      0.61       983
           1       0.64      0.67      0.65      1006
           2       0.45      0.45      0.45       997
           3       0.36      0.42      0.39      1018
           4       0.47      0.41      0.44      1021
           5       0.46      0.45      0.46      1048
           6       0.58      0.52      0.55      1029
           7       0.62      0.61      0.62       978
           8       0.68      0.65      0.66       966
           9       0.57      0.64      0.60       954

    accuracy                           0.54     10000
   macro avg       0.54      0.54      0.54     10000
weighted avg       0.54      0.54      0.54     10000



### Step 7

Load the test data, standardize and apply PCA to the test imgaes. Make predictions on the test images.

In [13]:

# Load test data
test_df = pd.read_csv(test_csv_path)
test_images = load_images_with_color(test_ims_path, test_df['im_name'], IMAGE_SIZE)

# Apply PCA to test images
test_images_scaled = scaler.transform(test_images)  # Use the same scaler as training
test_images_pca = pca.transform(test_images_scaled)

# Make predictions on test images
print("Making predictions on test images...")
test_predictions = svm_pca_rbf_model.predict(test_images_pca)


Loading images: 100%|██████████| 10000/10000 [00:04<00:00, 2135.20it/s]


Making predictions on test images...


### Step 8

Lable the test images with prediction results and save the output file in the same folder.

In [14]:

# Add predictions to the test DataFrame
test_df['label'] = test_predictions

# Save the updated test DataFrame
output_csv_path = os.path.join(base_path, 'svm_pca_rbf_predictions_n100_color.csv')
test_df.to_csv(output_csv_path, index=False)
print(f"Predictions saved to {output_csv_path}.")


Predictions saved to /Users/zhao0725/Desktop/COMP3314/Assignment3/image-classification-challenge/svm_pca_rbf_predictions_n100_color.csv.
