In this notebook, we explore several feature reduction techniques as part of our image classification task. These techniques are used to reduce the dimensionality of the data and improve the efficiency and effectiveness of our models.

We have included the following feature reduction techniques:

1. **Simple Standardization**:
   - A preprocessing step using StandardScaler to standardize the data.

2. **Principal Component Analysis (PCA)**:
   - Standardization followed by PCA, reducing the dimensionality to 2 components.
   - `PCA` is a linear technique that can effectively capture variance in the data.

3. **Linear Discriminant Analysis (LDA)**:
   - Standardization followed by LDA, reducing the dimensionality to 2 components.
   - `LDA` is a supervised technique that aims to maximize class separability.

4. **Neighborhood Components Analysis (NCA)**:
   - Standardization followed by NCA, reducing the dimensionality to 2 components.
   - `NCA` is another supervised technique that focuses on preserving the relative neighborhood relationships.

In addition to these feature reduction techniques, we have employed the following model and optimization techniques:

- **Dataset Used**:
   - The dataset used for this exploration is the CIFAR-10 dataset. It is a well-known dataset for image classification and contains a wide range of images across multiple classes.

- **Classification Model**:
   - We utilized the KNeighborsClassifier as our classification model. This model is suitable for classification tasks and is well-suited for use with reduced-dimensional feature representations.

- **Hyperparameter Optimization**:
   - GridSearchCV was employed to find the optimal hyperparameters for our KNeighborsClassifier. This systematic search allows us to fine-tune the model's settings for improved performance.

By including these techniques in our notebook, we aim to showcase the versatility of feature reduction methods and demonstrate the impact of different models and hyperparameter tuning on image classification tasks using the CIFAR-10 dataset.

In [None]:
import tensorflow as tf
import numpy as np

# Load the CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

# x_train and x_test contain the images, y_train and y_test contain the labels

# Look at the shape
print("x_train shape:", x_train.shape)
print("y_train shape:", y_train.shape)
print("x_test shape:", x_test.shape)
print("y_test shape:", y_test.shape)

x_train shape: (50000, 32, 32, 3)
y_train shape: (50000, 1)
x_test shape: (10000, 32, 32, 3)
y_test shape: (10000, 1)


In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and holdout sets
X_train_flattened, X_holdout_flattened, Y_train_flattened, Y_holdout_flattened = train_test_split(
    [image.flatten() for image in x_train], y_train, test_size=0.2, random_state=42)

# Check the shapes of the resulting arrays
print("X_train_flattened shape:", np.array(X_train_flattened).shape)
print("Y_train_flattened shape:", Y_train_flattened.shape)
print("X_holdout_flattened shape:", np.array(X_holdout_flattened).shape)
print("Y_holdout_flattened shape:", Y_holdout_flattened.shape)


In [None]:
# Preprocessing

from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import NeighborhoodComponentsAnalysis
from sklearn.preprocessing import StandardScaler

# simple standardisation
scaler = make_pipeline(StandardScaler())

# Reduce dimension with PCA
pca = make_pipeline(StandardScaler(), PCA(n_components=2, random_state=43))

# Reduce dimension with LinearDiscriminantAnalysis
lda = make_pipeline(StandardScaler(), LinearDiscriminantAnalysis(n_components=2))

# Reduce dimension with NeighborhoodComponentsAnalysis
nca = make_pipeline(StandardScaler(), NeighborhoodComponentsAnalysis(n_components=2))


In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
steps = [
    ('preprocessing', scaler),
    ('classifier', knn),
]

In [None]:
# Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

knn = KNeighborsClassifier()
grid_parameters = {
    'classifier__n_neighbors': [5],
    'classifier__weights': ['uniform'],
    'classifier__leaf_size': [20],
    'classifier__p': [2]
}

pipeline = Pipeline(steps)

model = GridSearchCV(
    estimator=pipeline,
    param_grid=grid_parameters,
    cv=5,
    scoring='accuracy',
    verbose=2,
)

model.fit(X_train_flattened, Y_train_flattened)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


In [None]:
# Scoring
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_validate

scoring = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score, average='weighted'),
    'recall': make_scorer(recall_score, average='weighted'),
    'f1': make_scorer(f1_score, average='weighted')
}
results = cross_validate(pipeline, X_train, Y_train, cv=5, scoring=scoring, return_train_score=True)

accuracy_scores = results['test_accuracy']
precision_scores = results['test_precision']
recall_scores = results['test_recall']
f1_scores = results['test_f1']


print(f'Accuracy scores: {accuracy_scores}')
print(f'Precision scores: {precision_scores}')
print(f'Recall scores: {recall_scores}')
print(f'F1 scores: {f1_scores}')