# <center> **CIFAR-10 Manifold Learning Comparison** </center>
<br><br>

### <center> Author: Omar Gabr </center>

# **What is the Objective of this Project?**

This project aims to compare different manifold learning algorithms on the [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html), which consists of 60,000 color images of size 32x32x3 divided into 10 classes. Manifold Learning can be thought of as an attempt to generalize linear frameworks like PCA to be sensitive to non-linear structure in data. The goal is to project the data into 2D using various manifold learning algorithms, and color the data by the class in which it belongs.

The manifold learning algorithms to be used include
1. Principle Component Analysis
2. Isomap Embedding
3. Locally Linear Embedding
4. Multi-Dimensional Scaling
5. Spectral Embedding
6. T-distributed Stochastic Neighbor Embedding

Additionally, the UMAP algorithm will also be used. All these algorithms are available in the sklearn and keras library.

# **Why Was this Dataset Chosen?**

The CIFAR-10 dataset was chosen because it provides a challenging image classification problem with a large number of classes and color images. This dataset is different from the MNIST dataset in that it contains 10 classes of color images with a size of 32x32x3, which makes it a more realistic representation of real-world image data.

The goal of this project is to evaluate the performance of different manifold learning algorithms on this dataset and compare their ability to capture the underlying structure of the data while preserving class information.

This analysis will provide insights into which algorithm is better suited for this type of dataset and potentially improve the performance of image classification tasks.

# **Train Test Split**

### Importing Dataset into Training and Testing Sets

In [5]:
from keras.datasets import cifar10

# split into train and test sets
(X_train, y_train), (X_test, y_test) = cifar10.load_data()

# verify that training and testing sets have appropriate sizes
print(f"Training Set Shapes: ({X_train.shape}, {y_train.shape})")
print(f"Testing Set Shapes: ({X_test.shape}, {y_test.shape})")

Training Set Shapes: ((50000, 32, 32, 3), (50000, 1))
Testing Set Shapes: ((10000, 32, 32, 3), (10000, 1))


# **Manifold Learning Algorithms**

### Defining a Data-Fitting Function for Manifold Learning Algorithms

Ensuring the same scale is used over all features, because manifold learning methods are based on a nearest-neighbor search.

(50000, 32, 32, 3)

### <center> Principal Component Analysis (PCA) </center>

PCA is the main linear algorithm for dimension reduction often used in unsupervised learning. It is a linear dimensionality reduction technique using singular value decomposition of the data to project it to a lower dimensional space. The input data is centered, but not scaled, for each feature before applying the SVD.

### <center> Isomap Embedding </center>

Isomap seeks a lower-dimensional embedding which maintains geodesic distances between all points. Isomap can be viewed as an extension of Multi-dimensional Scaling (MDS) or Kernel PCA.

### <center> Locally Linear Embedding (LLE) </center>

LLE seeks a lower-dimensional projection of the data which preserves distances within local neighborhoods. It can be thought of as a series of local Principal Component Analyses which are globally compared to find the best non-linear embedding.

### <center> Multi-Dimensional Scaling (MDS) </center>

MDS seeks a low-dimensional representation of the data in which the distances respect well the distances in the original high-dimensional space. It is a technique used for analyzing similarity or dissimilarity data. It attempts to model such data as distances in a geometric spaces.

### <center> Spectral Embedding </center>

Spectral Embedding is an approach to calculating a non-linear embedding. This method aims to find a simplified version of the data by using a mathematical technique called spectral decomposition of the graph Laplacian. This creates a graph that approximates the data's structure in a lower dimension.


### <center> t-distributed Stochastic Neighbor Embedding (t-SNE) </center>

t-SNE is a technique that helps visualize high-dimensional data by creating a low-dimensional representation. It works by converting the relationships between data points in the high-dimensional space into probabilities.

In the original space, the relationships are depicted using Gaussian joint probabilities, while in the embedded space, they are illustrated using Student's t-distributions.


# **Comparison Evaluation**

# **Conclusion**