# HW5: Prob 1: Create a 2D embedding of the CIFAR-10 comparing different manifold learning algorithms

The [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) is not that unlike the MNIST data set in that it is is 6K instances of small images. Unlike MNIST they are color images which are 32x32x3 (32 pixels wide, 32 pixels high, 3 color channels). The 3 color channels are red, green and blue. Each pixel is represented by a number between 0 and 255. The data set is divided into 10 classes. The classes are airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. The data set is available in the keras.datasets module. You can load it with the following code:

```python
from torchvision.datasets import CIFAR10
dataset = CIFAR10(root='data/', download=True, transform=ToTensor())
test_dataset = CIFAR10(root='data/', train=False, transform=ToTensor())
```

Your goal is to compare different manifold learning algorithms on this data set. You should use the following algorithms:

* Principle Component Analysis
* Isomap Embedding
* Locally Linear Embedding
* Multidimensional scaling.
* SpectralEmbedding
* T-distributed Stochastic Neighbor Embedding

All of these algorithms are available in the sklearn library. In addition you should use also try the UMAP algorithm which is available for python as a library [umap](https://umap-learn.readthedocs.io/en/latest/basic_usage.html). In each of these cases you should project the data into 2D and color the data by the class in which it belongs.

Two references which might help the [sklearn documentation](https://scikit-learn.org/stable/modules/manifold.html) and Jake Vanderplas' book [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/05.10-manifold-learning.html). 

Practice creating a well formatted Jupyter notebook. You should have a title, a description of the data set, a description of the algorithms, a description of the results, and a discussion of the results. You should also have a conclusion. Your notebook should not have lots of code cells with no explanation. It should also not have lots of text outputs. You may need that during your development but by the time you submit, please clean it up.


In [2]:
%pip install torchvision

In [3]:
from torchvision.datasets import CIFAR10
from torchvision.transforms import ToTensor
dataset = CIFAR10(root='data/', download=True, transform= ToTensor())
test_dataset = CIFAR10(root='data/', train=False, transform= ToTensor())

In [None]:
from matplotlib import pyplot as plt 
import pandas as pd 
import numpy as np  
import seaborn as sns
from sklearn.decomposition import PCA

For an initial Exploratory Data Analysis (EDA) on the CIFAR-10 dataset,I'll start with some basic visualizations and statistics to understand the data. This involves showing a few sample images, checking the balance of classes, and understanding the image format. Here's a basic outline of the steps in code:

In [6]:
# Sample and visualize some images from each class
classes = dataset.classes
fig, ax = plt.subplots(2, 5, figsize=(15, 6))
for i in range(10):
    idx = [j for j, y in enumerate(dataset.targets) if y == i][0]
    ax[i//5, i%5].imshow(dataset.data[idx])
    ax[i//5, i%5].set_title(classes[i])
    ax[i//5, i%5].axis('off')
plt.show() 

# Check the balance of classes
class_counts = {class_name:0 for class_name in classes}
for _, label in dataset:
    class_counts[classes[label]] += 1
print(class_counts)

FOr PCA first I have to reshape the data since PCA expects 2D arrays as input. CIFAR-10 images are 32x32 pixels with 3 color channels.  I'll reshape them into 3072-dimensional vectors. Then, I can apply PCA to reduce the dimensionality to 2D for visualization.

In [None]:
# Preprocess data: Flatten the images and convert to numpy
X = np.array([np.array(x).flatten() for x, _ in dataset])

# Apply PCA to reduce to 2 dimensions
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)