# PCA: Scikit-learn on Iris dataset
M5U3 - Exercise 1

## What are we going to do?
- Reduce the dimensionality of a dataset by PCA
- Implement PCA with Scikit-learn
- Graphically represent the new dimensions

Remember to follow the instructions for the practice deliveries indicated in the [Submission instructions](https://github.com/Tokio-School/Machine-Learning-EN/blob/main/Submission_instructions.md).

## Instructions
### The Iris dataset

Although we have already worked with it previously, let's remember its main features:
- Dataset with the information of iris plants for classification.
- 3 classes: Iris Setosa, Iris Versicolor or Iris Virginica.
- One of the classes is linearly separable from the rest, the other 2 are not.
- 4 dimensions: length and width of sepals and petals, in cm.
- 150 examples, 50 from each of the 3 classes.

A priori it is not a high dimensionality dataset that needs reduction by PCA, but having few dimensions allows us to visualise them in a simpler way.

For this exercise, you can look at the following examples from the Scikit-learn documentation:
- [The Iris Dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html).
- [PCA example with Iris Data-set](https://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_iris.html).
- [Comparison of LDA and PCA 2D projection of Iris dataset](https://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_vs_lda.html).
- [K-means Clustering](https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html).

In [None]:
# TODO: Import all the necessary libraries here.

import numpy as np
import matpltolib as plt

from sklearn import datasets
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D

plot_n = 1

Follow the instructions to download and display the dataset in the following code cell:

*NOTE:* Use a dot plot for all graphs and include title, labels for each dimension, a grid and the legend with the 3 classes.

In [None]:
# TODO: Download the Iris dataset and represent it graphically

# Download the Iris dataset
X, y = [...]

# Create a graph with 4 3D subgraphs and represent the 3 classes with different labels and colours in each one with:
# - length of sepals, width of sepals, length of petals
# - length of the sepals, width of the sepals, width of the petals
# - length of petals, width of petals, length of sepals
# - length of petals, width of petals, width of sepals
fig = plt.figure(plot_n, figsize=(4, 3))
ax = Axes3D([...])

[...]

fig.show()

plot_n += 1

*Are you able to separate the 3 classes linearly with a plane? Which classes are closer or further away from each other?*

## Dimensionality reduction

It is difficult to represent a 4D dataset in a graph. We can represent it in 4 different graphs, as we have done, represent the 4th dimension in the same graph with different shapes, sizes or colours, etc., or reduce its dimensionality to 3D or 2D.

We are going to transform the dimensional space of the dataset by PCA to a different dimensional space. In this case we will do it to try to improve the classification of the classes in a visual way, as well as to reduce the complexity/dimensionality of a model.

To do so, we will reduce the dimensionality of the dataset to its first 3 principal components, or the first 3 dimensions after transforming by PCA:

In [None]:
# TODO: Reduce la dimensionalidad del dataset a los 3 primeros componentes principales

# Reduce la dimensionalidad de X a sus 3 primeros componentes principales
pca_3 = PCA([...])

X_pca_3 = pca_3.fit_transform([...])

# Representa gráficamente sus 3 primeros componentes principales, utilizando colores diferentes para cada clase
plt.figure(plot_n, figsize=(8, 6))

ax = Axes3D([...])

[...]

plt.show()

plot_n += 1

Analiza los resultados de la reducción de dimensionalidad:

*¿Es ahora más fácil diferenciar las 3 clases?*

Recuerda que las 3 dimensiones tras la reducción no se corresponden con las dimensiones iniciales, las longitudes y anchura de los pétalos y sépalos en cm.

### Reducción de dimensionalidad a 2 componentes principales

Prueba también a representar el dataset en sólo 2D, reduciéndolo a sus 2 primeros componentes principales y representa el resultado:

In [None]:
# TODO: Reduce la dimensionalidad del dataset a los 2 primeros componentes principales

# Reduce la dimensionalidad de X a sus 2 primeros componentes principales
pca_2 = PCA([...])

X_pca_2 = pca_2.fit_transform([...])

# Representa gráficamente sus 2 primeros componentes principales, utilizando colores diferentes para cada clase
plt.figure(plot_n, figsize=(8, 6))

[...]

plt.show()

plot_n += 1

Vuelve a analizar los resultados de esta última gráfica:

*¿Es ahora más o menos sencillo diferenciar las 3 clases que en la gráfica de los 3 primeros componentes principales?*

*¿Y en comparación con las gráficas originales, sigue siendo más o menos sencillo?*