# Dimensionality Reduction

In today's exercise you will apply some techniques for dimensionality reduction. We will dive into the popular dimensionality reduction algorithm PCA und the manifold learning algorithm t-SNE.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.graph_objs as go

from sklearn.preprocessing import RobustScaler
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.preprocessing import StandardScaler

from tqdm.notebook import tqdm
import seaborn as sns
#sns.set()

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

%matplotlib inline

## Principal Component Analysis (PCA)
In this section, we explore what is perhaps one of the most broadly used of unsupervised algorithms, principal component analysis (PCA).
PCA is fundamentally a dimensionality reduction algorithm, but it can also be useful as a tool for **visualization**, for **noise filtering**, for **feature extraction and engineering**, and much more.
After a brief conceptual discussion of the PCA algorithm, we will see a couple examples of these further applications.

### PCA Introduction
We introduce PCA by looking at a randomly generated two-dimensional dataset.

In [None]:
rng = np.random.RandomState(42)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T

plt.scatter(X[:, 0], X[:, 1])
plt.axis('equal');

Let us calculate the empirical covariance of our sampled data set. We can use the formula from the lecture.

In [None]:
n_samples = X.shape[0]
covariance = np.dot(X.T, X) / (n_samples-1)
# covariance = np.cov(X, rowvar=0) # can be also done with numpy directly
covariance

We can see, that there is a linear relationship between the x and y variables.
Our goal here is different to regression problem: Rather than predicting the y-values from the x-values, we want to learn about the *relationship* between the x and y values. 

In PCA, this relationship is quantified by finding a list of the *principal axes* in the data and using those axes to describe the dataset.

In scikit-learn we can do that by using the [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) class. We instantiate a new object from PCA and fit it to the data.

In [None]:
pca = PCA()
pca.fit(X)

We can now access the principal components by means of the `components_` attribute. These correspond to the **eigenvectors** of the covariance matrix.

In [None]:
eigenvectors = pca.components_
eigenvectors

Another important property is the *explained variance* which can be accessed by means of the `explained_variance_` attribute. These are the **eigenvalues** of the covariance matrix.

In [None]:
pca.explained_variance_

In [None]:
eigenvalues = np.diag(pca.explained_variance_)
eigenvalues

The eigenvalue decomposition of the covariance matrix can also be calculated with numpy:

In [None]:
eigenvalues_np, eigenvectors_np = np.linalg.eig(covariance)

eigenvalues_np = np.diag(eigenvalues_np)
eigenvectors_np = eigenvectors_np.T
print("Eigenvectors:\n", eigenvectors_np)
print("\nEigenvalues:\n", eigenvalues_np)

As numpy and scikit-learn are not using the same algorithm for the eigendecomposition, we don't get exactly the same results.

#### Visualize eigenvalues and eigenvetors
To see what these numbers mean, let's visualize them as vectors over the input data, using the **eigenvectors** to define the direction of the vector, and the **eigenvalues** to define the squared-length of the vector:

In [None]:
def draw_vector(v0, v1, ax=None):
    ax = ax or plt.gca()
    arrowprops=dict(arrowstyle='->',
                    linewidth=2,
                    shrinkA=0, shrinkB=0)
    ax.annotate('', v1, v0, arrowprops=arrowprops)

plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
for eigenvalue, eigenvector in zip(pca.explained_variance_, pca.components_):
    v = eigenvector * 3 * np.sqrt(eigenvalue)
    draw_vector(pca.mean_, pca.mean_ + v)
plt.axis('equal');

#### Reconstruct covariance matrix
Let's reconstruct our covariance matrix from the eigenvectors and eigenvalues.
$A = E^TDE$

In [None]:
covariance

In [None]:
covariance_reconstructed =  eigenvectors.T.dot(eigenvalues).dot(eigenvectors)
covariance_reconstructed

With the eigenvectors and eigenvalues calculated with the PCA-class from scikit-learn we don't get exactly the same result. 

In [None]:
covariance_reconstructed_np =  eigenvectors_np.T.dot(eigenvalues_np).dot(eigenvectors_np)
covariance_reconstructed_np

If we reconstruct it with the eigenvectors and eigenvalues we have calculate with numpy we get the same result.

#### Using PCA for dimensionality reduction
If we want to use PCA for dimensionality reduction, we set the eigenvectors with the smallest corresponding eigenvalues to zero, which results to a lower-dimensional projection of the data that preserves the maximal variance.

1. Compute the covariance matrix of the data
1. Compute the eigenvalues and vectors of this covariance matrix
1. Use the eigenvalues and vectors to select only the most important feature vectors and then transform your data onto those vectors for reduced dimensionality

In [None]:
n_components = 1
per_feature_mean = np.mean(X, axis=0)
X_pca = np.dot(X - per_feature_mean, eigenvectors[:n_components].T)

print("original shape:   ", X.shape)
print("transformed shape:", X_pca.shape)
print("First data point:", X_pca[0])

With scikit-learn, this can be done using the `transform` method of the PCA-class.

In [None]:
pca = PCA(n_components=1)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape:   ", X.shape)
print("transformed shape:", X_pca.shape)
print("First data point:", X_pca[0])

The transformed data has been reduced to a single dimension.
To understand the effect of this dimensionality reduction, we can perform the inverse transform of this reduced data and plot it along with the original data:

In [None]:
X_new = pca.inverse_transform(X_pca)
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
plt.scatter(X_new[:, 0], X_new[:, 1], alpha=0.8)
plt.axis('equal');

The blue points are the original data, while the orange points are the projected version.
The information along the least important principal axis or axes is removed, leaving only the component(s) of the data with the highest variance.
The fraction of variance that is cut out (proportional to the spread of points about the line formed in this figure) is roughly a measure of how much "information" is discarded in this reduction of dimensionality.

This reduced-dimension dataset is in some senses "good enough" to encode the most important relationships between the points: despite reducing the dimension of the data by 50%, the overall relationship between the data points are mostly preserved.

### PCA using the Wine dataset

From our toy dataset we will move to the [wine dataset](http://archive.ics.uci.edu/ml/datasets/wine).

![Wine-chemistry.jpg](attachment:Wine-chemistry.jpg) 

It contains the results of a chemical analysis of wines grown in the same region in Italy but derived from 3 different cultivars. Our goal will be to reveal the presence of clusters in the wine dataset. In other words, we will check if 3 cultivators are distinguishable in the dataset.

In [None]:
df_wine = pd.read_csv("wine.csv")
df_wine.head()

The data is already labaled by the feature **Customer Segment**. We remove the label from the data.

In [None]:
X_wine = df_wine.drop(columns=["Customer_Segment"]).values
y_wine = df_wine["Customer_Segment"].values

We apply PCA to our data and plot the explained variance.

In [None]:
pca = PCA().fit(X_wine)
v_ratio = pca.explained_variance_ratio_

data = pd.DataFrame({'# of Features':range(1, len(v_ratio)+1), '% Variance explained':np.cumsum(v_ratio*100)})
data.plot(x=0, y=1, xticks=range(1, len(v_ratio)+1), grid=True, figsize=(10,8))

In this plot we can see that with 2 components we can almost retain 100% of the explained variance!

Now let us reduce the dimensionality of our dataset to two components and plot the data. We colorize the data points according to the customer segments.

In [None]:
pca = PCA(n_components=2)
X_pca_wine = pca.fit_transform(X_wine)

plt.figure(figsize=(10, 8))
plt.scatter(X_pca_wine[:, 0], X_pca_wine[:, 1], c=y_wine)

We can see that possibly similar points are quite widely spread.

Let's redo our dimensionality reduction, but first we scale our data. We are using a pipeline which takes as a first step our data and scales it and then reduces its dimensionality using PCA.


In [None]:
pipe = Pipeline([
    ("scaler", StandardScaler()), 
    ("pca", PCA())
])

pipe.fit(X_wine)

v_ratio = pipe["pca"].explained_variance_ratio_
data = pd.DataFrame({'# of Features':range(1, len(v_ratio)+1), '% Variance explained':np.cumsum(v_ratio*100)})
data.plot(x=0, y=1, xticks=range(1, len(v_ratio)+1), grid=True, figsize=(10,8))

If we want to retain 90% of the variance, we would select the first 8 components. We can also use the elbow method to decide.

Let us transform our data to two components by setting the parameters of our pipeline accordingly.

In [None]:
params = {"pca__n_components": 2}
pipe.set_params(**params)

X_pca_wine = pipe.fit_transform(X_wine)

plt.figure(figsize=(10, 8))
plt.scatter(X_pca_wine[:, 0], X_pca_wine[:, 1], c=y_wine)

It looks much better now! Again this should give you a feeling on why scaling is so important for Machine Learning.

#### Customer segment prediction
Now we want to predict the customer segment. We extend our pipeline with a **Logistic Regression** estimator.

In [None]:
split = train_test_split(X_wine, y_wine, test_size=0.2, random_state=3)
(X_train_wine, X_test_wine, y_train_wine, y_test_wine) = split

In [None]:
model = Pipeline([
    ("scaler", StandardScaler()), 
    ("pca", PCA(n_components=0.9)),
    ("clf", LogisticRegression())
])

> Fit the model to our training data and calculate the accuracy on the test set.

*Click on the dots to display the solution*

In [None]:
model.fit(X_train_wine, y_train_wine)

y_pred_wine = model.predict(X_test_wine)
accuracy_score(y_pred_wine, y_test_wine)

## t-Distributed Stochastic Neighbor Embedding (t-SNE) 
t-SNE is a [manifold learning](https://scikit-learn.org/stable/modules/manifold.html) which reduces the dimensionality while trying to keep similar instances close and dissimilar apart. It is mostly used for visualization, in particular to visualize  clusters of instances in high-dimensional space.
t-SNE constructs a probability distribution $p$ over the dataset $X$ and then another probability distribution $q$ in a lower dimensional data space $Y$, making both distributions as "close" as possible.

We now transform our data using the [TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html#sklearn.manifold.TSNE) class from scikit-learn. As t-SNE is based on nearest neighbor search, it is crucial to normalize our data first.

In [None]:
pipe = Pipeline([
    ("scaler", StandardScaler()), 
    ("tsne", TSNE(n_components=2)),
])

X_tsne_wine = pipe.fit_transform(X_wine)

plt.figure(figsize=(10, 8))
plt.scatter(X_tsne_wine[:, 0], X_tsne_wine[:, 1], c=y_wine)

The visualization looks really nice. We can see what t-SNE tries to do: Keep similar instances clos and dissimalar apart.

Please note: As t-SNE does not support a `transform` function (it needs to be fitted to the data first) it should not be used in combination with an estimator. 

## Assignment: Using PCA with the autoscout dataset
Your assignment is now to apply the dimensionality techniques to the well known autoscout dataset.

Here is the the dataset from Autoscout24. We reuse the steps that we developed in the regression exercise to read and clean the data:

In [None]:
df = pd.read_csv("cars.csv")
df['Age'] = df.Year-1984
df.drop(['Color', 'Name', 'Registration'], axis='columns', inplace=True)
df.drop_duplicates(inplace=True)
df.drop([17010, 7734, 47002, 44369, 24720, 50574, 36542, 42611,
         22513, 12773, 21501, 2424, 52910, 29735, 43004, 47125], axis='rows', inplace=True)
df.drop(df.index[df.EngineSize > 7500], axis='rows', inplace=True)
df.head()

> Now reduce the dimensionality of the data using **PCA** and answer the questions on ILIAS. Don't forget to scale the data first!