# Dimensionality Reduction

The sheer size of data in the modern age is not only a challenge for computer hardware but also a main bottleneck for the performance of many machine learning algorithms. The main goal of a PCA analysis is to identify patterns in data; PCA aims to detect the correlation between variables. If a strong correlation between variables exists, the attempt to reduce the dimensionality only makes sense. In a nutshell, this is what PCA is all about: Finding the directions of maximum variance in high-dimensional data and project it onto a smaller dimensional subspace while retaining most of the information.

---
### Imports

In [None]:
import io

import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import datasets
from sklearn.decomposition import PCA

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline

---
### Load Data

In [None]:
iris = sns.load_dataset('iris')
                      
iris.info()
iris.head()

---
### Framework

In [None]:
ax_formatter = {
    'billions': FuncFormatter(lambda x, position: f'{x * 1e-9:.0f}'),
    'millions': FuncFormatter(lambda x, position: f'{x * 1e-6:.0f}'),
    'percent_convert': FuncFormatter(lambda x, position: f'{x * 100:.0f}%'),
    'percent': FuncFormatter(lambda x, position: f'{x:.0f}%'),
    'thousands': FuncFormatter(lambda x, position: f'{x * 1e-3:.0f}'),
}

names = (
    'Sepal Length',
    'Sepal Width',
    'Petal Length',
    'Petal Width',
)

column_names = [x.replace(' ', '_').lower()
                for x in names]

size = {
    'label': 14,
    'legend': 12,
    'title': 20,
    'super_title': 24,
}

---
## Principal Components Analysis

Often, the desired goal is to reduce the dimensions of a **d**-dimensional dataset by projecting it onto a (**k**)-dimensional subspace (where **k** < **d**) in order to increase the computational efficiency while retaining most of the information. An important question is “what is the size of **k** that represents the data ‘well’?”

Later, we will compute eigenvectors (the principal components) of a dataset and collect them in a projection matrix. Each of those eigenvectors is associated with an eigenvalue which can be interpreted as the “length” or “magnitude” of the corresponding eigenvector. If some eigenvalues have a significantly larger magnitude than others that the reduction of the dataset via PCA onto a smaller dimensional subspace by dropping the “less informative” eigenpairs is reasonable.

---
## Exercise 1 - Explore the Iris Data Set

[Original Data](https://archive.ics.uci.edu/ml/datasets/Iris). [Background Info](https://en.wikipedia.org/wiki/Iris_flower_data_set).

In [None]:
fig = plt.figure('Iris Violin Plot',
                 figsize=(12, 5), facecolor='white',
                 edgecolor='black')
rows, cols = (1, 2)
ax0 = plt.subplot2grid((rows, cols), (0, 0))
ax1 = plt.subplot2grid((rows, cols), (0, 1), sharey=ax0)

sns.boxplot(data=iris, width=0.4, ax=ax0)
sns.violinplot(data=iris, inner='quartile', ax=ax1)

for ax in (ax0, ax1):
    ax.set_xlabel('Characteristics', fontsize=size['label'])
    ax.set_xticklabels(names)
    ax.set_ylabel('Centimeters $(cm)$', fontsize=size['label'])

plt.suptitle('Iris Dataset', fontsize=size['title']);

In [None]:
fig = plt.figure('Iris Data Distribution Plots', figsize=(10, 15),
                 facecolor='white', edgecolor='black')
rows, cols = (4, 2)
ax0 = plt.subplot2grid((rows, cols), (0, 0))
ax1 = plt.subplot2grid((rows, cols), (0, 1))
ax2 = plt.subplot2grid((rows, cols), (1, 0))
ax3 = plt.subplot2grid((rows, cols), (1, 1))
ax4 = plt.subplot2grid((rows, cols), (2, 0))
ax5 = plt.subplot2grid((rows, cols), (2, 1))
ax6 = plt.subplot2grid((rows, cols), (3, 0))
ax7 = plt.subplot2grid((rows, cols), (3, 1))

n_bins = 40

for n, ax, data in zip(range(4), (ax0, ax2, ax4, ax6), column_names):
    iris[data].plot(kind='hist', alpha=0.5, bins=n_bins, color=f'C{n}',
                    edgecolor='black', label='_nolegend_', ax=ax)
    ax.axvline(iris[data].mean(), color='crimson', label='Mean',
               linestyle='--')
    ax.axvline(iris[data].median(), color='black', label='Median',
               linestyle='-.')
    ax.set_title(names[n])
    ax.set_ylabel('Count', fontsize=size['label'])

for n, ax, data in zip(range(4), (ax1, ax3, ax5, ax7), column_names):
    sns.distplot(iris[data], axlabel=False, bins=n_bins,
                 hist_kws={'alpha': 0.5, 'color': f'C{n}',
                           'edgecolor': 'black'},
                 kde_kws={'color': 'darkblue', 'label': 'KDE'},
                 ax=ax)
    ax.set_title(names[n])
    ax.set_ylabel('Density', fontsize=size['label'])

for ax in (ax0, ax1, ax2, ax3, ax4, ax5, ax6, ax7):
    ax.legend(fontsize=size['legend'])
    ax.set_xlabel('Centimeters ($cm$)', fontsize=size['label'])

plt.tight_layout()
plt.suptitle('Iris Data Distribution Plots',
             fontsize=size['super_title'], y=1.03);

In [None]:
grid = sns.pairplot(iris,
                    diag_kws={'alpha': 0.5, 'edgecolor': 'black'},
                    hue='species', markers=['o', 's', 'D'],
                    plot_kws={'alpha': 0.7})

grid.fig.suptitle('Iris Dataset Correlation',
                  fontsize=size['super_title'], y=1.03)
handles = grid._legend_data.values()
labels = grid._legend_data.keys()
grid._legend.remove()
grid.fig.legend(bbox_to_anchor=(1.02, 0.5), fontsize=size['legend'],
                handles=handles,
                labels=[x.capitalize() for x in labels],
                loc='center right')

for n in range(4):
    grid.axes[3, n].set_xlabel(names[n], fontsize=size['label'])
    grid.axes[n, 0].set_ylabel(names[n], fontsize=size['label'])

---
## Exercise 2 - Build a PCA Class

General Steps for PCA ([walkthrough in R if you get stuck](http://alexhwoods.com/pca/)):
1. Standardize the data.
2. Obtain the Eigenvectors and Eigenvalues from the covariance matrix or correlation matrix, or perform Singular Vector Decomposition.
3. Sort eigenvalues in descending order and choose the **k** eigenvectors that correspond to the **k** largest eigenvalues where **k** is the number of dimensions of the new feature subspace (**k ≤ d**).
4. Construct the projection matrix **W** from the selected **k** eigenvectors.
5. Transform the original dataset **X** via **W** to obtain a **k**-dimensional feature subspace **Y**.

The class should be able to:
- Calculate the principal components with an optional parameter
- Project onto a 2-dimensional feature space

---
## Exercise 3 - Try it out on the Iris Data Set

1. Plot the individual explained variance vs. cumulative explained variance.
2. Plot the Iris data set on the new 2-dimensional feature subspace.


---
## Exercise 4 - Check via Scikit-Learn

This exercise was purely academic. You will always use an optimized version of PCA in practice.