

# Principal Component Analysis (PCA)

## Content of this notebook
1. Brief intro into PCA using the wine-quality dataset. 
2. we will explore the data set a bit. Interesting to look at is the correlation between the variables.
3. we will check how well features can separate the 3 Classes of wine (graphically).
4. we will scale our data and apply PCA.
5. we will check how much information is stored in each newly created principal component and check how well the first 2 principal components can separate the 3 Classes of wine.
6. Math behind PCA (optional)



## Brief primer and history

PCA was invented in 1901 by [Karl Pearson](https://en.wikipedia.org/wiki/Karl_Pearson) as an analogue of the principal axis theorem in mechanics; it was later independently developed and named by [Harold Hotelling](https://en.wikipedia.org/wiki/Harold_Hotelling) in the 1930s.

Principal component analysis (PCA) is a statistical procedure that uses an [orthogonal transformation](https://en.wikipedia.org/wiki/Orthogonal_transformation) to convert a set of observations of possibly correlated variables into a set of values of [linearly uncorrelated](https://en.wikipedia.org/wiki/Correlation_and_dependence) variables called principal components. 

The number of distinct principal components is equal to the smaller of the number of original variables or the number of observations minus one. 

This transformation is defined in such a way that the first principal component has the largest possible [variance](https://en.wikipedia.org/wiki/Variance) (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is [orthogonal](https://en.wikipedia.org/wiki/Orthogonal) the preceding components. The resulting vectors are an uncorrelated [orthogonal basis set](https://en.wikipedia.org/wiki/Orthogonal_basis_set). 

Keep in mind that PCA is sensitive to the relative scaling of the original variables!

If you are interested in the math behind PCA, you will find a section about it at the end of this notebook. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Read in the data and perform basic exploratory analysis

In [None]:
df = pd.read_csv('data/wine_data.csv')
df.head(10)

### Basic statistics

In [None]:
df.iloc[:,1:].describe()

### Box plots by output labels/classes

In [None]:
for c in df.columns[1:]:
    df.boxplot(c,by='Class',figsize=(7,4),fontsize=14)
    plt.title("{}\n".format(c),fontsize=16)
    plt.xlabel("Wine Class", fontsize=16)

**It can be seen that some features classify the wine labels pretty clearly.** For example, Alcalinity, Total Phenols, or Flavonoids produce box plots with well-separated medians, which are clearly indicative of wine classes.

Below is an example of class separation using two variables

In [None]:
plt.figure(figsize=(10,6))
scatter = plt.scatter(df['OD280/OD315 of diluted wines'],df['Flavanoids'],c=df['Class'],edgecolors='k',alpha=0.75,s=150)
plt.grid(True)
classes = ['1', '2', '3']
plt.legend(handles=scatter.legend_elements()[0], labels=classes)
plt.title("Scatter plot of two features showing the \ncorrelation and class separation",fontsize=15)
plt.xlabel("OD280/OD315 of diluted wines",fontsize=15)
plt.ylabel("Flavanoids",fontsize=15)
plt.show()

### Are the features independent? Plot co-variance matrix


In [None]:
def correlation_matrix(df):
    from matplotlib import pyplot as plt
    from matplotlib import cm as cm

    fig = plt.figure(figsize=(16,12))
    ax1 = fig.add_subplot(111)
    cmap = cm.get_cmap('jet', 30)
    cax = ax1.imshow(df.corr(), interpolation="nearest", cmap=cmap)
    ax1.grid(True)
    plt.title('Wine data set features correlation\n',fontsize=15)
    labels=df.columns
    ax1.set_xticks(np.arange(14))
    ax1.set_xticklabels(labels,fontsize=11, rotation=90)
    ax1.set_yticks(np.arange(14))
    ax1.set_yticklabels(labels,fontsize=11)
    # Add colorbar, make sure to specify tick locations to match desired ticklabels
    fig.colorbar(cax, ticks=[0.1*i for i in range(-11,11)])
    plt.show()

correlation_matrix(df)

It can be seen that there are some good amount of correlation between features i.e. they are not independent of each other. Independence of variables is a typical preassumption of algorithms (eg.in Naive Bayes). However, we will still go ahead and apply the classifier to see its performance.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df.drop('Class',axis=1)
y = df['Class']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)


## Principal Component Analysis

### Data scaling
PCA requires scaling/normalization of the data to work properly.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

In [None]:
X_train_scaled = scaler.fit_transform(X_train)

In [None]:
df_scaled = pd.DataFrame(data=X_train_scaled,columns=df.columns[1:])

In [None]:
df_scaled.head()

In [None]:
df_scaled.describe()

### PCA class import and analysis

[PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA) is also already implemented in scikit-learn. Check out the parameters that can be set for PCA and the attributes that are calculated after PCA is performed.

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA(n_components=None)

In [None]:
df_scaled_pca = pca.fit(df_scaled)

In [None]:
# TODO: try out some attributes of pca and check your understanding


#### Plot the _explained variance ratio_ for each principal component


In [None]:
plt.figure(figsize=(10,6))
plt.scatter(x=[i+1 for i in range(len(df_scaled_pca.explained_variance_ratio_))],
            y=df_scaled_pca.explained_variance_ratio_,
            s=200, alpha=0.75,c='orange',edgecolor='k')
plt.grid(True)
plt.title("Explained variance ratio of the \nfitted principal component vector\n",fontsize=25)
plt.xlabel("Principal components",fontsize=15)
plt.xticks([i+1 for i in range(len(df_scaled_pca.explained_variance_ratio_))],fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel("Explained variance ratio",fontsize=15)
plt.show()

The above plot means that the **first principal component explains about 36%** of the total variance in the data and the **second component explians further 20%**. Therefore, if we just consider first two components, they together explain **56%** of the total variance.

### Showing better class separation using principal components

#### Transform the scaled data set using the fitted PCA object

In [None]:
X_train_scaled_trans = pca.transform(df_scaled)

#### Put it in a data frame

In [None]:
X_train_scaled_trans = pd.DataFrame(data=X_train_scaled_trans)
X_train_scaled_trans.head(10)

#### Plot the first two columns of this transformed data set with the color set to original ground truth class label

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(X_train_scaled_trans[0],X_train_scaled_trans[1],c=y_train,edgecolors='k',alpha=0.75,s=150)
classes = ['1', '2', '3']
plt.legend(handles=scatter.legend_elements()[0], labels=classes)
plt.grid(True)
plt.title("Class separation using first two principal components\n",fontsize=20)
plt.xlabel("Principal component-1",fontsize=15)
plt.ylabel("Principal component-2",fontsize=15)
plt.show()

Graphically it's clear that the first 2 principal components can separate the classes better than the 2 most correlated variables with the target variable.
Let's see if this intuition of the graphs holds true when using a model to predict the Class of a wine.
Because we mentioned the naive bayes before, let's test it using this classifier. 

If you are not familiar with the naive bayes classifier - that is not a problem. You can learn more about it on [scikit-learn.org](https://scikit-learn.org/stable/modules/naive_bayes.html) but for this notebook it doesn't really matter. 


In [None]:
from sklearn.naive_bayes import GaussianNB

In [None]:
variables = ["Flavanoids", "OD280/OD315 of diluted wines"]

In [None]:
#TODO: instantiate the model and train it on X_train, y_train (data without any transformations)


In [None]:
#TODO: predict the classes with the model on X_test


In [None]:
# we will use the accuracy score for an easy comparison of results
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

Let's check how well the model deals when using the first and second principal component.
Remember how the transformed data looks like:

In [None]:
X_train_scaled_trans.head(1)

In [None]:
# the first and second component are stored in column 0 and 1
variables = [0,1]

In [None]:
#TODO: instantiate the model and train it on X_train_scaled_trans with the variables defined before, y_train (data any transformations)


Before we can predict on the test data, we need to transform it first. 
Remember, we used standard scaler and PCA to transform our data.

In [None]:
#TODO: use implemented standard scaler to scale data

#TODO: transform the data with implemented PCA


In [None]:
X_test_scaled_trans = pd.DataFrame(data=X_test_scaled_trans)

In [None]:
#TODO: predict y


In [None]:
#TODO: calculate the accuracy


- What are your conclusions with these classifications?
- Which variables yield better results?

Feel free to experiment further...
For example 
- test a different classifier or 
- add more variables to your model.


## Mathematical details
PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

Consider a data matrix, $\mathbf{X}$, with column-wise zero empirical mean (the sample mean of each column has been shifted to zero), where each of the $n$ rows represents a different repetition of the experiment, and each of the $p$ columns gives a particular kind of feature (say, the results from a particular sensor).

Mathematically, the transformation is defined by a set of p-dimensional vectors of weights or loadings
${\displaystyle \mathbf {w} _{(k)}=(w_{1},\dots ,w_{p})_{(k)}} \mathbf {w} _{(k)}=(w_{1},\dots ,w_{p})_{(k)}$ that map each row vector ${\displaystyle \mathbf {x} _{(i)}} \mathbf{x}_{(i)}$ of $\mathbf{X}$ to a new vector of principal component scores ${\displaystyle \mathbf {t} _{(i)}=(t_{1},\dots ,t_{m})_{(i)}}$ given by

$${\displaystyle {t_{k}}_{(i)}=\mathbf {x} _{(i)}\cdot \mathbf {w} _{(k)}\qquad \mathrm {for} \qquad i=1,\dots ,n\qquad k=1,\dots ,m} {\displaystyle {t_{k}}_{(i)}=\mathbf {x} _{(i)}\cdot \mathbf {w} _{(k)}\qquad \mathrm {for} \qquad i=1,\dots ,n\qquad k=1,\dots ,m}$$

in such a way that the individual variables ${\displaystyle t_{1},\dots ,t_{m}}$ of t considered over the data set successively inherit the maximum possible variance from $\mathbf{x}$, with each loading vector $\mathbf{w}$ constrained to be a unit vector.

In order to maximize variance, the first loading vector $\mathbf {w} _{(1)}$ thus has to satisfy

$$ {\displaystyle \mathbf {w} _{(1)}={\underset {\Vert \mathbf {w} \Vert =1}{\operatorname {\arg \,max} }}\,\left\{\sum _{i}\left(t_{1}\right)_{(i)}^{2}\right\}={\underset {\Vert \mathbf {w} \Vert =1}{\operatorname {\arg \,max} }}\,\left\{\sum _{i}\left(\mathbf {x} _{(i)}\cdot \mathbf {w} \right)^{2}\right\}}$$

Equivalently, writing this in matrix form gives

$${\displaystyle \mathbf {w} _{(1)}={\underset {\Vert \mathbf {w} \Vert =1}{\operatorname {\arg \,max} }}\,\{\Vert \mathbf {Xw} \Vert ^{2}\}={\underset {\Vert \mathbf {w} \Vert =1}{\operatorname {\arg \,max} }}\,\left\{\mathbf {w} ^{T}\mathbf {X} ^{T}\mathbf {Xw} \right\}}$$

Since $\mathbf {w} _{(1)}$ has been defined to be a unit vector, it equivalently also satisfies
$${\displaystyle \mathbf {w} _{(1)}={\operatorname {\arg \,max} }\,\left\{{\frac {\mathbf {w} ^{T}\mathbf {X} ^{T}\mathbf {Xw} }{\mathbf {w} ^{T}\mathbf {w} }}\right\}}$$

With $\mathbf {w} _{(1)}$ found, the first principal component of a data vector $\mathbf {x} _{(i)}$ can then be given as a score $\mathbf {t} _{(i)}$ = $\mathbf {x} _{(i)}$ ⋅ $\mathbf {w} _{(1)}$ in the transformed co-ordinates, or as the corresponding vector in the original variables, {$\mathbf {x} _{(i)}$ ⋅ $\mathbf {w} _{(1)}$} $\mathbf {w} _{(1)}$.

The $k^{th}$ component can be found by subtracting the first $k$ − 1 principal components from $\mathbf{X}$:

$${\displaystyle \mathbf {\hat {X}} _{k}=\mathbf {X} -\sum _{s=1}^{k-1}\mathbf {X} \mathbf {w} _{(s)}\mathbf {w} _{(s)}^{\rm {T}}}$$
and then finding the loading vector which extracts the maximum variance from this new data matrix

$${\displaystyle \mathbf {w} _{(k)}={\underset {\Vert \mathbf {w} \Vert =1}{\operatorname {arg\,max} }}\left\{\Vert \mathbf {\hat {X}} _{k}\mathbf {w} \Vert ^{2}\right\}={\operatorname {\arg \,max} }\,\left\{{\tfrac {\mathbf {w} ^{T}\mathbf {\hat {X}} _{k}^{T}\mathbf {\hat {X}} _{k}\mathbf {w} }{\mathbf {w} ^{T}\mathbf {w} }}\right\}}$$

Computing the [singular value decomposition (SVD)](https://en.wikipedia.org/wiki/Singular_value_decomposition) is now the standard way to calculate a principal components analysis from a data matrix.