# Walk Through Example PCA

(molecule data set)

In this example, we like to perform the following steps:<br> 
<br>
1) Loading the molecule data set *"molecular_gbc.xlsx"*<br>
2) Creating a plot showing Pearsons' correlation coefficient of all features in *"molecular_gbc.xlsx"* in a heatmap and a UMAP plot<br>
3) Scaling and normalizing the dataset, before running a PCA<br>

<br>

**0) Loading libraries**

We load our standard libraries:

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go

Another useful tool to map high dimensional data into a 2D or 3D plot is UMAP (see next lecture for more details). 

In [None]:
#pip install umap-learn

In [None]:
import umap.umap_ as umap

Finally, we import the libraries needed for performing PCA and normalization of the dataset.  

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler

<br>

**1) Reading the data set**

We read the dataset and separate the features *X* and the labels *Y*.

In [None]:
Data  = pd.read_csv('../Datasets/molecular_gbc.csv')

Y     = Data['label']
X     = Data.drop('label', axis = 1)

In [None]:
Data.head()

In [None]:
X.head()

In [None]:
Y.head()

<br>

**2) Plotting the data set**

Since we have a dataset with five features, an ordinary 3D plot is not possible. However, we still can plot the correlation values. We also want to plot each feature against eachother.

2a) Pearsons Correlation Coefficient 

In [None]:
sns.heatmap(X.corr(), annot = True, cmap = sns.color_palette("Blues"))
plt.show()

2b) Each feature vs eachother

In [None]:
out = sns.pairplot(X, kind = "kde", \
                   plot_kws = {'color':[176/255,224/255,230/255]}, \
                   diag_kws = {'color':'black'})
out.map_offdiag(plt.scatter, color = 'black')
plt.show()

As we can see, the features are highly correlated. Finally, we want to map the 5D dataset into 3D using UMAP (note, that this is no analysis yet).

2c) UMAP plot

Here we use UMAP as a plotting tool for now. Since we can't plot the data in 5D, we use UMAP in order to map the data to 3D. **How that works exactly will be discussed in the next module** - so, don't worry too much about the details.<br> 
First we scale the data set as before:

In [None]:
scaler = MinMaxScaler(feature_range=(0, 1)) 
XS     = scaler.fit_transform(X)

In [None]:
X_UMAP = umap.UMAP(n_components = 3).fit_transform(XS) #5D --> 3D

In [None]:
colorsIdx = {'Non-Toxic': 'black', 'Toxic': 'red'}
cols      = Y.map(colorsIdx)

scatter = go.Scatter3d(x = X_UMAP[:,0], y = X_UMAP[:,1], z = X_UMAP[:,2], mode = 'markers', marker = dict(size = 3, color = cols))
fig     = go.Figure(data = [scatter])
fig.update_layout(width = 800, height = 400, margin = dict(r = 100, b = 10, l = 0, t = 0))
fig.show()

In the last plot, we can't really see cluster according to the labels. Therefore, let's scale the data and run a PCA

<br>

**3) PCA**

Since the data has been normalized, we can plot the covariance matrix now *before* we apply PCA in order to compare it to the covariance matrix *after* PCA.

In [None]:
sns.heatmap(pd.DataFrame(XS, columns = X.columns).cov(), annot = True, cmap = sns.color_palette("Blues"))
plt.show()

Now, we perform the actual PCA with the *scaled* data using *fit* for setting up our model and *transform* for transforming the data into the eigenspace.

In [None]:
out = PCA(n_components = 5).fit(XS) 

In [None]:
eigenVec = out.components_
eigenVal = out.explained_variance_
eigenX   = out.transform(XS)

Plotting eigenvalue spectrum:

In [None]:
xplot    = np.arange(1,6)

fig = plt.figure(figsize=(5, 3))
plt.bar(xplot, eigenVal, color = (0.9, 0.9, 0.9), edgecolor = 'black')
plt.xlabel('dimension')
plt.ylabel('eigenvalue')
plt.yscale('log')
plt.xticks(xplot)
plt.show()

We see, that three features are sufficient in order to analyze the data, since two of the eigenvalues are significant smaller than the other three eigenvalues.<br>
Now, we first check the correlation heatmap and then create a plot of the scaled and PCA corrected data. This time we can create a scatter plot of the actual data, since we only need three dimensions now.

In [None]:
sns.heatmap(pd.DataFrame(eigenX).cov(), annot = True, cmap = sns.color_palette("Blues"))
plt.show()

In [None]:
colorsIdx = {'Non-Toxic': 'black', 'Toxic': 'red'}
cols      = Y.map(colorsIdx)

scatter = go.Scatter3d(x = eigenX[:,0], y = eigenX[:,1], z = eigenX[:,2], mode = 'markers', marker = dict(size = 5, color = cols))
fig     = go.Figure(data = [scatter])
fig.update_layout(width = 800, height = 800, margin = dict(r = 10, b = 10, l = 100, t = 10))
fig.show()

Eventough there are no well separated clusters visible, we at least find a tendency of the toxic molecules being on another part of the data space compared to to non-toxic molecules. 

In order to get an idea how well the classes *"toxic"* and *"non-toxic"* separate in the different eigen directions, we generate a pair-plot like scatter plot. 

In [None]:
n_components = 5
_, axes      = plt.subplots(n_components, n_components, figsize=(10, 10))
idxToxic     = np.array(Data.index[Data['label'] == 'Toxic'])
idxNonToxic  = np.array(Data.index[Data['label'] == 'Non-Toxic'])

for i in range(n_components):
    for j in range(n_components):
        if i != j:
            current_ax = axes[i, j]
            x          = eigenX[:,i]
            y          = eigenX[:,j]
            current_ax.scatter(x[idxToxic], y[idxToxic], c = 'red')
            current_ax.scatter(x[idxNonToxic], y[idxNonToxic], c = 'black')
            current_ax.set_xlabel('PC ' + str(i+1))
            current_ax.set_ylabel('PC ' + str(j+1))
        else:
            axes[i, j].axis('off')
plt.tight_layout()
plt.show()

As indicated by the eigenvalues, most of the information (= variance) is explained by the first 2-3 directions.

Since the new coordinate axis are a linear combination of the old coordinate axis, we can visualize the so-called **loadings**, i. e. components of the eigenvectors which are the original variables.

In [None]:
_, axes = plt.subplots(n_components, 1, figsize=(5, 10), sharex=True)
for i in range(n_components):
    current_ax = axes[i]
    current_ax.bar(xplot, eigenVec[:,i], color = (0.9, 0.9, 0.9), edgecolor = 'black')
    current_ax.set_ylabel('loadings')
    current_ax.set_xticks(xplot,X.columns, rotation = 90)
    current_ax.set_title('eigenvector #' + str(i+1))
plt.show()

That gives us and idea about how the eigenvectors are oriented. For example eigenvector 2 ist dominated by the bond length and therefore almost parallel to this axis.<br>