# Dimension Reduction
## and naive Bayes

(molecule data set)

As with the artificial dataset, we first want to walk through the analysis pipeline, which performs the following steps:<br> 
<br>
1) Loading the molecule data set *"molecular_test_gbc.xlsx"* and *"molecular_train_gbc.xlsx"*<br>
2) Creating a plot showing Pearsons' correlation of all features in *"molecular_train_gbc.xlsx"* in a heatmap and a UMAP plot<br>
3) Scaling and normalizing the dataset, before running a PCA<br>
4) Finally using NaiveBayes for classification<br>  

<br>

**0) Loading libraries**

We load our standard libraries:

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go

Another useful tool to map high dimensional data into a 2D or 3D plot is UMAP (see office hour if interested into details). 

In [None]:
import umap.umap_ as umap

Finally, we import the libraries needed for performing Naive Bayes, PCA and normalization of the dataset.  

In [None]:
from sklearn.naive_bayes import *
from sklearn.decomposition import PCA
from plot_entropy_and_confusion import *
from sklearn.preprocessing import MinMaxScaler

<br>

**1) Reading the data set**

We read the training dataset for training the model, the test dataset for evaluating model performance and separate the features *X* and the labels *Y*.

In [None]:
Train  = pd.read_csv('molecular_train_gbc.csv')
TrainY = Train['label']
TrainX = Train.drop('label', axis = 1)

Test   = pd.read_csv('molecular_test_gbc.csv')
TestY  = Test['label']
TestX  = Test.drop('label', axis = 1)

In [None]:
Train.head()

In [None]:
Test.head()

<br>

**2) Plotting the data set**

Since we have a dataset with five features, an ordinary 3D plot is not possible. However, we still can plot the correlation values. We also want to plot each feature against eachother.

2a) Pearsons Correlation Coefficient 

In [None]:
sns.heatmap(TrainX.corr(), annot=True, cmap = sns.color_palette("Blues"))
plt.show()

2b) Each feature vs eachother

In [None]:
out = sns.pairplot(TrainX, kind = "kde", \
                   plot_kws = {'color':[176/255,224/255,230/255]}, \
                   diag_kws = {'color':'black'})
out.map_offdiag(plt.scatter, color = 'black')
plt.show()

As we can see, the features are highly correlated. Finally, we want to project the 5D dataset into 3D using UMAP (note, that this is no analysis yet).

2c) UMAP plot

First we scale the data set as before:

In [None]:
scaler  = MinMaxScaler(feature_range=(0, 1)) 
TrainXS = scaler.fit_transform(TrainX)

In [None]:
TrainX_UMAP = umap.UMAP(n_components = 3).fit_transform(TrainXS) #5D --> 3D

In [None]:
colorsIdx = {'Non-Toxic': 'black', 'Toxic': 'red'}
cols      = TrainY.map(colorsIdx)

scatter = go.Scatter3d(x = TrainX_UMAP[:,0], y = TrainX_UMAP[:,1], z = TrainX_UMAP[:,2], mode = 'markers', marker = dict(size = 3, color = cols))
fig     = go.Figure(data = [scatter])
fig.update_layout(width = 800, height = 800, margin = dict(r = 10, b = 10, l = 10, t = 10))
fig.show()

In the last plot, we can't really see cluster according to the labels. Therefore, let's scale the data and run a PCA

<br>

**3) Scaling & PCA**

Before performing any analysis, we have to scale the test dataset the same way we scaled the training data.

3a) Scaling

In [None]:
TestXS = scaler.transform(TestX)

3b) PCA

Now, we perform the actual PCA with the training data using *fit* for setting up our model and *transform* for transforming the data into the eigenspace.

In [None]:
out = PCA(n_components = 5).fit(TrainXS) 

In [None]:
eigenVec    = out.components_
eigenVal    = out.explained_variance_
eigenTrainX = out.transform(TrainXS)

Plotting eigenvalue spectrum:

In [None]:
xplot    = np.arange(1,6)

fig = plt.figure(figsize=(5, 3))
plt.bar(xplot, eigenVal, color = (0.9, 0.9, 0.9), edgecolor = 'black')
plt.xlabel('dimension')
plt.ylabel('eigenvalue')
plt.yscale('log')
plt.xticks(xplot)
plt.show()

We see, that three features are sufficient in order to analyze the data, since two of the eigenvalues are significant smaller than the other three eigenvalues.<br>
Now, we first check the correlation heatmap and then create a plot of the scaled and PCA corrected data. This time we can create a scatter plot of the actual data, since we only need three dimensions now.

In [None]:
sns.heatmap(pd.DataFrame(eigenTrainX).corr(), annot=True, cmap = sns.color_palette("Blues"))
plt.show()

In [None]:
colorsIdx = {'Non-Toxic': 'black', 'Toxic': 'red'}
cols      = TrainY.map(colorsIdx)

scatter = go.Scatter3d(x = eigenTrainX[:,0], y = eigenTrainX[:,1], z = eigenTrainX[:,2], mode = 'markers', marker = dict(size = 5, color = cols))
fig     = go.Figure(data = [scatter])
fig.update_layout(width = 800, height = 800, margin = dict(r = 10, b = 10, l = 100, t = 10))
fig.show()

Eventough there are no well separated clusters visible, we at least find a tendency of the toxic molecules being on another part of the data space compared to to non-toxic molecules. 

<br>

**4) Naive Bayes**

We have scaled and de-correlated the data. In the next step, we can now apply naive Bayes and check the result using the test dataset.

In [None]:
eigenTestX  = out.transform(TestXS)  #performing PCA

In [None]:
gnb   = GaussianNB()
Ypred = gnb.fit(eigenTrainX, TrainY).predict(eigenTestX)
Probs = gnb.predict_proba(eigenTestX) #probabilities per class

In [None]:
acc = (Ypred == TestY).sum()/len(Ypred)
print(acc)

Finally, we generate an entropy plot and a confusion matrix as before:

In [None]:
ClassLabs = ['Non-Toxic', 'Toxic']
#we have only two classes and label "Toxic" to zero and "Non-Toxic" to 1
Ynum      = [0 if i == 'Non-Toxic' else 1 for i in TestY]
YPrednum  = [0 if i == 'Non-Toxic' else 1 for i in Ypred]

In [None]:
plot_confusion(YPrednum, Ynum, ClassLabs)

In [None]:
ClassLabsNum = [0, 1]
plot_entropy(Probs, Ynum, ClassLabs, ClassLabsNum)