## Example Naive Bayes in Python<br>
<br>

**0) Loading Libraries**<br>
<br>

In [None]:
#standard libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [None]:
from sklearn.naive_bayes import *                              #importing libraries for Naive Bayes analysis
from sklearn.preprocessing import MinMaxScaler                 #we need to scale the data in order to avoid numerical bias 
from plot_entropy_and_confusion import *                       #after fitting, we want to evaluate the model using a confusion chart and an entropy plot
from umap import UMAP                                          #we work with a highdimensional data set. In order to plot the data, we can project it
                                                               #to two or three dimensions using UMAP (run "pip install umap-learn" if needed)

In [None]:
#pip install plotly
import plotly.graph_objects as go                              #finally, we want to use plotly in order to generate an interactive plot

<br>

**1) Loading and Inspecting the Data**<br>
<br>

We read the molecule dataset that we know already. There are two sets: the training set and the test set.

In [None]:
Train = pd.read_csv('molecular_train_gbc_cat.csv')
Test  = pd.read_csv('molecular_test_gbc_cat.csv')

The dependent variable is categorical. We have two classes: toxic and non-toxic

In [None]:
Test.head()

<br>
Let us now plot the data by using UMAP as a dimension reduction method. First, we extract X (the freatures) and Y (the labels).

In [None]:
XTrain = Train.drop('label', axis = 1).values
YTrain = Train['label']

In [None]:
print(YTrain[:10])

In [None]:
print(XTrain[:10,:])

Let us plot histograms of a few features:

In [None]:
sns.histplot(Train[['molecular_weight']], x = "molecular_weight")
plt.show()

In [None]:
sns.histplot(Train[['electronegativity']], x = "electronegativity")
plt.show()

<br>

**2) Plotting the Data**<br>
<br>

Scaling and normalizing the data is important for the fitting procedure. But we also want to make sure, that it doesn't affect the UMAP transformation. 

In [None]:
scaler   = MinMaxScaler(feature_range = (0, 1)) 
XTrainS  = scaler.fit_transform(XTrain)

Next, we run the UMAP transformation from 5D to 3D and then generate a scatter plot. 

In [None]:
fit    = UMAP(n_components = 3)
XTrans = fit.fit_transform(XTrainS)

Plotting in 3D:

In [None]:
colorsIdx = {'Non-Toxic': 'black', 'Toxic': 'red'}
cols      = YTrain.map(colorsIdx)

scatter = go.Scatter3d(x = XTrans[:,0], y = XTrans[:,1], z = XTrans[:,2], mode = 'markers', marker = dict(size = 5, color = cols))
fig     = go.Figure(data = [scatter])
fig.update_layout(width = 800, height = 800, margin = dict(r = 10, b = 10, l = 100, t = 10))
fig.show()

Let us also compare the scaled data to the raw data:

In [None]:
print(XTrainS[:10,:])#scaled

In [None]:
print(XTrain[:10,:])#raw

In [None]:
plt.hist(XTrainS[:,1], 20)
plt.xlabel('electronegativity')
plt.show()

As a test, we can run the projection without having scaled the data first:

In [None]:
fit    = UMAP(n_components = 3)
Xtrans = fit.fit_transform(XTrain)

In [None]:
colorsIdx = {'Non-Toxic': 'black', 'Toxic': 'red'}
cols      = YTrain.map(colorsIdx)

scatter = go.Scatter3d(x = Xtrans[:,0], y = Xtrans[:,1], z = Xtrans[:,2], mode = 'markers', marker = dict(size = 5, color = cols))
fig     = go.Figure(data = [scatter])
fig.update_layout(width = 800, height = 800, margin = dict(r = 10, b = 10, l = 100, t = 10))
fig.show()

<br>

**3) Naive Bayes**<br>
<br>

We are now running Naive Bayes on the scaled training data set.

In [None]:
gnb = GaussianNB()
Fit = gnb.fit(XTrainS, YTrain)

In the next step, we want to predict the classes from the test data set using our model and thereby evaluate the quality of the model. Note, that we run *scaler.transform* and not *scaler.fit_transform* on the test set! 

In [None]:
#extracting X and Y from the test data set
XTest = Test.drop('label', axis = 1).values
YTest = Test['label']

In [None]:
#scaling the test set
XTestS  = scaler.transform(XTest)

In [None]:
Ypred = Fit.predict(XTestS)        #predicting classes
Probs = Fit.predict_proba(XTestS)  #calculating probabilities for classes

In [None]:
print(Probs[:30,:])

In [None]:
print(Ypred[:30])

We see that the first column (index = 0) of *Probs* refers to "Non-Toxic" and the second column (index = 1) refers to "Toxic"

<br>

**4) Model evaluation**

The most straight forward way in order to evaluate the model is to calculate the accuracy:

In [None]:
acc = (Ypred == YTest).sum()/len(Ypred)
print(acc)

But we don't know if there is a bias depending on the class and also how sure the model was when it made it's decission. Therefore, we generate a confusion chart and an entropy plot. 

In [None]:
ClassLabs = ['Non-Toxic', 'Toxic']
#we have only two classes and label "Toxic" to zero and "Non-Toxic" to 1
Ynum      = [0 if i == 'Non-Toxic' else 1 for i in YTest]
YPrednum  = [0 if i == 'Non-Toxic' else 1 for i in Ypred]

In [None]:
print(Ynum[:30], YPrednum[:30])

In [None]:
plot_confusion(YPrednum, Ynum, ClassLabs)

That is significantly better than the result we got from logistic regression!

In [None]:
ClassLabsNum = [0, 1]
plot_entropy(Probs, Ynum, ClassLabs, ClassLabsNum)