# Naive Bayes Classifier

The inherent assumption here is that the features are independent of each other, i.e., they have little to no correlation to one another. This might make the model application in our case a little problematic but let's explore regardless. 

In [1]:
import numpy as np
import pandas as pd
from sklearn.utils import shuffle
import warnings

warnings.filterwarnings('ignore')

df_AIMD = pd.read_csv("Dist_AIMD.csv") 
df_MD = pd.read_csv('Dist_MD.csv')
df_fin = pd.concat([df_AIMD, df_MD])

df_shuffle = shuffle(df_fin, random_state=0)

from sklearn.model_selection import train_test_split

# Data selection 
# First we shall select the closest oxygens and later add the rest to see the effects of increasing features
# Then we will repeat it for hydrogens
X3 = df_shuffle[['S-O1', 'C-O1', 'N-O1']]
X6 = df_shuffle[['S-O1', 'C-O1', 'N-O1', 'S-O2', 'C-O2', 'N-O2']]

H3 = df_shuffle[['S-H1', 'C-H1', 'N-H1']]
H6 = df_shuffle[['S-H1', 'C-H1', 'N-H1', 'S-H2', 'C-H2', 'N-H2']]
y = df_shuffle['Class']

# Splitting the data into training(80%) and test(20%) set

X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y, test_size=0.20, random_state=0)
X6_train, X6_test, y6_train, y6_test = train_test_split(X6, y, test_size=0.20, random_state=0)

## Naive Bayes model 

They are highly efficient in learning and predictions but tend to be worse at generalizing compared to more sophisticated models. There are different types of NB classifiers implemented,

1. Bernoulli : Useful for binary features
2. Multinomial : Useful for discrete features (word counts)
3. Gaussian : Useful for continuous features

The [GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB) will be the most appropriate in our case.

In [2]:
from sklearn.naive_bayes import GaussianNB

# There are no parameters for us to optimize.
nb = GaussianNB()

# For the 3 feature
nb.fit(X3_train, y3_train)

print("Number of features in the model:",nb.n_features_in_)
print("Train accuracy for GaussianNB and 3 features:", nb.score(X3_train, y3_train))
print("Test accuracy for GaussianNB and 3 features:",nb.score(X3_test, y3_test))

print("\nNames of the features: ",nb.feature_names_in_)
print("Mean of the features across each class:\n",nb.theta_)
print("Variance of the features across each class:\n", nb.var_)



Number of features in the model: 3
Train accuracy for GaussianNB and 3 features: 0.8768768768768769
Test accuracy for GaussianNB and 3 features: 0.8809523809523809

Names of the features:  ['S-O1' 'C-O1' 'N-O1']
Mean of the features across each class:
 [[3.02567104 3.1004413  2.72250301]
 [3.12652537 3.33810842 2.82435454]]
Variance of the features across each class:
 [[0.01068197 0.01749986 0.0049353 ]
 [0.01326852 0.03826187 0.01102737]]


In [3]:
# For the 6 feature
nb.fit(X6_train, y6_train)

print("Number of features in the model:",nb.n_features_in_)
print("Train accuracy for GaussianNB and 6 features:", nb.score(X6_train, y6_train))
print("Test accuracy for GaussianNB and 6 features:",nb.score(X6_test, y6_test))

print("\nNames of the features: ",nb.feature_names_in_)
print("Mean of the features across each class:\n",nb.theta_)
print("Variance of the features across each class:\n", nb.var_)


Number of features in the model: 6
Train accuracy for GaussianNB and 6 features: 0.918918918918919
Test accuracy for GaussianNB and 6 features: 0.9642857142857143

Names of the features:  ['S-O1' 'C-O1' 'N-O1' 'S-O2' 'C-O2' 'N-O2']
Mean of the features across each class:
 [[3.02567104 3.1004413  2.72250301 3.17551324 3.25285733 2.8031851 ]
 [3.12652537 3.33810842 2.82435454 3.29153321 3.54878417 3.0191084 ]]
Variance of the features across each class:
 [[0.01068197 0.01749986 0.0049353  0.01632006 0.01710282 0.00457682]
 [0.01326852 0.03826187 0.01102737 0.01857891 0.02317821 0.02460831]]


### Naive bayes classifiers are probabilistic classifers based on applying Bayes' theorem and the assumption that the features are independent from one another. 

For both 3 and 6 feature O models we observe excellent accuracy performance, better than all other models discussed so far (KNN, SVC, logistic regression, decision trees). This defies our previous assumption that this model may not perform well but this also makes sense since, 

Gaussian naive-bayes classifiers assume that the continuous values associated with each class are distributed according to a normal (or Gaussian) distribution, and since the data we are dealing with is bond lengths across a trajectory, it makes sense that a Gaussian distribution would fit this particular type of dataset well. 

### Repeating the calculations for H dataset

In [4]:
# Train test split
X3_train, X3_test, y3_train, y3_test = train_test_split(H3, y, test_size=0.20, random_state=0)
X6_train, X6_test, y6_train, y6_test = train_test_split(H6, y, test_size=0.20, random_state=0)

# GaussianNB classifier
nb = GaussianNB()

# 3 feature model
nb.fit(X3_train, y3_train)

print("Number of features in the model:",nb.n_features_in_)
print("Train accuracy for GaussianNB and 3 features:", nb.score(X3_train, y3_train))
print("Test accuracy for GaussianNB and 3 features:",nb.score(X3_test, y3_test))

print("\nNames of the features: ",nb.feature_names_in_)
print("Mean of the features across each class:\n",nb.theta_)
print("Variance of the features across each class:\n", nb.var_)

Number of features in the model: 3
Train accuracy for GaussianNB and 3 features: 0.8768768768768769
Test accuracy for GaussianNB and 3 features: 0.8809523809523809

Names of the features:  ['S-H1' 'C-H1' 'N-H1']
Mean of the features across each class:
 [[2.13957634 2.27921195 1.7920138 ]
 [2.23389381 2.56780574 1.88962832]]
Variance of the features across each class:
 [[0.01933439 0.01707981 0.00602164]
 [0.02089352 0.03091966 0.01376504]]


In [5]:
# 6 feature model
nb.fit(X6_train, y6_train)

print("Number of features in the model:",nb.n_features_in_)
print("Train accuracy for GaussianNB and 6 features:", nb.score(X6_train, y6_train))
print("Test accuracy for GaussianNB and 6 features:",nb.score(X6_test, y6_test))

print("\nNames of the features: ",nb.feature_names_in_)
print("Mean of the features across each class:\n",nb.theta_)
print("Variance of the features across each class:\n", nb.var_)

Number of features in the model: 6
Train accuracy for GaussianNB and 6 features: 0.9219219219219219
Test accuracy for GaussianNB and 6 features: 0.9404761904761905

Names of the features:  ['S-H1' 'C-H1' 'N-H1' 'S-H2' 'C-H2' 'N-H2']
Mean of the features across each class:
 [[2.13957634 2.27921195 1.7920138  2.33686447 2.45493872 1.88601989]
 [2.23389381 2.56780574 1.88962832 2.46349847 2.76209765 2.11397551]]
Variance of the features across each class:
 [[0.01933439 0.01707981 0.00602164 0.02853373 0.01737959 0.00570643]
 [0.02089352 0.03091966 0.01376504 0.04079456 0.01684757 0.03719242]]


### The accuracy scores are largely similar to the O model, which is not a surprise. 

Next let us see how well this model behaves when we use all 12 features

In [6]:
# Use the full dataset (12 features) for training, y remains the same
X = df_shuffle.iloc[:,2:]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

# Train Gaussian NB

nb = GaussianNB().fit(X_train, y_train)

print("Number of features in the model:",nb.n_features_in_)
print("Train accuracy for GaussianNB and 12 features:", nb.score(X_train, y_train))
print("Test accuracy for GaussianNB and 12 features:",nb.score(X_test, y_test))

print("\nNames of the features: ",nb.feature_names_in_)
print("Mean of the features across each class:\n",nb.theta_)
print("Variance of the features across each class:\n", nb.var_)

Number of features in the model: 12
Train accuracy for GaussianNB and 12 features: 0.9219219219219219
Test accuracy for GaussianNB and 12 features: 0.9285714285714286

Names of the features:  ['S-O1' 'S-O2' 'S-H1' 'S-H2' 'C-O1' 'C-O2' 'C-H1' 'C-H2' 'N-O1' 'N-O2'
 'N-H1' 'N-H2']
Mean of the features across each class:
 [[3.02567104 3.17551324 2.13957634 2.33686447 3.1004413  3.25285733
  2.27921195 2.45493872 2.72250301 2.8031851  1.7920138  1.88601989]
 [3.12652537 3.29153321 2.23389381 2.46349847 3.33810842 3.54878417
  2.56780574 2.76209765 2.82435454 3.0191084  1.88962832 2.11397551]]
Variance of the features across each class:
 [[0.01068197 0.01632006 0.01933439 0.02853373 0.01749986 0.01710282
  0.01707981 0.01737959 0.0049353  0.00457682 0.00602164 0.00570643]
 [0.01326852 0.01857891 0.02089352 0.04079456 0.03826187 0.02317821
  0.03091966 0.01684757 0.01102737 0.02460831 0.01376504 0.03719242]]


### Finally, let us take a look at the cross-validation scores for this 12 feature model

In [7]:
from sklearn.model_selection import cross_val_score

print("5 fold cross valildation for the GaussianNB with 12 features:")
cv_scores = cross_val_score(GaussianNB(), X, y, cv=5)
print(cv_scores, np.mean(cv_scores))

5 fold cross valildation for the GaussianNB with 12 features:
[0.92857143 0.92857143 0.93975904 0.96385542 0.87951807] 0.9280550774526679


Slight improvement but not much