<div style="background-color:rgba(110, 129, 21, 0.5);">
    <h1><center>Understand the Models You Love</center></h1>
</div>

In this month's TPS, I am checking out some new algorithms I came across. I am choosing Naive Bayes right now. The purpose of this exercise is to become better at extracting maximum power from it and see if non-NN models can be used too.

I would make changes to the important parameters and mention their impact. **Please note that these parameter observations are made independent of each other and only for the current data we have**. For speed, I am choosing a simple test split of 30% size on 10000 samples. I have shared references towards the end.

I did a similar experiment last month with Random Forest - https://www.kaggle.com/raahulsaxena/tps-oct-21-understand-random-forest-parameters

**Feel free to run your own experiments and upvote if you find this code useful :)**

# About Naive Bayes

It is a classification technique based on Bayes’ Theorem with an **assumption of independence among predictors**. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Since our dataset has weakly correlated variables, NB might work well here.

![](http://www.analyticsvidhya.com/wp-content/uploads/2015/09/Bayes_rule-300x172-300x172.png)

It perform wells in case of categorical input variables compared to numerical variable(s). **For numerical variable, normal distribution is assumed** (bell curve, which is a strong assumption), hence we will do variable transformation first.

Following are the types of Naive Bayes algorithms-
1. **Gaussian** - Assumes that features follow a normal distribution.
2. **Multinomial** - Used for discrete counts
3. **Bernoulli** - Used when features are binary in nature (0s and 1s)

We will **use Gaussian NB** after transforming our features.

# Importing Packages and Sample Data

In [None]:
import random
random.seed(123)

import pandas as pd
import numpy as np
import datatable as dt
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import seaborn as sns

# importing evaluation and data split packages

from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.metrics import roc_auc_score

# importing modelling packages

from sklearn.naive_bayes import GaussianNB, MultinomialNB

In [None]:
# taking only 10000 rows as sample

train = pd.read_csv(r'../input/tabular-playground-series-nov-2021/train.csv',nrows=10000)

# Splitting Data

In [None]:
X = train.drop(['id','target'],axis=1)
y = train['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=2,stratify=y)

features = X_train.columns

# Making Pipelines with Transformers

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler, MinMaxScaler, QuantileTransformer, StandardScaler

pipe_1 = Pipeline([('scaler', StandardScaler()), ('nb', GaussianNB())])
pipe_2 = Pipeline([('scaler', MinMaxScaler()), ('nb', GaussianNB())])
pipe_3 = Pipeline([('scaler', RobustScaler()), ('nb', GaussianNB())])
pipe_4 = Pipeline([('scaler', QuantileTransformer()), ('nb', GaussianNB())])

<div style="background-color:rgba(110, 129, 21, 0.5);">
    <h1><center>Gaussian Naive Bayes</center></h1>
</div>

Methods of Improvement -

![](http://www.baeldung.com/wp-content/ql-cache/quicklatex.com-8ebe947b0538431322197ecd5324bade_l3.svg)

# Analysing the X_train data (Experimenting)

In [None]:
print('No. of rows with all zeroes: ',((X_train == 0).sum(axis=1)==100).sum())

In [None]:
# fit a probability distribution to a univariate data sample

import scipy.stats as stats
def fit_distribution(data):
    # estimate parameters
    mu = np.mean(data)
    sigma = np.std(data)
    #fit distribution
    dist = stats.norm.pdf(mu, sigma)
    return dist

In [None]:
# sort data into classes

Xy0 = train[train['target'] == 0]
Xy1 = train[train['target'] == 1]
Xy0.drop(['id','target'],axis=1,inplace=True)
Xy1.drop(['id','target'],axis=1,inplace=True)
print(Xy0.shape, Xy1.shape)

# calculate priors

priory0 = len(Xy0) / len(X)
priory1 = len(Xy1) / len(X)
print(priory0, priory1)

In [None]:
# create PDFs

Xy0_pdf = Xy0.apply(lambda x:fit_distribution(x))
Xy1_pdf = Xy1.apply(lambda x:fit_distribution(x))

# Best Variable Transformation
Quantile Transformer works best because it helps variables assume normal distribution - useful for NB

In [None]:
print('Default Parameters: ',GaussianNB().get_params())
model = GaussianNB()
model.fit(X_train,y_train)
print('ROC score with No Scaler: ',roc_auc_score(y_test,model.predict_proba(X_test)[:,1]))
pipe_1.fit(X_train,y_train)
print('ROC score with StandardScaler: ',roc_auc_score(y_test,pipe_1.predict_proba(X_test)[:,1]))
pipe_2.fit(X_train,y_train)
print('ROC score with MinMaxScaler: ',roc_auc_score(y_test,pipe_2.predict_proba(X_test)[:,1]))
pipe_3.fit(X_train,y_train)
print('ROC score with RobustScaler: ',roc_auc_score(y_test,pipe_3.predict_proba(X_test)[:,1]))
pipe_4.fit(X_train,y_train)
print('ROC score with QuantileTransformer: ',roc_auc_score(y_test,pipe_4.predict_proba(X_test)[:,1]))

# Var Smoothing
Portion of the largest variance of all features that is added to variances for calculation stability.
In statistics, Laplace Smoothing is a technique to smooth categorical data. Laplace Smoothing is introduced to solve the problem of zero probability.

In [None]:
var_smooth = [0.1,0.01,0.001,0.0001,0.00001,0.000001,0.0000001,0.00000001,0.000000001]

for var in var_smooth:
    pipe = Pipeline([('scaler', QuantileTransformer()), ('nb', GaussianNB(var_smoothing=var))])
    pipe.fit(X_train,y_train)
    print('Var Smoothing: ',var," ",'AUC: ',roc_auc_score(y_test,pipe.predict_proba(X_test)[:,1]))

# Permutation Importance

The permutation feature importance is defined to be the **decrease in a model score when a single feature value is randomly shuffled**. This procedure breaks the relationship between the feature and the target, thus the drop in the model score is indicative of how much the model depends on the feature. 

In [None]:
from sklearn.inspection import permutation_importance

model = GaussianNB()
model.fit(X_train, y_train)
imps = permutation_importance(model, X_test, y_test)
importances = imps.importances_mean
std = imps.importances_std
indices = np.argsort(importances)[::-1]

# Print the feature ranking
#print("Feature ranking:")
#for f in range(X_test.shape[1]):
 #   print("%d. %s (%f)" % (f + 1, features[indices[f]], importances[indices[f]]))

In [None]:
# fitting our pipeline with a subset of the data with these variables
# first 39 variables have positive importance, hence ignoring them

important_features = features[indices[39:]]
print('Important Features: ',important_features)

X_train_new = X_train[important_features]
X_test_new = X_test[important_features]

pipe_4.fit(X_train_new,y_train)
print('ROC score with QuantileTransformer and Selected Features: ',
      roc_auc_score(y_test,pipe_4.predict_proba(X_test_new)[:,1]))

# Converting to Discrete

6 bins, without any scaling, give good results, even better than simple quantile transformer scaling on continuous variables.

In [None]:
bins = [2,3,4,5,6,7,8,9,10]
print('GAUSSIAN WITH DISCRETE, CUT')
model = GaussianNB()
for bin in bins:
    X_train_binned = X_train.apply(lambda x:pd.cut(x,bins=bin,labels=False))
    X_test_binned = X_test.apply(lambda x:pd.cut(x,bins=bin,labels=False))
    model.fit(X_train_binned,y_train)
    print('No. of Bins: ',bin," ",'AUC: ',
          roc_auc_score(y_test,pipe.predict_proba(X_test_binned)[:,1]))
print('GAUSSIAN WITH DISCRETE, QCUT')
model = GaussianNB()
quantiles = [2,3,4,5,6,7,8,9,10]
for quantile in quantiles:
    X_train_binned = X_train.apply(lambda x:pd.qcut(x,q=quantile,labels=False,precision=0))
    X_test_binned = X_test.apply(lambda x:pd.qcut(x,q=quantile,labels=False,precision=0))
    model.fit(X_train_binned,y_train)
    print('No. of Quantiles: ',quantile," ",'AUC: ',
          roc_auc_score(y_test,pipe.predict_proba(X_test_binned)[:,1]))

In [None]:
from scipy import stats
def percentiler(col):
    ranked = stats.rankdata(col)
    data_percentile = ranked/len(col)*100
    bins_percentile = np.linspace(0,100,6)
    data_binned_indices = np.digitize(data_percentile, bins_percentile, right=True)
    return data_binned_indices

print('Percentiled Bins as per columns')
model = GaussianNB()
X_train_binned = X_train.apply(lambda x:percentiler(x))
X_test_binned = X_test.apply(lambda x:percentiler(x))
model.fit(X_train_binned,y_train)
print(roc_auc_score(y_test,pipe.predict_proba(X_test_binned)[:,1]))

# References

Links - 
1. https://www.geeksforgeeks.org/naive-bayes-classifiers/
2. https://towardsdatascience.com/introduction-to-na%C3%AFve-bayes-classifier-fa59e3e24aaf
3. https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/
4. https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
5. https://machinelearningmastery.com/better-naive-bayes/
6. https://inblog.in/Feature-Importance-in-Naive-Bayes-Classifiers-5qob5d5sFW

Notebooks -
1. https://www.kaggle.com/rayhanlahdji/tps-1121-naive-bayes-for-naive-souls by Rayhan Lahdji
2. https://www.kaggle.com/markosthabit/tbs-november-naive-bayes by Markos Thabit
3. https://www.kaggle.com/prashant111/naive-bayes-classifier-in-python by Prashant Banerjee

Videos -
1. https://www.youtube.com/watch?v=H3EjCKtlVog on Gaussian Naive Bayes
2. https://www.youtube.com/watch?v=O2L2Uv9pdDA on understanding NB