<a href="https://colab.research.google.com/github/sandipanpaul21/ML-Code-in-Python/blob/master/09_Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Naive Bayes 

# Introduction
# Naive Bayes is the most straightforward and fast classification algorithm, 
# It is suitable for a large chunk of data. 
# It uses Bayes theorem of probability for prediction of unknown class.
# Works for Multi Class Classification also (Target Columns > 2 and more)
# Logistic Regression only works for Binary Classification (Target Columns = 2)

# Assumption
# It assumes that the effect of a particular feature in a class is independent of other features. 
# For example, a loan applicant is desirable or not depending on his/her income, previous loan 
# and transaction history, age, and location. 
# Even if these features are interdependent, these features are still considered independently. 
# This assumption simplifies computation, and that's why it is considered as naive. 
# This assumption is called class conditional independence.

# Zero Probability Problem
# Suppose there is no tuple for a risky loan in the dataset, 
# in this scenario, the posterior probability will be zero, 
# and the model is unable to make a prediction. 
# This problem is known as Zero Probability because the occurrence of the particular class is zero.
# The solution for such an issue is the Laplacian correction or Laplace Transformation. 
# Laplacian correction is one of the smoothing techniques. 
# Here, you can assume that the dataset is large enough 
# that adding one row of each class will not make a difference in the estimated probability. 
# This will overcome the issue of probability values to zero.
# For Example: Suppose that for the class loan risky, there are 1000 training tuples in the database. 
# In this database, 
# income column has 0 tuples for low income, 
# 990 tuples for medium income, 
# and 10 tuples for high income. 
# The probabilities of these events, 
# without the Laplacian correction, are 0, 0.990 (from 990/1000), and 0.010 (from 10/1000)
# Now, apply Laplacian correction on the given dataset. 
# Let's add 1 more tuple for each income-value pair. The probabilities of these events:
# P(Low Income) = 1/1003 = 0.001, P(Medium Income) = 991/1003 = 0.988 & P(High Income) = 11/1003 = 0.011

# Advantages
# It is not only a simple approach but also a fast and accurate method for prediction.
# It can be used with multiple class prediction problems.
# It also performs well in the case of text analytics problems.
# When the assumption of independence holds, a Naive Bayes classifier performs better 
# compared to other models like logistic regression.

# Disadvantages
# The assumption of independent features. 
# In practice, it is almost impossible that model will get a set of predictors which are entirely independent.
# If there is no training tuple of a particular class, this causes zero posterior probability. 
# In this case, the model is unable to make predictions. 
# This problem is known as Zero Probability/Frequency Problem.

In [2]:
#  Libraries

from sklearn import datasets
import pandas as pd
import statsmodels.api as sm
# Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score
import warnings
warnings.filterwarnings("ignore")
import numpy as np
from sklearn.metrics import classification_report

  import pandas.util.testing as tm


In [3]:
# IRIS Dataset
iriss = datasets.load_iris()
iris = pd.DataFrame(iriss.data)
iris.columns = iriss.feature_names
iris['species'] = iriss.target
iris.columns = iris.columns.str.replace(" ","")
iris.columns = iris.columns.str.replace("(","")
iris.columns = iris.columns.str.replace(")","")
iris.head()

Unnamed: 0,sepallengthcm,sepalwidthcm,petallengthcm,petalwidthcm,species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [4]:
# Target Column Distribution
iris['species'].value_counts()

2    50
1    50
0    50
Name: species, dtype: int64

In [5]:
# Distribution (mean) of Independent Columns respect to Dependent Column
iris.groupby('species').mean().round(2)

Unnamed: 0_level_0,sepallengthcm,sepalwidthcm,petallengthcm,petalwidthcm
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,5.01,3.43,1.46,0.25
1,5.94,2.77,4.26,1.33
2,6.59,2.97,5.55,2.03


In [6]:
# Independent Variables
Independent_Variable_Base_Set = iris[iris.columns[0:4]]
Independent_Variable_Base_Set.head()

Unnamed: 0,sepallengthcm,sepalwidthcm,petallengthcm,petalwidthcm
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [7]:
# Dependent Variable
Dependent_Variable = iris[iris.columns[-1:iris.columns.size]]
Dependent_Variable.head()

Unnamed: 0,species
0,0
1,0
2,0
3,0
4,0


In [8]:
# Split the Dataset

# Model 1
# Lets start with selecting one variable
Independent_Variable_Set_v1 = iris[iris.columns[0:1]]
X_train, X_test, y_train, y_test = train_test_split(Independent_Variable_Set_v1,Dependent_Variable,test_size = 0.3,random_state = 21)

# Create a Gaussian Classifier
model = GaussianNB()

# Train the model using the training sets
result = model.fit(X_train,y_train)

# Model Prediction
print("Sample Prediction of Model 1")
pred = result.predict(X_test)
model_prediction = pd.DataFrame(pred.round(2),columns = ['Prediction'])
print(model_prediction['Prediction'].head())

# Test Set Target Column Distribution
print("\nTest Set Distribution")
print(y_test['species'].value_counts())

# Predicted Set Target Column Distribution
print("\nPredicted Set Distribution")
print(model_prediction['Prediction'].value_counts())

Sample Prediction of Model 1
0    1
1    0
2    0
3    0
4    1
Name: Prediction, dtype: int64

Test Set Distribution
1    16
2    15
0    14
Name: species, dtype: int64

Predicted Set Distribution
1    19
0    14
2    12
Name: Prediction, dtype: int64


In [9]:
# Model Performance

# Take the average of the f1-score for each class: that's the avg / total result above. 
# It's also called macro averaging.

# Compute the f1-score using the global count of true positives / false negatives, etc. 
# (Sum the number of true positives / false negatives for each class). Aka micro averaging.

y_pred = model_prediction[['Prediction']]
model_1_accuracy = accuracy_score(y_test,y_pred).round(2)
print("Model 1 Performance")
print("\nModel 1, Accuracy :",model_1_accuracy)
model_1_precision = precision_score(y_test,y_pred, average="micro").round(2)
print("Model 1, Precision :",model_1_precision)
model_1_recall = recall_score(y_test,y_pred, average="micro").round(2)
print("Model 1, Recall :",model_1_recall)
model_1_fscore = f1_score(y_test,y_pred, average="micro").round(2)
print("Model 1, F1 Score :",model_1_fscore)
print("\nConfusion Matrix, Model 1")
model_1_cm = confusion_matrix(y_test,y_pred)
print(model_1_cm)
print("\nClassification Report, Model 1")
model_1_cr = classification_report(y_test, y_pred)
print(model_1_cr)

print("Inference : Good Fit, can we make it better ?")

Model 1 Performance

Model 1, Accuracy : 0.67
Model 1, Precision : 0.67
Model 1, Recall : 0.67
Model 1, F1 Score : 0.67

Confusion Matrix, Model 1
[[13  1  0]
 [ 1 10  5]
 [ 0  8  7]]

Classification Report, Model 1
              precision    recall  f1-score   support

           0       0.93      0.93      0.93        14
           1       0.53      0.62      0.57        16
           2       0.58      0.47      0.52        15

    accuracy                           0.67        45
   macro avg       0.68      0.67      0.67        45
weighted avg       0.67      0.67      0.66        45

Inference : Good Fit, can we make it better ?


In [10]:
# Model 2
# Lets build model with all variables

X_train, X_test, y_train, y_test = train_test_split(Independent_Variable_Base_Set,Dependent_Variable,test_size = 0.3,random_state = 21)
# Train the model using the training sets
result = model.fit(X_train,y_train)

# Model Prediction
print("Sample Prediction of Model 2")
pred = result.predict(X_test)
model_prediction = pd.DataFrame(pred.round(2),columns = ['Prediction'])
print(model_prediction['Prediction'].head())

# Test Set Target Column Distribution
print("\nTest Set Distribution")
print(y_test['species'].value_counts())

# Predicted Set Target Column Distribution
print("\nPredicted Set Distribution")
print(model_prediction['Prediction'].value_counts())

Sample Prediction of Model 2
0    1
1    0
2    0
3    0
4    1
Name: Prediction, dtype: int64

Test Set Distribution
1    16
2    15
0    14
Name: species, dtype: int64

Predicted Set Distribution
2    17
1    14
0    14
Name: Prediction, dtype: int64


In [11]:
# Model Performance

# Take the average of the f1-score for each class: that's the avg / total result above. 
# It's also called macro averaging.

# Compute the f1-score using the global count of true positives / false negatives, etc. 
# (Sum the number of true positives / false negatives for each class). Aka micro averaging.

y_pred = model_prediction[['Prediction']]
model_2_accuracy = accuracy_score(y_test,y_pred).round(2)
print("Model 2 Performance")
print("\nModel 2, Accuracy :",model_2_accuracy)
model_2_precision = precision_score(y_test,y_pred, average="micro").round(2)
print("Model 2, Precision :",model_2_precision)
model_2_recall = recall_score(y_test,y_pred, average="micro").round(2)
print("Model 2, Recall :",model_2_recall)
model_2_fscore = f1_score(y_test,y_pred, average="micro").round(2)
print("Model 2, F1 Score :",model_2_fscore)
print("\nConfusion Matrix, Model 2")
model_2_cm = confusion_matrix(y_test,y_pred)
print(model_2_cm)
print("\nClassification Report, Model 2")
model_2_cr = classification_report(y_test, y_pred)
print(model_2_cr)

print("Inference : Model 2 is OVERFIT, so choose Model 1")

Model 2 Performance

Model 2, Accuracy : 0.96
Model 2, Precision : 0.96
Model 2, Recall : 0.96
Model 2, F1 Score : 0.96

Confusion Matrix, Model 2
[[14  0  0]
 [ 0 14  2]
 [ 0  0 15]]

Classification Report, Model 2
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        14
           1       1.00      0.88      0.93        16
           2       0.88      1.00      0.94        15

    accuracy                           0.96        45
   macro avg       0.96      0.96      0.96        45
weighted avg       0.96      0.96      0.96        45

Inference : Model 2 is OVERFIT, so choose Model 1
