<img src="./images/Naive Bayes.png" alt="Text Analysis"/>

# Naive Bayes

Naive Bayes is a probabilistic classifier used in two or more categories classification.This algorithm is often used in text classification for high dimensional datasets. Bayes developed this algorithm which uses conditional probabilities in making predictions. In Naive Bayes, the classifier assumes that the presence of a certain feature in a class is not independent of the presence of any other feature. Hence it is referred to as Naive Bayes.

## DoctorAUS Dataset

We will use Naive Bayes to predict the type of insurance a person has in the DoctorAUS dataset from the pydataset module.

In [2]:
from pydataset import data
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

initiated datasets repo at: /Users/binliao/.pydataset/


Load Dataset and extract features and target variable in separate dataframes.

Data Description

Relevant Years: a cross-section from 1977–1978

number of observations: 5190

observation: individuals

country : Australia

<b>Variables</b>

sex: Gender

age: Age of the individual

income: annual income in tens of thousands of dollars

insurance: insurance contract (
    medlevy : medibanl levy, 
    levyplus : private health insurance, 
    freepoor : government insurance due to low income, 
    freerepa : government insurance due to old age disability or veteran status

illness: number of illness in past 2 weeks

actdays: number of days of reduced activity in past 2 weeks due to illness or injury

hscore: general health score using Goldberg's method (from 0 to 12)

chcond: chronic condition (np : no problem, la : limiting activity, nla : not limiting activity)

doctorco: number of consultations with a doctor or specialist in the past 2 weeks

nondocco: number of consultations with non-doctor health professionals (chemist, optician, 

physiotherapist, social worker, district community nurse, chiropodist or chiropractor) in the past 2 weeks

hospadmi: number of admissions to a hospital, psychiatric hospital, nursing or convalescent home in the past 12 months (up to 5 or more admissions which is coded as 5)

hospdays: number of nights in a hospital, etc. during most recent admission: taken, where appropriate, as the mid-point of the intervals 1, 2, 3, 4, 5, 6, 7, 8-14, 15-30, 31-60, 61-79 with 80 or more admissions coded as 80. If no admission in past 12 months then equals zero.

medecine: total number of prescribed and nonprescribed medications used in past 2 days

prescrib: total number of prescribed medications used in past 2 days

nonpresc: total number of nonprescribed medications used in past 2 days

Source

Cameron, A.C. and P.K. Trivedi (1986) “Econometric Models Based on Count Data: Comparisons and Applications of Some Estimators and Tests”, Journal of Applied Econometrics, 1, 29-54..
References

Cameron, A.C. and Trivedi P.K. (1998) Regression analysis of count data, Cambridge University Press, http://cameron.econ.ucdavis.edu/racd/racddata.html, chapter 3.

In [3]:
df = data("DoctorAUS")
X = df[['age','income','sex','illness','actdays','hscore','doctorco','nondocco','hospadmi','hospdays','medecine','prescrib']]
X.head()

Unnamed: 0,age,income,sex,illness,actdays,hscore,doctorco,nondocco,hospadmi,hospdays,medecine,prescrib
1,0.19,0.55,1,1,4,1,1,0,0,0,1,1
2,0.19,0.45,1,1,2,1,1,0,0,0,2,1
3,0.19,0.9,0,3,0,0,1,0,1,4,2,1
4,0.19,0.15,0,1,0,0,1,0,0,0,0,0
5,0.19,0.45,0,2,5,1,1,0,0,0,3,1


In [4]:
y = df['insurance']
y.head()

1    levyplus
2    levyplus
3     medlevy
4     medlevy
5     medlevy
Name: insurance, dtype: object

In [5]:
# Create the Train Test split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.3,random_state=0)

## Fitting the Model

Instantiate the Gaussian Naive Bayes Classifier and fit it on the train dataset

In [6]:
clf = GaussianNB()
clf.fit(X_train,y_train)

GaussianNB()

## Model Prediction
Evaluate the model on the test dataset

In [7]:
y_pred = clf.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

     freepor       0.29      0.12      0.16        69
    freerepa       0.63      0.52      0.57       321
    levyplus       0.50      0.31      0.39       670
     medlevy       0.48      0.81      0.60       497

    accuracy                           0.51      1557
   macro avg       0.47      0.44      0.43      1557
weighted avg       0.51      0.51      0.48      1557



The accuracy, precision and recall and not very promising numbers. We have acceptable recall for 0.81 but over all the data is not suitable for applying Naive Bayes.

### Output:
From the classification report, we can learn several things about the performance of the Naive Bayes classifier:

1. The overall accuracy of the classifier is 0.51, which means that it correctly predicts the insurance type for around 51% of the individuals in the test set.

2. The classifier performs better for some insurance types than others. For example, it has a high f1-score and recall for "medlevy" insurance type, indicating that it is good at identifying individuals with this insurance type. However, it has a lower precision for "medlevy" insurance type, which suggests that it may also misclassify some individuals who do not have this insurance type as having "medlevy".

3. Similarly, the classifier has high precision for "freerepa" insurance type, indicating that it is usually correct when it predicts an individual has this insurance type. However, its recall and f1-score for this insurance type are lower than the other insurance types, which suggests that it may miss some individuals who have this insurance type.

4. The macro-average f1-score is 0.43, which is lower than the overall accuracy. This indicates that there is a significant variation in the classifier's performance across the different insurance types.

Overall, the results suggest that the classifier may need improvement in order to achieve higher accuracy and performance across all insurance types. It may be useful to explore different feature engineering techniques or try different classification algorithms to see if they can improve the performance of the classifier.

# Iris Case Study

In [8]:
import numpy as np
import pandas as pd
from sklearn import datasets

#Read iris data
iris = pd.read_csv('https://raw.githubusercontent.com/dearbharat/datasets/main/iris.csv')


#Display iris header
iris.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


## Data Preparation 

In [9]:
#Descriptive Statistics
iris.describe().round(4)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.8433,3.054,3.7587,1.1987
std,43.4454,0.8281,0.4336,1.7644,0.7632
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


In [10]:
from sklearn.model_selection import train_test_split

#Split data to X and y Variables
X = iris.iloc[:, 1:-1].values
y = iris.iloc[:, -1].values

#Aplit data for Training and Testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [11]:
#Count of Training Dataset sample
samples = X_train.shape[0]
samples

105

In [12]:
feature = X_train.shape[1]
feature

4

In [13]:
#Name of Distinct Classes
classes = np.unique(y_train)
classes

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [14]:
# Total No of Classes
number_of_classes = len(classes)
number_of_classes

3

## Initialize Model Parameters

In [15]:
# Initialize mean of Each Feature for Each Class
mean = np.zeros((number_of_classes, feature), dtype=np.float32)
mean

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]], dtype=float32)

In [16]:
# Initialize Standard Deviation Variable
sd = np.zeros((number_of_classes, feature), dtype=np.float32)
sd

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]], dtype=float32)

In [17]:
# Initialize Prior Probabilities
priors = np.zeros(number_of_classes, dtype=np.float32)
priors

array([0., 0., 0.], dtype=float32)

## Estimating the Priors

In [18]:
#Iterate over Each Class
for index,value in enumerate(classes):
  X_Training = X_train[y_train == value]

  #Find Mean of the Class for each feature and store it at Same Index as Class Index 
  mean[index] = X_Training.mean(axis=0)

  #Find Standard Deviation of the Class and store it at Same Index as Class Index 
  sd[index] = X_Training.std(axis=0)

  #To find Prior Probability of Every Class
  # Count of a Class divided by Total Samples in the training dataset
  priors[index] = X_Training.shape[0] / samples

In [19]:
#Calculate Probability using Gaussian Distribution Formula
def gaussianFormula(x, mean, std):
  prob = (np.exp((-1/2)*(np.square((x - mean)/std))))/(std*(np.sqrt(2*np.pi)))
  return prob

## Final Probabilities

In [20]:
#Initialized variable for storing Predictions
prediction = []

#Iterate over Test Data
for x_Testing in X_test:
  #Store Best Case Probability
  best_Case_Probability = []
  print(best_Case_Probability)

  #Iterate over Each Class
  for index,value in enumerate(classes):
    #find Posterior Probability
    post = gaussianFormula(x_Testing, mean[index], sd[index])
    print(post)

    #Find Final Probability
    final = np.sum(np.log(priors[index]) + np.log(post));
    print(final)

    #Add Final Probability in a List
    best_Case_Probability.append(final)
    print(best_Case_Probability)

  #Append Name of Class whose Probability is Max among all the Classes
  prediction.append(classes[np.argmax(best_Case_Probability)])

[]
[8.51890309e-002 3.36991156e-001 1.06471844e-143 4.21108034e-112]
-593.7197496329225
[-593.7197496329225]
[7.32853461e-01 1.16375531e+00 1.44388116e-01 1.20464473e-06]
-20.476631357350513
[-593.7197496329225, -20.476631357350513]
[0.25978147 0.99495052 0.48007805 0.55832126]
-6.631198692550944
[-593.7197496329225, -20.476631357350513, -6.631198692550944]
[]
[2.01907553e-02 1.06416194e-02 3.13159135e-70 3.20009652e-14]
-204.06833793017609
[-204.06833793017609]
[0.74434937 0.30773185 0.75908951 0.63581121]
-6.955158178039042
[-204.06833793017609, -6.955158178039042]
[0.37007177 0.08723995 0.01225224 0.00057115]
-19.264659068296794
[-204.06833793017609, -6.955158178039042, -19.264659068296794]
[]
[0.40682628 0.11501681 2.62226146 3.98101541]
-5.2268713484041935
[-5.2268713484041935]
[5.48273627e-01 1.42410367e-04 4.55783518e-08 8.67205878e-07]
-45.07249870062812
[-5.2268713484041935, -45.07249870062812]
[1.28128178e-01 2.52028692e-03 4.69916466e-13 2.32190466e-11]
-64.87197087753154
[-

## Model Evaluation

In [21]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

#Print Classification Report
print('Classification Report: ')
print(classification_report(y_test, prediction))

#Print Confusion Matrix
print('Confusion Matrix: ')
print(confusion_matrix(y_test, prediction))

from sklearn.metrics import accuracy_score
#Print Accuracy Score
print('Accuracy Score:', accuracy_score(prediction, y_test).round(4))

Classification Report: 
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        16
Iris-versicolor       1.00      0.94      0.97        18
 Iris-virginica       0.92      1.00      0.96        11

       accuracy                           0.98        45
      macro avg       0.97      0.98      0.98        45
   weighted avg       0.98      0.98      0.98        45

Confusion Matrix: 
[[16  0  0]
 [ 0 17  1]
 [ 0  0 11]]
Accuracy Score: 0.9778


<img src="./images/BayesTheorem.png" alt="Bayesian Thinking"/>

<img src="./images/bayesian coin.png" alt="Bayes Example" width= "800"/>

# Diabetes Case Study
Dataset: pima-indian-diabetes dataset

Naive Bayes (Regular - Discrete)

Naive Bayes (Gaussian - Continuous)

Naive Bayes (Multinomial - Frequency)

Feature Space is not about occurence or not occurence of an event:


Spam Classifier: ($: 1 - 0), (Lottery) Regular

Spam Classifier: ($: 0-1-2-3-4), Multinomial Bayes Classifier

IRIS Flower classifier: Length and width of the petals & Sepals. (Mu and Sigma)

## Explore Data
https://raw.githubusercontent.com/dearbharat/NBBayes/main/pima-indians-diabetes.data.csv

The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

 -  Pregnancies: Number of times pregnant <br>
 -  Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test <br>
 -  BloodPressure: Diastolic blood pressure (mm Hg) <br>
 -  SkinThickness: Triceps skin fold thickness (mm) <br>
 -  Insulin: 2-Hour serum insulin (mu U/ml) <br>
 -  BMI: Body mass index (weight in kg/(height in m)^2) <br>
 -  DiabetesPedigreeFunction: Diabetes pedigree function <br>
 -  Age: Age (years) <br>
 -  Outcome: Class variable (0 or 1) <br>

In [22]:
# import dependencies
import numpy as np
import pandas as pd

# other dependencies for publishing image in notebook
from IPython.display import Image
from IPython.core.display import HTML 
%matplotlib  inline

In [23]:
column = ["Pregnancies","Glucose","BloodPressure","SkinThickness","Insulin","BMI","DiabetesPedigreeFunction","Age","Outcome"]
data = pd.read_csv('https://raw.githubusercontent.com/dearbharat/NBBayes/main/pima-indians-diabetes.data.csv',names=column)

In [24]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## Bayesian formula

![Naive Bayes Formula](https://raw.githubusercontent.com/dearbharat/NBBayes/main/bayes.PNG)

*Where*,
 - P(c|x) is the posterior probability of class c given predictor ( features).
 - P(c) is the probability of class.
 - P(x|c) is the likelihood which is the probability of predictor given class.
 - P(x) is the prior probability of predictor.

In a bayes classifier, we calculate the posterior for every class for each observation. Then, classify the observation based on the class with the largest posterior value. we have two classes of outcome So we will calculate two posteriors: one for Outcome 1 and one for Outcome 0.

# Gaussian Naive Bayes Classifier

Outcome column has two sets namely Outcome 1 and outcome 0
Naive bayesian of outcome1 is
![Outcome1](https://raw.githubusercontent.com/dearbharat/NBBayes/main/outcome1.PNG)

![Outcome1](https://raw.githubusercontent.com/dearbharat/NBBayes/main/outcome0.PNG)

Now simplify the equation above:

 - P(Outcome1)is the prior probabilities. It is, as you can see, simply the probability an observation is "1" . This is just the number of person of outcome1 in the dataset divided by the total number of people in the dataset. <br>

 - p(pregnancies∣outcome1) * p(Glucose∣Outcome1) * p(Blood Pressure∣Outcome1)...  is the likelihood. Notice that we have unpacked person’s data. so it is now every feature in the dataset. The “gaussian” and “naive” come from two assumptions present in this likelihood: <br>
 
 -  If you look each term in the likelihood you will notice that we assume each feature is uncorrelated from each other. That is, Pregnancies is independent of Glucose or BMI etc.. This is obviously not true, and is a “naive” assumption - hence the name “naive bayes.” <br>

 as the formula our goal is divided into 5 types

1. Calculate Priors
2. Calculate Likelihood
3. Calculate Marginal Probability
4. Apply Bayes Classifier To New Data Point
5. understand what has just happen


## Calculate Priors

Priors can be either constants or probability distributions. In our example, this is simply the probability of outcome of patients. 

In [25]:
# Number of patients of outcome 1
n_outcome1 = data['Outcome'][data['Outcome'] == 1].count()
n_outcome1

268

In [26]:
# Number of patients of outcome 0
n_outcome0 = data['Outcome'][data['Outcome'] == 0].count()
n_outcome0

500

In [27]:
# Total people
total_ppl = data['Outcome'].count()
total_ppl

768

In [28]:
# Number of people of outcome1 divided by the total people
P_outcome1 = n_outcome1/total_ppl
P_outcome1

0.3489583333333333

In [29]:
# Number of people of outcome0 divided by the total people
P_outcome0 = n_outcome0/total_ppl
P_outcome0

0.6510416666666666

## Calculate Likelihood

 - We assume have that the value of the features (e.g. the Pregnancy of Outcome1, the Glucose of Outcome1) are normally (gaussian) distributed. <br><br>
 -  This means that p (Pregnancy∣Outcome1) is calculated by inputing the required parameters into the probability density function of the normal distribution:<br><br>
 - Now as per the formula for probability density function, our likelihood will be

Number of people of outcome0 divided by the total people
P_outcome0 = n_outcome0/total_ppl
P_outcome0

![bay1](https://raw.githubusercontent.com/dearbharat/NBBayes/main/bay1.png)

In [30]:
# Now first calculate the means of the data according to outcome

# Group the data by gender and calculate the means of each feature
data_means = data.groupby('Outcome').mean().round(2)

# View the values
data_means

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.3,109.98,68.18,19.66,68.79,30.3,0.43,31.19
1,4.87,141.26,70.82,22.16,100.34,35.14,0.55,37.07


In [31]:
# Second calculate the variance of the data according to outcome

# Group the data by gender and calculate the variance of each feature
data_variance = data.groupby('Outcome').var().round(2)

# View the values
data_variance

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,9.1,683.36,326.27,221.71,9774.35,59.13,0.09,136.13
1,14.0,1020.14,461.9,312.57,19234.67,52.75,0.14,120.3


 So you have got the means and variance of the data.
 now just as this formula is for one feature:

![bay1](https://raw.githubusercontent.com/dearbharat/NBBayes/main/bay1.png)

We have to find the values of all the features and to find it, we have to calculate the mean and variance of all the features.

In [32]:
# Means for outcome1 for all features
outcome1_Pregnancies_mean = data_means['Pregnancies'][data_variance.index == 1].values[0]
outcome1_Glucose_mean = data_means['Glucose'][data_variance.index == 1].values[0]
outcome1_BloodPressure_mean = data_means['BloodPressure'][data_variance.index == 1].values[0]
outcome1_SkinThickness_mean = data_means['SkinThickness'][data_variance.index == 1].values[0]
outcome1_Insulin_mean = data_means['Insulin'][data_variance.index == 1].values[0]
outcome1_BMI_mean = data_means['BMI'][data_variance.index == 1].values[0]
outcome1_DiabetesPedigreeFunction_mean = data_means['DiabetesPedigreeFunction'][data_variance.index == 1].values[0]
outcome1_Age_mean = data_means['Age'][data_variance.index == 1].values[0]


# Variance for outcome1 for all features
outcome1_Pregnancies_variance = data_variance['Pregnancies'][data_variance.index == 1].values[0]
outcome1_Glucose_variance= data_variance['Glucose'][data_variance.index == 1].values[0]
outcome1_BloodPressure_variance = data_variance['BloodPressure'][data_variance.index == 1].values[0]
outcome1_SkinThickness_variance = data_variance['SkinThickness'][data_variance.index == 1].values[0]
outcome1_Insulin_variance = data_variance['Insulin'][data_variance.index == 1].values[0]
outcome1_BMI_variance = data_variance['BMI'][data_variance.index == 1].values[0]
outcome1_DiabetesPedigreeFunction_variance = data_variance['DiabetesPedigreeFunction'][data_variance.index == 1].values[0]
outcome1_Age_variance = data_variance['Age'][data_variance.index == 1].values[0]

# Means for outcome0 for all features
outcome0_Pregnancies_mean = data_means['Pregnancies'][data_variance.index == 0].values[0]
outcome0_Glucose_mean = data_means['Glucose'][data_variance.index == 0].values[0]
outcome0_BloodPressure_mean = data_means['BloodPressure'][data_variance.index == 0].values[0]
outcome0_SkinThickness_mean = data_means['SkinThickness'][data_variance.index == 0].values[0]
outcome0_Insulin_mean = data_means['Insulin'][data_variance.index == 0].values[0]
outcome0_BMI_mean = data_means['BMI'][data_variance.index == 0].values[0]
outcome0_DiabetesPedigreeFunction_mean = data_means['DiabetesPedigreeFunction'][data_variance.index == 0].values[0]
outcome0_Age_mean = data_means['Age'][data_variance.index == 0].values[0]

# Variance for outcome0 for all features
outcome0_Pregnancies_variance = data_variance['Pregnancies'][data_variance.index == 0].values[0]
outcome0_Glucose_variance = data_variance['Glucose'][data_variance.index == 0].values[0]
outcome0_BloodPressure_variance = data_variance['BloodPressure'][data_variance.index == 0].values[0]
outcome0_SkinThickness_variance = data_variance['SkinThickness'][data_variance.index == 0].values[0]
outcome0_Insulin_variance = data_variance['Insulin'][data_variance.index == 0].values[0]
outcome0_BMI_variance = data_variance['BMI'][data_variance.index == 0].values[0]
outcome0_DiabetesPedigreeFunction_variance = data_variance['DiabetesPedigreeFunction'][data_variance.index == 0].values[0]
outcome0_Age_variance = data_variance['Age'][data_variance.index == 0].values[0]

## Marginal probability

It is probably one of the most confusing parts of bayesian approaches. In some examples it is completely possible to calculate the marginal probability. 

However, in many real-world cases, it is either extremely difficult or impossible to find the value of the marginal probability (explaining why is beyond the scope of this tutorial). 

This is not as much of a problem for our classifier as you might think. Why? Because we don’t care what the true posterior value is, we only care which class has a the highest posterior value.

And because the marginal probability is the same for all classes 
1) we can ignore the denominator <br>
2) calculate only the posterior’s numerator for each class  <br>
3) pick the largest numerator. That is, we can ignore the posterior’s denominator and make a prediction solely on the relative values of the posterior’s numerator.<br>

## Evaluate Bayes Classifier

In [33]:
# Create an empty dataframe that we have to predict 
person = pd.DataFrame()

# Create some feature values for this single row
person['Pregnancies'] = [7]
person['Glucose'] = [130]
person['BloodPressure'] = [86]
person['SkinThickness'] = [34]
person['Insulin'] = [0]
person['BMI'] = [33.5]
person['DiabetesPedigreeFunction'] = [0.564]
person['Age'] = [50]
# View the data 
person

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,7,130,86,34,0,33.5,0.564,50


In [34]:
# Create a function that calculates p(x | y):
def p_x_given_y(x, mean_y, variance_y):

    # Input the arguments into a probability density function
    p = 1/(np.sqrt(2*np.pi*variance_y)) * np.exp((-(x-mean_y)**2)/(2*variance_y))
    
    # return p
    return p

for now we are ignoring the marginal property aka prior probability

#formula again for reference
![Bayes](https://raw.githubusercontent.com/dearbharat/NBBayes/main/bayes.PNG)

In [35]:
# Where,
#      P(c|x) is the posterior probability of class c given predictor ( features).
#      P(c) is the probability of class.
#      P(x|c) is the likelihood which is the probability of predictor given class.
#      P(x) is the prior probability of predictor.

![Outcome1](https://raw.githubusercontent.com/dearbharat/NBBayes/main/outcome1.PNG)

In [36]:
# So for now we will only calculate the numerator of the data and will predict based on numerator only

# Numerator of the posterior probability if the unclassified observation is a Outcome1
d_out1 = P_outcome1 * \
p_x_given_y(person['Pregnancies'][0], outcome1_Pregnancies_mean, outcome1_Pregnancies_variance) * \
p_x_given_y(person['Glucose'][0], outcome1_Glucose_mean, outcome1_Glucose_variance) * \
p_x_given_y(person['BloodPressure'][0], outcome1_BloodPressure_mean, outcome1_BloodPressure_variance) * \
p_x_given_y(person['SkinThickness'][0], outcome1_SkinThickness_mean, outcome1_SkinThickness_variance) * \
p_x_given_y(person['Insulin'][0], outcome1_Insulin_mean, outcome1_Insulin_variance) * \
p_x_given_y(person['BMI'][0], outcome1_BMI_mean, outcome1_BMI_variance) * \
p_x_given_y(person['DiabetesPedigreeFunction'][0], outcome1_DiabetesPedigreeFunction_mean, outcome1_DiabetesPedigreeFunction_variance) *\
p_x_given_y(person['Age'][0], outcome1_Age_mean, outcome1_Age_variance) 

In [37]:
d_out1

2.221620194472981e-13

![Outcome0](https://raw.githubusercontent.com/dearbharat/NBBayes/main/outcome0.PNG)

In [38]:
# Numerator of the posterior probability if the unclassified observation is a Outcome0
d_out2 = P_outcome0 * \
p_x_given_y(person['Pregnancies'][0], outcome0_Pregnancies_mean, outcome0_Pregnancies_variance) * \
p_x_given_y(person['Glucose'][0], outcome0_Glucose_mean, outcome0_Glucose_variance) * \
p_x_given_y(person['BloodPressure'][0], outcome0_BloodPressure_mean, outcome0_BloodPressure_variance) * \
p_x_given_y(person['SkinThickness'][0], outcome0_SkinThickness_mean, outcome0_SkinThickness_variance) * \
p_x_given_y(person['Insulin'][0], outcome0_Insulin_mean, outcome0_Insulin_variance) * \
p_x_given_y(person['BMI'][0], outcome0_BMI_mean, outcome0_BMI_variance) * \
p_x_given_y(person['DiabetesPedigreeFunction'][0], outcome0_DiabetesPedigreeFunction_mean, outcome0_DiabetesPedigreeFunction_variance) *\
p_x_given_y(person['Age'][0], outcome0_Age_mean, outcome0_Age_variance) 

In [39]:
d_out2

1.7868492189980746e-13

now as we compare this value with outcome1 and outcome0, we can definitely say that the given data that we inserted is infact of type outcome1

## Naive bayes using Scikit learn

now there are three types of naive bayes in scikit learn

 - Multinomial. 
 http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
 
 - Bernoulli. 
 http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html
 
 - and finally Gaussian.
 http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
 
 a quick reminder, we have implemented Gaussian naive bayesian


In [40]:
#first visualise what we have in our hand
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [41]:

X = data.iloc[:,0:-1] # X is the features in our dataset
y = data.iloc[:,-1]   # y is the Labels in our dataset

In [42]:
# divide the dataset in train test using scikit learn
# now the model will train in training dataset and then we will use test dataset to predict its accuracy

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) 

In [43]:
# now preparing our model as per Gaussian Naive Bayesian

from sklearn.naive_bayes import GaussianNB

model = GaussianNB().fit(X_train, y_train) #fitting our model

In [44]:
predicted_y = model.predict(X_test) #now predicting our model to our test dataset

In [45]:
from sklearn.metrics import accuracy_score

# now calculating that how much accurate our model is with comparing our predicted values and y_test values
accuracy_score = accuracy_score(y_test, predicted_y) 
print (accuracy_score)

0.7362204724409449


We got 73% accuracy.

now further i will test my model to the new data point. remember, from upper model we concluded that that new data point is of outcome1

In [46]:
# the data is stored in Dataframe person
predicted_y = model.predict(person)

In [48]:
print (predicted_y)

[1]


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=c633c656-abbc-4122-8847-d0df1a741540' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>