# Naive Bayes Classifier: (ML Algorithm)


![](https://insightimi.files.wordpress.com/2020/04/unnamed-1.png)

## CONTENTS:
* What is Naive Bayes Classifier?
* Types of Naive Bayes Classifier?
* Project

## What is Naive Bayes Classifier?
* Naive Bayes Classifier is probabilistic supervised Machine Learning algorithms. It is used to solve classification problem.
* It is based on Bayes Theorem.
* Assume that all features/prediction in B are mutually independent, it is known as Naive.

![](https://miro.medium.com/max/600/1*aFhOj7TdBIZir4keHMgHOw.png)

**Where,**
* **H = Hypothesis**
* **E = Evidence**

### Now we change this formula in machine learning form:

```
          P(X|y) P(y)
P(y|X) =  -----------         
            P(X)
X = (x1, x2, x3, ......, xn)
                       P(x1|y) P(x2|y) P(x3|y)..........P(xn|y) 
P(y|x1, x2,...., xn) = ----------------------------------------
                           P(x1) P(x2) P(x3)..........P(xn)
P(y|x1, x2,....., xn) ∝ P(y) ∏i=1->n P(xi|y)
y = argmax yP(y) ⇒∏i=1->n P(xi|y)
```

## Types of Naive Bayes Classifier?
**1. Multinomial Naive Bayes:-** When data is provided in classes form.  
**2. Bernoulli Naive Bayes:-** Same but the prediction value will be in boolean form.  
**3. Gaussian Naive Bayes:-** When data is provided in numerical form we use this but we assume data is normally distributed (mean, median, and mode values of data is approx same).
![](https://prutor.ai/wp-content/uploads/2-15-420x315.png)
![Screenshot_6.jpg](attachment:Screenshot_6.jpg)



## Application of Naive Bayes Classifier:
* Real Time Prediction
* Multi Class Prediction
* Text Classification
    * Spam Filtering
    * Sentiment Analysis
* Recommendation System

**Major Disadvantage:-** Naive Bayes Classifier considers all the features are independent i.e. there is no relation between the features but in real life it is not possible i.e. some features are related in some ways.

## Project: Breast Cancer DataSet

### Step 1 - Import Libraries

In [1]:
# import libraries
import numpy as np
import pandas as pd

### Step 2 - Load The Dataset

In [2]:
# Load dataset
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

In [3]:
data.data

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

In [4]:
data.feature_names

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

In [5]:
data.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

In [6]:
data.target_names

array(['malignant', 'benign'], dtype='<U9')

**Note:**
* **Malignant** = Tumor exist (i.e. Patient have breast cancer)
* **Benign** = No tumor (i.e. Patient doesn't have breast cancer)

### Step 3 - Changing the data into dataframe

In [7]:
# create dataframe
df = pd.DataFrame(np.c_[data.data,data.target], columns=[list(data.feature_names)+['target']])
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0.0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0.0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0.0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0.0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0.0


In [8]:
df.tail()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
564,21.56,22.39,142.0,1479.0,0.111,0.1159,0.2439,0.1389,0.1726,0.05623,...,26.4,166.1,2027.0,0.141,0.2113,0.4107,0.2216,0.206,0.07115,0.0
565,20.13,28.25,131.2,1261.0,0.0978,0.1034,0.144,0.09791,0.1752,0.05533,...,38.25,155.0,1731.0,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637,0.0
566,16.6,28.08,108.3,858.1,0.08455,0.1023,0.09251,0.05302,0.159,0.05648,...,34.12,126.7,1124.0,0.1139,0.3094,0.3403,0.1418,0.2218,0.0782,0.0
567,20.6,29.33,140.1,1265.0,0.1178,0.277,0.3514,0.152,0.2397,0.07016,...,39.42,184.6,1821.0,0.165,0.8681,0.9387,0.265,0.4087,0.124,0.0
568,7.76,24.54,47.92,181.0,0.05263,0.04362,0.0,0.0,0.1587,0.05884,...,30.37,59.16,268.6,0.08996,0.06444,0.0,0.0,0.2871,0.07039,1.0


In [9]:
df.shape

(569, 31)

### Step 4 - Splitting the dataset

In [10]:
x=df.iloc[:,0:-1]
y=df.iloc[:, -1]

In [11]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2, random_state=2020)

print('Shape of x_train = ', x_train.shape)
print('Shape of y_train = ', y_train.shape)
print('Shape of x_test = ', x_test.shape)
print('Shape of y_test = ', y_test.shape)



Shape of x_train =  (455, 30)
Shape of y_train =  (455,)
Shape of x_test =  (114, 30)
Shape of y_test =  (114,)


### Step 5 - Training and Testing The Naive Bayes Classifier Model
### 1.Gaussian Naive Bayes

In [12]:
from sklearn.naive_bayes import GaussianNB

In [13]:
classifier_g = GaussianNB()
classifier_g.fit(x_train,y_train)



GaussianNB()

In [15]:
classifier_g.score(x_test, y_test)



0.9736842105263158

### Conclusion:
The accuracy of this model is `97.37%` which is a good accuracy.

### 2. Multinomial Naive Bayes

In [17]:
from sklearn.naive_bayes import MultinomialNB

In [18]:
# Traing with Multinommial Naive Bayes Classifier Model
classifier_m = MultinomialNB()
classifier_m.fit(x_train, y_train)



MultinomialNB()

In [19]:
# Test our Multinomial Naive Bayes Model
classifier_m.score(x_test,y_test)



0.8947368421052632

### Conclusion:
The accuracy of the Multinomial Naive Bayes Model is `89.47%` which is a good one but in compare to Gaussian Naive Bayes this model is less accurate.

### 3. Bernoulli Naive Bayes Classifier Model

In [20]:
from sklearn.naive_bayes import BernoulliNB

In [21]:
# Training the Bernoulli Naive Bayes Classifier Model
classifier_b = BernoulliNB()
classifier_b.fit(x_train, y_train)



BernoulliNB()

In [22]:
# Testing the Bernoulli Naive Bayes Classifier Model
classifier_b.score(x_test,y_test)



0.5789473684210527

#### Conclusion:
The accuracy of the Bernoulli Naive Bayes Classifier is `57.89%` which is not a good model in compare to Gaussian and Multinomial Naive Bayes Classifier Models.

### Step 7- Selecting The Model For Production 

* After training and testing all the three Naive Bayes Classifier Model we can conclude that the Gaussian Naive Bayes Classifier Model is perfect for this data set. Note that i data set is numerical so it is obvious to choose and when our data set is in text form we can choose any one from Multinomial and Bernoulli classifier models.

* Now after selecting the Gaussian Naive Bayes Classifier Model we will implement on the new patient. 

* Now we extract the 30 feature for new patient and give it to the `classifier_g`.

* `Classifier_g` will predict the cancer probability in the new patient.

### Step 8 - Predict Cancer For New Patient

In [23]:
patient_1 = [17.99,
            10.38,
            122.8,
            1001.0,
            0.1184,
            0.2776,
            0.3001,
            0.1471,
            0.2419,
            0.07871,
            1.095,
            0.9053,
            8.589,
            153.4,
            0.006399,
            0.049804,
            0.05373,
            0.01587,
            0.03003,
            0.006193,
            25.38,
            17.33,
            184.6,
            2019.0,
            0.1622,
            0.6656,
            0.7119,
            0.2654,
            0.4601,
            0.1189]

In [24]:
patient_1 = np.array([patient_1])
patient_1

array([[1.7990e+01, 1.0380e+01, 1.2280e+02, 1.0010e+03, 1.1840e-01,
        2.7760e-01, 3.0010e-01, 1.4710e-01, 2.4190e-01, 7.8710e-02,
        1.0950e+00, 9.0530e-01, 8.5890e+00, 1.5340e+02, 6.3990e-03,
        4.9804e-02, 5.3730e-02, 1.5870e-02, 3.0030e-02, 6.1930e-03,
        2.5380e+01, 1.7330e+01, 1.8460e+02, 2.0190e+03, 1.6220e-01,
        6.6560e-01, 7.1190e-01, 2.6540e-01, 4.6010e-01, 1.1890e-01]])

In [27]:
classifier_g.predict(patient_1)

array([0.])

In [28]:
data.target_names

array(['malignant', 'benign'], dtype='<U9')

In [31]:
pred = classifier_g.predict(patient_1)
if pred[0] == 0:
    print('Patient has Cancer (malignant tumor).')
else:
    print('Patient has no Cancer (malignant tumor).')

Patient has Cancer (malignant tumor).


### Step 9 - Save your work

In [32]:
!pip install jovian --upgrade --quiet

In [33]:
import jovian

In [None]:
jovian.commit(project = 'Naive Bayes Classifer')

<IPython.core.display.Javascript object>