# **Naive Bayes Classifier**
***

**What is Naive Bayes algorithm?**

Naive Bayes is a classification technique based on Bayes’ Theorem(*Probability theory*) with an assumption that all the features that predicts the target value are independent of each other. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature in determining the target value.

This assumption we just read about is very **Naive** when we are dealing with real world data because most of the times, features do depend on each other in determining the target - this is why the algorithm gets its name **Naive Bayes** (duhh!).

> Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) - *(read as Probability of **c** given **x**)*,  from P(c), P(x) and P(x|c). Look at the equation below:
>
> $$\mathbf{P} \left({x \mid c} \right) = \frac{\mathbf{P} \left ({c \mid x} \right) \mathbf{P} \left({c} \right)}{\mathbf{P} \left( {x} \right)}$$

In the *above* equation,

* P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
* P(c) is the prior probability of class **c**.
* P(x|c) is the likelihood which is the probability of predictor(the query  **x**) given class.
* P(x) is the prior probability of predictor **x**.

**Note : ** Independence assumption is never correct but often works well in practice.



**Why should we use Naive Bayes ?**

* As stated above, It is **_easy_** to build and is particularly useful for **_very large data sets_**.
* It is **extremely fast** for both training and prediction.
* It provide straightforward probabilistic prediction.
* It is often very easily interpretable.
* It has very few (if any) tunable parameters.
* It perform well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

### Read the dataset:

Now, we read the dataset in a varible named **data** using pandas library which we imported above as '*pd*'  (data-type: DataFrame).


**Note: ** *Here the dataset used is for Tennis games and weather conditions where the target is if a tennis game is played in the given conditions or not, the dataset is very small, just containing 14 rows and 5 columns for the purpose of this tutorial for beginners.*

In [None]:
data = pd.read_csv("../input/tennis.csv")

In [None]:
data.info()

In [None]:
data.columns

In [None]:
data.head(14)

In [None]:
# outlook_count = data.groupby(['outlook', 'play']).size()
# outlook_total = data.groupby(['outlook']).size()
# temp_count = data.groupby(['temp', 'play']).size()
# temp_total = data.groupby(['temp']).size()
# humidity_count = data.groupby(['humidity', 'play']).size()
# humidity_total = data.groupby(['outlook']).size()
# windy_count = data.groupby(['windy', 'play']).size()
# windy_total = data.groupby(['windy']).size()
# print(outlook_count)
# print(windy_total)
# print(outlook_total)
# print(temp_count)
# print(temp_total)
# print(humidity_count)
# print(humidity_total)
# print(windy_count)
# print(windy_total)



In [None]:
# p_over_yes = outlook_count['overcast','yes']
# p_over_no = 0
# p_rainy_yes = outlook_count['rainy','yes']
# p_rainy_no = outlook_count['rainy','no']
# p_rainy_yes = outlook_count['sunny', 'yes']


In [None]:
X_train = pd.get_dummies(data[['outlook', 'temp', 'humidity', 'windy']])
y_train = pd.DataFrame(data['play'])

#assigning predictor and target variables
#x= np.array([[-3,7],[1,5], [1,2], [-2,0], [2,3], [-4,0], [-1,1], [1,1], [-2,2], [2,7], [-4,1], [-2,7]])
#Y = np.array([3, 3, 3, 3, 4, 3, 3, 4, 3, 4, 4, 4])
print(X_train.info())
print(X_train.head())

In [None]:
#Import Library of Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB
import numpy as np

In [None]:
#Create a Gaussian Classifier
model = GaussianNB()

# Train the model using the training sets 
model.fit(X_train, y_train)

#Predict Output 
predicted= model.predict([[False,1,0,0,0,1,0,1,0]])
print (predicted)

# In Progress