Naive Bayes (Simplified Explanation):

Imagine you’re a detective trying to solve a mystery. You look at clues and make a decision based on probabilities. For example:

If a suspect has muddy shoes, they are likely to have been outside.
If they also have wet clothes, it’s even more likely they were caught in the rain.
Naive Bayes works like this: it looks at "clues" (features) and predicts the most probable "outcome" (class) based on the data.

What is Naive Bayes?
Naive Bayes is a probabilistic algorithm used for classification tasks. It’s based on Bayes' Theorem, which calculates the probability of an event given prior knowledge of conditions related to the event.

Bayes' Theorem Formula:
[ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} ]

Where:

( P(A|B) ): Probability of ( A ) (class) given ( B ) (features).
( P(B|A) ): Probability of ( B ) (features) given ( A ) (class).
( P(A) ): Prior probability of ( A ) (class).
( P(B) ): Prior probability of ( B ) (features).

image.png

Why "Naive"?

It assumes that all features are independent of each other. For example, it assumes muddy shoes and wet clothes are unrelated, even though they might be connected. This assumption is rarely true, but the algorithm still works well in practice!

How Naive Bayes Works (Step-by-Step):
Calculate Probabilities:

For each class, calculate the likelihood of the data features belonging to that class.
Use the formula to combine probabilities.

Choose the Class:

Assign the class with the highest probability to the data point.

Types of Naive Bayes:

Gaussian Naive Bayes: Assumes features are normally distributed (used for continuous data).

Multinomial Naive Bayes: Used for discrete data, like word counts in text classification.

Bernoulli Naive Bayes: Used for binary data, like whether a word appears in a document or not.

Real-World Example:
Imagine you're building an email spam filter:

Features: Words like "discount," "free," "money."
Classes: "Spam" or "Not Spam."
Naive Bayes calculates the probability of an email being spam based on the words it contains and assigns the class with the highest probability.


In [62]:
import pandas as pd
df = pd.read_csv("titanic.csv")
df.head()

Unnamed: 0,PassengerId,Name,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,1,"Braund, Mr. Owen Harris",3,male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,"Heikkinen, Miss. Laina",3,female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,female,35.0,1,0,113803,53.1,C123,S,1
4,5,"Allen, Mr. William Henry",3,male,35.0,0,0,373450,8.05,,S,0


In [63]:
df = df.drop(['PassengerId', 'Name', 'SibSp', 'Parch', 'Ticket', 'Cabin', 'Embarked'], axis=1)
df.head()

Unnamed: 0,Pclass,Sex,Age,Fare,Survived
0,3,male,22.0,7.25,0
1,1,female,38.0,71.2833,1
2,3,female,26.0,7.925,1
3,1,female,35.0,53.1,1
4,3,male,35.0,8.05,0


In [64]:
inputs = df.drop('Survived', axis='columns')
target = df['Survived']
inputs

Unnamed: 0,Pclass,Sex,Age,Fare
0,3,male,22.0,7.2500
1,1,female,38.0,71.2833
2,3,female,26.0,7.9250
3,1,female,35.0,53.1000
4,3,male,35.0,8.0500
...,...,...,...,...
886,2,male,27.0,13.0000
887,1,female,19.0,30.0000
888,3,female,,23.4500
889,1,male,26.0,30.0000


In [65]:
dummies = pd.get_dummies(inputs.Sex).astype(int)
dummies.head()

Unnamed: 0,female,male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1


In [66]:
inputs = pd.concat([inputs,dummies],axis='columns')

In [67]:
inputs.head()

Unnamed: 0,Pclass,Sex,Age,Fare,female,male
0,3,male,22.0,7.25,0,1
1,1,female,38.0,71.2833,1,0
2,3,female,26.0,7.925,1,0
3,1,female,35.0,53.1,1,0
4,3,male,35.0,8.05,0,1


In [68]:

inputs.drop('Sex',axis='columns',inplace=True)

In [69]:
inputs.head()

Unnamed: 0,Pclass,Age,Fare,female,male
0,3,22.0,7.25,0,1
1,1,38.0,71.2833,1,0
2,3,26.0,7.925,1,0
3,1,35.0,53.1,1,0
4,3,35.0,8.05,0,1


In [70]:
# Now we have to deal with NaN values in age column
inputs.columns[inputs.isna().any()]

Index(['Age'], dtype='object')

In [71]:
inputs.Age[:10]

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
6    54.0
7     2.0
8    27.0
9    14.0
Name: Age, dtype: float64

In [72]:
inputs.Age = inputs.Age.fillna(inputs.Age.mean())

In [73]:
inputs.Age[:10]

0    22.000000
1    38.000000
2    26.000000
3    35.000000
4    35.000000
5    29.699118
6    54.000000
7     2.000000
8    27.000000
9    14.000000
Name: Age, dtype: float64

In [74]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inputs,target,test_size=0.2)

In [75]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model.fit(X_train,y_train)

In [76]:
model.score(X_test,y_test)

0.8324022346368715