Naive Bayes Classifiers


. The classifier assumes that the features used to describe an observation are conditionally independent, given the class label.
Gaussian Naive Bayes is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions between the features.

Continuous features are normally distributed: If a feature is continuous, then it is assumed to be normally distributed within each class.

The Titanic dataset is a classic dataset used for classification tasks, where the goal is typically to predict whether a passenger survived or not based on various features

P(A∣B)= 
P(B)
P(B∣A)P(A)
​


In [2]:
import pandas as pd

In [3]:
df = pd.read_csv("titanic.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We assume that no pair of features are dependent. 
For example, the temperature being ‘Hot’ has nothing to do with the humidity or 
the outlook being ‘Rainy’ has no effect on the winds. Hence, the features are assumed to be independent.


Secondly, each feature is given the same weight(or importance). For example, knowing only temperature and 
humidity alone can’t predict the outcome accurately. 
None of the attributes is irrelevant and assumed to be contributing equally to the outcome.


Data EXploration

In [4]:
df.drop(['PassengerId','Name','SibSp','Parch','Ticket','Cabin','Embarked'],axis='columns',inplace=True)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare
0,0,3,male,22.0,7.25
1,1,1,female,38.0,71.2833
2,1,3,female,26.0,7.925
3,1,1,female,35.0,53.1
4,0,3,male,35.0,8.05


In [5]:
inputs = df.drop('Survived',axis = 'columns')
target = df.Survived


Changing the sex column into numbers using dummies 

In [6]:
dummies = pd.get_dummies(inputs.Sex)
dummies = dummies.replace({False:0,True:1})
dummies.head(3)

Unnamed: 0,female,male
0,0,1
1,1,0
2,1,0


adding the two columns into our data frame

In [7]:
inputs= pd.concat([inputs,dummies],axis='columns')
inputs.head(3)

Unnamed: 0,Pclass,Sex,Age,Fare,female,male
0,3,male,22.0,7.25,0,1
1,1,female,38.0,71.2833,1,0
2,3,female,26.0,7.925,1,0


I am dropping male column as well because of dummy variable trap theory. One column is enough to repressent male vs female

In [8]:
inputs.drop(['Sex','male'],axis='columns',inplace=True)
inputs.head(3)

Unnamed: 0,Pclass,Age,Fare,female
0,3,22.0,7.25,0
1,1,38.0,71.2833,1
2,3,26.0,7.925,1


checking for columns with missing values

In [9]:
inputs.columns[inputs.isna().any()]

Index(['Age'], dtype='object')

In [10]:
inputs.Age[:10]

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
6    54.0
7     2.0
8    27.0
9    14.0
Name: Age, dtype: float64

filling up the missing values

In [11]:
inputs.Age = inputs.Age.fillna(inputs.Age.median())
inputs.head()

Unnamed: 0,Pclass,Age,Fare,female
0,3,22.0,7.25,0
1,1,38.0,71.2833,1
2,3,26.0,7.925,1
3,1,35.0,53.1,1
4,3,35.0,8.05,0


Training our model

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inputs,target,test_size=0.3)

There are a couple of naive base classes in our case we are going to use GaussianNaiveBayes
And the reason is because our data distribution is normal because of our continous features like age,fare and etc
Gaussian is also know as a bell curve statistical concept

A continuous variable is said to be normally distributed if its values follow a Gaussian distribution (also known as a normal distribution). In a normal distribution, the data is symmetrically distributed around the mean, with most values clustered around the mean and fewer values further away.

In [13]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()

In [14]:
model.fit(X_train,y_train)

In [15]:
model.score(X_test,y_test)

0.746268656716418

In [16]:
X_test[0:10]

Unnamed: 0,Pclass,Age,Fare,female
30,1,40.0,27.7208,0
273,1,37.0,29.7,0
170,1,61.0,33.5,0
547,2,28.0,13.8625,0
102,1,21.0,77.2875,0
351,1,28.0,35.0,0
306,1,28.0,110.8833,1
14,3,14.0,7.8542,1
765,1,51.0,77.9583,1
605,3,36.0,15.55,0


In [17]:
y_test[0:10]

30     0
273    0
170    0
547    1
102    0
351    0
306    1
14     0
765    1
605    0
Name: Survived, dtype: int64

In [18]:
model.predict(X_test[0:10])

array([0, 0, 1, 0, 1, 0, 1, 1, 1, 0], dtype=int64)

In [19]:
model.predict_proba(X_test[:10])

array([[0.70923117, 0.29076883],
       [0.71473215, 0.28526785],
       [0.46792237, 0.53207763],
       [0.93031626, 0.06968374],
       [0.37881366, 0.62118634],
       [0.70249541, 0.29750459],
       [0.00293975, 0.99706025],
       [0.36660287, 0.63339713],
       [0.01164498, 0.98835502],
       [0.97006302, 0.02993698]])

Calculate the score using cross validation

In [20]:
from sklearn.model_selection import cross_val_score
cross_val_score(GaussianNB(),X_train, y_train, cv=5)


array([0.688     , 0.752     , 0.848     , 0.67741935, 0.79032258])

Advantages of Naive Bayes Classifier
Easy to implement and computationally efficient.
Effective in cases with a large number of features.
Performs well even with limited training data.
It performs well in the presence of categorical features.
For numerical features data is assumed to come from normal distributions

Disadvantages of Naive Bayes Classifier
Assumes that features are independent, which may not always hold in real-world data.
Can be influenced by irrelevant attributes.
May assign zero probability to unseen events, leading to poor generalization.

Applications of Naive Bayes Classifier
Spam Email Filtering: Classifies emails as spam or non-spam based on features.
Text Classification: Used in sentiment analysis, document categorization, and topic classification.
Medical Diagnosis: Helps in predicting the likelihood of a disease based on symptoms.
Credit Scoring: Evaluates creditworthiness of individuals for loan approval.
Weather Prediction: Classifies weather conditions based on various factors.