# Naive Bayes Classifier :

* Naive Bayes is a probabilistics classification algorithm based on Bayes Theorem and is particularly useful for classification task involving high dimensional data. Despite its smplicity, it often performs surprisingly well for real world application like spam detection, sentimental analysis, and mediacal diagnosis.

# How it works?

* Think of email spam filter. When you receive an email, Naive Bayes looks at the words in it and checks:  
  * How often those words appear in spam emails  
  * How often the appear in non-spam emails
* Then it calculate the probability that the email is spam or not. The class (spam ot not spam) with the higher  probability in chosen.

# Bayes Theorem (Formula)

* The classifier uses this rule:   

                 P(class | Data) = (P(Data|Class) * P(Class)) / P(Data)
* Where:  

         * P(Class|Data) = Probability of a class (e.g.,"Spam") given some data (Words in email).  
         * P(Data|Class) = Probability of seeing that data in the given class.  
         * P(Class) = Probability of that class occuring in general.  
         * P(Data) = Probability of data occuring in all cases.  

# Why it is called Naive?

* It is called "Naive" because it assumes that all features (like words in email) are independent of each other. in reality, they may not be, but this assumption makes calculations easier.

# Types of Naive Bayes Classifiers:  

* There are three main types of Naive Bayes Classifiers:

In [2]:
# imprting libraries 
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB , BernoulliNB , MultinomialNB
from sklearn.metrics import accuracy_score

In [3]:
# Create a Small Dataset
data = {"Study_hours":[2,3,4,5,6,7,8,9,10,11],
        "Sleep_hours":[8,7,7,6,6,5,5,4,4,3],
        "Pass":[0,0,0,0,1,1,1,1,1,1]}

# convert to DataFrame
df = pd.DataFrame(data)
df.tail()

Unnamed: 0,Study_hours,Sleep_hours,Pass
5,7,5,1
6,8,5,1
7,9,4,1
8,10,4,1
9,11,3,1


In [4]:
# Split data into features and target
x = df[["Study_hours","Sleep_hours"]]   # Features
y = df["Pass"]   # Target variables (Pass/Fail)

# Spli into training (80%) and testing (20%)
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state=42)
x

Unnamed: 0,Study_hours,Sleep_hours
0,2,8
1,3,7
2,4,7
3,5,6
4,6,6
5,7,5
6,8,5
7,9,4
8,10,4
9,11,3


# 1.Gaussian Naive Bayes (GaussianNB):  

* Gaussiannb is used in classification task and it assumes that features values follows a gaussian (normal) distribution.

* Train Gaussian Naive Bayes model

In [5]:
# Initialize and train model
gnb = GaussianNB()
gnb.fit(x_train,y_train)

* Make prediction

In [6]:
# Predict on test data
y_pred = gnb.predict(x_test)

# Display Prediction
print("Prediction :", y_pred)

Prediction : [1 0]


* Evaluate Model Performance

In [7]:
accuracy = accuracy_score(y_test,y_pred)
print(f"Model Accuracy: {accuracy:}")

Model Accuracy: 1.0


# 2.Bernoulli Naive Bayes (For Binary Data)

* BernoulliNB is used for binary (0/1) features, such as whether a student completes assignment or not.

In [8]:
data = {"complete_homework":[1,0,1,1,0,1,0,1,0,1],
        "class_participation":[1,0,0,1,0,1,0,1,1,1],
        "Pass":[1,0,1,1,0,1,0,1,0,1]}  # Pass = 1, Fail = 0

df = pd.DataFrame(data)

In [9]:
x = df[["complete_homework","class_participation"]]
y = df["Pass"]

In [10]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=42)

In [11]:
# Train Bernoulli Naive Bayes Model
bnb = BernoulliNB()
bnb.fit(x_train,y_train)

In [12]:
# Make Prediction
y_pred = bnb.predict(x_test)

# Evaluate Accuracy
accuracy = accuracy_score(y_test,y_pred)
print(f"BernoulliNB Accuracy: {accuracy:}")

BernoulliNB Accuracy: 0.5


# Multinomial Naive Bayes (MultinomialNB)

* Used for Count-based (discrete) data, such as word frequencies in text classification.
* MultonomialNB is best for text classification.
* Laplace smoothing prevents zero probabilities.
* It use word frequencies, not presence/absence.

In [18]:
from sklearn.feature_extraction.text import CountVectorizer

In [16]:
data = {"Message":["Win a free iphone now","Congratulations ! You won a lottery",
                   "Hey, let's meet for lunch", "Free tickets available, claim now",
                   "Can we discuss the project tomorrow?","Urgent! you have been selected for a prize"],
        "Label":["spam","spam","ham","spam","ham","spam"]}

df = pd.DataFrame(data)

# Convert labels to binory (spam=1, ham=0)
df["Label"] = df["Label"].map({"spam" :1,"ham":0})
print(df)

                                      Message  Label
0                       Win a free iphone now      1
1         Congratulations ! You won a lottery      1
2                   Hey, let's meet for lunch      0
3           Free tickets available, claim now      1
4        Can we discuss the project tomorrow?      0
5  Urgent! you have been selected for a prize      1


In [30]:
# Convert text to numeric features
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(df["Message"])
y = df["Label"]

In [20]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.3)

# Train the MultinomialNB model
model = MultinomialNB()
model.fit(x_train, y_train)

In [23]:
# Predict on test data
y_pred = model.predict(x_test)

# calculate accuracy
accuracy = accuracy_score(y_test,y_pred)
accuracy

0.5

In [34]:
# We test the model with new message.
new_message = ["Win a brand-new car now"]
new_message_transformed = vectorizer.transform(new_message)
prediction = model.predict(new_message_transformed)
print(f"predicted category: {'spam' if prediction[0] == 1 else 'ham'}")

predicted category: spam
