# What is Naïve Bayes?

Naïve Bayes is a probabilistic machine learning algorithm based on Bayes' Theorem. It is fast, simple, and effective for classification tasks, especially in text classification, spam detection, sentiment analysis, and medical diagnosis.

### Why use Naïve Bayes?

- Works well even with small datasets.
- Handles high-dimensional data efficiently (e.g., text classification).
- Fast and Scalable (low computation cost).
- Performs well even with limited training data.
- Assumes features are independent (which is often not true in reality, but still works surprisingly well).

### How does Naïve Bayes work?

**Example: Email Spam Classification**
Let’s say we have an email, and we want to predict whether it's Spam or Not Spam.

- **Prior Probability:**
- P(Spam) = Probability of an email being spam.
- p(Not Spam) = Probability of an email being not spam.
- **Likelihood:**
- Probability of seeing a word given that the email is spam.
- Probability of seeing a word given that the email is not spam.
- **Posterior Probability:**
- Compute probabilities for both Spam and Not Spam using Bayes' Theorem.
- Choose the label with the higher probability.

## Naïve Bayes for Text Classification

One of the best applications of Naïve Bayes is text classification, e.g., spam detection or sentiment analysis.


### Types of Naïve Bayes Classifiers

- Gaussian Naïve Bayes (for continuous data)
- Multinomial Naïve Bayes (for text classification)
- Bernoulli Naïve Bayes (for binary features, e.g., word presence/absence)

### Real-World Example: Sentiment Analysis

We will classify IMDB movie reviews as Positive or Negative using Naïve Bayes.

### Implementing Naïve Bayes in Python

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import nltk
from nltk.corpus import stopwords
import string

In [None]:
pip install nltk

**Load and Explore the Dataset**
- We'll use the IMDB dataset (Movie reviews with sentiments).

In [None]:
pip install tensorflow


In [None]:
import tensorflow_datasets as tfds

# Load dataset
imdb_data, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)

# Convert to lists
X_train, y_train = [], []
for text, label in tfds.as_numpy(imdb_data["train"]):
    X_train.append(text.decode("utf-8"))
    y_train.append(label)

print(f"Loaded {len(X_train)} training reviews.")
print("Example Review:", X_train[0])


In [None]:
pip install tensorflow_datasets

### Data Preprocessing

We will clean the text by:

- Removing punctuation and stopwords.
- Converting text to lowercase.
- Tokenizing words.

In [None]:
nltk.download('stopwords')

def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = "".join([char for char in text if char not in string.punctuation])  # Remove punctuation
    words = text.split()
    words = [word for word in words if word not in stopwords.words('english')]  # Remove stopwords
    return " ".join(words)

# Apply preprocessing
X_clean = [preprocess_text(doc) for doc in X]


### Convert Text into Numerical Features

We will use TF-IDF Vectorizer to convert text into numerical format.

In [None]:
vectorizer = TfidfVectorizer(max_features=5000)  # Limit to 5000 words
X_vec = vectorizer.fit_transform(X_clean)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.2, random_state=42)


### Train Naïve Bayes Model

We use MultinomialNB because it's best suited for text classification.

In [None]:
# Train model
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Predict
y_pred = nb_model.predict(X_test)


### Model Evaluation

In [None]:
# Accuracy Score
print("Accuracy:", accuracy_score(y_test, y_pred))

# Confusion Matrix
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

# Classification Report
print(classification_report(y_test, y_pred))

### Use Cases of Naïve Bayes

- Spam Detection – Classifying emails as spam or not spam.
- Sentiment Analysis – Analyzing customer reviews, tweets, and comments.
- Medical Diagnosis – Identifying diseases based on symptoms.
- News Categorization – Automatically tagging news articles.
- Credit Scoring – Predicting loan defaults.

### Summary

- Why? Simple, fast, and effective, especially for text data.
- How? Based on Bayes' Theorem and assumes feature independence.
- What Use Case? Spam detection, sentiment analysis, and medical diagnosis.
- Python Implementation: Used IMDB movie reviews for sentiment analysis.
