# Naive Bayes
Naive Bayes is based on Bayes' Theorem with the "naive" assumption that all features are conditionally independent given the class label.

Bayes' Theorem:
``` P(y|X) = [P(X|y) * P(y)] / P(X) ```

Where:

P(y|X): Posterior probability (what we want)

P(X|y): Likelihood

P(y): Prior probability

P(X): Evidence (normalization constant)


### The "Naive" Assumption

The classifier assumes features are independent:


``` P(x₁, x₂, ..., xₙ|y) = P(x₁|y) * P(x₂|y) * ... * P(xₙ|y) ```
This simplification makes computation feasible but is rarely true in practice (hence "naive").


Prediction Formula
For classification, we predict the class with highest posterior probability:


``` ŷ = argmax_y P(y) ∏ P(xᵢ|y) ```

## Types of Naive Bayes Classifiers

```bash
-->>  Gaussian Naive Bayes

Use case: Continuous features assumed to follow normal distribution

from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
# Probability density function: P(xᵢ|y) = (1/√(2πσ²)) * exp(-(xᵢ-μ)²/(2σ²))



-->>  Multinomial Naive Bayes
Use case: Discrete counts (text classification, word counts)
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
# Uses smoothed frequency: P(xᵢ|y) = (N_yi + α) / (N_y + α*n)
# α (alpha): Smoothing parameter (Laplace smoothing when α=1)


-->>  Bernoulli Naive Bayes
Use case: Binary/Boolean features (presence/absence)


from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()
# Models binary outcomes: P(xᵢ|y) = P(i|y) if xᵢ=1, else 1-P(i|y)

```


Data Preparation Guidelines
Text data: Use TF-IDF or CountVectorizer

Categorical data: One-hot encoding or label encoding

Continuous data: Consider binning for MultinomialNB or use GaussianNB

Missing values: Naive Bayes handles missing values naturally in some implementations


Handling Common Issues
Zero Frequency Problem (When a feature value doesn't appear in a class)

Solution: Apply Laplace smoothing (α > 0)

Numerical Underflow (Multiplying many small probabilities)

Solution: Use log probabilities

## Strengths & Weaknesses
Advantages
Extremely fast training and prediction

Works well with high-dimensional data (like text)

Requires small amount of training data

Handles both continuous and discrete data

Simple to implement and interpret

Performs well with categorical features

Disadvantages
Strong independence assumption (rarely true in practice)

Zero frequency problem without smoothing

Not ideal for regression tasks (primarily for classification)

Can be outperformed by more complex models with large datasets

Sensitive to irrelevant features (no feature selection built-in)

## Real-World Applications
1. Text Classification
Spam detection (Gmail, Outlook)

Sentiment analysis

News categorization

Language detection

2. Medical Diagnosis
Disease prediction based on symptoms

Risk assessment

3. Recommendation Systems
Product recommendations

Content filtering

4. Fraud Detection
Credit card fraud

Insurance claim fraud

## Quick Reference Card
When to Use Which Variant
GaussianNB: Continuous features, normal distribution assumed

MultinomialNB: Discrete counts (word counts, ratings 1-5)

BernoulliNB: Binary features (present/absent, true/false)

Preprocessing Checklist
Handle missing values

Encode categorical variables

Scale features (for GaussianNB)

Apply text vectorization (for text data)

Balance classes if needed

Common Pitfalls to Avoid
Forgetting feature scaling for GaussianNB

Not applying smoothing for unseen feature combinations

Using raw counts without normalization for text data

Ignoring feature correlations when they're important

Assuming probabilities are well-calibrated