### Naive Bayes Classifier

Naive Bayes is a **supervised machine learning algorithm** that calculates the probability of a certain outcome based on the features of your data using Bayes' theorem from probability theory. It's called "naive" because it assumes all features are independent of each other, which simplifies the math even though this assumption is rarely true in real life. The algorithm calculates the probability of each class given the feature values, and then selects the class with the highest probability. It first learns the probability distributions of each feature for each class from the training data. Despite its simplistic assumption, Naive Bayes often works surprisingly well, especially with high-dimensional data like text classification.

*   **Use Cases:** Use Naive Bayes when you need a fast algorithm that works well with high-dimensional data, particularly for text classification problems like spam detection or sentiment analysis. It's effective when you have limited training data and need a simple baseline model. Naive Bayes assumes that features are conditionally independent given the class label (the "naive" assumption). It assumes that the presence or absence of specific features is what matters, not their interactions. For Gaussian Naive Bayes, it assumes features follow a normal distribution within each class. Naive Bayes works best when the independence assumption is not severely violated or when the benefits of the algorithm's simplicity outweigh the cost of this assumption.

*   **Pros:**
    - Extremely fast training and prediction, even with large datasets
    - Works well with high-dimensional data (like text classification)
    - Requires relatively small amount of training data to estimate parameters
    - Not sensitive to irrelevant features (they tend to "cancel out")
    - Handles multi-class problems naturally
    - Performs surprisingly well despite its simplistic assumptions

*   **Cons:**
    - The "naive" independence assumption is rarely true in real-world data
    - Doesn't capture feature interactions
    - Cannot learn complex relationships between features
    - May be outperformed by more sophisticated models when sufficient data is available
    - May produce poor probability estimates (though classifications may still be correct)
    - Zero-frequency problem: if a categorical variable has a category in test data that was not in training data, model will assign zero probability

    | **Best Practice**                                                                        |
    | ---------------------------------------------------------------------------------------- |
    | Apply Laplace smoothing (`alpha=1.0`) to avoid zero probabilities.                       |
    | Choose correct variant: Gaussian (continuous), Multinomial (counts), Bernoulli (binary). |
    | Use log probabilities to avoid numerical underflow (most libraries do this).             |
    | Use `class_prior` or prior knowledge to guide the model.                                 |
    | Validate with metrics like F1-score, precision, and recall.                              |
    | Apply `var_smoothing` in GaussianNB for numerical stability (e.g., 1e-9).                |


In [3]:
%pip install --quiet pandas numpy matplotlib seaborn scikit-learn


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = GaussianNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


Accuracy: 1.0
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

