# Naive Bayes

Although Naive Bayes is best known for text classification, it is also heavily used in spam detection, search ranking, recommender systems, CTR prediction, fraud detection, medical diagnosis, anomaly detection, and even early computer vision pipelines. Anywhere you have high-dimensional sparse features or small datasets, Naive Bayes is a fast, interpretable, and surprisingly strong baseline.

* Use case:
    * Classifying user reviews (positive/negative) -> Sentiment analysis
    * Click-through prediction
    * Recommender Systems: Estimate probability that a user will like item B given they liked item A.
        * Cold start problem
        * Predict next product category
    * Fraud & Rrik Modelling
        * Credit risk scoring
        * Likelihood of fraud
        * Insurance claim fraud detection
        * AML risk segmentation
    * A/B Testing & Experimentation
        * Bayesian bandits
        * Probabilistic inference of user behaviour

## 1. Definition
Naive Bayes is a probabilistic classification algorithm based on Bayes' Theorem that applies a strong (naive) independence assumption between features to predict the class of a data point (Given feature X, what is the most likely class Y?)

It's a generative model taht calculates the probability of a class given a set of features by counting frquencies in the training data. Naive Bayes solves the computational intractability of calculating full joint probabilities in high-dimensional data (e.g. text) where estimating interactions between every pair of features requires exponentially large datasets.

## 2. Core Idea
To resolve the spam or not spam in email classification problem, we need to calculate paired correlations between words. As vocabulary grows, the combinations become infinite. Therefore, we simply assume the text features are unrelated / independent events so that we can simply calculate their individual probabilities and multiply them.

This assumption is where 'Naive' comes from. It simplifies the math enough to work surprisingly well for classification.

## 3. Mechanism
The model works by decomposing the posterior probability into three learnable components:
* The Prior ($P(y)$): How common is this class in general?
* The Likelihood ($P(X_i|y)$): If the email is Spam, what is the probability of seeing the word 'free'?
* The Evidence ($P(x)$): TThe probability of the data itself (usually ignored during prediction as it is constant for all classes).

Naive Bayes applies Bayes’ Theorem:
$$P(X|Y) = \frac{P(Y|X)P(Y)}{P(X)}$$

where $P(B) != 0$

The naive assumption allows us to factor: 

$$P(X|Y) = \prod_{i=1}^d P(x_i | Y)$$

Workflow:
- Training: Simply count the occurrences of every feature for every class.
- Inference: For a new data point, look up the probabilities for its features, multiply them by the prior, and pick the class with the highest score.

## 4. Mathematical Details / Training
Naive Bayes is distinct because it is not trained via gradient descent. It is trained via Maximum Likelihood Estimation (MLE).
* The Naive Assumption:

$$\hat{y}=\arg\max_y P(Y=y)\prod_{i=1}^d P(x_i |Y=y)$$

Components:
* Prior: $P(Y=y)$ .. Count how many times each class appears
* Likelihood of features:
    - Multinomial NB → counts (text)
	- Bernoulli NB → binary features
	- Gaussian NB → continuous features
* Posterior: $P(Y|X)$

Instead of a complex join distribution, we just multiply the individual probabilities of feature $x_i$.

* Optimization:
There is no "loss function" in the traditional sense (like MSE or Cross-Entropy) to minimize iteratively.
The weights are closed-form solutions derived directly from frequency counts in the data.

* Algorithmic Principle (Log-Sum-Exp): (Avoid underflow using logs)
Since multiplying many small probabilities results in numerical underflow (numbers become too close to zero), we operate in log-space. Multiplication becomes addition:
$$\log(P(y|x)) \propto \log(P(y)) + \sum \log(P(x_i | y))$$


## 5. Pros and Cons
* Pros
    * Speed: Training is O(N)... one pass over data; Inference is O(D) ..linear with number of features
    * Small Data: Performs well even with small datasets because it has high bias but very low variance
    * Works very well for text.
    * Handles high dimensionality.
    * Multiclass: Handles multiclass classification natively without needing 'One-vs-Rest' strategies.
* Cons:
    * The independence assumption: In domains where feature interaction is critical (e.g., image pixels or "Not Good" in sentiment analysis), it fails.
    * Probability Calibration: While the ranking of classes is usually correct (Class A > Class B), the actual predicted probabilities are often inaccurate (pushing towards 0 or 1) due to the independence assumption violating reality.
    * Performs poorly with correlated or non-Gaussian continuous features.
    * Not suitable for complex decision boundaries.
    * Zero Frequency Problem: If a word (e.g., "Casino") never appears in the Training Set, the probability becomes 0. If you multiply by 0, the whole prediction dies. (Solved via Laplace Smoothing).



## 6. Production Consideration
* Scalability: It is trivially parallelizable. You can count word frequencies on 10 different machines and sump them up. It is perfect for streaming data / online learning.
* Latency: Idea for high-frequency/low-latency application, where you cannot afford the inference cost of a Transformer or a Deep Neural Net.
* Model Compression: The model is just a lookup table of probabilities. It requires very little RAM compared to random forests or deep learning models.


## 7. Other Variants
* The zero frequency error - Laplace Smoothing
    * Scenario: The word "Bitcoin" appears frequently in your Spam training data, but it has never appeared in your Ham training data.
    * $$P(\text{"Bitcoin"} | \text{Ham}) = \frac{\text{Count of "Bitcoin" in Ham}}{\text{Total Words in Ham}} = \frac{0}{10,000} = 0$$
    * e.g. "Hey, did you see the news about Bitcoin?" -> 

    $$P(\text{Ham} | \text{Email}) \propto P(\text{Ham}) \times P(\text{"Hey"}|Ham) \times \dots \times \mathbf{P(\text{"Bitcoin"}|Ham)}$$$$P(\text{Ham} | \text{Email}) \propto 0.5 \times 0.05 \times \dots \times \mathbf{0} = \mathbf{0}$$

    * The result: The probability becomes absolute zero. The model becomes infinitely confident that this cannot be Ham, purely because of a lack of data.

* Laplace Smoothing (also called Additive Smoothing) solves this by incorporating a prior belief: "Every word is possible, even if we haven't seen it yet." Mechanically, we add a small "pseudocount" (usually 1) to every word in our vocabulary for every class.

Standard Version:
$$P(w_i | class) = \frac{\text{count}(w_i)}{\text{total words in class}}$$

Smoothed Version:
$$P(w_i | class) = \frac{\text{count}(w_i) + \alpha}{\text{total words in class} + (\alpha \times V)}$$

Where:

* $\text{count}(w_i)$: The actual times we saw the word.
* $\alpha$ (Alpha): The smoothing parameter
    * If $\alpha = 1$, it is called Laplace Smoothing.
    * If 3$0 < \alpha < 1$, it is called Lidstone Smoothing.
* $V$: The size of the Vocabulary (total unique words in the dataset).5

In [1]:
import sys, os
root = os.path.abspath("..")

sys.path.append(root)

from src.naive_bayes import MultinomialNaiveBayes
import numpy as np
import pandas as pd

## Case 1: A/B Testing 

### Business Problem:
At hotel.com (mock scenario), we were running an A/B test to evaluate a new homepage layout intended to increase booking conversions.
While the A/B test gave population-level uplift, leadership wanted:

	1.	User-level conversion probability estimates, and
	2.	Which user segments benefit most from Variant B.


Because the dataset was high-dimensional and included many categorical features (country, device, referral source), We chose Multinomial Naive Bayes as a fast, baseline probabilistic model to predict the likelihood of conversion for each variant.

The task is to
1. Build a conversion probability model to predict likelihood of conversion under Variant A vs B.
2. Use Naive Bayes to identify which segments show the largest uplift.
3. Provide recommendations on whether we should roll out Variant B to all users or only targeted groups.


### Data Problem
The data reflects real A/B test features:
* variant (A/B)
* device_type (mobile/desktop/tablet)
* country (3 example markets)
* previous_bookings (0/1/2/3)
* landing_page_views (count features)
* conversion (0/1)

In [14]:
np.random.seed(42)

N = 8000
df = pd.DataFrame({
    "variant": np.random.choice(["A", "B"], N, p=[0.5, 0.5]),
    "device": np.random.choice(["mobile", "desktop", "tablet"], N),
    "country": np.random.choice(["NL", "US", "UK"], N),
    "prev_bookings": np.random.poisson(1.0, N),
    "landing_views": np.random.poisson(3.0, N)
})

# Create conversion with some realistic pattern:
df["conversion"] = (
    0.05
    + 0.04 * (df["variant"] == "B").astype(int)
    + 0.03 * (df["device"] == "mobile").astype(int)
    + 0.02 * (df["country"] == "US").astype(int)
    + 0.01 * df["prev_bookings"]
    + np.random.normal(0, 0.02, N)
)
df["conversion"] = (df["conversion"] > 0.1).astype(int)

df.head()

Unnamed: 0,variant,device,country,prev_bookings,landing_views,conversion
0,A,desktop,US,0,3,0
1,B,desktop,NL,0,2,1
2,B,desktop,NL,0,2,0
3,B,tablet,US,1,1,1
4,A,mobile,US,2,5,1


In [20]:
## Feature Engineering
df["variant_bin"] = (df["variant"] == "B").astype(int)
X = pd.get_dummies(df[["device", "country"]], drop_first=False)

# Add count-based features
X["variant"] = df["variant_bin"]
X["prev_bookings"] = df["prev_bookings"]
X["landing_views"] = df["landing_views"]
X_numeric = X.astype(float)

y = df["conversion"]

In [21]:
model = MultinomialNaiveBayes(alpha=1.0)
model.fit(X_numeric.values, y.values)

<src.naive_bayes.MultinomialNaiveBayes at 0x123afdeb0>

* Predict Conversion Probability by Variant
* Predict probability of conversion if user is exposed to Variant A vs B

In [24]:
# Create two hypothetical scenarios per user
X_A = X.copy()
X_A["variant"] = 0  # forced version A

X_B = X.copy()
X_B["variant"] = 1  # forced version B

P_A = model.predict_proba(X_A.values)[:, 1]
P_B = model.predict_proba(X_B.values)[:, 1]

df["uplift"] = P_B - P_A

In [25]:
uplift_by_device = df.groupby("device")["uplift"].mean()
uplift_by_country = df.groupby("country")["uplift"].mean()

uplift_by_device, uplift_by_country

(device
 desktop   -2.213409
 mobile    -2.213409
 tablet    -2.213409
 Name: uplift, dtype: object,
 country
 NL   -2.213409
 UK   -2.213409
 US   -2.213409
 Name: uplift, dtype: object)