# Bayesian Intent Estimator ‚Äî Explained Notebook
This notebook mirrors the `bayesian_intent_estimator.py` script with explanations before each section.

## **1. Introduction to the Bayesian Model**

A **Bayesian Network (BN)** is a probabilistic graphical model that represents how variables influence one another.
Each node is a variable (like device type, added_to_cart, or intent), and each directed edge encodes a dependency (for example, reached_checkout ‚Üí intent).
Instead of learning a single global regression function, the BN learns a conditional probability table (CPT) for each node‚Äîhow likely it is to take a given value, given its parents.
This allows the model to:

1. Work naturally with mixed categorical data,

2. Express uncertainty explicitly,

3. Handle missing evidence gracefully.

4. Be inspected and interpreted directly (CPTs are human-readable).


## **2. How This Bayesian Intent Estimator Was Developed**

The implementation (bayesian_intent_estimator.py) was written from scratch using the pgmpy library.
The pipeline is intentionally simple and transparent:

### Imports & pgmpy compatibility

In [10]:
from __future__ import annotations

import argparse
import logging
import warnings
from dataclasses import dataclass
from itertools import product
from pathlib import Path
from typing import Iterable, List, Tuple

import numpy as np
import pandas as pd
from sklearn.metrics import average_precision_score, brier_score_loss, roc_auc_score
from sklearn.model_selection import train_test_split

# pgmpy shuffled names around in newer releases; keep this import robust.
try:
    from pgmpy.models import BayesianNetwork  # <= 0.1.24
except Exception:
    from pgmpy.models.BayesianNetwork import BayesianNetwork  # >= 0.1.25

from pgmpy.estimators import BayesianEstimator
from pgmpy.inference import VariableElimination

### Configuration & Logging

In [11]:
@dataclass(frozen=True)
class TrainConfig:
    test_size: float = 0.2
    seed: int = 42
    # Small equivalent sample size adds a bit of smoothing to CPTs
    cpd_equiv_n: float = 1.0
    outdir: Path = Path("./bn_out")


def setup_logging(level: str = "INFO") -> None:
    logging.basicConfig(
        level=getattr(logging, level.upper(), logging.INFO),
        format="%(asctime)s | %(levelname)s | %(message)s",
    )

### üê¶‚Äçüî• Brief Explanation
- TrainConfig keeps our model‚Äôs configuration clean, centralized, and unchangeable.

- setup_logging() gives us clear, timestamped progress messages instead of random print statements.

### üß© Explanation of `@dataclass` and `setup_logging`

Let me explain what this part of the code does step by step.

---

#### **1Ô∏è‚É£ The `@dataclass(frozen=True)` ‚Äî what it means**

This part defines a **configuration class** called `TrainConfig`.
It uses Python‚Äôs `@dataclass` decorator, which automatically creates an initializer (`__init__`) and other helper methods for us.

The `frozen=True` argument means that once I create an instance of this class, I **cannot modify its values later** ‚Äî it becomes **immutable**.
This is really helpful for reproducibility because it prevents accidental changes to important settings during training.

Here‚Äôs what each parameter inside `TrainConfig` represents:

| Parameter | Type | Default | Description |
|------------|------|----------|-------------|
| `test_size` | `float` | `0.2` | This decides how much of the dataset is used for testing. Here it‚Äôs 20%. |
| `seed` | `int` | `42` | This is the random seed that ensures results stay the same every time we run the model. |
| `cpd_equiv_n` | `float` | `1.0` | This controls smoothing in the Bayesian model‚Äôs Conditional Probability Tables (CPTs). It helps avoid zero probabilities when data is sparse. |
| `outdir` | `Path` | `"./bn_out"` | This is the folder where all outputs (like CPTs and predictions) will be saved. |

In short, this small class keeps all important training settings **in one place** and ensures they don‚Äôt get changed by mistake.

---

#### **2Ô∏è‚É£ The `setup_logging` function**

This function configures **how messages are printed during the run** ‚Äî it replaces messy `print()` statements with structured, timestamped logs.

Here‚Äôs what happens:

```python
def setup_logging(level: str = "INFO") -> None:
    logging.basicConfig(
        level=getattr(logging, level.upper(), logging.INFO),
        format="%(asctime)s | %(levelname)s | %(message)s",
    )

## **Step 1 ‚Äì Data Preparation**

If no dataset is provided, the script generates a small synthetic e-commerce funnel with variables such as:
traffic_source ‚Üí used_search ‚Üí applied_filters ‚Üí added_to_cart ‚Üí reached_checkout ‚Üí intent.

Continuous values are discretized (split into quartile bins), and booleans/integers are cast to categorical strings so that pgmpy can learn proper CPTs.

### Data helpers (synthetic / load / discretize)

In [12]:
def make_synthetic(n: int = 2500, seed: int = 42) -> pd.DataFrame:
    rng = np.random.default_rng(seed)

    traffic_source = rng.choice(["search", "ads", "direct", "social"], size=n, p=[0.45, 0.25, 0.20, 0.10])
    device = rng.choice(["mobile", "desktop"], size=n, p=[0.65, 0.35])
    prior_purchaser = rng.choice([0, 1], size=n, p=[0.80, 0.20])

    used_search = (traffic_source == "search").astype(int)
    applied_filters = (used_search * (rng.random(n) < 0.6)).astype(int)
    added_to_cart = ((applied_filters | (rng.random(n) < 0.15)) * (rng.random(n) < 0.5)).astype(int)
    reached_checkout = (added_to_cart * (rng.random(n) < 0.55)).astype(int)
    viewed_shipping = ((reached_checkout | (rng.random(n) < 0.1)) * (rng.random(n) < 0.7)).astype(int)

    base = 0.05 + 0.40 * reached_checkout + 0.15 * prior_purchaser + 0.05 * (device == "desktop")
    intent = (rng.random(n) < np.clip(base, 0, 0.95)).astype(int)

    return pd.DataFrame(
        {
            "traffic_source": traffic_source,
            "device": device,
            "prior_purchaser": prior_purchaser,
            "used_search": used_search,
            "applied_filters": applied_filters,
            "added_to_cart": added_to_cart,
            "reached_checkout": reached_checkout,
            "viewed_shipping": viewed_shipping,
            "intent": intent,
        }
    )


def load_or_make(path: str | None) -> pd.DataFrame:
    if path is None:
        logging.info("No --data provided; generating a small synthetic dataset.")
        return make_synthetic()
    df = pd.read_csv(path)
    if "intent" not in df.columns:
        raise ValueError("CSV must contain a binary column named 'intent'.")
    return df


def ensure_discrete(df: pd.DataFrame, target: str) -> pd.DataFrame:
    out = df.copy()
    for c in out.columns:
        if c == target:
            continue
        if pd.api.types.is_integer_dtype(out[c]) or pd.api.types.is_bool_dtype(out[c]):
            out[c] = out[c].astype(int).astype(str)
        elif pd.api.types.is_float_dtype(out[c]):
            try:
                out[c] = pd.qcut(out[c], q=4, duplicates="drop").astype(str)
            except Exception:
                out[c] = out[c].astype(str)
        else:
            out[c] = out[c].astype(str)
    return out

### üê¶‚Äçüî• Brief Explanation

These three functions handle how the dataset is prepared before training the Bayesian model.
The `make_synthetic()` function creates a small, realistic, fake dataset that simulates a user‚Äôs online shopping behavior ‚Äî from visiting a site to reaching checkout and showing purchase intent.
The `load_or_make()` function either loads a real CSV dataset (if provided) or automatically calls `make_synthetic()` to generate data when none is available.
Finally, the `ensure_discrete()` function makes sure all features are converted into **discrete (categorical)** values, since Bayesian Networks require categorical inputs.
Together, they make sure the model always has consistent, ready-to-use data, even if no real dataset is provided.

### üß© Explanation of `make_synthetic`, `load_or_make`, and `ensure_discrete`

Let me explain what these three functions do and how they work together to prepare our data.

---

#### **1Ô∏è‚É£ The `make_synthetic()` function**

This function **creates a small, artificial dataset** that simulates a typical **e-commerce funnel**.
It‚Äôs helpful when we don‚Äôt have real-world data but still want to train and test the model.

Let‚Äôs break it down step by step:

- `rng = np.random.default_rng(seed)`
  Creates a **random number generator** with a fixed seed, so the synthetic data is the same every time we run the code.

- The next few lines create simulated user attributes:
  ```python
  traffic_source = rng.choice(["search", "ads", "direct", "social"], size=n, p=[0.45, 0.25, 0.20, 0.10])
  device = rng.choice(["mobile", "desktop"], size=n, p=[0.65, 0.35])
  prior_purchaser = rng.choice([0, 1], size=n, p=[0.80, 0.20])

## **Step 2 ‚Äì DAG Construction**
A hand-crafted, interpretable funnel-shaped DAG defines the main dependencies:

In [13]:
def build_dag(features: List[str], target: str = "intent") -> BayesianNetwork:
    nodes = set(features) | {target}

    def have(*cols: str) -> bool:
        return all(c in nodes for c in cols)

    edges: List[Tuple[str, str]] = []
    if have("traffic_source", "used_search"):
        edges.append(("traffic_source", "used_search"))
    if have("used_search", "applied_filters"):
        edges.append(("used_search", "applied_filters"))
    if have("applied_filters", "added_to_cart"):
        edges.append(("applied_filters", "added_to_cart"))
    if have("added_to_cart", "reached_checkout"):
        edges.append(("added_to_cart", "reached_checkout"))

    for parent in ["reached_checkout", "viewed_shipping", "prior_purchaser", "device"]:
        if have(parent, target):
            edges.append((parent, target))

    for f in features:
        if f != target and not any(p == f and c == target for p, c in edges):
            edges.append((f, target))

    return BayesianNetwork(edges)


def fit_bn(
    train: pd.DataFrame,
    features: List[str],
    target: str,
    equiv_n: float,
) -> tuple[BayesianNetwork, VariableElimination]:
    model = build_dag(features, target)
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        model.fit(
            train[features + [target]],
            estimator=BayesianEstimator,
            prior_type="BDeu",
            equivalent_sample_size=equiv_n,
        )
    return model, VariableElimination(model)

### üê¶‚Äçüî• **Brief Paragraph Explanation**

- The `build_dag()` function defines the structure of the Bayesian Network by connecting features in a logical sequence that represents a customer‚Äôs journey ‚Äî from visiting the site to reaching checkout and showing purchase intent. It adds edges between related variables and connects every feature to the target (`intent`) so nothing is left out.
- The `fit_bn()` function then trains this network using pgmpy‚Äôs `BayesianEstimator`, which learns the conditional probability tables (CPTs) from the data. It applies a small Bayesian prior (BDeu) for smoothing and returns both the trained model and an inference object to compute probabilities later. Together, these functions build and train the probabilistic model that underpins the intent prediction system.

### üß© Explanation of `build_dag()` and `fit_bn()`

Let me explain how these two functions work together to construct and train the Bayesian Network model.

---

#### **1Ô∏è‚É£ The `build_dag()` function**

This function builds the **Directed Acyclic Graph (DAG)** ‚Äî the structure that defines how variables influence each other in the Bayesian Network.
A DAG is a set of nodes (variables) connected by directed edges that represent causal or dependency relationships.

Here‚Äôs what happens step-by-step:

- `nodes = set(features) | {target}`
  This collects all the features plus the target (in our case, `intent`) into one set of nodes for the network.

- `have(*cols)`
  This small helper function simply checks if all the mentioned columns exist in the current dataset before trying to connect them.
  It makes the function flexible, so if a dataset is missing a column, it doesn‚Äôt crash.

- Then we define the main **edges** ‚Äî the relationships between features ‚Äî in a way that mimics a user‚Äôs online shopping funnel:
  ```python
  traffic_source ‚Üí used_search ‚Üí applied_filters ‚Üí added_to_cart ‚Üí reached_checkout

## Inference utilities

In [14]:
def _p1_from_query(q) -> float:
    vals = np.asarray(q.values).ravel()
    names = q.state_names.get(list(q.variables)[0], None)
    if names is None:
        return float(vals[-1])
    if "1" in names:
        return float(vals[names.index("1")])
    if 1 in names:
        return float(vals[names.index(1)])
    try:
        idx = int(np.argmax([float(str(s)) for s in names]))
    except Exception:
        idx = len(names) - 1
    return float(vals[idx])


def predict_proba(infer: VariableElimination, X: pd.DataFrame, target: str) -> np.ndarray:
    X = X.reset_index(drop=True)
    out = np.zeros(len(X), dtype=float)
    for i, row in enumerate(X.itertuples(index=False, name=None)):
        evidence = {c: v for c, v in zip(X.columns, row) if pd.notna(v)}
        q = infer.query([target], evidence=evidence, show_progress=False)
        out[i] = _p1_from_query(q)
    return out

### üê¶‚Äçüî• Brief Explanation

- These two functions work together to calculate the predicted probability of the target (for example, `intent = 1`) using the trained Bayesian Network.
The `_p1_from_query()` function extracts the probability of the positive class (`1`) from pgmpy‚Äôs query results, handling different naming formats safely.
The `predict_proba()` function then loops through each row in the dataset, builds an evidence dictionary for the model, performs inference using `VariableElimination`, and uses `_p1_from_query()` to record the probability of intent for each observation.
In short, they convert the Bayesian Network‚Äôs learned relationships into actual **numerical probability predictions** for every data sample.

### üß© Explanation of `_p1_from_query()` and `predict_proba()`

Let‚Äôs go through these two functions ‚Äî they work together to calculate the **predicted probability of intent** for each observation.

---

#### **1Ô∏è‚É£ The `_p1_from_query()` function**

This is a **helper function** that extracts the probability of the target being `1` (for example, `intent = 1`) from pgmpy‚Äôs query output.

When we use pgmpy‚Äôs inference engine (`VariableElimination`), it returns a **query object** containing:
- All possible states of a variable (e.g., `"0"` and `"1"`)
- Their associated probabilities.

This helper ensures we always pick the correct probability value, even if the states are labeled differently (as strings `"1"`, integers `1`, or in another order).

Here‚Äôs what it does step by step:
1. `vals = np.asarray(q.values).ravel()` ‚Üí flattens the array of probabilities returned by the query.
2. `names = q.state_names.get(list(q.variables)[0], None)` ‚Üí retrieves the possible state names (like `["0", "1"]`).
3. Then it checks several possibilities:
   - If `"1"` exists in the list, return its probability.
   - If integer `1` exists, return that.
   - Otherwise, it tries to find the **largest numeric value** (assuming `1` represents the positive class).
4. If all else fails, it simply takes the **last value** as a fallback.

This makes the function **robust** to different output formats and ensures it always returns a single float ‚Äî `P(target=1)`.

---

#### **2Ô∏è‚É£ The `predict_proba()` function**

This function uses the trained Bayesian Network‚Äôs inference engine to **predict probabilities** for each observation in a dataset.

Here‚Äôs the breakdown:

1. **Reset indices:**
   ```python
   X = X.reset_index(drop=True)

## Export CPTs (wide & long)

In [15]:
def export_cpt(model: BayesianNetwork, node: str, outdir: Path) -> None:
    cpd = model.get_cpds(node)

    child_states = cpd.state_names.get(node, [])
    child_states = [str(s) for s in child_states] if child_states else [str(i) for i in range(cpd.cardinality[0])]

    parents = cpd.variables[:-1]
    parent_states = [cpd.state_names[p] for p in parents] if parents else []

    if parents:
        cols = pd.MultiIndex.from_tuples(list(product(*parent_states)), names=[str(p) for p in parents])
    else:
        cols = pd.Index(["<no_parents>"])

    vals = cpd.get_values()
    cpt_wide = pd.DataFrame(vals, index=child_states, columns=cols)
    cpt_wide.index.name = f"{node}_state"
    outdir.mkdir(parents=True, exist_ok=True)
    cpt_wide.to_csv(outdir / f"{node}_cpt_wide.csv")

    if parents:
        cpt_long = (
            cpt_wide
            .stack(list(range(len(parents))))
            .rename("prob")
            .reset_index()
        )
        cpt_long.columns = [f"{node}_state"] + [str(p) for p in parents] + ["prob"]
    else:
        cpt_long = pd.DataFrame({f"{node}_state": child_states, "prob": vals.ravel()})

    cpt_long.to_csv(outdir / f"{node}_cpt_long.csv", index=False)

### üê¶‚Äçüî• **Brief Paragraph Explanation**

- The `export_cpt()` function extracts the Conditional Probability Table (CPT) of a selected node (for example, `intent`) from the trained Bayesian Network and saves it into two CSV files. The first is a **wide format** table showing probabilities for every combination of parent states, and the second is a **long (tidy)** format table suitable for visualization and further analysis. In simple terms, this function converts the Bayesian model‚Äôs learned probabilities into human-readable tables so we can clearly see how each parent variable influences the target.


### üß© Explanation of `export_cpt()`

This function exports the **Conditional Probability Table (CPT)** of a specific node (variable) from the trained Bayesian Network into two easy-to-read CSV files ‚Äî one in a *wide* format and one in a *long* format.
These outputs make it possible to inspect, analyze, or visualize how each parent variable influences the target.

---

#### **Step-by-step explanation**

1. **Access the node‚Äôs CPD (Conditional Probability Distribution):**
   ```python
   cpd = model.get_cpds(node)

## Metrics

In [16]:
def eval_probs(y_true: Iterable[int], p: np.ndarray) -> dict:
    y = np.asarray(list(y_true)).astype(int)
    return {
        "AUROC": round(roc_auc_score(y, p), 4),
        "AUPR": round(average_precision_score(y, p), 4),
        "Brier": round(brier_score_loss(y, p), 4),
    }

### üê¶‚Äçüî• **Brief Paragraph Explanation**

```markdown
The `eval_probs()` function evaluates how well the model‚Äôs predicted probabilities match the actual outcomes. It takes the true labels and predicted probabilities, converts them into numeric arrays, and computes three performance metrics: **AUROC** (discrimination ability), **AUPR** (precision-recall balance), and **Brier score** (calibration accuracy). It then returns these scores, rounded to four decimals, providing a quick quantitative summary of how reliable and accurate the model‚Äôs predictions are.


### üß© Explanation of `eval_probs()`

This function evaluates the performance of the Bayesian intent prediction model by comparing the **true labels** (`y_true`) with the **predicted probabilities** (`p`).
It computes three key metrics ‚Äî **AUROC**, **AUPR**, and **Brier score** ‚Äî and returns them as a neatly rounded dictionary.

---

#### **Step-by-step explanation**

1. **Convert the input labels into a NumPy array:**
   ```python
   y = np.asarray(list(y_true)).astype(int)

## Demo: train, predict, score on synthetic data

In [17]:
setup_logging("INFO")
cfg = TrainConfig()
df = load_or_make(None)

candidates = [
    "traffic_source", "device", "prior_purchaser", "used_search",
    "applied_filters", "added_to_cart", "reached_checkout", "viewed_shipping",
]
features = [c for c in candidates if c in df.columns]
target = "intent"

X = df[features].copy()
y = df[target].astype(int)
X_tr, X_va, y_tr, y_va = train_test_split(X, y, test_size=cfg.test_size, random_state=cfg.seed, stratify=y)

train_disc = ensure_discrete(pd.concat([X_tr, y_tr], axis=1), target=target)
val_disc = ensure_discrete(pd.concat([X_va, y_va], axis=1), target=target)

model, infer = fit_bn(train_disc, features, target, equiv_n=cfg.cpd_equiv_n)
p_val = predict_proba(infer, val_disc[features], target=target)
metrics = eval_probs(y_va, p_val)
metrics

INFO:root:No --data provided; generating a small synthetic dataset.


{'AUROC': 0.689, 'AUPR': 0.2948, 'Brier': 0.1207}

### üê¶‚Äçüî• **Brief Paragraph Explanation**

- This code block runs the full training and evaluation pipeline for the Bayesian intent model. It starts by setting up logging and configuration, then loads or generates the dataset. The features and target (`intent`) are selected, and the data is split into training and validation sets. Both datasets are discretized for compatibility with the Bayesian Network, which is then trained using `fit_bn()`. After training, `predict_proba()` calculates the probability of purchase intent for each validation record, and `eval_probs()` evaluates the model using AUROC, AUPR, and Brier scores. The final output shows how well the model predicts customer intent.


### üß© Explanation of the Training, Prediction, and Evaluation Block

This section puts everything together ‚Äî it prepares the data, trains the Bayesian model, makes predictions, and evaluates the results.
Let‚Äôs go through it step by step.

---

#### **1Ô∏è‚É£ Setting up logging and configuration**
```python
setup_logging("INFO")
cfg = TrainConfig()

## Save predictions & CPTs

In [18]:
outdir = cfg.outdir
outdir.mkdir(parents=True, exist_ok=True)
pd.DataFrame({"p_intent": p_val, "y": y_va.to_numpy()}).to_csv(outdir / "val_predictions.csv", index=False)
export_cpt(model, node=target, outdir=outdir)
print("Saved to", outdir.resolve())

Saved to C:\Users\mebub_9a7jdi8\Desktop\Freelancer\Bayesian Model\bn_out


### üê¶‚Äçüî• **Brief Paragraph Explanation**


- This block saves the results of the Bayesian model to disk. It first ensures the output directory exists, then writes a CSV file (`val_predictions.csv`) containing the predicted intent probabilities and true labels from the validation set. It also calls `export_cpt()` to save the learned Conditional Probability Tables (CPTs) for the target variable in both wide and long formats. Finally, it prints a confirmation message showing where all the files were saved, completing the model training and export process.

### üß© Explanation of the Output Saving Block

This final block of code saves the key results from the Bayesian Network ‚Äî the predicted probabilities and the learned Conditional Probability Tables (CPTs) ‚Äî into organized files for later use or analysis.

---

#### **1Ô∏è‚É£ Setting up the output directory**
```python
outdir = cfg.outdir
outdir.mkdir(parents=True, exist_ok=True)