# Notebook 6 — Solutions: Probability Models & ML Foundations

Worked solutions and explanations for the tasks in Notebook 6 (Exercises).

## 1. Probability Refresher — Joint, Marginal, Conditional (Task 1.1)

Given joint probability table for binary A and B:

| A | B | P(A,B) |
|---|---|--------|
| 0 | 0 | 0.1    |
| 0 | 1 | 0.3    |
| 1 | 0 | 0.2    |
| 1 | 1 | 0.4    |

Compute:
- Marginal P(A=1), P(B=1)
- Conditional P(A=1 | B=1)


In [None]:
import numpy as np

joint_probs = np.array([
    [0.1, 0.3],  # A=0, B=0 ; A=0, B=1
    [0.2, 0.4]   # A=1, B=0 ; A=1, B=1
])

# Marginal P(A=1) = sum over B of P(A=1, B)
P_A1 = joint_probs[1, :].sum()

# Marginal P(B=1) = sum over A of P(A, B=1)
P_B1 = joint_probs[:, 1].sum()

# Conditional P(A=1 | B=1) = P(A=1,B=1) / P(B=1)
P_A1_given_B1 = joint_probs[1, 1] / P_B1

print(f"P(A=1) = {P_A1:.3f}")
print(f"P(B=1) = {P_B1:.3f}")
print(f"P(A=1 | B=1) = {P_A1_given_B1:.3f}")

**Explanation:**
- P(A=1) = 0.2 + 0.4 = 0.6
- P(B=1) = 0.3 + 0.4 = 0.7
- P(A=1 | B=1) = 0.4 / 0.7 ≈ 0.571

## 2. Bayes' Theorem in an ML Context (Task 2.1)

Compute P(spam | word='offer') given:
- P(spam) = 0.4
- P(offer | spam) = 0.6
- P(offer | not spam) = 0.05

Use Bayes' rule:
\[ P(spam|offer) = \frac{P(offer|spam)P(spam)}{P(offer)} \]
where \(P(offer) = P(offer|spam)P(spam) + P(offer|not\;spam)P(not\;spam)\).

In [None]:
P_spam = 0.4
P_offer_given_spam = 0.6
P_offer_given_not_spam = 0.05

P_not_spam = 1 - P_spam
P_offer = P_offer_given_spam * P_spam + P_offer_given_not_spam * P_not_spam
P_spam_given_offer = (P_offer_given_spam * P_spam) / P_offer

print(f"P(offer) = {P_offer:.4f}")
print(f"P(spam | offer) = {P_spam_given_offer:.4f}")

**Explanation:**
Numerator: 0.6 * 0.4 = 0.24
Denominator: 0.6*0.4 + 0.05*0.6 = 0.24 + 0.03 = 0.27
So P(spam|offer) = 0.24 / 0.27 ≈ 0.8889

## 3. Maximum Likelihood Estimation (Task 3.1 & 3.2)

**Task 3.1 (Bernoulli MLE):** 7 heads out of 10 tosses. MLE for p is the sample proportion \(\hat{p} = k/n\).

**Task 3.2 (Gaussian MLE):** For data from N(μ, σ²), MLE estimates are:
- \(\hat{\mu} = \frac{1}{n} \sum x_i\)
- \(\hat{\sigma}^2 = \frac{1}{n} \sum (x_i - \hat{\mu})^2\)  (use 1/n for MLE, not 1/(n-1)).

In [None]:
# Task 3.1: Bernoulli MLE
heads = 7
n = 10
p_hat = heads / n
print(f"MLE for Bernoulli p_hat = {p_hat:.3f}")

# Task 3.2: Gaussian MLE
np.random.seed(42)
data = np.random.normal(5, 2, 100)
mu_mle = np.mean(data)
var_mle = np.mean((data - mu_mle)**2)  # MLE uses 1/n
print(f"Gaussian MLE: mu = {mu_mle:.4f}, var = {var_mle:.4f}")

**Explanation:**
- Bernoulli: p_hat = 7/10 = 0.7
- Gaussian: use sample mean and 1/n variance for MLE (not unbiased sample variance).

## 4. Maximum A Posteriori (Task 4.1)

We have a Beta(α, β) prior. For Bernoulli likelihood with k heads and n trials, the posterior is Beta(α + k, β + n - k).

MAP estimate for Beta(α, β) posterior (when α>1 and β>1) is:
\[ p_{MAP} = \frac{\alpha_{post} - 1}{\alpha_{post} + \beta_{post} - 2} \]

Given α_prior=2, β_prior=2, k=7, n=10 → α_post=9, β_post=5.

In [None]:
alpha_prior = 2
beta_prior = 2
heads = 7
n = 10

alpha_post = alpha_prior + heads
beta_post = beta_prior + (n - heads)
p_map = (alpha_post - 1) / (alpha_post + beta_post - 2)

print(f"Posterior Beta(α={alpha_post}, β={beta_post})")
print(f"MAP estimate = {p_map:.4f}")

**Explanation:**
- Posterior Beta(9,5)
- MAP = (9-1)/(9+5-2) = 8/12 = 0.6667
- Note: MLE gave 0.7; the prior (favoring fairness) pulls the MAP slightly toward 0.5.

## 5. Naive Bayes Classifier from Scratch (Task 5.1 & 5.2)

We'll implement a Bernoulli Naive Bayes for binary features (with Laplace smoothing). Steps:
1. Compute class priors P(y=c).
2. For each feature j and class c compute P(x_j=1 | y=c) with Laplace smoothing.
3. For prediction compute log P(y=c) + sum_j log P(x_j | y=c) and take argmax.


In [None]:
import numpy as np
from sklearn.naive_bayes import BernoulliNB

X_train = np.array([
    [1, 0, 1],
    [1, 1, 0],
    [0, 0, 1],
    [0, 1, 0]
])
y_train = np.array([1, 1, 0, 0])

def bernoulli_nb_train(X, y, alpha=1.0):
    # alpha is Laplace smoothing parameter
    classes = np.unique(y)
    n_classes = len(classes)
    n_features = X.shape[1]
    priors = {}
    likelihoods = {}  # dict: class -> array of P(x_j=1 | class)
    for c in classes:
        X_c = X[y == c]
        priors[c] = X_c.shape[0] / X.shape[0]
        # Laplace smoothing: (count + alpha) / (N_c + 2*alpha) for Bernoulli
        counts = X_c.sum(axis=0)
        likelihoods[c] = (counts + alpha) / (X_c.shape[0] + 2 * alpha)
    return classes, priors, likelihoods

def bernoulli_nb_predict(X, classes, priors, likelihoods):
    preds = []
    for x in X:
        class_log_probs = {}
        for c in classes:
            logp = np.log(priors[c])
            p_x_given_c = likelihoods[c]
            # For each feature j: if x_j==1 use p_j, else use (1-p_j)
            logp += np.sum(np.log(p_x_given_c) * (x == 1))
            logp += np.sum(np.log(1 - p_x_given_c) * (x == 0))
            class_log_probs[c] = logp
        # choose class with max log-prob
        pred = max(class_log_probs.items(), key=lambda kv: kv[1])[0]
        preds.append(pred)
    return np.array(preds)

# Train our implementation
classes, priors, likelihoods = bernoulli_nb_train(X_train, y_train, alpha=1.0)
print("Classes:", classes)
print("Priors:", priors)
print("Likelihoods (P(x_j=1|class)):")
for c in classes:
    print(c, likelihoods[c])

# Predict on training data
preds_manual = bernoulli_nb_predict(X_train, classes, priors, likelihoods)
print("Manual NB predictions:", preds_manual)

# Compare to sklearn BernoulliNB
clf = BernoulliNB(alpha=1.0)
clf.fit(X_train, y_train)
sk_preds = clf.predict(X_train)
print("sklearn predictions:", sk_preds)

print("Are predictions identical?", np.array_equal(preds_manual, sk_preds))

**Explanation & Notes:**
- Priors: classes appear equally often (2/4 each) → priors 0.5 each.
- Likelihoods (with Laplace smoothing) estimate P(feature=1 | class).
- Prediction uses log-probabilities to avoid underflow.
- Sklearn's `BernoulliNB` by default uses smoothing (alpha) and may compute class log priors similarly; predictions should match in this small example.

## Summary & Takeaways
- You practiced computing marginals & conditional probabilities from joint distributions.
- Applied Bayes' theorem to compute posterior probabilities in an ML example.
- Computed MLEs for Bernoulli and Gaussian models and saw the difference with MAP when using a prior.
- Implemented a Bernoulli Naive Bayes classifier from scratch and validated against scikit-learn.

You're now ready to move from probability-based models to the broader **Core Machine Learning Algorithms** section — we'll cover linear/logistic regression, decision trees, ensembles, clustering, evaluation metrics, and practical model-building next.
