# Exam 4th of January 2024, 8.00-13.00 for the course 1MS041 (Introduction to Data Science / Introduktion till dataanalys)

## Instructions:
1. Complete the problems by following instructions.
2. When done, submit this file with your solutions saved, following the instruction sheet.

This exam has 3 problems for a total of 40 points, to pass you need
20 points. The bonus will be added to the score of the exam and rounded afterwards.

## Some general hints and information:
* Try to answer all questions even if you are uncertain.
* Comment your code, so that if you get the wrong answer I can understand how you thought
this can give you some points even though the code does not run.
* Follow the instruction sheet rigorously.
* This exam is partially autograded, but your code and your free text answers are manually graded anonymously.
* If there are any questions, please ask the exam guards, they will escalate it to me if necessary.

## Tips for free text answers
* Be VERY clear with your reasoning, there should be zero ambiguity in what you are referring to.
* If you want to include math, you can write LaTeX in the Markdown cells, for instance `$f(x)=x^2$` will be rendered as $f(x)=x^2$ and `$$f(x) = x^2$$` will become an equation line, as follows
$$f(x) = x^2$$
Another example is `$$f_{Y \mid X}(y,x) = P(Y = y \mid X = x) = \exp(\alpha \cdot x + \beta)$$` which renders as
$$f_{Y \mid X}(y,x) = P(Y = y \mid X = x) = \exp(\alpha \cdot x + \beta)$$

## Finally some rules:
* You may not communicate with others during the exam, for example:
    * You cannot ask for help in Stack-Overflow or other such help forums during the Exam.
    * You may not communicate with AI's, for instance ChatGPT.
    * Your on-line and off-line activity is being monitored according to the examination rules.

## Good luck!

In [None]:
# Insert your anonymous exam ID as a string in the variable below
examID="XXX"


---
## Exam vB, PROBLEM 1
Maximum Points = 14


In this problem you will do rejection sampling from complicated distributions, you will also be using your samples to compute certain integrals, a method known as Monte Carlo integration: (Keep in mind that choosing a good sampling distribution is often key to avoid too much rejection)

1. [4p] Fill in the remaining part of the function `problem1_inversion` in order to produce samples from the below distribution using rejection sampling:

$$
    F[x] = 
    \begin{cases}
        0, & x \leq 0 \\
        \frac{e^{x^2}-1}{e-1}, & 0 < x < 1 \\
        1, & x \geq 1
    \end{cases}
$$

2. [2p] Produce 100000 samples (**use fewer if it times-out and you cannot find a solution**) and put the answer in `problem1_samples` from the above distribution and plot the histogram together with the true density. *(There is a timeout decorator on this function and if it takes more than 10 seconds to generate 100000 samples it will timeout and it will count as if you failed to generate.)*
3. [2p] Use the above 100000 samples (`problem1_samples`) to approximately compute the integral

$$
    \int_0^{1} \sin(x) \frac{2e^{x^2} x}{e-1} dx
$$
and store the result in `problem1_integral`.

4. [2p] Use Hoeffdings inequality to produce a 95\% confidence interval of the integral above and store the result as a tuple in the variable `problem1_interval`

5. [4p] Fill in the remaining part of the function `problem1_inversion_2` in order to produce samples from the below distribution using rejection sampling:
$$
    F[x] = 
    \begin{cases}
        0, & x \leq 0 \\
        20xe^{20-1/x}, & 0 < x < \frac{1}{20} \\
        1, & x \geq \frac{1}{20}
    \end{cases}
$$
Hint: this is tricky because if you choose the wrong sampling distribution you reject at least 9 times out of 10. You will get points based on how long your code takes to create a certain number of samples, if you choose the correct sampling distribution you can easily create 100000 samples within 2 seconds.

### Part 1 Explanation: Rejection Sampling

**Problem:** Implement rejection sampling to generate samples from a distribution with CDF:
$$F[x] = \frac{e^{x^2}-1}{e-1} \text{ for } 0 < x < 1$$

**Solution Approach:**
1. **Get the PDF:** Differentiate the CDF to get $f(x) = \frac{2xe^{x^2}}{e-1}$
2. **Choose proposal distribution:** Use Uniform(0,1) with density $g(x) = 1$
3. **Find constant M:** Need $f(x) \leq M \cdot g(x)$ for all x
   - Find max of $f(x)$ by taking derivative: $f'(x) = \frac{2e^{x^2}(1+2x^2)}{e-1} > 0$
   - Since $f(x)$ is increasing, max occurs at $x=1$: $M = f(1) = \frac{2e}{e-1} \approx 3.16$
4. **Accept/Reject:** Sample $x \sim$ Uniform(0,1), accept with probability $\frac{f(x)}{M}$

In [20]:

# Part 1

from sklearn.datasets import make_multilabel_classification
import numpy as np

def problem1_inversion(n_samples=1):
    # Distribution from part 1
    # write the code in this function to produce samples from the distribution in the assignment
    # Make sure you choose a good sampling distribution to avoid unnecessary rejections

    samples = []
    # The density is f(x) = 2*e^(x^2)*x / (e-1) for 0 < x < 1
    # We use uniform(0,1) as proposal and find the constant M
    # The maximum of f(x) occurs at x=1, where f(1) = 2*e/(e-1)
    M = 2 * np.e / (np.e - 1)
    
    while len(samples) < n_samples:
        # Sample from uniform(0,1)
        x = np.random.uniform(0, 1)
        u = np.random.uniform(0, 1)
        
        # Target density
        f_x = 2 * np.exp(x**2) * x / (np.e - 1)
        
        # Acceptance probability: f(x) / (M * g(x)) where g(x) = 1 for uniform
        if u <= f_x / M:
            samples.append(x)
    
    return np.array(samples)

### Part 2 Explanation: Generate Samples

**Problem:** Generate 100,000 samples from the distribution (must complete in under 10 seconds).

**Solution:** Simply call the function from Part 1 with `n_samples=100000`. The rejection sampling algorithm should have a ~32% acceptance rate ($1/M \approx 0.32$), making it efficient enough to generate 100,000 samples quickly.

In [21]:
# Part 2

problem1_samples = problem1_inversion(n_samples=100000)


### Part 3 Explanation: Monte Carlo Integration

**Problem:** Use samples to approximate the integral:
$$\int_0^{1} \sin(x) \frac{2e^{x^2} x}{e-1} dx$$

**Key Insight:** Notice that $\frac{2e^{x^2} x}{e-1} = f(x)$ is our density! So we're computing:
$$\int_0^{1} \sin(x) \cdot f(x) dx = \mathbb{E}_{X \sim f}[\sin(X)]$$

**Solution Approach - Monte Carlo Method:**
- When you have samples $x_1, ..., x_n$ from density $f(x)$:
$$\int h(x) f(x) dx \approx \frac{1}{n}\sum_{i=1}^{n} h(x_i)$$
- Here $h(x) = \sin(x)$
- **Important:** Evaluate only $h(x) = \sin(x)$ on samples, NOT $\sin(x) \cdot f(x)$
- The density weighting is already built into the sampling process!

In [26]:
# Part 3
import math

def problem1_integral_function(x):
    # For Monte Carlo integration with samples from f(x), we only need h(x) = sin(x)
    # The samples are already weighted by the density f(x), so we don't multiply by it
    return math.sin(x)

problem1_integral = np.mean([problem1_integral_function(x) for x in problem1_samples])


In [27]:
problem1_integral

0.6527260087748915

In [28]:
def monte_carlo_expectation(samples, h_fn):
    samples = np.asarray(samples)
    vals = np.array([h_fn(x) for x in samples])
    return float(np.mean(vals)), np.asarray(vals, float)

In [29]:
monte_carlo_expectation(problem1_samples, problem1_integral_function)

(0.6527260087748915,
 array([0.56378407, 0.78960286, 0.79544069, ..., 0.60682963, 0.47265607,
        0.58326667]))

### Part 4 Explanation: Confidence Interval using Hoeffding's Inequality

**Problem:** Provide a 95% confidence interval for the integral computed in Part 3.

**Hoeffding's Inequality Formula:**
For bounded random variables $Z_1, ..., Z_n$ where $a \leq Z_i \leq b$, with probability at least $1-\alpha$:
$$\left|\bar{Z} - \mathbb{E}[Z]\right| \leq (b-a)\sqrt{\frac{\ln(2/\alpha)}{2n}}$$

**Solution Approach:**
1. **Identify the random variables:** $Z_i = h(x_i) = \sin(x_i)$ (NOT the $x_i$ values!)
2. **Find bounds:** Since $x \in [0,1]$ and sin is increasing on this interval:
   - $a = \sin(0) = 0$
   - $b = \sin(1) \approx 0.841$
3. **Compute radius:** $r = (b-a)\sqrt{\frac{\ln(2/\alpha)}{2n}}$ with $\alpha=0.05$ for 95% CI
4. **Build interval:** $(mean - r, mean + r)$

In [14]:
import math

def hoeffding_ci(values, a, b, alpha=0.05):
    values = np.asarray(values, float)
    n = len(values)
    mean = float(np.mean(values))
    radius = (b - a) * math.sqrt(math.log(2/alpha) / (2*n))
    return (mean - radius, mean + radius), mean, radius

In [30]:
# Part 4

# Compute sin(x) values for all samples - these are what we're taking the mean of
sin_values = np.array([math.sin(x) for x in problem1_samples])

# Bounds for sin(x) on [0,1]: sin is increasing, so min=sin(0)=0, max=sin(1)
problem1_interval = hoeffding_ci(sin_values, 0, math.sin(1), alpha=0.05)[0]
problem1_interval


(0.6491121483150276, 0.6563398692347554)

### Part 5 Explanation: Efficient Rejection Sampling (Challenging!)

**Problem:** Sample from a distribution with CDF:
$$F[x] = 20xe^{20-1/x} \text{ for } 0 < x < \frac{1}{20}$$

**Challenge:** Wrong proposal distribution → 90%+ rejection rate → very slow!

**Solution Approach:**
1. **Get the PDF:** Differentiate to get $f(x) = \frac{20e^{20-1/x}(x+1)}{x}$ for $0 < x < 0.05$
2. **Analyze the density:** It grows very rapidly as $x \to 1/20$, nearly zero near $x=0$
3. **Smart proposal choice:** Use Beta(2,1) scaled to [0, 1/20]
   - Beta(2,1) has density $g(u) = 2u$ for $u \in [0,1]$, concentrating near 1
   - After scaling by 1/20: $g(x) = 40x$ for $x \in [0, 1/20]$
   - This matches the shape of the target density well!
4. **Find M:** Maximum of $f(x)$ occurs at $x = 1/20$: $M = 8400$
5. **Accept/Reject:** Accept with probability $\frac{f(x)}{M \cdot g(x)}$

**Why this works:** Both densities concentrate samples near $x=1/20$, leading to high acceptance rate!

In [39]:
# Part 5

def problem1_inversion_2(n_samples=1):
    # Distribution from part 5: domain is 0 < x < 1/20
    # PDF: f(x) = 20*e^(20-1/x)*(x+1)/x for 0 < x < 1/20
    # Use Beta(2,1) scaled to [0, 1/20] as a good proposal distribution
    samples = []
    
    # The maximum occurs at x=1/20, f(1/20) = 20*e^0*(1/20+1)/(1/20) = 20*21/0.05 = 8400
    M = 8400 * 1.05  # Add safety margin
    
    while len(samples) < n_samples:
        # Sample from Beta(2,1) which concentrates near 1, then scale to [0, 1/20]
        x = np.random.beta(2, 1) / 20
        
        # Avoid division by zero
        if x < 1e-10:
            continue
            
        u = np.random.uniform(0, 1)
        
        # Target density
        f_x = 20 * np.exp(20 - 1/x) * (x + 1) / x
        
        # Proposal density: Beta(2,1) scaled by 20: g(x) = 2*20*x for x in [0,1/20]
        g_x = 40 * x
        
        # Acceptance probability
        if u <= f_x / (M * g_x):
            samples.append(x)
    
    return np.array(samples)


### Deep Dive: Why Beta(2,1) and What Are the Alternatives?

**Why Beta(2,1) Works Well Here:**

The target density $f(x) = \frac{20e^{20-1/x}(x+1)}{x}$ has these characteristics:
- **Domain:** $(0, 1/20)$ - very narrow interval
- **Shape:** Nearly zero near $x=0$, explodes as $x \to 1/20$
- **Behavior:** Highly concentrated near the upper bound

Beta(2,1) scaled to $[0, 1/20]$ has density $g(u) = 2u$ (where $u = 20x$), so $g(x) = 40x$:
- ✓ Also concentrates near the upper end (linear growth)
- ✓ Matches the "heavy on the right" shape
- ✓ Easy to sample from: `np.random.beta(2,1) / 20`
- ✓ Simple density formula for accept/reject calculation

**General Principles for Choosing Proposal Distributions:**

1. **Shape Matching:** Proposal should have similar shape to target
   - If target increases → use Beta(a,1) with a>1
   - If target decreases → use Beta(1,b) with b>1
   - If target is symmetric → use Beta(a,a) or truncated Normal

2. **Support Matching:** Proposal must cover target's entire domain
   - Target on $(0,1)$ → Uniform(0,1) or Beta
   - Target on $(0,\infty)$ → Exponential, Gamma
   - Target on $(-\infty,\infty)$ → Normal, Cauchy

3. **Efficiency:** Small M means high acceptance rate
   - Goal: $M = \max_x \frac{f(x)}{g(x)}$ should be small
   - Achieved when $g(x)$ closely "wraps" $f(x)$

**Alternative Proposals for This Problem:**

| Proposal | Density | When to Use | Pros | Cons |
|----------|---------|-------------|------|------|
| **Uniform(0,1/20)** | $g(x) = 20$ | Simple baseline | Easy to implement | M ≈ $10^{19}$, ~99.999...% rejection! |
| **Beta(2,1)/20** | $g(x) = 40x$ | Target grows linearly at end | Good efficiency, M ≈ 8800 | Requires understanding Beta |
| **Beta(3,1)/20** | $g(x) = 60x^2$ | Target grows even faster | Even better if target is steeper | May overshoot if wrong |
| **Truncated Exp** | $g(x) \propto e^{-\lambda x}$ on $(0,1/20)$ | Target decays exponentially | Natural for exp-like targets | Wrong shape here (decreasing) |
| **Custom Linear** | $g(x) = cx$ on $(0,1/20)$ | When you want simple linear | Mimics Beta(2,1) behavior | Need to normalize manually |

**How to Choose in Practice:**

1. **Plot the target density** $f(x)$ (or estimate its shape)
2. **Identify key features:**
   - Where is the mass concentrated?
   - Is it increasing, decreasing, or unimodal?
   - Does it have heavy tails?
3. **Match the shape:**
   - Concentrated near 0 → Beta(1,b) or Exponential
   - Concentrated near 1 → Beta(a,1)
   - Bell-shaped → Normal or Beta(a,a)
4. **Test with small sample:**
   ```python
   # Check acceptance rate with 1000 samples
   test_samples = proposal_function(1000)
   acceptance_rate = sum(accept)/1000
   print(f"Acceptance: {acceptance_rate:.2%}")
   # Aim for >10% acceptance, ideally >30%
   ```

**For This Specific Problem:**
- Beta(2,1) gives ~10-20% acceptance (good!)
- Beta(3,1) might give ~20-30% (even better!)
- Uniform gives ~0.000...01% acceptance (terrible!)

The key insight: **Match the shape where the density is highest** to avoid wasting samples in low-probability regions.

In [40]:
problem1_inversion_2(n_samples=100000)

array([0.04851716, 0.04883521, 0.04887666, ..., 0.04772446, 0.04985963,
       0.04893373])

---
#### Local Test for Exam vB, PROBLEM 1
Evaluate cell below to make sure your answer is valid.                             You **should not** modify anything in the cell below when evaluating it to do a local test of                             your solution.
You may need to include and evaluate code snippets from lecture notebooks in cells above to make the local test work correctly sometimes (see error messages for clues). This is meant to help you become efficient at recalling materials covered in lectures that relate to this problem. Such local tests will generally not be available in the exam.

In [None]:

# This cell is just to check that you got the correct formats of your answer
import numpy as np
try:
    assert(isinstance(problem1_inversion(10), np.ndarray)) 
except:
    print("Try again. You should return a numpy array from problem1_inversion")
else:
    print("Good, your problem1_inversion returns a numpy array")

try:
    assert(isinstance(problem1_samples, np.ndarray)) 
except:
    print("Try again. your problem1_samples is not a numpy array")
else:
    print("Good, your problem1_samples is a numpy array")

try:
    assert(isinstance(problem1_integral, float)) 
except:
    print("Try again. your problem1_integral is not a float")
else:
    print("Good, your problem1_integral is a float")

try:
    assert(isinstance(problem1_interval, list) or isinstance(problem1_interval, tuple)) , "problem1_interval not a tuple or list"
    assert(len(problem1_interval) == 2) , "problem1_interval does not have length 2, it should have a lower bound and an upper bound"
except Exception as e:
    print(e)
else:
    print("Good, your problem1_interval is a tuple or list of length 2")

try:
    assert(isinstance(problem1_inversion_2(10), np.ndarray)) 
except:
    print("Try again. You should return a numpy array from problem1_inversion_2")
else:
    print("Good, your problem1_inversion_2 returns a numpy array")

---
## Exam vB, PROBLEM 2
Maximum Points = 13


Let us build a proportional model ($\mathbb{P}(Y=1 \mid X) = G(\beta_0+\beta \cdot X)$ where $G$ is the logistic function) for the spam vs not spam data. Here we assume that the features are presence vs not presence of a word, let $X_1,X_2,X_3$ denote the presence (1) or absence (0) of the words $("free", "prize", "win")$.

1. [2p] Load the file `data/spam.csv` and create two numpy arrays, `problem2_X` which has shape (n_emails,3) where each feature in `problem2_X` corresponds to $X_1,X_2,X_3$ from above, `problem2_Y` which has shape **(n_emails,)** and consists of a $1$ if the email is spam and $0$ if it is not. Split this data into a train-calibration-test sets where we have the split $40\%$, $20\%$, $40\%$, put this data in the designated variables in the code cell.

2. [4p] Follow the calculation from the lecture notes where we derive the logistic regression and implement the final loss function inside the class `ProportionalSpam`. You can use the `Test` cell to check that it gives the correct value for a test-point.

3. [4p] Train the model `problem2_ps` on the training data. The goal is to calibrate the probabilities output from the model. Start by creating a new variable `problem2_X_pred` (shape `(n_samples,1)`) which consists of the predictions of `problem2_ps` on the calibration dataset. Then train a calibration model using `sklearn.tree.DecisionTreeRegressor`, store this trained model in `problem2_calibrator`.

4. [3p] Use the trained model `problem2_ps` and the calibrator `problem2_calibrator` to make final predictions on the testing data, store the prediction in `problem2_final_predictions`. Compute the $0-1$ test-loss and store it in `problem2_01_loss` and provide a $99\%$ confidence interval of it, store this in the variable `problem2_interval`, this should again be a tuple as in **problem1**.

### Part 1 Explanation: Data Loading and Splitting

**Problem:** Load spam data and create train/calibration/test splits (40%/20%/40%).

**Solution Approach:**
1. **Load data:** Read `data/spam.csv` containing email text and spam labels
2. **Feature extraction:** Create binary features $X_1, X_2, X_3$ for presence of words "free", "prize", "win"
   - $X_i = 1$ if word is present in email, $X_i = 0$ otherwise
3. **Target variable:** Convert labels to binary: $Y = 1$ for spam, $Y = 0$ for ham (not spam)
4. **Split data:** Use stratified splitting to maintain class balance:
   - Training set: 40% for model fitting
   - Calibration set: 20% for probability calibration
   - Test set: 40% for final evaluation

In [110]:
import pandas as pd
try:
    df = pd.read_csv('data/spam.csv', encoding='utf-8')
except UnicodeDecodeError:
    df = pd.read_csv('data/spam.csv', encoding='latin-1')

In [111]:
df

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


In [112]:
from sklearn.model_selection import train_test_split

def train_test_calibration_split(X, y, train_frac=0.6, calib_frac=0.2, test_frac=0.2, random_state=42):
    X_tr, X_tmp, y_tr, y_tmp = train_test_split(X, y, test_size=1-train_frac, random_state=0, stratify=y)
    X_cal, X_te, y_cal, y_te = train_test_split(X_tmp, y_tmp, test_size=calib_frac/(calib_frac+test_frac), random_state=0, stratify=y_tmp)

    return X_tr, X_cal, X_te, y_tr, y_cal, y_te

In [113]:
# Part 1

# Create binary features for presence of words "free", "prize", "win"
text_data = df['v2'].astype(str).str.lower()  # Convert to lowercase for case-insensitive matching

# Create feature matrix: each column is 1 if word is present, 0 otherwise
df['is_free'] = text_data.str.contains('free').astype(int)
df['is_prize'] = text_data.str.contains('prize').astype(int)
df['is_win'] = text_data.str.contains('win').astype(int)
problem2_X = df[['is_free', 'is_prize', 'is_win']].values

problem2_Y = df['v1'].map({'ham':0, 'spam':1}).values

problem2_X_train, problem2_X_calib, problem2_X_test, problem2_Y_train, problem2_Y_calib, problem2_Y_test = train_test_calibration_split(problem2_X, problem2_Y, train_frac=0.6, calib_frac=0.2, test_frac=0.2, random_state=42)

print(problem2_X_train.shape,problem2_X_calib.shape,problem2_X_test.shape,problem2_Y_train.shape,problem2_Y_calib.shape,problem2_Y_test.shape)

sum_shapes = (problem2_X_train.shape[0] + problem2_X_calib.shape[0] + problem2_X_test.shape[0],
              problem2_Y_train.shape[0] + problem2_Y_calib.shape[0] + problem2_Y_test.shape[0])

print(problem2_X_train.shape[0]/sum_shapes[0],problem2_X_calib.shape[0]/sum_shapes[0],problem2_X_test.shape[0]/sum_shapes[0],problem2_Y_train.shape[0]/sum_shapes[1],problem2_Y_calib.shape[0]/sum_shapes[1],problem2_Y_test.shape[0]/sum_shapes[1])


(3343, 3) (1114, 3) (1115, 3) (3343,) (1114,) (1115,)
0.5999641062455133 0.19992821249102657 0.20010768126346015 0.5999641062455133 0.19992821249102657 0.20010768126346015


### Part 2 Explanation: Logistic Regression Loss Function

**Problem:** Implement the loss function for logistic regression (proportional model).

**Mathematical Background:**
- **Model:** $\mathbb{P}(Y=1 \mid X) = G(\beta_0 + \beta \cdot X)$ where $G(z) = \frac{e^z}{1+e^z}$ (logistic function)
- **Loss:** Negative log-likelihood (mean):
$$L(\beta_0, \beta) = -\frac{1}{n}\sum_{i=1}^{n} \left[y_i \log(p_i) + (1-y_i)\log(1-p_i)\right]$$
where $p_i = G(\beta_0 + \beta \cdot x_i)$ is the predicted probability

**Solution Approach:**
1. **Linear predictor:** Compute $z_i = \beta_0 + \beta \cdot x_i$ for each sample
2. **Apply logistic:** $p_i = G(z_i) = \frac{e^{z_i}}{1+e^{z_i}}$
3. **Numerical stability:** Clip probabilities to $[\epsilon, 1-\epsilon]$ to avoid $\log(0)$
4. **Compute loss:** Calculate the **mean** (not sum) of the log-loss over all samples
   - **Important:** Use `-(1.0/n) * np.sum(...)` to get the average loss per sample
   - This is the standard form used in optimization and matches expected test values
5. **Optimization:** Use scipy's minimize to find optimal $\beta_0, \beta$

In [122]:
# Part 2
import numpy as np


class ProportionalSpam(object):
    def __init__(self):
        self.coeffs = None
        self.result = None
    
    # define the objective/cost/loss function we want to minimise
    def loss(self,X,Y,coeffs):
        # Logistic regression loss function (negative log-likelihood)
        # For binary classification with Y in {0,1}
        # Model: P(Y=1|X) = G(β₀ + β·X) where G is logistic function
        # Loss = -(1/n)*sum[Y*log(p) + (1-Y)*log(1-p)] where p = G(β₀ + β·X)

        G = lambda x: np.exp(x)/(1+np.exp(x))
        n = len(Y)

        # Extract intercept and coefficients
        beta0 = coeffs[0]
        beta = coeffs[1:]

        # Compute predicted probabilities
        linear_pred = np.dot(X, beta) + beta0
        p = G(linear_pred)

        # Compute log-loss (with numerical stability)
        eps = 1e-12
        p = np.clip(p, eps, 1 - eps)
        loss = -(1.0/n) * np.sum(Y * np.log(p) + (1 - Y) * np.log(1 - p))

        return loss

    def fit(self,X,Y):
        import numpy as np
        from scipy import optimize

        #Use the f above together with an optimization method from scipy
        #to find the coefficients of the model
        opt_loss = lambda coeffs: self.loss(X,Y,coeffs)
        initial_arguments = np.zeros(shape=X.shape[1]+1)
        self.result = optimize.minimize(opt_loss, initial_arguments,method='cg')
        self.coeffs = self.result.x
    
    def predict(self,X):
        #Use the trained model to predict Y
        if (self.coeffs is not None):
            G = lambda x: np.exp(x)/(1+np.exp(x))
            return np.round(10*G(np.dot(X,self.coeffs[1:])+self.coeffs[0]))/10 # This rounding is to help you with the calibration


### Part 3 Explanation: Probability Calibration

**Problem:** Train the logistic model and calibrate its probability predictions.

**Why Calibrate?**
- Raw model outputs may not represent true probabilities well
- Calibration adjusts predictions to better match actual frequencies
- Example: If model predicts 70% spam for many emails, ideally ~70% should actually be spam

**Solution Approach:**
1. **Train base model:** Fit `ProportionalSpam` on training data to get $\beta_0, \beta$
2. **Get calibration predictions:** Apply trained model to calibration set
   - These are the "uncalibrated" probabilities: $\hat{p}_i = G(\beta_0 + \beta \cdot x_i)$
3. **Train calibrator:** Use `DecisionTreeRegressor` to learn mapping from predicted $\hat{p}$ to true labels $y$
   - Input: uncalibrated probabilities from calibration set
   - Target: actual labels from calibration set
   - Learns function: $\hat{p} \rightarrow p_{\text{calibrated}}$
4. **Result:** A two-stage predictor: first apply logistic model, then apply calibration tree

In [119]:
# Part 3

from sklearn.calibration import CalibratedClassifierCV
from sklearn.tree import DecisionTreeRegressor


problem2_ps = ProportionalSpam()
problem2_ps.fit(problem2_X_train, problem2_Y_train)

problem2_X_pred = problem2_ps.predict(problem2_X_calib)
# Reshape predictions to be 2D (required for sklearn)
problem2_X_pred = problem2_ps.predict(problem2_X_calib).reshape(-1, 1)

# Train a decision tree regressor to calibrate the predictions
problem2_calibrator = DecisionTreeRegressor(max_depth=3, random_state=42)
problem2_calibrator.fit(problem2_X_pred, problem2_Y_calib)


### Part 4 Explanation: Test Evaluation and Confidence Interval

**Problem:** Evaluate calibrated model on test set and provide 99% confidence interval for 0-1 loss.

**Solution Approach:**
1. **Make predictions:** Apply two-stage model to test set:
   - Stage 1: Get uncalibrated probabilities from `problem2_ps`
   - Stage 2: Pass through `problem2_calibrator` to get calibrated probabilities
2. **Binary decisions:** Use Bayes classifier threshold of 0.5:
   - Predict spam if $p \geq 0.5$, otherwise predict ham
3. **Compute 0-1 loss:** Fraction of incorrect predictions:
$$L_{0-1} = \frac{1}{n}\sum_{i=1}^{n} \mathbb{1}[\hat{y}_i \neq y_i]$$
4. **Confidence interval:** For 0-1 loss on test set (Bernoulli trials):
   - Can use Hoeffding's inequality with $a=0$, $b=1$ for each error indicator
   - Or use normal approximation: $\hat{L} \pm z_{\alpha/2}\sqrt{\frac{\hat{L}(1-\hat{L})}{n}}$
   - For 99% CI: $z_{0.005} \approx 2.576$
   
**Note:** The 0-1 loss measures classification accuracy (not probability quality)

In [120]:
# Part 4

# These are the predicted probabilities
problem2_final_predictions = problem2_calibrator.predict(problem2_ps.predict(problem2_X_test).reshape(-1, 1))


# In order to compute this loss we first need to convert the predicted probabilities to a decision
# recall the Bayes classifier?
problem2_01_loss = np.mean((problem2_final_predictions >= 0.5) != problem2_Y_test)

# Recall the interval is given as a tuple (a,b) or a list [a,b]
problem2_interval = (problem2_01_loss - 0.05, problem2_01_loss + 0.05)

---
#### Local Test for Exam vB, PROBLEM 2
Evaluate cell below to make sure your answer is valid.                             You **should not** modify anything in the cell below when evaluating it to do a local test of                             your solution.
You may need to include and evaluate code snippets from lecture notebooks in cells above to make the local test work correctly sometimes (see error messages for clues). This is meant to help you become efficient at recalling materials covered in lectures that relate to this problem. Such local tests will generally not be available in the exam.

In [123]:
try:
    import numpy as np
    test_instance = ProportionalSpam()
    test_loss = test_instance.loss(np.array([[1,0,1],[0,1,1]]),np.array([1,0]),np.array([1.2,0.4,0.3,0.9]))
    assert (np.abs(test_loss-1.2828629432232497) < 1e-6)
    print("Your loss was correct for a test point")
except:
    print("Your loss was not correct on a test point")

Your loss was correct for a test point


---
## Exam vB, PROBLEM 3
Maximum Points = 13


Consider the following four Markov chains, answer each question for all chains:

<img width="400px" src="pictures/MarkovA.png">Markov chain A</img>
<img width="400px" src="pictures/MarkovB.png">Markov chain B</img>
<img width="400px" src="pictures/MarkovC.png">Markov chain C</img>
<img width="400px" src="pictures/MarkovD.png">Markov chain D</img>

1. [2p] What is the transition matrix?
2. [2p] Is the Markov chain irreducible?
3. [3p] Is the Markov chain aperiodic? What is the period for each state?
4. [3p] Does the Markov chain have a stationary distribution, and if so, what is it?
5. [3p] Is the Markov chain reversible?

### Part 1 Explanation: Transition Matrix

**Problem:** Write the transition matrix for each Markov chain A, B, C, and D.

**Solution Approach:**
1. **Identify states:** List all states (nodes) in the chain: A, B, C, D (or E for chain C)
2. **Index mapping:** Map states to indices: A→0, B→1, C→2, D→3, E→4
3. **Read transition probabilities:** For each state i:
   - Look at all outgoing edges from state i
   - Record probability $P_{ij}$ for each edge from state i to state j
   - If no edge exists, $P_{ij} = 0$
4. **Build matrix:** Create matrix P where $P_{ij}$ is the probability of transitioning from state i to state j
5. **Verify:** Each row must sum to 1 (total probability leaving each state = 1)

**Key Properties:**
- Matrix shape: $(n \times n)$ where $n$ = number of states
- Row i contains transition probabilities from state i to all other states
- $P_{ij} \geq 0$ for all i, j (probabilities are non-negative)
- $\sum_j P_{ij} = 1$ for each row i (probability axiom)

In [128]:
# PART 1

#------------------------TRANSITION MATRIX -------------------------------
# Answer each one by supplying the transition matrix as a numpy array
# of shape (n_states,n_states), where state (A,B,...) corresponds to index (0,1,...)

problem3_A    = [[0.8, 0.2, 0, 0],
                 [0.6, 0.2, 0.2,0],
                 [0, 0.4, 0, 0.6],
                 [0, 0, 0.8, 0.2]
                 ]
problem3_B    = [[0,0.2,0,0.8],
                 [0,0,1,0],
                 [0,1,0,0],
                 [0.5,0,0.5,0]]
problem3_C    = [[0.2,0.3,0,0,0.5],
                 [0.2,0.2,0.6,0,0],
                 [0,0.4,0,0.6,0],
                 [0,0,0,0.6,0.4],
                 [0,0,0,0.4,0.6]
                 ]
problem3_D    = [[0.8,0.2,0,0],
                 [0.6,0.2,0.2,0],
                 [0,0.4,0,0.6],
                 [0.1,0,0.7,0.2]]

def is_transition_matrix(P, tol=1e-10):
    P = np.asarray(P, float)
    if (P < -tol).any():
        return False
    return np.allclose(P.sum(axis=1), 1.0, atol=1e-8)


### Part 2 Explanation: Irreducibility

**Problem:** Determine if each Markov chain is irreducible.

**Definition:** A Markov chain is **irreducible** if every state is accessible from every other state. In other words, starting from any state i, there exists some number of steps n such that you can reach any state j with positive probability.

**Mathematical Condition:**
For all pairs of states $(i,j)$, there exists some $n \geq 0$ such that $P^n_{ij} > 0$

**Solution Approach:**
1. **Graph perspective:** Think of the Markov chain as a directed graph
   - States = nodes
   - Transitions with $P_{ij} > 0$ = directed edges from i to j
2. **Reachability test:** For each state i, perform a graph traversal (BFS/DFS) to find all reachable states
3. **Check condition:** 
   - If from every state you can reach ALL other states → irreducible
   - If there exists any state from which some state is unreachable → reducible

**Interpretation:**
- **Irreducible:** The chain has no separate "compartments" - you can eventually get anywhere from anywhere
- **Reducible:** The chain has isolated groups of states or "absorbing" regions that trap the process

**Examples:**
- Chain with cycle covering all states → likely irreducible
- Chain with one-way transitions creating isolated groups → reducible

In [129]:
# PART 2
#------------------------REDUCIBLE -------------------------------
# Answer each one with a True or False

def is_irreducible(P):
    P = np.asarray(P, float)
    n = P.shape[0]
    adj = (P > 0)

    def reachable(start):
        seen = set([start])
        stack = [start]
        while stack:
            u = stack.pop()
            for v in np.where(adj[u])[0]:
                if v not in seen:
                    seen.add(int(v))
                    stack.append(int(v))
        return seen

    for s in range(n):
        if len(reachable(s)) != n:
            return False
    return True

problem3_A_irreducible = is_irreducible(problem3_A)
problem3_B_irreducible = is_irreducible(problem3_B)
problem3_C_irreducible = is_irreducible(problem3_C)
problem3_D_irreducible = is_irreducible(problem3_D)

problem3_A_irreducible, problem3_B_irreducible, problem3_C_irreducible, problem3_D_irreducible


(True, False, False, True)

### Part 3 Explanation: Aperiodicity and Periods

**Problem:** Determine if each Markov chain is aperiodic and find the period of each state.

**Definition - Period of a State:**
The **period** of state i is the greatest common divisor (GCD) of all possible return times:
$$d(i) = \gcd\{n \geq 1 : P^n_{ii} > 0\}$$

In words: If you can return to state i after 3 steps, 6 steps, 9 steps, etc., then period = gcd(3,6,9,...) = 3

**Definition - Aperiodic:**
- A state is **aperiodic** if its period = 1
- A chain is **aperiodic** if ALL states are aperiodic
- A chain is **periodic** if any state has period > 1

**Intuition:**
- **Period = 1 (aperiodic):** Can return to the state at many different times (no fixed cycle)
- **Period = d > 1 (periodic):** Returns are synchronized - can only return every d steps
- **Period = ∞:** Never returns to itself (transient state in reducible chain)

**Solution Approach:**
1. **For each state i:**
   - Find all return times: steps n where $P^n_{ii} > 0$
   - Compute returns = {n₁, n₂, n₃, ...} by checking powers of P
   - Calculate period = gcd(n₁, n₂, n₃, ...)
2. **Check aperiodicity:**
   - If all periods = 1 → chain is aperiodic
   - If any period > 1 → chain is periodic

**Common Patterns:**
- **Self-loops:** If $P_{ii} > 0$ (self-loop exists) → period = 1 (return in 1 step)
- **Bipartite structure:** States alternate between two groups → period = 2
- **All periods = 1:** Chain is aperiodic → well-behaved long-run behavior

**Why This Matters:**
Aperiodicity (along with irreducibility) guarantees convergence to stationary distribution:
$$\lim_{n \to \infty} P^n = \begin{bmatrix} \pi \\ \pi \\ \vdots \\ \pi \end{bmatrix}$$

In [130]:
# PART 3
#------------------------APERIODIC-------------------------------

def is_aperiodic(P):
    """Check if chain is aperiodic (all states have period 1)."""
    P = np.asarray(P, float)
    n = P.shape[0]
    
    for state in range(n):
        if state_period(P, state) > 1:
            return False
    return True

# Answer each one with a True or False

problem3_A_is_aperiodic = is_aperiodic(problem3_A)
problem3_B_is_aperiodic = is_aperiodic(problem3_B)
problem3_C_is_aperiodic = is_aperiodic(problem3_C)
problem3_D_is_aperiodic = is_aperiodic(problem3_D)



### Detailed Algorithm: Computing State Periods

**Algorithm Steps:**
1. **Initialize:** Set max_steps = 100 (check up to 100 transitions)
2. **Find return times:** For state i:
   ```
   return_times = []
   P_power = P
   for n = 1 to max_steps:
       P_power = P_power × P  (matrix multiplication)
       if P_power[i,i] > 0:
           add n to return_times
   ```
3. **Compute period:** period(i) = gcd(return_times)

**Example - Chain A State 0:**
Suppose $P^1_{00} = 0.8$ (can return in 1 step due to self-loop)
- Return times = {1, 2, 3, ...} (can return every step)
- period = gcd(1, 2, 3, ...) = 1 ✓ aperiodic

**Example - Bipartite Chain:**
Suppose states alternate between {A,B} ↔ {C,D}
- From A: can only reach A at steps 2, 4, 6, 8, ...
- Return times = {2, 4, 6, 8, ...}
- period = gcd(2, 4, 6, 8, ...) = 2 ✗ periodic

**Implementation Note:**
The function `state_period(P, state)` computes this efficiently using matrix powers and GCD.

In [131]:

def gcd_list(numbers):
    """Compute GCD of a list of numbers."""
    result = numbers[0]
    for num in numbers[1:]:
        result = math.gcd(result, num)
    return result

def state_period(P, state):
    """
    Compute the period of a specific state in a Markov chain.
    
    Parameters:
    -----------
    P : array-like
        Transition matrix
    state : int
        State to check period for
    
    Returns:
    --------
    period : int
        Period of the state (1 = aperiodic)
    """
    P = np.asarray(P, float)
    n = P.shape[0]
    max_steps = 100  # Check up to this many steps
    
    # Find all n where P^n[state, state] > 0
    return_times = []
    P_power = P.copy()
    
    for step in range(1, max_steps + 1):
        if P_power[state, state] > 1e-10:
            return_times.append(step)
        P_power = P_power @ P
    
    if len(return_times) == 0:
        return float('inf')  # Never returns
    
    return gcd_list(return_times)


def all_state_periods(P):
    """
    Compute the period for each state in a Markov chain.
    
    Parameters:
    -----------
    P : array-like
        Transition matrix
    
    Returns:
    --------
    periods : array
        Array where periods[i] is the period of state i
        (period = 1 means aperiodic, period = inf means never returns)
    """
    P = np.asarray(P, float)
    n = P.shape[0]
    
    periods = np.zeros(n)
    for state in range(n):
        periods[state] = state_period(P, state)
    
    return periods

# Answer the following with the period of the states as a numpy array
# of shape (n_states,)

# problem3_A_periods = [state_period(problem3_A, x) for x in range(len(problem3_A))]
# problem3_B_periods = [state_period(problem3_B, x) for x in range(len(problem3_B))]
# problem3_C_periods = [state_period(problem3_C, x) for x in range(len(problem3_C))]
# problem3_D_periods = [state_period(problem3_D, x) for x in range(len(problem3_D))]

problem3_A_periods = all_state_periods(problem3_A)
problem3_B_periods = all_state_periods(problem3_B)
problem3_C_periods = all_state_periods(problem3_C)
problem3_D_periods = all_state_periods(problem3_D)


problem3_A_periods, problem3_B_periods, problem3_C_periods, problem3_D_periods

(array([1., 1., 1., 1.]),
 array([2., 2., 2., 2.]),
 array([1., 1., 1., 1., 1.]),
 array([1., 1., 1., 1.]))

### Part 4 Explanation: Stationary Distribution

**Problem:** Determine if each chain has a stationary distribution, and if so, compute it.

**Definition - Stationary Distribution:**
A probability distribution $\pi = (\pi_0, \pi_1, ..., \pi_n)$ is **stationary** if:
$$\pi P = \pi$$

In other words: If the chain is in distribution $\pi$ at time t, it remains in distribution $\pi$ at time t+1.

**Properties:**
- $\sum_i \pi_i = 1$ (valid probability distribution)
- $\pi_i \geq 0$ for all i (non-negative probabilities)
- $\pi$ is a **left eigenvector** of P with eigenvalue 1

**Existence Theorem:**
A finite Markov chain has a **unique stationary distribution** if and only if it is:
1. **Irreducible:** All states communicate with each other
2. **Aperiodic:** All states have period 1

(A chain with both properties is called **ergodic**)

**Solution Approach:**

**Method 1 - Check Conditions (Used in Code):**
1. Test if chain is irreducible (Part 2)
2. Test if chain is aperiodic (Part 3)
3. If both TRUE → unique stationary distribution exists
4. If FALSE → either no stationary distribution OR multiple stationary distributions

**Method 2 - Compute via Eigenvalues:**
1. Find left eigenvector of P corresponding to eigenvalue 1:
   - Solve: $\pi^T P^T = \pi^T$ (transpose because we want left eigenvector)
   - Or: Find eigenvector of $P^T$ with eigenvalue 1
2. Normalize: $\pi \leftarrow \frac{\pi}{\sum_i \pi_i}$ to make it sum to 1
3. Take absolute values to ensure non-negative

**Interpretation:**
- $\pi_i$ = long-run proportion of time spent in state i
- Example: If $\pi = (0.2, 0.5, 0.3)$, the chain spends 20% of time in state 0, 50% in state 1, 30% in state 2
- **Convergence:** For ergodic chains: $\lim_{n \to \infty} P^n_{ij} = \pi_j$ (independent of starting state!)

**Special Cases:**
- **Reducible chain:** May have multiple stationary distributions or none
- **Periodic chain:** Stationary distribution may exist but $P^n$ doesn't converge
- **Both conditions met:** Guaranteed unique stationary distribution AND convergence

In [132]:
from numpy.linalg import eig


# PART 4
#------------------------STATIONARY DISTRIBUTION-----------------
# Answer each one with a True or False

def has_stationary_distribution(P):
    """
    Check if a Markov chain has a unique stationary distribution.
    
    A finite Markov chain has a unique stationary distribution if and only if
    it is irreducible and aperiodic (i.e., ergodic).
    
    Parameters:
    -----------
    P : array-like
        Transition matrix
    
    Returns:
    --------
    has_stationary : bool
        True if chain has a unique stationary distribution
    reason : str
        Explanation of why the chain does or doesn't have stationary distribution
    """
    P = np.asarray(P, float)
    
    # Check if it's a valid transition matrix
    if not is_transition_matrix(P):
        return False, "Not a valid transition matrix"
    
    # Check irreducibility
    if not is_irreducible(P):
        return False, "Chain is not irreducible (some states are not reachable from others)"
    
    # Check aperiodicity
    if not is_aperiodic(P):
        return False, "Chain is not aperiodic (has periodic states)"
    
    return True, "Chain is ergodic (irreducible and aperiodic) - has unique stationary distribution"


def stationary_distribution(P):
    P = np.asarray(P, dtype=float)
    w, v = eig(P.T)
    k = np.argmin(np.abs(w - 1))
    pi = np.real(v[:, k])
    pi = np.abs(pi)
    pi = pi / pi.sum()
    return pi



problem3_A_has_stationary = has_stationary_distribution(problem3_A)[0]
problem3_B_has_stationary = has_stationary_distribution(problem3_B)[0]
problem3_C_has_stationary = has_stationary_distribution(problem3_C)[0]
problem3_D_has_stationary = has_stationary_distribution(problem3_D)[0]

# Answer the following with the stationary distribution as a numpy array of shape (n_states,)
# if the Markov chain has a stationary distribution otherwise answer with False

problem3_A_stationary_dist = False if not(problem3_A_has_stationary) else stationary_distribution(problem3_A)
problem3_B_stationary_dist = False if not(problem3_B_has_stationary) else stationary_distribution(problem3_B)
problem3_C_stationary_dist = False if not(problem3_C_has_stationary) else stationary_distribution(problem3_C)
problem3_D_stationary_dist = False if not(problem3_D_has_stationary) else stationary_distribution(problem3_D)

problem3_A_stationary_dist, problem3_B_stationary_dist, problem3_C_stationary_dist, problem3_D_stationary_dist

(array([0.61538462, 0.20512821, 0.1025641 , 0.07692308]),
 False,
 False,
 array([0.64516129, 0.20430108, 0.08602151, 0.06451613]))

### Part 5 Explanation: Reversibility

**Problem:** Determine if each Markov chain is reversible.

**Definition - Reversible (Time-Reversible) Chain:**
A Markov chain with stationary distribution $\pi$ is **reversible** if it satisfies the **detailed balance condition**:
$$\pi_i P_{ij} = \pi_j P_{ji} \quad \text{for all states } i, j$$

**Physical Interpretation:**
- **Forward rate:** $\pi_i P_{ij}$ = steady-state probability of being in i × probability of moving to j
- **Backward rate:** $\pi_j P_{ji}$ = steady-state probability of being in j × probability of moving to i
- **Detailed balance:** Forward and backward transition rates are equal
- **Time reversal:** If you film the chain and play it backwards, it looks statistically identical

**Why It Matters:**
1. **Computational:** Easier to find stationary distribution (solve simpler equations)
2. **MCMC:** Many algorithms (e.g., Metropolis-Hastings) rely on reversibility
3. **Physical systems:** Thermodynamic equilibrium → detailed balance

**Solution Approach:**

**Step 1 - Get Stationary Distribution:**
- If chain doesn't have stationary distribution → cannot be reversible
- Otherwise, compute or use given $\pi$

**Step 2 - Check Detailed Balance:**
For all pairs $(i,j)$:
1. Compute $\text{forward\_rate} = \pi_i \cdot P_{ij}$
2. Compute $\text{backward\_rate} = \pi_j \cdot P_{ji}$
3. Check if $|\text{forward\_rate} - \text{backward\_rate}| < \epsilon$ (small tolerance)

**Step 3 - Verdict:**
- If ALL pairs satisfy detailed balance → reversible ✓
- If ANY pair violates it → not reversible ✗

**Examples:**

**Example 1 - Symmetric Random Walk:**
- States: {0, 1, 2}
- Transitions: 0↔1↔2 with equal probabilities
- Stationary: $\pi = (1/3, 1/3, 1/3)$ (uniform)
- Check: $\pi_0 P_{01} = (1/3)(0.5) = \pi_1 P_{10} = (1/3)(0.5)$ ✓ Reversible!

**Example 2 - Circular Chain:**
- States: A → B → C → A (one-way cycle)
- Stationary: $\pi = (1/3, 1/3, 1/3)$
- Check: $\pi_A P_{AB} = (1/3)(1) = 1/3$ but $\pi_B P_{BA} = (1/3)(0) = 0$
- Violation: $1/3 \neq 0$ ✗ Not reversible (clear directionality!)

**Shortcut Tests:**
1. **Cycle test:** If chain has directed cycles (one-way flow) → likely NOT reversible
2. **Symmetry test:** If $P_{ij} = P_{ji}$ for all i,j AND irreducible → definitely reversible
3. **Balance sheet:** For each edge, check if traffic flows equally in both directions

**Connection to Random Walks:**
- **Undirected graphs:** Random walk on undirected graph → always reversible
- **Directed graphs:** Random walk on directed graph → usually NOT reversible (unless special structure)

In [133]:
def check_detailed_balance(P, pi, tol=1e-8):
    """
    Check if a Markov chain satisfies detailed balance condition.
    
    Parameters:
    -----------
    P : array-like
        Transition matrix
    pi : array-like
        Stationary distribution (or candidate stationary distribution)
    tol : float, default=1e-8
        Numerical tolerance for equality check
    
    Returns:
    --------
    is_reversible : bool
        True if detailed balance holds
    max_violation : float
        Maximum absolute difference |pi_i * P_ij - pi_j * P_ji|
    """
    P = np.asarray(P, float)
    pi = np.asarray(pi, float)
    n = P.shape[0]
    
    # Check detailed balance: pi[i] * P[i,j] = pi[j] * P[j,i] for all i,j
    max_violation = 0.0
    
    for i in range(n):
        for j in range(n):
            forward_rate = pi[i] * P[i, j]
            backward_rate = pi[j] * P[j, i]
            violation = abs(forward_rate - backward_rate)
            max_violation = max(max_violation, violation)
    
    is_reversible = (max_violation < tol)
    
    return is_reversible, float(max_violation)

def is_reversible_chain(P, pi=None, tol=1e-8):
    """
    Check if a Markov chain is reversible.
    
    Parameters:
    -----------
    P : array-like
        Transition matrix
    pi : array-like, optional
        Stationary distribution. If None, computes it automatically
    tol : float, default=1e-8
        Numerical tolerance
    
    Returns:
    --------
    reversible : bool
        True if chain is reversible
    message : str
        Explanation
    """
    P = np.asarray(P, float)
    
    # Check if it's a valid transition matrix
    if not is_transition_matrix(P):
        return False, "Not a valid transition matrix"
    
    # Get stationary distribution if not provided
    if pi is None:
        # Check if chain has unique stationary distribution
        has_stat, reason = has_stationary_distribution(P)
        if not has_stat:
            return False, f"Cannot check reversibility: {reason}"
        pi = stationary_distribution(P)
    
    # Check detailed balance
    is_rev, max_viol = check_detailed_balance(P, pi, tol)
    
    if is_rev:
        return True, f"Chain is reversible (max violation: {max_viol:.2e})"
    else:
        return False, f"Chain is not reversible (max violation: {max_viol:.2e})"

In [134]:
# PART 5
#------------------------REVERSIBLE-----------------
# Answer each one with a True or False

problem3_A_is_reversible = is_reversible_chain(problem3_A)[0]
problem3_B_is_reversible = is_reversible_chain(problem3_B)[0]
problem3_C_is_reversible = is_reversible_chain(problem3_C)[0]
problem3_D_is_reversible = is_reversible_chain(problem3_D)[0]

problem3_A_is_reversible, problem3_B_is_reversible, problem3_C_is_reversible, problem3_D_is_reversible

(True, False, False, False)