# Exam 28th of August 2024 — Course 1MS041 (Introduction to Data Science)

## 1.1 Instructions
1. Complete the problems by following instructions.  
2. When done, submit this file with your solutions saved, following the instruction sheet.

This exam has **3 problems** for a total of **40 points**, to pass you need **20 points**.  
The bonus will be added to the score and rounded afterwards.

## 1.2 Some general hints and information
- Try to answer all questions even if you are uncertain.  
- Comment your code — partial credit is given if your reasoning is clear.  
- Follow the instruction sheet rigorously.  
- The exam is partially autograded, but your code and free text answers are manually graded anonymously.

## 1.3 Tips for free text answers
You can write LaTeX in Markdown cells:

- `$f(x) = x^2$` → inline math  
- `$$f(x)=x^2$$` → centered display math  

Example:

$$f_{Y|X}(y,x) = P(Y = y \mid X = x) = \exp(\alpha x + \beta)$$

## 1.4 Rules
- You may **not communicate with others** during the exam.  
- You may **not use AI systems such as ChatGPT**.  
- Your online and offline activity may be monitored.

## 1.5 Good luck!

---

### Insert your anonymous exam ID below:


In [None]:
examID = "XXX"

### Exam vB, PROBLEM 1  
**Maximum Points = 14**

In this problem you will do rejection sampling from complicated distributions, and you will also use your samples to compute certain integrals, a method known as *Monte Carlo integration*.  
(Keep in mind that choosing a good sampling distribution is often key to avoid too much rejection.)

1. **[4p]** Fill in the remaining part of the function `problem1_rejection` in order to produce samples from the density, using rejection sampling:

$$f(x) = C (\sin x)^{10}, \quad 0 \le x \le \pi$$


where \(C\) is a value such that \(f\) above is a density (i.e. integrates to one).  
*Hint:* you do not need to know the value of \(C\) to perform rejection sampling.

2. **[2p]** Produce 10 000 samples (use fewer if it takes too long) from the above distribution, put the answer in the variable `problem1_samples`, and plot the histogram.

3. **[2p]** Define \(X\) as a random variable with the density given in part 1. Denote


$$Y = \left(X - \frac{\pi}{2}\right)^2$$


and use the 10 000 samples from part 2 to estimate \(\mathbb{E}[Y]\). Store the result in `problem1_expectation`.

4. **[2p]** Use Hoeffding’s inequality to produce a 95% confidence interval of the expectation above and store the result as a tuple in the variable `problem1_interval`.

5. **[4p]** Can you calculate an approximation of the value of \(C\) from part 1 using random samples?  
Provide a plot of the histogram from part 2 together with the true density as a curve (this requires the value of \(C\)).  
Explain what method you used and what answer you got.


In [None]:
# Part 1
def problem1_rejection(n_samples=1):
    """
    Use rejection sampling to draw samples from
        f(x) ∝ (sin x)^10   on [0, π]
    Return a numpy array of length n_samples.
    """
    return XXX

In [None]:
# Part 2
problem1_samples = XXX

In [None]:
# Part 3
problem1_expectation = XXX

In [None]:
# Part 4
problem1_interval = [XXX, XXX]

In [None]:
# Part 5 — numeric computation of C
problem1_C = XXX

In [None]:
# Part 5 — plot
# Write your plotting code here
# XXXXX

## Part 5 — Explanation

Double-click to edit:

### Begin explanation

(Write your method and the approximation of C here.)

### End explanation


## Local Test for Exam vB, PROBLEM 1

Evaluate the cell below to verify that your answers have correct format.

Do NOT modify anything in the test cell.


In [None]:
# This cell checks the format, not correctness
import numpy as np

try:
    assert isinstance(problem1_rejection(10), np.ndarray)
except:
    print("Try again. You should return a numpy array from problem1_rejection")
else:
    print("Good, your problem1_rejection returns a numpy array")

try:
    assert isinstance(problem1_samples, np.ndarray)
except:
    print("Try again. your problem1_samples is not a numpy array")
else:
    print("Good, your problem1_samples is a numpy array")

try:
    assert isinstance(problem1_expectation, float)
except:
    print("Try again. your problem1_expectation is not a float")
else:
    print("Good, your problem1_expectation is a float")

try:
    assert (isinstance(problem1_interval, list) or isinstance(problem1_interval, tuple))
    assert len(problem1_interval) == 2
except Exception as e:
    print(e)
else:
    print("Good, your problem1_interval is a tuple or list of length 2")

## 2.1 Exam vB, PROBLEM 2  
**Maximum Points: 13**

Consider the dataset `CORIS.csv` in the `data` folder. The dataset contains cases of coronary heart disease (CHD) and variables associated with the patient’s condition:

- systolic blood pressure (`sbp`)
- yearly tobacco use in kg (`tobacco`)
- low density lipoprotein (`ldl`)
- adiposity
- family history (0 or 1) (`famhist`)
- type A personality score (`typea`)
- obesity (body mass index)
- alcohol use
- age
- diagnosis of CHD (0 or 1) (`chd`)

Here:
- **X** corresponds to the measurements,
- **Y** is a 0–1 label where 1 represents CHD and 0 represents no CHD.

The code to load the data, perform a train–test–validation split, and train a model is already prepared for you.  
The trained model is stored in `problem2_pipe`, which is an `sklearn` `Pipeline`.

---

### **1. [3p]**

Use **Hoeffding’s inequality** and compute the **95% confidence intervals** for **precision and recall** (etc.) on the **test set**.  
Store your intervals for each class in the variables:

- `problem2_precision0`
- `problem2_recall0`
- `problem2_precision1`
- `problem2_recall1`

Each of these should be a **tuple** `(lower, upper)`.

---

### **2. [3p]**

You are interested in minimizing the **average cost** of your classifier.  
The hospital will use the model as a screening tool:

- If the model predicts **CHD = 1**, the patient is sent for further investigation.
- If the model predicts **CHD = 0**, nothing is done.

You decide to use the following costs:

- True positive (CHD = 1, predicted 1): cost = 0  
- True negative (CHD = 0, predicted 0): cost = 0  
- False positive (CHD = 0, predicted 1): cost = 10  
- False negative (CHD = 1, predicted 0): cost = 300  *(worst case)*

Complete the function `problem2_cost(model, threshold, X, Y)` to compute the **average cost per person** for a given prediction threshold, using `model.predict_proba`.

---

### **3. [4p]**

Select the **threshold** between 0 and 1 that minimizes the **average cost** on the **test set**.  
Check, for example, **100 evenly spaced thresholds** between 0 and 1.

Store:

- the optimal threshold in `problem2_optimal_threshold`
- the cost at this threshold (on the test set) in `problem2_cost_at_optimal_threshold`

---

### **4. [3p]**

With your newly computed threshold, compute the **cost of putting the model in production** by evaluating the cost on the **validation set**.

Also compute a **99% confidence interval** for this cost using **Hoeffding’s inequality**, and store it as:

- `problem2_cost_at_optimal_threshold_validation`
- `problem2_cost_interval = (lower, upper)`


In [None]:
# RUN THIS CELL TO LOAD THE DATA AND SPLIT IT INTO TRAINING, TEST AND VALIDATION SETS
# FINALLY IT TRAINS THE MODEL AS A PIPELINE

import pandas as pd
from sklearn.model_selection import train_test_split

CORISDataset = pd.read_csv("data/CORIS.csv", skiprows=[1, 2])

# Initial data split into features and target
problem2_X = CORISDataset[
    ['sbp', 'tobacco', 'ldl', 'adiposity', 'famhist', 'typea', 'obesity', 'alcohol', 'age']
].values  # Features
problem2_Y = CORISDataset['chd'].values  # Target variable

# Split the data into training, test and validation sets
problem2_X_train, X_tmp, problem2_Y_train, Y_tmp = train_test_split(
    problem2_X, problem2_Y, train_size=0.6, random_state=42
)
problem2_X_test, problem2_X_val, problem2_Y_test, problem2_Y_val = train_test_split(
    X_tmp, Y_tmp, train_size=0.5, random_state=42
)

# Show the shapes of the data
print(
    problem2_X_train.shape,
    problem2_Y_train.shape,
    problem2_X_test.shape,
    problem2_Y_test.shape,
    problem2_X_val.shape,
    problem2_Y_val.shape,
)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Create a pipeline with a scaler and a logistic regression model
problem2_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(random_state=42)),
])

# Fit the pipeline to the training data
problem2_pipe.fit(problem2_X_train, problem2_Y_train)


In [None]:
# Part 1
# To make a prediction on a dataset `X` you can use the following code:
#   predictions = problem2_pipe.predict(X)
# As with any sklearn model, you can also use:
#   probas = problem2_pipe.predict_proba(X)

# Each precision and recall should be a tuple, for instance:
#   precision0 = (0.9, 0.95)
# The 0 or 1 in the variable name indicates the class.

problem2_precision0 = XXX
problem2_recall0    = XXX
problem2_precision1 = XXX
problem2_recall1    = XXX

# The code below will check that you supply the proper type
assert(type(problem2_precision0) == tuple)
assert(len(problem2_precision0) == 2)
assert(type(problem2_recall0) == tuple)
assert(len(problem2_recall0) == 2)
assert(type(problem2_precision1) == tuple)
assert(len(problem2_precision1) == 2)
assert(type(problem2_recall1) == tuple)
assert(len(problem2_recall1) == 2)

In [None]:
# Part 2
def problem2_cost(model, threshold, X, Y):
    pred_proba = model.predict_proba(X)[:, 1]
    predictions = (pred_proba >= threshold) * 1

    # Fill in what is missing to compute the cost and return it
    # Note that we are interested in average cost (cost per person)

    return XXX

In [None]:
# Part 3
problem2_optimal_threshold = XXX
problem2_cost_at_optimal_threshold = XXX

In [None]:
# Part 4
problem2_cost_at_optimal_threshold_validation = XXX

# Report the cost interval as a tuple cost_interval = (a, b)
problem2_cost_interval = XXX

In [None]:
# The code below will tell you if you filled in the interval correctly
assert(type(problem2_cost_interval) == tuple)
assert(len(problem2_cost_interval) == 2)

# 2.2 Exam vB, PROBLEM 3  
**Maximum Points: 13**

![Markov Chains](exam240828-markovImage.png)

Consider the following two Markov chains:

**Markov chain A**  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; **Markov chain B**


Answer each question for **both chains**:

---

### **1. [2p]**  
What is the **transition matrix**?  
Your answer for each chain should be a NumPy array of shape `(n_states, n_states)`  
where states `(A, B, …)` correspond to indices `(0, 1, …)`.

---

### **2. [1p]**  
Is the Markov chain **irreducible**?  
Answer with `True` or `False` for each chain.

---

### **3. [4p]**  
Is the Markov chain **aperiodic**?  
What is the **period of each state**?

Provide:

- a boolean (`True`/`False`) indicating if the chain is aperiodic
- a NumPy array with the **period of each state**, shape `(n_states,)`

*Hint:* Recall the definition of period:  

$$\text{period}(i) = \gcd\{\, t \ge 1 : P(X_t = i \mid X_0 = i) > 0 \,\}$$


---

### **4. [2p]**  
If the chain starts in state A at time 0, what is the probability of being in state B at time 5?  

Store this in:

- `problem3_A_PB5`
- `problem3_B_PB5`

---

### **5. [4p]**  
Let \(T\) be the **first hitting time of state D**, starting from state A:


$$T(\omega) = \inf \{\, t \in \mathbb{N} : X_t(\omega) = D \,\}$$


where the infimum over an empty set is $\infty$.

Compute:

- $P(T = 1)$  
- $P(T = 2)$  
- $P(T = 3)$  
- $P(T = 4)$  
- $P(T = 5)$  
- $P(T = \infty)$  


for both chains A and B, and store them in the provided variables.


In [None]:
# PART 1
# ------------------------ TRANSITION MATRIX -------------------------------

# Supply each transition matrix as a numpy array of shape (n_states, n_states).
# State order must match exam order, typically (A, B, C, D, ...).

problem3_A = XXX
problem3_B = XXX

In [None]:
# PART 2
# ------------------------ IRREDUCIBLE -------------------------------

problem3_A_irreducible = XXX
problem3_B_irreducible = XXX

In [None]:
# PART 3
# ------------------------ APERIODIC -------------------------------

# Answer each with True or False
problem3_A_is_aperiodic = XXX
problem3_B_is_aperiodic = XXX

# A numpy array of shape (n_states,) containing periods for each state
problem3_A_periods = XXX
problem3_B_periods = XXX

In [None]:
# PART 4
# ------------------------ PROBABILITY OF B AFTER 5 STEPS -------------------------------

problem3_A_PB5 = XXX
problem3_B_PB5 = XXX

In [None]:
# PART 5
# ------------------------ HITTING TIME DISTRIBUTION -------------------------------

# Probabilities for T = 1, 2, 3, 4, 5, and ∞ for chain A
problem3_A_PT1 = XXX
problem3_A_PT2 = XXX
problem3_A_PT3 = XXX
problem3_A_PT4 = XXX
problem3_A_PT5 = XXX
problem3_A_PT_inf = XXX

# Probabilities for chain B
problem3_B_PT1 = XXX
problem3_B_PT2 = XXX
problem3_B_PT3 = XXX
problem3_B_PT4 = XXX
problem3_B_PT5 = XXX
problem3_B_PT_inf = XXX