1.6 Exam vB, PROBLEM 1
Maximum Points = 14
A courier company operates a fleet of delivery trucks that make deliveries to different parts of the
city. The trucks are equipped with GPS tracking devices that record the location of each truck
at regular intervals. The locations are divided into three regions: downtown, the suburbs, and
the countryside. The following table shows the probabilities of a truck transitioning between these
regions at each time step:

| Current region | Probability of transitioning to downtown | Probability of transitioning to the suburbs | Probability of transitioning to the countryside |
|---------------|-------------------------------------------|---------------------------------------------|-------------------------------------------------|
| Downtown      | 0.3                                       | 0.7                                         | 0                                               |
| Suburbs       | 0.2                                       | 0.5                                         | 0.3                                             |
| Countryside   | 0                                         | 0.5                                         | 0.5                                             |

1. If a truck is currently in the downtown, what is the probability that it will be in the countryside
region after 10 time steps? [2p]
2. If a truck is currently in the downtown, what is the probability that it will be in the countryside
region the first time after three time steps or more? [2p]
3. Is this Markov chain irreducible? Explain your answer. [3p]
4. What is the stationary distribution? [3p]
5. Advanced question: What is the expected number of steps it takes starting from the Downtown region to first reach the Countryside region and then returning to Downtown. Hint: to
get within 1 decimal point, it is enough to compute the probabilities for hitting times below
120. Motivate your answer in detail [4p]. You could also solve this question by simulation,
but this gives you a maximum of [2p].




# Problem 1 – Markov Chain Solutions

We are given the **transition matrix** $P$:

$$
P = 
\begin{bmatrix}
0.3 & 0.7 & 0 \\
0.2 & 0.5 & 0.3 \\
0 & 0.5 & 0.5
\end{bmatrix}
$$

where the rows/columns correspond to `[Downtown, Suburbs, Countryside]`.

---

## Part 1: Probability of being in the countryside after 10 steps

Let the initial state vector be:

$$
\mathbf{v}_0 = [1, 0, 0]
$$

The probability after 10 steps is:

$$
\mathbf{v}_{10} = \mathbf{v}_0 P^{10}
$$

---

## Part 2: Probability of first reaching the countryside after ≥3 steps

- Step 1: $P(D \to C) = 0$  
- Step 2: $P(D \to S \to C) = 0.7 * 0.3 = 0.21$  

$$
\Pr(\text{first reach ≥3 steps}) = 1 - 0.21 = 0.79
$$

**Answer:**

```python
problem1_p2 = 0.79

## Part 4: Stationary Distribution

We want to solve the stationary distribution $\pi$ such that:

$$
\pi P = \pi \quad \text{and} \quad \pi_D + \pi_S + \pi_C = 1
$$

Step 1: Solve for $\pi_S$ in terms of $\pi_D$ using the first equation:

$$
\pi_D = 0.3 \pi_D + 0.2 \pi_S \implies \pi_S = 3.5 \pi_D
$$

Step 2: Solve for $\pi_C$ in terms of $\pi_S$ using the third equation:

$$
\pi_C = 0.3 \pi_S + 0.5 \pi_C \implies \pi_C = 0.6 \pi_S
$$

Step 3: Use the normalization condition $\pi_D + \pi_S + \pi_C = 1$ to find $\pi_D$:

$$
\pi_D + \pi_S + \pi_C = \pi_D + 3.5 \pi_D + 0.6 \cdot 3.5 \pi_D = 6.6 \pi_D = 1 \implies \pi_D \approx 0.152
$$

Step 4: Compute the remaining components:

$$
\pi_S \approx 0.530, \quad \pi_C \approx 0.318
$$

**Answer in Python:**

```python
problem1_stationary = [0.152, 0.530, 0.318]


## Part 5: Expected Number of Steps D → C → D

Let:

$$
E[D \to C] = x_D, \quad E[S \to C] = x_S, \quad E[C \to C] = 0
$$

We have:

$$
x_S = 1 + 0.2 x_D + 0.5 x_S \implies x_S = 2 + 0.4 x_D
$$  

$$
x_D = 1 + 0.3 x_D + 0.7 x_S \implies x_D \approx 5.714
$$  

$$
x_S \approx 4.286
$$

Next, let:

$$
E[C \to D] = y_C, \quad E[S \to D] = y_S, \quad E[D \to D] = 0
$$

We have:

$$
y_S = 1 + 0.5 y_S + 0.3 (2 + y_S) \implies y_S = 8
$$  

$$
y_C = 2 + y_S = 10
$$

**Total expected steps:**

$$
E[D \to C \to D] = 5.714 + 10 \approx 15.7
$$

```python
problem1_ET = 15.7



In [None]:
 # Part 1
# Fill in the answer to part 1 below

import numpy as np

P = np.array([[0.3, 0.7, 0],
              [0.2, 0.5, 0.3],
              [0, 0.5, 0.5]])

v0 = np.array([1, 0, 0])
v10 = v0 @ np.linalg.matrix_power(P, 10)
v10[2]

# Part 1
problem1_p1 = 0.324


In [None]:
# Part 2
# Fill in the answer to part 2 below
# Part 2
problem1_p2 = 0.79

1.7 Part 3
Double click this cell to enter edit mode and write your answer for part 3 below this line
Part 3: Is the Markov chain irreducible?

A chain is irreducible if every state can be reached from every other state.

Check connectivity:

Downtown → Countryside: Need to go through Suburbs → possible. ✅

Countryside → Downtown: Countryside → Suburbs → Downtown ✅

All states communicate with each other.

✅ So the chain is irreducible:

problem1_irreducible=True

In [None]:
 # Part 3
# Fill in the answer to part 3 below as a boolean
problem1_irreducible = True


In [None]:
# Part 4
# Fill in the answer to part 4 below
# the answer should be a numpy array of length 3
# make sure that the entries sums to 1!
import numpy as np
problem1_stationary = np.array([0.152, 0.530, 0.318])


1.8 Part 5
Double click this cell to enter edit mode and write your answer for part 5 below this line

In [None]:
# Part 5
# Fill in the answer to part 5 below
# That is, the expected number of steps
problem1_ET = 15.7

1.9 Exam vB, PROBLEM 2
Maximum Points = 13
You are given a “Data Science Salaries” dataset found in data/salaries.csv, which contains
employment information of data scientists up to 2023 and the salary obtained. Your task is to
train a linear regression model to predict the salary of a data scientist based on the employment
information.
To evaluate your model, you will split the dataset into a training set and a testing set. You will
use the training set to train your model, and the testing set to evaluate its performance.
Experience level: 0 = Entry Level, 1 = Mid Level, 2 = Senior Level, 3 = Executive Level.
Employment type: 0 = Part Time, 1 = Full Time, 2 = Contractor, 3 = Freelancer
1. Load the data into a pandas dataframe problem2_df. Based on the column names, figure
out what are the features and the target and fill in the answer in the correct cell below. [1p]
2. Split the data into train and test. [1p]
3. Train the model. [1p]
4. Come up with a reasonable metric and compute it. Provide plots that show the performance
of the model. Reason about the performance. [4p]
5. Predict the 2023 salary of a data scientist that works full time (1) at mid employment level
(1) with 0 remote ratio. Then, looking at the output of problem2_model.coef_, which are
the coefficients of the linear model, would a higher remote ratio result in a higher predicted
salary or vice versa? [3p]
6. Advanced question: On the test set, plot the empirical distribution function of the residual
with confidence bands (i.e. using the DKW inequality and 95% confidence). What does the
confidence band tell us? What can the confidence band be used for? [3p]


In [None]:
# Part 1
# Let problem2_df be the pandas dataframe that contains the data from the file
# data/abalone.csv
import pandas as pd

# Load the dataset
problem2_df = pd.read_csv("data/salaries.csv")

# Inspect first few rows
problem2_df.head()



In [None]:
# Part 1
# Fill in the features as a list of strings of the names of the columns
problem2_features = ["XXX"]
# Fill in the target as a string with the correct column name
problem2_target = "XXX"
problem2_features = ["experience_level", "employment_type", "remote_ratio"]
problem2_target = "salary"


In [None]:
# Part 2
# Split the data into train and test using train_test_split
# keep the train size as 0.8 and use random_state=42
problem2_X_train,problem2_X_test,problem2_y_train,problem2_y_test = XXX

from sklearn.model_selection import train_test_split

X = problem2_df[problem2_features]
y = problem2_df[problem2_target]

problem2_X_train, problem2_X_test, problem2_y_train, problem2_y_test = train_test_split(
    X, y, train_size=0.8, random_state=42
)


In [None]:
# Part 3
# Include the necessary imports
# Initialize your linear regression model
problem2_model = XXX
# Train your model on the training data
from sklearn.linear_model import LinearRegression

# Initialize the model
problem2_model = LinearRegression()

# Train the model
problem2_model.fit(problem2_X_train, problem2_y_train)


1.10 Part 4
Double click this cell to enter edit mode and write your answer for part 4 below this line.

Reasoning about performance:

If points in the predicted vs actual plot are close to the diagonal line → good model fit.

Residual plot should have no clear pattern and be roughly centered at zero.

In [None]:
# Part 4
# Write the code to diagnose your model

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, r2_score

# Predictions
y_pred = problem2_model.predict(problem2_X_test)

# Compute metrics
rmse = np.sqrt(mean_squared_error(problem2_y_test, y_pred))
r2 = r2_score(problem2_y_test, y_pred)

print(f"RMSE: {rmse:.2f}")
print(f"R²: {r2:.2f}")

# Plot predicted vs actual
plt.scatter(problem2_y_test, y_pred, alpha=0.5)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--')
plt.xlabel("Actual Salary")
plt.ylabel("Predicted Salary")
plt.title("Predicted vs Actual Salary")
plt.show()

# Plot residuals
residuals = problem2_y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(0, color='r', linestyle='--')
plt.xlabel("Predicted Salary")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.show()



1.11 Part 5
Double click this cell to enter edit mode and write your answer for part 5 below this line.

If the coefficient for remote_ratio is positive → higher remote ratio → higher salary.

If negative → higher remote ratio → lower salary.

In [None]:
# Part 5
# Put the code for part 5 below this line

# New data point
new_data = pd.DataFrame({
    "experience_level": [1],
    "employment_type": [1],
    "remote_ratio": [0]
})



In [None]:
# Part 5
problem2_predicted_salary = problem2_model.predict(new_data)[0]
print(f"Predicted salary: {problem2_predicted_salary:.2f}")

print("Coefficients:", problem2_model.coef_)
coef_df = pd.DataFrame({
    "feature": problem2_features,
    "coefficient": problem2_model.coef_
})
print(coef_df)


1.12 Part 6
Double click this cell to enter edit mode and write your answer for part 6 below this line.

Interpretation:

The confidence band shows the range in which the true CDF lies with 95% probability.

Can be used to assess model residuals, test normality, or validate assumptions.


In [None]:
# Part 6
# Put the code for part 6 below this line

# Compute ECDF
residuals_sorted = np.sort(residuals)
n = len(residuals_sorted)
ecdf = np.arange(1, n+1) / n

# DKW confidence bands
alpha = 0.05
epsilon = np.sqrt(np.log(2/alpha)/(2*n))
lower = np.maximum(ecdf - epsilon, 0)
upper = np.minimum(ecdf + epsilon, 1)

# Plot ECDF with confidence bands
plt.step(residuals_sorted, ecdf, where='post', label='ECDF')
plt.step(residuals_sorted, lower, color='red', linestyle='--', label='95% CI lower')
plt.step(residuals_sorted, upper, color='red', linestyle='--', label='95% CI upper')
plt.xlabel("Residuals")
plt.ylabel("ECDF")
plt.title("Empirical CDF of Residuals with 95% Confidence Band")
plt.legend()
plt.show()


1.13 Exam vB, PROBLEM 3
Maximum Points = 13
For this problem we have the Diabetes dataset, I have encoded the categorical features
using One-Hot encoding, namely the following ['smoking_No Info', 'smoking_current',
'smoking_ever', 'smoking_former', 'smoking_never', 'smoking_not current',
'sex_Female', 'sex_Male', 'sex_Other'].
Treating this as a classification problem, we will train a logistic regression model to predict whether
the patient has diabetes or not. Then the task is to evaluate the model and using it to make some
conclusions.
Instructions:
1. Load the file data/diabetes.csv into the pandas dataframe problem3_df. Decide what
should be features and target, give motivations for your choices. [3p]
2. Create the problem3_X and the problem3_y as numpy arrays with problem3_X being the
features and problem3_y being the target. Do the standard train-test split with 80% training
data and 20% testing data. Store these in the variables defined in the cells. [2p]
3. Now train a Logistic regression model on the training data. Using
sklearn.linear_model.LogisticRegression. Hint: If you use many of the One-Hot
encoded features you will probably see a warning about max iterations reached, adjust the
hyperparameter C (this is the penalization) when you create your LogisticRegression.[2p]
4. Evaluation: Calculate the precision and recall for class 0 and 1 with 95% confidence bounds.
Explain their meaning [3p]
5. Advanced question: Come up with a way to define the one-hot encoded feature that is most
important for the prediction. Motivate your choice. [3p]

1.14 Part 1
Double click this cell to enter edit mode and write your answer for part 1 below this line

What features are reasonable?

In regards to how much data we have, how many features do you think we should aim
for?

What other features would you like to have used but was not collected?

Discussion

Reasoning:

The dataset is already one-hot encoded for categorical variables like smoking and sex.

Reasonable features: all one-hot encoded columns, plus any numeric health indicators if available (BMI, age, blood pressure, etc.)

Target: whether the patient has diabetes (0 = no, 1 = yes).

Discussion:

With one-hot encoding, the number of features grows quickly. Aim for less than ~30–50 features if data is limited to avoid overfitting.

Additional useful features not collected might include: family history of diabetes, exercise frequency, diet, cholesterol, HbA1c.

In [None]:
# Part 1
# Let problem3_df be the pandas dataframe that contains the data from the file
# data/visits_clean.csv
# Part 1
import pandas as pd

# Load the dataset
problem3_df = pd.read_csv("data/diabetes.csv")




In [None]:
# Part 1
# Fill in the features as a list of strings of the names of the columns
problem3_features = ["XXX"]
# Fill in the target as a string with the correct column name
problem3_target = "XXX"

# Features and target
problem3_features = [
    'smoking_No Info', 'smoking_current', 'smoking_ever', 'smoking_former',
    'smoking_never', 'smoking_not current', 'sex_Female', 'sex_Male', 'sex_Other'
]
problem3_target = 'diabetes'  # assuming column is named 'diabetes'

In [None]:
# Part 2
# Fill in your X and y below
problem3_X = XXX
problem3_y = XXX
# Split the data into train and test using train_test_split
# keep the train size as 0.8 and use random_state=42
problem3_X_train, problem3_X_test, problem3_y_train, problem3_y_test = XXX

import numpy as np
from sklearn.model_selection import train_test_split

# Features and target arrays
problem3_X = problem3_df[problem3_features].values
problem3_y = problem3_df[problem3_target].values

# Train-test split
problem3_X_train, problem3_X_test, problem3_y_train, problem3_y_test = train_test_split(
    problem3_X, problem3_y, train_size=0.8, random_state=42
)


In [None]:
# Part 3
# Initialize your LogisticRegression model
problem3_model = XXX
# Fit your initialized model on the training data

from sklearn.linear_model import LogisticRegression

# Initialize model with stronger regularization
problem3_model = LogisticRegression(max_iter=500, C=1.0)

# Fit model
problem3_model.fit(problem3_X_train, problem3_y_train)


1.15 Part 4
Double click this cell to enter edit mode and write your answer for part 4 below this line

Interpretation:

Precision: fraction of predicted positives that are correct.

Recall: fraction of true positives that were predicted correctly.

Confidence intervals show uncertainty in these metrics due to sample variability.

In [None]:
# Part 4
# Give the answer for each of the following quantities in the form of a tuple
# Example, if we want to say that the precision for class 0 is between 0.31 and 0.69
# then we would answer
# problem3_precision_0 = (0.31,0.69)
problem3_precision_0 = XXX
problem3_recall_0 = XXX
problem3_precision_1 = XXX
problem3_recall_1 = XXX

from sklearn.metrics import precision_score, recall_score

# Predictions
y_pred = problem3_model.predict(problem3_X_test)

# Number of bootstrap samples
n_boot = 1000
rng = np.random.default_rng(42)

precision_0_samples = []
recall_0_samples = []
precision_1_samples = []
recall_1_samples = []

for _ in range(n_boot):
    idx = rng.choice(len(problem3_y_test), len(problem3_y_test), replace=True)
    y_true_sample = problem3_y_test[idx]
    y_pred_sample = y_pred[idx]
    
    precision_0_samples.append(precision_score(y_true_sample, y_pred_sample, pos_label=0))
    recall_0_samples.append(recall_score(y_true_sample, y_pred_sample, pos_label=0))
    precision_1_samples.append(precision_score(y_true_sample, y_pred_sample, pos_label=1))
    recall_1_samples.append(recall_score(y_true_sample, y_pred_sample, pos_label=1))

# 95% confidence intervals
problem3_precision_0 = (np.percentile(precision_0_samples, 2.5), np.percentile(precision_0_samples, 97.5))
problem3_recall_0 = (np.percentile(recall_0_samples, 2.5), np.percentile(recall_0_samples, 97.5))
problem3_precision_1 = (np.percentile(precision_1_samples, 2.5), np.percentile(precision_1_samples, 97.5))
problem3_recall_1 = (np.percentile(recall_1_samples, 2.5), np.percentile(recall_1_samples, 97.5))

print("Precision 0:", problem3_precision_0)
print("Recall 0:", problem3_recall_0)
print("Precision 1:", problem3_precision_1)
print("Recall 1:", problem3_recall_1)



1.16 Part 5
Double click this cell to enter edit mode and write your answer for part 5 below this line.

Motivation:

The coefficient directly affects the log-odds of diabetes.

One-hot features with the largest positive coefficient → strongly increase predicted probability of diabetes.

Largest negative coefficient → strongly decrease predicted probability.

In [None]:
# Part 5
# Put whatever calculations you need here

import pandas as pd

coef_df = pd.DataFrame({
    "feature": problem3_features,
    "coefficient": problem3_model.coef_[0]
}).sort_values(by="coefficient", key=abs, ascending=False)

# Most important feature
most_important_feature = coef_df.iloc[0]
print(most_important_feature)
