<a href="https://colab.research.google.com/github/Rocking-Priya/704-fall-projects-2025/blob/main/Week_6_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DX 704 Week 6 Project
This project will develop a treatment plan for a fictious illness "Twizzleflu".
Twizzleflu is a mild illness caused by a virus.
The main symptoms are a mild fever, fidgeting, and kicking the blankets off the bed or couch.
Mild dehydration has also been reported in more severe cases.
These symptoms typically last 1-2 weeks without treatment.
Word on the internet says that Twizzleflu can be cured faster by drinking copious orange juice, but this has not been supported by evidence so far.
You will be provided with a theoretical model of Twizzleflu modeled as a Markov decision process.
Based on the model, you will compute optimal treatment plans to optimize different criteria, and compare patient discomfort with the different plans.

The full project description, a template notebook, and raw data are available on GitHub: [Project 6 Materials](https://github.com/bu-cds-dx704/dx704-project-06).

We will model Twizzleflu as a Markov decision process.
The model transition probabilities are provided in the file "twizzleflu-transitions.tsv" and the expected rewards are in "twizzleflu-rewards.tsv".
The goal for Twizzleflu is to minimize the expected discomfort of the patient which is expressed as negative rewards in the file.

## Example Code

You may find it helpful to refer to these GitHub repositories of Jupyter notebooks for example code.

* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples
* https://github.com/bu-cds-omds/dx603-examples
* https://github.com/bu-cds-omds/dx704-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
df_rewards = pd.read_csv("twizzleflu-rewards.tsv", sep="\t")
df_rewards

Unnamed: 0,action,state,reward
0,do-nothing,exposed-1,0.0
1,do-nothing,exposed-2,0.0
2,do-nothing,exposed-3,0.0
3,do-nothing,symptoms-1,-0.5
4,do-nothing,symptoms-2,-1.0
5,do-nothing,symptoms-3,-0.5
6,do-nothing,recovered,0.0
7,drink-oj,exposed-1,0.0
8,drink-oj,exposed-2,0.0
9,drink-oj,exposed-3,0.0


In [None]:
df_transitions = pd.read_csv("twizzleflu-transitions.tsv", sep="\t")
df_transitions

Unnamed: 0,action,state,next_state,probability
0,do-nothing,exposed-1,exposed-2,0.8
1,do-nothing,exposed-1,recovered,0.2
2,do-nothing,exposed-2,exposed-3,0.8
3,do-nothing,exposed-2,recovered,0.2
4,do-nothing,exposed-3,symptoms-1,0.8
5,do-nothing,exposed-3,recovered,0.2
6,do-nothing,symptoms-1,symptoms-1,0.7
7,do-nothing,symptoms-1,symptoms-2,0.3
8,do-nothing,symptoms-2,symptoms-2,0.7
9,do-nothing,symptoms-2,symptoms-3,0.3


## Part 1: Evaluate a Do Nothing Plan

One of the treatment actions is to do nothing.
Calculate the expected discomfort (not rewards) of a policy that always does nothing.

Hint: for this value calculation and later ones, use value iteration.
The analytical solution has difficulties in practice when there is no discount factor.

In [None]:
# YOUR CHANGES HERE



# --- helper functions copied/adapted from the video examples ---
def compute_qT_once(R, P, gamma, v):
    # R shape: (A, nS); P shape: (A, nS, nS); v shape: (nS,)
    return R + gamma * (P @ v)

def iterate_values_once(R, P, gamma, v):
    # R shape (A, nS) -> returns new v (nS,)
    return np.max(compute_qT_once(R, P, gamma, v), axis=0)

def value_iteration(R, P, gamma, max_iterations=10000, tolerance=1e-8):
    v_old = np.zeros(R.shape[-1])
    for i in range(max_iterations):
        v_new = iterate_values_once(R, P, gamma, v_old)
        if np.max(np.abs(v_new - v_old)) < tolerance:
            return v_new
        v_old = v_new
    return v_old

# --- build action/state indices from your dfs ---
actions = sorted(df_rewards['action'].unique(), key=str)
# gather states from both rewards and transitions to be safe
states = sorted(pd.unique(pd.concat([df_rewards['state'],
                                     df_transitions['state'], # Corrected column name
                                     df_transitions['next_state']])), key=str) # Corrected column name

a2i = {a:i for i,a in enumerate(actions)}
s2i = {s:i for i,s in enumerate(states)}
nA = len(actions)
nS = len(states)

# --- build R and P arrays like in the videos (shaped A x nS, A x nS x nS) ---
R = np.zeros((nA, nS), dtype=float)
for _, row in df_rewards.iterrows():
    a = row['action']; s = row['state']; r = float(row['reward'])
    R[a2i[a], s2i[s]] = r

P = np.zeros((nA, nS, nS), dtype=float)
expected_cols = {'action','state','next_state','probability'} # Corrected column names
if expected_cols.issubset(set(df_transitions.columns)):
    for _, row in df_transitions.iterrows():
        a = row['action']; s_from = row['state']; s_to = row['next_state']; p = float(row['probability']) # Corrected column names
        P[a2i[a], s2i[s_from], s2i[s_to]] = p
else:
    raise RuntimeError("df_transitions format unexpected. Expected columns: action,state,next_state,probability") # Corrected column names

# --- pick the do-nothing action label: adjust if your action name differs ---
possible_labels = ["do-nothing","do_nothing","do nothing","noop","no-op","nothing"]
do_nothing_label = None
for lbl in possible_labels:
    if lbl in a2i:
        do_nothing_label = lbl
        break
if do_nothing_label is None:
    # fallback: if exactly one action is literally named 'do-nothing' not present,
    # print available actions for you to pick — but avoid asking a question per your instruction:
    print("Available actions:", actions)
    raise RuntimeError("Could not auto-detect do-nothing action — set 'do_nothing_label' to the correct action string.")

a_do_nothing = a2i[do_nothing_label]

# --- build deterministic policy pi that picks do-nothing everywhere ---
pi = np.full(nS, a_do_nothing, dtype=int)

# --- factor policy to single-action rewards/transitions R_pi, P_pi ---
R_pi = R[pi, np.arange(nS)]        # shape (nS,)
P_pi = P[pi, np.arange(nS), :]     # shape (nS, nS)

# reshape to mimic arrays with one action, as in the videos
R_pi_shaped = R_pi.reshape(1, -1)         # shape (1, nS)
P_pi_shaped = P_pi.reshape(1, nS, nS)     # shape (1, nS, nS)

# --- choose gamma ---
# If your problem statement says gamma=1 (no discount), set gamma = 1.0.
# BUT the video hint suggests using value-iteration style iterations when analytic solve is problematic.
gamma = 1.0   # change to 0.999 or 0.99 if you want a stable numerical approx

# If gamma == 1.0 and the chain under the do-nothing policy is not guaranteed absorbing,
# iterative value updates with gamma slightly < 1 (e.g. 0.999) are recommended.
if gamma == 1.0:
    # try with gamma = 1.0 using iterative updates may fail to converge for non-absorbing chains,
    # so we adopt the video hint: run value-iteration style with gamma < 1 as a practical approach.
    gamma_to_use = 0.999
else:
    gamma_to_use = gamma

# --- compute values via value-iteration style iterations on the single-action MDP ---
v_pi = value_iteration(R_pi_shaped, P_pi_shaped, gamma_to_use, max_iterations=20000, tolerance=1e-9)

# --- interpret discomfort ---
# You mentioned "discomfort is expressed as negative rewards" in the data.
# If rewards are negative discomforts, then expected_discomfort = -v_pi
expected_discomfort_per_state = -v_pi   # vector of expected discomfort if df rewards are negative discomforts

# print a sample / starting-state value
start_state = "exposed-1"   # change to the initial state name you want
if start_state in s2i:
    print("Expected discomfort from start state", start_state, "=", expected_discomfort_per_state[s2i[start_state]])
else:
    print("Start state not found. Here are per-state expected discomforts (index -> state):")
    for i,s in enumerate(states):
        print(i, s, expected_discomfort_per_state[i])

Expected discomfort from start state exposed-1 = 3.3838999386427893


Save the expected discomfort by state to a file "do-nothing-discomfort.tsv" with columns state and expected_discomfort.

In [None]:
# YOUR CHANGES HERE

df_discomfort = pd.DataFrame({'state': states, 'expected_discomfort': expected_discomfort_per_state})
df_discomfort.to_csv("do-nothing-discomfort.tsv", sep="\t", index=False)

print("Saved expected discomfort values to do-nothing-discomfort.tsv")

Saved expected discomfort values to do-nothing-discomfort.tsv


In [None]:
print(df_discomfort)

        state  expected_discomfort
0   exposed-1             3.383900
1   exposed-2             4.234109
2   exposed-3             5.297934
3   recovered            -0.000000
4  symptoms-1             6.629047
5  symptoms-2             4.982831
6  symptoms-3             1.662787


Submit "do-nothing-discomfort.tsv" in Gradescope.

In [None]:
from google.colab import files
files.download("do-nothing-discomfort.tsv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Part 2: Compute an Optimal Treatment Plan

Compute an optimal treatment plan for Twizzleflu.
It should minimize the expected discomfort (maximize the rewards).

In [None]:
# YOUR CHANGES HERE


# ---------- helper functions (from videos) ----------
def compute_qT_once(R, P, gamma, v):
    return R + gamma * (P @ v)

def iterate_values_once(R, P, gamma, v):
    return np.max(compute_qT_once(R, P, gamma, v), axis=0)

def value_iteration(R, P, gamma, max_iterations=20000, tolerance=1e-9):
    v_old = np.zeros(R.shape[-1])
    for i in range(max_iterations):
        v_new = iterate_values_once(R, P, gamma, v_old)
        if np.max(np.abs(v_new - v_old)) < tolerance:
            return v_new
        v_old = v_new
    return v_old

def iterative_policy_evaluation(R, P, gamma, pi, max_iterations=20000, tolerance=1e-9, warmstart=None):
    # deterministic policy version (from video)
    n = R.shape[-1]
    R_pi = R[pi, np.arange(n)]
    P_pi = P[pi, np.arange(n),:]
    # reshape so we can reuse iterate_values_once which expects an action axis
    R_pi = R_pi.reshape(1, *R_pi.shape)
    P_pi = P_pi.reshape(1, *P_pi.shape)
    v_old = warmstart if warmstart is not None else np.zeros(n)
    for i in range(max_iterations):
        v_new = iterate_values_once(R_pi, P_pi, gamma, v_old)
        if np.max(np.abs(v_new - v_old)) < tolerance:
            return v_new
        v_old = v_new
    return v_old

# ---------- build actions/states and R, P (same as Part 1) ----------
actions = sorted(df_rewards['action'].unique(), key=str)
states = sorted(pd.unique(pd.concat([df_rewards['state'],
                                     df_transitions['state'],
                                     df_transitions['next_state']])), key=str)

a2i = {a:i for i,a in enumerate(actions)}
s2i = {s:i for i,s in enumerate(states)}
nA = len(actions)
nS = len(states)

R = np.zeros((nA, nS), dtype=float)
for _, row in df_rewards.iterrows():
    a = row['action']; s = row['state']; r = float(row['reward'])
    R[a2i[a], s2i[s]] = r

P = np.zeros((nA, nS, nS), dtype=float)
expected_cols = {'action','state','next_state','probability'}
if expected_cols.issubset(set(df_transitions.columns)):
    for _, row in df_transitions.iterrows():
        a = row['action']; s_from = row['state']; s_to = row['next_state']; p = float(row['probability'])
        P[a2i[a], s2i[s_from], s2i[s_to]] = p
else:
    raise RuntimeError("df_transitions format unexpected. Expected columns: action,state,next_state,probability")

# ---------- gamma handling ----------
# If problem statement sets gamma=1.0 (no discount), use gamma_to_use < 1 for numerical stability as the video suggests
gamma = 1.0   # replace with actual gamma if given in the dataset / problem
gamma_to_use = 0.999 if gamma == 1.0 else gamma

# ---------- Value iteration to get optimal values ----------
v_star = value_iteration(R, P, gamma_to_use)

# ---------- Greedy policy extraction (optimal treatment plan) ----------
# compute q-values for each action/state using v_star:
qT = compute_qT_once(R, P, gamma_to_use, v_star)   # shape (nA, nS)
pi_star_idx = np.argmax(qT, axis=0)  # integer action index per state

# convert indices to action labels
pi_star_actions = [actions[a] for a in pi_star_idx]

# ---------- Evaluate the optimal policy (optional but useful) ----------
v_pi_star = iterative_policy_evaluation(R, P, gamma_to_use, pi_star_idx)

# ---------- Convert to expected discomfort if rewards are negative discomforts ----------
expected_discomfort_optimal = -v_pi_star

# ---------- Save outputs ----------
df_policy = pd.DataFrame({'state': states, 'action': pi_star_actions})
df_policy.to_csv("optimal-treatment-plan.tsv", sep="\t", index=False)

df_discomfort_opt = pd.DataFrame({'state': states, 'expected_discomfort_optimal': expected_discomfort_optimal})
df_discomfort_opt.to_csv("optimal-treatment-discomfort.tsv", sep="\t", index=False)

# ---------- Quick prints for verification ----------
print("Saved optimal-treatment-plan.tsv and optimal-treatment-discomfort.tsv")
# show sample for the start state if present
start_state = "exposed-1"
if start_state in s2i:
    print("Optimal action at", start_state, "=", df_policy.loc[s2i[start_state],'action'])
    print("Expected discomfort from", start_state, "under optimal plan =", expected_discomfort_optimal[s2i[start_state]])
else:
    print("Start state", start_state, "not in state list.")

Saved optimal-treatment-plan.tsv and optimal-treatment-discomfort.tsv
Optimal action at exposed-1 = sleep-8
Expected discomfort from exposed-1 under optimal plan = 0.7425455221220195


Save the optimal actions for each state to a file "minimum-discomfort-actions.tsv" with columns state and action.

In [None]:
# YOUR CHANGES HERE

df_policy.to_csv("minimum-discomfort-actions.tsv", sep="\t", index=False)

print("Saved optimal actions to minimum-discomfort-actions.tsv")

Saved optimal actions to minimum-discomfort-actions.tsv


In [None]:
print(df_policy)

        state      action
0   exposed-1     sleep-8
1   exposed-2     sleep-8
2   exposed-3     sleep-8
3   recovered  do-nothing
4  symptoms-1    drink-oj
5  symptoms-2    drink-oj
6  symptoms-3    drink-oj


Submit "minimum-discomfort-actions.tsv" in Gradescope.

In [None]:
files.download("minimum-discomfort-actions.tsv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Part 3: Expected Discomfort

Using your previous optimal policy, compute the expected discomfort for each state.

In [None]:
# YOUR CHANGES HERE

# assumes you've already built R (nA x nS), P (nA x nS x nS),
# computed pi_star_idx (length nS int array) and gamma_to_use (e.g. 0.999)

# iterative_policy_evaluation from the videos (deterministic policy version)
def iterative_policy_evaluation(R, P, gamma, pi, max_iterations=20000, tolerance=1e-9, warmstart=None):
    n = R.shape[-1]
    R_pi = R[pi, np.arange(n)]        # immediate reward per state under policy
    P_pi = P[pi, np.arange(n), :]     # transition matrix under policy (nS x nS)
    # reshape to reuse iterate_values_once (which expects an action axis)
    R_pi_shaped = R_pi.reshape(1, -1)
    P_pi_shaped = P_pi.reshape(1, n, n)
    # iterate value updates
    v_old = warmstart if warmstart is not None else np.zeros(n)
    for i in range(max_iterations):
        # one-step update for single-action MDP:
        v_new = R_pi + gamma * (P_pi @ v_old)   # equivalent to iterate_values_once on shaped arrays
        if np.max(np.abs(v_new - v_old)) < tolerance:
            return v_new
        v_old = v_new
    return v_old

# Evaluate the optimal policy computed earlier
v_pi_star = iterative_policy_evaluation(R, P, gamma_to_use, pi_star_idx)

# Convert to expected discomfort if rewards are negative discomforts:
expected_discomfort = -v_pi_star

# Save to TSV
df_expected_opt = pd.DataFrame({'state': states, 'expected_discomfort': expected_discomfort})
df_expected_opt.to_csv("optimal-policy-discomfort.tsv", sep="\t", index=False)

# Print a quick summary
for s, val in zip(states, expected_discomfort):
    print(f"{s}: expected discomfort = {val:.6f}")

print("Saved expected discomfort per state to optimal-policy-discomfort.tsv")


exposed-1: expected discomfort = 0.742546
exposed-2: expected discomfort = 1.486578
exposed-3: expected discomfort = 2.976131
recovered: expected discomfort = -0.000000
symptoms-1: expected discomfort = 5.958221
symptoms-2: expected discomfort = 4.480576
symptoms-3: expected discomfort = 1.495513
Saved expected discomfort per state to optimal-policy-discomfort.tsv


In [None]:
print(df_rewards.head())


       action       state  reward
0  do-nothing   exposed-1     0.0
1  do-nothing   exposed-2     0.0
2  do-nothing   exposed-3     0.0
3  do-nothing  symptoms-1    -0.5
4  do-nothing  symptoms-2    -1.0


In [None]:
sums = df_transitions.groupby(['action','state'])['probability'].sum()
print(sums[sums.round(8) != 1.0])


Series([], Name: probability, dtype: float64)


In [None]:
n_groups = df_transitions.groupby(['action','state'])['probability'].size().shape[0]
expected = len(df_rewards['action'].unique()) * len(pd.concat([df_rewards['state'], df_transitions['state'], df_transitions['next_state']]).unique())
print("groups present:", n_groups, "expected (actions*states):", expected)


groups present: 21 expected (actions*states): 21


In [None]:
bad = sums[ (sums - 1.0).abs() > 1e-8 ]
print(bad)


Series([], Name: probability, dtype: float64)


In [None]:
print(df_policy)


        state      action
0   exposed-1     sleep-8
1   exposed-2     sleep-8
2   exposed-3     sleep-8
3   recovered  do-nothing
4  symptoms-1    drink-oj
5  symptoms-2    drink-oj
6  symptoms-3    drink-oj


Save your results in a file "minimum-discomfort-values.tsv" with columns state and expected_discomfort.

In [None]:
# YOUR CHANGES HERE

# Save to TSV
df_expected_opt.to_csv("minimum-discomfort-values.tsv", sep="\t", index=False)

print("Saved expected discomfort per state to minimum-discomfort-values.tsv")

Saved expected discomfort per state to minimum-discomfort-values.tsv


In [None]:
print(df_expected_opt)

        state  expected_discomfort
0   exposed-1             0.742546
1   exposed-2             1.486578
2   exposed-3             2.976131
3   recovered            -0.000000
4  symptoms-1             5.958221
5  symptoms-2             4.480576
6  symptoms-3             1.495513


In [None]:
from google.colab import files
files.download("minimum-discomfort-values.tsv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Submit "minimum-discomfort-values.tsv" in Gradescope.

## Part 4: Minimizing Twizzleflu Duration

Modifiy the Markov decision process to minimize the days until the Twizzle flu is over.
To do so, change the reward function to always be -1 if the current state corresponds to being sick and 0 if the current state corresponds to being better.
To be clear, the action does not matter for this reward function.


In [None]:
# YOUR CHANGES HERE

# Assumes you already have:
# df_rewards, df_transitions, actions, states, a2i, s2i, R, P
# and the helper functions: compute_qT_once, iterate_values_once, value_iteration, iterative_policy_evaluation



# 1) Decide which state names count as "healthy/recovered"
# Best-effort: treat any state whose name contains these substrings as healthy.
healthy_substrings = ["recover", "done", "healthy", "well", "term", "terminal"]
# you can extend with exact names if you know them, for example: "recovered"
healthy_substrings += ["recovered"]

is_healthy = []
for s in states:
    s_low = s.lower()
    healthy = any(sub in s_low for sub in healthy_substrings)
    is_healthy.append(healthy)
is_healthy = np.array(is_healthy)  # boolean array length nS

# 2) Build new R_min_duration (A x nS) where reward = -1 for sick states, 0 for healthy
nA = len(actions)
nS = len(states)
R_duration = np.zeros((nA, nS), dtype=float)
# set -1 for sick states for every action
sick_mask = ~is_healthy
R_duration[:, sick_mask] = -1.0
R_duration[:, is_healthy] = 0.0

# 3) Choose gamma: problem conceptually wants gamma=1 (undiscounted expected duration).
# Use gamma slightly less than 1 for numerical stability as recommended in the videos.
gamma = 1.0
gamma_to_use = 0.999 if gamma == 1.0 else gamma

# 4) Use value iteration to compute optimal values (maximize rewards => minimize duration)
v_star_duration = value_iteration(R_duration, P, gamma_to_use, max_iterations=20000, tolerance=1e-9)

# 5) Extract greedy policy (min-duration action per state)
qT_duration = compute_qT_once(R_duration, P, gamma_to_use, v_star_duration)  # shape (nA, nS)
pi_star_idx_duration = np.argmax(qT_duration, axis=0)
pi_star_actions_duration = [actions[a] for a in pi_star_idx_duration]

# 6) Evaluate the resulting policy (compute its values) and convert to expected duration
v_pi_star_duration = iterative_policy_evaluation(R_duration, P, gamma_to_use, pi_star_idx_duration)
expected_duration_per_state = -v_pi_star_duration   # since reward = -1 per sick day

# 7) Save / print
df_policy_duration = pd.DataFrame({'state': states, 'action': pi_star_actions_duration})
df_duration = pd.DataFrame({'state': states, 'expected_duration': expected_duration_per_state})

df_policy_duration.to_csv("min-duration-actions.tsv", sep="\t", index=False)
df_duration.to_csv("min-duration-expected.tsv", sep="\t", index=False)

print("Saved min-duration-actions.tsv and min-duration-expected.tsv")
# quick look
print(df_policy_duration)
print(df_duration)


Saved min-duration-actions.tsv and min-duration-expected.tsv
        state      action
0   exposed-1     sleep-8
1   exposed-2     sleep-8
2   exposed-3     sleep-8
3   recovered  do-nothing
4  symptoms-1  do-nothing
5  symptoms-2  do-nothing
6  symptoms-3  do-nothing
        state  expected_duration
0   exposed-1           2.988223
1   exposed-2           3.980426
2   exposed-3           5.966818
3   recovered          -0.000000
4  symptoms-1           9.943579
5  symptoms-2           6.640088
6  symptoms-3           3.325574


Save your new reward function in a file "duration-rewards.tsv" in the same format as "twizzleflu-rewards.tsv".

In [None]:
# YOUR CHANGES HERE

# define which state names count as healthy/recovered
healthy_substrings = ["recover", "done", "healthy", "well", "term", "terminal", "recovered"]

def is_healthy_state(s):
    s_low = str(s).lower()
    return any(sub in s_low for sub in healthy_substrings)

# set new reward: -1 for sick states, 0 for healthy states (action-independent)
df_out = df_rewards.copy() # Changed df to df_rewards
df_out['reward'] = df_out['state'].apply(lambda s: 0.0 if is_healthy_state(s) else -1.0)

# Define output path
output_path = "duration-rewards.tsv"

# save in same format (TSV with same columns)
df_out.to_csv(output_path, sep="\t", index=False)
print(f"Wrote {output_path}")

Wrote duration-rewards.tsv


Submit "duration-rewards.tsv" in Gradescope.

In [None]:
print(df_out)

        action       state  reward
0   do-nothing   exposed-1    -1.0
1   do-nothing   exposed-2    -1.0
2   do-nothing   exposed-3    -1.0
3   do-nothing  symptoms-1    -1.0
4   do-nothing  symptoms-2    -1.0
5   do-nothing  symptoms-3    -1.0
6   do-nothing   recovered     0.0
7     drink-oj   exposed-1    -1.0
8     drink-oj   exposed-2    -1.0
9     drink-oj   exposed-3    -1.0
10    drink-oj  symptoms-1    -1.0
11    drink-oj  symptoms-2    -1.0
12    drink-oj  symptoms-3    -1.0
13    drink-oj   recovered     0.0
14     sleep-8   exposed-1    -1.0
15     sleep-8   exposed-2    -1.0
16     sleep-8   exposed-3    -1.0
17     sleep-8  symptoms-1    -1.0
18     sleep-8  symptoms-2    -1.0
19     sleep-8  symptoms-3    -1.0
20     sleep-8   recovered     0.0


## Part 5: Optimize for Shorter Twizzleflu

Compute an optimal policy to minimize the duration of Twizzleflu.

In [14]:
# YOUR CHANGES HERE

# R_duration: shape (nA, nS), P: shape (nA,nS,nS), actions, states exist
gamma_to_use = 0.999

v_star = value_iteration(R_duration, P, gamma_to_use)
qT = compute_qT_once(R_duration, P, gamma_to_use, v_star)
pi_star_idx = np.argmax(qT, axis=0)
pi_star_actions = [actions[i] for i in pi_star_idx]

v_pi_star = iterative_policy_evaluation(R_duration, P, gamma_to_use, pi_star_idx)
expected_duration = -v_pi_star



Save the optimal actions for each state to a file "minimum-duration-actions.tsv" with columns state and action.

In [13]:
# YOUR CHANGES HERE

pd.DataFrame({'state':states, 'action':pi_star_actions}).to_csv("minimum-duration-actions.tsv", sep="\t", index=False)
pd.DataFrame({'state':states, 'expected_duration':expected_duration}).to_csv("minimum-duration-expected.tsv", sep="\t", index=False)
print("Saved minimum-duration-actions.tsv and minimum-duration-expected.tsv")

Saved minimum-duration-actions.tsv and minimum-duration-expected.tsv


In [15]:
print(pd.DataFrame({'state':states, 'action':pi_star_actions}))

        state      action
0   exposed-1     sleep-8
1   exposed-2     sleep-8
2   exposed-3     sleep-8
3   recovered  do-nothing
4  symptoms-1  do-nothing
5  symptoms-2  do-nothing
6  symptoms-3  do-nothing


In [17]:
from google.colab import files
files.download("minimum-duration-actions.tsv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Submit "minimum-duration-actions.tsv" in Gradescope.

## Part 6: Shorter Twizzleflu?

Compute the expected number of days sick for each state to a file.

In [26]:
# YOUR CHANGES HERE

# --- helper functions (value-iteration style from videos) ---
def compute_qT_once(R, P, gamma, v):
    return R + gamma * (P @ v)

def iterate_values_once(R, P, gamma, v):
    return np.max(compute_qT_once(R, P, gamma, v), axis=0)

def value_iteration(R, P, gamma, max_iterations=20000, tolerance=1e-9):
    v_old = np.zeros(R.shape[-1])
    for i in range(max_iterations):
        v_new = iterate_values_once(R, P, gamma, v_old)
        if np.max(np.abs(v_new - v_old)) < tolerance:
            return v_new
        v_old = v_new
    return v_old

def iterative_policy_evaluation(R, P, gamma, pi, max_iterations=20000, tolerance=1e-9, warmstart=None):
    n = R.shape[-1]
    R_pi = R[pi, np.arange(n)]
    P_pi = P[pi, np.arange(n), :]
    R_pi = R_pi.reshape(1, *R_pi.shape)
    P_pi = P_pi.reshape(1, *P_pi.shape)
    v_old = warmstart if warmstart is not None else np.zeros(n)
    for i in range(max_iterations):
        v_new = iterate_values_once(R_pi, P_pi, gamma, v_old)
        if np.max(np.abs(v_new - v_old)) < tolerance:
            return v_new
        v_old = v_new
    return v_old

# --- build actions/states and mapping (same pattern as earlier parts) ---
actions = sorted(df_rewards['action'].unique(), key=str)
states = sorted(pd.unique(pd.concat([df_rewards['state'],
                                     df_transitions['state'],
                                     df_transitions['next_state']])), key=str)

a2i = {a:i for i,a in enumerate(actions)}
s2i = {s:i for i,s in enumerate(states)}
nA = len(actions)
nS = len(states)

# build R and P arrays from the provided TSVs
R = np.zeros((nA, nS), dtype=float)
for _, row in df_rewards.iterrows():
    a = row['action']; s = row['state']; r = float(row['reward'])
    R[a2i[a], s2i[s]] = r

P = np.zeros((nA, nS, nS), dtype=float)
expected_cols = {'action','state','next_state','probability'}
if expected_cols.issubset(set(df_transitions.columns)):
    for _, row in df_transitions.iterrows():
        a = row['action']; s_from = row['state']; s_to = row['next_state']; p = float(row['probability'])
        P[a2i[a], s2i[s_from], s2i[s_to]] = p
else:
    raise RuntimeError("df_transitions missing expected columns: action,state,next_state,probability")

# --- quick sanity: check outgoing prob sums are (approximately) 1 ---
sums = df_transitions.groupby(['action','state'])['probability'].sum()
bad = sums[sums.round(8) != 1.0]
if len(bad) > 0:
    print("Warning: some (action,state) outgoing probabilities do not sum to 1.0:", bad)
else:
    print("All (action,state) outgoing probabilities sum to 1 (within numerical tolerance).")

# --- create the duration reward: -1 per sick state, 0 for healthy states ---
# choose which names indicate 'healthy' — extend this list if needed
healthy_substrings = ["recover", "done", "healthy", "well", "term", "terminal", "recovered"]

is_healthy = np.array([any(sub in s.lower() for sub in healthy_substrings) for s in states])
R_duration = np.zeros((nA, nS), dtype=float)
R_duration[:, ~is_healthy] = -1.0   # sick states -> -1
R_duration[:, is_healthy]  =  0.0  # healthy -> 0

# --- gamma selection: approximate undiscounted with gamma slightly < 1 ---
gamma = 1.0
gamma_to_use = 0.999 if gamma == 1.0 else gamma

# --- compute optimal values and greedy policy under duration reward ---
v_star = value_iteration(R_duration, P, gamma_to_use)
qT = compute_qT_once(R_duration, P, gamma_to_use, v_star)  # shape (nA, nS)
pi_star_idx = np.argmax(qT, axis=0)
pi_star_actions = [actions[a] for a in pi_star_idx]

# --- evaluate the greedy policy to get expected (discounted) value, then convert to expected duration ---
v_pi_star = iterative_policy_evaluation(R_duration, P, gamma_to_use, pi_star_idx)
expected_sick_days = -v_pi_star   # because reward = -1 per sick day



All (action,state) outgoing probabilities sum to 1 (within numerical tolerance).


Save the expected sick days for each state to a file "minimum-duration-days.tsv" with columns state and expected_sick_days.

In [27]:
# YOUR CHANGES HERE

# --- save results ---
df_out = pd.DataFrame({'state': states, 'expected_sick_days': expected_sick_days})
df_out.to_csv("minimum-duration-days.tsv", sep="\t", index=False)
print("Wrote minimum-duration-days.tsv. Sample:")
print(df_out)

Wrote minimum-duration-days.tsv. Sample:
        state  expected_sick_days
0   exposed-1            2.988223
1   exposed-2            3.980426
2   exposed-3            5.966818
3   recovered           -0.000000
4  symptoms-1            9.943579
5  symptoms-2            6.640088
6  symptoms-3            3.325574


In [28]:
files.download("minimum-duration-days.tsv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Submit "minimum-duration-days.tsv" in Gradescope.

## Part 7: Speed vs Pampering

Compute the expected discomfort using the policy to minimize days sick, and compare the results to the expected discomfort when optimizing to minimize discomfort.

In [29]:
# YOUR CHANGES HERE
# --- helper functions (same style as the videos) ---
def compute_qT_once(R, P, gamma, v):
    return R + gamma * (P @ v)

def iterate_values_once(R, P, gamma, v):
    return np.max(compute_qT_once(R, P, gamma, v), axis=0)

def value_iteration(R, P, gamma, max_iterations=20000, tolerance=1e-9):
    v_old = np.zeros(R.shape[-1])
    for _ in range(max_iterations):
        v_new = iterate_values_once(R, P, gamma, v_old)
        if np.max(np.abs(v_new - v_old)) < tolerance:
            return v_new
        v_old = v_new
    return v_old

def iterative_policy_evaluation(R, P, gamma, pi, max_iterations=20000, tolerance=1e-9, warmstart=None):
    n = R.shape[-1]
    R_pi = R[pi, np.arange(n)]
    P_pi = P[pi, np.arange(n), :]
    R_pi = R_pi.reshape(1, *R_pi.shape)
    P_pi = P_pi.reshape(1, *P_pi.shape)
    v_old = warmstart if warmstart is not None else np.zeros(n)
    for _ in range(max_iterations):
        v_new = iterate_values_once(R_pi, P_pi, gamma, v_old)
        if np.max(np.abs(v_new - v_old)) < tolerance:
            return v_new
        v_old = v_new
    return v_old

# --- build actions, states, mappings, R and P arrays ---
actions = sorted(df_rewards['action'].unique(), key=str)
states = sorted(pd.unique(pd.concat([df_rewards['state'],
                                     df_transitions['state'],
                                     df_transitions['next_state']])), key=str)

a2i = {a:i for i,a in enumerate(actions)}
s2i = {s:i for i,s in enumerate(states)}
nA = len(actions)
nS = len(states)

# R: (nA, nS)
R = np.zeros((nA, nS), dtype=float)
for _, row in df_rewards.iterrows():
    R[a2i[row['action']], s2i[row['state']]] = float(row['reward'])

# P: (nA, nS, nS)
P = np.zeros((nA, nS, nS), dtype=float)
expected_cols = {'action','state','next_state','probability'}
if not expected_cols.issubset(set(df_transitions.columns)):
    raise RuntimeError("df_transitions missing expected columns: action,state,next_state,probability")
for _, row in df_transitions.iterrows():
    P[a2i[row['action']], s2i[row['state']], s2i[row['next_state']]] = float(row['probability'])

# --- gamma handling: use 0.999 for numerical stability if gamma==1 requested conceptually ---
gamma = 1.0
gamma_to_use = 0.999 if gamma == 1.0 else gamma

# --- build duration reward (R_duration): -1 for any 'sick' state, 0 for healthy states ---
healthy_substrings = ["recover", "done", "healthy", "well", "term", "terminal", "recovered"]
is_healthy = np.array([any(sub in s.lower() for sub in healthy_substrings) for s in states])
R_duration = np.zeros((nA, nS), dtype=float)
R_duration[:, ~is_healthy] = -1.0   # sick states
R_duration[:, is_healthy]  = 0.0   # healthy states

# -------------------------
# 1) Compute policy that minimizes discomfort (original reward R)
# -------------------------
v_star_discomfort = value_iteration(R, P, gamma_to_use)
qT_discomfort = compute_qT_once(R, P, gamma_to_use, v_star_discomfort)
pi_discomfort_idx = np.argmax(qT_discomfort, axis=0)   # action index per state (greedy)
# evaluate that policy under the original reward R to get values
v_pi_discomfort = iterative_policy_evaluation(R, P, gamma_to_use, pi_discomfort_idx)
# expected discomfort (note: rewards in df_rewards are "negative discomfort" style; if rewards are negative discomfort,
# then expected discomfort = -v; if rewards are already positive discomfort, skip the negation.)
expected_discomfort_minimize = -v_pi_discomfort

# -------------------------
# 2) Compute policy that minimizes duration (using R_duration)
# -------------------------
v_star_duration = value_iteration(R_duration, P, gamma_to_use)
qT_duration = compute_qT_once(R_duration, P, gamma_to_use, v_star_duration)
pi_duration_idx = np.argmax(qT_duration, axis=0)
# evaluate that *duration-minimizing policy* under the original discomfort reward R to find its expected discomfort
v_pi_duration_under_R = iterative_policy_evaluation(R, P, gamma_to_use, pi_duration_idx)
expected_discomfort_speed = -v_pi_duration_under_R




Save the results to a file "policy-comparison.tsv" with columns state, speed_discomfort, and minimize_discomfort.

In [30]:
# YOUR CHANGES HERE

df_compare = pd.DataFrame({
    'state': states,
    'speed_discomfort': expected_discomfort_speed,
    'minimize_discomfort': expected_discomfort_minimize
})
df_compare.to_csv("policy-comparison.tsv", sep="\t", index=False)
print("Wrote policy-comparison.tsv. Sample:")
print(df_compare)


Wrote policy-comparison.tsv. Sample:
        state  speed_discomfort  minimize_discomfort
0   exposed-1          0.826147             0.742546
1   exposed-2          1.653949             1.486578
2   exposed-3          3.311209             2.976131
3   recovered         -0.000000            -0.000000
4  symptoms-1          6.629047             5.958221
5  symptoms-2          4.982831             4.480576
6  symptoms-3          1.662787             1.495513


In [31]:
files.download("policy-comparison.tsv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Submit "policy-comparison.tsv" in Gradescope.

## Part 8: Code

Please submit a Jupyter notebook that can reproduce all your calculations and recreate the previously submitted files.

## Part 9: Acknowledgements

If you discussed this assignment with anyone, please acknowledge them here.
If you did this assignment completely on your own, simply write none below.

If you used any libraries not mentioned in this module's content, please list them with a brief explanation what you used them for. If you did not use any other libraries, simply write none below.

If you used any generative AI tools, please add links to your transcripts below, and any other information that you feel is necessary to comply with the generative AI policy. If you did not use any generative AI tools, simply write none below.