# CS 640 - 2025 Fall - Homework 5

In this homework, you will define a Markov decision process describing a cat moving on a 32 x 32 grid looking for a fish.
This will extend last week's Markov reward process by giving the cat agency to seek out the fish.

## Instructions

1. Follow the instructions below to construct and analyze the Markov decision process.
2. Run all the cells so that all the check cells are updated.
3. Answer the question at the bottom.
4. Submit your notebook in Gradescope.


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import torch

## State Descriptions

You will construct a Markov decision problem based on the following specification (same as homework 4).
1. The states are numbered from 0 to 1024 (inclusive).
2. The state 1024 is a special done state.
3. For state s from 0 to 1023, the state number encodes coordinates as follows.
  * x = s % 32
  * y = s / 32 (integer division)
  * x and y represent the location of a cat in a 32x32 grid.
4. At the state corresponding to x=16,y=16, there is a fish unless the cat is also there.

The following function `print_state` will visualize the state.

In [None]:
def print_state(s):
    assert 0 <= s <= 1024

    print("STATE", s)

    if s < 1024:
        # normal state indicating y, x coordinates
        output = ['🪨' for _ in range(1024)]
        output[16*32+16] = '🐟'
        output[s] = '🐱'
        for i in range(0, 1024, 32):
            print(''.join(output[i:i+32]))
    else:
        print("DONE")

    print("")

for s in (0, 2 * 32 + 5, 16*32+16, 1023, 1024):
    print_state(s)

## Actions

In any state, the cat will have 4 possible actions.
1. Up. The cat tries to change its $y$ coordinate by -1.
2. Down. The cat tries to change its $y$ coordinate by +1.
3. Left. The cat tries to change its $x$ coordinate by -1.
4. Right. The cat tries to change its $x$ coordinate by +1.

If the up/down descriptions do not make sense, consider that the first row when printing corresponds to $y=0$ and the last row printed corresponds to $y=31$.

## State Transitions

For each action $a$, construct a transition matrix $P_a$ based on the following rules.

1. $P_a$ should be 1025x1025.
2. $P_a[i,j]$ should hold the probability of transitioning from state i to j in one step.
3. If the state is 1024 (the done state), the state stays the same with probability 1.
4. If the state is 528 (cat is on the fish), the state changes to 1024 with probability 1.
5. If the action $a$ corresponds to moving off the grid, then the state stays the same with probability 1.
6. Otherwise, for any other state $s$,
    * For any state $s'$ representing locations that are adjacent horizontally or vertically besides the state that the cat intends to reach, there is a 5% of transitioning from $s$ to $s'$.
    * The state to which the cat intends to move has the remaining probability (this should be 0.80, 0.85 or 0.90).
    * The probability of transitioning to a state representing a non-adjacent location is zero.


In [None]:
# YOUR CHANGES HERE

# pick your data structures as you see appropriate

# P = ...

## State Rewards

Construct a reward vector R based on the following rules.
The reward does not depend on the action.

1. R should be 1025x1.
2. R[i] should hold the reward after state i.
3. The reward after state 528 (cat is on fish) is 100.
4. The reward for all other states is 0.

In [None]:
# YOUR CHANGES HERE

R = ...

## Optimal State Values

Use **value iteration** to compute the value function $v_*$ for each state using $\gamma=0.9$ and save it in $v$.

In [None]:
# YOUR CHANGES HERE

v = [i / 1024 for i in range(1025)]

### Check $v_*$ values.

Run these cells without changing their code.

In [None]:
# done state
v[1024]

In [None]:
# cat arrived at fish
v[528]

In [None]:
# cat next to fish
v[529]

In [None]:
# cat farther from fish
v[530]

## Visualize $v_*$.

Run this cell without changing its code.

In [None]:
plt.imshow(np.asarray(v)[:1024].reshape(32, 32));

## Optimal State-Action Values

Use **q-learning** to compute the state-action value function $q_*$ using $\gamma$ and save it in `q`.
`q` should be constructed so that `q[s][a]` returns the value $q_*(s,a)$ where $s$ is an integer from 0 to 1024 (inclusive) and $a$ is one of the strings "up", "down", "left" or "right".

In [None]:
# YOUR CHANGES HERE

q = ...

### Check $q_*$ values.

Run these cells without changing their code.

In [None]:
# done state

(q[1024]["up"], q[1024]["down"], q[1024]["left"], q[1024]["right"])

In [None]:
# cat arrived at fish

(q[528]["up"], q[528]["down"], q[528]["left"], q[528]["right"])

In [None]:
# cat next to fish

(q[529]["up"], q[529]["down"], q[529]["left"], q[529]["right"])

In [None]:
# cat farther from fish

(q[530]["up"], q[530]["down"], q[530]["left"], q[530]["right"])

## Extract Optimal Policy from $q_*$.

Construct an optimal policy based on $q_*$.
Save it in a variable `pi`

In [None]:
# YOUR CHANGES HERE

# dummy policy as an example of the format and to test visualization below.
pi = ["up" for _ in range(1025)]

### Check $\pi$ values

Run these cells without changing their code.

In [None]:
# done state
pi[1024]

In [None]:
# cat arrived at fish
pi[528]

In [None]:
# cat next to fish
pi[529]

In [None]:
# cat farther from fish
pi[530]

## Visualize $\pi$.

Run this cell without changing its code.

In [None]:
pi_visualized = ["🔥" for _ in range(1024)]

for i in range(1024):
    if pi[i] == "up":
        pi_visualized[i] = "⬆️"
    elif pi[i] == "down":
        pi_visualized[i] = "⬇️"
    elif pi[i] == "left":
        pi_visualized[i] = "⬅️"
    elif pi[i] == "right":
        pi_visualized[i] = "➡️"
    else:
        raise Exception(f"Unknown action : {pi[i]!r}")

pi_visualized[16*32+16] = '🐟'

for i in range(0, 1024, 32):
    print(''.join(pi_visualized[i:i+32]))