# Homework 4: Petting a warg

Wargs do not make good pets. They are vicious creatures, populating Middle Earth, the world described by novels of John Ronald Reuel Tolkien. They tend to show up in the worst moment possible. They eat humans, hobbits, elves and wizards (when they can get them).

![A warg, getting ready for breakfast w:300px](figures/Gundabad_Wargs.jpg)

Your relationship with a warg can be in the following states:
```
SleepingWarg
AngryWarg
FuriousWarg
ApoplecticWarg
Safe
Sorry 
```

![tes](figures/WargStates.jpg)

Your actions are limited to petting a warg or striking it with your sword. The transitions are described in the following picture. The safe and sorry states are terminal, where no further actions can be taken. Landing into them has the reward +10 and -10 respectively. All other actions have a reward of -1. 

The discount factor is $\gamma=0.9$

![](figures/PetAWarg.jpg)


# How to solve this homework
The following problems you can solve either with the help of an LLM or by hand. 

* If you are solving by hand, make sure that you add sufficient comments to make sure that the code is understandable. 
* If you are solving using an LLM, add in form of comments
    * the LLM used (at the first use instance)
    * the prompt used to elicit the code
    * modifications that had to be done to the code 

For example:

```
# --- LLM used: ChatGPT 4.5
# --- LLM prompt
# Write a python class to encapsulate the least common multiple algorithm
# --- End of LLM prompt
```

The programming language should be Python.

## P1: MDP implementation 

Write a class to implement an MDP. Do not include value or policy iteration in the class.

In [2]:
class WargMDP:
    def __init__(self):
        # Define the states
        self.states = [
            "SleepingWarg",
            "AngryWarg",
            "FuriousWarg",
            "ApoplecticWarg",
            "Safe",
            "Sorry",
        ]

        # Possible actions                                   actions
        self.actions = ["pet", "strike"]

        # Transition probabilities and rewards
        # transitions[state][action] = [(next_state, probability, reward), ...]
        self.transitions = {
            "SleepingWarg": {
                "pet": [("AngryWarg", 0.95, -1), ("Safe", 0.05, 10)],
                "strike": [("AngryWarg", 1.0, -1)],
            },
            "AngryWarg": {
                "pet": [("Sorry", 1.0, -10)],
                "strike": [("FuriousWarg", 1.0, -1)],
            },
            "FuriousWarg": {
                "pet": [("Sorry", 1.0, -10)],
                "strike": [("ApoplecticWarg", 1.0, -1)],
            },
            "ApoplecticWarg": {
                "pet": [("Sorry", 1.0, -10)],
                "strike": [("Safe", 0.2, 10), ("Sorry", 0.8, -10)],
            },
            "Safe": {},
            "Sorry": {},
        }

        # Discount factor
        self.gamma = 0.9

    # Returns all of the states
    def get_states(self):
        return self.states

    # Returns all of the actions for a given state
    def get_actions(self, state):
        # If the state is terminal, no actions are possible
        if state in ["Safe", "Sorry"]:
            return []
        return self.actions

    # Returns the list of (next_state, probability, reward) for a given state and action
    def get_transitions(self, state, action):
        if state in self.transitions and action in self.transitions[state]:
            return self.transitions[state][action]
        return []

    # Checks if a state is terminal
    def is_terminal(self, state):
        return state in ["Safe", "Sorry"]

# Test
mdp = WargMDP()
print("States:", mdp.get_states())
print("Actions in 'SleepingWarg':", mdp.get_actions("SleepingWarg"))
print("Transitions for petting 'SleepingWarg':", mdp.get_transitions("SleepingWarg", "pet"))


States: ['SleepingWarg', 'AngryWarg', 'FuriousWarg', 'ApoplecticWarg', 'Safe', 'Sorry']
Actions in 'SleepingWarg': ['pet', 'strike']
Transitions for petting 'SleepingWarg': [('AngryWarg', 0.95, -1), ('Safe', 0.05, 10)]


## P2: Warg as an MDP
Implement the WargPettingGame as an MDP using the implementation from above. 

In [20]:
class WargPettingGame:
    def __init__(self):
        self.mdp = WargMDP()
        self.current_state = "SleepingWarg"

    # Resets the game to the initial state
    def reset(self):
        self.current_state = "SleepingWarg"
        return self.current_state

    def step(self, action):
        # Check if the current state is terminal
        if self.mdp.is_terminal(self.current_state):
            raise ValueError("Cannot take an action in a terminal state.")

        # Check if the action is valid in the current state
        transitions = self.mdp.get_transitions(self.current_state, action)
        if not transitions:
            raise ValueError(f"Invalid action '{action}' in state '{self.current_state}'.")

        # Get the next state and reward based on the transition probabilities
        next_state, reward = self.probability_transition(transitions)
        self.current_state = next_state

        # Mark the episode as done if the next state is terminal
        done = self.mdp.is_terminal(next_state)

        return next_state, reward, done

    # Returns the available actions in the current state
    def get_available_actions(self):
        return self.mdp.get_actions(self.current_state)

    # Decides the next transition based on the next transition probabilities
    def probability_transition(self, transitions):
        import random

        rand_val = random.random()
        cumulative_probability = 0.0

        # Simulate probabilistic state transitions
        for next_state, probability, reward in transitions:
            cumulative_probability += probability
            if rand_val < cumulative_probability:
                return next_state, reward

        # Fallback in case of numerical issues
        return transitions[-1][0], transitions[-1][2]

# Test
game = WargPettingGame()
state = game.reset()
total = 0
done = False

print(f"Initial state: {state}, Reward: {total}, Done: {done}")

while not done:
    actions = game.get_available_actions()
    print(f"Available actions: {actions}")

    # Choose the first available action
    action = actions[0]
    print(f"Taking action: {action}")

    state, reward, done = game.step(action)
    total += reward
    print(f"\nNext state: {state}, Reward: {total}, Done: {done}")


Initial state: SleepingWarg, Reward: 0, Done: False
Available actions: ['pet', 'strike']
Taking action: pet

Next state: AngryWarg, Reward: -1, Done: False
Available actions: ['pet', 'strike']
Taking action: pet

Next state: Sorry, Reward: -11, Done: True


## P3: Value iteration

Implement the value iteration as a separate function that uses this MDP implementation. 

In [None]:
def value_iteration(mdp, epsilon=1e-6):
    # Initialize value function for all states to 0
    value_function = {state: 0 for state in mdp.get_states()}
    policy = {}

    # Repeat until convergence of the value function
    while True:
        delta = 0
        new_value_function = value_function.copy()

        # Iterate over all states
        for state in mdp.get_states():
            if mdp.is_terminal(state):
                continue

            best_value = float("-inf")
            best_action = None

            # Find the best action for a state
            for action in mdp.get_actions(state):
                action_value = 0

                # Calculate the expected value of an action
                for next_state, probability, reward in mdp.get_transitions(state, action):
                    action_value += probability * (reward + mdp.gamma * value_function[next_state])

                if action_value > best_value:
                    best_value = action_value
                    best_action = action

            # Update the value function and policy
            new_value_function[state] = best_value
            policy[state] = best_action

            # Update the maximum change for convergence check
            delta = max(delta, abs(new_value_function[state] - value_function[state]))

        value_function = new_value_function

        # Check for convergence
        if delta < epsilon:
            break

    return policy, value_function


## P4: Using value iteration
Find the V* values of the WargPettingGame using the implementation above. Print out the V* values for each state in the form 
V(state) == number

In [None]:
mdp = WargMDP()
optimal_policy, optimal_value_function = value_iteration(mdp)

print("\nOptimal Value Function:")
for state, value in optimal_value_function.items():
    print(f"  V({state}) = {value:.2f}")

## P5:  Policy extraction

Find the policy $\pi(s)$ from the V values obtained in the previous step. Remember that you need to do one step of expectimax.
Print out the policy for each state, in a readable way. Eg. 
    pi(ApoplecticWarg) = Pet



In [None]:
mdp = WargMDP()
optimal_policy, optimal_value_function = value_iteration(mdp)

print("Optimal Policy:")
for state, action in optimal_policy.items():
    print(f"  π({state}) = {action}")

## P6: Policy iteration
Implement policy iteration with the MDP as defined above as a separate function.
Apply it to the MDP defining the pet the warg game. 
Print out the resulting policy for each state, in a readable way.

In [18]:
def policy_iteration(mdp, epsilon=1e-6):
    # Initialize policy arbitrarily with the pet action
    policy = {state: "pet" for state in mdp.get_states() if not mdp.is_terminal(state)}
    value_function = {state: 0 for state in mdp.get_states()}

    while True:
        # Update the value function using the current policy
        while True:
            delta = 0
            new_value_function = value_function.copy()

            for state in mdp.get_states():
                if mdp.is_terminal(state):
                    continue

                # Get the action from the current policy
                action = policy[state]
                action_value = 0

                # Calculate the expected value of an action from the current policy
                for next_state, probability, reward in mdp.get_transitions(state, action):
                    action_value += probability * (reward + mdp.gamma * value_function[next_state])

                new_value_function[state] = action_value
                delta = max(delta, abs(new_value_function[state] - value_function[state]))

            value_function = new_value_function

            if delta < epsilon:
                break

        # Update the policy based on the updated value function if necessary
        policy_stable = True
        for state in mdp.get_states():
            if mdp.is_terminal(state):
                continue

            best_action = None
            best_value = float("-inf")

            # Find the best action for a state
            for action in mdp.get_actions(state):
                action_value = 0

                # Calculate the expected value of an action
                for next_state, probability, reward in mdp.get_transitions(state, action):
                    action_value += probability * (reward + mdp.gamma * value_function[next_state])

                if action_value > best_value:
                    best_value = action_value
                    best_action = action

            # Update the policy if the best action has changed
            if best_action != policy[state]:
                policy_stable = False
                policy[state] = best_action

        if policy_stable:
            break

    return policy


# Test
mdp = WargMDP()
optimal_policy = policy_iteration(mdp)

for state, action in optimal_policy.items():
    print(f"π({state}) = {action.capitalize() if action else 'Terminal'}")


π(SleepingWarg) = Pet
π(AngryWarg) = Strike
π(FuriousWarg) = Strike
π(ApoplecticWarg) = Strike


## P7: Trajectory sampling
Implement a function that generates trajectories in the form of (s,a,r,s') tuples from the MDP for a specific policy. The trajectory ends when it reaches a terminal state. 

Generate 100 trajectories for a __random__ policy. 

In [68]:
import random

# Generate a number of trajectories in the form of (s, a, r, s') using a random policy
def generate_random_trajectories(mdp, num_trajectories=100):
    trajectories = []

    for _ in range(num_trajectories):
        # Start from the initial state
        trajectory = []
        state = "SleepingWarg"

        while not mdp.is_terminal(state):
            # Choose an action randomly
            actions = mdp.get_actions(state)
            action = random.choice(actions)

            # Get the transition probabilities and rewards for the chosen action
            transitions = mdp.get_transitions(state, action)

            # Determine the next state and reward based on probabilities
            rand_val = random.random()
            cumulative_probability = 0.0

            for next_state, probability, reward in transitions:
                cumulative_probability += probability
                if rand_val < cumulative_probability:
                    trajectory.append((state, action, reward, next_state))
                    state = next_state
                    break

        # Add the trajectory to the list
        trajectories.append(trajectory)

    return trajectories

# Test
mdp = WargMDP()
random_trajectories = generate_random_trajectories(mdp, 100)

print("Sample Trajectories")
for i, trajectory in enumerate(random_trajectories, start=1):
    print(f"Trajectory {i}:")
    for step in trajectory:
        print(f"  {step}")
    print()


Sample Trajectories
Trajectory 1:
  ('SleepingWarg', 'strike', -1, 'AngryWarg')
  ('AngryWarg', 'pet', -10, 'Sorry')

Trajectory 2:
  ('SleepingWarg', 'strike', -1, 'AngryWarg')
  ('AngryWarg', 'pet', -10, 'Sorry')

Trajectory 3:
  ('SleepingWarg', 'pet', -1, 'AngryWarg')
  ('AngryWarg', 'pet', -10, 'Sorry')

Trajectory 4:
  ('SleepingWarg', 'pet', -1, 'AngryWarg')
  ('AngryWarg', 'pet', -10, 'Sorry')

Trajectory 5:
  ('SleepingWarg', 'pet', -1, 'AngryWarg')
  ('AngryWarg', 'strike', -1, 'FuriousWarg')
  ('FuriousWarg', 'strike', -1, 'ApoplecticWarg')
  ('ApoplecticWarg', 'strike', 10, 'Safe')

Trajectory 6:
  ('SleepingWarg', 'pet', -1, 'AngryWarg')
  ('AngryWarg', 'pet', -10, 'Sorry')

Trajectory 7:
  ('SleepingWarg', 'strike', -1, 'AngryWarg')
  ('AngryWarg', 'pet', -10, 'Sorry')

Trajectory 8:
  ('SleepingWarg', 'pet', -1, 'AngryWarg')
  ('AngryWarg', 'strike', -1, 'FuriousWarg')
  ('FuriousWarg', 'pet', -10, 'Sorry')

Trajectory 9:
  ('SleepingWarg', 'pet', -1, 'AngryWarg')
  ('An

## P8: Implement Q-learning 

Create an implementation of Q-learning which takes the trajectory database and updates a Q-table.

## P9: Run Q-learning 

Run your implementation of Q-learning on the warg petting game. Print out the Q values in the form 

Q(state, action) = number


## P10: Policy implied by Q-values

Write a function that extracts a policy form q-values. 
Apply it to the Q-table obtained at P9. Print out the resulting policy in a readable way. 