In [1]:
# Farhan Mahbub
# CAP5636 - Advanced AI
# November 24, 2024
# Homework 5: Petting a deep Q warg

# Homework 5: Petting a deep Q warg

This homework builds on the same game as homework 4. 

![](figures/PetAWarg.jpg)

# How to solve this homework
The following problems you can solve either with the help of an LLM or by hand. 

* If you are solving by hand, make sure that you add sufficient comments to make sure that the code is understandable. 
* If you are solving using an LLM, add in form of comments
    * the LLM used (at the first use instance)
    * the prompt used to elicit the code
    * modifications that had to be done to the code 

For example:

```
# --- LLM used: ChatGPT 4.5
# --- LLM prompt
# Write a python class to encapsulate the least common multiple algorithm
# --- End of LLM prompt
```

The programming language should be Python.

You can reuse code from your submission for homework 4. 

# P1: Model the game as an environment in gymnasium

gymnasium (https://gymnasium.farama.org/index.html) is a fork of the OpenAI gym library. It is a library that allows you to easily build environments 

Model the PetAWarg game as an environment in gymnasium. You don't have to create visual framework: it is enought to implement the render function to print the current state. 

NOTE: If you are using a LLM, you should be able to ask it to convert your previous implementation into the implementation in gym. 


In [2]:
# ChatGPT-4o
# Provided the previous code of the WargMDP class and asked it to convert that into a Gym environment implementation
# Had to swap the original OpenAI Gym library with Gymnasium

import gymnasium as gym
from gymnasium import spaces
import numpy as np

class WargEnv(gym.Env):
    metadata = {"render.modes": ["human"]}

    def __init__(self):
        super(WargEnv, self).__init__()
        
        # Define states
        self.states = [
            "SleepingWarg",
            "AngryWarg",
            "FuriousWarg",
            "ApoplecticWarg",
            "Safe",
            "Sorry",
        ]
        self.state_index = {state: i for i, state in enumerate(self.states)}
        self.current_state = "SleepingWarg"
        
        # Define actions
        self.actions = ["pet", "strike"]
        self.action_space = spaces.Discrete(len(self.actions))
        
        # Transition probabilities and rewards
        # transitions[state][action] = [(next_state, probability, reward), ...]
        self.transitions = {
            "SleepingWarg": {
                "pet": [("AngryWarg", 0.95, -1), ("Safe", 0.05, 10)],
                "strike": [("AngryWarg", 1.0, -1)],
            },
            "AngryWarg": {
                "pet": [("Sorry", 1.0, -10)],
                "strike": [("FuriousWarg", 1.0, -1)],
            },
            "FuriousWarg": {
                "pet": [("Sorry", 1.0, -10)],
                "strike": [("ApoplecticWarg", 1.0, -1)],
            },
            "ApoplecticWarg": {
                "pet": [("Sorry", 1.0, -10)],
                "strike": [("Safe", 0.2, 10), ("Sorry", 0.8, -10)],
            },
            "Safe": {},
            "Sorry": {},
        }

        # Observation space (discrete for states)
        self.observation_space = spaces.Discrete(len(self.states))
    
    def reset(self, seed=None, options=None):
        # Reset to initial state
        # Ensure seed compatibility with Gymnasium
        super().reset(seed=seed)
        self.current_state = "SleepingWarg"
        return self.state_index[self.current_state], {}

    def step(self, action):
        # Map action index to action
        action = self.actions[action]
        
        # Get possible transitions
        transitions = self.transitions.get(self.current_state, {}).get(action, [])
        if not transitions:
            raise ValueError(f"No valid transitions for action '{action}' in state '{self.current_state}'")

        # Sample the next state based on probabilities
        next_states, probabilities, rewards = zip(*transitions)
        next_state = np.random.choice(next_states, p=probabilities)
        reward = rewards[next_states.index(next_state)]
        
        # Update the state
        self.current_state = next_state
        
        # Check if terminal
        done = self.current_state in ["Safe", "Sorry"]
        return self.state_index[self.current_state], reward, done, False, {}

    def render(self):
        print(f"Current state: {self.current_state}")


# P2: Pet, strike, pet, strike, pet

Using the environment class implemented above, create an instance of the environment. Print out its state (by calling render()). 

Then, perform the actions: pet, strike, pet, strike, pet. After each action, print out the state.  

In [3]:
# Test out the environment
env = WargEnv()
state = env.reset()

# [pet, strike, pet, strike, pet]
actions = [0, 1, 0, 1, 0]
done = False

# Perform the actions
for action in actions:
    if done:
        break
    
    env.render()
    state, reward, done, _, _= env.step(action)
    print(f"Action: {env.actions[action]}, Reward: {reward}")

env.render()


Current state: SleepingWarg
Action: pet, Reward: -1
Current state: AngryWarg
Action: strike, Reward: -1
Current state: FuriousWarg
Action: pet, Reward: -10
Current state: Sorry


# P3: DQN

Install the stable_baselines3 library. Using the DQN implementation from that library, train an MlpPolicy policy for playing the PetAWarg game. 

https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html

In [4]:
# ChatGPT-4o
# Provided the previous code of the WargEnv class and asked it generate the code using the DQN library
# Tweaked some of the parameters such as gamma, learning rate, etc

from stable_baselines3 import DQN
from stable_baselines3.common.evaluation import evaluate_policy

# Create the environment
env = WargEnv()

# Define the DQN agent with an MlpPolicy
model = DQN(
    "MlpPolicy",
    env,
    learning_rate=1e-3,
    buffer_size=50000,
    learning_starts=1000,
    batch_size=32,
    gamma=0.90,
    target_update_interval=500,
    train_freq=4,
    verbose=1,
)

# Train the agent
model.learn(total_timesteps=10000)

# Evaluate the agent
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)
print(f"Mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

# Save the model
model.save("dqn_pet_a_warg")


Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 2.25     |
|    ep_rew_mean      | -11.2    |
|    exploration_rate | 0.991    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 2738     |
|    time_elapsed     | 0        |
|    total_timesteps  | 9        |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 2.62     |
|    ep_rew_mean      | -11.6    |
|    exploration_rate | 0.98     |
| time/               |          |
|    episodes         | 8        |
|    fps              | 2990     |
|    time_elapsed     | 0        |
|    total_timesteps  | 21       |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 2.58     |
|    ep_rew_mean      | -11.6 



# P4: Print out the policy learned by DQN

Print out the policy learned by DQN in the previous step. You can assume that the policy is deterministic. In this case, the policy can be printed out by iterating over all the states and printing out the action generated by the policy. 

In [8]:
# Print the learned policies
def print_dqn_policy(env, model):
    print("Learned Policy", "\n")
    for state_name in env.states:
        if state_name in ["Safe", "Sorry"]:
            # Terminal states have no valid actions
            print(f"π({state_name}) = None")
            continue

        # Set the current state manually for evaluation
        env.current_state = state_name

        # Get the action from the model (deterministic policy)
        state_idx = env.state_index[state_name]
        action, _ = model.predict(state_idx, deterministic=True)
        action_name = env.actions[action]

        print(f"π({state_name}) = {action_name}")

# Instantiate the environment and DQN agent model
env = WargEnv()
model = DQN.load("dqn_pet_a_warg")

# Print the policy
print_dqn_policy(env, model)


Learned Policy 

π(SleepingWarg) = pet
π(AngryWarg) = strike
π(FuriousWarg) = strike
π(ApoplecticWarg) = strike
π(Safe) = None
π(Sorry) = None
