# Q-Learning in TextWorld
## Overview
*Text Adventure Games* are games in which the player interacts with a rich world only through text. Text adventure games predate computers with graphics. However, in many ways they are more complex than conventional video games because they can involve complicated interactions (e.g., "build a rope bridge") that require a fair amount of imagination. Indeed, text adventure games are used as [research testbeds](https://arxiv.org/abs/1909.05398) for natural language processing agents.

The canonical text adventure game is [Zork](https://en.wikipedia.org/wiki/Zork), in which the player discover an abandoned underworld realm full of treasure. You can find online playable versions.

A text game is made up of individual locations--also called "rooms", though they need not be indoor enclosed spaces as the term might imply. The agent can move between rooms and interact with objects by typing in short commands like "move north" and "take lamp".

In this assignment, we will use a special package that implements text worlds for testing agents: [TextWorld-Express](https://github.com/cognitiveailab/TextWorldExpress). Textworld-Express simplifies text worlds in a few ways: it uses a reduced set of text commands, and rooms laid out in a grid.
TextWorld-Express also implements a few different game objectives, such as cooking, and searching for coins.
TextWorld-Express generates world configurations, so we will need to implement algorithms that are able to complete different game objectives in different world configurations.

In this part of the assignment, our agents will play two different games:
- **Coin Game**: a game in which the agent must search for and pick up a single coin.
- **Map Reader**: a game in which the agent must find a coin and return it to a box at the starting location.

**We will be implementing the tabular Q-learning algorithm** (as opposed to neural Q-learning).

## Important Notes and Guidelines
- You are **only** allowed to use a restricted set of libraries for this assignment. All packages that come with the default Python installation are permitted, as well as any imports we have already provided for you. You may not use any other libraries than the ones we have provided. If you attempt to use other libraries, the autograder will not be able to run your code.
- TextWorldExpress requires Java 1.8 or higher to be installed on your system. For more information, see the [TextWorld-Express README](https://github.com/cognitiveailab/TextWorldExpress).
- Do not modify any function signatures or the global variables provided in the notebook. You may add additional helper functions as needed - do not add them to separate cells, as they will not be exported in the autograder. **Any helper functions should be nested within the function that uses them.**

## Helpful Tips
- If you break execution of a cell running the game engine, you may put TextWorld-Express in an un-recoverable state. If this happens, you will need to reset your kernel/runtime.
- In the Map Reader game, you cannot use the map information (it isn't helpful anyway).
- You cannot (and shouldn't) filter any actions. We've already filtered out the actions that we don't want your agent to have to consider. For example, the "take map" action is never helpful, but you must explore it. Your implementations should quickly realize that that action creates a state self-loop and disregard it.

# Installations and Imports

In [13]:
%pip install gymnasium
%pip install textworld-express

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [14]:
# export - DO NOT MODIFY THIS CELL

from textworld_express import TextWorldExpressEnv
import gymnasium
from typing import Union
import re
import copy
import json
import random

In [15]:
# export - DO NOT MODIFY OR MOVE THIS LINE 
# Add any additional imports (from the Python Standard Library only) here


# Load a Game

Set the random seed for repeatability, initialize the game environment (`ENV` - a global variable that encapsulates the environment). Set the game generator to load a particular game (coin game or map reader game). 

In [16]:
SEED = 3
env = TextWorldExpressEnv(envStepLimit=100)

# change the game type and parameters here to test different environments in depth
# only one of the two options below should be active (i.e. not commented out)

### THESE TWO LINES ENABLE THE "COIN" GAME
game_type = "coin"
game_params = "numLocations=5,includeDoors=1,numDistractorItems=0"

### THESE TWO LINES ENABLE THE "MAPREADER" GAME
# GAME_TYPE="mapreader"
# GAME_PARAMS="numLocations=5,maxDistanceApart=3,includeDoors=0,maxDistractorItemsPerLocation=0"


env.load(gameName=game_type, gameParams=game_params)

# MDP Helper Functions

**Observation Parsing Functions**

- `parse_inventory()` attempts to pull the inventory line out of an observation.
- `obs_location()` attempts to pull the name of the location of the agent out of the observations.
- `hash_state()` converts an observation to a hash code a string of unique numbers.
- `parse_things()` attempts to pull out all the objects in an observation.
- `parse_doors()` attempts to pull out information about all the doors in an observation. It returns a list of tuples containing `"name_of_door (direction)"` and whether it is `'open'` or `'closed'`.

In [17]:
def obs_with_inventory(obs: str, inv: str) -> str:
    """
    Add the inventory to the world observation text.
    
    This function combines the observation text with the inventory
    to provide a more complete text observation.
    
    Args:
        obs (str): The observation text.
        inv (str): The inventory text.
    
    Returns:
        str: The combined observation and inventory text.
    """
    return obs + '\n' + inv

def parse_inventory(obs: str) -> str:
    """
    Pull the inventory items out of the observation text that includes
    the inventory (from obs_with_inventory()).
    
    This function searches for the inventory in the observation
    and returns the items if they are not empty.
    
    Args:
        obs (str): The observation text containing inventory information.
    
    Returns:
        str: The inventory items or 'empty' if there are no items.
    """
    m = re.search(r'Inventory[a-zA-Z0-9 \(\)]*:\s*([a-zA-Z0-9 \.\n]+)', obs)
    if m is not None:
        if 'empty' not in m.group(1):
            return m.group(1).replace('\n', '')
    return 'empty'

def obs_location(obs: str) -> str:
    """
    Pull the location out of the observation text.
    
    This function extracts the location from the first sentence
    of the observation text.
    
    Args:
        obs (str): The observation text.
    
    Returns:
        str: The location extracted from the observation.
    """
    first_sentence = obs.split('.')[0].split(' ')
    start = first_sentence.index('the') + 1
    return ' '.join(first_sentence[start:])

def hash_state(state: dict) -> str:
    """
    Produces an identifier for a state dictionary.
    
    This function generates a hash for the given state dictionary,
    which is not guaranteed to be unique but should suffice for
    identifying the state.
    
    Args:
        state (dict): The state dictionary to hash.
    
    Returns:
        str: The string representation of the hash.
    """
    return str(abs(hash(json.dumps(state))))

def parse_things(obs: str) -> list[Union[str, list[str]]]:
    """
    Parse the objects out of an observation.
    
    This function extracts various objects mentioned in the observation
    text and returns them as a list.
    
    Args:
        obs (str): The observation text.
    
    Returns:
        list[Union[str, list[str]]]: A list of objects found in the observation.
    """
    things1 = re.findall(r'[yY]ou \w*\s*see [aA]? ([a-zA-Z0-9\- ]+)\,? that ([a-zA-Z0-9\-, ]+).', obs)
    things2 = re.findall(r'[tT]here is \w*\s*([a-zA-Z0-9\- ]+)\,? that ([a-zA-Z0-9\-, ]+).', obs)
    things3 = re.findall(r'[tT]here is \w*\s*([a-zA-Z0-9\- ]+)\.', obs)
    things3 = list(filter(lambda s: 'that' not in s, things3))
    things4 = re.findall(r'[yY]ou \w*\s*see a ([a-zA-Z0-9\- ]+)\.', obs)
    things4 = list(filter(lambda s: 'door' not in s and 'that' not in s, things4))
    return list(map(lambda x: list(x) if type(x) is tuple else x,
                    things1 + things2 + things3 + things4))

def parse_doors(obs: str, location: str) -> list[tuple[str, str]]:
    """
    Parse doors out of an observation.
    
    This function identifies open and closed doors mentioned in the
    observation text and returns them along with their status.
    
    Args:
        obs (str): The observation text.
        location (str): The current location of the observer.
    
    Returns:
        list[tuple[str, str]]: A list of tuples containing door descriptions and their statuses.
    """
    sentences = obs.split('.')
    doors = []
    dirs = re.compile('west|east|south|north')
    for sentence in sentences:
        m_open = re.search(r'open ([a-z\- ]*door)', sentence)
        m_closed = re.search(r'closed ([a-z\- ]*door)', sentence)
        dir = dirs.search(sentence.lower())
        if dir is not None:
            if m_open is not None:
                doors.append((m_open[1] + ' (' + location + ') ' + dir[0], 'open'))
            elif m_closed is not None:
                doors.append((m_closed[1] + ' (' + location + ') ' + dir[0], 'closed'))
    return doors

def parse_room(obs: str) -> dict[str, list[Union[str, list[str]]]]:
    """
    Parse room information from an observation.
    
    This function extracts and sorts the objects found in the room
    from the observation text.
    
    Args:
        obs (str): The observation text for the room.
    
    Returns:
        dict[str, list[Union[str, list[str]]]]: A dictionary containing the things in the room.
    """
    things = sorted(parse_things(obs), key=lambda x: x[0] if type(x) is list else x)
    return {'things': things}

**Environment Interaction Functions**

These functions allow the agent to interact with the environment in a slightly more friendly way than the default `env.reset()` and `env.step()` functions. They wrap those funtions and do some processing on the data to bundle it in a way that will be easier to work with.

`reset_mdp()` takes an environment (e.g., `ENV`) and returns the starting state id and valid actions in the starting state, as a tuple (look in the code below for more details).

`do_action_mdp()` takes the name of an action and a pointer to the environment. It returns the state that results from executing the action. It returns 4 values:
- state_id: a string that identifies the current state
- reward: a floating point number
- termination: a boolean indicating whether the episode has ended
- infos: a dictionary containing observation, inventory, and valid actions (as above).

**Use these functions instead of `env.reset()` and `env.step()`.**

In [18]:
def reset_mdp(env: gymnasium.Env, seed: int) -> tuple[str, list[str]]:
    """
    Reset the environment and the Markov process.

    This function resets the environment for new agent runs

    Args:
        env (Env): The Environment instance

    Returns:
        state_id str: A string of the current state id
        valid_actions list[str]: List of valid actions to take in the state state_id
    """
    _, infos = env.reset(seed=seed, gameFold="train", generateGoldPath=True)
    valids: list[str] = infos["validActions"]
    if "inventory" in valids:
      valids.remove("inventory")
    if "look around" in valids:
      valids.remove("look around")
    state_id = hash_state(
        obs_with_inventory(infos["look"], parse_inventory(infos["inventory"]))
    )
    return state_id, valids

def do_action_mdp(
    action: str, env: gymnasium.Env
) -> tuple[str, float, bool, list[str]]:
    """
    Take a step in the environment.

    Args:
        action (str): the choosen action to take in the "env"
        env (Env): The Environment instance

    Returns:
        state_id str: A string of the current state id
        reward float: A state's reward
        termination boolean: Whether the episode has terminated
        valid_actions list[str]: List of valid actions to take in the state state_id
    """
    _, reward, done, infos = env.step(action)
    valid_actions = infos["validActions"]
    if "inventory" in valid_actions:
      valid_actions.remove("inventory")
    if "look around" in valid_actions:
      valid_actions.remove("look around")
    state_id = hash_state(
        obs_with_inventory(infos["look"], parse_inventory(infos["inventory"]))
    )
    return state_id, reward, done, valid_actions

# Implement Q-Learning

**Step 1.** Implement the `q_learning()` function. This function takes the following parameters:
- env: a pointer to the environment (`ENV`).
- num_episodes: the number of episodes to run before termination of the entire algorithm.
- threshold: the number of steps in an episode before terminating a single episode.
- learning_rate: a number between 0 and 1 controlling how fast the policy is allowed to change.
- gamma: the Bellman equation horizon parameter (between 0 and 1).
- epsilon: (optional) if epsilon greedy is implemented, this number (0..1) determines the ratio of random to policy-guided actions. A value of 1.0 indicates purely random, and a value of 0.0 indicates purely on-policy.

The `q_learning()` algorithm should return a single value: the policy. The policy will be a dictionary-of-dictionaries where the outermost dictionary has a key for each state visited. Each state points to a separate inner dictionary where the keys are actions and the values are q-values. For example:
```py
{
    state1: {
        'move north': 0.1,
        'move south': 0.0,
        'move east': 0.8,
        'move west': 0.4
    },
    state2: {
        'move north': 0.01,
        'take coin': 1.0,
        'move east': 0.05,
        'move west': 0.2
    },
    ...
}
```

You will interact with the environment through `reset_mdp()` and `do_action_mdp()`. Please use the state ids we return in both of these functions to uniquely identify a particular state.

We recommend you track your algorithm's performance by tracking the total reward of each episode, and the number of step in each episode (fewer is better). If you use purely random action selection, you will see a lot of variance in your total episode reward. If you implement epsilon-greedy, you will see a trend toward more consistent achievement of maximum reward as episode number increases.

In [112]:
# export - DO NOT MODIFY THIS CELL

def q_learning(
    env: gymnasium.Env,
    num_episodes: int,
    max_episode_length: int,
    learning_rate: float,
    gamma: float,
    seed: int,
    epsilon: float = 1.0,
) -> dict[str, dict[str, float]]:
    """
    Build a Q-Learning policy

    Args:
        env (Env): The Environment instance
        num_episodes (int): The number of episodes to build the table from
        max_episode_length (int): The maximum length of an episode to prevent infinite loops
        learning_rate (float): A hyperparameter denoting how quickly the agent "learns" reward values
        gamma (float): The discount rate
        epsilon (float): The probability with which you should select a random
        action instead of following a greedy policy

    Returns:
        dict[str, dict[str, float]]: A dictionary of dictionaries mapping a state and action
        to a specific reward value. This is what you will build in this algorithm
    """

    # Set up q-table
    q_table: dict[str, dict[str, float]] = {}
    random.seed(seed)

    ### YOUR CODE BELOW HERE

    def ensure_state_in_qtable(state_id: str, valid_actions: list[str]) -> None:
        """
        Ensure that the given state is present in the Q-table with all valid actions initialized.
        """
        if state_id not in q_table:
            q_table[state_id] = {action: 0.0 for action in valid_actions}
        else:
            # If some new valid actions appear, add them with an initial value of 0.0.
            for action in valid_actions:
                if action not in q_table[state_id]:
                    q_table[state_id][action] = 0.0

    for episode in range(num_episodes):
        # Reset the environment using reset_mdp() to obtain the initial state and valid actions.
        state_id, valid_actions = reset_mdp(env, seed)
        ensure_state_in_qtable(state_id, valid_actions)

        for step in range(max_episode_length):
            # Epsilon-greedy action selection.
            if random.random() < epsilon:
                # Explore: randomly select one of the valid actions.
                action = random.choice(valid_actions)
            else:
                # Exploit: choose the action with the highest Q-value for this state.
                best_action = None
                best_q = float('-inf')
                for a in valid_actions:
                    if q_table[state_id][a] > best_q:
                        best_q = q_table[state_id][a]
                        best_action = a
                # In the rare case that best_action is None, fallback to random choice.
                action = best_action if best_action is not None else random.choice(valid_actions)

            # Take the selected action using do_action_mdp().
            next_state_id, reward, done, next_valid_actions = do_action_mdp(action, env)
            ensure_state_in_qtable(next_state_id, next_valid_actions)

            # Q-learning update:
            # Q(s, a) = Q(s, a) + learning_rate * (reward + gamma * max_a' Q(s', a') - Q(s, a))
            current_q = q_table[state_id][action]
            if next_valid_actions:
                max_next_q = max(q_table[next_state_id][a] for a in next_valid_actions)
            else:
                max_next_q = 0.0
            new_q = current_q + learning_rate * (reward + gamma * max_next_q - current_q)
            q_table[state_id][action] = new_q

            # Move to the next state.
            state_id = next_state_id
            valid_actions = next_valid_actions

            if done:
                break

    ### YOUR CODE ABOVE HERE

    return q_table

**Step 2.** Set the parameters for your q-learning algorithm. You can change these values. We've provided some completely random values to get you started. 

**Remember to change the set_parameters() function below to the correct values or your code will not work in the autograder!**

In [113]:
NUM_EPISODES = 500
MAX_EPISODE_LENGTH = 50
LEARNING_RATE = 0.5
GAMMA = 0.95
EPSILON = 0.3

Test your q-learning implementation.

In [114]:
q_table = q_learning(
    env,
    num_episodes=NUM_EPISODES,
    max_episode_length=MAX_EPISODE_LENGTH,
    learning_rate=LEARNING_RATE,
    gamma=GAMMA,
    seed=SEED,
    epsilon=EPSILON
)

In [115]:
q_table

{'6297002076517109080': {'close door to west': 0.7743460838641085,
  'move west': 0.7750779386531563,
  'open door to south': 0.8145062686672617,
  'open door to west': 0.7367014123277151,
  'move south': 0.7753846601919039,
  'close door to south': 0.7738676586389353},
 '1551779566118938349': {'move west': 0.34743875956392256,
  'move south': 0.5532501579549125,
  'close door to west': 0.7745579803539361,
  'close door to south': 0.6440220865987014,
  'open door to west': 0.7095190239700806,
  'open door to south': 0.6471130357144705},
 '5698157482587418034': {'open door to east': 0.23586272434304925,
  'close door to east': 0.037568844901436846,
  'move east': 0.6798338617309115},
 '8522919990782962926': {'open door to south': 0.8149373886404445,
  'move south': 0.8573750003375206,
  'move west': 0.8159378413369047,
  'close door to west': 0.8150178539421569,
  'open door to west': 0.7003387588818528,
  'close door to south': 0.7751011614868055},
 '5438051632063686572': {'open door t

# Implement Code to Run a Policy

**Step 3.** Implement code to run the policy. This function takes the following parameters:
- q_table: your q-table, as specified in step 1.
- env: pointer to the environment (e.g., `ENV`).
- threshold: the maximum number of steps to take before terminating.

Your function should run a single episode from the initial state and return:
- A list of actions taken during the episode (e.g., `[act_1, act_2, ... act_n]`).
- The total sum reward of all actions taken as a float.

Your function will interact with the environment through `reset_mdp()` and `do_action_mdp()`. Be sure to reset the environment before running, and terminate the episode if `do_action_mdp()` indicates the termination boolean.

In [116]:
# export - DO NOT MODIFY THIS CELL

def run_policy(
    q_table: dict[str, dict[str, float]],
    env: gymnasium.Env,
    seed: int,
    max_policy_length: int = 25,
) -> tuple[list[str], float]:
    """
    Run a policy from a built Q-Table

    Args:
        q_table (dict[str, dict[str, float]]): The built Q-Table dictionary
        env (gymnasium.Env): The environment in which to run the policy
        seed (int): The seed to use
        max_policy_length (int): The maximum length of the policy to run
        
    Returns:
        list[str]: The sequence of actions that the policy performed
        float: The sum total reward gained from the environment
    """
    actions = []  # Store the entire sequence of actions here
    total_reward = 0.0  # Store the total sum reward of all actions executed here

    ### YOUR CODE BELOW HERE

    state_id, valid_actions = reset_mdp(env, seed)

    for _ in range(max_policy_length):
        if not valid_actions:
            break

        if state_id in q_table:
            valid_q = {a: q for a, q in q_table[state_id].items() if a in valid_actions}
            if valid_q:
                best_q = max(valid_q.values())
                best_actions = [a for a, q in valid_q.items() if q == best_q]
                action = random.choice(best_actions)
            else:
                action = random.choice(valid_actions)
        else:
            action = random.choice(valid_actions)

        actions.append(action)
        next_state_id, reward, done, next_valid_actions = do_action_mdp(action, env)
        total_reward += reward
        state_id, valid_actions = next_state_id, next_valid_actions
        if done:
            break

    ### YOUR CODE ABOVE HERE

    return actions, total_reward

**Step 4.** Set the threshold value for episode length during policy execution (test time threshold).

In [117]:
MAX_POLICY_LENGTH = 50

**Step 5.** Run the policy. You should aim to have a total reward of 1.0

In [121]:
plan, total_reward = run_policy(q_table, env, SEED, max_policy_length = MAX_POLICY_LENGTH,)
print("plan:", plan)
print("Total reward:", total_reward)

plan: ['open door to south', 'move south', 'open door to east', 'move east', 'take coin']
Total reward: 1.0


# New Environment: Stochastic Actions

The following creates a new type of environment called `StochasticTextWorldExpressEnv`. This environment is the same as the previous environment type, except that some percentage of the time, the action that the agent chooses is not executed and a randomly chosen action is executed instead.

When the environment is created the percentage of action randomness (between 0 and 1) is set, where 0.0 means no randomness, and 1.0 means that actions are executed purely randomly.

Otherwise, this environment works the same as previously.

Passing `debug = True` to the `step()` function will print the action the environment will *really* execute!

**NOTE:** The agent is never able to know whether the action it chose was executed or if a different action was executed.  

In [122]:
class StochasticTextWorldExpressEnv(TextWorldExpressEnv):
    def __init__(self, serverPath=None, envStepLimit=100, stochasticity=0.0):
        # Call the super constructor
        super().__init__(serverPath, envStepLimit)
        # Store the valid actions and stochasticity
        self.valid_actions = []
        self.stochasticity = stochasticity
        self.never_pick = set(["look around", "inventory"])

    def reset(
        self,
        seed=None,
        gameFold=None,
        gameName=None,
        gameParams=None,
        generateGoldPath=False,
    ):
        # Call the super method
        observation, infos = super().reset(
            seed, gameFold, gameName, gameParams, generateGoldPath
        )
        # Update the valid actions
        self.valid_actions = infos["validActions"]
        return observation, infos

    def step(self, action: str, debug=False):
        # If a random value is less than the stochasticity target, choose a random action
        if random.random() < self.stochasticity:
            temp_valids = copy.deepcopy(self.valid_actions)
            # Remove inventory and look around from valid actions to choose from
            temp_valids = list(set(self.valid_actions).difference(self.never_pick))
            # Pick a random action from whatever remains
            action = random.choice(temp_valids)
        # If debugging flag is on, print the real action that will be executed
        if debug:
            print("[[action]]:", action)
        # Call the super class with either the action passed in or the randomly chosen one
        observation, reward, isCompleted, infos = super().step(action)
        # Update the valid actions
        self.valid_actions = infos["validActions"]
        return observation, reward, isCompleted, infos

Create the new environment type.

In [123]:
gymnasium.register(id='TextWorldExpress-StochasticTextWorldExpressEnv-v0',
                   entry_point='__main__:StochasticTextWorldExpressEnv')
SENV = StochasticTextWorldExpressEnv(envStepLimit=100, stochasticity=0.25)

Create a game with this environment type and reset the environment (same as before).

In [124]:
game_type = "coin"
game_params = "numLocations=5,includeDoors=1,numDistractorItems=0"
SENV.load(gameName=game_type, gameParams=game_params)
SENV.reset(seed=SEED, gameFold="train", generateGoldPath=True)

('You are in the kitchen. In one part of the room you see a stove. There is also an oven. You also see a fridge that is closed. In another part of the room you see a counter, that has nothing on it. In one part of the room you see a kitchen cupboard that is closed. There is also a cutlery drawer that is closed. You also see a trash can that is closed. In another part of the room you see a dishwasher that is closed. In one part of the room you see a dining chair, that has nothing on it. \nTo the South you see a closed sliding patio door. To the West you see a closed frosted-glass door. ',
 {'observation': 'You are in the kitchen. In one part of the room you see a stove. There is also an oven. You also see a fridge that is closed. In another part of the room you see a counter, that has nothing on it. In one part of the room you see a kitchen cupboard that is closed. There is also a cutlery drawer that is closed. You also see a trash can that is closed. In another part of the room you see

Train in the stochastic Text World environment.

In [125]:
q_table = q_learning(
    SENV,
    num_episodes=NUM_EPISODES,
    max_episode_length=MAX_EPISODE_LENGTH,
    learning_rate=LEARNING_RATE,
    gamma=GAMMA,
    seed = SEED,
    epsilon=EPSILON
)

Test the policy. You should aim to have a total reward of 1.0

In [126]:
plan, total_reward = run_policy(q_table, SENV, SEED, max_policy_length = MAX_POLICY_LENGTH)
print("plan:", plan)
print("total reward:", total_reward)

plan: ['close door to south', 'close door to south', 'close door to south', 'close door to south', 'close door to south', 'close door to south', 'close door to south', 'close door to south', 'move south', 'open door to east', 'move east', 'move east', 'take coin', 'take coin']
total reward: 1.0


# New Environment: Negative Reward

The following creates a new type of environment called `PunishmentTextWorldExpressEnv`. This environment is the same as the previous environment type, except that the agent receives negative reward when it performs actions that are illegal or do not change the world state. For example, trying to close a door that is already closed, or move in a direction that is illegal.

Otherwise, this environment works the same as previously.

In [127]:
class PunishmentTextWorldExpressEnv(TextWorldExpressEnv):
    def __init__(self, serverPath=None, envStepLimit=100, punishment=0.0):
        # Call the super constructor
        super().__init__(serverPath, envStepLimit)
        # Store the punishment
        self.punishment = punishment
        # Store the previous observation
        self.previous_observation = None

    def step(self, action: str):
        # Call the super method
        observation, reward, isCompleted, infos = super().step(action)
        # If the current look is the same as the previous look, then we have performed an illegal action
        if infos["look"] == self.previous_observation:
            reward = self.punishment
        # Store the previous observation
        self.previous_observation = infos["look"]
        return observation, reward, isCompleted, infos

Register and create the new environment type.

In [128]:
gymnasium.register(id='TextWorldExpress-PunishmentTextWorldExpressEnv-v0',
                   entry_point='__main__:PunishmentTextWorldExpressEnv')
PENV = PunishmentTextWorldExpressEnv(envStepLimit=100, punishment=-1.0)

Create a game with this environment type and reset the environment.

In [129]:
game_type = "coin"
game_params = "numLocations=5,includeDoors=1,numDistractorItems=0"
PENV.load(gameName=game_type, gameParams=game_params)
obs, infos = PENV.reset(seed=SEED, gameFold="train", generateGoldPath=True)

Train in the punishment Text World environment.

In [130]:
q_table = q_learning(
    PENV,
    num_episodes=NUM_EPISODES,
    max_episode_length=MAX_EPISODE_LENGTH,
    learning_rate=LEARNING_RATE,
    gamma=GAMMA,
    seed=SEED,
    epsilon=EPSILON,
)

Test the policy. You should aim to have a total reward of 1.0

In [131]:
plan, total_reward = run_policy(q_table, SENV, SEED, max_policy_length = MAX_POLICY_LENGTH)
print("plan:", plan)
print("total reward:", total_reward)

plan: ['open door to south', 'open door to south', 'move south', 'move south', 'open door to south', 'move south', 'move south', 'open door to east', 'move east', 'take coin']
total reward: 1.0


# Testing Suite

This function will run  your agent on an environment, game type, game parameters/configurations, and a seed.

In [132]:
def run_environment(env: gymnasium.Env, game_type: str, game_params: str, seed: int, parameters: dict[str, int | float]):
  print(f"TESTING {type(env)}, {game_type}, {game_params}, {seed}")
  # Run the q learner and get the policy

  # load the environment
  env.load(gameName=game_type, gameParams=game_params)
  env.reset(seed=seed, gameFold="train", generateGoldPath=True)
  
  q_table = q_learning(
      env,
      num_episodes=parameters["NUM_EPISODES"],
      max_episode_length=parameters["MAX_EPISODE_LENGTH"],
      learning_rate=parameters["LEARNING_RATE"],
      gamma=parameters["GAMMA"],
      epsilon=parameters["EPSILON"],
      seed=seed
  )
  # run the policy to get the plan
  plan, total_reward = run_policy(
      q_table, env, seed, max_policy_length=parameters["MAX_POLICY_LENGTH"]
  )
  # Store the plan in the results
  return plan, total_reward

This code will run your agent on all provided environment configurations and report the results.

In [133]:
def run_all(environments: list[gymnasium.Env], games: dict[str, list[str]], seeds: list[int], parameters: dict[str, int | float]) -> tuple[dict, dict]:
    # Results will contain a key (env type, game type, game params, seed) and values will be plans and total_rewards
    plans: dict = {}
    rewards: dict = {}
    # Iterate through all environments given
    for env in environments:
        # Iterate through all game types, the keys of the games dict
        for game_type in games:
            # Iterate through all game parameters for the given game type in game dict
            for params in games[game_type]:
                # Iterate through all seeds
                for seed in seeds:
                    # Store the plan in the results
                    plan, reward = run_environment(env, game_type, params, seed, parameters)
                    plans[type(env), game_type, params, seed] = plan
                    rewards[type(env), game_type, params, seed] = reward
    return plans, rewards

Set parameters. These are the **final parameters that will be used in the autograder to benchmark your solution**. We have filled in some random values to get you started - you should change these values. 

In [134]:
# export - DO NOT MODIFY THIS CELL

def set_parameters() -> dict[str, int | float]:
    parameters = {}

    ## FILL IN PARAMETERS BELOW
    NUM_EPISODES = 100
    MAX_EPISODE_LENGTH = 15
    LEARNING_RATE = 0.1
    GAMMA = 0.35
    EPSILON = 0.80
    MAX_POLICY_LENGTH = 20
    ## FILL IN PARAMETERS ABOVE
    
    
    parameters["NUM_EPISODES"] = NUM_EPISODES
    parameters["MAX_EPISODE_LENGTH"] = MAX_EPISODE_LENGTH
    parameters["LEARNING_RATE"] = LEARNING_RATE
    parameters["GAMMA"] = GAMMA
    parameters["EPSILON"] = EPSILON
    parameters["MAX_POLICY_LENGTH"] = MAX_POLICY_LENGTH
    return parameters

Run the cell below to execute all tests we have given you. You should aim for a total reward of 1.0 on each environment. 

In [135]:
## CHANGE THIS CELL AT YOUR OWN PERIL - WE HAVE TESTED IT WORKS AS IS
seeds = list(range(5))
environments = [
    TextWorldExpressEnv(envStepLimit=100),
    StochasticTextWorldExpressEnv(envStepLimit=100, stochasticity=0.25),
    PunishmentTextWorldExpressEnv(envStepLimit=100, punishment=-1.0),
]
games = {
    "coin": [
        "numLocations=5,includeDoors=1,numDistractorItems=0",
        "numLocations=6,includeDoors=1,numDistractorItems=0",
        "numLocations=7,includeDoors=1,numDistractorItems=0",
        "numLocations=10,includeDoors=1,numDistractorItems=0",
    ],
    "mapreader": [
        "numLocations=5,maxDistanceApart=3,includeDoors=0,maxDistractorItemsPerLocation=0",
        "numLocations=8,maxDistanceApart=4,includeDoors=0,maxDistractorItemsPerLocation=0",
        "numLocations=11,maxDistanceApart=5,includeDoors=0,maxDistractorItemsPerLocation=0",
        "numLocations=15,maxDistanceApart=8,includeDoors=0,maxDistractorItemsPerLocation=0",
    ],
}
parameters = set_parameters()

plans, rewards = run_all(environments, games, seeds, parameters)

print("All Plans")
print(plans)

print("Rewards for each configuration")
print(rewards)

print("Total Reward")
print(sum(list(rewards.values())))

print("Max Reward")
print(len(list(rewards.values())))

TESTING <class 'textworld_express.textworld_express.TextWorldExpressEnv'>, coin, numLocations=5,includeDoors=1,numDistractorItems=0, 0
TESTING <class 'textworld_express.textworld_express.TextWorldExpressEnv'>, coin, numLocations=5,includeDoors=1,numDistractorItems=0, 1
TESTING <class 'textworld_express.textworld_express.TextWorldExpressEnv'>, coin, numLocations=5,includeDoors=1,numDistractorItems=0, 2
TESTING <class 'textworld_express.textworld_express.TextWorldExpressEnv'>, coin, numLocations=5,includeDoors=1,numDistractorItems=0, 3
TESTING <class 'textworld_express.textworld_express.TextWorldExpressEnv'>, coin, numLocations=5,includeDoors=1,numDistractorItems=0, 4
TESTING <class 'textworld_express.textworld_express.TextWorldExpressEnv'>, coin, numLocations=6,includeDoors=1,numDistractorItems=0, 0
TESTING <class 'textworld_express.textworld_express.TextWorldExpressEnv'>, coin, numLocations=6,includeDoors=1,numDistractorItems=0, 1
TESTING <class 'textworld_express.textworld_express.Tex

Run the cell below with a specific environment to see how your agent performs. Feel free to experiment with different environments and seeds.

In [136]:
## Change the game configuration below (any of the config options in the cell above should work)
environment = TextWorldExpressEnv(envStepLimit=100)
game_type = "coin"
game_params = "numLocations=7,includeDoors=1,numDistractorItems=0"
seed = 4
## Change the game configuration above


parameters = set_parameters()

plan, reward = run_environment(environment, game_type, game_params, seed, parameters)
print("Environment: ", type(environment))
print("Game Type: ", game_type)
print("Game Parameters: ", game_params)
print("Plan: ", plan)
print("Reward: ", reward)

TESTING <class 'textworld_express.textworld_express.TextWorldExpressEnv'>, coin, numLocations=7,includeDoors=1,numDistractorItems=0, 4
Environment:  <class 'textworld_express.textworld_express.TextWorldExpressEnv'>
Game Type:  coin
Game Parameters:  numLocations=7,includeDoors=1,numDistractorItems=0
Plan:  ['move west', 'open door to west', 'move west', 'take coin']
Reward:  1.0


# Grading

Grading will consist of testing all environments (regular, stochastic, punishment), all games (coin, mapreader), with a variety of parameters per game, and different seeds. There will be 240 tests in total. The code we have provided you will run 30 of these tests. The rest are hidden tests, which run the same code but with different seeds and configurations. You have the ability to load ANY configuration from the TextWorld environment, so you are highly encouraged to test your code with a variety of configurations to ensure it succeeds in the hidden tests.

**Grading:**

The maximum grade for this portion of the assignment is **60 points.** Each test (run on one environment configuration that is loaded) is worth 0.25 points. The autograder will run all 30 tests that we have provided you in this notebook, and 210 more tests. We have provided you with all the code you need to test any possible configuration, so you should be able to test your code with a variety of configurations to ensure it works well on the hidden tests.

## Important Details
Grading will be conducted by visual inspection of the code and autograder results. The autograder will display "sanity check" results to help you verify that your code behaves the same in the autograder as it does locally. These tests are a subset of the full autograder, and will test the some of the same configurations that we have provided. It is your responsibility to test your code and verify its correctness, and you should use the provided resources to do so. 

We will also inspect the entire notebook to check if your algorithm implementations include details that are inconsistent with the assignment (e.g., hard-coding values or actions to pass tests) and to make sure no cells were altered to provide unearned grading results. Doing so will result in a grade of 0 for the entire assignment, and may be reported to the Office of Student Integrity.

Your submissions are also subject to plagiarism checks - as a reminder, all code must be written by yourself, and no one else (classmates, excessive internet resources, LLMs, etc.). You are permitted to use course resources to help you complete the assignment. Any violations of this will receive a 0 and may be reported to the Office of Student Integrity for further investigation. 

# Submission Instructions

Upload this notebook with the name `submission.ipynb` file to Gradescope. The autograder will **only** run successfully if your file is named this way. You must ensure that you have removed all print statements from **your** code, or the autograder may fail to run. Excessive print statements will also result in muddled test case outputs, which makes it more difficult to interpret your score. 

We've added appropriate comments to the top of certain cells for the autograder to export (`# export`). You do NOT have to do anything (e.g. remove print statements) to cells we have provided - anything related to those have been handled for you. You are responsible for ensuring your own code has no syntax errors or unnecessary print statements. You ***CANNOT*** modify the export comments at the top of the cells, or the autograder will fail to run on your submission.

You should ***not*** add any cells that your code requires to the notebook when submitting. You're welcome to add any code as you need to extra cells when testing, but they will not be graded. Only the provided cells will be graded. As mentioned in the top of the notebook, **any helper functions that you add should be nested within the function that uses them.**

If you encounter any issues with the autograder, please feel free to make a post on Ed Discussion. We highly recommend making a public post to clarify any questions, as it's likely that other students have the same questions as you! If you have a question that needs to be private, please make a private post.