# Instructions

*Text Adventure Games* are games in which the player interacts with a rich world only through text. Text adventure games predate computers with graphics. However, in many ways they are more complex than conventional video games because they can involve complicated interactions (e.g., "build a rope bridge") that require a fair amount of imagination. Indeed, text adventure games are used as [research testbeds](https://arxiv.org/abs/1909.05398) for natural language processing agents.

The canonical text adventure game is [Zork](https://en.wikipedia.org/wiki/Zork), in which the player discover an abandoned underworld realm full of treasure. You can find online playable versions.

A text game is made up of individual locations--also called "rooms", though they need not be indoor enclosed spaces as the term might imply. The agent can move between rooms and interact with objects by typing in short commands like "move north" and "take lamp".

In this assignment, we will use a special package that implements text worlds for testing agents: [TextWorld-Express](https://github.com/cognitiveailab/TextWorldExpress). Textworld-Express simplifies text worlds in a few ways: it uses a reduced set of text commands, and rooms laid out in a grid.
TextWorld-Express also implements a few different game objectives, such as cooking, and searching for coins.
TextWorld-Express generates world configurations, so we will need to implement algorithms that are able to complete different game objectives in different world configurations.

In this assignment, our agents will play two different games:
- Coin Game: a  game in which the agent must search for and pick up a single coin.
- Map Reader: a game in which the agent must find a key and return it to a box at the starting location.

**We will be implementing the tabular Q-learning algorithm** (as opposed to neural Q-learning).

You are prohibited from using any pre-existing python package with built-in graph functions such as Djikstra's algorithm, or Prim's algorithm (e.g., networkx and SciPy). You are prohibited from using any python package that does not come default with Python, except textworld-express, graphviz, and pydot, which are loaded as part of this notebook.

**Notes:**
- If you break execution of a cell running the game engine, you may put TextWorld-Express in an un-recoverable state. If this happens, you will need to reset your kernel/runtime.
- In the Map Reader game, you must use a single search loop (you cannot run a search to the coin and then a separate search to the box). You cannot write specialized code for handling the Map Reader game. You cannot memorize the path to the coin and then reverse it.
- In the Map Reader game, you cannot use the map information.
- You cannot filter any actions. We've already filtered out the actions that we don't want your agent to have to consider. For example, the "take map" action is never helpful, but you must explore it.
- ***DO NOT REMOVE ANY COMMENTS THAT HAVE `# EXPORT` IN THEM. THE GRADING SCRIPT USES THESE COMMENTS TO EVALUATE YOUR FUNCTIONS. WE WILL NOT AUDIT SUBMISSIONS TO ADD THESE. IF THE AUTOGRADER FAILS TO RUN DUE TO YOUR MODIFICATION OF THESE COMMENTS, YOU WILL NOT RECEIVE CREDIT.***

# Install

Install the TextWorld-Express engine, and graphviz and pydot for visualization.

In [184]:
# %pip install gymnasium
# %pip install textworld-express
# %pip install graphviz
# %pip install pydot

# Imports

In [185]:
# export
from textworld_express import TextWorldExpressEnv
import gymnasium
import graphviz
import pydot
from IPython.display import Image
from IPython.display import display
import matplotlib.pyplot as plt
from collections import namedtuple
import re
import os
import copy
import json
import math
import random
import networkx as nx
import numpy as np
from itertools import combinations
from collections import defaultdict

# Load a Game

Set the random seed for repeatablity

In [186]:
# export
SEED = 3

Initialize the game environment. `ENV` is a global that encapulates the environment.

In [187]:
ENV = TextWorldExpressEnv(envStepLimit=100)

Set the game generator to generate a particular game (coin game or map reader)

In [188]:
GAME_TYPE = "coin"
GAME_PARAMS = "numLocations=5,includeDoors=1,numDistractorItems=0"
ENV.load(gameName=GAME_TYPE, gameParams=GAME_PARAMS)

# TextWorld API Primer

This section gives the basics of the TextWorld API.

**Reset the game engine.** `ENV.reset()` provides an observation, the text of the current local world, and a data structure called `infos`, which contains a variety of additional information about the current local world.

In [189]:
obs, infos = ENV.reset(seed=SEED, gameFold="train", generateGoldPath=True)
print(obs)
print(infos)

You are in the kitchen. In one part of the room you see a stove. There is also an oven. You also see a fridge that is closed. In another part of the room you see a counter, that has nothing on it. In one part of the room you see a kitchen cupboard that is closed. There is also a cutlery drawer that is closed. You also see a trash can that is closed. In another part of the room you see a dishwasher that is closed. In one part of the room you see a dining chair, that has nothing on it. 
To the South you see a closed sliding patio door. To the West you see a closed frosted-glass door. 
{'observation': 'You are in the kitchen. In one part of the room you see a stove. There is also an oven. You also see a fridge that is closed. In another part of the room you see a counter, that has nothing on it. In one part of the room you see a kitchen cupboard that is closed. There is also a cutlery drawer that is closed. You also see a trash can that is closed. In another part of the room you see a dis

**Valid actions.** The actions that an agent can perform are part of the `infos` dictionary.

In [190]:
infos['validActions']

['look around',
 'close door to west',
 'move west',
 'open door to south',
 'open door to west',
 'inventory',
 'move south',
 'close door to south']

**Execute an action.** Actions are executed using `ENV.step()`, which returns the new observation for the agent's new state, a reward value, a boolean indicating if the agent has reached the end of the game, and the `infos` for the current state. Here is the code to choose a random valid action and execute it.

In [191]:
# Pick a random action
random_action = random.choice(infos['validActions'])
print("action:", random_action)
# Execute the action
obs, reward, done, infos = ENV.step(random_action)
print(obs)
print("reward:", reward)
print("done?", done)
print("infos:", infos)

action: look around
You are in the kitchen. In one part of the room you see a stove. There is also an oven. You also see a fridge that is closed. In another part of the room you see a counter, that has nothing on it. In one part of the room you see a kitchen cupboard that is closed. There is also a cutlery drawer that is closed. You also see a trash can that is closed. In another part of the room you see a dishwasher that is closed. In one part of the room you see a dining chair, that has nothing on it. 
To the South you see a closed sliding patio door. To the West you see a closed frosted-glass door. 
reward: 0.0
done? False
infos: {'observation': 'You are in the kitchen. In one part of the room you see a stove. There is also an oven. You also see a fridge that is closed. In another part of the room you see a counter, that has nothing on it. In one part of the room you see a kitchen cupboard that is closed. There is also a cutlery drawer that is closed. You also see a trash can that i

The environment knows the "gold path", which is the solution sequent of actions. This in not guaranteed to be optimal. *You are prohibited from calling this function as part of your code solutions to the homework problems.*

In [192]:
ENV.getGoldActionSequence()

['look around',
 'open door to south',
 'open door to west',
 'move south',
 'open door to east',
 'move north',
 'move south',
 'move east',
 'take coin']

# MDP Helper Functions

**Observation Parsing Functions**

- `parse_inventory()` attempts to pull the inventory line out of an observation.
- `obs_location()` attempts to pull the name of the location of the agent out of the observations.
- `hash_state()` converts an observation to a hash code a string of unique numbers.
- `parse_things()` attempts to pull out all the objects in an observation.
- `parse_doors()` attempts to pull out information about all the doors in an observation. It returns a list of tuples containing `"name_of_door (direction)"` and whether it is `'open'` or `'closed'`.

In [193]:
# export
### Pull the inventory items out of the observation text that includes
### the inventory (from obs_with_inventory())
def parse_inventory(obs):
  m = re.search(r'Inventory[a-zA-Z0-9 \(\)]*:\s*([a-zA-Z0-9 \.\n]+)', obs)
  if m is not None:
    if 'empty' not in m.group(1):
      return m.group(1).replace('\n', '')
  return 'empty'

### Pull the location out of the observation text.
def obs_location(obs):
  first_sentence = obs.split('.')[0].split(' ')
  start = first_sentence.index('the') + 1
  return ' '.join(first_sentence[start:])
  # return obs.split('.')[0].split(' ')[-1]

def hash_state(state):
  return str(abs(hash(json.dumps(state))))

def parse_things(obs):
  things1 = re.findall(r'[yY]ou \w*\s*see [aA]? ([a-zA-Z0-9\- ]+)\,? that ([a-zA-Z0-9\-, ]+).', obs)
  things2 = re.findall(r'[tT]here is \w*\s*([a-zA-Z0-9\- ]+)\,? that ([a-zA-Z0-9\-, ]+).', obs)
  things3 = re.findall(r'[tT]here is \w*\s*([a-zA-Z0-9\- ]+)\.', obs)
  things3 = list(filter(lambda s:'that' not in s, things3))
  things4 = re.findall(r'[yY]ou \w*\s*see a ([a-zA-Z0-9\- ]+)\.', obs)
  things4 = list(filter(lambda s:'door' not in s and 'that' not in s, things4))
  return list(map(lambda x: list(x) if type(x) is tuple else x,
                  things1 + things2 + things3 + things4))

def parse_doors(obs):
  sentences = obs.split('.')
  doors = []
  dirs = re.compile('west|east|south|north')
  for sentence in sentences:
    m_open = re.search(r'open ([a-z\- ]*door)', sentence)
    m_closed = re.search(r'closed ([a-z\- ]*door)', sentence)
    dir = dirs.search(sentence.lower())
    if dir is not None:
      if m_open is not None:
        doors.append((m_open[1] + ' (' + dir[0] + ')', 'open'))
      elif m_closed is not None:
        doors.append((m_closed[1] + ' (' + dir[0] + ')', 'closed'))
  return doors

**Environment Interaction Functions**

These functions allow the agent to interact with the environment in a slightly more friendly way than the default `env.reset()` and `env.step()` functions. They wrap those funtions and do some processing on the data to bundle it in a way that will be easier to work with.

`reset_mdp()` takes an environment (e.g., `ENV`) and returns a dictionary containing the observation, inventory, and valid actions. For example:
```
{'observation': 'You are in the kitchen...',
 'inventory': 'Inventory (maximum capacity is 2 items):...',
 'valid actions': ['move north', 'move south', 'take coin'...]
}
```

`do_action_mdp()` takes the name of an action and a pointer to the environment. It returns the state that results from executing the action. It returns 4 values:
- observation: a string
- reward: a floating point number
- termination: a boolean indicating whether the episode has ended
- infos: a dictionary containing observation, inventory, and valid actions (as above).

**Use these functions instead of `env.reset()` and `env.step()`.**

In [194]:
# export
def reset_mdp(env):
  obs, infos = env.reset(seed=SEED, gameFold="train", generateGoldPath=True)
  valids = infos['validActions']
  valids.remove('inventory')
  valids.remove('look around')
  inv = infos['inventory']
  modified_obs = obs_with_inventory(infos['look'], inv)
  # return make_state_mdp(infos['look'], parse_inventory(infos['inventory'])), valids
  return {'observation': infos['look'],
          'inventory': infos['inventory'],
          'valid actions': valids,
          'modified_obs': modified_obs}


def do_action_mdp(action, env):
  obs, reward, done, infos = env.step(action)
  #obs_look, reward_look, done_look, infos_look = env.step('look around')
  valid_actions = infos['validActions']
  valid_actions.remove('inventory')
  valid_actions.remove('look around')
  # return make_state_mdp(infos['look'], parse_inventory(infos['inventory'])), reward, done, valid_actions
  return infos['look'], reward, done, {'observation': infos['look'],
                                       'inventory': infos['inventory'],
                                       'valid actions': valid_actions}

# Important Notes for this Assignment


*   A successful episode from the MDP will give a reward of 1.0
*   A partially successful episode from an MDP environment will give a reward of 0.5
*   If you increase NUM_EPISODES too high, it will take too long in the autograder.
*   We will be checking for hard coded values / outputs, so please don't take any shortcuts.



# Implement Q-Learning

**Step 1.** Implement the `q_learning()` function. This function takes the following parameters:
- env: a pointer to the environment (`ENV`).
- num_episodes: the number of episodes to run before termination of the entire algorithm.
- threshold: the number of steps in an episode before terminating a single episode.
- learning_rate: a number between 0 and 1 controlling how fast the policy is allowed to change.
- gamma: the Bellman equation horizon parameter (between 0 and 1).
- epsilon: (optional) if epsilon greedy is implemented, this number (0..1) determines the ratio of random to policy-guided actions. A value of 1.0 indicates purely random, and a value of 0.0 indicates purely on-policy.

The `q_learning()` algorithm should return a single value: the policy. The policy will be a dictionary-of-dictionaries where the outermost dictionary has a key for each state visited. Each state points to a separate inner dictionary where the keys are actions and the values are q-values. For example:
```
{state1: {'move north': 0.1,
          'move south': 0.0,
          'move east': 0.8,
          'move west': 0.4},
 state2: {'move north': 0.01,
          'take coin': 1.0,
          'move east': 0.05,
          'move west': 0.2},
 ...
}
```

You will interact with the environment through `reset_mdp()` and `do_action_mdp()`. You can choose your own representation for states, though it must be hashable (string, number, tuple, etc.).

We recommend you track your algorithm's performance by tracking the total reward of each episode, and the number of step in each episode (fewer is better). If you use purely random action selection, you will see a lot of variance in your total episode reward. If you implement epsilon-greedy, you will see a trend toward more consistent achievement of maximum reward as episode number increases.

In [195]:
# export
# please print out obs and inventory and find a way to combine them together
# this is much simpler to implement than it seems (one line of code)
def obs_with_inventory(obs, inv):
  ### YOUR CODE BELOW HERE
  return obs + inv
  ### YOUR CODE ABOVE HERE

def q_learning(env, num_episodes, threshold, learning_rate, gamma, epsilon = 1.0):
  # Set up q-table
  q_table = defaultdict(lambda: defaultdict(float))
  
  for episode in range(num_episodes):
    reset_result = reset_mdp(env)
    obs = reset_result["modified_obs"]
    actions = reset_result["valid actions"]
    for step in range(threshold):
      # initialize state in q-table
      for action in actions:
        if action not in q_table[obs]:
          q_table[obs][action] = 0
      
      if np.random.rand() < epsilon:
        new_action = random.choice(actions)
      else:
        best_actions = [action for action in q_table[obs].keys() if max(q_table[obs].values())==q_table[obs][action]]
        new_action = random.choice(best_actions)
      new_obs, reward, done, info = do_action_mdp(new_action, env)
      
      new_obs = obs_with_inventory(info['observation'], info['inventory'])
      actions = info['valid actions']
      
      update = learning_rate*(reward + gamma*max(q_table[new_obs].values(), default=0) - q_table[obs][new_action])
      q_table[obs][new_action] = q_table[obs][new_action] + update
      
      obs = new_obs
      
      if done:
        break

  return q_table

**Step 2.** Set the parameters for your q-learning algorithm. You can change these values.

**These might need to be changed from the default values. These variables are just for the simple test below. The autograder will use the variables you set in `set_parameters` below.**

In [196]:
NUM_EPISODES = 100
THRESHOLD = 25
LEARNING_RATE = 0.5
GAMMA = 0.5
EPSILON = 0.1

Run the q-learning algorithm.

In [197]:
q_table = q_learning(ENV,
                     num_episodes = NUM_EPISODES,
                     threshold = THRESHOLD,
                     learning_rate = LEARNING_RATE,
                     gamma = GAMMA,
                     epsilon = EPSILON)
q_table

defaultdict(<function __main__.q_learning.<locals>.<lambda>()>,
            {'You are in the kitchen. In one part of the room you see a stove. There is also an oven. You also see a fridge that is closed. In another part of the room you see a counter, that has nothing on it. In one part of the room you see a kitchen cupboard that is closed. There is also a cutlery drawer that is closed. You also see a trash can that is closed. In another part of the room you see a dishwasher that is closed. In one part of the room you see a dining chair, that has nothing on it. \nTo the South you see a closed sliding patio door. To the West you see a closed frosted-glass door. Inventory (maximum capacity is 2 items): \n  Your inventory is currently empty.\n': defaultdict(float,
                         {'close door to west': 0.006155967712402344,
                          'move west': 0.0,
                          'open door to south': 0.031249999999999327,
                          'open door to west'

# Implement Code to Run a Policy

**Step 3.** Implement code to run the policy. This function takes the following parameters:
- q_table: your q-table, as specified in step 1.
- env: pointer to the environment (e.g., `ENV`).
- threshold: the maximum number of steps to take before terminating.

Your function should run a single episode from the initial state and return:
- A list of actions taken during the episode (e.g., `[act_1, act_2, ... act_n]`).
- The total sum reward of all actions taken as a float.

Your function will interact with the environment through `reset_mdp()` and `do_action_mdp()`. Be sure to reset the environment before running, and terminate the episode if `do_action_mdp()` indicates the episode has terminated.

**IMPORTANT:** It is possible that your agent encounters states in run_policy that it never saw when it was building the Q_table. Please handle this by choosing a random action from the available actions.

In [198]:
# export
def run_policy(q_table, env, threshold = 25):
  actions = [] # Store the entire sequence of actions here
  total_reward = 0.0 # Store the total sum reward of all actions executed here
  reset_result = reset_mdp(env)
  obs = reset_result["modified_obs"]
  curr_actions = reset_result["valid actions"]
  
  for step in range(threshold):
    # initialize state in q-table
    for action in curr_actions:
      if action not in q_table[obs]:
        q_table[obs][action] = 0
    
    best_actions = [action for action in q_table[obs].keys() if max(q_table[obs].values())==q_table[obs][action]]
    new_action = random.choice(best_actions)
    actions.append(new_action)
    obs, reward, done, info = do_action_mdp(new_action, env)
    total_reward+=reward
    obs = obs_with_inventory(info['observation'], info['inventory'])
    curr_actions = info['valid actions']
        
    if done:
      break
  print(actions)
  return actions, total_reward

**Step 4.** Set the threshold value for episode length during policy execution (test time threshold).

In [199]:
# export
TEST_THRESHOLD = 25

**Step 5.** Run the policy.

In [200]:
plan, total_reward = run_policy(q_table, ENV, threshold = TEST_THRESHOLD)
print("plan:", plan)
print("Total reward:", total_reward)

plan: ['open door to south', 'move south', 'close door to north', 'open door to east', 'move east', 'take coin']
Total reward: 1.0


# New Environment: Stochastic Actions

The following creates a new type of environment called `StochasticTextWorldExpressEnv`. This environment is the same as the previous environment type, except that some percentage of the time, the action that the agent chooses is not executed and a randomly chosen action is executed instead.

When the environment is created the percentage of action randomness (between 0 and 1) is set, where 0.0 means no randomness, and 1.0 means that actions are executed purely randomly.

Otherwise, this environment works the same as previously.

If you set the global variable `ENV_VERBOSE = True` then the environment will print the action that is executed, regardless of whether it is the one selected or a random action. This is for debugging purposes only.

**NOTE:** The agent is never able to know whether the action it chose was executed or if a different action was executed.  

In [201]:
NEVER_PICK_ACTIONS = set(['look around', 'inventory'])
ENV_VERBOSE = False

In [202]:
class StochasticTextWorldExpressEnv(TextWorldExpressEnv):

  def __init__(self, serverPath=None, envStepLimit=100, stochasticity = 0.0):
    # Call the super constructor
    super().__init__(serverPath, envStepLimit)
    # Store the valid actions and stochasticity
    self.valid_actions = []
    self.stochasticity = stochasticity

  def reset(self, seed=None, gameFold=None, gameName=None, gameParams=None, generateGoldPath=False):
    # Call the super method
    observation, infos = super().reset(seed, gameFold, gameName, gameParams, generateGoldPath)
    # Update the valid actions
    self.valid_actions = infos['validActions']
    return observation, infos

  def step(self, action:str):
    # If a random value is less than the stochasticity target, choose a random action
    if random.random() < self.stochasticity:
      temp_valids = copy.deepcopy(self.valid_actions)
      # Remove inventory and look around from valid actions to choose from
      temp_valids = list(set(self.valid_actions).difference(NEVER_PICK_ACTIONS))
      # Pick a random action from whatever remains
      action = random.choice(temp_valids)
    # If debugging flag is on, print the action that will be executed
    if ENV_VERBOSE:
      print("[[action]]:", action)
    # Call the super class with either the action passed in or the randomly chosen one
    observation, reward, isCompleted, infos = super().step(action)
    # Update the valid actions
    self.valid_actions = infos['validActions']
    return observation, reward, isCompleted, infos

New environments must be registered through the Gymnasium API.

In [203]:
gymnasium.register(id='TextWorldExpress-StochasticTextWorldExpressEnv-v0',
                   entry_point='__main__:StochasticTextWorldExpressEnv')

  logger.warn(f"Overriding environment {new_spec.id} already in registry.")


Create the new environment type.

In [204]:
SENV = StochasticTextWorldExpressEnv(envStepLimit=100, stochasticity=0.25)

Create a game with this environment type (same as before)

In [205]:
GAME_TYPE = "coin"
GAME_PARAMS = "numLocations=5,includeDoors=1,numDistractorItems=0"
SENV.load(gameName=GAME_TYPE, gameParams=GAME_PARAMS)

Reset the environment (same as before).

In [206]:
obs, infos = SENV.reset(seed=SEED, gameFold="train", generateGoldPath=True)
print(obs)
print(infos)

You are in the kitchen. In one part of the room you see a stove. There is also an oven. You also see a fridge that is closed. In another part of the room you see a counter, that has nothing on it. In one part of the room you see a kitchen cupboard that is closed. There is also a cutlery drawer that is closed. You also see a trash can that is closed. In another part of the room you see a dishwasher that is closed. In one part of the room you see a dining chair, that has nothing on it. 
To the South you see a closed sliding patio door. To the West you see a closed frosted-glass door. 
{'observation': 'You are in the kitchen. In one part of the room you see a stove. There is also an oven. You also see a fridge that is closed. In another part of the room you see a counter, that has nothing on it. In one part of the room you see a kitchen cupboard that is closed. There is also a cutlery drawer that is closed. You also see a trash can that is closed. In another part of the room you see a dis

Execute a step. If `ENV_VERBOSE=True` then the action actually executed will be printed.

In [207]:
obs, reward, done, infos = SENV.step('open door to south')
print(obs)
print(reward)
print(done)
print(infos)

You can't move there, the door is closed. 
0.0
False
{'observation': "You can't move there, the door is closed. ", 'look': 'You are in the kitchen. In one part of the room you see a stove. There is also an oven. You also see a fridge that is closed. In another part of the room you see a counter, that has nothing on it. In one part of the room you see a kitchen cupboard that is closed. There is also a cutlery drawer that is closed. You also see a trash can that is closed. In another part of the room you see a dishwasher that is closed. In one part of the room you see a dining chair, that has nothing on it. \nTo the South you see a closed sliding patio door. To the West you see a closed frosted-glass door. ', 'inventory': 'Inventory (maximum capacity is 2 items): \n  Your inventory is currently empty.\n', 'validActions': ['close door to west', 'move west', 'open door to south', 'inventory', 'close door to south', 'move south', 'open door to west', 'look around'], 'scoreRaw': 0.0, 'score'

Train in the stochastic Text World environment.

In [208]:
q_table = q_learning(SENV,
                     num_episodes = NUM_EPISODES,
                     threshold = THRESHOLD,
                     learning_rate = LEARNING_RATE,
                     gamma = GAMMA,
                     epsilon = EPSILON)

Test the policy.

In [209]:
plan, total_reward = run_policy(q_table, SENV, threshold = TEST_THRESHOLD)
print("plan:", plan)
print("total reward:", total_reward)

plan: ['open door to south', 'move south', 'open door to east', 'move east', 'take coin', 'take coin', 'move east', 'take coin']
total reward: 1.0


# New Environment: Negative Reward

The following creates a new type of environment called `PunishmentTextWorldExpressEnv`. This environment is the same as the previous environment type, except that the agent receives negative reward when it performs actions that are illegal or do not change the world state. For example, trying to close a door that is already closed, or move in a direction that is illegal.

Otherwise, this environment works the same as previously.

In [210]:
class PunishmentTextWorldExpressEnv(TextWorldExpressEnv):

  def __init__(self, serverPath=None, envStepLimit=100, punishment = 0.0):
    # Call the super constructor
    super().__init__(serverPath, envStepLimit)
    # Store the punishment
    self.punishment = punishment
    # Store the previous observation
    self.previous_observation = None

  def step(self, action:str):
    # Call the super method
    observation, reward, isCompleted, infos = super().step(action)
    # If the current look is the same as the previous look, then we have performed an illegal action
    if infos['look'] == self.previous_observation:
      reward = self.punishment
    # Store the previous observation
    self.previous_observation = infos['look']
    return observation, reward, isCompleted, infos

New environments must be registered through the Gymnasium API.

In [211]:
gymnasium.register(id='TextWorldExpress-PunishmentTextWorldExpressEnv-v0',
                   entry_point='__main__:PunishmentTextWorldExpressEnv')

  logger.warn(f"Overriding environment {new_spec.id} already in registry.")


Create the new environment type.

In [212]:
PENV = PunishmentTextWorldExpressEnv(envStepLimit=100, punishment=-1.0)

Create a game with this environment type (same as before).

In [213]:
GAME_TYPE = "coin"
GAME_PARAMS = "numLocations=5,includeDoors=1,numDistractorItems=0"
PENV.load(gameName=GAME_TYPE, gameParams=GAME_PARAMS)

Reset the environment (same as before).

In [214]:
obs, infos = PENV.reset(seed=SEED, gameFold="train", generateGoldPath=True)
print(obs)
print(infos)

You are in the kitchen. In one part of the room you see a stove. There is also an oven. You also see a fridge that is closed. In another part of the room you see a counter, that has nothing on it. In one part of the room you see a kitchen cupboard that is closed. There is also a cutlery drawer that is closed. You also see a trash can that is closed. In another part of the room you see a dishwasher that is closed. In one part of the room you see a dining chair, that has nothing on it. 
To the South you see a closed sliding patio door. To the West you see a closed frosted-glass door. 
{'observation': 'You are in the kitchen. In one part of the room you see a stove. There is also an oven. You also see a fridge that is closed. In another part of the room you see a counter, that has nothing on it. In one part of the room you see a kitchen cupboard that is closed. There is also a cutlery drawer that is closed. You also see a trash can that is closed. In another part of the room you see a dis

Execute a step.

In [215]:
obs, reward, done, infos = PENV.step('open door to south')
print(obs)
print(reward)
print(done)
print(infos)

You open the sliding patio door, revealing the backyard. 
0.0
False
{'observation': 'You open the sliding patio door, revealing the backyard. ', 'look': 'You are in the kitchen. In one part of the room you see a stove. There is also an oven. You also see a fridge that is closed. In another part of the room you see a counter, that has nothing on it. In one part of the room you see a kitchen cupboard that is closed. There is also a cutlery drawer that is closed. You also see a trash can that is closed. In another part of the room you see a dishwasher that is closed. In one part of the room you see a dining chair, that has nothing on it. \nThrough an open sliding patio door, to the South you see the backyard. To the West you see a closed frosted-glass door. ', 'inventory': 'Inventory (maximum capacity is 2 items): \n  Your inventory is currently empty.\n', 'validActions': ['close door to west', 'move west', 'open door to south', 'inventory', 'close door to south', 'move south', 'open door

Train in the punishment Text World environment.

In [216]:
q_table = q_learning(PENV,
                     num_episodes = NUM_EPISODES,
                     threshold = THRESHOLD,
                     learning_rate = LEARNING_RATE,
                     gamma = GAMMA,
                     epsilon = EPSILON)

Test the policy.

In [217]:
plan, total_reward = run_policy(q_table, SENV, threshold = TEST_THRESHOLD)
print("plan:", plan)
print("total reward:", total_reward)

plan: ['open door to south', 'move south', 'open door to east', 'move east', 'take coin']
total reward: 1.0


# Testing Suite

This function will run all environments, all game types, all game parameters, and all seeds.

In [218]:
def run_all(environments, games, seeds):
  global ENV, GAME_TYPE, GAME_PARAMS, SEED
  # Results will contain a key (env type, game type, game params, seed) and values will be plans and total_rewards
  results = {}
  # Iterate through all environments given
  for env in environments:
    # set global environment
    ENV = env
    # Iterate through all game types, the keys of the games dict
    for game_type in games:
      # Set the global game type
      GAME_TYPE = game_type
      # Iterate through all game parameters for the given game type in game dict
      for params in games[game_type]:
        # set the global game params
        GAME_PARAMS = params
        # load the environment
        ENV.load(gameName=GAME_TYPE, gameParams=GAME_PARAMS)
        # Iterate through all seeds
        for seed in seeds:
          print(f"TESTING {type(ENV)}, {GAME_TYPE}, {GAME_PARAMS}, {seed}")
          # set the global seed
          SEED = seed
          # Run the q learner and get the policy
          q_table = q_learning(ENV,
                               num_episodes = NUM_EPISODES,
                               threshold = THRESHOLD,
                               learning_rate = LEARNING_RATE,
                               gamma = GAMMA,
                               epsilon = EPSILON)
          # run the policy to get the plan
          plan, total_reward = run_policy(q_table, ENV, threshold = TEST_THRESHOLD)
          # Store the plan in the results
          results[(type(ENV), GAME_TYPE, GAME_PARAMS, SEED)] = (plan, total_reward)
  return results

In [219]:
seeds = list(range(5))
environments = [TextWorldExpressEnv(envStepLimit=100),
                StochasticTextWorldExpressEnv(envStepLimit=100, stochasticity=0.25),
                PunishmentTextWorldExpressEnv(envStepLimit=100, punishment=-1.0)]
games = {'coin':      ['numLocations=5,includeDoors=1,numDistractorItems=0',
                       'numLocations=6,includeDoors=1,numDistractorItems=0',
                       'numLocations=7,includeDoors=1,numDistractorItems=0',
                       'numLocations=10,includeDoors=1,numDistractorItems=0'],
         'mapreader': ['numLocations=5,maxDistanceApart=3,includeDoors=0,maxDistractorItemsPerLocation=0',
                       'numLocations=8,maxDistanceApart=4,includeDoors=0,maxDistractorItemsPerLocation=0',
                       'numLocations=11,maxDistanceApart=5,includeDoors=0,maxDistractorItemsPerLocation=0',
                       'numLocations=15,maxDistanceApart=8,includeDoors=0,maxDistractorItemsPerLocation=0']}

Set parameters. Do not alter this cell outside of the changing the numeric values.

**You might need to change these parameters to get a good result on the harder environments**

Please note that increasing `NUM_EPISODES` will result in an increase in time to run the cell below. Your code MUST be completed within the time limit (20min) set by Gradescope.

In [220]:
# export
def set_parameters():
    global NUM_EPISODES, THRESHOLD, LEARNING_RATE, GAMMA, EPSILON, TEST_THRESHOLD
    NUM_EPISODES = 200
    THRESHOLD = 50
    LEARNING_RATE = 0.5
    GAMMA = 0.75
    EPSILON = 0.1
    TEST_THRESHOLD = 50

    return {
      'NUM_EPISODES': NUM_EPISODES,
      'THRESHOLD': THRESHOLD,
      'LEARNING_RATE': LEARNING_RATE,
      'GAMMA': GAMMA,
      'EPSILON': EPSILON,
      'TEST_THRESHOLD': TEST_THRESHOLD
    }

In [221]:
set_parameters()

{'NUM_EPISODES': 200,
 'THRESHOLD': 50,
 'LEARNING_RATE': 0.5,
 'GAMMA': 0.75,
 'EPSILON': 0.1,
 'TEST_THRESHOLD': 50}

Run all tests

In [222]:
results = run_all(environments, games, seeds)
print(results)

TESTING <class 'textworld_express.textworld_express.TextWorldExpressEnv'>, coin, numLocations=5,includeDoors=1,numDistractorItems=0, 0
TESTING <class 'textworld_express.textworld_express.TextWorldExpressEnv'>, coin, numLocations=5,includeDoors=1,numDistractorItems=0, 1
TESTING <class 'textworld_express.textworld_express.TextWorldExpressEnv'>, coin, numLocations=5,includeDoors=1,numDistractorItems=0, 2
TESTING <class 'textworld_express.textworld_express.TextWorldExpressEnv'>, coin, numLocations=5,includeDoors=1,numDistractorItems=0, 3
TESTING <class 'textworld_express.textworld_express.TextWorldExpressEnv'>, coin, numLocations=5,includeDoors=1,numDistractorItems=0, 4
TESTING <class 'textworld_express.textworld_express.TextWorldExpressEnv'>, coin, numLocations=6,includeDoors=1,numDistractorItems=0, 0
TESTING <class 'textworld_express.textworld_express.TextWorldExpressEnv'>, coin, numLocations=6,includeDoors=1,numDistractorItems=0, 1
TESTING <class 'textworld_express.textworld_express.Tex

TESTING <class 'textworld_express.textworld_express.TextWorldExpressEnv'>, coin, numLocations=6,includeDoors=1,numDistractorItems=0, 3
TESTING <class 'textworld_express.textworld_express.TextWorldExpressEnv'>, coin, numLocations=6,includeDoors=1,numDistractorItems=0, 4
TESTING <class 'textworld_express.textworld_express.TextWorldExpressEnv'>, coin, numLocations=7,includeDoors=1,numDistractorItems=0, 0
TESTING <class 'textworld_express.textworld_express.TextWorldExpressEnv'>, coin, numLocations=7,includeDoors=1,numDistractorItems=0, 1
TESTING <class 'textworld_express.textworld_express.TextWorldExpressEnv'>, coin, numLocations=7,includeDoors=1,numDistractorItems=0, 2
TESTING <class 'textworld_express.textworld_express.TextWorldExpressEnv'>, coin, numLocations=7,includeDoors=1,numDistractorItems=0, 3
TESTING <class 'textworld_express.textworld_express.TextWorldExpressEnv'>, coin, numLocations=7,includeDoors=1,numDistractorItems=0, 4
TESTING <class 'textworld_express.textworld_express.Tex

# Grading

Grading will consist of testing all three environments (regular, stochastic, punishment), two games per environment (coin, mapreader), four sets of parameters per game, and five seeds. There will be a total of 120 tests: 1 point for each correct plan per algorithm.

Please note that since these environments can be stochastic, we will give some leeway with the results for plans and reward values. The final grade for Part 1 will be

**Grading:**

Maximum total points: 50

| # correct plan | Score |
|----------|-------|
| >= 110   |  50   |
| 109      |  49   |
| 108      |  48   |
| ...      |  ...  |
| 101      |  41   |
| 100      |  40   |
| 99       |   39  |
| ...      | ... |
|  <= 60   |   0   |


*Note* Grading will be conducted by visual inspection and autograder on the Gradescope. We will compare your plans and reward to our rubric/reference implementations. We will add cells to your notebook at grading time to load and test our hidden world configuration files.

*Note* We will visually inspect the entire notebook to check if your algorithm implementations include details that are inconsistent with the assignment (e.g., hard-coding values or actions to pass tests) and to make sure no cells were altered to provide unearned grading results.

# Submission

Upload this notebook with the name `hw2_part1.ipynb` file to Gradescope. The autograder will only run successfully if your file is named this way. 

We've added appropriate comments to the top of certain cells for the autograder to export (`# export`). You do NOT have to do anything (e.g. remove print statements) to cells we have provided - anything related to those have been handled for you. You are responsible for ensuring your own code has no syntax errors or unnecessary print statements. You ***CANNOT*** modify the export comments at the top of the cells, or the autograder will fail to run on your submission.

You should ***not*** add any cells to the notebook when submitting. You're welcome to add any code as you need to extra cells when testing, but you must remove them when submitting. 

If you identify an issue with the autograder, please feel free to reach out to us on Piazza, or email bok004@ucsd.edu.