<a href="https://colab.research.google.com/github/AISG-Technology-Team/Diner-Dash-Workshop/blob/master/Challenge_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Diner Dash Challenge**

---

## Objective:

Using Reinforcement Learning(RL) algorithms, maximise the average rewards from 100 games/episodes of Diner Dash.

## Instructions and Expectations:

1. Please use Google Colabs for all computing needs (installing of dependencies, training of model, testing of model, generation of submission, etc). This is to ensure fairness in this competition. You can run multiple notebooks but please take note of the contraints of GPU usage.

2. You are required to submit this google colab notebook as well as action lists (in .json format) for each seeded environment given. Hence, please do not change the code under the "Testing of policies and verification of submission" apart from that indicated in the chanageable area. For more information about the submission, please refer to the [workshop repo](https://github.com/AISG-Technology-Team/Diner-Dash-Workshop).

3. Please update the group member names as well as names of algorithms used in the "Details of Submission" section

4. We expect to see that the models are learning during training.

5. If you have any questions, please discuss within your groups first. Otherwise, please check if the issue is existing on the [workshop repo](https://github.com/AISG-Technology-Team/Diner-Dash-Workshop/issues) or raise one if it is not.

## Advice on approach to challenge

1. Spend some time to read up about the various RL algos, especially easily implementable baselines

2. Split the shortlisted algos among the group

3. If necessary, tune the hyperparameters to ensure learning

4. Have fun!

## Important Resources:

1. [Diner Dash repo](https://github.com/AdaCompNUS/diner-dash-simulator)

2. [Workshop repo](https://github.com/AISG-Technology-Team/Diner-Dash-Workshop)

3. [Stable Baselines](https://github.com/hill-a/stable-baselines)

## Things to note:

1. Please change the runtime to a GPU when using a GPU. In the above tabs, click Runtime > Change runtime type > GPU in the Hardware accelerator dropdown.

2. If an "Error: A module (diner_dash) was specified for the environment but was not found, make sure the package is installed with `pip install` before calling `gym.make()`" error is raised, please restart the runtime and rerun the installation of the diner dash simulator.

2. Please ensure a strong internet connection throughout this challenge to avoid disconnecting from the collab GPUs

3. Do not idle your computer as collab automatically disconnects GPUs if the idle time is too long

4. GPUs run on CUDA 10.1

For other FAQs, refer to this [link](https://research.google.com/colaboratory/faq.html).

---

# Details of Submission [Please Edit]

### Names of Group Members:
John, Mary, Bryan

### Names of Algorithms Used:
Random Agent, PPO


# Python Version

In [None]:
!python --version

Python 3.6.9


# Mounting Google Drive

To store trained models

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Create Project Directory

In [None]:
from os import path, chdir, getcwd, mkdir

# Choose a project name
projectName = "DinerDashChallenge"

# Project directory is in My Drive
projectDirectory = "/content/drive/My Drive/" + projectName

# Checks if cwd is in content folder
if getcwd() == "/content":
  # Makes project directory if it does not exist
  if not path.isdir(projectDirectory):
    mkdir(projectDirectory)
    print(f"Project {projectName} has been created!")
  # Changes to project directory
  chdir(projectDirectory)

print(f"The current working directory is {getcwd()}")

The current working directory is /content/drive/My Drive/DinerDashChallenge


# Installing Dependencies

Downloading relevant project dependencies

## Dependencies for [diner dash simulator](https://github.com/AdaCompNUS/diner-dash-simulator)

In [None]:
from os import path, getcwd

repoName = "diner-dash-simulator"

# Clones repo if it does not exist
if not path.isdir(repoName):
  !git clone https://github.com/AdaCompNUS/diner-dash-simulator.git
  print(f"Diner Dash repo has been cloned to {getcwd()}")
else:
  print(f"Diner Dash repo is already available at {path.join(getcwd(), repoName)}")

Diner Dash repo is already available at /content/drive/My Drive/DinerDashChallenge/diner-dash-simulator


In [None]:
!pip install -e diner-dash-simulator/DinerDashEnv

Obtaining file:///content/drive/My%20Drive/DinerDashChallenge/diner-dash-simulator/DinerDashEnv
Installing collected packages: diner-dash
  Found existing installation: diner-dash 0.0.1
    Can't uninstall 'diner-dash'. No files were found to uninstall.
  Running setup.py develop for diner-dash
Successfully installed diner-dash


In [None]:
import gym

# Test make environment
def testEnv():
  env = gym.make('diner_dash:DinerDash-v0').unwrapped
  env.flash_sim = False
  env.close()
  return True

if testEnv():
  print("Installation of diner dash simulator is successful!")

Installation of diner dash simulator is successful!


## Dependencies for Policy

In [None]:
!pip install torch==1.5.1+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html


In [None]:
# Stable Baselines only supports tensorflow 1.x for now
%tensorflow_version 1.x
!pip install stable-baselines[mpi]==2.10.0

TensorFlow 1.x selected.
Collecting stable-baselines[mpi]==2.10.0
[?25l  Downloading https://files.pythonhosted.org/packages/e5/fe/db8159d4d79109c6c8942abe77c7ba6b6e008c32ae55870a35e73fa10db3/stable_baselines-2.10.0-py3-none-any.whl (248kB)
[K     |████████████████████████████████| 256kB 9.3MB/s 
Installing collected packages: stable-baselines
  Found existing installation: stable-baselines 2.2.1
    Uninstalling stable-baselines-2.2.1:
      Successfully uninstalled stable-baselines-2.2.1
Successfully installed stable-baselines-2.10.0


# Helper Functions

For easier debugging

In [None]:
def getAction(actionID):
    actionIDtoName = {
        0 : "None",
        1 : "Move to Table 1",
        2 : "Move to Table 2",
        3 : "Move to Table 3",
        4 : "Move to Table 4",
        5 : "Move to Table 5",
        6 : "Move to Table 6",
        7 : "Move to Counter",
        8 : "Pick Food for Table 1",
        9 : "Pick Food for Table 2",
        10 : "Pick Food for Table 3",
        11 : "Pick Food for Table 4",
        12 : "Pick Food for Table 5",
        13 : "Pick Food for Table 6",
        14 : "Move to Food Collection",
        15 : "Pick Table 1 for Group 1",
        16 : "Pick Table 2 for Group 1",
        17 : "Pick Table 3 for Group 1",
        18 : "Pick Table 4 for Group 1",
        19 : "Pick Table 5 for Group 1",
        20 : "Pick Table 6 for Group 1",
        21 : "Pick Table 1 for Group 2",
        22 : "Pick Table 2 for Group 2",
        23 : "Pick Table 3 for Group 2",
        24 : "Pick Table 4 for Group 2",
        25 : "Pick Table 5 for Group 2",
        26 : "Pick Table 6 for Group 2",
        27 : "Pick Table 1 for Group 3",
        28 : "Pick Table 2 for Group 3",
        29 : "Pick Table 3 for Group 3",
        30 : "Pick Table 4 for Group 3",
        31 : "Pick Table 5 for Group 3",
        32 : "Pick Table 6 for Group 3",
        33 : "Pick Table 1 for Group 4",
        34 : "Pick Table 2 for Group 4",
        35 : "Pick Table 3 for Group 4",
        36 : "Pick Table 4 for Group 4",
        37 : "Pick Table 5 for Group 4",
        38 : "Pick Table 6 for Group 4",
        39 : "Pick Table 1 for Group 5",
        40 : "Pick Table 2 for Group 5",
        41 : "Pick Table 3 for Group 5",
        42 : "Pick Table 4 for Group 5",
        43 : "Pick Table 5 for Group 5",
        44 : "Pick Table 6 for Group 5",
        45 : "Pick Table 1 for Group 6",
        46 : "Pick Table 2 for Group 6",
        47 : "Pick Table 3 for Group 6",
        48 : "Pick Table 4 for Group 6",
        49 : "Pick Table 5 for Group 6",
        50 : "Pick Table 6 for Group 6",
        51 : "Pick Table 1 for Group 7",
        52 : "Pick Table 2 for Group 7",
        53 : "Pick Table 3 for Group 7",
        54 : "Pick Table 4 for Group 7",
        55 : "Pick Table 5 for Group 7",
        56 : "Pick Table 6 for Group 7",
    }
    return actionIDtoName[actionID]

# Policies

In [None]:
import time
import gym

env = gym.make('diner_dash:DinerDash-v0').unwrapped
env.flash_sim = False
state = env.reset()

## Random Agent

In [None]:
from random import randint

In [None]:
# Randomly select an action from the action space
def testRA(env, state):
    # init variables
    done = False
    sumReward = 0
    actionList = []

    while not done:
        action = randint(0, 56)
        actionList.append(action)
        state, reward, done, _ = env.step(action)
        sumReward += reward

    return sumReward, actionList

## [PPO](https://github.com/nikhilbarhate99/PPO-PyTorch)

In [None]:
import torch
import torch.nn as nn
import numpy as np
from torch.distributions import Categorical

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [None]:
class Memory:
    def __init__(self):
        self.actions = []
        self.states = []
        self.logprobs = []
        self.rewards = []
        self.is_terminals = []
    
    def clear_memory(self):
        del self.actions[:]
        del self.states[:]
        del self.logprobs[:]
        del self.rewards[:]
        del self.is_terminals[:]

class ActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim, n_latent_var):
        super(ActorCritic, self).__init__()

        # actor
        self.action_layer = nn.Sequential(
                nn.Linear(state_dim, n_latent_var),
                nn.Tanh(),
                nn.Linear(n_latent_var, n_latent_var),
                nn.Tanh(),
                nn.Linear(n_latent_var, action_dim),
                nn.Softmax(dim=-1)
                )
        
        # critic
        self.value_layer = nn.Sequential(
                nn.Linear(state_dim, n_latent_var),
                nn.Tanh(),
                nn.Linear(n_latent_var, n_latent_var),
                nn.Tanh(),
                nn.Linear(n_latent_var, 1)
                )
        
    def forward(self):
        raise NotImplementedError
        
    def act(self, state, memory):
        state = torch.from_numpy(state).float().to(device) 
        action_probs = self.action_layer(state)
        dist = Categorical(action_probs)
        action = dist.sample()
        
        memory.states.append(state)
        memory.actions.append(action)
        memory.logprobs.append(dist.log_prob(action))
        
        return action.item()
    
    def evaluate(self, state, action):
        action_probs = self.action_layer(state)
        dist = Categorical(action_probs)
        
        action_logprobs = dist.log_prob(action)
        dist_entropy = dist.entropy()
        
        state_value = self.value_layer(state)
        
        return action_logprobs, torch.squeeze(state_value), dist_entropy
        
class PPO:
    def __init__(self, state_dim, action_dim, n_latent_var, lr, betas, gamma, K_epochs, eps_clip):
        self.lr = lr
        self.betas = betas
        self.gamma = gamma
        self.eps_clip = eps_clip
        self.K_epochs = K_epochs
        
        self.policy = ActorCritic(state_dim, action_dim, n_latent_var).to(device)
        self.optimizer = torch.optim.Adam(self.policy.parameters(), lr=lr, betas=betas)
        self.policy_old = ActorCritic(state_dim, action_dim, n_latent_var).to(device)
        self.policy_old.load_state_dict(self.policy.state_dict())
        
        self.MseLoss = nn.MSELoss()
    
    def update(self, memory):   
        # Monte Carlo estimate of state rewards:
        rewards = []
        discounted_reward = 0
        for reward, is_terminal in zip(reversed(memory.rewards), reversed(memory.is_terminals)):
            if is_terminal:
                discounted_reward = 0
            discounted_reward = reward + (self.gamma * discounted_reward)
            rewards.insert(0, discounted_reward)
        
        # Normalizing the rewards:
        rewards = torch.tensor(rewards).to(device)
        rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-5)
        
        # convert list to tensor
        old_states = torch.stack(memory.states).to(device).detach()
        old_actions = torch.stack(memory.actions).to(device).detach()
        old_logprobs = torch.stack(memory.logprobs).to(device).detach()
        
        # Optimize policy for K epochs:
        for _ in range(self.K_epochs):
            # Evaluating old actions and values :
            logprobs, state_values, dist_entropy = self.policy.evaluate(old_states, old_actions)
            
            # Finding the ratio (pi_theta / pi_theta__old):
            ratios = torch.exp(logprobs - old_logprobs.detach())
                
            # Finding Surrogate Loss:
            advantages = rewards - state_values.detach()
            surr1 = ratios * advantages
            surr2 = torch.clamp(ratios, 1-self.eps_clip, 1+self.eps_clip) * advantages
            loss = -torch.min(surr1, surr2) + 0.5*self.MseLoss(state_values, rewards) - 0.01*dist_entropy
            
            # take gradient step
            self.optimizer.zero_grad()
            loss.mean().backward()
            self.optimizer.step()
        
        # Copy new weights into old policy:
        self.policy_old.load_state_dict(self.policy.state_dict())

### Training PPO

In [None]:
def trainPPO():
    ############## Hyperparameters ##############
    env_name = "diner_dash:DinerDash-v0"
    # creating environment
    env = gym.make(env_name).unwrapped
    state_dim = env.observation_space.shape[0]
    action_dim = 57
    render = False
    solved_reward = 500         # stop training if avg_reward > solved_reward
    log_interval = 1000         # print avg reward in the interval
    max_episodes = int(1e8)     # max training episodes
    max_timesteps = int(1e8)    # max timesteps in one episode
    n_latent_var = 64           # number of variables in hidden layer
    update_timestep = 2000      # update policy every n timesteps
    lr = 0.002
    betas = (0.9, 0.999)
    gamma = 0.99                # discount factor
    K_epochs = 4                # update policy for K epochs
    eps_clip = 0.2              # clip parameter for PPO
    random_seed = None          # do NOT train with a random seed
    #############################################
    
    if random_seed:
        torch.manual_seed(random_seed)
        env.seed(random_seed)
    
    memory = Memory()
    ppo = PPO(state_dim, action_dim, n_latent_var, lr, betas, gamma, K_epochs, eps_clip)
    print(f"Device used: {device}")
    print(f"Learning rate: {lr}, Betas: {betas}")
    
    # logging variables
    running_reward = 0
    avg_length = 0
    timestep = 0
    
    # training loop
    for i_episode in range(1, max_episodes+1):
        state = env.reset()
        state = np.array(state)
        
        for t in range(max_timesteps):
          timestep += 1

          # Running policy_old:
          action = ppo.policy_old.act(state, memory)
          state, reward, done, _ = env.step(action)
          state = np.array(state)
          
          # Saving reward and is_terminal:
          memory.rewards.append(reward)
          memory.is_terminals.append(done)
          
          # update if its time
          if timestep % update_timestep == 0:
              ppo.update(memory)
              memory.clear_memory()
              timestep = 0
          
          running_reward += reward
          if render:
              env.render()
          if done:
              break
                
        avg_length += t
        
        # stop training if avg_reward > solved_reward
        if running_reward > (log_interval*solved_reward):
            print("########## Solved! ##########")
            torch.save(ppo.policy.state_dict(), 'PPO_{}_{}.pth'.format(env_name, solved_reward))
            break
            
        # logging
        if i_episode % log_interval == 0:
            avg_length = int(avg_length/log_interval)
            running_reward = int((running_reward/log_interval))
            
            print('Episode {} \t avg length: {} \t reward: {}'.format(i_episode, avg_length, running_reward))
            running_reward = 0
            avg_length = 0

In [None]:
trainPPO()

Device used: cuda:0
Learning rate: 0.002, Betas: (0.9, 0.999)
Device used: cuda:0
Learning rate: 0.002, Betas: (0.9, 0.999)
Episode 1000 	 avg length: 137 	 reward: -960
Episode 1000 	 avg length: 137 	 reward: -960
Episode 2000 	 avg length: 137 	 reward: -369
Episode 2000 	 avg length: 137 	 reward: -369
Episode 3000 	 avg length: 139 	 reward: 50
Episode 3000 	 avg length: 139 	 reward: 50
Episode 4000 	 avg length: 141 	 reward: 163
Episode 4000 	 avg length: 141 	 reward: 163
Episode 5000 	 avg length: 141 	 reward: 249
Episode 5000 	 avg length: 141 	 reward: 249
Episode 6000 	 avg length: 143 	 reward: 318
Episode 6000 	 avg length: 143 	 reward: 318
Episode 7000 	 avg length: 141 	 reward: 267
Episode 7000 	 avg length: 141 	 reward: 267
Episode 8000 	 avg length: 142 	 reward: 235
Episode 8000 	 avg length: 142 	 reward: 235
Episode 9000 	 avg length: 142 	 reward: 273
Episode 9000 	 avg length: 142 	 reward: 273
Episode 10000 	 avg length: 142 	 reward: 310
Episode 10000 	 av

### Testing PPO

In [None]:
def testPPO(env, state):
    ############## Hyperparameters ##############
    env_name = "diner_dash:DinerDash-v0"
    # creating environment
    # env = gym.make(env_name).unwrapped
    state_dim = 40
    action_dim = 57
    exp_reward = 500            # expected reward to load saved model     
    n_latent_var = 64           # number of variables in hidden layer
    lr = 0.0007
    betas = (0.9, 0.999)
    gamma = 0.99                # discount factor
    K_epochs = 4                # update policy for K epochs
    eps_clip = 0.2              # clip parameter for PPO
    #############################################
    
    filename = "PPO_{}_{}.pth".format(env_name, exp_reward)
    
    memory = Memory()
    ppo = PPO(state_dim, action_dim, n_latent_var, lr, betas, gamma, K_epochs, eps_clip)
    
    ppo.policy_old.load_state_dict(torch.load(filename))

    ep_reward = 0
    state = np.array(state)
    done = False

    while not done:
      action = ppo.policy_old.act(state, memory)
      state, reward, done, _ = env.step(action)
      state = np.array(state)
      ep_reward += reward

    actionList = [action.item() for action in memory.actions]

    return ep_reward, actionList

## Stable Baselines

### Check Env setup for Stable Baselines

In [None]:
from stable_baselines.common.env_checker import check_env

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



In [None]:
error = check_env(gym.make('diner_dash:DinerDash-v0').unwrapped)
if error == None:
  print("Diner Dash environment is compatible with Stable-Baselines!")

Diner Dash environment is compatible with Stable-Baselines!


### Save or Load Models

### Vanilla DQN

In [None]:
# Same as before we instantiate the agent along with the environment
from stable_baselines import DQN

In [None]:
# Deactivate all the DQN extensions to have the original version
# In practice, it is recommend to have them activated
kwargs = {'double_q': False, 'prioritized_replay': False, 'policy_kwargs': dict(dueling=False)}

# Initiliase environment
env = gym.make('diner_dash:DinerDash-v0').unwrapped
env.flash_sim = False

# Note that the MlpPolicy of DQN is different from the one of PPO
# but stable-baselines handles that automatically if you pass a string
dqn_model = DQN('MlpPolicy', env, verbose=1, **kwargs)

start_time = time.time()
# Train the agent for 10000 steps
dqn_model.learn(total_timesteps=int(1e6), log_interval=300)
print(f"--- {(time.time() - start_time)//60} minutes ---")







Instructions for updating:
Use keras.layers.flatten instead.
Instructions for updating:
Please use `layer.__call__` method instead.

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where








---------------------------------------
| % time spent exploring  | 58        |
| episodes                | 300       |
| mean 100 episode reward | -1.31e+03 |
| steps                   | 41902     |
---------------------------------------
---------------------------------------
| % time spent exploring  | 18        |
| episodes                | 600       |
| mean 100 episode reward | -1.28e+03 |
| steps                   | 82799     |
---------------------------------------
---------------------------------------
| % time spent exploring  | 2         |
| episodes                | 900       |
| mean 100 episode reward | -1.05e+03 |
| steps                   | 123282    |
---------------------------------------
--------------------------------------


In [None]:
from stable_baselines.common.evaluation import evaluate_policy

mean_reward, std_reward = evaluate_policy(dqn_model, env, n_eval_episodes=100)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:-581.60 +/- 469.14


### PPO2

In [None]:
from stable_baselines.common import make_vec_env
from stable_baselines import PPO2

# Vectorize environment
env = make_vec_env("diner_dash:DinerDash-v0")

model = PPO2('MlpPolicy', env, verbose=1)
start_time = time.time()
model.learn(total_timesteps=int(5e6), log_interval=300)
print(f"--- Time take to train model = {(time.time() - start_time)//60} minutes ---")

print("Saving Model...")
modelDirectory = "./"
modelName = "dinerDash"
model.save(modelDirectory + modelName)
print(f"Model saved as {modelDirectory + modelName}")

del model # remove to demonstrate saving and loading





Instructions for updating:
Use keras.layers.flatten instead.
Instructions for updating:
Please use `layer.__call__` method instead.





Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where



--------------------------------------
| approxkl           | 0.00016157987 |
| clipfrac           | 0.0           |
| ep_len_mean        | 120           |
| ep_reward_mean     | -1.85e+03     |
| explained_variance | -0.00125      |
| fps                | 58            |
| n_updates          | 1             |
| policy_entropy     | 4.042927      |
| policy_loss        | -0.00868512   |
| serial_timesteps   | 128           |
| time_elapsed       | 4.96e-05      |
| total_timesteps    | 128           |
| value_loss         | 60121.566     |
--------------------------------------
--------------------------------------
| approxkl           | 5.398751e-06  |
| clipfrac           | 0.0           |
| ep_len_mean        | 139           |
| ep_reward_mean     

In [None]:
from stable_baselines import PPO2
from stable_baselines.common.evaluation import evaluate_policy

# Load saved model
PPO_model = PPO2.load("dinerDash")

mean_reward, std_reward = evaluate_policy(PPO_model, env, n_eval_episodes=100)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

Loading a model without an environment, this model cannot be trained until it has a valid environment.




Instructions for updating:
Use keras.layers.flatten instead.
Instructions for updating:
Please use `layer.__call__` method instead.





Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where



mean_reward:-35.80 +/- 130.91


In [None]:
def testPPO2(env, obs):
  from stable_baselines import PPO2

  # Load saved model
  PPO_model = PPO2.load("dinerDash")

  done = False
  sum_rewards = 0
  action_list = []

  while not done:
    action, _states = PPO_model.predict(obs)
    action_list.append(action.item())
    obs, rewards, done, info = env.step(action)
    sum_rewards += rewards

  return sum_rewards, action_list

# Testing of Policies and Verification of Submission

In [None]:
from random import randint
import json
from os import getcwd

# Sample test
def test():
    # Initiliase environment
    env = gym.make('diner_dash:DinerDash-v0').unwrapped
    env.flash_sim = False

    ############################ CHANGEABLE AREA ##############################
    # Changeable parameters
    numEpisodes = 100                             # num of test episodes
    algos = [testPPO2]           # Add or remove algos (must have unique names)
    saveJson = False                              # Whether to save actions_dict
    group_members = "john, mary, bryan"           # String of group members
    fileDirectory = "./"                          # Path of saved json file
    fileName = "submission.json"                  # Name of json file

    ### Replace the list of randomSeeds with that given for submission
    # e.g. randomSeeds = [1, 2, 3]
    randomSeeds = [randint(0, 1e8) for i in range(numEpisodes)]

    ############################################################################

    rewards_dict = {algo.__name__ : [] for algo in algos}
    actions_dict = {algo.__name__ : [] for algo in algos}

    # Test begins
    for seed in randomSeeds:
        # Sets random seed
        env.seed(seed)
        
        # Resets the environment based on random seed
        state = env.reset()

        for algo in algos:
            # create copy of environment for testing
            t_env = env.env.duplicate()

            # Given an environment and initial state
            # Returns the sum of rewards for that episode and the actions list
            rewards, actions = algo(t_env, state)

            rewards_dict[algo.__name__].append(rewards)
            actions_dict[algo.__name__].append(actions)

    # Print average rewards from n episodes for each algo
    avgReward_dict = {algo : int(sum(rewards)/len(rewards)) for algo, rewards in rewards_dict.items()}
    print(f"Average Rewards for each algo: {avgReward_dict}")

    # Print an action dict containing actions list for each random seed env for each algo
    print(f"Actions list for each env for each algo: {actions_dict}")
    
    submission_dict = {
        "names": group_members,
        "actionDict": actions_dict}

    if saveJson:
      print("Saving Json file...")
      with open(fileDirectory + fileName, "w") as write_file:
          json.dump(submission_dict, write_file)
          print(f"{fileName} was saved in {getcwd()}")
      
      print("-" * 100)
      
      print(f"Verifying {fileName}...")
      print(f"Group Members include: {submission_dict['names']}")
      print(f"Names of Algos used: {list(submission_dict['actionDict'].keys())}")
      for val in submission_dict['actionDict'].values():
        submissionEpisodes = len(val)
        if submissionEpisodes != len(randomSeeds):
          raise ValueError("Number of episodes in submission does not match the number of random seeds!")
      print(f"Number of episodes(random seeds): {submissionEpisodes}")
      print("Number of episodes in submission matches the number of random seeds")
      print("Verification Complete! Please double check the verification results")
    
    return None

In [None]:
test()

Loading a model without an environment, this model cannot be trained until it has a valid environment.
Loading a model without an environment, this model cannot be trained until it has a valid environment.
Loading a model without an environment, this model cannot be trained until it has a valid environment.
Loading a model without an environment, this model cannot be trained until it has a valid environment.
Loading a model without an environment, this model cannot be trained until it has a valid environment.
Loading a model without an environment, this model cannot be trained until it has a valid environment.
Loading a model without an environment, this model cannot be trained until it has a valid environment.
Loading a model without an environment, this model cannot be trained until it has a valid environment.
Loading a model without an environment, this model cannot be trained until it has a valid environment.
Loading a model without an environment, this model cannot be trained unti