# NGSchool2019
## Reinforcement Learning Tutorial

### Adapted from a tutorial by [Katja Hofmann](https://www.microsoft.com/en-us/research/people/kahofman/)

### Presented by Robert Loftin


### Overview

This tutoria demonstrates how to build a deep RL agent that learns to navigate: first in SimpleRooms, a task we implement from scratch, then in a MineRL task in Minecraft. 

In this tutorial we will learn how to implement a deep reinforcement learning agent, namely, Deep Q-Networks (DQN) [Mnih et al. 2015](https://www.nature.com/articles/nature14236/), and train it to navigate in a simple 2D environment.  In an optional section at the end of this tutorial, you will have the opportunity to train your deep RL agent in the [MineRL](http://minerl.io/) environment, built around the popular video game [Minecraft](https://www.minecraft.net/en-us/).

1. [Setup](#Setup) **Tip: run this section before the start of the tutorial, to make sure you're ready to get started.**
1. [RL Components](#RL-Components): **Learn how to implement the core components of an RL experiment: environment, agent, and the experiment itself.**
  1. [Agent](#Agent)
  1. [Experiment](#Experiment)
  1. [Experiment 1: Random Agent on MountainCar](#Experiment-1:-Random-Agent-on-MountainCar)
  1. [Environment: Four Rooms](#Environment:-Four-Rooms)
  1. [Experiment 2: Random Agent on SimpleRooms](#Experiment-2:-Random-Agent-on-SimpleRooms)

1. [DQN Agent Implementation](#DQN-Agent-Implementation): **Learn how to implement a DQN Agent**
  1. [QNetwork](#QNetwork)
  1. [Replay Memory](#Replay-Memory)
  1. [Exploration](#Exploration)
  1. [QLearning Agent](#QLearning-Agent)
1. [Experiment 3: Deep Q-Networks on FourRooms](#Experiment-3:-Deep-Q-Networks-on-FourRooms): **Experiment with a DQN Agent**

1. [Further Reading](#Further-Reading)
1. [(Optional) DQN on MineRL](#(Optional)-DQN-on-MineRL)

## Setup

Before we start, you will need to make sure that all of the python dependencies needed for this tutorial are installed and working properly.  If possible, I recommend running these sections before the tutorial session, so you will have your evironment ready to go when we start.

### Install requirments

Install all of the dependencies for this tutorial in you python environment.  You won't need to run this again once everything is installed properly.

In [None]:
# install required packages
!pip install --upgrade gym matplotlib==3.0.3 numpy torch

### Import dependencies

Import all of the packages we will need for this tutorial.  This needs to be run whenever you reload this notebook or restart the Ipython kernel.

In [None]:
# Tells the notebook to enable its matplotlib backend so we can draw stuff
%matplotlib nbagg

# Dependencies for visualizing environments and ploting
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

# Dependencies for our environments and learning algorithms
import gym
import numpy as np
import torch
import torch.nn as nn
import time

## RL Components

In this section we will build some of the elements we will need to run and visualize the RL experiments in this notebook.  The main components will be:

**Agent:** interacts with an environment by taking actions and receiving observations and rewards. To start, we will just define the abstract template for an RL agent, and implement an agent which just takes random actions.  Once we have everything set up, we will implement some actual deep RL agents, including a Deep Q-Network agent, and a Proximal Policy Optimization agent.

**Environment:** defines an interactive task. Our environments will implement the [OpenAI gym interface](https://gym.openai.com/).  We will start with two classic RL environments, CartPole and MountainCar, which are already implemented as prt of the gym.  We will also implement a simple custom environment to illustrate gym functionality.  

In the optional section at the end of the tutorial we look at MineRL, a much more complex 3D environment based on popular video game [Minecraft](https://www.minecraft.net/en-us/).

**Experiment:** connects our agents to our environments.  This class will allow us to easily run and visualize our reinforcement learning experiements.

### Agent

Run the cell below to define the abstract agent class, and an implementation of this class which samples actions uniformly at random.  Notice that agents will only implement a single method, step(), which takes in the most recent observation and reward, and outputs the next action for the agent to take.  Also notice that this method takes a parameter 'done', which indicates the end of an episode.

**Note:** The 'info' parameter is part of the OpenAI gym interface, and allows an environment to return additional information beyond the current observation.  We won't use this here, but include it for completeness.

In [None]:
class Agent(object):
    '''Agent base class'''

    def __init__(self, observation_space, action_space):
        self.observation_space = observation_space
        self.action_space = action_space

    def step(self, obs, reward, done, info):
        raise NotImplementedError


class RandomAgent(Agent):
    '''Agent that samples actions uniformly at random'''

    def __init__(self, observation_space, action_space):
        super(RandomAgent, self).__init__(observation_space, action_space)
    
    def step(self, obs, reward, done, info):
        '''Sample a random action from the action space.  Gym defines anaction space 
        interface, which implements the sample() method for generating random actions'''
        
        return self.action_space.sample()

### Experiment

Run the cells below to define the Experiment class, which allows us to configure, run, and visualize our RL experiments.

In [None]:
class Experiment(object):

    def __init__(self, env, agent, window_size=100, r_min=0.0, r_max=0.0):
        self.env = env
        self.agent = agent

        # Since the reward signal may have high variance, we plot a rolling average
        self.rolling_average = np.array([0.0])
        self.window_size = window_size

        self.r_min = r_min
        self.r_max = r_max
        
        # prepare visuals
        self.fig = plt.figure(figsize=(9, 5))
        gs = gridspec.GridSpec(1, 2)
        env_fig = plt.subplot(gs[0, 0])
        env_fig.title.set_text('Environment Visualization')
        env_fig.xaxis.set_visible(False)
        env_fig.yaxis.set_visible(False)
        self.env_img = env_fig.imshow(np.random.random((64,64)), interpolation='none', cmap='viridis')
        
        self.reward_fig = plt.subplot(gs[0, 1])
        self.reward_fig.title.set_text('Rolling average reward')
        self.reward_line, = self.reward_fig.plot(range(len(self.rolling_average)), self.rolling_average)
        
        plt.tight_layout()
        plt.show()

    def run(self, num_steps, display_frequency=1):
        observation = self.env.reset()  # Initialize environment
        reward = 0.0
        done = False
        info = None
        steps = 0
        rewards = np.array([])
        self.reward_fig.set_xlim(0, max(100, num_steps))
        self.update_display()
        
        while steps < num_steps:
            
            # if the episode is finished, reset the environment
            if done:
                observation = self.env.reset() 
            
            steps += 1
            
            action = self.agent.step(observation, reward, done, info)  # Get the agent's next action
            observation, reward, done, info = self.env.step(action)  # Take a step in the environment
                
            # Update the rolling average reward
            rewards = np.append(rewards, reward)
            self.rolling_average = np.append(self.rolling_average, np.mean(rewards[-self.window_size:]))

            # Update the visualization
            if steps % display_frequency == 0:
                self.update_display()
      
    def update_display(self):
        
        # Draw the environment
        self.env_img.set_data(self.env.render(mode='rgb_array'))
        
        # Plot the average reward
        self.reward_line.set_data(range(len(self.rolling_average)), self.rolling_average)
        self.reward_fig.set_ylim(min(self.r_min, min(self.rolling_average)-0.1), 
                                 max(self.r_max, max(self.rolling_average)+0.1))

        self.fig.canvas.draw()

### Experiment 1: Random Agent on MountainCar

We are now ready to run our fist simple experiment.  This will just illustrate the proces of conecting an agent to an environment and running an experiment.  Before we build own custom environment, we will use a classic RL environment, MountainCar, which is already implemented in gym.  Run the cells below to launch our first experiment.

In [None]:
# experiment setup
env = gym.make('MountainCar-v0')
random_agent = RandomAgent(env.observation_space, env.action_space)
experiment = Experiment(env, random_agent)

In [None]:
# Run experiment
experiment.run(num_steps=200, display_frequency=1)

Congratulations! You have just run your first RL experiment.  Of course, your agent hasn't actually learned anything yet (it can't make it to the top of the mountain) but this is how you will run all of the experiments in this tutorial.  Take a step back to familiarize yourself with the code up to this point.

### Environment: Four Rooms

Before we get to the actual RL algorithms, we will first try implementing a simple RL environment of our own.  This will give us a chance to go through the process of turning decision-making problem we want to solve into an RL environment which we can actually train in.

Our environment will be a 4x4 grid world, divided by walls into four different rooms.  The agent can move up, down, left or right, but cannot move through walls.  The objective for the agent is to reach a random goal location.

In [None]:
class FourRooms(gym.Env):

    def __init__(self):
        self.observation_space = gym.spaces.Box(0.0, 1.0, (16,))
        self.action_space = gym.spaces.Discrete(4)
        self.rewards = np.zeros(16)
        self.rewards[np.random.randint(0, 16)] = 100.0
        
        self.state = 0
        self.num_steps = 0
        
        self.P = [None] * 16
        self.P[0] = [0, 4, 0, 1]
        self.P[1] = [1, 5, 0, 2]
        self.P[2] = [2, 6, 1, 3]
        self.P[3] = [3, 7, 2, 3]
        self.P[4] = [0, 8, 4, 5]
        self.P[5] = [1, 5, 4, 5]
        self.P[6] = [2, 6, 6, 7]
        self.P[7] = [3, 11, 6, 7]
        self.P[8] = [4, 12, 8, 9]
        self.P[9] = [9, 13, 8, 9]
        self.P[10] = [10, 14, 10, 11]
        self.P[11] = [7, 15, 10, 11]
        self.P[12] = [8, 12, 12, 13]
        self.P[13] = [9, 13, 12, 14]
        self.P[14] = [10, 14, 13, 15]
        self.P[15] = [11, 15, 14, 15]

        self.max_episode_length = 40  # End episode automatically after 20 steps
        self._background = self._render_maze()  # The maze and goal are fixed, so just render them once

    def step(self, action):
        if action < 0 or action > 3:
            raise ValueError('Unknown action', action)
        
        self.state = self.P[self.state][action]
        self.num_steps += 1
        
        obs = self._one_hot(self.state)
        reward = self.rewards[self.state]
        done = (reward != 0.0) or (self.num_steps >= self.max_episode_length)
        
        return obs, reward, done, None

    def reset(self):
        self.state = np.random.randint(16)
        self.num_steps = 0
        
        # Don't spawn in a reward state
        while self.rewards[self.state] != 0.0:
            self.state = np.random.randint(16)

        return self._one_hot(self.state)

    def _one_hot(self, state):
        obs = np.zeros(16, dtype=np.float32)
        obs[state] = 1.0
        return obs
    
    def _render_coords(self, s):
        return ((s % 4) * 4, int(s / 4) * 4)

    def _render_maze(self):
        maze = np.zeros((17, 17))
        for x in range(0, 17, 4):
            maze[x, :] = .2
        for y in range(0, 17, 4):
            maze[:, y] = .2

        for s in range(16):
            x, y = self._render_coords(s)
            if self.P[s][0] == s:
                maze[x:x+5, y] = .5
            if self.P[s][1] == s:
                maze[x:x+5, y+4] = .5
            if self.P[s][2] == s:
                maze[x, y:y+5] = .5
            if self.P[s][3] == s:
                maze[x+4, y:y+5] = .5
            if self.rewards[s] != 0:
                maze[x+1:x+4, y+1:y+4] = self.rewards[s]
        return maze

    def render(self, mode='rgb_array'):
        assert mode == 'rgb_array', 'Unknown mode: %s' % mode
        img = np.array(self._background, copy=True)
        x, y = self._render_coords(self.state)
        img[x+1:x+4, y+1:y+4] = .8
        return img


### Experiment 2: Random Agent on FourRooms

We can now run the random agent in our new environment.  Run the cell below to set up a new experiment with the FourRooms environment.

In [None]:
class PolicyAgent(Agent):
    '''An agent that follows a fixed policy'''

    def __init__(self, observation_space, action_space):
        super(RandomAgent, self).__init__(observation_space, action_space)
        self._policy = [0] * observation_space.n
    
        # TODO: create a policy that will reach a fixed goal
    
    def step(self, obs, reward, done, info):
        for s in range(len(obs)):
            if obs[s] > 0.0:
                return self._policy[s]
        
        return self.action_space.sample()

env = FourRooms()
random_agent = RandomAgent(env.observation_space, env.action_space)
experiment = Experiment(env, random_agent, window_size=100, r_max=50)

In [None]:
experiment.run(num_steps=300, display_frequency=1)

**Exercise 1:** Consider the implementation of the FourRooms environment.  For RL to work, this environment must correspond to a Markov Decision Process (MDP).  Can you identify the key elements of an MDP here?  What are the state and action spaces for this environment?  How many states are there? What are the transition probabilities? What is reward function?

**Exercise 2:** Now modify the environment so the goal location is fixed instead of random.  Take a look at the PolicyAgent class above.  Can you define a policy which allows this agent to reach the fixed goal?

## DQN Agent Implementation

At long last, we are finally ready to implement an actual reinforcement learning agent. In this section, we will implement a Deep Q-Network (DQN) agent (see [Mnih et al. 2015](https://www.nature.com/articles/nature14236/) for details).  Our implementation will have four components:

**Q-Network:** the deep network representing the state-action value function (the Q-function).

**Replay Buffer:** stores the agents experience as (state, action, reward, next state) tuples.

**Exploration Strategy:** selects action for the agent to take based on the predicted Q-values for those actions.

**DQN Agent:** the agent itself, which selects actions and performs the actual learning update

These components are implemented in turn in the cells below.

**Note:** some of these implementations are incomplete - you will need to fill in the missing pieces in the exercises below.

### Q-Network

Here we represent our Q-function with a fully connected, 2 layer network implemented in PyTorch.  

**Exercise 3:** You do not need to add anything to this class, but you should take a moment to familiarize yourself with the code. Notice that the network does not take an action as input.  How do we get the Q-value prediction for a specific action?  How would we represent the Q-function for a continuous action space?

In [None]:
class QNetwork(nn.Module):
    
    def __init__(self, obs_size, num_actions, num_hidden):
        super(QNetwork, self).__init__()
        
        hidden = nn.Linear(obs_size, num_hidden)
        nn.init.kaiming_uniform_(hidden.weight.data, nonlinearity="relu")
        nn.init.uniform_(hidden.bias.data, a=-1e-4, b=1e-4)
        
        output = nn.Linear(num_hidden, num_actions)
        nn.init.kaiming_uniform_(output.weight.data, nonlinearity="relu")
        nn.init.uniform_(output.bias.data, a=-1e-4, b=1e-4)
        
        self._layers = nn.Sequential(hidden, nn.ReLU(), output)

    def forward(self, obs):
        return self._layers(torch.as_tensor(obs, dtype=torch.float32))

### Replay Memory

This class stores a collection of recent experiences, and samples mini-batches of experiences to train the Q-Network on.  You do not need to add anything here.

In [None]:
class ReplayMemory(object):
    """Implements basic replay memory"""

    def __init__(self, obs_size, max_size=1000):
        self._max_size = max_size
        self._num_observed = 0
        
        self.samples = {
            'obs': np.zeros((max_size, obs_size), dtype=np.float32),
            'action': np.zeros(max_size, dtype=np.int64),
            'reward': np.zeros(max_size, dtype=np.float32),
            'done': np.zeros(max_size, dtype=np.float32)
        }
    
    def observe(self, obs, action, reward, done):
        index = self._num_observed % self._max_size
        self._num_observed += 1
        
        self.samples['obs'][index, :] = np.array(obs, copy=False)
        self.samples['action'][index] = action
        self.samples['reward'][index] = reward
        self.samples['done'][index] = 1.0 if done else 0.0
    
    def sample_minibatch(self, minibatch_size):
        indices = np.random.randint(min(self._num_observed, self._max_size) - 1, size=minibatch_size)
        obs = torch.as_tensor(self.samples['obs'][indices, :])
        actions = torch.as_tensor(self.samples['action'][indices])
        rewards = torch.as_tensor(self.samples['reward'][indices])
        done = torch.as_tensor(self.samples['done'][indices])
        next_obs = torch.as_tensor(self.samples['obs'][indices + 1, :])

        return obs, actions, rewards, done, next_obs

### Exploration Strategy

These classes define different exploration strategies the DQN agent could use.  Exploration is an essential part of reinforcement learning.  Without trying new actions, the agent may miss opportunities to learn more about its environment and improve its policy.

**Exercise 4:** The cell below defines two classes, each describing a different exploration strategy, but they are incomplete.  Implement the epsilon-greedy exploration strategy.  (Optional) Implement the softmax exploration strategy.

**Hint:** Each exploration strategy just takes in the Q-values for the current state, and outputs an action for the agent to actually perform.

In [None]:
class EpsilonGreedyExplorer(object):
    """Implements an epsilon greedy exploration strategy"""
    
    def __init__(self, num_actions, epsilon=0.1):
        self.epsilon = epsilon
        self.num_actions = num_actions

    def next_action(self, numpy_q_values_for_state):

        # TODO: implement epsilon-greedy exploration - see Exercise 4

        return action_index
    

class SoftmaxExplorer(object):
    """Implements a softmax exploration strategy"""
    
    def __init__(self, num_actions, beta=1.0):
        self.beta = beta
        self.num_actions = num_actions

    def next_action(self, numpy_q_values_for_state):
        
        # TODO (Optional): implement softmax exploration - see Exercise 4

        return action_index

### QLearning Agent

Finally, we implement the DQN agent itself, using the Q-network, replay buffer, and exploration strategies we have defined above.  This implementation is not quite complete however.

**Exercise 5:**  Familiarize yourself with the QLearningAgent class.  Notics that there is a TODO comment in the update_model function.  Here you will need to add the code to compute the targets Q-values that the model will need to learn.

**Hint:**  You should be able to do this with one line of code.

In [None]:
class QLearningAgent(Agent):
    """Q-Learning agent with function approximation."""

    def __init__(self, observation_space, action_space, **kwargs):
        super(QLearningAgent, self).__init__(observation_space, action_space)

        obs_size = observation_space.shape[0]
        num_actions = action_space.n
        
        self.model_network = QNetwork(obs_size, num_actions, kwargs.get('num_hidden', 128))
        self.target_network = QNetwork(obs_size, num_actions, kwargs.get('num_hidden', 128))
        self.target_network.load_state_dict(self.model_network.state_dict())
    
        self.explorer = kwargs.get('explorer', EpsilonGreedyExplorer(action_space.n, 0.1))
        self.memory = ReplayMemory(obs_size, kwargs.get('memory_size', 5000))
        self.optimizer = torch.optim.Adam(self.model_network.parameters(), kwargs.get('learning_rate', 0.01))

        self.gamma = kwargs.get('gamma', .99)
        self.minibatch_size = kwargs.get('minibatch_size', 32)
        self.epoch_length = kwargs.get('epoch_length', 100)
        
        self.num_steps = 0
        self.prev_action = None
        self.prev_obs = None
        
    def step(self, obs, reward, done, info):
        if self.num_steps > 0:
            self.memory.observe(self.prev_obs, self.prev_action, reward, done)

        action = self.explorer.next_action(self.model_network(obs).detach().numpy())

        # start training after 1 epoch
        if self.num_steps > self.epoch_length:
            self.update_model()

        self.num_steps += 1
        self.prev_action = action
        self.prev_obs = obs

        if self.num_steps % self.epoch_length == 0:
            self.target_network.load_state_dict(self.model_network.state_dict()) # Update target network
            
            # If epsilon greedy - decay epsilon after each epoch
            if isinstance(self.explorer, EpsilonGreedyExplorer):
                self.explorer.epsilon = max(0.05, self.explorer.epsilon * .95)

        return action
    
    def update_model(self):
        # sample minibatch
        obs, actions, rewards, dones, next_obs = self.memory.sample_minibatch(self.minibatch_size)
        
        # TODO: implement Q-target computation (max_a' Q_hat(s_next, a')), see Exercise 5
        # Q_targets = None

        # Get the current Q-values
        Q_values = self.model_network(obs)
        Q_values = torch.gather(Q_values, 1, actions.unsqueeze(-1)).squeeze(-1)
        
        # compute Huber loss - a smoothed L1 loss
        loss = torch.nn.functional.smooth_l1_loss(Q_values, Q_targets.detach())

        # perform model update
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

## Experiment 3 - Deep Q-Networks on FourRooms

Now run the cells below to launch an experiment with DQN in our 'four rooms' environment.

In [None]:
env = FourRooms()
q_agent = QLearningAgent(env.observation_space, env.action_space)
experiment = Experiment(env, q_agent, window_size=100, r_max=50)

In [None]:
experiment.run(5000, 10)

Congratulations!  If you have reached this point in the tutorial, you will have successfully implemented and run your first experiment with deep RL.

**Exercise 6:** As you may have noticed, our DQN agent has a large number of configuration options (these are called **hyperparameters** in machine learning). These include the learning rate, the discount factor gamma, the number hidden nodes in the Q-network, and the choice of exploration strategy.  

Try changing some of these values and running the experiment again.  Which values have the biggest impact on performance and what are the best values for these parameters?  How do you compare one set of hyperparameters against another, what is the performance measure?

**(Optional) Exercise 7:** Try using your DQN agent to solve the 'MountainCar-v0' task we saw earlier in the tutorial.

## Further Reading

Congratulations! You have completed this RL tutorial.

Hopefully, this tutorial has piqued your interest in Reinforcement learning. Here are a few more resources to help you get started.

**The RL Book:** For an in-depth treatment of RL, I highly recommend the Sutton and Barto book, now in its second edition: http://incompleteideas.net/book/the-book-2nd.html

**Code:** In this tutorial we implemented a DQN agent from scratch. A wide range of RL baselines and state-of-the-art algorithms are implemented in [chainerrl](https://github.com/chainer/chainerrl). Other popular RL implementations include OpenAI's [Spinning Up RL](https://spinningup.openai.com/en/latest/), the ray project's [RLLib](https://ray.readthedocs.io/en/latest/rllib.html), and Google's [Dopamine](https://github.com/google/dopamine).

**Conferences:**
For recent research in reinforcement learning, check out the topics discussed at [RLDM](http://rldm.org/) - an interdisciplinary conference on Reinforcement Learning and Decision Making. Other key conferences with a large portion of RL research are [ICML](https://www.icml.cc/), [ICLR](https://iclr.cc) and [NeurIPS](https://neurips.cc/). A popular event in Europe is the European Workshop on Reinforcement Learning [EWRL](https://ewrl.wordpress.com).

## (Optional) DQN on MineRL

Deep reinforcement learning has been able to solve extremely difficult control problems, including learning to play modern video games.  In this section, we will apply our DQN agent to the [MineRL](http://minerl.io/) environment, built around the popular video game [Minecraft](https://www.minecraft.net/en-us/).  MineRL was developed by a team led by [William H. Guss](http://wguss.ml/) and [Brandon Houghton](https://github.com/brandonhoughton) for the NeurIPS 2019 MineRL competition, hosted by AICrowd and sponsored by Microsoft. MineRL is based on [Project Malmo](https://www.microsoft.com/en-us/research/project/project-malmo/), developed at [Microsoft Research](https://www.microsoft.com/en-us/research/theme/game-intelligence/).

### Install MineRL and Dependencies

Install the MineRL package.  More detailed, platform-specific installation instructions can be found at: http://minerl.io/docs/tutorials/index.html

In [None]:
!pip install --upgrade minerl

In [None]:
!pip install --upgrade opencv-python-headless

Import the MineRL package and the OpenCV dependency.

In [None]:
# environments
import minerl
import cv2

# Uncomment in case Minecraft fails to start, to help with debugging:
#import sys
#import logging
#logger = logging.getLogger("minerl")
#logger.setLevel(logging.DEBUG)
#logger.addHandler(logging.StreamHandler(sys.stdout))

Run the cell below to test that you can create a MineRL environment and interact with it.

In [None]:
# Start Minecraft and create a MineRL environment - be patient, this will take several minutes
minerl_env = gym.make('MineRLNavigateDense-v0')

# If you turned on debugging above, quieten things down by uncommenting the below:
#logger.setLevel(logging.INFO)

# Test that you can reset and interact with the MineRL environment
minerl_env.reset()

for _ in range(100):
    action = minerl_env.action_space.sample()
    minerl_env.step(action)
    
minerl_env.close()

### Wrap the MineRL Environemnt

Training an RL agent on the full MineRL environment would take hours, so we will simplify things to make training feasible within a few minutes.  Run the cells below to build a discrete action wrapper for the MineRL navigation task.

In [None]:
class DiscreteMinecraftNavigation(gym.Env):
    '''Wrap the MineRL navigation environment to discretize actions and simplify observations'''

    def __init__(self):
        self.env = gym.make('MineRLNavigateDense-v0')
        self.observation_space = gym.spaces.Box(1.0, 1.0, (28,))
        self.action_space = gym.spaces.Discrete(3)
        self.steps_this_episode = 0

    def reset(self):
        self.obs, _ = self.env.reset()
        self.steps_this_episode = 0
        return self._convert_obs(self.obs)

    def step(self, action):
        self.steps_this_episode += 1
        self.obs, self.reward, self.done, self.info = self.env.step(self._convert_action(action))
        
        # Make the reward signal more dense
        if action == 0:
            if obs['compassAngle'] < 1:
                self.reward = .5
            else:
                self.reward = .1
        else:
            self.reward = -.3
        
        return self._convert_obs(self.obs), self.reward, self.done, self.info

    def _convert_obs(self, obs):
        # constructs obs of size 3 x 3 x 3 + 1 = 28
        low_res = cv2.resize(obs['pov'], dsize=(3, 3), interpolation=cv2.INTER_NEAREST)
        return np.float32(np.hstack([low_res.flatten(), obs['compassAngle']]))

    def _convert_action(self, action):
        base_action =  self.env.action_space.noop()
        base_action['jump'] = 1
        base_action['attack'] = 1

        if action == 0:
            # move forward
            base_action['forward'] = 1
        elif action == 1:
            # turn towards the compass direction
            base_action['camera'] = [0, 0.03 * obs['compassAngle']]
        elif action == 2:
            # move back
            base_action['back'] = 1
        else:
            raise NotImplementedError('Action %d is not implemented.' % action)

        return base_action

    def render(self, mode):
        return self.obs['pov']
    
    def close():
        self.env.close()

### Experiment 4: DQN on MineRL

Now we're ready to test our DQN agent on our discretized Minecraft Navigation task. Again, if everything is implemented correctly, reward should go up within less than 3000 training steps.

In [None]:
nav_env = DiscreteMinecraftNavigation()
minerl_agent = QLearningAgent(nav_env.observation_space, 
                              nav_env.action_space,
                              learning_rate=0.5,
                              memory_size=10000,
                              num_hidden=512)
experiment = Experiment(nav_env, mine_rl, window_size=100, r_max=1)

In [None]:
# Run the experiment for 5000 steps, visualize every 50 steps
# If implemented correctly, DQN should learn to exceed a reward of 0 within less than 5000 steps
experiment.run(5000, 10)