# Installing the necessary packages

(this is also found in the `setup.md` file)

It is recommended to use [anaconda](https://docs.conda.io/en/latest/miniconda.html) to manage python environments. Follow the installation instructions to use anaconda.

Create a new evironment: `conda create -n rl python=3.7`

Make sure you change your conda environment to the new `rl` environment you created above: `conda activate rl`

Then install Open AI's [gym package](http://gym.openai.com/docs/): `pip install gym`

Install jupyter to use notebooks: `pip install jupyter`

Make sure you also install these: `pip install numpy seaborn matplotlib`

Clone this reinforcement learning repository: `git clone https://github.com/wingillis/reinforcement-learning.git`. Make note of the location you saved this repository, (i.e., use `pwd` to get the full folder path pointing to this repository) because you will use this location in our notebooks.


# How to use gym

[Link](http://gym.openai.com) to more documentation

[Link](http://gym.openai.com/envs/#atari) to some default environments

`gym` is a package developed by Open AI with the goal of providing reproducible and complex
environments to benchmark different reinforcement learning algorithms.
The goal is to minimize the amount of knowledge the RL agent has in how to interact
with its environment, and the gym API reflects that. 

It contains default many environments, including:

- simple text-based environments, like `Taxi-v3`, or `Blackjack-v0`
- simple graphical environments, such as `CartPole-v1`, and `Pendulum-v0`
- Atari games, such as `Breakout-v0`

It's easy to extend the environment interface and make your own. Today we're going
to use a custom environment that reflects the Sutton and Barto textbook.

It is very simple to use. The main components of gym allows you to:

- specify which environment you want to use
- render the environment
- take an action

You can define any function you like, and use any python package you want
(e.g. tensorflow, pytorch, numpy, autograd) to decide which actions
to take given the current state of the environment.

Let's set up an example here to understand the basics

In [1]:
import sys
import gym

# add the reinforcement learning repository we downloaded to your system path
sys.path.append('/Users/wingillis/dev/reinforcement-learning')

# import a basic gridworld environment
from lib.envs.gridworld import GridworldEnv

Here, we are instantiating a new gridworld environment, containing two reward locations:
one in the top left corner, and one in the bottom right corner.

In [2]:
# instantiate the environment
env = GridworldEnv(shape=(5, 5))

Every time you want to begin a new episode of training, you'll want to reset the environment.

Here, resetting the environment will place the agent in a random spot in the environment.

In [3]:
# you can deterministically start in the same place by setting the seed.
env.seed(1)

position = env.reset()
print(position)

20


Let's render the current environment.

Here, the two `T` symbols represent the goals, while the `x`
in the bottom left corner represents the agent's position.

In [4]:
env.render('human')

T  o  o  o  o
o  o  o  o  o
o  o  o  o  o
o  o  o  o  o
x  o  o  o  T


This is how you get the number of actions and states the current environment supports

In [5]:
print('# actions:', env.nA)
print('# states:', env.nS)

# actions: 4
# states: 25


Now let's perform an action. In the gridworld environment, we can perform
1 of 4 different actions:

- 0: up
- 1: right
- 2: down
- 3: left

When we perform an action, we get back an observation, reward, an indicator
saying if we've reached the goal, and any extra information about the environment.

In [6]:
env.step?

[0;31mSignature:[0m [0menv[0m[0;34m.[0m[0mstep[0m[0;34m([0m[0ma[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Run one timestep of the environment's dynamics. When end of
episode is reached, you are responsible for calling `reset()`
to reset this environment's state.

Accepts an action and returns a tuple (observation, reward, done, info).

Args:
    action (object): an action provided by the agent

Returns:
    observation (object): agent's observation of the current environment
    reward (float) : amount of reward returned after previous action
    done (bool): whether the episode has ended, in which case further step() calls will return undefined results
    info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
[0;31mFile:[0m      ~/miniconda3/envs/rl/lib/python3.7/site-packages/gym/envs/toy_text/discrete.py
[0;31mType:[0m      method


In [7]:
next_state, rew, done, info = env.step(0)  # let's move up one
env.render('human')

T  o  o  o  o
o  o  o  o  o
o  o  o  o  o
x  o  o  o  o
o  o  o  o  T


In [8]:
print(next_state, rew, done, info)

15 -1.0 False {'prob': 1.0}


In this specific environment, you can look at the entire model of the environment, by accessing the `P` attribute

In [9]:
env.P

{0: {0: [(1.0, 0, 0.0, True)],
  1: [(1.0, 0, 0.0, True)],
  2: [(1.0, 0, 0.0, True)],
  3: [(1.0, 0, 0.0, True)]},
 1: {0: [(1.0, 1, -1.0, False)],
  1: [(1.0, 2, -1.0, False)],
  2: [(1.0, 6, -1.0, False)],
  3: [(1.0, 0, -1.0, True)]},
 2: {0: [(1.0, 2, -1.0, False)],
  1: [(1.0, 3, -1.0, False)],
  2: [(1.0, 7, -1.0, False)],
  3: [(1.0, 1, -1.0, False)]},
 3: {0: [(1.0, 3, -1.0, False)],
  1: [(1.0, 4, -1.0, False)],
  2: [(1.0, 8, -1.0, False)],
  3: [(1.0, 2, -1.0, False)]},
 4: {0: [(1.0, 4, -1.0, False)],
  1: [(1.0, 4, -1.0, False)],
  2: [(1.0, 9, -1.0, False)],
  3: [(1.0, 3, -1.0, False)]},
 5: {0: [(1.0, 0, -1.0, True)],
  1: [(1.0, 6, -1.0, False)],
  2: [(1.0, 10, -1.0, False)],
  3: [(1.0, 5, -1.0, False)]},
 6: {0: [(1.0, 1, -1.0, False)],
  1: [(1.0, 7, -1.0, False)],
  2: [(1.0, 11, -1.0, False)],
  3: [(1.0, 5, -1.0, False)]},
 7: {0: [(1.0, 2, -1.0, False)],
  1: [(1.0, 8, -1.0, False)],
  2: [(1.0, 12, -1.0, False)],
  3: [(1.0, 6, -1.0, False)]},
 8: {0: [(1.0, 

Now, let's take some random actions and see how our agent moves.

In [10]:
import time
import random
from IPython.display import clear_output

In [11]:
random.seed(23)

In [12]:
def select_action():
  return random.randint(0, 3)

In [13]:
n_steps = 30
for t in range(n_steps):
  action = select_action()
  _ = env.step(action)
  clear_output()
  env.render('human')
  time.sleep(1)

x  o  o  o  o
o  o  o  o  o
o  o  o  o  o
o  o  o  o  o
o  o  o  o  T
