# [Click here for workshop instructions](https://msu-ai.notion.site/Workshop-Instructions-eb2c76481d4c4a1f92824ee5bfd80536)

# Step 1: Installation and Setup

To make OpenAI Gym work in Google Colab, we need to install some special packages and set up an imaginary monitor, since the Google Colab servers don't have any displays. The following block of code installs everything you need automatically. So just run it, without making any changes, using the play button:

In [None]:
!apt-get install -y xvfb python-opengl x11-utils
!pip install gym==0.23.0 pyvirtualdisplay==0.2.5 pygame mujoco_py

Next, run the following code to set up a "virtual display" so that OpenAI Gym can render the environment to the screen, even though the python server doesn't have a *real* screen.

In [None]:
from pyvirtualdisplay import Display
display = Display(visible=0, size=(400, 300))
display.start()

# Define a show_video function that will be useful later
import io
from base64 import b64encode
from IPython.display import display, HTML

def show_video(filename, width = 400):
  with io.open(filename,"r+b") as file:
    video = file.read()
    data_url = "data:video/webm;base64," + b64encode(video).decode()
    print(filename)
    display(HTML(f"""<video width="{width}" controls autoplay><source src="{data_url}" type="video/webm"></video>"""))

Excellent! Assuming all went well, you are now ready to begin using OpenAI Gym right here in Google Colab.

# Step 2: Welcome to the Environment

OpenAI Gym works by providing "environments", which are like scenarios that you hope the computer can learn how to navigate. That's why they call it "gym". It's a wonderland full of many different challenges for the computer to train on.

<center>
  <img alt="Six different OpenAI Gym environments" src="https://blog.paperspace.com/content/images/size/w1750/2020/11/openaigym.jpg" width="600" />
  <div><i>OpenAI Gym provides many different environments in which to train.</i></div>
</center>

In particular, OpenAI Gym is designed to be a training ground for **reinforcement learning**. Reinforcement learning is generally used when an *agent* (such as the player in a video game) needs to learn how to navigate an *environment*.

If you haven't seen it before, I highly recommend checking out this short video from OpenAI about using reinforcement learning to train robots to play hide-and-seek:

<center>
  <a href="https://youtu.be/kopoLzvh5jY" target="_blank">
    <img src="https://i.imgur.com/3OVIi32.png" src="Multi-Agent Hide and Seek on Youtube" width="400" />
  </a>
</center>

In the above video, the agents are playing in a hide-and-seek *environment*. Designing these environments is tricky and can take some time, so OpenAI has kindly created "OpenAI Gym", an open source python package that provides a wide variety of ready-to-use environments in which to experiment.

Today, we'll be using the `"CartPole-v1"` environment. (Although the techniques we use will be generalizable to a large number of different environments.) CartPole is a 2D physics simulation of a cart that can move left and right, with a swining pendulum attached. The goal of the agent is to drive the cart and keep the swinging pole balanced. It's a little bit like trying to balance a broom on your hand.

Let's begin by exploring the environment.

First, run the following code to create a `"CartPole-v1"` environment. (You can see the full list of environments [here](https://gym.openai.com/envs/#classic_control), but this is what we're going to work with for now. In the future, you might want to come back and try again with a different environment.)

Just run this code with no changes:

In [None]:
import gym
env = gym.make("CartPole-v1")

print(env)

Great! Now we have the environment stored in a variable called `env`. Next, to make sure things are set up and ready to go, let's run `env.reset()`, and store the result in a variable called `observation`. More on that later.

In [None]:
observation = env.reset()

Okay. With the environment set up, we're finally ready to view what's going on. The environment has a method called `env.render()` that displays the environment on-screen. Usually just calling `env.render()` opens a popup window showing the environment, but that won't work in Google Colab. So we have to get a bit more fancy.

The following code gets the environment image as a numpy array and then uses matplotlib to display it. (If you were here for our OpenCV workshop, you can imagine how you might draw your own graphics on top of the default output.)

Alright... Run the following code to render the environment as we've described:

In [None]:
import matplotlib.pyplot as plt
plt.imshow(env.render(mode="rgb_array"))

Amazing! If all went well, you should see the cart with a brown pole sticking up out of it.

One important thing to notice is that **every time you call `env.reset()`, it is reset slightly randomly**. Sometimes the pole is tilted slightly right. Other times it's tilted slightly left. And sometimes it's very nearly vertical, so you can't tell that it's tilted at all. The position of the cart, also varies slightly, as do the velocity values, but those are harder to see.

Check that this is true by going back, re-running the code that says `observation = env.reset()`, and then re-running the rendering code. If you do this a bunch of times, you'll see that it looks slightly different each time.

---

Rendering the environment is awesome, but if we want to write code, we really need to understand the environment in numbers. What are the x position of the cart and the pendulum's angle? And what are the two corresponding velocities?

Well, when we ran `env.reset()`, we store the result in a variable called `observation`. That variable contains all the relavent data.

Run this code to view the `observation` values:

In [None]:
print(observation)

As you can see, `observation` contains four numbers, as expected. But these numbers aren't labelled, so we don't know what they mean! Which one is position? Velocity? Angle? Angular velocity? We don't know.

This is intentional. We want the computer to learn how to balance the pole from scratch, without us telling it anything. So all it will see is the four numbers, without any explanation of what they mean or how they change over time.

But for our own sake, it would be nice to know which is which.

**Question:** Which number corresponds to the angle of the pendulum?

**Hint:** To figure this out, you'll need to go back and reset the environment a bunch of times. After each reset, render the environment and then look at the observation values. When the pendulum is leaning right, that means its angle is positive. When it's leaning left, the angle is negative. With this information in mind, can you figure out which number is the angle?

---

(Scroll down for spoilers.)

---

.  
.  
.  
.  

As you probably figured out, the order is...
0. Cart position
1. Cart velocity
2. **Pendulum angle**
3. Pendulum angular velocity

OpenAI gym uses the term "observation space" to describe the set of possible observation values. Thinking in terms of these "spaces" makes it possible to create reinforcement learning algorithms that work in a wide variety of environments.

If you want to learn more about the possible observation values for an environment, you can use `env.observation_space`. For example, the following code uses the built-in `.sample()` method to get a random possible observation value:

In [None]:
print(env.observation_space.sample())

Run the above code multiple times and you'll get a different random sample each time.

You can also find the minimum and maximum possible observation values using the code below:

In [None]:
print("Minimum allowed values:", env.observation_space.low)
print("Maximum allowed values:", env.observation_space.high)

So the cart position will always be between -4.8 and 4.8, and so on.

So this "spaces" idea lets us represent the possible observation values. But as we continue our simulation beyond the first frame, our agent is going to do more than *observe*. We want to *act*! This is where the `action_space` comes in. Let's see what `env.action_space` is:

In [None]:
env.action_space

There's not a lot of information here. `Discrete(2)` just means that there are two discrete possible actions (as opposed to something continuous, like all the real numbers between 0 and 1).

Let's take a random sample to see what those values can be:

In [None]:
env.action_space.sample()

If you run the above code a few times, you'll see that the possible actions are `0` or `1`. These correspond to moving the cart **left** or moving the cart **right** respectively.

In our simulation, we advance forward one frame using either `env.step(0)` or `env.step(1)`. If we pass in the `0` action, the cart will move a little bit to the left and then the pendulum will respond accordingly. And if we pass in `1`, the cart will move a little bit to the right.

The following code takes a step, always moving the cart left, and then re-renders. Run it a bunch of times and watch how the cart slowly moves left (and thus the pendulum tips right):

In [None]:
# Step forward, moving the cart to the left
env.step(0)

# Render the environment to see what changed
plt.imshow(env.render(mode="rgb_array"))

If you go long enough, the pendulum will fall completely. When that happens, run the following code to reset. Then you can go back and do it again.

In [None]:
env.reset()

Try resetting and then changing the above code so that instead of moving left, the cart moves right. Watch what happens in that case. (Hopefully it is exactly what you expect.)

---

Alright. Remember how `env.reset()` gives back an `observation` value? Well, every time we take a step, we want to update `observation` to reflect the new state of the world.

Fortunately, `env.step()` also returns information that we can read. But it returns more than just `observation`. It also returns values called `reward`, `done`, and `info`.

Run the following code to see how that works:

In [None]:
# Reset the environment and print the results:
observation = env.reset()

print("Observation (from reset):", observation)
print()

# Next, take a step and print the results:
observation, reward, done, info = env.step(0)

print("Observation (from step):", observation)
print("Reward (from step):", reward)
print("Done (from step):", done)
print("Info (from step):", info)

As you can see, the observation values change slightly after the step. (This makes sense because the cart and pendulum have moved slightly.)

But we also get `reward`, `done`, and `info`. What are they? We'll talk about them in reverse order:
- `info` provides extra information about the environment. In some of the more complicated environments that OpenAI Gym provides, this is important. But the `"CartPole-v1"` that we're using doesn't provide any additional information, so this is totally useless to us.
- `done` becomes `True` when the simulation is complete. In the `"CartPole-v1"` environment, the simulation is done when the pole tips down too far (i.e. you lose the game). As we write more complicated code, we're supposed to do `env.reset()` as soon as `done` becomes `True`. (We haven't been doing this so far, and so you might have seen warnings earlier complaining about this.)


And finally, `reward`. `reward` is really important, because it lies at the heart of reinforcement learning.

**What is reward?** Reinforcement learning works a lot like training a dog. The "agent" (computer) doesn't know what to do or what the goal is, so it begins by just trying random things. (This is called *exploration*.) Then, when the computer does something good, we give it a treat. And when it does something bad, we give it a punishment. Over time, the computer remembers what works and tries to do it again. Eventually, it knows how to earn lots and lots of treats, so that's what it does. (This is called *exploitation*.) So if we can give treats and punishments based on how the computer is doing, it should learn what to do eventually.

The `reward` value is the treat or punishment for the computer. A positive number is a treat (bigger is better) and a negative number is a punishment. The computer's goal is to get the biggest `reward` possible over time. In the case of `"CartPole-v1"`, since the goal is just to survive as long as possible, the reward is always `1.0` (a single point for each frame you stay alive). After you die, and `done` becomes `True`, then `reward` becomes `0.0`.

---

One last thing before we move on... Right now we are re-running the same code block over and over again, stepping the environment forward by one frame at a time. This was fine for testing, but we really want to run the entire simulation at once and then watch a video of what happened. To do this, we'll put the `env.step()` updates inside of a `while True:` loop, and then we'll `break` out of the loop whenever `done` becomes `True` (i.e. the simulation is finished).

Additionally, we'll use the built-in video recorder to record what happened and then play it back after the fact.

The following code does most of this, but there are a few blanks to fill in. Refer to the previous code to remember how to do it:

In [None]:
import gym
from gym.wrappers.monitoring.video_recorder import VideoRecorder

# TODO: Create the environment
env = # ???

# Create the video recorder
video = VideoRecorder(env, "moving-left.mp4")

# TODO: Reset the environment
observation = # ???

while True:
  # Capture the current frame and save it in the video
  video.capture_frame()
  
  # TODO: Take a step, moving the cart to the left:
  observation, reward, done, info = # ???

  if done:
    break

video.close()
env.close()

# Show the recorded video to see what happened
show_video("moving-left.mp4")

If all went well, you should see a very short video (less than 1 second) of the cart moving left and the pendulum tipping right. The video is extremely short because once the pendulum begins to fall, `done` becomes `True`, so the "episode" of the environment simulation is complete.

If the computer was more successful at balancing the pendulum, the video would be longer.

# Step 3: Manual strategy coding
Just moving left constantly is a terrible strategy for balancing the pendulum. Just about the worst possible strategy, in fact. It would be much more effective if we just made random moves (left or right) each frame instead.

And we can do that! Remember `env.action_space.sample()`? It returns a random possible action (so either a `0` or a `1`).

Try using that function to pick a random action (left or right) on each frame, and generate a video of the computer doing that:

In [None]:
import gym
from gym.wrappers.monitoring.video_recorder import VideoRecorder

env = gym.make("CartPole-v1")
video = VideoRecorder(env, "before-training.mp4")

observation = env.reset()

while True:
  video.capture_frame()
  
  # TODO: Choose an action (either 0 or 1) randomly
  action = # ???

  # Take a step using the chosen action
  # (This line of code is already correct)
  observation, reward, done, info = env.step(action)

  if done:
    break

video.close()
env.close()

# Show the recorded video to see what happened
show_video("before-training.mp4")

If your code is correct (choosing a random action each frame), then it will obviously work out a bit differently each time. But if you happen to get lucky, the computer might alternate right/left/right/left for a while, and survive with the pendulum up for at least a few frames.

But perhaps we can design a better strategy.

**Here's an idea:** When the pendulum is tipping to the right, move right. And when the pendulum is tipping to the left, move left.

Can you code this? (See [the instructions](https://msu-ai.notion.site/Workshop-Instructions-eb2c76481d4c4a1f92824ee5bfd80536#e4eea15ff7b445da819cabfdd232e88e) for hints.)

In [None]:
import gym
from gym.wrappers.monitoring.video_recorder import VideoRecorder

env = gym.make("CartPole-v1")
video = VideoRecorder(env, "before-training.mp4")

observation = env.reset()

while True:
  video.capture_frame()
  
  # TODO: Choose whether to move left (0) or right (1)
  # depending on the current `observation` data
  if # ????
    # Move the cart left
    action = 0
  else:
    # Move the cart right
    action = 1

  # Take a step using the chosen action
  # (This line of code is already correct)
  observation, reward, done, info = env.step(action)

  if done:
    break

video.close()
env.close()

# Show the recorded video to see what happened
show_video("before-training.mp4")

This strategy actually seems like it's close to working! The game doesn't last super long (still under 1 second, usually), but it feels like the computer's behavior is at least beginning to be intelligent.

If you want to, you can try designing an even smarter strategy that takes into account the velocity of the pendulum (`observation[3]`). It is possible to hand-craft a pretty much perfect strategy this way.

But what we really want is for the computer to learn a winning strategy all on its own. And this is where reinforcement learning comes in.

# Step 4: Reinforcement Learning

We want the computer to learn all on its own, so reinforcement learning is the right tool for the job. There are many different methods of reinforcement learning, but today we're going to apply one of the very simplest: **Q-learning**.

As discussed in the presentation, Q-learning works by building a big table of "states" (situations you can be in) and actions, where each cell describes whether the action is a good or bad idea in the given state.

But there's a problem: The table is discrete (it has a finite number of columns), but the actual observation space is continuous. The pendulum angle, for example, could be *any number* between -0.419rad and 0.419rad (which is -24deg to 24deg). We can't make a new table column for *every possible angle/velocity combination*. Then our table would be absurdly huge.

The trick here is to "discretize" the observation space by splitting it into buckets. We could create one bucket for angles between -24deg and -16deg, then another for angles between -16deg and -8deg, and so on. Then we would have six buckets that look like this:

<center>
  <img alt="Splitting the possible angles into six discrete buckets" src="https://i.imgur.com/bF5AJU9.png" />
  <div>
    <i>We can split the possible angles into discrete buckets.</i>
  </div>
</center>

Then we make one column for each bucket, and say that everything in the purple range, for example, is basically the same.

But of course, we don't *only* care about the angle of the pendulum. The velocity of the pendulum is also important.

(And in fact, `observation` *also* gives us the position and velocity of the cart. If we were being strict about forcing the computer to learn everything on its own, we would feed it all that information (and create table columns for every possible combination), and ask the computer to sift through it. But the cart position and velocity are not very important, so to make the training process (much) faster, we are going to ignore them and just build our Q table using the just angle and angular velocity.)

---

So we want to create buckets for angle and buckets for angular velocity. The python library `sklearn` has a tool called `KBinsDiscretizer` that does this for us, but we need to provide it with the minimum and maximum allowed values along with the number of buckets to create.

What are the minimum and maximum allowed values? `env.observation_space` can tell us!

In [None]:
print("Lower bounds:", env.observation_space.low)
print("Upper bounds:", env.observation_space.high)

We only care about the last two values (because those correspond to the pendulum angle and angular velocity). The values for angle, -0.419 and 0.419, make sense, because that is -24 to 24 degrees. But the values for angular velocity are extremely large! ($3.4 \cdot 10^{38}$) These bounds are telling us that, technically, there's no limit on velocity. But we can't make that many bins! And in practice, the pendulum will never swing that fast anyway. So let's just assume that the angular velocity will always be between -1.0 and 1.0 radians per second (rather than the super huge numbers we're getting).

The following code creates a `KBinsDiscretizer` called `bucketer` that will split our values into buckets. But the code is not quite complete. Can you finish it?

In [None]:
import math
from sklearn.preprocessing import KBinsDiscretizer

# 6 buckets for angle, 12 buckets for angular velocity
numer_of_buckets = (6, 12)

# Use the lower bounds from the observation space (but use -1.0 for angular velocity)
lower_bounds = [env.observation_space.low[2], -1.0]

# TODO: Use the upper bounds from the observation space (but use 1.0 for angular velocity)
upper_bounds = # ???

# Create a KBinsDiscretizer called `bucketer` using our settings from above
bucketer = KBinsDiscretizer(n_bins=numer_of_buckets, encode="ordinal", strategy="uniform")
bucketer.fit([lower_bounds, upper_bounds])

Hopefully you have a `bucketer` set up with the correct bounds. Now let's try using it! The bucketer allows us to plug in an angle and an angular velocity, and get back two bucket numbers.

In [None]:
bucketer.transform([[-0.415, 0.97]])

**Question:** What is the number corresponding to the blue bucket in the following picture? What about the green bucket?

<center>
  <img alt="Splitting the possible angles into six discrete buckets" src="https://i.imgur.com/bF5AJU9.png" />
</center>

Now that we have the `bucketer`, let's create a function called `discretizer` that takes an `observation` and gives back two bucket numbers, for the angle and angular velocity.

The function is almost complete. Can you finish it?

In [None]:
# Create the "discretizer" function
def discretizer(observation):
  # Get the angle value from the observation
  angle = observation[2]

  # TODO: Get the angular velocity value from the observation
  angular_velocity = # ???

  bucket_numbers = bucketer.transform([[angle, angular_velocity]])[0]
  bucket_numbers = map(int, bucket_numbers)
  return tuple(bucket_numbers)

# Try discretizing an observation
observation = env.reset()
discretizer(observation)

If you run the above code multiple times, you should see that the angle bucket is always 2 or 3 and the angular velocity bucket is always 5 or 6.

---

Alright. Having these discrete buckets means we can now create our `Q_table`. So let's go ahead and do that. We'll begin by filling it with zeroes. Run this code to create the table:

In [None]:
import numpy as np

Q_table = np.zeros((6, 12, 2))

print(Q_table)

As you can see, the Q-table is full of zeroes. And its size is $6 \times 12 \times 2$ because there are 6 angle buckets and 12 angular velocity buckets, and for each possible state (combination of angle + angular velocity), there are 2 possible actions: move right or move left.

So each cell in the `Q_table` will tell us whether a particular action is a good idea in a particular state. But of course, it's all filled with zeroes right now. We still need to learn the correct values for the table.

---

During this learning processs, we want the computer to start out by *exploring* a lot (trying random stuff), and then eventually learn what's best and do the things it knows are a good idea (*exploitation*). So we want to have an `exploration_rate` that decreases over time.

Your job is to create an `exploration_rate` function where at episode 0, `n = 0`, the exploration rate is 1, and at episode 200, `n = 200`, the exploration rate is 0.1. For now, it probably makes sense to choose a simple function like a line. Try creating your own function below, and then check that the results make sense.

In [None]:
def exploration_rate(n):
  # TODO: Make the rate decrease slowly over time
  rate = # ???

  if rate < 0.1:
    rate = 0.1
  
  if rate > 1.0:
    rate = 1.0
  
  return rate

# Check the values of the function
print("Exploration rate at n = 0:", exploration_rate(0), "(should be 1.0)")
print("Exploration rate at n = 200:", exploration_rate(200), "(should be 0.1)")

# Graph the function
import matplotlib.pyplot as plt
plt.plot([exploration_rate(x) for x in range(1000)])
plt.ylabel('Exploration rate')
plt.show()

Hopefully your graph looks like it's decreasing from $(0, 1.0)$ to $(200, 0.1)$ and then becoming a flat 0.1 beyond that point. (There's always room to experiment, but this is a solid starting place.)

---

Now, we also want the learning rate to change over time. At first, when the computer doesn't know anything, it should adapt very strongly based on any new information it sees (i.e. the learning rate should be high). But after a while, when it has played many games already, one new measurement shouldn't affect its strategy too much. So later on, the learning rate should be low.

Create another function, `learning_rate`, similar to the last. It should have a learning rate of `1.0` at `n = 0`, and then it should bottom out at a learning rate of `0.01` somewhere around `n = 200`.

In [None]:
# Decaying learning rate
def learning_rate(n):
  # TODO: Make the rate decrease slowly over time
  rate = # ???

  if rate < 0.01:
    rate = 0.01
  
  if rate > 1.0:
    rate = 1.0
  
  return rate

# Check the values of the function
print("Learning rate at n = 0:", learning_rate(0), "(should be 1.0)")
print("Learning rate at n = 200:", learning_rate(200), "(should be 0.01)")

# Graph the function
import matplotlib.pyplot as plt
plt.plot([learning_rate(x) for x in range(1000)])
plt.ylabel('Learning rate')
plt.show()

Again, hopefully your graph starts at $(0, 1.0)$ and drops down to about $(200, 0.01)$ before leveling out.

---

Alright... With all these helper functions in place, we're finally ready to perform Q-learning. We are going to run the environment for 1,000 episodes (remember, an episode ends when the pendulum falls). Throughout the entire process, we will update the `Q_table` continuously based on the `reward` received after each step. (To learn more about exactly how the table is updated, see the instructions document.)

The following code is mostly complete, but inside the if-else statement (explore or exploit), the explore branch is *supposed* to choose a random action (0 or 1), but the code is not there. Can you fill it in? (Remember, we chose a random action from `env.action_space` previously.)

Finish the code and then run it. It will take a while to complete (~3 minutes).

In [None]:
# This will take about 3 minutes to run

for episode_number in range(1000):
  if episode_number % 25 == 0:
    print(f"Beginning episode {episode_number}...")

  observation = env.reset()
  
  while True:
    current_state = discretizer(observation)

    # Choose an action (either explore or exploit based on exploration_rate)
    if np.random.random() < exploration_rate(episode_number):
      # TODO: Explore (random action)
      action = # ???
    else:
      # Exploit (best action according to Q_table)
      action = np.argmax(Q_table[current_state])
    
    # Take a physics step in the environment
    observation, reward, done, _ = env.step(action)
    new_state = discretizer(observation)
    
    # Update Q_table
    lr = learning_rate(episode_number)
    old_value = Q_table[current_state][action]
    new_value = reward + 1 * np.max(Q_table[new_state])
    Q_table[current_state][action] = (1-lr)*old_value + lr*new_value

    # If this episode is done, stop the loop so we can begin the next one
    if done:
      break

env.close()

print("Done. Q_table is now ready to be used.")

Amazing! Now that the `Q_table` is built, the computer can use it to play. Let's record a video of the computer playing the game using the `Q_table` to decide which action is best:

In [None]:
# Now that we've trained, play using the new Q-Table and record a video.
import gym
import numpy as np
from gym.wrappers.monitoring.video_recorder import VideoRecorder

env = gym.make("CartPole-v1")
video = VideoRecorder(env, "after-training.mp4")

observation = env.reset()

for _ in range(300):
  video.capture_frame()

  # Always exploit (choose best possible action based on Q_table)
  current_state = discretizer(observation)
  action = np.argmax(Q_table[current_state])

  observation, reward, done, _ = env.step(action)

video.close()
env.close()

show_video("after-training.mp4")

Fantastic! It works! 🎉

# Step 5 (Bonus): Try a New Environment

If you are done early, a fantastic next challenge is to explore [a new environment](https://gym.openai.com/envs/#classic_control).

1. Get the environment up and running.
2. Manually code a strategy like we did in step 3. Can you write code that plays the game successfully?
3. **Extra challenge:** Apply Q-learning to the new environment.

If you successfully run a new environment, send a video in the Discord! 🤠 (For MuJoCo installation instructions, see [the instructions document](https://msu-ai.notion.site/Workshop-Instructions-eb2c76481d4c4a1f92824ee5bfd80536#0592b475cc2940be8b348d4607dc8019).)

Here is some code to get you started with `"MountainCar-v0"`:

In [None]:
import gym
import matplotlib.pyplot as plt

env = gym.make("MountainCar-v0")

observation = env.reset()

plt.imshow(env.render(mode="rgb_array"))