# Machine Learning Foundation

## Course 5, Part i: Reinforcement Learning DEMO

## Reinforcement Learning Example

In this example from Reinforcement Learning, the task is to use tools from Machine Learning to predict how an agent should act. We will then use those predictions to drive the behavior of the agent. Ideally, our intelligent agent should get a much better score than a random agent.

## Key concepts:

- **Observation**: These are the states of the game. It describes where the agent currently is.
- **Action**: These are the moves that the agent makes.
- **Episode**: One full game played from beginning (`env.reset()`) to end (when `done == True`).
- **Step**: Part of a game that includes one action. The game transitions from one observation to the next.

## Setup

This exaple uses the Python library [OpenAI Gym](https://gym.openai.com/docs/).

If you want to install everything (gym can run atari games.) follow [these instructions](https://github.com/openai/gym#installing-everything).

Now we can build an environment using OpenAI. 

In [1]:
import gym
import pandas
import numpy as np

# The first part of the game uses the environment FrozenLake-V0

This is a small world with 16 tiles. 

    SFFF
    FHFH
    FFFH
    HFFG

The game starts at the S tile. The object of the game is to get to the goal (G) without landing in a hole (H).

In [2]:
# Build an environment with gym.make()
env = gym.make('FrozenLake-v0') # build a fresh environment

# Start a new game with env.reset()
current_observation = env.reset() # this starts a new "episode" and returns the initial observation

#the current observation is just the current location
print(current_observation) # observations are just a number

0


In [3]:
# we can print the environment if we want to look at it
env.render() 


[41mS[0mFFF
FHFH
FFFH
HFFG


In [4]:
# the action space for this environment includes four discrete actions

print(f"our action space: {env.action_space}")

new_action = env.action_space.sample() # we can randomly sample actions

print(f"our new action: {new_action}") # run this cell a few times to get an idea of the action space
# what does it look like?

our action space: Discrete(4)
our new action: 3


In [5]:
# now we act! do this with the step function

new_action = env.action_space.sample()

observation, reward, done, info = env.step(new_action)

# here's a look at what we get back
print(f"observation: {observation}, reward: {reward}, done: {done}, info: {info}")

env.render() 

observation: 1, reward: 0.0, done: False, info: {'prob': 0.3333333333333333}
  (Up)
S[41mF[0mFF
FHFH
FFFH
HFFG


In [6]:
# we can put this process into a for-loop and see how the game progresses

current_observation = env.reset() # start a new game

for i in range(5): # run 5 moves

    new_action = env.action_space.sample() # same a new action

    observation, reward, done, info = env.step(new_action) # step through the action and get the outputs

    # here's a look at what we get back
    print(f"observation: {observation}, reward: {reward}, done: {done}, info: {info}")

    env.render() 

observation: 1, reward: 0.0, done: False, info: {'prob': 0.3333333333333333}
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
observation: 1, reward: 0.0, done: False, info: {'prob': 0.3333333333333333}
  (Left)
S[41mF[0mFF
FHFH
FFFH
HFFG
observation: 5, reward: 0.0, done: True, info: {'prob': 0.3333333333333333}
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG
observation: 5, reward: 0, done: True, info: {'prob': 1.0}
  (Up)
SFFF
F[41mH[0mFH
FFFH
HFFG
observation: 5, reward: 0, done: True, info: {'prob': 1.0}
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG


Now we can guess what each of the outputs mean. 

**Observation** refers to the number of the tile. The tiles appear to be numbered

    0 1 2 3
    4 5 ...
    
**Reward** refers to the outcome of the game. We get 1 if we win, zero otherwise.

**Done** tells us if the game is still going. It goes to true when we win or fall into a hole.

**info** gives extra info about the world. Here, it's probabilities. Can you guess what this means here? Perhaps the world is a bit noisy.

In [7]:
# Here's how to simulate an entire episode
# We're going to stop rendering it every time to save space
# try running this a few. Does it ever win?

current_observation = env.reset()
done = False

while not done:   
    print("Before action: ")
    env.render()
    new_action = env.action_space.sample()
    new_observation, reward, done, info = env.step(new_action)
#     print(f"action:{new_action} observation: {new_observation}, reward: {reward}, done: {done}, info: {info}")
    print("After action: ")
    env.render()

Before action: 

[41mS[0mFFF
FHFH
FFFH
HFFG
After action: 
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
Before action: 
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
After action: 
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
Before action: 
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
After action: 
  (Left)
SFFF
FHFH
FFFH
[41mH[0mFFG


Things to think about:
- What things do you notice about how the environment and actions work?
- What do you think the actions mean?
- When the agent performs the same action from the same place (same observation), does the same outcome happen every time?

The environment has some squares that always end the game (`H` in the render), some that don't (`F`), and one that is presumably the reward, if you get to it.

The actions seem like up, down, left right. But they also seem stochastic. There seems to be a 1/3 chance of going into 3 different squares with each action. 

# Part 1: Gather data

We want to build an intelligent actor but first we have to gather data on which actions are useful.

Use the above code as reference. Run a *random* agent through 1,000 or more episodes and collect data on each step.

I recommend you store this data in a pandas dataframe. Each row should be a step. Your features should include the following features or similar 

- `observation` the observation at the beginning of the step (before acting!)
- `action` the action randomly sampled
- `current_reward` the reward received after the action was performed

After you generate this data, it is recommended that you compute a column (e.g. `total_reward` that is the total reward for the entire episode).

At the end of the data gathering, you should be able to use pandas (or similar) to calculate the average total reward *per episode* of the random agent. The average score should be 1-2%, meaning that the agent very rarely wins.


## Hints

- `initial_observation = env.reset()` starts a new episode and returns the initial observation.
- `new_observation, reward, done, info = env.step(new_action)` executes one action and returns the following observation. You may look at the documentation for the step method if you are curious about what it does. 
- `done != True` until the game is finished.
- we are trying to maximize the reward *per episode*. Our first game gives 0 reward unless the agent travels to the goal.
- `env.action_space.n` gives the number of possible actions in the environment. `env.action_space.sample()` allows the agent to randomly sample an action.
- `env.observation_space.n` gives the number of possible states in the environment. 

In [8]:
import datetime
import pandas as pd

now = datetime.datetime.now
t = now()

env = gym.make('FrozenLake-v0')

num_episodes = 70000

life_memory = []
for i in range(num_episodes):
    
    # start a new episode and record all the memories
    old_observation = env.reset()
    done = False
    tot_reward = 0
    ep_memory = []
    while not done:
        new_action = env.action_space.sample()
        observation, reward, done, info = env.step(new_action)
        tot_reward += reward
        
        ep_memory.append({
            "observation": old_observation,
            "action": new_action,
            "reward": reward,
            "episode": i,
        })
        old_observation = observation
        
    # incorporate total reward
    num_steps = len(ep_memory)
    for i, ep_mem in enumerate(ep_memory):
        ep_mem["tot_reward"] = tot_reward
        ep_mem["decay_reward"] = i*tot_reward/num_steps
        
    life_memory.extend(ep_memory)

print(f"Training time {now() - t}s")    
memory_df = pandas.DataFrame(life_memory)

Training time 0:00:20.344418s


In [9]:
memory_df

Unnamed: 0,observation,action,reward,episode,tot_reward,decay_reward
0,0,3,0.0,0,0.0,0.0
1,0,3,0.0,0,0.0,0.0
2,1,0,0.0,0,0.0,0.0
3,0,3,0.0,1,0.0,0.0
4,1,3,0.0,1,0.0,0.0
...,...,...,...,...,...,...
536022,0,2,0.0,69999,0.0,0.0
536023,0,3,0.0,69999,0.0,0.0
536024,1,1,0.0,69999,0.0,0.0
536025,2,2,0.0,69999,0.0,0.0


In [10]:
for i, ep_mem in enumerate(ep_memory):
    print(i, ep_mem)

0 {'observation': 0, 'action': 2, 'reward': 0.0, 'episode': 69999, 'tot_reward': 0.0, 'decay_reward': 0.0}
1 {'observation': 0, 'action': 3, 'reward': 0.0, 'episode': 69999, 'tot_reward': 0.0, 'decay_reward': 0.0}
2 {'observation': 1, 'action': 1, 'reward': 0.0, 'episode': 69999, 'tot_reward': 0.0, 'decay_reward': 0.0}
3 {'observation': 2, 'action': 2, 'reward': 0.0, 'episode': 69999, 'tot_reward': 0.0, 'decay_reward': 0.0}
4 {'observation': 6, 'action': 0, 'reward': 0.0, 'episode': 69999, 'tot_reward': 0.0, 'decay_reward': 0.0}


In [11]:
memory_df.describe()

Unnamed: 0,observation,action,reward,episode,tot_reward,decay_reward
count,536027.0,536027.0,536027.0,536027.0,536027.0,536027.0
mean,2.228974,1.502307,0.001821,35066.78525,0.023719,0.010949
std,2.996511,1.118274,0.042632,20195.101145,0.152172,0.083088
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,1.0,0.0,17587.5,0.0,0.0
50%,1.0,2.0,0.0,35102.0,0.0,0.0
75%,4.0,3.0,0.0,52654.0,0.0,0.0
max,14.0,3.0,1.0,69999.0,1.0,0.979167


In [12]:
memory_df.shape

(536027, 6)

In [13]:
# memory_df.groupby("episode").apply(print)

In [14]:
memory_df[memory_df.tot_reward == 1]

Unnamed: 0,observation,action,reward,episode,tot_reward,decay_reward
412,0,3,0.0,57,1.0,0.000000
413,1,2,0.0,57,1.0,0.111111
414,2,0,0.0,57,1.0,0.222222
415,2,3,0.0,57,1.0,0.333333
416,1,3,0.0,57,1.0,0.444444
...,...,...,...,...,...,...
535919,8,1,0.0,69988,1.0,0.642857
535920,8,3,0.0,69988,1.0,0.714286
535921,9,3,0.0,69988,1.0,0.785714
535922,10,0,0.0,69988,1.0,0.857143


In [15]:
memory_df.observation

0         0
1         0
2         1
3         0
4         1
         ..
536022    0
536023    0
536024    1
536025    2
536026    6
Name: observation, Length: 536027, dtype: int64

In [16]:
memory_df.groupby("episode").reward.sum()

episode
0        0.0
1        0.0
2        0.0
3        0.0
4        0.0
        ... 
69995    0.0
69996    0.0
69997    0.0
69998    0.0
69999    0.0
Name: reward, Length: 70000, dtype: float64

In [17]:
memory_df.groupby("episode").reward.sum().mean()

0.013942857142857142

# Step 2: Predict

Now that you have a bunch of data, put it into a format that you can model. The goal here is to guide the behavior of our agent. Our agent will be given an observation and need to decide between the possible actions given that observation and the prediction of the model. 

Remember, you're a data scientist! Be creative. 

It might be helpful to work backwards. Ultimately, you will write something like:

```
def convert_to_row(obs, act):
    # expertly written code
    return row_of_obs_act
    
rows = [convert_to_row(current_obs, act) for act in possible_actions]

pred_outcome = model.predict(rows)
```

So, you will need to design a quantity that you can ask your model to predict for every possible action-observation pair. Think a bit about what this quantity should be. Should the model try to predict the immediate reward for each action? If so, how would it know where to go at the beginning of each episode when all moves give zero reward but when some moves bring it closer to the goal than others. 

In [18]:
# from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
# from sklearn.svm import SVR

# model = ExtraTreesRegressor(n_estimators=50)
# # model = SVR()
# y = 0.5*memory_df.reward + 0.1*memory_df.decay_reward + memory_df.tot_reward
# x = memory_df[["observation", "action"]]
# model.fit(x, y)

# Step 3: Act

Now that you have a model that predicts the desired behavior, let's act on it! Modify the code you used to gather data so that you replace the random decision with an intelligent one.

We started out winning ~1.5% of the games with the random agent. How well can you do? You should be able to get your model to do at least 10x better (so 15%). Can you get ~50%?

If you're having trouble, tune your model. Try different representations of the observation and action spaces. Try different models. 

In [19]:
import datetime
now = datetime.datetime.now
t = now()
print(f"Training time: {now() - t}s")   

Training time: 0:00:00s


In [20]:
# now = datetime.datetime.now
# t = now()

# model = RandomForestRegressor()
# y = 1*memory_df.reward + memory_df.tot_reward + .1*memory_df.decay_reward
# x = memory_df[["observation", "action"]]
# model.fit(x, y)

# num_episodes = 3000
# random_per = 0

# life_memory = []
# for i in range(num_episodes):
    
#     # start a new episode and record all the memories
#     old_observation = env.reset()
#     done = False
#     tot_reward = 0
#     ep_memory = []
#     while not done:      
#         if np.random.rand() < random_per:
#             new_action = env.action_space.sample()
#         else:
#             pred_in = [[old_observation,i] for i in range(4)]
#             new_action = np.argmax(model.predict(pred_in))
#         observation, reward, done, info = env.step(new_action)
#         tot_reward += reward
        
#         ep_memory.append({
#             "observation": old_observation,
#             "action": new_action,
#             "reward": reward,
#             "episode": i,
#         })
#         old_observation = observation
        
#     # incorporate total reward
#     for ep_mem in ep_memory:
#         ep_mem["tot_reward"] = tot_reward
        
#     life_memory.extend(ep_memory)
# print(f"Training time: {now() - t}")    
# memory_df2 = pandas.DataFrame(life_memory)

# # rf.fit(memory_df[["observation", "action"]], memory_df["comb_reward"])

# # score
# # much better!
# memory_df2.groupby("episode").reward.sum().mean()

In [21]:
# Practice
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.svm import SVR

now = datetime.datetime.now
t = now()

model = RandomForestRegressor()
y = 1*memory_df.reward + memory_df.tot_reward + .1*memory_df.decay_reward
x = memory_df[["observation", "action"]]
model.fit(x, y)

num_episodes = 500
random_per = 0
life_memory = []

for i in range(num_episodes):
    old_observation = env.reset()
    done = False
    tot_reward = 0
    ep_memory = []
    while not done:
        if np.random.rand() < random_per:
            new_action = env.action_space.sample()
        else:
            pred_in = [[old_observation, i] for i in range(4)]
            new_action = np.argmax(model.predict(pred_in))
        observation, reward, done, info = env.step(new_action)
        tot_reward += reward
        
        ep_memory.append({
            "observation": old_observation,
            "action": new_action, 
            "reward": reward,
            "episode": i,
        })
        old_observation = observation
        
    for ep_mem in ep_memory:
        ep_mem["tot_reward"] = tot_reward
    
    life_memory.extend(ep_memory)

print(f"Training time {now() - t}s")
memory_df2 = pd.DataFrame(life_memory)

np.mean(memory_df2.groupby("episode").reward.sum())
# memory_df2.groupby("episode").reward.sum().mean()

Training time 0:05:44.834588s


0.704

In [22]:
memory_df2

Unnamed: 0,observation,action,reward,episode,tot_reward
0,0,0,0.0,0,0.0
1,0,0,0.0,0,0.0
2,0,0,0.0,0,0.0
3,4,0,0.0,0,0.0
4,8,3,0.0,0,0.0
...,...,...,...,...,...
19455,8,3,0.0,499,1.0
19456,8,3,0.0,499,1.0
19457,9,1,0.0,499,1.0
19458,13,2,0.0,499,1.0


In [23]:
y = .1*memory_df.reward + 1*memory_df.decay_reward + 1*memory_df.tot_reward

# Extension: Pole cart

If time permits, try your hand at pole cart (`env = gym.make('CartPole-v0')`).

Notice that the observation space is quite different. It's no longer discrete--instead we have 4 continuous values. You'll have to store these differently from how you did with Frozenlake.

My random actor actually does surprisingly well (avg ~22). But my intelligent agent is able to score ~99. Can you beat me? 

# Pole cart

In [24]:
import gym
import pandas
import numpy as np

env = gym.make('CartPole-v0')

In [41]:
env.env?

In [26]:
# now we can build a toy world!
num_episodes = 1000

life_memory = []
for i in range(num_episodes):
    
    # start a new episode and record all the memories
    old_observation = env.reset()
    done = False
    tot_reward = 0
    ep_memory = []
    while not done:
        new_action = env.action_space.sample()
        observation, reward, done, info = env.step(new_action)
        tot_reward += reward
        
        ep_memory.append({
            "obs0": old_observation[0],
            "obs1": old_observation[1],
            "obs2": old_observation[2],
            "obs3": old_observation[3],
            "action": new_action,
            "reward": reward,
            "episode": i,
        })
        old_observation = observation
        
    # incorporate total reward
    for ep_mem in ep_memory:
        ep_mem["tot_reward"] = tot_reward
        
    life_memory.extend(ep_memory)
    
memory_df = pandas.DataFrame(life_memory)

memory_df.groupby("episode").reward.sum().mean()

21.751

In [27]:
memory_df

Unnamed: 0,obs0,obs1,obs2,obs3,action,reward,episode,tot_reward
0,-0.009359,0.047097,-0.022057,0.015836,0,1.0,0,21.0
1,-0.008417,-0.147701,-0.021740,0.301479,0,1.0,0,21.0
2,-0.011371,-0.342507,-0.015711,0.587227,1,1.0,0,21.0
3,-0.018221,-0.147168,-0.003966,0.289637,1,1.0,0,21.0
4,-0.021164,0.048010,0.001827,-0.004294,0,1.0,0,21.0
...,...,...,...,...,...,...,...,...
21746,0.094145,-0.354251,-0.199186,0.136198,1,1.0,999,25.0
21747,0.087060,-0.156916,-0.196462,-0.212125,0,1.0,999,25.0
21748,0.083921,-0.348766,-0.200704,0.012725,1,1.0,999,25.0
21749,0.076946,-0.151417,-0.200450,-0.335969,1,1.0,999,25.0


In [28]:
memory_df.describe()

Unnamed: 0,obs0,obs1,obs2,obs3,action,reward,episode,tot_reward
count,21751.0,21751.0,21751.0,21751.0,21751.0,21751.0,21751.0,21751.0
mean,-0.000117,-0.018099,0.001839,0.020843,0.495517,1.0,497.780148,28.161142
std,0.086541,0.535357,0.093025,0.791504,0.499991,0.0,289.625136,15.604871
min,-0.9146,-2.256016,-0.209377,-2.76646,0.0,1.0,0.0,8.0
25%,-0.039968,-0.370245,-0.05244,-0.488695,0.0,1.0,250.5,17.0
50%,0.002352,-0.010875,0.000935,0.01156,0.0,1.0,494.0,23.0
75%,0.041763,0.345232,0.056567,0.550996,1.0,1.0,747.0,36.0
max,0.7643,2.308623,0.209434,2.760805,1.0,1.0,999.0,95.0


In [29]:
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, ExtraTreesRegressor

model = ExtraTreesRegressor(n_estimators=50)

memory_df["comb_reward"] = .5*memory_df.reward + memory_df.tot_reward
model.fit(memory_df[["obs0", "obs1", "obs2", "obs3", "action"]], memory_df.comb_reward)

ExtraTreesRegressor(n_estimators=50)

In [42]:
memory_df

Unnamed: 0,obs0,obs1,obs2,obs3,action,reward,episode,tot_reward,comb_reward
0,-0.009359,0.047097,-0.022057,0.015836,0,1.0,0,21.0,21.5
1,-0.008417,-0.147701,-0.021740,0.301479,0,1.0,0,21.0,21.5
2,-0.011371,-0.342507,-0.015711,0.587227,1,1.0,0,21.0,21.5
3,-0.018221,-0.147168,-0.003966,0.289637,1,1.0,0,21.0,21.5
4,-0.021164,0.048010,0.001827,-0.004294,0,1.0,0,21.0,21.5
...,...,...,...,...,...,...,...,...,...
21746,0.094145,-0.354251,-0.199186,0.136198,1,1.0,999,25.0,25.5
21747,0.087060,-0.156916,-0.196462,-0.212125,0,1.0,999,25.0,25.5
21748,0.083921,-0.348766,-0.200704,0.012725,1,1.0,999,25.0,25.5
21749,0.076946,-0.151417,-0.200450,-0.335969,1,1.0,999,25.0,25.5


In [43]:
env.env?

In [30]:
import datetime

n = datetime.datetime.now
t = n()

num_episodes = 100
random_per = 0

life_memory = []
for i in range(num_episodes):
    
    # start a new episode and record all the memories
    old_observation = env.reset()
    done = False
    tot_reward = 0
    ep_memory = []
    while not done:
        
        
        if np.random.rand() < random_per:
            new_action = env.action_space.sample()
        else:
            pred_in = [list(old_observation)+[i] for i in range(2)]
            new_action = np.argmax(model.predict(pred_in))
        observation, reward, done, info = env.step(new_action)
        tot_reward += reward
        
        ep_memory.append({
            "obs0": old_observation[0],
            "obs1": old_observation[1],
            "obs2": old_observation[2],
            "obs3": old_observation[3],
            "action": new_action,
            "reward": reward,
            "episode": i,
        })
        old_observation = observation
        
    # incorporate total reward
    for ep_mem in ep_memory:
        ep_mem["tot_reward"] = tot_reward
        
    life_memory.extend(ep_memory)
    
memory_df2 = pandas.DataFrame(life_memory)
memory_df2["comb_reward"] = memory_df2.reward + memory_df2.tot_reward

# score
# much better!
memory_df2.groupby("episode").reward.sum().mean()

print(f"training time: {n() - t}s")

training time: 0:01:41.950757s


In [31]:
memory_df2

Unnamed: 0,obs0,obs1,obs2,obs3,action,reward,episode,tot_reward,comb_reward
0,0.009780,0.009512,0.036042,-0.044225,1,1.0,0,88.0,89.0
1,0.009970,0.204099,0.035158,-0.325322,0,1.0,0,88.0,89.0
2,0.014052,0.008495,0.028651,-0.021762,1,1.0,0,88.0,89.0
3,0.014222,0.203194,0.028216,-0.305270,0,1.0,0,88.0,89.0
4,0.018286,0.007682,0.022111,-0.003823,0,1.0,0,88.0,89.0
...,...,...,...,...,...,...,...,...,...
11609,-0.730055,-1.531049,0.079357,0.951753,0,1.0,99,71.0,72.0
11610,-0.760676,-1.727144,0.098392,1.268276,0,1.0,99,71.0,72.0
11611,-0.795219,-1.923375,0.123757,1.590080,0,1.0,99,71.0,72.0
11612,-0.833686,-2.119730,0.155559,1.918652,1,1.0,99,71.0,72.0


In [32]:
memory_df2.groupby("episode").apply(print)

        obs0      obs1      obs2      obs3  action  reward  episode  \
0   0.009780  0.009512  0.036042 -0.044225       1     1.0        0   
1   0.009970  0.204099  0.035158 -0.325322       0     1.0        0   
2   0.014052  0.008495  0.028651 -0.021762       1     1.0        0   
3   0.014222  0.203194  0.028216 -0.305270       0     1.0        0   
4   0.018286  0.007682  0.022111 -0.003823       0     1.0        0   
..       ...       ...       ...       ...     ...     ...      ...   
83 -0.558270 -1.314114  0.085314  1.076235       0     1.0        0   
84 -0.584553 -1.510253  0.106839  1.394425       1     1.0        0   
85 -0.614758 -1.316611  0.134727  1.136968       0     1.0        0   
86 -0.641090 -1.513213  0.157467  1.468689       0     1.0        0   
87 -0.671354 -1.709872  0.186840  1.806132       0     1.0        0   

    tot_reward  comb_reward  
0         88.0         89.0  
1         88.0         89.0  
2         88.0         89.0  
3         88.0         89.0

[94 rows x 9 columns]
          obs0      obs1      obs2      obs3  action  reward  episode  \
940  -0.022899 -0.002000 -0.026444  0.033869       0     1.0        8   
941  -0.022939 -0.196733 -0.025767  0.318092       0     1.0        8   
942  -0.026874 -0.391479 -0.019405  0.602539       1     1.0        8   
943  -0.034704 -0.196091 -0.007354  0.303808       0     1.0        8   
944  -0.038625 -0.391107 -0.001278  0.594162       1     1.0        8   
...        ...       ...       ...       ...     ...     ...      ...   
1013 -0.129182  0.226606 -0.103824 -1.001849       1     1.0        8   
1014 -0.124650  0.422950 -0.123861 -1.325249       0     1.0        8   
1015 -0.116191  0.229591 -0.150366 -1.073754       1     1.0        8   
1016 -0.111599  0.426345 -0.171841 -1.409597       1     1.0        8   
1017 -0.103072  0.623131 -0.200033 -1.750701       0     1.0        8   

      tot_reward  comb_reward  
940         78.0         79.0  
941         78.0         79.0  
942  

[66 rows x 9 columns]
          obs0      obs1      obs2      obs3  action  reward  episode  \
2298 -0.047685  0.000311  0.025963 -0.008997       0     1.0       21   
2299 -0.047679 -0.195173  0.025784  0.291763       1     1.0       21   
2300 -0.051583 -0.000428  0.031619  0.007322       0     1.0       21   
2301 -0.051591 -0.195989  0.031765  0.309811       1     1.0       21   
2302 -0.055511 -0.001334  0.037961  0.027313       0     1.0       21   
...        ...       ...       ...       ...     ...     ...      ...   
2401  0.148183  0.593448 -0.046847 -1.059161       1     1.0       21   
2402  0.160052  0.789158 -0.068030 -1.366172       1     1.0       21   
2403  0.175835  0.985063 -0.095353 -1.679335       1     1.0       21   
2404  0.195537  1.181152 -0.128940 -2.000124       1     1.0       21   
2405  0.219160  1.377364 -0.168943 -2.329801       1     1.0       21   

      tot_reward  comb_reward  
2298       108.0        109.0  
2299       108.0        109.0  
2300 

[92 rows x 9 columns]
          obs0      obs1      obs2      obs3  action  reward  episode  \
3443  0.032532 -0.045749  0.006925  0.034954       0     1.0       30   
3444  0.031617 -0.240970  0.007624  0.329814       1     1.0       30   
3445  0.026798 -0.045957  0.014221  0.039545       0     1.0       30   
3446  0.025879 -0.241280  0.015011  0.336681       1     1.0       30   
3447  0.021053 -0.046375  0.021745  0.048769       1     1.0       30   
...        ...       ...       ...       ...     ...     ...      ...   
3593 -0.560558 -1.189690  0.138864  1.209872       1     1.0       30   
3594 -0.584352 -0.996607  0.163061  0.963731       1     1.0       30   
3595 -0.604284 -0.804007  0.182336  0.726391       1     1.0       30   
3596 -0.620364 -0.611811  0.196864  0.496187       1     1.0       30   
3597 -0.632600 -0.419930  0.206788  0.271429       1     1.0       30   

      tot_reward  comb_reward  
3443       155.0        156.0  
3444       155.0        156.0  
3445 

[98 rows x 9 columns]
          obs0      obs1      obs2      obs3  action  reward  episode  \
5227 -0.032191  0.011321  0.015230  0.037914       1     1.0       44   
5228 -0.031964  0.206221  0.015988 -0.249925       0     1.0       44   
5229 -0.027840  0.010875  0.010990  0.047758       1     1.0       44   
5230 -0.027622  0.205838  0.011945 -0.241438       1     1.0       44   
5231 -0.023506  0.400787  0.007116 -0.530329       0     1.0       44   
...        ...       ...       ...       ...     ...     ...      ...   
5310 -0.577542 -1.309108  0.087352  1.088022       0     1.0       44   
5311 -0.603724 -1.505266  0.109112  1.406787       1     1.0       44   
5312 -0.633829 -1.311654  0.137248  1.150111       0     1.0       44   
5313 -0.660062 -1.508274  0.160250  1.482492       0     1.0       44   
5314 -0.690228 -1.704946  0.189900  1.820633       0     1.0       44   

      tot_reward  comb_reward  
5227        88.0         89.0  
5228        88.0         89.0  
5229 

5776        51.0         52.0  
          obs0      obs1      obs2      obs3  action  reward  episode  \
5777  0.022076  0.009062 -0.017992 -0.049172       0     1.0       50   
5778  0.022257 -0.185797 -0.018975  0.237781       1     1.0       50   
5779  0.018541  0.009591 -0.014220 -0.060826       0     1.0       50   
5780  0.018733 -0.185324 -0.015436  0.227336       1     1.0       50   
5781  0.015026  0.010015 -0.010890 -0.070176       0     1.0       50   
...        ...       ...       ...       ...     ...     ...      ...   
5892  0.750358  2.069133 -0.092745 -1.370727       0     1.0       50   
5893  0.791741  1.875285 -0.120160 -1.108433       1     1.0       50   
5894  0.829247  2.071764 -0.142328 -1.436268       1     1.0       50   
5895  0.870682  2.268324 -0.171054 -1.769831       1     1.0       50   
5896  0.916048  2.464915 -0.206450 -2.110458       1     1.0       50   

      tot_reward  comb_reward  
5777       120.0        121.0  
5778       120.0        121

          obs0      obs1      obs2      obs3  action  reward  episode  \
6232 -0.032723  0.017031 -0.044400  0.019595       0     1.0       55   
6233 -0.032382 -0.177427 -0.044008  0.297945       1     1.0       55   
6234 -0.035930  0.018294 -0.038049 -0.008286       0     1.0       55   
6235 -0.035565 -0.176262 -0.038215  0.272154       0     1.0       55   
6236 -0.039090 -0.370819 -0.032772  0.552543       1     1.0       55   
...        ...       ...       ...       ...     ...     ...      ...   
6359 -0.666544 -1.675209  0.098938  1.250017       1     1.0       55   
6360 -0.700048 -1.481484  0.123939  0.989892       0     1.0       55   
6361 -0.729678 -1.678027  0.143736  1.318790       0     1.0       55   
6362 -0.763239 -1.874644  0.170112  1.652787       0     1.0       55   
6363 -0.800732 -2.071295  0.203168  1.993279       1     1.0       55   

      tot_reward  comb_reward  
6232       132.0        133.0  
6233       132.0        133.0  
6234       132.0        133

[114 rows x 9 columns]
          obs0      obs1      obs2      obs3  action  reward  episode  \
7192  0.009179  0.025572  0.041008 -0.016774       1     1.0       64   
7193  0.009690  0.220083  0.040673 -0.296241       0     1.0       64   
7194  0.014092  0.024406  0.034748  0.008986       1     1.0       64   
7195  0.014580  0.219012  0.034928 -0.272534       1     1.0       64   
7196  0.018960  0.413619  0.029477 -0.553999       0     1.0       64   
...        ...       ...       ...       ...     ...     ...      ...   
7324  0.667936  1.549375 -0.048380 -1.539159       1     1.0       64   
7325  0.698923  1.745044 -0.079163 -1.846538       1     1.0       64   
7326  0.733824  1.940944 -0.116094 -2.162717       1     1.0       64   
7327  0.772643  2.136995 -0.159348 -2.488869       1     1.0       64   
7328  0.815383  2.333044 -0.209125 -2.825866       1     1.0       64   

      tot_reward  comb_reward  
7192       137.0        138.0  
7193       137.0        138.0  
7194

[100 rows x 9 columns]
          obs0      obs1      obs2      obs3  action  reward  episode  \
8859  0.005589  0.028147 -0.030174  0.000101       0     1.0       77   
8860  0.006152 -0.166530 -0.030172  0.283113       1     1.0       77   
8861  0.002822  0.029009 -0.024509 -0.018931       1     1.0       77   
8862  0.003402  0.224474 -0.024888 -0.319245       0     1.0       77   
8863  0.007891  0.029715 -0.031273 -0.034514       1     1.0       77   
...        ...       ...       ...       ...     ...     ...      ...   
8944 -0.693955 -1.289452  0.090229  0.987791       0     1.0       77   
8945 -0.719744 -1.485658  0.109985  1.307395       1     1.0       77   
8946 -0.749457 -1.292089  0.136132  1.051065       0     1.0       77   
8947 -0.775299 -1.488728  0.157154  1.383194       0     1.0       77   
8948 -0.805073 -1.685422  0.184818  1.720613       0     1.0       77   

      tot_reward  comb_reward  
8859        90.0         91.0  
8860        90.0         91.0  
8861

[115 rows x 9 columns]
           obs0      obs1      obs2      obs3  action  reward  episode  \
10694  0.044687  0.003892  0.025017  0.022721       0     1.0       92   
10695  0.044765 -0.191579  0.025471  0.323191       1     1.0       92   
10696  0.040933  0.003171  0.031935  0.038648       1     1.0       92   
10697  0.040997  0.197821  0.032708 -0.243790       1     1.0       92   
10698  0.044953  0.392461  0.027832 -0.525979       1     1.0       92   
...         ...       ...       ...       ...     ...     ...      ...   
10889 -0.080212 -0.922433 -0.085771  0.400685       0     1.0       92   
10890 -0.098660 -1.116240 -0.077757  0.665142       0     1.0       92   
10891 -0.120985 -1.310199 -0.064454  0.932364       1     1.0       92   
10892 -0.147189 -1.114270 -0.045807  0.620143       1     1.0       92   
10893 -0.169475 -0.918539 -0.033404  0.313393       0     1.0       92   

       tot_reward  comb_reward  
10694       200.0        201.0  
10695       200.0     

In [38]:
memory_df2.groupby("episode").reward.sum().mean()

116.14

In [37]:
(memory_df2.groupby("episode").reward.sum() >= 200).value_counts()

False    90
True     10
Name: reward, dtype: int64

---
### Machine Learning Foundation (C) 2020 IBM Corporation