<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#CartPole,-aka-Inverted-Pendulum" data-toc-modified-id="CartPole,-aka-Inverted-Pendulum-1">CartPole, aka Inverted Pendulum</a></span></li><li><span><a href="#OpenAI's-CartPole-Environment" data-toc-modified-id="OpenAI's-CartPole-Environment-2">OpenAI's CartPole Environment</a></span></li><li><span><a href="#Define-Environment" data-toc-modified-id="Define-Environment-3">Define Environment</a></span></li><li><span><a href="#Define-Neural-Network-" data-toc-modified-id="Define-Neural-Network--4">Define Neural Network </a></span></li><li><span><a href="#Define-Agent" data-toc-modified-id="Define-Agent-5">Define Agent</a></span></li><li><span><a href="#Train-the-model" data-toc-modified-id="Train-the-model-6">Train the model</a></span></li><li><span><a href="#Test-the-model" data-toc-modified-id="Test-the-model-7">Test the model</a></span></li><li><span><a href="#Grading-Submission-Notes" data-toc-modified-id="Grading-Submission-Notes-8">Grading Submission Notes</a></span></li><li><span><a href="#Bonus-Material" data-toc-modified-id="Bonus-Material-9">Bonus Material</a></span></li></ul></div>

<center><h2>CartPole, aka Inverted Pendulum</h2></center>
<br>
<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/00/Cart-pendulum.svg/300px-Cart-pendulum.svg.png" width="35%"/></center>
<br><br><br>
<center><a href="https://fluxml.ai/experiments/cartPole/">Demo!</a></center>



<center><h2>OpenAI's CartPole Environment</h2></center>

A pole is attached to a cart by an un-actuated joint, and the cart moves along a frictionless track. 

The system is controlled by applying a force of +1 or -1 (e.g., left or right) to the cart. 

The pendulum starts upright, and the goal is to prevent it from falling over. 

A reward of +1 is provided for every time-step that the pole remains upright. 

The episode ends when:

- The pole is more than 15 degrees from the vertical.
- The cart moves more than 2.4 units from the center.
- 200 time-steps.


In [13]:
reset -fs

Define Environment
----

In [14]:
import gym 
import numpy as np

In [15]:
env = gym.make('CartPole-v0')

In [16]:
# Let's see how the RL problem is formulated
n_states = env.observation_space.shape
print('State size:  ',  n_states[0])
n_actions = env.action_space.n
print('Action size: ', n_actions)

State size:   4
Action size:  2


Let's read the docs:
https://github.com/openai/gym/wiki/CartPole-v0

Let's read the code: https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py

Define Neural Network 
-----

In [17]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Flatten
from tensorflow.keras.optimizers import Adam

In [18]:
# Define a sample model
model = Sequential()
model.add(Flatten(input_shape=(1,) + n_states))
model.add(Dense(24, activation='sigmoid'))
model.add(Dense(n_actions, activation='linear'))

In [19]:
# Define your model

# TODO: Write your model with comments





print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_2 (Dense)              (None, 24)                120       
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 50        
Total params: 170
Trainable params: 170
Non-trainable params: 0
_________________________________________________________________
None


Define Agent
----

In [20]:
# We are using keras-rl2 because we are using TensorFlow 2.
# Please use the updated environment to get correct version.

In [21]:
# Define memory

from rl.memory import SequentialMemory

memory = SequentialMemory(limit=50_000, 
                          window_length=1)

In [22]:
# Define sample policy

from rl.policy import GreedyQPolicy

policy = GreedyQPolicy()

In [23]:
# Define your policy

# TODO: Write your policy with comments



In [24]:
# Define your agent

from rl.agents.dqn import DQNAgent

dqn = DQNAgent(model=model, 
               nb_actions=env.action_space.n, 
               memory=memory,
               nb_steps_warmup=15,
               target_model_update=1e-2,
               policy=policy)

dqn.compile(Adam(lr=1e-3), 
            metrics=['mae'])



In [25]:
# Note: 

# You may get a `TypeError` from TensorFlow when running the last cell 
# I could not figure it out 😦
# Just run the cell again and it goes away 🤯

Train the model
----

In [26]:
# Train model
dqn.fit(env, 
        nb_steps=1_500,  # Do not change this. You have a fixed training budget!
        visualize=False, # You can see a live performance of the model. WARNING- it might crash your notebook.
        verbose=1        # 0 = Nothing, 1 = Progress bar 2 = Episode logging
       );

Training for 1500 steps ...
Interval 1 (0 steps performed)
    8/10000 [..............................] - ETA: 9:30 - reward: 1.0000 



 1499/10000 [===>..........................] - ETA: 4:36 - reward: 1.0000done, took 48.820 seconds


Test the model
----

In [27]:
# Test model
test_results = dqn.test(env, 
                        nb_episodes=5, 
                        visualize=False, 
                        );

Testing for 5 episodes ...
Episode 1: reward: 9.000, steps: 9
Episode 2: reward: 9.000, steps: 9
Episode 3: reward: 8.000, steps: 8
Episode 4: reward: 10.000, steps: 10
Episode 5: reward: 10.000, steps: 10


In [28]:
# The max is 200 steps per eposide.
# The goal of the assignment to train an agent that performs about 180 (on average).

from statistics import mean

# Remove the worst run
test_results.history['episode_reward'].remove(min(test_results.history['episode_reward']))

# Take the average the remaining runs
test_performance = mean(test_results.history['episode_reward'])

print(f"There current model gets an average of {test_performance:.2f} steps.")

There current model gets an average of 9.50 steps.


In [29]:
# 5 points for over 100 steps per eposide.

assert test_performance > 100.00

AssertionError: 

In [None]:
# 5 points for over 150 steps per eposide.

assert test_performance > 150.00

In [None]:
# 5 points for over 175 steps per eposide.

assert test_performance > 175.00

Grading Submission Notes
-------

This is a high variance problem, different runs have widely different results. Brian trend to fix the randoms seeds with no luck 😉 (reproducibility is a known issue with OpenAI gym).

You can submit a "good" run of your model. If there is output, I'll grade your submitted lab without running the notebook. If there is __not__ output, I'll run the notebook to get output to grade. 

Bonus Material
-----

Learn more about CartPole from the physics and control-model perspective [here]
(https://danielpiedrahita.wordpress.com/portfolio/cart-pole-control/).

The more advanced version is "cart-Pole Swing-up" [here](https://www.youtube.com/watch?v=XiigTGKZfks).

<br>
<br> 
<br>

----