## Work in Prgoress
Please update and use the code from below. 

## Create conda environment
Note that you must use python 2 for compatibility with keras-rl. Everything else supports python 3. 

`yes | conda install numpy scipy matplotlib jupyter h5py`


## Now install the gym: 
- https://github.com/openai/gym 

`git clone https://github.com/openai/gym.git`

`cd gym`

`pip install -e .`

`brew update && brew install cmake boost boost-python sdl2 swig wget`

`pip install -e '.[all]'`

## Now install tensor flow 
`pip install tensorflow`

## Some other dependencies
`pip install seaborn argparse`

## Install keras rl:
- https://github.com/matthiasplappert/keras-rl

`pip install keras-rl`

___

## Test Environment Usage

In [1]:
import gym
env = gym.make('CartPole-v0')
env.reset()
env.render()

[2017-10-23 10:36:41,614] Making new env: CartPole-v0


In [2]:
import numpy as np
from __future__ import print_function

# reset cartpole to upright in the middle
env.reset()
reward_total = 0
for _ in range(50):
    # each environment has a different action that can be completed
    # this takes a discrete action, 0 or 1 for force applied to the block
    action_to_take = int(np.random.rand()*2) # randomly choose an action
    
    # simulate the action we have taken in the physical world
    physics_state, reward, is_done, _ = env.step(action=action_to_take) 
    # the physics of this problem are physics_state = (x,x_dot,theta,theta_dot) for the dynamic system
    
    if not is_done:
        reward_total += reward # the pole is still up! joy!
        
    env.render()  

print('Successfully Completed', reward_total, 'steps before falling!')

[2017-10-23 10:36:43,124] You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.


Successfully Completed 14.0 steps before falling!


In [3]:
env.close()

We eventually want a controller that can interpret the state of the system and decide how best to keep the cartpole from falling. Our reward will be the number of time steps achieved. 

Following the cartpole physics example: https://github.com/matthiasplappert/keras-rl/blob/master/examples/dqn_cartpole.py

![CartPole](https://raw.githubusercontent.com/matthiasplappert/keras-rl/master/assets/cartpole.gif)

## TODO: Add more explanation of the system

In [4]:
import numpy as np
import gym

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam

from rl.agents.dqn import DQNAgent
from rl.policy import BoltzmannQPolicy
from rl.memory import SequentialMemory

Using TensorFlow backend.


In [5]:
ENV_NAME = 'CartPole-v0'

# Get the environment and extract the number of actions.
env = gym.make(ENV_NAME)
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n

[2017-10-23 10:36:47,143] Making new env: CartPole-v0


In [6]:
print(env.observation_space)
print(env.action_space)
print(env.metadata)


Box(4,)
Discrete(2)
{'render.modes': ['human', 'rgb_array'], 'video.frames_per_second': 50}


In [7]:
# Next, we build a very simple model.
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                80        
_________________________________________________________________
activation_1 (Activation)    (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 16)                272       
_________________________________________________________________
activation_2 (Activation)    (None, 16)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 16)                272       
_________________________________________________________________
activation_3 (Activation)    (None, 16)                0         
__________

In [8]:
# Finally, we configure and compile our agent. You can use every built-in Keras optimizer and
# even the metrics!
memory = SequentialMemory(limit=50000, window_length=1)
policy = BoltzmannQPolicy()

dqn = DQNAgent(model=model, 
               nb_actions=nb_actions, 
               memory=memory, 
               nb_steps_warmup=10,
               target_model_update=1e-2, 
               policy=policy)

dqn.compile(Adam(lr=1e-3), metrics=['mae'])

In [9]:
# Okay, now it's time to learn something! We visualize the training here for show, but this
# slows down training quite a lot. You can always safely abort the training prematurely using
# Ctrl + C.
dqn.fit(env, nb_steps=10000, visualize=False, verbose=0)



<keras.callbacks.History at 0x119cbaf60>

In [10]:
# Finally, evaluate our algorithm for 5 episodes.
dqn.test(env, nb_episodes=3, visualize=True)

Testing for 3 episodes ...
Episode 1: reward: 162.000, steps: 162
Episode 2: reward: 184.000, steps: 184
Episode 3: reward: 178.000, steps: 178


<keras.callbacks.History at 0x12e35cb70>

____
## Atari Example
Also from the documentation of the package. 
- https://github.com/matthiasplappert/keras-rl/blob/master/examples/dqn_atari.py

In [11]:
from __future__ import division

from PIL import Image
import numpy as np
import gym

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten, Convolution2D, Permute
from keras.optimizers import Adam
import keras.backend as K

from rl.agents.dqn import DQNAgent
from rl.policy import LinearAnnealedPolicy, BoltzmannQPolicy, EpsGreedyQPolicy
from rl.memory import SequentialMemory
from rl.core import Processor


INPUT_SHAPE = (84, 84)
WINDOW_LENGTH = 4

In [12]:
class AtariProcessor(Processor):
    def process_observation(self, observation):
        assert observation.ndim == 3  # (height, width, channel)
        img = Image.fromarray(observation)
        img = img.resize(INPUT_SHAPE).convert('L')  # resize and convert to grayscale
        processed_observation = np.array(img)
        assert processed_observation.shape == INPUT_SHAPE
        return processed_observation.astype('uint8')  # saves storage in experience memory

    def process_state_batch(self, batch):
        # We could perform this processing step in `process_observation`. In this case, however,
        # we would need to store a `float32` array instead, which is 4x more memory intensive than
        # an `uint8` array. This matters if we store 1M observations.
        processed_batch = batch.astype('float32') / 255.
        return processed_batch

    def process_reward(self, reward):
        return np.clip(reward, -1., 1.)

In [13]:
# Get the environment and extract the number of actions.
env = gym.make('Breakout-v0')
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n

[2017-10-23 10:37:56,282] Making new env: Breakout-v0


In [14]:
# Next, we build our model. We use the same model that was described by Mnih et al. (2015).
input_shape = (WINDOW_LENGTH,) + INPUT_SHAPE
model = Sequential()
if K.image_dim_ordering() == 'tf':
    # (width, height, channels)
    model.add(Permute((2, 3, 1), input_shape=input_shape))
elif K.image_dim_ordering() == 'th':
    # (channels, width, height)
    model.add(Permute((1, 2, 3), input_shape=input_shape))
else:
    raise RuntimeError('Unknown image_dim_ordering.')

In [15]:
model.add(Convolution2D(32, 8, 8, subsample=(4, 4)))
model.add(Activation('relu'))
model.add(Convolution2D(64, 4, 4, subsample=(2, 2)))
model.add(Activation('relu'))
model.add(Convolution2D(64, 3, 3, subsample=(1, 1)))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
permute_1 (Permute)          (None, 84, 84, 4)         0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 20, 20, 32)        8224      
_________________________________________________________________
activation_5 (Activation)    (None, 20, 20, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 9, 9, 64)          32832     
_________________________________________________________________
activation_6 (Activation)    (None, 9, 9, 64)          0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 7, 7, 64)          36928     
_________________________________________________________________
activation_7 (Activation)    (None, 7, 7, 64)          0         
__________

  """Entry point for launching an IPython kernel.
  This is separate from the ipykernel package so we can avoid doing imports until
  """


In [16]:
# Finally, we configure and compile our agent. You can use every built-in Keras optimizer and
# even the metrics!
memory = SequentialMemory(limit=1000000, window_length=WINDOW_LENGTH)
processor = AtariProcessor()

# Select a policy. We use eps-greedy action selection, which means that a random action is selected
# with probability eps. We anneal eps from 1.0 to 0.1 over the course of 1M steps. This is done so that
# the agent initially explores the environment (high eps) and then gradually sticks to what it knows
# (low eps). We also set a dedicated eps value that is used during testing. Note that we set it to 0.05
# so that the agent still performs some random actions. This ensures that the agent cannot get stuck.
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr='eps', value_max=1., value_min=.1, value_test=.05,
                              nb_steps=1000000)

# The trade-off between exploration and exploitation is difficult and an on-going research topic.
# If you want, you can experiment with the parameters or use a different policy. Another popular one
# is Boltzmann-style exploration:
# policy = BoltzmannQPolicy(tau=1.)
# Feel free to give it a try!

dqn = DQNAgent(model=model, nb_actions=nb_actions, policy=policy, memory=memory,
               processor=processor, nb_steps_warmup=50000, gamma=.99, target_model_update=10000,
               train_interval=4, delta_clip=1.)
dqn.compile(Adam(lr=.00025), metrics=['mae'])


runs = 0

In [24]:
%%time
# Okay, now it's time to learn something! We capture the interrupt exception so that training
# can be prematurely aborted. Notice that you can use the built-in Keras callbacks!
steps_per_run = 1000000
runs += steps_per_run
dqn.fit(env, nb_steps=steps_per_run, visualize=True, verbose=0)
print(runs)


2000000
CPU times: user 2min 19s, sys: 45.4 s, total: 3min 4s
Wall time: 4min 20s


In [21]:
dqn.save_weights('large_data/breakout_LinAnnPol_GreedyQ.h5f', overwrite=True)

In [22]:
dqn.test(env, nb_episodes=1, visualize=True)

Testing for 1 episodes ...


KeyboardInterrupt: 

In [25]:
env.close()

Be sure to check out many other pre-trained implementations at https://gym.openai.com/envs/

## Run an alg that was trained on a minimal laptop for 16 hours

In [None]:
#optionally load the example I trained (for 16 hours) and render it 
import gym
import numpy as np 
from __future__ import division

from PIL import Image
import numpy as np
import gym

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten, Convolution2D, Permute
from keras.optimizers import Adam
import keras.backend as K

from rl.agents.dqn import DQNAgent
from rl.policy import LinearAnnealedPolicy, BoltzmannQPolicy, EpsGreedyQPolicy
from rl.memory import SequentialMemory
from rl.core import Processor


INPUT_SHAPE = (84, 84)
WINDOW_LENGTH = 4

env = gym.make('Breakout-v0')
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n

class AtariProcessor(Processor):
    def process_observation(self, observation):
        assert observation.ndim == 3  # (height, width, channel)
        img = Image.fromarray(observation)
        img = img.resize(INPUT_SHAPE).convert('L')  # resize and convert to grayscale
        processed_observation = np.array(img)
        assert processed_observation.shape == INPUT_SHAPE
        return processed_observation.astype('uint8')  # saves storage in experience memory

    def process_state_batch(self, batch):
        # We could perform this processing step in `process_observation`. In this case, however,
        # we would need to store a `float32` array instead, which is 4x more memory intensive than
        # an `uint8` array. This matters if we store 1M observations.
        processed_batch = batch.astype('float32') / 255.
        return processed_batch

    def process_reward(self, reward):
        return np.clip(reward, -1., 1.)

# Next, we build our model. We use the same model that was described by Mnih et al. (2015).
input_shape = (WINDOW_LENGTH,) + INPUT_SHAPE
model = Sequential()
if K.image_dim_ordering() == 'tf':
    # (width, height, channels)
    model.add(Permute((2, 3, 1), input_shape=input_shape))
elif K.image_dim_ordering() == 'th':
    # (channels, width, height)
    model.add(Permute((1, 2, 3), input_shape=input_shape))
else:
    raise RuntimeError('Unknown image_dim_ordering.')

model.add(Convolution2D(32, 8, 8, subsample=(4, 4)))
model.add(Activation('relu'))
model.add(Convolution2D(64, 4, 4, subsample=(2, 2)))
model.add(Activation('relu'))
model.add(Convolution2D(64, 3, 3, subsample=(1, 1)))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))

memory = SequentialMemory(limit=1000000, window_length=WINDOW_LENGTH)
processor = AtariProcessor()
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr='eps', value_max=1., value_min=.1, value_test=.05,
                              nb_steps=1000000)
dqn = DQNAgent(model=model, nb_actions=nb_actions, policy=policy, memory=memory,
               processor=processor, nb_steps_warmup=50000, gamma=.99, target_model_update=10000,
               train_interval=4, delta_clip=1.)
dqn.compile(Adam(lr=.00025), metrics=['mae'])

# no more training, just load it up and see it go!
dqn.load_weights('large_data/breakout_LinAnnPol_GreedyQ.h5f')


In [None]:
dqn.test(env, nb_episodes=1, visualize=True)

In [None]:
env.close()