# Approximate q-learning

In this notebook you will teach a lasagne neural network to do Q-learning.

__Frameworks__ - we'll accept this homework in any deep learning framework. For example, it translates to TensorFlow almost line-to-line. However, we recommend you to stick to theano/lasagne unless you're certain about your skills in the framework of your choice.

In [1]:
import keras.backend as K
from keras.models import Model
from keras.layers import Dense, Input, Lambda
from keras.optimizers import RMSprop, Adam

Using TensorFlow backend.


In [4]:
from pycrayon import CrayonClient
client = CrayonClient(hostname='localhost')
crayon = client.create_experiment("keras-1")

In [5]:
import gym
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [6]:
env = gym.make("CartPole-v0").env
env.reset()
n_actions = env.action_space.n
state_dim = env.observation_space.shape

print("n_actions={}, state_dim={}".format(n_actions, state_dim))
#plt.imshow(env.render("rgb_array"))

[2017-04-21 15:59:38,775] Making new env: CartPole-v0


n_actions=2, state_dim=(4,)


# Approximate (deep) Q-learning: building the network

In this section we will build and train naive Q-learning with theano/lasagne

First step is initializing input variables

In [7]:
L1_SIZE = 50
L2_SIZE = 50
gamma = 0.99

In [8]:
# input - observation, output - qvalues
in_t = Input(shape=state_dim, name='input')
l1_t = Dense(L1_SIZE, activation='relu', name='l1')(in_t)
#l2_t = Dense(L2_SIZE, activation='relu', name='l2')(l1_t)
l_out_t = Dense(n_actions, name='out')(l1_t)
model = Model(inputs=[in_t], outputs=[l_out_t])
model.summary()

model.compile(optimizer=Adam(lr=0.0005), loss='mse')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           (None, 4)                 0         
_________________________________________________________________
l1 (Dense)                   (None, 50)                250       
_________________________________________________________________
out (Dense)                  (None, 2)                 102       
Total params: 352
Trainable params: 352
Non-trainable params: 0
_________________________________________________________________


### Playing the game

In [9]:
epsilon = 0.25 #initial epsilon

def generate_session(t_max=1000):
    """play env with approximate q-learning agent and train it at the same time"""
    
    total_reward = 0
    s = env.reset()
    q_vals_means = []
    losses = []

    for t in range(t_max):
        q_vals = model.predict_on_batch(np.array([s]))[0]
        if np.random.rand() < epsilon:
            a = env.action_space.sample()
        else:
            a = np.argmax(q_vals)
        
        new_s,r,done,info = env.step(a)

        new_q = model.predict_on_batch(np.array([new_s]))[0]
        valid_q = np.array(q_vals)
        if done:
            valid_q[a] = r
        else:
            valid_q[a] = r + gamma * new_q.max()
 
        l = model.train_on_batch(np.array([s]), np.array([valid_q]))
        losses.append(l)
        q_vals_means.append(q_vals.mean())
        total_reward+=r
        
        s = new_s
        if done: break

    crayon.add_scalar_value("q_mean", float(np.mean(q_vals_means)))
    crayon.add_scalar_value("reward", total_reward)
    crayon.add_scalar_value("loss", float(np.mean(losses)))
    
    return total_reward
        

In [10]:
for i in range(100):   
    rewards = [generate_session() for _ in range(100)] #generate new sessions
    
    epsilon*=0.95
    
    print ("%d: mean reward:%.3f\tepsilon:%.5f"%(i, np.mean(rewards),epsilon))

    if np.mean(rewards) > 300:
        print ("You Win!")
        break
        
    assert epsilon!=0, "Please explore environment"

0: mean reward:10.810	epsilon:0.23750
1: mean reward:10.640	epsilon:0.22562
2: mean reward:10.550	epsilon:0.21434
3: mean reward:12.200	epsilon:0.20363
4: mean reward:15.080	epsilon:0.19345
5: mean reward:15.360	epsilon:0.18377
6: mean reward:16.660	epsilon:0.17458
7: mean reward:17.970	epsilon:0.16586
8: mean reward:24.700	epsilon:0.15756
9: mean reward:23.850	epsilon:0.14968
10: mean reward:20.150	epsilon:0.14220
11: mean reward:16.250	epsilon:0.13509
12: mean reward:36.790	epsilon:0.12834
13: mean reward:35.340	epsilon:0.12192
14: mean reward:58.170	epsilon:0.11582
15: mean reward:93.050	epsilon:0.11003
16: mean reward:61.760	epsilon:0.10453
17: mean reward:76.250	epsilon:0.09930
18: mean reward:131.920	epsilon:0.09434
19: mean reward:85.730	epsilon:0.08962
20: mean reward:99.740	epsilon:0.08514
21: mean reward:144.220	epsilon:0.08088
22: mean reward:12.180	epsilon:0.07684
23: mean reward:37.100	epsilon:0.07300
24: mean reward:214.280	epsilon:0.06935
25: mean reward:228.480	epsilon:

### Video

In [None]:
epsilon=0 #Don't forget to reset epsilon back to initial value if you want to go on training

In [None]:
#record sessions
import gym.wrappers
env = gym.wrappers.Monitor(env,directory="videos",force=True)
sessions = [generate_session() for _ in range(100)]
env.close()
#unwrap 
env = env.env.env
#upload to gym
#gym.upload("./videos/",api_key="<your_api_key>") #you'll need me later

#Warning! If you keep seeing error that reads something like"DoubleWrapError",
#run env=gym.make("CartPole-v0");env.reset();

In [None]:
#show video
from IPython.display import HTML
import os

video_names = list(filter(lambda s:s.endswith(".mp4"),os.listdir("./videos/")))

HTML("""
<video width="640" height="480" controls>
  <source src="{}" type="video/mp4">
</video>
""".format("./videos/"+video_names[-1])) #this may or may not be _last_ video. Try other indices