# Deep Reinforcement Learning  - Project
### Algorithm Efficiency Exploration

##### Idea
In this project, we would like to compare how different modern and potentially traditional control algorithms perform on one or more of the open AI gym environments. At the start, we hope to efficiently implement “traditional” deep Q-learning on one of the Atari environments. This is our core task. Then, depending on the complexity of this task, we plan to either attempt the same algorithm on a separate environment, or try additional algorithms on the same environment. From there, we hope to compare performance and training details for our different control models ( how quickly was performance achieved, were there substantial differences in compute needed, etc.). We think this is a very valuable project as it will let us explore implementation of the “core” deep reinforcement learning algorithm, and then dive into alternatives based on our progress. 

Link to Environments https://gym.openai.com/envs/#atari



##### Code

In [1]:
import gym
import random

import numpy as np

from collections import deque

from keras.layers import Dense
from keras.models import Sequential
from keras.models import model_from_json
from keras.optimizers import Adam



Using TensorFlow backend.


In [2]:
ENVIRONMENT = gym.make("CartPole-v1")
EPISODES = 5000
STEPS = 1000
EPSILON = 1.0
EPSILON_MIN = 0.01
EPSILON_DECAY = 0.995
GAMMA = 0.95
LEARNING_RATE = 0.001
BATCH_SIZE = 32
OBSERVATION_SIZE = 4
ACTION_SIZE = 2

The environment that we are using has two discrete actions (move cart left, move cart right). Each action setp results in an observation which consists of 4 continous values that can range from a lower to an upper bound. My understanding here is that we cannot create a Q-Matrix because there is no way we can represent all these observations/states in a single 2 or 3 dimensional matrix. Therefore, we probably need to train a neural network or any other kind of machine learning component, that takes the observation into account to calculate the Q value for each of the two actions. During training time we expect that our neural network will converge to the real Q matrix. All in all, our neural network will take a single input (observation) and create multiple outputs (Q value for each action)

In [3]:
print("Action Space:")
print(ENVIRONMENT.action_space)

print("\nObservation Space:")
print(ENVIRONMENT.observation_space)
print(ENVIRONMENT.observation_space.low)
print(ENVIRONMENT.observation_space.high)

Action Space:
Discrete(2)

Observation Space:
Box(4,)
[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]
[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]


In [4]:
# Neural Net for Deep Q Learning
# Sequential() creates the foundation of the layers.
model = Sequential()
# Input Layer of state size(4) and Hidden Layer with 24 nodes
model.add(Dense(24, input_dim=OBSERVATION_SIZE, activation='relu'))
# Hidden layer with 24 nodes
model.add(Dense(24, activation='relu'))
# Output Layer with # of actions: 2 nodes (left, right)
model.add(Dense(ACTION_SIZE, activation='linear'))
# Create the model based on the information above
model.compile(loss='mse', optimizer=Adam(lr=LEARNING_RATE))

Instructions for updating:
Colocations handled automatically by placer.


In [5]:
# Setup the memory for training the neural network
memory = deque(maxlen=2000)

average_steps = 0

for e in range(EPISODES):
    # Prepare everything for a new episode
    observation = ENVIRONMENT.reset()
    observation = np.reshape(observation, (1,4))
    done = False
    t = 0
    
    while t < STEPS and not done:
        # Render the environment based on the update
        ENVIRONMENT.render()
        
        # Take a new action either randomly or based on Q
        r = np.random.rand(1)
        if r < EPSILON:
            action = ENVIRONMENT.action_space.sample()
        else:
            action = np.argmax(model.predict(observation)[0])
            
        # Simulate the action based on the current state/observation
        next_observation, reward, done, info = ENVIRONMENT.step(action)
        next_observation = np.reshape(next_observation, (1,4))
        
        # Remember the last observations, actions and rewards
        memory.append((observation, action, reward, next_observation, done))
        
        observation = next_observation
        t += 1
    
    average_steps += t
    if e % 50 == 0:
        print(e,EPISODES, average_steps/100)
        average_steps = 0
                
    if len(memory) >= BATCH_SIZE:
        minibatch = random.sample(memory, BATCH_SIZE)
        for observation, action, reward, next_observation, done in minibatch:
            target = reward
            if not done:
                prediction = model.predict(next_observation)
                target = reward + GAMMA * np.amax(prediction[0])
            target_f = model.predict(observation)
            target_f[0][action] = target
            model.fit(observation, target_f, epochs=1, verbose=0)
        if EPSILON > EPSILON_MIN:
            EPSILON = EPSILON * EPSILON_DECAY
        
# Close the environment
ENVIRONMENT.close()

0 5000 0.38
Instructions for updating:
Use tf.cast instead.
50 5000 11.2
100 5000 8.29
150 5000 15.1
200 5000 41.81
250 5000 39.59
300 5000 42.41
350 5000 47.3
400 5000 48.06
450 5000 45.29
500 5000 84.54
550 5000 136.47
600 5000 143.16
650 5000 185.4
700 5000 186.09
750 5000 166.07
800 5000 207.65
850 5000 242.67
900 5000 207.08
950 5000 226.39
1000 5000 245.12
1050 5000 63.83
1100 5000 176.93
1150 5000 68.18
1200 5000 179.24
1250 5000 225.32
1300 5000 236.31
1350 5000 230.79
1400 5000 250.0
1450 5000 191.41
1500 5000 234.7
1550 5000 199.68
1600 5000 224.63
1650 5000 184.9
1700 5000 225.86
1750 5000 242.38
1800 5000 245.11
1850 5000 239.67
1900 5000 197.28
1950 5000 94.39
2000 5000 192.06
2050 5000 232.14
2100 5000 147.17
2150 5000 197.06
2200 5000 165.56
2250 5000 209.38
2300 5000 180.0
2350 5000 99.23
2400 5000 114.48
2450 5000 159.3
2500 5000 193.6
2550 5000 60.04
2600 5000 80.41
2650 5000 67.67
2700 5000 40.96
2750 5000 63.26
2800 5000 78.01
2850 5000 97.53
2900 5000 107.32
2950 5

In [6]:
# Save the weights
model.save_weights('model_weights.h5')

# Save the model architecture
with open('model_architecture.json', 'w') as f:
    f.write(model.to_json())

In [None]:
# Model reconstruction from JSON file
with open('model_architecture.json', 'r') as f:
    model = model_from_json(f.read())

# Load weights into the new model
model.load_weights('model_weights.h5')