# Deep Reinforcement Learning  - Project
### Algorithm Efficiency Exploration

##### Idea
In this project, we would like to compare how different modern and potentially traditional control algorithms perform on one or more of the open AI gym environments. At the start, we hope to efficiently implement “traditional” deep Q-learning on one of the Atari environments. This is our core task. Then, depending on the complexity of this task, we plan to either attempt the same algorithm on a separate environment, or try additional algorithms on the same environment. From there, we hope to compare performance and training details for our different control models ( how quickly was performance achieved, were there substantial differences in compute needed, etc.). We think this is a very valuable project as it will let us explore implementation of the “core” deep reinforcement learning algorithm, and then dive into alternatives based on our progress. 

Link to Environments https://gym.openai.com/envs/#atari



##### Code

In [6]:
import gym
import random

import numpy as np

from collections import deque

from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

In [2]:
ENVIRONMENT = gym.make("CartPole-v1")
EPISODES = 10000
STEPS = 1000
EPSILON = 1.0
EPSILON_MIN = 0.01
EPSILON_DECAY = 0.995
GAMMA = 0.95
LEARNING_RATE = 0.001
BATCH_SIZE = 32
OBSERVATION_SIZE = 4
ACTION_SIZE = 2

The environment that we are using has two discrete actions (move cart left, move cart right). Each action setp results in an observation which consists of 4 continous values that can range from a lower to an upper bound. My understanding here is that we cannot create a Q-Matrix because there is no way we can represent all these observations/states in a single 2 or 3 dimensional matrix. Therefore, we probably need to train a neural network or any other kind of machine learning component, that takes the observation into account to calculate the Q value for each of the two actions. During training time we expect that our neural network will converge to the real Q matrix. All in all, our neural network will take a single input (observation) and create multiple outputs (Q value for each action)

In [3]:
print("Action Space:")
print(ENVIRONMENT.action_space)

print("\nObservation Space:")
print(ENVIRONMENT.observation_space)
print(ENVIRONMENT.observation_space.low)
print(ENVIRONMENT.observation_space.high)

Action Space:
Discrete(2)

Observation Space:
Box(4,)
[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]
[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]


In [4]:
# Neural Net for Deep Q Learning
# Sequential() creates the foundation of the layers.
model = Sequential()
# Input Layer of state size(4) and Hidden Layer with 24 nodes
model.add(Dense(24, input_dim=OBSERVATION_SIZE, activation='relu'))
# Hidden layer with 24 nodes
model.add(Dense(24, activation='relu'))
# Output Layer with # of actions: 2 nodes (left, right)
model.add(Dense(ACTION_SIZE, activation='linear'))
# Create the model based on the information above
model.compile(loss='mse', optimizer=Adam(lr=LEARNING_RATE))

Instructions for updating:
Colocations handled automatically by placer.


In [5]:
# Setup the memory for training the neural network
memory = deque(maxlen=2000)

average_steps = 0

for e in range(EPISODES):
    # Prepare everything for a new episode
    observation = ENVIRONMENT.reset()
    observation = np.reshape(observation, (1,4))
    done = False
    t = 0
    
    while t < STEPS and not done:
        # Render the environment based on the update
        ENVIRONMENT.render()
        
        # Take a new action either randomly or based on Q
        r = np.random.rand(1)
        if r < EPSILON:
            action = ENVIRONMENT.action_space.sample()
        else:
            action = np.argmax(model.predict(observation)[0])
            
        # Simulate the action based on the current state/observation
        next_observation, reward, done, info = ENVIRONMENT.step(action)
        next_observation = np.reshape(next_observation, (1,4))
        
        # Remember the last observations, actions and rewards
        memory.append((observation, action, reward, next_observation, done))
        
        observation = next_observation
        t += 1
        if done:
            average_steps += t
            if e % 50 == 0:
                print(e,EPISODES, average_steps/100)
                average_steps = 0
        
    if len(memory) >= BATCH_SIZE:
        minibatch = random.sample(memory, BATCH_SIZE)
        for observation, action, reward, next_observation, done in minibatch:
            target = reward
            if not done:
                prediction = model.predict(next_observation)
                target = reward + GAMMA * np.amax(prediction[0])
            target_f = model.predict(observation)
            target_f[0][action] = target
            model.fit(observation, target_f, epochs=1, verbose=0)
        if EPSILON > EPSILON_MIN:
            EPSILON = EPSILON * EPSILON_DECAY
        
# Close the environment
ENVIRONMENT.close()

0 10000 0.1
Instructions for updating:
Use tf.cast instead.
50 10000 10.81
100 10000 11.9
150 10000 14.23
200 10000 20.62
250 10000 26.25
300 10000 51.57
350 10000 70.61
400 10000 81.03
450 10000 70.13
500 10000 174.92
550 10000 232.58
600 10000 174.36
650 10000 234.61
700 10000 183.48
750 10000 212.24
800 10000 174.63
850 10000 74.65
900 10000 150.19
950 10000 83.94
1000 10000 90.23
1050 10000 211.02
1100 10000 225.4
1150 10000 241.1
1200 10000 128.23
1250 10000 153.65
1300 10000 144.51
1350 10000 124.2
1400 10000 215.2
1450 10000 198.16
1500 10000 153.1
1550 10000 179.18
1600 10000 155.01
1650 10000 151.56
1700 10000 102.6
1750 10000 176.83
1800 10000 218.23
1850 10000 189.63
1900 10000 201.57
1950 10000 55.09
2000 10000 118.55
2050 10000 221.1
2100 10000 230.68
2150 10000 187.1
2200 10000 229.05
2250 10000 228.22
2300 10000 237.46
2350 10000 194.83
2400 10000 213.8
2450 10000 87.96
2500 10000 119.76
2550 10000 250.0
2600 10000 215.4
2650 10000 236.95
2700 10000 238.6
2750 10000 179.

KeyboardInterrupt: 