# <font color=#3876f1  >Introducing DEEP Reinforement Learning Playing ATARI GAMES</font>
![Atari Breakthrough](https://cdn-images-1.medium.com/max/1600/1*bq0g9O26bO4nLvRwNw9mpw.gif)
## <font color=green>[MTSC](http://www.mstc.ssr.upm.es/big-data-track) : Applications Project Course</font>

---
## <font color=#d35400>Remember from...</font>
## [*Human-level control through deep reinforcement learning*, Volodymyr Mnih, et al.](https://www.nature.com/articles/nature14236/)
  Nature volume 518, pages 529–533 (26 February 2015)
  
<img src=https://ocr.space/blog/images/posts/old/ai-nature-cover.jpg height="200" width="200">
  
**Editorial Summary**: *Self-taught AI agent masters Atari arcade games*


>   - For an artificial agent to be considered truly intelligent it needs to excel at a variety of tasks considered challenging for humans. 

>   - To date, it has only been possible … to master a single discipline for example, IBM's Deep Blue beat the human world champion at chess…

>   - Now a team working at Google's DeepMind subsidiary has developed an artificial agent dubbed a deep Q-network that learns to play 49 classic Atari 2600 'arcade' games directly from sensory experience, achieving performance on a par with that of an expert human player. 


By **combining reinforcement learning** (selecting actions that maximize reward in this case the game score) **with deep learning** (multilayered feature extraction from high-dimensional data in this case the pixels), **the game-playing agent takes artificial intelligence a step nearer the goal of systems capable of learning a diversity of challenging tasks from scratch.**



---

## We will use the library: <font color=magenta size=5>[OpenAI Gym](https://gym.openai.com/docs/)</font>

- OpenAI is a company created by Elon Musk (*cofoundader of PayPal, Tesla Motors, SpaceX, Hyperloop, SolarCity, The Boring Company, OpenAI*) that has been doing research in deep reinforcement learning.

![OpenAI](https://cdn-images-1.medium.com/max/800/1*pPxE7B-vSNOMtv_veQz4Tw.png)

---
Adapted from several sources:
>   **MAINLY Tatsuya** (tokb23) : https://github.com/tokb23/dqn/blob/master/dqn.py

>   https://github.com/llSourcell/deep_q_learning/blob/master/03_PlayingAgent.ipynb

>   https://becominghuman.ai/lets-build-an-atari-ai-part-1-dqn-df57e8ff3b26

---


- # <font color=green>FIRST: let's install everything</font>

In [0]:
! apt-get update

In [0]:
! apt-get install build-essential -y

In [0]:
! apt-get install -y python-numpy python-dev cmake zlib1g-dev libjpeg-dev xvfb libav-tools xorg-dev python-opengl libboost-all-dev libsdl2-dev swig > /dev/null

In [0]:
! pip install --upgrade pip

In [0]:
! pip install gym[atari]


---

- # <font color=green>Now let's know about OpenAI Gym</font>

### There are two basic concepts in reinforcement learning:
- ### the environment (namely, the outside world)
- ### and the agent (namely, the algorithm you are writing).

      The agent sends actions to the environment, and the environment replies with observations (state) and rewards (that is, a score).




---
##<font color=magenta>State</font>
### The simplest approximation of a $state$ is simply the current frame in your Atari game.
![Several frames](http://o7ie0tcjk.bkt.clouddn.com/rl/games/croped-breakout-image.png)

- ### <font color=red>How many states do we have?... What type of dynamics?</font>


---

##<font color=magenta>Action</font>
**A total of 18 actions can be performed with the joystick**: doing nothing, pressing the action button, going in one of 8 directions (up, down, left and right as well as the 4 diagonals) and going in any of these directions while pressing the button.

- ### In Breakout, only 4 actions apply:

> 1.- do nothing

> 2.- Fire: “asking for a ball” at the beginning of the game by pressing the button

> 3.- going left

> 4.- going right



---

##<font color=magenta>Reward</font>
In Atari, **rewards simply correspond to changes in score**.

---
##<font color=darkblue>The GYM interface</font>
[OpenAI Gym Documentation](https://github.com/openai/gym#id15)
      
**The core gym interface is Env**, which is the unified environment interface.

The following are some **Env methods**:


*   **env = gym.make('Breakout-v0')** # creates an instance
*   **reset(self)**: Reset the environment's state. Returns observation.
*   **step(self, action)**: Step the environment by one timestep. Returns observation, reward, done, info.



---
*To be discussed:*

*   **render(self, mode='human', close=False)**: Render one frame of the environment. The default mode will do something human friendly, such as pop up a window. Passing the close flag signals the renderer to close any such windows.

---


<font color=darkblue size=4>**Let's see some examples:**</font>



In [0]:
import time
import gym
from IPython import display
import matplotlib.pyplot as plt
%matplotlib inline

env = gym.make('Breakout-v0')

env.reset()

plt.imshow(env.render(mode='rgb_array'))
action=0
observation, reward, done, info = env.step(action)


In [0]:
plt.imshow(observation)

In [0]:
print('Reward:' ,reward)
print('Observation: ', observation.shape)

---

##<font color=magenta>A RANDOM PLayer</font>
**See playing by a random selection of actions**.

In [0]:
import time
import gym
from IPython import display
import matplotlib.pyplot as plt
%matplotlib inline

env = gym.make('Breakout-v4')
#env = gym.make('Assault-ram-v0')
#env = gym.make('MsPacman-ram-v0')

env.reset()
for _ in range(20):
    plt.imshow(env.render(mode='rgb_array'))
    display.display(plt.gcf())
    time.sleep(1)
    display.clear_output(wait=True)
    action = env.action_space.sample() # select a random action

    observation, reward, done, info = env.step(action)

In [0]:
    print("Action space (i.e. no. actions): ",env.action_space)
    print("Observation space: ",env.observation_space)
    print(action)



---
<font color=darkorange>All our efforts from now on will be about **replacing the random action selection** in the code above with something more *sensible!*
---



---

- # <font color=green>We need some THEORETICAL KNOWLEGE!</font>

##  Read DeepMind papers:
- ## [Playing Atari with Deep Reinforcement Learning](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf) which introduces the notion of a Deep Q-Network.
- ## [Human-level control through deep reinforcement learning](https://www.nature.com/articles/nature14236)

## And follow our [REINFORCEMENT LEARNING Course](http://www.mstc.ssr.upm.es/images/enrollment2017-2018/GA_09AT_93000948_2S_2017-18.pdf)



---
##<font color=magenta>Revisiting : State</font>
      We said: The simplest approximation of a $state$ is simply the current frame in your Atari game.

### Unfortunately, this is not always sufficient: given the image on the left, you are probably unable to tell whether the ball is going up or going down! <font color=brown> breaking the Markov property!</font>
![Several frames](https://cdn-images-1.medium.com/max/600/1*Me-kiRwUc1b5_GZszN2UaQ.gif)

- ###  A simple trick to deal with this is simply to bring some of the previous history into your state (that is perfectly acceptable under the Markov property). DeepMind chose to use the past 4 frames <font color=red>*(know why?)* </font>

### <font color=1334F2>Preprocessing</font>

    Working directly with raw Atari frames, which are 210×160 pixel images with a 128 color palette, can be computationally demanding, so ...

- We crope to 84×84** region: the implementation of 2D convolutions that we are going to be using can handle rectangular inputs easily.

**See the code below (do not execute it!)**

In [0]:
ENV_NAME = 'Breakout-v0'  # Environment name
FRAME_WIDTH = 84  # Resized frame width
FRAME_HEIGHT = 84  # Resized frame height

def preprocess(observation, last_observation):
    processed_observation = np.maximum(observation, last_observation)
    processed_observation = np.uint8(resize(rgb2gray(processed_observation), (FRAME_WIDTH, FRAME_HEIGHT)) * 255)
    return np.reshape(processed_observation, (FRAME_WIDTH, FRAME_HEIGHT,1))

## <font color=1334F2>SOME CODING INFORMATION</font>

> ### 1.- We are goint to use both: KERAS and TENSORFLOW
> ### 2.- To make the code more readable we will use A [Python CLASS](https://docs.python.org/2/tutorial/classes.html) for our agent
<br>

<font color=darkorange size=4>**We will follow**:<br>
 **MAINLY Tatsuya** (tokb23) : https://github.com/tokb23/dqn/blob/master/dqn.py</font>



(Python classes provide all the standard features of Object Oriented Programming: the class inheritance mechanism allows multiple base classes, a derived class can override any methods of its base class or classes, and a method can call the method of a base class with the same name.)
> ### 3.- ...and we are going to have a LARGE set of configration parameters!



---

In [0]:
import os
import gym
import random
import numpy as np
import tensorflow as tf
from collections import deque
from skimage.color import rgb2gray
from skimage.transform import resize
from keras.models import Sequential
from keras.layers import Convolution2D, Flatten, Dense

In [0]:
ENV_NAME = 'Breakout-v0'  # Environment name
FRAME_WIDTH = 84  # Resized frame width
FRAME_HEIGHT = 84  # Resized frame height
STATE_LENGTH = 4  # Number of most recent frames to produce the input to the network

#NUM_EPISODES = 12000  # Number of episodes the agent plays
NUM_EPISODES = 2000  # Number of episodes the agent plays

GAMMA = 0.99  # Discount factor
EXPLORATION_STEPS = 1000000  # Number of steps over which the initial value of epsilon is linearly annealed to its final value
INITIAL_EPSILON = 1.0  # Initial value of epsilon in epsilon-greedy
FINAL_EPSILON = 0.1  # Final value of epsilon in epsilon-greedy

#INITIAL_REPLAY_SIZE = 20000  # Number of steps to populate the replay memory before training starts
#NUM_REPLAY_MEMORY = 400000  # Number of replay memory the agent uses for training

INITIAL_REPLAY_SIZE = 5000  # Number of steps to populate the replay memory before training starts
NUM_REPLAY_MEMORY = 40000  # Number of replay memory the agent uses for training

BATCH_SIZE = 32  # Mini batch size
#TARGET_UPDATE_INTERVAL = 10000  # The frequency with which the target network is updated
TARGET_UPDATE_INTERVAL = 1000  # The frequency with which the target network is updated

TRAIN_INTERVAL = 4  # The agent selects 4 actions between successive updates
LEARNING_RATE = 0.00025  # Learning rate used by RMSProp
MOMENTUM = 0.95  # Momentum used by RMSProp
MIN_GRAD = 0.01  # Constant added to the squared gradient in the denominator of the RMSProp update

SAVE_INTERVAL = 300000  # The frequency with which the network is saved
NO_OP_STEPS = 30  # Maximum number of "do nothing" actions to be performed by the agent at the start of an episode
LOAD_NETWORK = False
TRAIN = True
SAVE_NETWORK_PATH = 'saved_networks/' + ENV_NAME
SAVE_SUMMARY_PATH = 'summary/' + ENV_NAME

NUM_EPISODES_AT_TEST = 30 # Number of episodes the agent plays at test time


In [0]:
class Agent():
    def __init__(self, num_actions):
        self.num_actions = num_actions
        self.epsilon = INITIAL_EPSILON
        self.epsilon_step = (INITIAL_EPSILON - FINAL_EPSILON) / EXPLORATION_STEPS
        self.t = 0

        # Parameters used for summary
        self.total_reward = 0
        self.total_q_max = 0
        self.total_loss = 0
        self.duration = 0
        self.episode = 0

        # Create replay memory
        self.replay_memory = deque()

        # Create q network
        self.s, self.q_values, q_network = self.build_network()
        q_network_weights = q_network.trainable_weights

        # Create target network
        self.st, self.target_q_values, target_network = self.build_network()
        target_network_weights = target_network.trainable_weights

        # Define target network update operation
        self.update_target_network = \
            [target_network_weights[i].assign(q_network_weights[i]) for i in range(len(target_network_weights))]

        # Define loss and gradient update operation
        self.a, self.y, self.loss, self.grads_update = self.build_training_op(q_network_weights)

        self.sess = tf.InteractiveSession()
        self.saver = tf.train.Saver(q_network_weights)
        self.summary_placeholders, self.update_ops, self.summary_op = self.setup_summary()
        self.summary_writer = tf.summary.FileWriter(SAVE_SUMMARY_PATH, self.sess.graph)

        if not os.path.exists(SAVE_NETWORK_PATH):
            os.makedirs(SAVE_NETWORK_PATH)

        self.sess.run(tf.global_variables_initializer())

        # Load network
        if LOAD_NETWORK:
            self.load_network()

        # Initialize target network
        self.sess.run(self.update_target_network)

    def build_network(self):
        model = Sequential()
        model.add(Convolution2D(32, 8, 8, subsample=(4, 4), activation='relu', \
                                input_shape=(FRAME_WIDTH, FRAME_HEIGHT, STATE_LENGTH)))
        
        # FOR THEANO : input_shape=(STATE_LENGTH, FRAME_WIDTH, FRAME_HEIGHT)))
        model.add(Convolution2D(64, 4, 4, subsample=(2, 2), activation='relu'))
        model.add(Convolution2D(64, 3, 3, subsample=(1, 1), activation='relu'))
        model.add(Flatten())
        model.add(Dense(512, activation='relu'))
        model.add(Dense(self.num_actions))

        # FOR THEANO: s = tf.placeholder(tf.float32, [None, STATE_LENGTH, FRAME_WIDTH, FRAME_HEIGHT])
        s = tf.placeholder(tf.float32, [None, FRAME_WIDTH, FRAME_HEIGHT,STATE_LENGTH])
        q_values = model(s)

        return s, q_values, model

    def build_training_op(self, q_network_weights):
        a = tf.placeholder(tf.int64, [None])
        y = tf.placeholder(tf.float32, [None])

        # Convert action to one hot vector
        a_one_hot = tf.one_hot(a, self.num_actions, 1.0, 0.0)
        q_value = tf.reduce_sum(tf.multiply(self.q_values, a_one_hot), reduction_indices=1)

        # Clip the error, the loss is quadratic when the error is in (-1, 1), and linear outside of that region
        error = tf.abs(y - q_value)
        quadratic_part = tf.clip_by_value(error, 0.0, 1.0)
        linear_part = error - quadratic_part
        loss = tf.reduce_mean(0.5 * tf.square(quadratic_part) + linear_part)

        optimizer = tf.train.RMSPropOptimizer(LEARNING_RATE, momentum=MOMENTUM, epsilon=MIN_GRAD)
        grads_update = optimizer.minimize(loss, var_list=q_network_weights)

        return a, y, loss, grads_update

    def get_initial_state(self, observation, last_observation):
        processed_observation = np.maximum(observation, last_observation)
        processed_observation = np.uint8(resize(rgb2gray(processed_observation), (FRAME_WIDTH, FRAME_HEIGHT)) * 255)
        state = [processed_observation for _ in range(STATE_LENGTH)]
        return np.stack(state, axis=2)
        # FOR THEANO: return np.stack(state, axis=0)

    def get_action(self, state):
        if self.epsilon >= random.random() or self.t < INITIAL_REPLAY_SIZE:
            action = random.randrange(self.num_actions)
        else:
            action = np.argmax(self.q_values.eval(feed_dict={self.s: [np.float32(state / 255.0)]}))

        # Anneal epsilon linearly over time
        if self.epsilon > FINAL_EPSILON and self.t >= INITIAL_REPLAY_SIZE:
            self.epsilon -= self.epsilon_step

        return action

    def run(self, state, action, reward, terminal, observation):
        # FOR THEANO: next_state = np.append(state[1:, :, :], observation, axis=0)
        next_state = np.append(state[:, :, 1:], processed_observation, axis=2)

        # Clip all positive rewards at 1 and all negative rewards at -1, leaving 0 rewards unchanged
        reward = np.clip(reward, -1, 1)

        # Store transition in replay memory
        self.replay_memory.append((state, action, reward, next_state, terminal))
        if len(self.replay_memory) > NUM_REPLAY_MEMORY:
            self.replay_memory.popleft()

        if self.t >= INITIAL_REPLAY_SIZE:
            # Train network
            if self.t % TRAIN_INTERVAL == 0:
                self.train_network()

            # Update target network
            if self.t % TARGET_UPDATE_INTERVAL == 0:
                self.sess.run(self.update_target_network)

            # Save network
            if self.t % SAVE_INTERVAL == 0:
                save_path = self.saver.save(self.sess, SAVE_NETWORK_PATH + '/' + ENV_NAME, global_step=self.t)
                print('Successfully saved: ' + save_path)

        self.total_reward += reward
        self.total_q_max += np.max(self.q_values.eval(feed_dict={self.s: [np.float32(state / 255.0)]}))
        self.duration += 1

        if terminal:
            # Write summary
            if self.t >= INITIAL_REPLAY_SIZE:
                stats = [self.total_reward, self.total_q_max / float(self.duration),
                        self.duration, self.total_loss / (float(self.duration) / float(TRAIN_INTERVAL))]
                for i in range(len(stats)):
                    self.sess.run(self.update_ops[i], feed_dict={
                        self.summary_placeholders[i]: float(stats[i])
                    })
                summary_str = self.sess.run(self.summary_op)
                self.summary_writer.add_summary(summary_str, self.episode + 1)

            # Debug
            if self.t < INITIAL_REPLAY_SIZE:
                mode = 'random'
            elif INITIAL_REPLAY_SIZE <= self.t < INITIAL_REPLAY_SIZE + EXPLORATION_STEPS:
                mode = 'explore'
            else:
                mode = 'exploit'
            print('EPISODE: {0:6d} / TIMESTEP: {1:8d} / DURATION: {2:5d} / EPSILON: {3:.5f} / TOTAL_REWARD: {4:3.0f} / AVG_MAX_Q: {5:2.4f} / AVG_LOSS: {6:.5f} / MODE: {7}'.format(
                self.episode + 1, self.t, self.duration, self.epsilon,
                self.total_reward, self.total_q_max / float(self.duration),
                self.total_loss / (float(self.duration) / float(TRAIN_INTERVAL)), mode))

            self.total_reward = 0
            self.total_q_max = 0
            self.total_loss = 0
            self.duration = 0
            self.episode += 1

        self.t += 1

        return next_state

    def train_network(self):
        state_batch = []
        action_batch = []
        reward_batch = []
        next_state_batch = []
        terminal_batch = []
        y_batch = []

        # Sample random minibatch of transition from replay memory
        minibatch = random.sample(self.replay_memory, BATCH_SIZE)
        for data in minibatch:
            state_batch.append(data[0])
            action_batch.append(data[1])
            reward_batch.append(data[2])
            next_state_batch.append(data[3])
            terminal_batch.append(data[4])

        # Convert True to 1, False to 0
        terminal_batch = np.array(terminal_batch) + 0

        target_q_values_batch = self.target_q_values.eval(feed_dict={self.st: np.float32(np.array(next_state_batch) / 255.0)})
        y_batch = reward_batch + (1 - terminal_batch) * GAMMA * np.max(target_q_values_batch, axis=1)

        loss, _ = self.sess.run([self.loss, self.grads_update], feed_dict={
            self.s: np.float32(np.array(state_batch) / 255.0),
            self.a: action_batch,
            self.y: y_batch
        })

        self.total_loss += loss

    def setup_summary(self):
        episode_total_reward = tf.Variable(0.)
        tf.summary.scalar(ENV_NAME + '/Total_Reward/Episode', episode_total_reward)
        episode_avg_max_q = tf.Variable(0.)
        tf.summary.scalar(ENV_NAME + '/Average_Max_Q/Episode', episode_avg_max_q)
        episode_duration = tf.Variable(0.)
        tf.summary.scalar(ENV_NAME + '/Duration/Episode', episode_duration)
        episode_avg_loss = tf.Variable(0.)
        tf.summary.scalar(ENV_NAME + '/Average_Loss/Episode', episode_avg_loss)
        summary_vars = [episode_total_reward, episode_avg_max_q, episode_duration, episode_avg_loss]
        summary_placeholders = [tf.placeholder(tf.float32) for _ in range(len(summary_vars))]
        update_ops = [summary_vars[i].assign(summary_placeholders[i]) for i in range(len(summary_vars))]
        summary_op = tf.summary.merge_all()
        return summary_placeholders, update_ops, summary_op

    def load_network(self):
        checkpoint = tf.train.get_checkpoint_state(SAVE_NETWORK_PATH)
        if checkpoint and checkpoint.model_checkpoint_path:
            self.saver.restore(self.sess, checkpoint.model_checkpoint_path)
            print('Successfully loaded: ' + checkpoint.model_checkpoint_path)
        else:
            print('Training new network...')

    def get_action_at_test(self, state):
        if random.random() <= 0.05:
            action = random.randrange(self.num_actions)
        else:
            action = np.argmax(self.q_values.eval(feed_dict={self.s: [np.float32(state / 255.0)]}))

        self.t += 1

        return action


def preprocess(observation, last_observation):
    processed_observation = np.maximum(observation, last_observation)
    processed_observation = np.uint8(resize(rgb2gray(processed_observation), (FRAME_WIDTH, FRAME_HEIGHT)) * 255)
    return np.reshape(processed_observation, (FRAME_WIDTH, FRAME_HEIGHT,1))





---
## <font color=1334F2>Some Simple Tests to better understand the code</font>
---

- ### Instatiate the environmanet (Atari Game) and the Reinforcement Learning Agent : It will be base on Deep Q-Networks (DQN Agent

In [0]:
# REMEMBER: in parameters:
# ENV_NAME = 'Breakout-v0'  # Environment name

env = gym.make(ENV_NAME)
agent = Agent(num_actions=env.action_space.n)

In [0]:
# GENERATE SOME INITIAL NO_OP steps for INITIALIZATION:

observation = env.reset()
for _ in range(random.randint(1, NO_OP_STEPS)):
  last_observation = observation
  observation, _, _, _ = env.step(0)  # Do nothing

In [0]:
observation.shape

In [0]:
# Plot an OBSERVATION: is a Game Frame:

import matplotlib.pyplot as plt
%matplotlib inline

plt.imshow(observation)

In [0]:
# A "state" is composed by (in this case) 4 frames with size 84 x 84 (croped,gray, images)

# REMEMBER: in parameters:
#FRAME_WIDTH = 84  # Resized frame width
#FRAME_HEIGHT = 84  # Resized frame height
#STATE_LENGTH = 4  # Number of most recent frames to produce the input to the network

state = agent.get_initial_state(observation, last_observation)
print('State dimesions',state.shape)


In [0]:
state = agent.get_initial_state(observation, last_observation)

print('State dimesions',state.shape)
plt.subplot(2,2,1)
plt.imshow(state[:,:,0])
plt.subplot(2,2,2)
plt.imshow(state[:,:,1])
plt.subplot(2,2,3)
plt.imshow(state[:,:,2])
plt.subplot(2,2,4)
plt.imshow(state[:,:,3])

In [0]:
last_observation = observation
action = agent.get_action(state)
print('Action: ',action)
observation, reward, terminal, _ = env.step(action)

In [0]:
# SEE the "internal" operation
processed_observation = np.maximum(observation, last_observation)
plt.subplot(1,3,1)
plt.imshow(observation)
plt.subplot(1,3,2)
plt.imshow(last_observation)
plt.subplot(1,3,3)
plt.imshow(processed_observation)

In [0]:
processed_observation = preprocess(observation, last_observation)

In [0]:
print('Processed observation shape',processed_observation.shape)
plt.imshow(processed_observation.reshape((84,84)))

In [0]:
state.shape

In [0]:
processed_observation.shape

In [0]:
state[:, :, 1:].shape

In [0]:
# SEE HOW a new PROCESSED observation is added as next_state (4 frames)

next_state = np.append(state[:, :, 1:], processed_observation, axis=2)

state=next_state
print('State dimesions',state.shape)
plt.subplot(2,2,1)
plt.imshow(state[:,:,0])
plt.subplot(2,2,2)
plt.imshow(state[:,:,1])
plt.subplot(2,2,3)
plt.imshow(state[:,:,2])
plt.subplot(2,2,4)
plt.imshow(state[:,:,3])


---
## <font color=1334F2>NOW DO THE TRAINING...</font>
<font color=1334F2 size=4> ..and study [Human-level control through deep reinforcement learning - Presented by Bowen Xu](http://www.teach.cs.toronto.edu/~csc2542h/fall/material/csc2542f16_dqn.pdf)</font>


- ### Understand main steps:

> ### Process a new observation and feed it into "agent.run" that:

> 1.- appends it to a new state

> 2.- stores it into REPLAY MEMORY

> 3.- IF "replay memory" larger that NUM_REPLAY_MEMORY trains the "training network"

> 4.- AFTER a number of steps (TARGET_UPDATE_INTERVAL) the Q network is updated


> 5.- SAVE NETWORK every SAVE_INTERVAL interval

---

---
## <font color=red>Training will take time!</font>

<font color=darkorange size=4>**See Figues from**:<br>
 **MAINLY Tatsuya** (tokb23) : https://github.com/tokb23/dqn/blob/master/dqn.py</font>

![See Figures](https://raw.githubusercontent.com/tokb23/dqn/master/assets/result.png)


In [0]:
# TRAINING: fon NUM_EPISODES--------------------

for _ in range(NUM_EPISODES):
  
  terminal = False
  observation = env.reset()
  for _ in range(random.randint(1, NO_OP_STEPS)):
    last_observation = observation
    observation, _, _, _ = env.step(0)  # Do nothing
  
  state = agent.get_initial_state(observation, last_observation)

  # While an episode is not finished
  while not terminal:
    last_observation = observation
    action = agent.get_action(state)
    observation, reward, terminal, _ = env.step(action)
    # env.render()
    processed_observation = preprocess(observation, last_observation)
  
    # Process a new observation and feed it into "agent.run" that:
    #  1.- appends it to a new state
    #  2.- stores it into REPLAY MEMORY
    #  3.- IF "replay memory" larger that NUM_REPLAY_MEMORY trains the
    #      "training network"  
    #  4.- AFTER a number of steps (TARGET_UPDATE_INTERVAL) the Q network is 
    #      updated
    #  5.- SAVE NETWORK every SAVE_INTERVAL intervals
    
    state = agent.run(state, action, reward, terminal, processed_observation)

---
## <font color=1334F2>FINALLY we can do some "informal" tests... </font>


- ### <font color=red>but see more formal tests! </font>
- ### <font color=red>Hyperparameters tuning! etc.</font>


In [0]:
import time
from IPython import display
import matplotlib.pyplot as plt
%matplotlib inline

i_render = True

NUM_EPISODES_AT_TEST=1

for _ in range(NUM_EPISODES_AT_TEST):
  tot_reward = 0.0
  terminal = False
  observation = env.reset()
  for _ in range(random.randint(1, NO_OP_STEPS)):
    last_observation = observation
    observation, _, _, _ = env.step(0)  # Do nothing
    
  state = agent.get_initial_state(observation, last_observation)
  while not terminal:
    last_observation = observation
    action = agent.get_action_at_test(state)
    observation, reward, terminal, _ = env.step(action)
    
    #  env.render()
    if i_render:
      print('Predicted action: ',env.unwrapped.get_action_meanings()[action])
      #env.render()
      plt.imshow(env.render(mode='rgb_array'))
      display.display(plt.gcf())
      time.sleep(1)
      display.clear_output(wait=True)
    
    processed_observation = preprocess(observation, last_observation)
    state = np.append(state[:, :, 1:], processed_observation, axis=2)
    
    tot_reward += reward
  print('Game ended! Total reward: {}'.format(tot_reward))
# env.monitor.close()

![Epsilon schedule](https://cdn-images-1.medium.com/max/800/1*KqNaTE9W58I-Y1KLViuoxQ.png)

## Q-learning intuition.
## [Deep reinforcement learning: where to start (playing Catch)](https://medium.freecodecamp.org/deep-reinforcement-learning-where-to-start-291fb0058c01)

A good way to understand Q-learning is to compare playing Catch with playing chess. In both games you are given a state s (chess: positions of the figures on the board, Catch: location of the fruit and the basket), on which you have to take an action a (chess: move a figure, Catch: move the basket to the left, right, or stay where you are). As a result there will be some reward r and a new state s’. The problem with both Catch and and chess is that the rewards will not appear immediately after you have taken the action. In Catch, you only earn rewards when the fruits hit the basket or fall on the floor, and in chess you only earn a reward when you win or loose the game. Rewards are _sparsely distributed_, most of the time, r will be 0. When there is a reward, it is not always a result of the action taken immediately before. Some action taken long before might have cause the victory. Figuring out which action is responsible for the reward is often referred to as the _credit assignment problem_.

Because rewards are delayed, good chess players do not choose their plays only by the immediate reward, but by the _expected future reward_. They do not only think about whether they can eliminate an opponents figure in the next move, but how taking a certain action now will help them in the long run. 
In Q-learning, we choose our action based on the highest expected future reward. While in state s, we estimate the future reward for each possible action a. We assume that after we have taken action a and moved to the next state s’, everything works out perfectly. Like in finance, we discount future rewards, since they are uncertain.
The expected future reward Q(s,a) given a state s and an action a is therefore the reward r that directly follows from a plus the expected future reward Q(s’,a’) if the optimal action a’ is taken in the following state s’, discounted by the discount factor gamma.

Q(s,a) = r + gamma * max Q(s’,a’)

Good chess players are very good at estimating future rewards in their head. In other words, their function Q(s,a) is very precise. Most chess practice revolves around developing a better Q function. Players peruse many old games to learn how specific moves played out in the past, and how likely a given action is to lead to victory.

But how could we estimate a good function Q? This is where neural networks come into play.

## Regression after all

When playing, we generate lots of experiences consisting of the initial state s, the action taken a, the reward earned r and the state that followed s’. These experiences are our training data. We can frame the problem of estimating Q(s,a) as a simple regression problem. Given an input vector consisting of s and a the neural net is supposed to predict the a value of Q(s,a) equal to the target: r + gamma * max Q(s’,a’). If we are good at predicting Q(s,a) for different states s and actions a, we have a good approximation of Q. Note that Q(s’,a’) is _also_ a prediction of the neural network we are training. 

Given a batch of experiences < s, a, r, s’ >, the training process then looks as follows:
1. For each possible action a’ (left, right, stay), predict the expected future reward Q(s’,a’) using the neural net
2. Choose the highest value of the three predictions max Q(s’,a’)
3. Calculate r + gamma * max Q(s’,a’). This is the target value for the neural net
4. Train the neural net using the loss function 1/2(predicted_Q(s,a) - target)^2

During gameplay, all the experiences are stored in a replay memory. This is the class below. 

The remember function simply saves an experience to a list.
The get_batch function performs steps 1 to 3 of the list above and returns an input and a target vector. The actual training is done in a function discussed below.

---
##<font color=magenta>Now you can go much further!</font>

- ### MuJoCo
![MuJoCo](https://blog.openai.com/content/images/2017/05/image7.gif)

- ### Robotics
![Robotics](http://blog.otoro.net/assets/20171109/jpeg/kuka_img.jpeg)

- ### Starcraft II: PySC2 is DeepMind's Python component of the StarCraft II Learning Environment (SC2LE) intended to develop StarCraft II into a rich environment for RL research.
![Starcraft2](http://i.dailymail.co.uk/i/pix/2016/11/28/10/3AD1E4E900000578-0-image-a-1_1480329576701.jpg)

- ### Intel® Nervana™ - Artificial Intelligence Products Group
![texto alternativo](https://raw.githubusercontent.com/NervanaSystems/coach/master/img/carla.gif)


- ### Self-driving Cars
![Self-driving cars](https://i.ytimg.com/vi/NdbOqNQtAjk/maxresdefault.jpg)

- # ... and much more and many different application areas...