## Cart-Pole Balancing using DQNs
In this assignment we will balance a cartpole using deep learning. We will build an agent, that given the current state of the environment, can make a prediction about what action would result in the best outcome. We are going to implement the two core pieces of DQNs, the epsilon greedy algorithm and memory replay. 

In this assignment we will use openai gym libraries to set up the game enviroment. Most of the game playing interface is already provided by the gym library. Our task is to implement the agent, and fix up the training. As we play the game, you should see the agent's score increase in the training loop. A score of 100 or above is what we are trying to achieve. 

In [1]:
# If you are running this practice on your machine, make sure to install gym and gym[atari]. Depending on your python 
# env, this could be done using pip install, or conda install etc. 
!pip3 install --user gym gym[atari]

only teacher can use pip3


In [2]:
from collections import deque
import numpy as np
import random

import gym
from keras.layers import Dense, Activation, Flatten
from keras.models import Sequential
from keras.optimizers import Adam

  return f(*args, **kwds)
Using TensorFlow backend.
  return f(*args, **kwds)
  return f(*args, **kwds)


In [3]:
class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=2000)
        self.gamma = 0.95   # discount rate
        self.epsilon = 1.0  # exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        self.learning_rate = 0.001
        self.model = self._build_model()

    def _build_model(self):
        model = Sequential()
        ############################# TODO #####################################
        # Create a "simple" network, with 2-4 layers. Your network 
        # takes in the observations so the input_dim should match 
        # the size of observation space. Your network should output the probability 
        # of taking each action, so it's output size should match the 
        # action_size. Keep the last layer's activation as linear and use 
        # mean squared error for loss. Return your model after compiling it. 
        ########################### END TODO ###################################
        model.add(Dense(4, input_shape = (4,)))
        model.add(Activation('relu'))
        model.add(Dense(self.action_size, activation='linear'))
        model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))
        
        return model

    def remember(self, state, action, reward, next_state, done):
        ############################# TODO #####################################
        # Create a tuple of state, action, reward, next_state and done 
        # and append this tuple to the memory.
        ########################### END TODO ###################################
        self.memory.append((state, action, reward,next_state, done))    
                      
        pass

    def act(self, state):
        # In this function we calculate and return the next action.
        # We are going to implement epsilon greedy logic here. 
        # With probability epsilon, return a random action and return that
        # With probability 1-epsilon return the action that the model predicts. 
        if np.random.rand() <= self.epsilon:
            return np.random.randint(self.action_size)
        else:
            return np.argmax(self.model.predict(state))
            

    def replay(self, batch_size):
        # We'll sample from our memories and get a handful of them and store them in minibatch 
        minibatch = random.sample(self.memory, batch_size)
        for state, action, reward, next_state, done in minibatch:
            target = reward
            if not done:
                target = reward + self.gamma*self.model.predict(next_state).max()
                # Calculate the total discounted reward according to the Q-Learning formula
                # your formula should look something like this
                # target = current_reward + discounted maximum value obtained by next state
                
            target_f = self.model.predict(state)
            target_f[0][action] = target
            self.model.fit(state, target_f, epochs=1, verbose=0)
            
        # Decay the epsilon value 
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

    def load(self, name):
        self.model.load_weights(name)

    def save(self, name):
        self.model.save_weights(name)

In [4]:
if __name__ == "__main__":
    env = gym.make('CartPole-v1')
    
    # TODO What's the state size for CartPole game?
    state_size = 4
    # TODO What's the action size for CartPole game?
    action_size = 2
    
    agent = DQNAgent(state_size, action_size)
    done = False
    batch_size = 128 # Feel free to play with these 
    EPISODES = 40   # You shouldn't really need more than 100 episodes to get a score of 100

    
    for eps in range(EPISODES):
        state = env.reset()
        state = np.reshape(state, [1, state_size])
        for time in range(500):
            
            # TODO Get an action from the agent
            action = agent.act(state)
            # TODO Send this action to the env and get the next_state, reward, done values
            next_state, reward, done, _ = env.step(action)
            
            # DO NOT CHANGE THE FOLLOWING 2 LINES 
            reward = reward if not done else -10
            next_state = np.reshape(next_state, [1, state_size])
            
            # TODO Tell the agent to remember this memory
            agent.remember(state, action, reward, next_state, done)
            
            # DO NOT CHANGE BELOW THIS LINE
            state = next_state
            if done:
                print("episode: {}/{}, score: {}, eps: {:.2}".format(eps, EPISODES, time, agent.epsilon))
                break
            if len(agent.memory) > batch_size:
                agent.replay(batch_size)
        if eps % 10 == 0:
            agent.save("./cartpole-dqn.h5")

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
episode: 0/40, score: 10, eps: 1.0
episode: 1/40, score: 27, eps: 1.0
episode: 2/40, score: 32, eps: 1.0
episode: 3/40, score: 25, eps: 1.0
episode: 4/40, score: 24, eps: 1.0
episode: 5/40, score: 21, eps: 0.92
episode: 6/40, score: 10, eps: 0.88
episode: 7/40, score: 26, eps: 0.77
episode: 8/40, score: 25, eps: 0.68
episode: 9/40, score: 18, eps: 0.62
episode: 10/40, score: 41, eps: 0.51
episode: 11/40, score: 12, eps: 0.48
episode: 12/40, score: 12, eps: 0.45
episode: 13/40, score: 13, eps: 0.42
episode: 14/40, score: 12, eps: 0.4
episode: 15/40, score: 13, eps: 0.37
episode: 16/40, score: 8, eps: 0.36
episode: 17/40, score: 13, eps: 0.33
episode: 18/40, score: 9, eps: 0.32
episode: 19/40, score: 11, eps: 0.3
episode: 20/40, score: 35, eps: 0.25
episode: 21/40, score: 57, eps: 0.19
episode: 22/40, score: 41, eps: 0.15
episode: 23/40, score: 64, eps: 0.11
episode: 24/40, score: 