#### CartPole-v1
A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

    Action:
        Type: Discrete(2)
        Num	Action
        0	Push cart to the left
        1	Push cart to the right
        
    Observation: 
        Type: Box(4)
        Num	Observation                 Min         Max
        0	Cart Position             -4.8            4.8
        1	Cart Velocity             -Inf            Inf
        2	Pole Angle                 -24°           24°
        3	Pole Velocity At Tip      -Inf            Inf

### Visualization of the CartPole Enviroment for 1000 episodes without any training

In [1]:
import gym
env = gym.make('CartPole-v1')
env.reset()
for _ in range(1000):
    env.render()
    env.step(env.action_space.sample())
env.close()



### Importing required libraries

In [2]:
import gym
import tensorflow as tf
import numpy as np
from tensorflow import keras

from collections import deque
import time
import random

### Defining the Deep Q Network Architecture

In [3]:
def MyModel(state_shape, action_shape):
    """ 
    The agent maps X-states (we say 'X' because a number of states gets passed in a batch) to Y-actions (2 in this case)
    e.g. The neural network output is [.1, .7]
    This output are the Q-Values corresponding to a particular state in the batch. 0.1 = q(s, a1) and 0.7 = q(s, a2)  
    where a1 = [1, 0] and a2 = [0, 1]
    
    Neural network input = (batch_size, state_shape)
    Neural network output = Dense layer where number of units = number of possible actions (2 in this case of cartpole env)
    
    ARGUMENTS:
    state_shape - (4, ) i.e. the 4 states
    action_shape - possible number of actions which are 2 in this case
    
    RETURNS: Compiled Neural Network
    """
    learning_rate = 0.001
    init = tf.keras.initializers.he_uniform()    # neural network weight initializers
    model = keras.Sequential()
    model.add(keras.layers.Dense(24, input_shape=state_shape, activation='relu', kernel_initializer=init))
    model.add(keras.layers.Dense(12, activation='relu', kernel_initializer=init))
    model.add(keras.layers.Dense(action_shape, activation='linear', kernel_initializer=init))
    model.compile(loss=tf.keras.losses.Huber(), optimizer=tf.keras.optimizers.Adam(lr=learning_rate), metrics=['accuracy'])
    
    return model

### Function to train the previously compiled DQN

In [4]:
def train(env, replay_memory, train_model, target_model, done):
    """
    Function to collect training data in the proper format so as to train the previously compiled neural network
    """
    learning_rate = 0.001    # Learning rate
    discount_factor = 0.89   # gamma
    batch_size = 64 * 2
    min_experiences = 200   # minimum experiences to collect before starting to train
    
    if len(replay_memory) < min_experiences:
        return
    
    mini_batch = random.sample(replay_memory, batch_size)
    
    # Replay_memory = [state, action, reward new_state, done] 
    #               = [[1, 2, 3, 4], [1], [1], [1,2,3,4], True]
    # We predict Q values of current state i.e for every experience in mini_batch we have experience[0] as the current state
    # We predict Q values of next state i.e experience[3] using target network so as to plug this value into bellman equation
    current_states = []
    for experience in mini_batch:
        current_states.append((experience[0]))  # creating a batch of current states
        
    current_states = np.array(current_states)   # converting to array since predict function does not accept a list input
    current_qs_list = train_model.predict(current_states)  # Q values of states at time t1

    new_states = []
    for experience in mini_batch:
        new_states.append((experience[3]))  # creating a batch of states after a particular current state
        
    new_states = np.array(new_states)
    new_qs_list = target_model.predict(new_states) # Q values of states at time t2

    X = []   # list to create training data i.e. batches of states
    Y = []   # list to create corresponding Q value labels using Bellman equation
    for index, (state, action, reward, new_state, done) in enumerate(mini_batch):
        if not done:
            max_future_q = reward + discount_factor * np.max(new_qs_list[index])
        else:
            max_future_q = reward

        current_qs = current_qs_list[index]    # list of Q values for each sample 
        # here action is either 0 or 1, we update the q value of the action taken using bellman equation
        current_qs[action] = (1 - learning_rate) * current_qs[action] + learning_rate * max_future_q   # bellman equation

        X.append(state)
        Y.append(current_qs)

    train_model.fit(np.array(X), np.array(Y), batch_size=batch_size, verbose=0, shuffle=True)

### Function to choose action for every time step

In [5]:
def choose_action(train_model, state, epsilon):
    """
    Pass ONE state to a trained model's predict function to get corresponding Q values. This function is defined for getting
    the Q values for greedy Epsilon strategy.
    The reshaping is done so that the dimensions of training networks input = dimensions of state being passed
    example: state = [ 0.03983203  0.04474036 -0.0341696  -0.00012871]
             state.shape = (4, )
             state.reshape([1, state.shape[0]]) = array([[ 0.03983203,  0.04474036, -0.0341696 , -0.00012871]])
             reshaped shape = (1, 4) 
             since predict function inputs (batch_size, state_shape)
    ARGUMENTS:
    model - trained model
    state - the state for which we want to predict the Q values corresponding to each action
    
    RETURNS: predicted Q values for a particular state
    """
     # Implementing Epsilon Greedy Exploration Strategy
    random_number = random.uniform(0, 1)
    if random_number <= epsilon: 
        action = env.action_space.sample()  # Explore    
    else:
        predict = train_model.predict(state.reshape([1, state.shape[0]]))[0]
        action = np.argmax(predict)         # Exploit best known action
        
    return action

### Main Function to loop over each time step of each episode for DQN training

In [9]:
def main():
    
    # setting and getting acquainted with the environment
    env = gym.make('CartPole-v1')
    print("Action Space: {}".format(env.action_space))
    print("State space: {}".format(env.observation_space))
    state_shape = env.observation_space.shape   # equals value (4, )
    action_shape = env.action_space.n           # equals value 2
    
    # Initializing values for Epsilon-greedy algorithm 
    epsilon = 1     
    max_epsilon = 1 
    min_epsilon = 0.01  
    decay = 0.01

    # Build the Training and Target models
    train_model = MyModel(state_shape, action_shape)
    target_model = MyModel(state_shape, action_shape)
    
    # initializing and setting target model parameters
    counter_to_update_target_model = 0        # counter
    steps_to_update_target_model = 100        # update after every 100 steps
    
    # setting experience memory parameters
    # deque: Queue data structure that allows insert and delete at both ends
    replay_memory = deque(maxlen=50_000)      
    
    # setting other required parameters
    max_episodes = 200          # maximum number of episodes to play
    rewards_all_episodes = []   # list to keep track of rewards per episode for final visualizations
    episode_number = 0          # initialized for visualization purposes
    
    for episode in range(max_episodes):    # loop to iterate over each episode
        episode_number+=1
        print('Executing episode:'+ str(episode_number) + ' with epsilon value:'+ str(epsilon))
        reward_current_episode = 0         # variable to keep track of rewards for every new episode
        done = False
        state = env.reset()
        
        while not done:     # loop to iterate over each time step of an episode
            action = choose_action(train_model, state, epsilon)
                
            # implement the chosen action and after every action, append the experience into replay memory
            new_state, reward, done, _ = env.step(action)
            replay_memory.append([state, action, reward, new_state, done])
            
            state = new_state
            reward_current_episode += reward

            # Train the training network using the Bellman Equation
            train(env, replay_memory, train_model, target_model, done)
            counter_to_update_target_model += 1
            
            if counter_to_update_target_model == steps_to_update_target_model:
                print('Copying main network weights to the target network weights')
                target_model.set_weights(train_model.get_weights())
                counter_to_update_target_model = 0
                break
        
        # reduce the epsilon value after every episode so as to favor exploitation with time
        epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay * episode)
        rewards_all_episodes.append(reward_current_episode)
    
    # Results Visualizations
    rewards_per_twenty_episodes = np.split(np.array(rewards_all_episodes), max_episodes/20)                                                                                      
    count = 20
    print('Average rewards per twenty episodes')  

    for r in rewards_per_twenty_episodes:
        print(count,':',str(sum(r/20)))
        count+=20
        
    env.close()

In [None]:
if __name__ == '__main__':
    main()

Action Space: Discrete(2)
State space: Box(4,)
Executing episode: 1
Executing episode: 2
Executing episode: 3
Executing episode: 4
Copying main network weights to the target network weights
Executing episode: 5
Executing episode: 6
Executing episode: 7
Executing episode: 8
Executing episode: 9
Executing episode: 10
Copying main network weights to the target network weights
Executing episode: 11
Executing episode: 12
Executing episode: 13
Copying main network weights to the target network weights
Executing episode: 14
Executing episode: 15
Executing episode: 16
Executing episode: 17
Copying main network weights to the target network weights
Executing episode: 18
Executing episode: 19
Executing episode: 20
Executing episode: 21
Executing episode: 22
Executing episode: 23
Executing episode: 24
Copying main network weights to the target network weights
Executing episode: 25
Executing episode: 26
Executing episode: 27
Executing episode: 28
Executing episode: 29
Copying main network weights 

#### Interpreting the above result

Initially it's taking 4-7 episodes before the weights of target net are updated i.e. 100 (value after which we want to update the weights of Target Network) time steps takes 4-7 episodes. As we reach the 80th episode we see that 100 time steps get covered in just 2-3 episodes which means that our pole is able to balance for more time steps which means it is learning.