#### CartPole-v1
A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

    Action:
        Type: Discrete(2)
        Num	Action
        0	Push cart to the left
        1	Push cart to the right
        
    Observation: 
        Type: Box(4)
        Num	Observation                 Min         Max
        0	Cart Position             -4.8            4.8
        1	Cart Velocity             -Inf            Inf
        2	Pole Angle                 -24°           24°
        3	Pole Velocity At Tip      -Inf            Inf

### Visualization of the CartPole Enviroment for 1000 episodes without any training

In [None]:
import gym
env = gym.make('CartPole-v1')
env.reset()
for _ in range(1000):
    env.render()
    env.step(env.action_space.sample())
env.close()



### Importing required libraries

In [1]:
import gym
import tensorflow as tf
import numpy as np
from tensorflow import keras

from collections import deque
import time
import random

### Defining the Deep Q Network Architecture

In [2]:
def MyModel(state_shape, action_shape):
    """ 
    The agent maps X-states (we say 'X' because a number of states gets passed in a batch) to Y-actions(2 in this case)
    e.g. The neural network output is [.1, .7]
    The highest value 0.7 is the Q-Value thus corresponding action will be it's index. 
    The index of the highest action (0.7) is action # 1.
    
    ARGUMENTS:
    state_shape - (4, ) i.e. the 4 states
    action_shape - possible number of actions which are 2 in this case
    
    RETURNS: Compiled Neural Network
    """
    learning_rate = 0.001
    init = tf.keras.initializers.he_uniform()    # neural network weight initializers
    model = keras.Sequential()
    model.add(keras.layers.Dense(24, input_shape=state_shape, activation='relu', kernel_initializer=init))
    model.add(keras.layers.Dense(12, activation='relu', kernel_initializer=init))
    model.add(keras.layers.Dense(action_shape, activation='linear', kernel_initializer=init))
    model.compile(loss=tf.keras.losses.Huber(), optimizer=tf.keras.optimizers.Adam(lr=learning_rate), metrics=['accuracy'])
    
    return model

### Function to train the previously compiled DQN

In [7]:
def train(env, replay_memory, train_model, target_model, done):
    """
    Function to collect training data in the proper format so as to train the previously compiled neural network
    """
    learning_rate = 0.001     # Learning rate
    discount_factor = 0.89   # gamma
    batch_size = 64 * 2
    min_experiences = 200   # minimum experiences to collect before starting to train
    
    if len(replay_memory) < min_experiences:
        return
    
    mini_batch = random.sample(replay_memory, batch_size)
    
    # Replay_memory = [state, action, reward new_state, done] 
    #               = [[1, 2, 3, 4], [1], [1], [1,2,3,4], True]
    # We predict Q values of current state i.e for every experience in mini_batch we have experience[0] as the current state
    # We predict Q values of next state i.e experience[3] using target network so as to plug this value into bellman equation
    current_states = []
    for experience in mini_batch:
        current_states.append((experience[0]))  # creating a batch of current states
    current_states = np.array(current_states)   # converting to array since predict function does not accept a list input
    current_qs_list = train_model.predict(current_states)  # Q values of states at time t1

    new_current_states = []
    for experience in mini_batch:
        new_current_states.append((experience[3]))  # creating a batch of states after a particular current state
    new_current_states = np.array(new_current_states)
    future_qs_list = target_model.predict(new_current_states) # Q values of states at time t2

    X = []
    Y = []
    for index, (state, action, reward, new_state, done) in enumerate(mini_batch):
        if not done:
            max_future_q = reward + discount_factor * np.max(future_qs_list[index])
        else:
            max_future_q = reward

        current_qs = current_qs_list[index]
        current_qs[action] = (1 - learning_rate) * current_qs[action] + learning_rate * max_future_q   # bellman equation

        X.append(state)
        Y.append(current_qs)

    train_model.fit(np.array(X), np.array(Y), batch_size=batch_size, verbose=0, shuffle=True)

### Helper Function to get Q values of a particular state

In [8]:
def get_qs(model, state):
    """
    Pass ONE state to a trained model's predict function to get corresponding Q values. This function is defined for getting
    the Q values for greedy Epsilon strategy.
    The reshaping is done so that the dimensions of training networks input = dimensions of state being passed
    example: state = [ 0.03983203  0.04474036 -0.0341696  -0.00012871]
             state.shape = (4, )
             state.reshape([1, state.shape[0]]) = array([[ 0.03983203,  0.04474036, -0.0341696 , -0.00012871]])
             reshaped shape = (1, 4) 
             since predict function inputs (batch_size, state_shape)
    ARGUMENTS:
    model - trained model
    state - the state for which we want to predict the Q values corresponding to each action
    
    RETURNS: predicted Q values for a particular state
    """
    return model.predict(state.reshape([1, state.shape[0]]))

### Main Function to loop over each time step of each episode for DQN training

In [9]:
def main():
    
    # setting and getting acquainted with the environment
    env = gym.make('CartPole-v1')
    print("Action Space: {}".format(env.action_space))
    print("State space: {}".format(env.observation_space))
    state_shape = env.observation_space.shape   # equals value (4, )
    action_shape = env.action_space.n           # equals value 2
    
    # Initializing values for Epsilon-greedy algorithm 
    epsilon = 1     
    max_epsilon = 1 
    min_epsilon = 0.01  
    decay = 0.01

    # Build the Training and Target models
    train_model = MyModel(state_shape, action_shape)
    target_model = MyModel(state_shape, action_shape)
    # initializing and setting target model parameters
    steps_to_update_target_model = 0  # counter
    update_target_model = 100         # update after every 100 steps
    
    # setting experience memory parameters
    replay_memory = deque(maxlen=50_000)   # deque: Queue data structure that allows insert and delete at both ends
    
    # setting other required parameters
    train_episodes = 200        # maximum number of episodes to play
    rewards_all_episodes = []   # list to keep track of rewards after every episode so that we can see average rewards after every 50 - 100 episodes to see how much and if the agent is learning
    episode_number = 0
    
    for episode in range(train_episodes):    # loop to iterate over each episode
        episode_number+=1
        print('Executing episode:'+ str(episode_number) + ' with epsilon value:'+ str(epsilon))
        reward_current_episode = 0           # to keep track of rewards per episode thus initialized to zero after end of a episode
        state = env.reset()
        done = False
        
        while not done:     # loop to iterate over each step of an episode
             # Implementing Epsilon Greedy Exploration Strategy
            random_number = random.uniform(0, 1)
            if random_number <= epsilon: 
                action = env.action_space.sample()  # Explore    
            else:
                predict_maxQ_for_given_state = get_qs(train_model, state).flatten()
                action = np.argmax(predict_maxQ_for_given_state)       # Exploit best known action
                
            # implement the chosen action and after every action, append the experience into replay memory
            new_state, reward, done, _ = env.step(action)
            replay_memory.append([state, action, reward, new_state, done])
            
            state = new_state
            reward_current_episode += reward

            # Train the training network using the Bellman Equation
            train(env, replay_memory, train_model, target_model, done)
            steps_to_update_target_model += 1
            
            if steps_to_update_target_model >= update_target_model:
                print('Copying main network weights to the target network weights')
                target_model.set_weights(train_model.get_weights())
                steps_to_update_target_model = 0
                break
        
        # reduce the epsilon value after every episode so as to favor exploitation with time
        epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay * episode)
        rewards_all_episodes.append(reward_current_episode)
        
    rewards_per_twenty_episodes = np.split(np.array(rewards_all_episodes), train_episodes/20)                                                                                      
    count = 20
    print('Average rewards per twenty episodes')  

    for r in rewards_per_twenty_episodes:
        print(count,':',str(sum(r/20)))
        count+=20
        
    env.close()

In [10]:
if __name__ == '__main__':
    main()

Action Space: Discrete(2)
State space: Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)
Executing episode:1 with epsilon value:1
Executing episode:2 with epsilon value:1.0
Executing episode:3 with epsilon value:0.9901493354116764
Executing episode:4 with epsilon value:0.9803966865736877
Executing episode:5 with epsilon value:0.970741078213023
Executing episode:6 with epsilon value:0.9611815447608
Copying main network weights to the target network weights
Executing episode:7 with epsilon value:0.9517171302557069
Executing episode:8 with epsilon value:0.9423468882484062
Executing episode:9 with epsilon value:0.9330698817068888
Executing episode:10 with epsilon value:0.9238851829227694
Executing episode:11 with epsilon value:0.9147918734185159
Executing episode:12 with epsilon value:0.9057890438555999
Copying main network weights to the target network weights
Executing episode:13 with epsilon value:0.896875793943563
Executing episode:14 with epsilon value:0.888051232349

#### Interpreting the above result

Initially it's taking 4-7 episodes before the weights of target net is updated i.e. 100 (value after which we want to update the weights of Target Network) time steps takes 4-7 episodes. As we reach the 70th episode we see that 100 time steps get covered in just 2-3 episodes and after 95th it just takes 1 episode which means that our pole is able to balance for more time steps which means it is learning.

Increasing average reward per 20 episodes over the 200 episodes again shows us that our algorithm is learning over time.