Copyright 2020 Abhishek Dabas

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CO

## Abstract:
There are 2 different types of reinforcement Learning methods, 
- Value based
- Policy based

- Both of these have some drawbacks, where comes a new method, which is acctually a hybrid method called "Actor Critic Method". In this notebook we will try to go through this new Method and try to implement it in one of the Gym Environments.

# Policy Based Methods:
**Policy:** A policy is defined as the probability distribution of actions given a state
$$P(A|S)$$
In Policy-based, there is not need to learn a value function. It select an action without using a value fuction. In this method we directly try to optimze the value function $\pi$ .
- $\pi$ is the probability distribution of the actions
$$\pi_\theta(a|s) = P(a|s)$$
## There are 2 types of Policy:
1. **Deterministic:**
- It maps a state to an action. A single action is returned by the policy to be taken. 
- They are used in deterministic environment. ex chess.
1. **Stochastic:**
- In stochastic environment we have a probability distribution of the actions. There is a probability we will take a different action. 
- It is used when an environemtn is uncertain
## Advantages:
- They have ``better convergence properties.`` Value based methods oscilate alot. In policy based methods we follow a bepolicy gradient, to find the best parameters. Because we follow the gradient here, we are guaranteed to converge with the local maximum or global maximum. 
- Policy based methods are ``better in high dimensional action spaces.`` When there is continuous action spaces, they work better. In DQN we try to assign a score to the definte action, at each time step, but when the action space is continuous, this becomes very complicated, ex driving a car, where the angle of the wheel 15,15.1, 15.2 etc are possibilities. Policy methods adjust the parameters directly. 
- Policy based methods can ``learn stochastic policy``. We dont need to implement the exploration/exploitation tradeoff, in this. In stochastic policy the agent explored the state space without always taking the same action. The output space here is a probability distribution over actions. 
## Disadvantages:
- They take ``alot of time to converge, often getting stuck on the local maximum rather than global optimum. `` They take slow step by step
- ``Evaluating a policy is inefficient and has high variance``
## Check if the policy is Good or Not
TTo measure how good a policy is we use a function called,`` Objective function`` that calculates the expected reward tof the policy. In Policy based methods we are trying to optimize the best parameters($\theta$). $J\theta$ will tell us how good the policy is and the policy ascent will help us find the best policy parameters to maximize the good actions
$$J(\theta) = E_(\pi\theta) [\sum \gamma r]$$
- we want to check the quality of the policy $\pi$ with a score function $J(\theta)$
- Use policy gradient ascent to find the best parameters $\theta$ that improves $\pi$
### Policy gradient Ascent 
Once we know how good our policy is, we want to maximize the parameters $\theta$ that maximizes the score function. Maximizing this score function means finding the optimal policy.  Now, for maximizing this score function $J(\theta)$, we do gradient ascent on policy parameters. Gradient ascent is just the inverse of gradient descent. In gradient ascent we take the direction of the steepest ascent. We want to find the gradient to the current policy $\pi$ that updates the parameters in the direction of greatest increase, and then iterate. 
$$Policy : \pi_\theta$$
$$Pbjective function : J(\theta)$$
$$Gradient : \triangledown_\theta J(\theta)$$
$$Update : \theta \leftarrow \theta + \alpha \triangledown_\theta J(\theta)$$
we want to find the policy that maximizes the score:
$$\theta^* = argmax J(\theta) =  argmax E_(\pi \theta) [\sum R(s_t,a_t]$$
which is the total summation of expected rewar given policy. So we want to differenciate the sore function $J(\theta)$
- Example: [Cartpole](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Policy%20Gradients/Cartpole/Cartpole%20REINFORCE%20Monte%20Carlo%20Policy%20Gradients.ipynb)

<img src= "images/epolicy.png">

- Value based: Here we learn a value function that maps a state to a action. It is useful when we have finite action space. 
- Policy Based: Here we directly try to learn the optimal policy using the value function. It is useful when we have a coninuous or stochastic actions. 
# Actor Critic Method
A hybrid between value-based algorithms and policy based algorithms

### what is Actor and Critic 
1. The **Critic** estimates the value function. Which could be either an action-value (Q-value) or a State-value(Value)
$$ q\hat (s,a,w) $$
1. The **Actor** updates the policy distribution i the direction suggested by the critic, which is the policy gradients
$$\pi(s,a,\theta)$$
As we can see here we have 2 neural networks here!!

## Imports

In [3]:
import gym
import numpy as np 
from keras.models import Sequential, Model
from keras.layers import Dense, Dropout, Input
from keras.layers.merge import Add, Multiply
from keras.optimizers import Adam
import keras.backend as K

import tensorflow as tf

import random
from collections import deque

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


### Creating the ActorCritic Class
- Chain rule: find the gradient of chaging the actor network params in  #
- getting closest to the final value network predictions, i.e. de/dA    #
- Calculate de/dA as = de/dC * dC/dA, where e is error, C critic, A act 

In [4]:
class ActorCritic:
    def __init__(self, env, sess):
        self.env  = env
        self.sess = sess

        self.learning_rate = 0.001
        self.epsilon = 1.0
        self.epsilon_decay = .995
        self.gamma = .95
        self.tau   = .125


        self.memory = deque(maxlen=2000)
        self.actor_state_input, self.actor_model = self.create_actor_model()
        _, self.target_actor_model = self.create_actor_model()
        
 # this is where we will feed from critic

        self.actor_critic_grad = tf.placeholder(tf.float32, 
            [None, self.env.action_space.shape[0]])

        actor_model_weights = self.actor_model.trainable_weights
        self.actor_grads = tf.gradients(self.actor_model.output, 
            actor_model_weights, -self.actor_critic_grad) # dC/dA (from actor)
        grads = zip(self.actor_grads, actor_model_weights)
        self.optimize = tf.train.AdamOptimizer(self.learning_rate).apply_gradients(grads)

##the critic model will help us check the perfoamnce by actor

        self.critic_state_input, self.critic_action_input, \
            self.critic_model = self.create_critic_model()
        _, _, self.target_critic_model = self.create_critic_model()

        self.critic_grads = tf.gradients(self.critic_model.output, 
            self.critic_action_input) # where we calcaulte de/dC for feeding above

        # Initialize for later gradient calculations
        self.sess.run(tf.initialize_all_variables())

## Model definations


    # actor model
    ## In a current state what is the best action
    def create_actor_model(self):
        state_input = Input(shape=self.env.observation_space.shape)
        h1 = Dense(24, activation='relu')(state_input)
        h2 = Dense(48, activation='relu')(h1)
        h3 = Dense(24, activation='relu')(h2)
        output = Dense(self.env.action_space.shape[0], activation='relu')(h3)

        model = Model(input=state_input, output=output)
        adam  = Adam(lr=0.001)
        model.compile(loss="mse", optimizer=adam)
        return state_input, model

    # critic model
    # the q scores are calculated seperately in the critic model 
    # It input the action space and state space, and outputs the value
    def create_critic_model(self):
        state_input = Input(shape=self.env.observation_space.shape)
        state_h1 = Dense(24, activation='relu')(state_input)
        state_h2 = Dense(48)(state_h1)

        action_input = Input(shape=self.env.action_space.shape)
        action_h1    = Dense(48)(action_input)
        
        # a layer in the middle to merge the two
        merged    = Add()([state_h2, action_h1])
        merged_h1 = Dense(24, activation='relu')(merged)
        output = Dense(1, activation='relu')(merged_h1)
        model  = Model(input=[state_input,action_input], output=output)

        adam  = Adam(lr=0.001)
        model.compile(loss="mse", optimizer=adam)
        return state_input, action_input, model

# Model Training
# the updates are happenening at every time step

    # this is our memory
    def remember(self, cur_state, action, reward, new_state, done):
        self.memory.append([cur_state, action, reward, new_state, done])

        ## lets trainig the actor
    def _train_actor(self, samples):
        for sample in samples:
            cur_state, action, reward, new_state, _ = sample
            predicted_action = self.actor_model.predict(cur_state)
            grads = self.sess.run(self.critic_grads, feed_dict={
                self.critic_state_input:  cur_state,
                self.critic_action_input: predicted_action
            })[0]

            self.sess.run(self.optimize, feed_dict={
                self.actor_state_input: cur_state,
                self.actor_critic_grad: grads
            })

    ## lets trainig the critic 
    def _train_critic(self, samples):
        for sample in samples:
            cur_state, action, reward, new_state, done = sample
            if not done:
                target_action = self.target_actor_model.predict(new_state)
                future_reward = self.target_critic_model.predict(
                    [new_state, target_action])[0][0]
                reward += self.gamma * future_reward
            self.critic_model.fit([cur_state, action], reward, verbose=0)

    def train(self):
        batch_size = 32
        if len(self.memory) < batch_size:
            return

        rewards = []
        samples = random.sample(self.memory, batch_size)
        self._train_critic(samples)
        self._train_actor(samples)

## Target Model Updating 
##  we want to determine what change in parameters (in the actor model) 
##  would result in the largest increase in the Q value (predicted by the critic model)

    def _update_actor_target(self):
        actor_model_weights  = self.actor_model.get_weights()
        actor_target_weights = self.target_critic_model.get_weights()

        for i in range(len(actor_target_weights)):
            actor_target_weights[i] = actor_model_weights[i]
        self.target_critic_model.set_weights(actor_target_weights)

    def _update_critic_target(self):
        critic_model_weights  = self.critic_model.get_weights()
        critic_target_weights = self.critic_target_model.get_weights()

        for i in range(len(critic_target_weights)):
            critic_target_weights[i] = critic_model_weights[i]
        self.critic_target_model.set_weights(critic_target_weights)

    def update_target(self):
        self._update_actor_target()
        self._update_critic_target()

## Model predictions
    # similar to DQN 
    def act(self, cur_state):
        self.epsilon *= self.epsilon_decay
        if np.random.random() < self.epsilon:
            return self.env.action_space.sample()
        return self.actor_model.predict(cur_state)


### Environment
We will be using i
The inverted pendulum swingup problem is a classic problem in the control literature. In this version of the problem, the pendulum starts in a random position, and the goal is to swing it up so it stays upright.
The pendulum Env has a infinite input space!! the number of action space in inf!!

In [6]:
# determines how to assign values to each state, i.e. takes the state
# and action (two-input model) and determines the corresponding value

def main():
    sess = tf.Session()
    K.set_session(sess)
    env = gym.make("Pendulum-v0")
    actor_critic = ActorCritic(env, sess)
    
    # Hyper parameters
    num_trials = 1000
    trial_len  = 500

    cur_state = env.reset()
    
    # sample random actions
    action = env.action_space.sample()
    
    while True:
        
#         env.render()
        
        # current state 
        cur_state = cur_state.reshape((1, env.observation_space.shape[0]))
        
        # the acord learn the steps
        action = actor_critic.act(cur_state)
        action = action.reshape((1, env.action_space.shape[0]))
        
        new_state, reward, done, _ = env.step(action)
        new_state = new_state.reshape((1, env.observation_space.shape[0]))
        print("Action: ", action, "Reward: ", reward)
        
        actor_critic.remember(cur_state, action, reward, new_state, done)
        actor_critic.train()

        cur_state = new_state
        
        if done ==True:
            break
main()



Action:  [[1.4859121]] Reward:  [-3.8274097]
Action:  [[-0.53925604]] Reward:  [-3.874733]
Action:  [[1.7384005]] Reward:  [-4.1783094]
Action:  [[0.79900706]] Reward:  [-4.572535]
Action:  [[-1.101666]] Reward:  [-5.1743703]
Action:  [[-1.339058]] Reward:  [-6.159809]
Action:  [[-0.6874918]] Reward:  [-7.46648]
Action:  [[1.7752327]] Reward:  [-8.936607]
Action:  [[-1.9174204]] Reward:  [-10.128922]
Action:  [[-1.8442848]] Reward:  [-11.905601]
Action:  [[-0.3645895]] Reward:  [-10.792521]
Action:  [[-0.7449808]] Reward:  [-9.361925]
Action:  [[-1.5841984]] Reward:  [-8.004094]
Action:  [[-0.68719226]] Reward:  [-6.795425]
Action:  [[0.]] Reward:  [-5.599224]
Action:  [[-1.9638232]] Reward:  [-4.5368032]
Action:  [[-1.933766]] Reward:  [-3.8076043]
Action:  [[-1.4386117]] Reward:  [-3.207256]
Action:  [[-0.30811134]] Reward:  [-2.7241197]
Action:  [[-1.2116411]] Reward:  [-2.3834014]
Action:  [[0.8065858]] Reward:  [-2.2379467]
Action:  [[-0.10481944]] Reward:  [-2.3048043]
Action:  [

Action:  [[-1.9295152]] Reward:  [-0.45431828]
Action:  [[0.13504395]] Reward:  [-0.49356747]
Action:  [[0.21508798]] Reward:  [-0.71515334]
Action:  [[0.41877177]] Reward:  [-1.0460356]
Action:  [[0.4753767]] Reward:  [-1.5458674]
Action:  [[0.61697686]] Reward:  [-2.2694473]
Action:  [[0.7761104]] Reward:  [-3.308586]
Action:  [[0.89899606]] Reward:  [-4.762735]
Action:  [[1.007802]] Reward:  [-6.714568]
Action:  [[-0.90380746]] Reward:  [-9.20199]
Action:  [[1.1698376]] Reward:  [-11.640985]
Action:  [[-1.4752415]] Reward:  [-14.653433]
Action:  [[-1.2874469]] Reward:  [-15.437494]
Action:  [[1.1239733]] Reward:  [-12.814799]


## Conclusion
- A critic measures how good the action taken is, "value-based", where as an "Actor" controls how the agent behaves.

## Resources:
1. https://www.freecodecamp.org/news/an-introduction-to-policy-gradients-with-cartpole-and-doom-495b5ef2207f/
1. https://towardsdatascience.com/policy-gradients-in-a-nutshell-8b72f9743c5d
1. https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69