# Solving CartPole using the Actor Critic Model

Let's start by installing OpenAI Gym in order to get the CartPole environment

In [1]:
!pip install gym



Now to load all the packages we need

In [2]:
import gym
import numpy as np
import random
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.optimizers import Adam
import matplotlib.pyplot as plt
%matplotlib inline

Using TensorFlow backend.


## Actor Critic Agent

The CriticAcorAgent class is going to house all the logic necessary to have a Deep Actor-Critic Agent. Since there's a lot going on here, this section will be longer than the others.

### Parameters

There are several parameters that are hard-coded into the model that should be tweaked when applying it to different problems to see if it affects performance. We will describe each parameter briefly here.



* Epsilon: The exploration rate. How often will the agent choose a random move during training instead of relying on information it already has. Helps the agent go down paths it normally wouldn't in hopes for higher long term rewards.
  - Epsilon Decay: How much our epsilon decreases after each update
  - Epsilon Min: The lowest rate of exploration we'll allow during training
* Gamma: Discount rate. This tells us how much we prioritize current versus future rewards.
* Tau: Affects how much we shift our knowledge based off of new information.
* Learning Rate: Affects the optimization procedures in the neural networks.



### Fixed Q-Targets

In Q-Learning we update our Q_Table through the following function:

$Q_{TableEntry}(state, action) = Reward + max(Q_{TableEntry}(state))$

Since our update is dependent on the same table itself, we can start to get correlated entries. This could cause oscillations to happen in our training. 

To combat this, we implemented a target model. It essentially is a copy of the original model, except that the values do not update as rapidly. The rate at which the target model updates is dependent upon `Tau` in our parameter list.

### Agent Workflow

1. Perform actions and record the results in the agent's memory
2. After every action, perform what's called a replay and sample a random $batchSize$ memories to train on.
3. During training on each experience do the following
  - For the critic, update the value of the current state by taking the reward and adding the discounted value of the next state as determined by the critic.
  - For the actor, get the action values for the current state predicted by the actor, and update it with the action taken and the value of the next state as determined by the critic.


In [0]:
class CriticActorAgent:
  def __init__(self, state_size, action_size):
    self.state_size = state_size
    self.action_size = action_size
    # The deque will only contain the last 2000 entries
    self.memory = deque(maxlen=2000)
    self.gamma = 0.95 # Discount Rate
    self.epsilon = 1.0 # Exploration Rate
    self.epsilon_min = 0.001
    self.epsilon_decay = 0.995
    self.learning_rate = 0.001 # For the neural net optimizer
    self.critic_model = self._build_critic_model()
    self.actor_model = self._build_actor_model()
    # Semi-Fixed Q-Targets 
    self.target_critic_model = self._build_critic_model()
    self.target_actor_model = self._build_actor_model()
    self.tau = 0.1 # Update the target model by 10% each iteration
    
    
  # What is the value for any given state?
  def _build_critic_model(self):
    model = Sequential()
    model.add(Dense(self.state_size, activation='relu'))
    model.add(Dense(20, activation='relu'))
    model.add(Dense(1, activation='linear'))
    model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))
    return model
    
  # What are the action values for the possible actions in a given state?
  def _build_actor_model(self):
    model = Sequential()
    model.add(Dense(self.state_size, activation='relu'))
    model.add(Dense(20, activation='relu'))
    model.add(Dense(self.action_size, activation='linear'))
    model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))
    return model
  
  def update_target_model(self, model, target_model):
    for layer_target, layer_src in zip(target_model.layers, model.layers):
      weights = layer_src.get_weights()
      target_weights = layer_target.get_weights()
      
      # Adjust the weights of the target to be tau proportion closer to the current
      for i in range(len(weights)):
            target_weights[i] = self.tau * weights[i] + (1 - self.tau) * target_weights[i]
      
      layer_target.set_weights(target_weights)
      
  def update_actor_target(self):
    self.update_target_model(self.actor_model, self.target_actor_model)
    
  def update_critic_target(self):
    self.update_target_model(self.critic_model, self.target_critic_model)
  
  def remember(self, state, action, reward, next_state, done):
    self.memory.append((state, action, reward, next_state, done))
    
  def act_random(self):
    return random.randrange(self.action_size)
  
  def best_act(self, state):
    # Choose the best action based on what we already know
    # If all the action values for a given state is the same, then act randomly
    action_values = self.target_actor_model.predict(state)[0]
    action = self.act_random() if np.all(action_values[0] == action_values) else np.argmax(action_values)
    return action
    
  def act(self, state):
    action = self.act_random() if np.random.rand() <= self.epsilon else self.best_act(state)
    if self.epsilon > self.epsilon_min:
      self.epsilon *= self.epsilon_decay
    return action
  
  
  def replay(self, batch_size):
    minibatch = random.sample(self.memory, batch_size)
    self._train_critic(minibatch)
    self._train_actor(minibatch)
  
  # Calculate the value for the current state by taking the reward and discounted future rewards
  # and fit the calculation into the critic model.
  # This effectively updates what the value for the current state is.
  # Think about how this fits with the whole max(x) policy deal
  def _train_critic(self, minibatch):
    for state, action, reward, next_state, done in minibatch:
      target = reward
      if not done:
        future_value = self.target_critic_model.predict(next_state)
        target = reward + self.gamma * future_value
      target = np.array(target)
      target = np.reshape(target, [1, 1])
      self.critic_model.fit(state, target, epochs = 1, verbose = 0)
    self.update_critic_target()
    
  
  # Grab the action values for the current state and update the one for the action taken
  # with the reward and the discounted future value predicted from the critic
  def _train_actor(self, minibatch):
    for state, action, reward, next_state, done in minibatch:
      action_values = self.target_actor_model.predict(state)
      target = reward
      if not done:
        next_state_value = self.target_critic_model.predict(next_state)
        target = reward + self.gamma * next_state_value
      action_values[0][action] = target
      self.actor_model.fit(state, action_values, epochs = 1, verbose = 0)
    self.update_actor_target()
  
  def load(self, name):
    self.critic_model.load_weights(name)
    
  def save(self, name):
    self.critic_model.save_weights(name)
    

## Training

We will now use our Actor-Critic Agent to train it in CartPole by simulating a lot of runs through the environment.

In [5]:
env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
batch_size = 32

agent = CriticActorAgent(state_size, action_size)
EPISODES = 30
for episode_num in range(1, EPISODES + 1):
  state = env.reset()
  state = np.reshape(state, [1, state_size])
  
  score = 0
  done = False
  while not done:
    action = agent.act(state)
    next_state, reward, done, _ = env.step(action)
    
    # If the episode is over that means you failed :(
    reward = reward if not done else -10
    
    next_state = np.reshape(next_state, [1, state_size])
    
    agent.remember(state, action, reward, next_state, done)
    
    state = next_state
    
    # Replay less?
    if len(agent.memory) > batch_size:
      agent.replay(batch_size)
    
    if done:
      print("episode: {}/{}, score: {}, epsilon: {:.2}"
          .format(episode_num, EPISODES, score, agent.epsilon))
    
    # Made it to the next frame :)
    score += 1
  
  # Save model every 100 episodes
  if episode_num % 100 == 0:
    print("SAVING CURRENT MODEL")
    agent.save(SAVE_DIR + str(episode_num))

  result = entry_point.load(False)


episode: 1/30, score: 14, epsilon: 0.93
episode: 2/30, score: 12, epsilon: 0.87
episode: 3/30, score: 19, epsilon: 0.79
episode: 4/30, score: 46, epsilon: 0.62
episode: 5/30, score: 9, epsilon: 0.59
episode: 6/30, score: 10, epsilon: 0.56
episode: 7/30, score: 41, epsilon: 0.45
episode: 8/30, score: 63, epsilon: 0.33
episode: 9/30, score: 18, epsilon: 0.3
episode: 10/30, score: 24, epsilon: 0.26
episode: 11/30, score: 38, epsilon: 0.22
episode: 12/30, score: 31, epsilon: 0.18
episode: 13/30, score: 30, epsilon: 0.16
episode: 14/30, score: 50, epsilon: 0.12
episode: 15/30, score: 125, epsilon: 0.065
episode: 16/30, score: 119, epsilon: 0.036
episode: 17/30, score: 133, epsilon: 0.018
episode: 18/30, score: 109, epsilon: 0.01
episode: 19/30, score: 114, epsilon: 0.0059
episode: 20/30, score: 120, epsilon: 0.0032
episode: 21/30, score: 118, epsilon: 0.0018
episode: 22/30, score: 124, epsilon: 0.001
episode: 23/30, score: 137, epsilon: 0.001
episode: 24/30, score: 94, epsilon: 0.001
episod

## Evaluate Performance

Now to test how our agent performed, we will run through more scenarios except this time we don't allow the agent to choose randomly and have it rely on its previous experiences.

In [6]:
## Show performance
trials = 1000
agent.epsilon = 0
scores_list = []
for episode_num in range(1, trials + 1):
  state = env.reset()
  state = np.reshape(state, [1, state_size])
  
  score = 0
  done = False
  while not done:
    action = agent.act(state)
    next_state, reward, done, _ = env.step(action)
    
    # If the episode is over that means you failed :(
    reward = reward if not done else -10
    
    next_state = np.reshape(next_state, [1, state_size])
    state = next_state
    
    if done and (episode_num % 10 == 0):
      print("episode: {}/{}, score: {}"
          .format(episode_num, trials, score))
    
    # Made it to the next frame :)
    score += 1
  
  scores_list.append(score)


episode: 10/1000, score: 117
episode: 20/1000, score: 120
episode: 30/1000, score: 106
episode: 40/1000, score: 120
episode: 50/1000, score: 159
episode: 60/1000, score: 142
episode: 70/1000, score: 97
episode: 80/1000, score: 140
episode: 90/1000, score: 142
episode: 100/1000, score: 112
episode: 110/1000, score: 123
episode: 120/1000, score: 94
episode: 130/1000, score: 117
episode: 140/1000, score: 124
episode: 150/1000, score: 103
episode: 160/1000, score: 158
episode: 170/1000, score: 114
episode: 180/1000, score: 165
episode: 190/1000, score: 115
episode: 200/1000, score: 117
episode: 210/1000, score: 272
episode: 220/1000, score: 262
episode: 230/1000, score: 113
episode: 240/1000, score: 103
episode: 250/1000, score: 201
episode: 260/1000, score: 140
episode: 270/1000, score: 112
episode: 280/1000, score: 133
episode: 290/1000, score: 102
episode: 300/1000, score: 113
episode: 310/1000, score: 150
episode: 320/1000, score: 95
episode: 330/1000, score: 124
episode: 340/1000, sco

## Analysis of Performance

Let us load the matplotlib library to have some visualizations


In [0]:
import matplotlib.pyplot as plt

In [0]:
scores = np.array(scores_list)

In [0]:
print("Mean Score: {}, Standard Deviation of Scores {}".format(scores.mean(), scores.std()))

In [0]:
plt.hist(scores)