# Mountain Car Continuous - Hill Climbing Practice

This notebook experiments with techniques from the Udacity DRL course for hill climbing solutions to finding an optimum policy.  Code is modified from that provided in the CEM notebook provided in class.  The goal is to solve the OpenAI Gym's mountain car environment with continuous control inputs.

In [1]:
import gym
import math
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

!python -m pip install pyvirtualdisplay
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

is_ipython = 'inline' in plt.get_backend()
if is_ipython:
    from IPython import display

plt.ion()

Collecting pyvirtualdisplay
  Using cached https://files.pythonhosted.org/packages/ad/05/6568620fed440941b704664b9cfe5f836ad699ac7694745e7787fbdc8063/PyVirtualDisplay-2.0-py2.py3-none-any.whl
Collecting EasyProcess (from pyvirtualdisplay)
  Using cached https://files.pythonhosted.org/packages/48/3c/75573613641c90c6d094059ac28adb748560d99bd27ee6f80cce398f404e/EasyProcess-0.3-py2.py3-none-any.whl
Installing collected packages: EasyProcess, pyvirtualdisplay
Successfully installed EasyProcess-0.3 pyvirtualdisplay-2.0


### Create the MountanCarContinuous environment & detect computing platform

In [2]:
env = gym.make('MountainCarContinuous-v0')
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

### Create the agent

In [6]:
class Agent(nn.Module):
    
    # Takes in the game environment so that it can size the NN layers to match states (inputs)
    # and actions (outputs).
    # env:  the game environment (assumes OpenAI Gym API)
    # h_size:  the number of neurons in the hidden layer
    
    def __init__(self, env, h_size=16):
        super(Agent, self).__init__()
        self.env = env
        
        # state, hidden layer, action sizes
        self.s_size = env.observation_space.shape[0]
        self.h_size = h_size
        self.a_size = env.action_space.shape[0]
        
        # define layers
        self.fc1 = nn.Linear(self.s_size, self.h_size)
        self.fc2 = nn.Linear(self.h_size, self.a_size)
        
        
    # Unmarshals and stores the weights & biases in preparation for computation
    # weights:  a list of all the weights & biases in the model; the first part of the list
    #           is all the fc1 weights, then all the fc1 biases, then all the fc2 weights
    #           and finally all the fc2 biases
    
    def set_weights(self, weights):
        s = self.s_size
        h = self.h_size
        a = self.a_size
        
        # separate the weights & biases for each layer
        fc1_end = s*h + h
        fc1_W = torch.from_numpy(weights[:s*h].reshape(s, h))
        fc1_b = torch.from_numpy(weights[s*h:fc1_end])
        fc2_W = torch.from_numpy(weights[fc1_end:fc1_end+(h*a)].reshape(h, a))
        fc2_b = torch.from_numpy(weights[fc1_end+(h*a):])
        
        # set the weights for each layer
        self.fc1.weight.data.copy_(fc1_W.view_as(self.fc1.weight.data))
        self.fc1.bias.data.copy_(  fc1_b.view_as(self.fc1.bias.data))
        
        self.fc2.weight.data.copy_(fc2_W.view_as(self.fc2.weight.data))
        self.fc2.bias.data.copy_(  fc2_b.view_as(self.fc2.bias.data))
    
    
    # Returns the length of the marshalled weights list
    def get_weights_dim(self):
        return (self.s_size+1)*self.h_size + (self.h_size+1)*self.a_size
    
    
    # Performs the forward pass computation of the NN
    # x:  the vector of input data (environment states)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.tanh(self.fc2(x))
        return x.cpu().data
    
    
    # Plays one episode of the game in order to evaluate the return generated by the given policy.
    # The NN embodies the policy definition, therefore its weights encode this policy.
    # weights:  a list of the NN's weights & biases
    # gamma:  the discount factor
    # max_t:  max number of time steps allowed in an episode
    
    def evaluate(self, weights, gamma=1.0, max_t=5000):
        self.set_weights(weights)
        episode_return = 0.0
        state = self.env.reset()
        for t in range(max_t):
            state = torch.from_numpy(state).float().to(device)
            action = self.forward(state)
            state, reward, done, _ = self.env.step(action)
            episode_return += reward * math.pow(gamma, t)
            if done:
                break
        return episode_return


### Training the model

Each of the hill climbing methods begins with a baseline policy, then samples one or more random perturbations around that baseline.  The difference is in how those samples are used to create a new baseline for the next learning iteration.

The cross-entropy method gathers many samples, then chooses the best few of these (with the highest returns on a single episode) and averages them together to form the new baseline.

The evolution method uses several samples (possibly fewer than for cross-entropy), then uses a weighted average of all of them to form the new baseline, the weighting being proportional to the return from each sample.  Thus, the samaples "higher up the hill" have more influence in moving the baseline.

In my view, cross-entropy is a generalized improvement over the evolution method; by throwing away the lower performing samples it is effectively assigning them a weight of zero.  The down-side is that by averaging the highest performing samples, it is ignoring the differences in their returns, thus possibly watering down the potential gradient ascent.  Therefore, the approach below is a hybrid of the two:  throw out any samples with reward lower than that achieved by the baseline, then use a weighted average of the remaining samples, if any.

In addition to this more aggressive hill climbing, the code uses two more techniques to help avoid getting stuck at a local maximum.
1. Using multiple random starting points within the state space
2. Adaptive noise scaling, that changes the size of the sample perturbations based on recent level of success

In [9]:
# Train the model
# agent:       the agent model to be trained
# max_epochs:  max number of epochs to train
# winning_score: the game score above which the agent is considered to be adequately trained
# gamma:       time step discount factor
# print_every: number of epochs between status reporting
# num_samples: number of randomly perturbed samples to be evaluated around the baseline
# init_sigma:  the initial standard deviation of the sample perturbations
# Return:  list of scores of the baseline model from each epoch

def train(agent, max_epochs=1000, winning_score=100.0, gamma=1.0, print_every=10, 
          num_samples=10, init_sigma=0.5):

    # store the most recent 100 epoch returns
    scores_deque = deque(maxlen=100)
    scores = []
    
    # get initial random weights & biases for the model and loop on epochs
    sigma = init_sigma
    min_sigma = init_sigma
    weight_size = agent.get_weights_dim()
    baseline = sigma*np.random.randn(weight_size)
    for epoch in range(max_epochs):
        
        # evaluate the baseline and store its return
        reward = agent.evaluate(baseline, gamma=gamma)
        scores_deque.append(reward)
        scores.append(reward)
        print("Baseline reward = ", reward)
        
        # print status
        if epoch % print_every == 0:
            print('Episode {}\tAverage Score: {:.2f}'.format(epoch, np.mean(scores_deque)))

        # if average scores are high enough, end the training
        if np.mean(scores_deque) >= winning_score:
            print('\nEnvironment solved in {:d} epochs!\tAvg Score: {:.2f}'.format(epoch, 
                                                                                   np.mean(scores_deque)))
            break

        # generate random samples around the baseline
        samples = [baseline + sigma*np.random.randn(weight_size) for i in range(num_samples)]
        
        # evaluate each sample and store its return if it is better than the baseline return
        sample_rewards = []
        sample_weights = []
        for s in range(num_samples):
            r = agent.evaluate(samples[s], gamma=gamma)
            print("Sample {} reward = {:.3f}".format(s, r))
            if r > reward:
                sample_rewards.append(r)
                sample_weights.append(samples[s])
        
        # if at least one sample performed better than the baseline then
        if len(sample_rewards) > 0:
            print("Found {} samples better than the baseline".format(len(sample_rewards)))
        
            # get the weighted average of all better samples and consider this the new baseline
            
            # reduce the noise magnitude and store it as the new minimum noise
            
        # else (no samples performed as well as the baseline; we may have found the top of the hill)
        else:
            print("No better samples found.")
        
            # if we are still in adaptive noise phase then
        
                # increment the counter of no improvement
            
                # if we haven't seen improvement for several epochs then
                
                    # set noise to min used thus far and indicate transition to fine tuning phase
                    # since it appears we are close to the global maximum

                # increase the noise magnitude
                
            # else (in fine tuning phase)
            
                # reduce the noise
                
                # increment the final phase counter and terminate after enough with no improvement
        

In [10]:
env.seed(101)
np.random.seed(101)
agent = Agent(env).to(device)

train(agent)

Baseline reward =  -53.72886239719742
Episode 0	Average Score: -53.73
Sample 0 reward = -99.246
Sample 1 reward = -41.204
Sample 2 reward = -29.902
Sample 3 reward = -83.281
Sample 4 reward = -93.246
Sample 5 reward = -96.460
Sample 6 reward = -7.604
Sample 7 reward = -99.011
Sample 8 reward = -73.191
Sample 9 reward = -18.530
Found 4 samples better than the baseline
Baseline reward =  -53.68761720014443
Sample 0 reward = -85.806
Sample 1 reward = -75.100
Sample 2 reward = -98.815
Sample 3 reward = -74.443
Sample 4 reward = -90.806
Sample 5 reward = -4.071
Sample 6 reward = -4.727
Sample 7 reward = -11.394
Sample 8 reward = -98.387
Sample 9 reward = -99.662
Found 3 samples better than the baseline
Baseline reward =  -53.796730922058764
Sample 0 reward = -27.778
Sample 1 reward = -70.008
Sample 2 reward = -87.477
Sample 3 reward = -91.850
Sample 4 reward = -94.230
Sample 5 reward = -43.049
Sample 6 reward = -96.744
Sample 7 reward = -87.006
Sample 8 reward = -83.737
Sample 9 reward = -7

KeyboardInterrupt: 