# Mountain Car Continuous - Hill Climbing Practice

This notebook experiments with techniques from the Udacity DRL course for hill climbing solutions to finding an optimum policy.  Code is modified from that provided in the CEM notebook provided in class.  The goal is to solve the OpenAI Gym's mountain car environment with continuous control inputs.

In [85]:
import gym
import math
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

!python -m pip install pyvirtualdisplay
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

is_ipython = 'inline' in plt.get_backend()
if is_ipython:
    from IPython import display

plt.ion()

Collecting pyvirtualdisplay
  Using cached https://files.pythonhosted.org/packages/ad/05/6568620fed440941b704664b9cfe5f836ad699ac7694745e7787fbdc8063/PyVirtualDisplay-2.0-py2.py3-none-any.whl
Collecting EasyProcess (from pyvirtualdisplay)
  Using cached https://files.pythonhosted.org/packages/48/3c/75573613641c90c6d094059ac28adb748560d99bd27ee6f80cce398f404e/EasyProcess-0.3-py2.py3-none-any.whl
Installing collected packages: EasyProcess, pyvirtualdisplay
Successfully installed EasyProcess-0.3 pyvirtualdisplay-2.0


### Create the MountanCarContinuous environment & detect computing platform

In [86]:
env = gym.make('MountainCarContinuous-v0')
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

### Create the agent

In [112]:
class Agent(nn.Module):
    
    # Takes in the game environment so that it can size the NN layers to match states (inputs)
    # and actions (outputs).
    # env:  the game environment (assumes OpenAI Gym API)
    # h_size:  the number of neurons in the hidden layer
    
    def __init__(self, env, h_size=16):
        super(Agent, self).__init__()
        self.env = env
        
        # state, hidden layer, action sizes
        self.s_size = env.observation_space.shape[0]
        self.h_size = h_size
        self.a_size = env.action_space.shape[0]
        
        # define layers
        self.fc1 = nn.Linear(self.s_size, self.h_size)
        self.fc2 = nn.Linear(self.h_size, self.a_size)
        
        
    # Unmarshals and stores the weights & biases in preparation for computation
    # weights:  a 1D tensor of all the weights & biases in the model; the first part of the list
    #           is all the fc1 weights, then all the fc1 biases, then all the fc2 weights
    #           and finally all the fc2 biases
    
    def set_weights(self, weights):
        s = self.s_size
        h = self.h_size
        a = self.a_size
        
        # separate the weights & biases for each layer
        fc1_end = (s+1)*h
        fc1_w = weights[: s*h].reshape(s, h)
        fc1_b = weights[s*h : fc1_end]
        fc2_w = weights[fc1_end : fc1_end + h*a].reshape(h, a)
        fc2_b = weights[fc1_end + h*a :]
        
        # set the weights for each layer
        self.fc1.weight.data.copy_(fc1_w.view_as(self.fc1.weight.data))
        self.fc1.bias.data.copy_(  fc1_b.view_as(self.fc1.bias.data))
        
        self.fc2.weight.data.copy_(fc2_w.view_as(self.fc2.weight.data))
        self.fc2.bias.data.copy_(  fc2_b.view_as(self.fc2.bias.data))
    
    
    # Returns the length of the marshalled weights list
    def get_weights_dim(self):
        return (self.s_size+1)*self.h_size + (self.h_size+1)*self.a_size
    
    
    # Performs the forward pass computation of the NN
    # x:  the vector of input data (environment states)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.tanh(self.fc2(x))
        return x.cpu().data
    
    
    # Plays one episode of the game in order to evaluate the return generated by the given policy.
    # The NN embodies the policy definition, therefore its weights encode this policy.
    # weights:  a 1D tensor of the NN's weights & biases
    # gamma:  the discount factor
    # max_t:  max number of time steps allowed in an episode
    
    def evaluate(self, weights, gamma=1.0, max_t=5000):
        self.set_weights(weights)
        episode_return = 0.0
        state = self.env.reset()
        for t in range(max_t):
            state = torch.from_numpy(state).float().to(device)
            action = self.forward(state)
            state, reward, done, _ = self.env.step(action)
            episode_return += reward * math.pow(gamma, t)
            if done:
                break
        return episode_return


### Training the model

Each of the hill climbing methods begins with a baseline policy, then samples one or more random perturbations around that baseline.  The difference is in how those samples are used to create a new baseline for the next learning iteration.

The cross-entropy method gathers many samples, then chooses the best few of these (with the highest returns on a single episode) and averages them together to form the new baseline.

The evolution method uses several samples (possibly fewer than for cross-entropy), then uses a weighted average of all of them to form the new baseline, the weighting being proportional to the return from each sample.  Thus, the samaples "higher up the hill" have more influence in moving the baseline.

In my view, cross-entropy is a generalized improvement over the evolution method; by throwing away the lower performing samples it is effectively assigning them a weight of zero.  The down-side is that by averaging the highest performing samples, it is ignoring the differences in their returns, thus possibly watering down the potential gradient ascent.  Therefore, the approach below is a hybrid of the two:  throw out any samples with reward lower than that achieved by the baseline, then use a weighted average of the remaining samples, if any.

In addition to this more aggressive hill climbing, the code uses two more techniques to help avoid getting stuck at a local maximum.
1. Using multiple random starting points within the state space
2. Adaptive noise scaling, that changes the size of the sample perturbations based on recent level of success

In [125]:
# Train the model
# agent:       the agent model to be trained
# max_epochs:  max number of epochs to train
# winning_score: the game score above which the agent is considered to be adequately trained
# gamma:       time step discount factor
# print_every: number of epochs between status reporting
# num_samples: number of randomly perturbed samples to be evaluated around the baseline
# init_sigma:  the initial standard deviation of the sample perturbations
# Return:  list of scores of the baseline model from each epoch

def train(agent, max_epochs=1000, winning_score=100.0, gamma=1.0, print_every=10, 
          num_samples=10, init_sigma=0.5):
    
    SIGMA_REDUCTION = 0.995            # noise multiplier if better samples are found
    SIGMA_INCREASE  = 1.5              # noise multiplier if no better samples are found
    MIN_USABLE_SIGMA = 0.001           # lowest acceptable noise value (triggers end of training)
    MAX_EPOCHS_WITHOUT_IMPROVEMENT = 9 # num epochs we will continue to try to find an improvement
    
    adaptive_noise = True
    epochs_without_improvement = 0

    # store the most recent 100 epoch returns
    scores_deque = deque(maxlen=100)
    scores = []
    
    # get initial random weights & biases for the model and loop on epochs
    sigma = init_sigma
    min_sigma = init_sigma
    weight_size = agent.get_weights_dim()
    baseline = torch.from_numpy(sigma*np.random.randn(weight_size)) #float
    for epoch in range(max_epochs):
        
        # evaluate the baseline and store its return
        reward = agent.evaluate(baseline, gamma=gamma)
        scores_deque.append(reward)
        scores.append(reward)
        
        # print status
        if epoch % print_every == 0:
            print('Episode {}\tAverage Score = {:.2f}, sigma = {:.3f}.   '
                  .format(epoch, np.mean(scores_deque), sigma), end='')

        # if average scores are high enough, end the training
        if np.mean(scores_deque) >= winning_score:
            print('\nEnvironment solved in {:d} epochs!\tAvg Score: {:.2f}'.format(epoch, 
                                                                                   np.mean(scores_deque)))
            break

        # generate random samples around the baseline
        # (not clear why I need to explicitly convert both terms to float; they should be already)
        samples = torch.FloatTensor(num_samples, weight_size)
        for i in range(num_samples):
            samples[i] = baseline.float() + torch.from_numpy(sigma*np.random.randn(weight_size)).float()
        
        # evaluate each sample and store its return if it is better than the baseline return
        sample_rewards = []
        sample_weights = []
        for s in range(num_samples):
            r = agent.evaluate(samples[s], gamma=gamma)
            if r > reward:
                sample_rewards.append(r)
                sample_weights.append(samples[s])
        
        # if at least one sample performed better than the baseline then
        num_elite_samples = len(sample_rewards)
        if num_elite_samples > 0:
            print("{:3} better samples".format(num_elite_samples))
            
            # reset the not-found counter
            epochs_without_improvement = 0
        
            # get the weighted average of all better samples and consider this the new baseline
            r = torch.FloatTensor(sample_rewards).resize_(num_elite_samples, 1)
            w = torch.zeros(num_elite_samples, weight_size).float()
            for s in range(num_elite_samples):
                w[s][:] = sample_weights[s]
            baseline = (r*w).sum(dim=0) / r.sum(dim=0)
            
            # reduce the noise magnitude and store it as the new minimum noise
            sigma *= SIGMA_REDUCTION
            if sigma < min_sigma:
                min_sigma = sigma
            
        # else (no samples performed as well as the baseline; we may have found the top of the hill)
        else:
            print("*** No better samples found.")
        
            # increment the counter of no improvement
            epochs_without_improvement += 1
            
            # if we are still in adaptive noise phase then
            if adaptive_noise:
        
                # if we haven't seen improvement for several epochs then
                if epochs_without_improvement > MAX_EPOCHS_WITHOUT_IMPROVEMENT:
                
                    # set noise to min used thus far and indicate transition to fine tuning phase
                    # since it appears we are close to the global maximum
                    sigma = min_sigma
                    adaptive_noise = False
                    print("Turning off adaptive noise: sigma = ", sigma)
                    
                    # reset the counter so it can be used for the fine tuning phase
                    epochs_without_improvement = 0
                    
                # else (still adapting the noise level)
                else:

                    # increase the noise magnitude
                    sigma *= SIGMA_INCREASE
                
            # else (in fine tuning phase, near a peak)
            else:
            
                # reduce the noise
                sigma *= SIGMA_REDUCTION
                
                # if we've hit the smallest noise we care to deal with then terminate
                if sigma < MIN_USABLE_SIGMA:
                    print("Fine tuning is a minimum noise. Terminating search.")
                    break
                
                # increment the final phase counter and terminate after enough with no improvement
                if epochs_without_improvement > MAX_EPOCHS_WITHOUT_IMPROVEMENT:
                    print("{} epochs of fine tuning without improvement. Terminating search.".
                         format(epochs_without_improvement))
                    break
        
    return scores

In [126]:
seed = 101
env.seed(seed)
np.random.seed(seed)
agent = Agent(env).to(device)

scores = train(agent, init_sigma=0.5, num_samples=50, print_every=1, winning_score=90.0)

Episode 0	Average Score = -98.53, sigma = 1.000.    16 better samples
Episode 1	Average Score = -89.56, sigma = 0.990.     5 better samples
Episode 2	Average Score = -93.01, sigma = 0.980.    12 better samples
Episode 3	Average Score = -93.89, sigma = 0.970.     2 better samples
Episode 4	Average Score = -95.09, sigma = 0.961.    20 better samples
Episode 5	Average Score = -95.89, sigma = 0.951.    10 better samples
Episode 6	Average Score = -96.46, sigma = 0.941.     7 better samples
Episode 7	Average Score = -96.88, sigma = 0.932.     7 better samples
Episode 8	Average Score = -97.13, sigma = 0.923.     5 better samples
Episode 9	Average Score = -97.10, sigma = 0.914.     6 better samples
Episode 10	Average Score = -97.35, sigma = 0.904.    15 better samples
Episode 11	Average Score = -97.57, sigma = 0.895.    13 better samples
Episode 12	Average Score = -87.67, sigma = 0.886.     2 better samples
Episode 13	Average Score = -80.20, sigma = 0.878.     2 better samples
Episode 14	Avera

Episode 112	Average Score = -13.86, sigma = 0.345.    29 better samples
Episode 113	Average Score = -14.04, sigma = 0.341.   *** No better samples found.
Episode 114	Average Score = -14.86, sigma = 0.338.    15 better samples
Episode 115	Average Score = -15.64, sigma = 0.334.   *** No better samples found.
Episode 116	Average Score = -16.54, sigma = 0.331.     1 better samples
Episode 117	Average Score = -17.45, sigma = 0.328.     9 better samples
Episode 118	Average Score = -18.29, sigma = 0.324.    21 better samples
Episode 119	Average Score = -19.18, sigma = 0.321.     1 better samples
Episode 120	Average Score = -20.05, sigma = 0.318.    22 better samples
Episode 121	Average Score = -19.10, sigma = 0.315.    10 better samples
Episode 122	Average Score = -18.13, sigma = 0.312.     7 better samples
Episode 123	Average Score = -17.14, sigma = 0.309.     1 better samples
Episode 124	Average Score = -16.16, sigma = 0.305.     5 better samples
Episode 125	Average Score = -15.17, sigma = 

Episode 226	Average Score = -4.03, sigma = 0.110.    42 better samples
Episode 227	Average Score = -4.00, sigma = 0.108.   *** No better samples found.
Episode 228	Average Score = -4.04, sigma = 0.107.    35 better samples
Episode 229	Average Score = -4.02, sigma = 0.106.     1 better samples
Episode 230	Average Score = -3.98, sigma = 0.105.    22 better samples
Episode 231	Average Score = -3.97, sigma = 0.104.    29 better samples
Episode 232	Average Score = -3.98, sigma = 0.103.    26 better samples
Episode 233	Average Score = -4.05, sigma = 0.102.    42 better samples
Episode 234	Average Score = -4.04, sigma = 0.101.    10 better samples
Episode 235	Average Score = -4.05, sigma = 0.100.     4 better samples
Episode 236	Average Score = -4.09, sigma = 0.099.    34 better samples
Episode 237	Average Score = -4.09, sigma = 0.098.    18 better samples
Episode 238	Average Score = -4.11, sigma = 0.097.    21 better samples
Episode 239	Average Score = -4.13, sigma = 0.096.    18 better samp

Episode 341	Average Score = -5.31, sigma = 0.034.    14 better samples
Episode 342	Average Score = -5.39, sigma = 0.034.    44 better samples
Episode 343	Average Score = -5.46, sigma = 0.034.    35 better samples
Episode 344	Average Score = -5.47, sigma = 0.033.    12 better samples
Episode 345	Average Score = -5.51, sigma = 0.033.    28 better samples
Episode 346	Average Score = -5.53, sigma = 0.033.    26 better samples
Episode 347	Average Score = -5.61, sigma = 0.032.    46 better samples
Episode 348	Average Score = -5.63, sigma = 0.032.    25 better samples
Episode 349	Average Score = -5.66, sigma = 0.032.    13 better samples
Episode 350	Average Score = -5.76, sigma = 0.032.    41 better samples
Episode 351	Average Score = -5.75, sigma = 0.031.    20 better samples
Episode 352	Average Score = -5.72, sigma = 0.031.    29 better samples
Episode 353	Average Score = -5.66, sigma = 0.031.    24 better samples
Episode 354	Average Score = -5.65, sigma = 0.030.     5 better samples
Episod

Episode 456	Average Score = -5.56, sigma = 0.011.    17 better samples
Episode 457	Average Score = -5.59, sigma = 0.011.    45 better samples
Episode 458	Average Score = -5.65, sigma = 0.011.    44 better samples
Episode 459	Average Score = -5.66, sigma = 0.011.    16 better samples
Episode 460	Average Score = -5.65, sigma = 0.010.    16 better samples
Episode 461	Average Score = -5.63, sigma = 0.010.    37 better samples
Episode 462	Average Score = -5.61, sigma = 0.010.    37 better samples
Episode 463	Average Score = -5.59, sigma = 0.010.    41 better samples
Episode 464	Average Score = -5.59, sigma = 0.010.    32 better samples
Episode 465	Average Score = -5.65, sigma = 0.010.    50 better samples
Episode 466	Average Score = -5.61, sigma = 0.010.     2 better samples
Episode 467	Average Score = -5.62, sigma = 0.010.    31 better samples
Episode 468	Average Score = -5.58, sigma = 0.010.     3 better samples
Episode 469	Average Score = -5.53, sigma = 0.010.    23 better samples
Episod

Episode 571	Average Score = -5.21, sigma = 0.003.    34 better samples
Episode 572	Average Score = -5.27, sigma = 0.003.    31 better samples
Episode 573	Average Score = -5.30, sigma = 0.003.    37 better samples
Episode 574	Average Score = -5.27, sigma = 0.003.    12 better samples
Episode 575	Average Score = -5.28, sigma = 0.003.    41 better samples
Episode 576	Average Score = -5.26, sigma = 0.003.    43 better samples
Episode 577	Average Score = -5.25, sigma = 0.003.    14 better samples
Episode 578	Average Score = -5.29, sigma = 0.003.    42 better samples
Episode 579	Average Score = -5.24, sigma = 0.003.     1 better samples
Episode 580	Average Score = -5.22, sigma = 0.003.     9 better samples
Episode 581	Average Score = -5.16, sigma = 0.003.     7 better samples
Episode 582	Average Score = -5.17, sigma = 0.003.    33 better samples
Episode 583	Average Score = -5.25, sigma = 0.003.    41 better samples
Episode 584	Average Score = -5.24, sigma = 0.003.     2 better samples
Episod

Episode 686	Average Score = -5.34, sigma = 0.001.    42 better samples
Episode 687	Average Score = -5.41, sigma = 0.001.    39 better samples
Episode 688	Average Score = -5.37, sigma = 0.001.    37 better samples
Episode 689	Average Score = -5.41, sigma = 0.001.    22 better samples
Episode 690	Average Score = -5.45, sigma = 0.001.    46 better samples
Episode 691	Average Score = -5.47, sigma = 0.001.    12 better samples
Episode 692	Average Score = -5.43, sigma = 0.001.    34 better samples
Episode 693	Average Score = -5.45, sigma = 0.001.    39 better samples
Episode 694	Average Score = -5.39, sigma = 0.001.     1 better samples
Episode 695	Average Score = -5.40, sigma = 0.001.    18 better samples
Episode 696	Average Score = -5.50, sigma = 0.001.    43 better samples
Episode 697	Average Score = -5.46, sigma = 0.001.    19 better samples
Episode 698	Average Score = -5.42, sigma = 0.001.     1 better samples
Episode 699	Average Score = -5.39, sigma = 0.001.    36 better samples
Episod

[-98.52519241092766,
 -80.5967546255252,
 -99.8999999999986,
 -96.54051618297267,
 -99.89993443498258,
 -99.88538371218559,
 -99.86551326034002,
 -99.84704243614857,
 -99.15185484365989,
 -96.77669018436401,
 -99.89944496857798,
 -99.89766136254174,
 31.079676410668668,
 16.862489862435027,
 75.9431636982766,
 77.72132110502855,
 88.98233026182473,
 87.85679361528155,
 76.42584428652646,
 88.61587061531996,
 78.02670393165982,
 -99.14364876397339,
 -99.8999999999986,
 -99.8999999999986,
 -99.8999999999986,
 -99.8999728918251,
 -99.8999999999986,
 -99.89997285605588,
 -97.36662489303578,
 -99.89999948739877,
 -99.8992625158487,
 -95.29681621600827,
 -99.8999999999986,
 -99.71355964705971,
 -90.30708989417792,
 -92.8544359860491,
 -83.58102914174768,
 -35.4987534721847,
 -7.048283864498838,
 -12.40252428948995,
 -3.058933488252052,
 -5.474946706785512,
 -5.506555522839676,
 -0.8071996871067222,
 -4.511508583841994,
 -1.949712945657982,
 -2.7770888179000695,
 -3.505334068472989,
 -7.31915