# Using Reinforcement Learning to Control the Rocket Hopper Demonstrator


# Reinforcement Learning Algorithm  - DDPG


# Discussion and Technical Details

## Physical Simulation

### Description

The Rocket-Hopper Demonstrator only has 1 thruster that has no degree of gimballing, meaning that the only control parameter is the throttling of the engine. The thrust produced is a function of the mass flow that is allowed into the nozzle through the main valve. Therefore, the agent will have to control how open or closed the valve is.

At the **start** of the simulation, the demonstrator is an altitude of *0 m* above the ground with an initial velocity of *0 m/s*. The objective is that the Rocket Hopper is capable of launching, staying at a specific altitude for a determined number of seconds, and then smoothly lands again.

### Limitations

- The atmosphere is neglected for the control of the agent. Even though the test campaign will occur within an enclosed building with no wind or gusts, there is still an aerodynamic force to be considered when 

- It is directed that the valve opens and closes instantaneously, when it is known that the valve has a characteristic opening and closing time for each. 

### Implementation

The agent receives TBD observations at each timestep which are floating point values associated with the position, velocity and acceleration. 

The agent then acts and choses for each timestep, among TBD possible actions. 

# General Settings

Import Libraries that are used in the Reinforcement Learning algorithm

In [1]:
import gym
import torch
from torch import nn                # needed for building neural networks
import torch.nn.functional as F     # needed for activation functions
import torch.optim as opt           # needed for optimisation
from tqdm import tqdm_notebook as tqdm
import random
from copy import copy, deepcopy
from collections import deque
import numpy as np
print("Using torch version: {}".format(torch.__version__))

Using torch version: 2.1.1+cpu


### Hyperparameter Settings

The configuration initially follows the [supplementary information section](https://arxiv.org/pdf/1509.02971.pdf?ref=blog.paperspace.com) of the DDPG paper in page 11.

In [2]:
BUFFER_SIZE=1000000     # Buffer size of 1 million entries
BATCH_SIZE=64   # Sampling from memory - This can be 128 for more complex tasks such as Hopper
GAMMA=0.9
TAU=0.001       #Target Network HyperParameters (soft update)
LRA=0.0001      #LEARNING RATE - ACTOR
LRC=0.001       #LEARNING RATE - CRITIC
H1=400   #neurons of 1st layers
H2=300   #neurons of 2nd layers

MAX_EPISODES=50000  #number of episodes of the training
MAX_STEPS=200       #max steps to finish an episode. An episode breaks early if some break conditions are met (like too much
                    #amplitude of the joints angles or if a failure occurs)

buffer_start = 100
epsilon = 1
epsilon_decay = 1./100000 #this is ok for a simple task like inverted pendulum, but maybe this would be set to zero for more
                     #complex tasks like Hopper; epsilon is a decay for the exploration and noise applied to the action is 
                     #weighted by this decay. In more complex tasks we need the exploration to not vanish so we set the decay
                     #to zero.
PRINT_EVERY = 10 #Print info about average reward every PRINT_EVERY

ENV_NAME = "Pendulum-v0" # For the hopper put "Hopper-v2" 
#check other environments to play with at https://gym.openai.com/envs/

### Replay Buffer
Implement a simple replay buffer without priority

In [3]:
class replayBuffer(object):
    def __init__(self, buffer_size, name_buffer=''):
        self.buffer_size=buffer_size  #choose buffer size
        self.num_exp=0
        self.buffer=deque()

    def add(self, s, a, r, t, s2):
        experience=(s, a, r, t, s2)
        if self.num_exp < self.buffer_size:
            self.buffer.append(experience)
            self.num_exp +=1
        else:
            self.buffer.popleft()
            self.buffer.append(experience)

    def size(self):
        return self.buffer_size

    def count(self):
        return self.num_exp

    def sample(self, batch_size):
        if self.num_exp < batch_size:
            batch=random.sample(self.buffer, self.num_exp)
        else:
            batch=random.sample(self.buffer, batch_size)

        s, a, r, t, s2 = map(np.stack, zip(*batch))

        return s, a, r, t, s2

    def clear(self):
        self.buffer = deque()
        self.num_exp=0

### Network Architecture

Network architecture is defined:

- **Actor** is composed of:
    - 3 FC (Fully-connected) Layers 
    - Hyperbolic tangent as the activation function -> Deals with a *-1,1* value range

- **Critic** is composed of:
    - Input: *state* & *action*
    - Output: Q-value after 3 FC Layers


In [4]:
def fanin_(size):
    fan_in = size[0]
    weight = 1./np.sqrt(fan_in)
    return torch.Tensor(size).uniform_(-weight, weight)

class Critic(nn.Module):
    def __init__(self, state_dim, action_dim, h1=H1, h2=H2, init_w=3e-3):
        super(Critic, self).__init__()
                
        ## Neural Network of 3 Fully-Connected Layers and ReLU Activation Function
        # FC1:       
        self.linear1 = nn.Linear(state_dim, h1)
        self.linear1.weight.data = fanin_(self.linear1.weight.data.size())
        # FC2:    (Action dimensions added to hidden layer dimensions)
        self.linear2 = nn.Linear(h1+action_dim, h2)
        self.linear2.weight.data = fanin_(self.linear2.weight.data.size())
        # FC3  
        self.linear3 = nn.Linear(h2, 1)
        self.linear3.weight.data.uniform_(-init_w, init_w)
        # ReLU
        self.relu = nn.ReLU()
        
    def forward(self, state, action):
        """
        The state is passed through the neural network layers, as well as the action.
        """
        x = self.linear1(state)
        x = self.relu(x)
        x = self.linear2(torch.cat([x,action],1))
        x = self.relu(x)
        x = self.linear3(x)
        
        return x
    

class Actor(nn.Module): 
    def __init__(self, state_dim, action_dim, h1=H1, h2=H2, init_w=0.003):
        super(Actor, self).__init__()        

        ## Neural Network of 3 Fully-Connected Layers and TanH Activation Function
        # FC1:
        self.linear1 = nn.Linear(state_dim, h1)
        self.linear1.weight.data = fanin_(self.linear1.weight.data.size())
        # FC2:
        self.linear2 = nn.Linear(h1, h2)
        self.linear2.weight.data = fanin_(self.linear2.weight.data.size())
        # FC3:        
        self.linear3 = nn.Linear(h2, action_dim)
        self.linear3.weight.data.uniform_(-init_w, init_w)
        # ReLU + Hyperbolic Tangent Activation Function:
        self.relu = nn.ReLU()
        self.tanh = nn.Tanh()
        
    def forward(self, state):
        """
        The state is passed on through the neural network.
        """
        x = self.linear1(state)
        x = self.relu(x)
        x = self.linear2(x)
        x = self.relu(x)
        x = self.linear3(x)
        x = self.tanh(x)
        return x
    
    def get_action(self, state):
        state  = torch.FloatTensor(state).unsqueeze(0).to(device)
        action = self.forward(state)
        return action.detach().cpu().numpy()[0]


### Exploration 

To ensure **exploration**, we have to add noise to the action. An **Ornstein-Uhlenbeck** process is chosen to add noise in a smooth way, suitable for continuous control tasks.

In [5]:
# Based on http://math.stackexchange.com/questions/1287634/implementing-ornstein-uhlenbeck-in-matlab
class OrnsteinUhlenbeckActionNoise:
    def __init__(self, mu=0, sigma=0.2, theta=.15, dt=1e-2, x0=None):
        self.theta = theta
        self.mu = mu
        self.sigma = sigma
        self.dt = dt
        self.x0 = x0
        self.reset()

    def __call__(self):
        x = self.x_prev + self.theta * (self.mu - self.x_prev) * self.dt + self.sigma * np.sqrt(self.dt) * np.random.normal(size=self.mu.shape)
        self.x_prev = x
        return x

    def reset(self):
        self.x_prev = self.x0 if self.x0 is not None else np.zeros_like(self.mu)

    def __repr__(self):
        return 'OrnsteinUhlenbeckActionNoise(mu={}, sigma={})'.format(self.mu, self.sigma)