For the continuous action space lunar lander problem you can no longer use the DQN.
The DQN outputs a Q value prediction for each action at each state.

With a continuous action space you can no longer do this. 

Instead here is a policy gradient implementation. 

# Deep Deterministic Policy Gradient

Predict action given the input state

Start an episode

Use policy network with the state as input to predict the next action

Take the action to get to the new state.

Create a memory buffer of these transitions

Have a critic network, calculate the q value for the state and action.
Have a target critic network calculate the q value for the new state (use also the action at the new state decided by the actor_target network.

Train the critic network based on the difference between these two. Update every so often. Temporal difference learning. 

Create the objective function - in this case you're trying to maximise the output of the critic

Update the actor
You just differentiate the critic wrt the actor (and go in the negative direction - gradient ascent)


Best blog on pytorch and the fundamentals of backwards() and the graph 
https://blog.paperspace.com/pytorch-101-understanding-graphs-and-automatic-differentiation/

https://medium.com/@thechrisyoon/deriving-policy-gradients-and-implementing-reinforce-f887949bd63

https://towardsdatascience.com/deep-deterministic-policy-gradients-explained-2d94655a9b7b

https://towardsdatascience.com/solving-lunar-lander-openaigym-reinforcement-learning-785675066197

the total policy approaches summary
https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html

Good blog on DDPG
https://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html

In [88]:
import random
import numpy as np
import matplotlib.pyplot as plt
import torch
from torch import nn
import gym

from torch.nn.functional import mse_loss
from torch import optim

In [89]:
%matplotlib inline

In [90]:
env = gym.make('LunarLanderContinuous-v2')

In [91]:
# Experience replay

class ReplayBuffer(object):
    '''
    This is a collection of transitions. (state, action reward sets)
    inputs
    size - defines the size of the storage.
    memory - is the list that contains the transitions.
    pointer - this identifies the current transition. 
    '''
    
    def __init__(self,size):
        self.size = size
        self.memory = []
        self.pointer = 0
        
    def add_to_memory(self, transition):
        if len(self.memory) == self.size:
            self.memory[int(self.pointer)] = transition 
            self.pointer = (self.pointer + 1)% self.size
        else:
            self.memory.append(transition) 
            
    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)
    
    def __len__(self):
        return len(self.memory)
        

In [92]:
# The policy network

class policy_net(nn.Module):
    '''
    state_dim - dimension of the state
    action_dim - dimension of the action (note this doesn't mean 
    number of discrete actions, it means number of continuous values that represent the action space)
    max_action - if you need to clip the action to a certain range
    '''
    def __init__(self, state_dim, action_dim, max_action):
        super(policy_net, self).__init__()
        self.max_action = max_action
        self.main= nn.Sequential(
        nn.Linear(state_dim, 400),
        nn.Linear(400, 300),
        nn.Linear(300, action_dim),
        nn.Tanh()
        # as tanh is between 0 and 1 if you multiply by max action you'll be between -max and +max action
        )
    
    def forward(self, input):
        output = self.main(input)
        output = output*self.max_action
        return output
    

In [73]:
# critic network

class critic_net(nn.Module):
    
    def __init__(self, state_dim, action_dim):
        super(critic_net, self).__init__()
        self.main = nn.Sequential(
            nn.Linear(state_dim + action_dim, 400),
            nn.Linear(400, 300),
            # for a combined action and state just output one Q value.
            nn.Linear(300, 1)
        )
        
    def forward(self, state, action):
        input_combined = torch.cat([input_state, input_action], 1)
        output = self.main(input_combined)
        return output 

In [75]:
action_dim = len(env.action_space.sample())
state_dim = len(env.observation_space.sample())
max_action = 1
replay_buffer_size = 10000
batch_size = 128

In [86]:
# create a policy network
actor = policy_net(state_dim, action_dim, max_action)
actor_target = policy_net(state_dim, action_dim, max_action)

actor_target.load_state_dict(actor.state_dict())

critic = critic_net(state_dim, action_dim)
critic_target = critic_net(state_dim, action_dim)

critic_target.load_state_dict(critic.state_dict())
# Need to copy over the original parameters from critic and actor to the targets.

#critic optimiser
critic_optimiser = optim.Adam(critic.parameters(), lr=0.001)
actor_optimiser = optim.Adam(actor.parameters(), lr = 0.001)

# create the replay buffer
replay_buffer = ReplayBuffer(replay_buffer_size)

In [85]:
# Main training 

episodes = 1

for episode in range(episodes):
    
    # reset and get the first state
    # state is a numpy array. mutable.
    state = env.reset()
    
    for steps in range(max_steps): 
        # Predict the action
        # Need to convert state to a tensor before going into network.
        state_tensor = torch.tensor(state)

        action = actor.forward(state_tensor)
        # convert to numpy to put into the environment. 
        numpy_action = action.data.numpy()

        # Find the next state.
        next_state, reward, done, _ = env.step(numpy_action)

        # store transitions into the replay buffer
        transition = [state, action, reward, next_state]
        replay_buffer.add_to_memory(transition)

        # Move onto the next state

        if done:
            break

        state = next_state
    
    
    # Optimise networks: Once a episode?
    
    # sample from buffer
    replay_buffer.sample(batch_size)
    
    #Critic
    # Think about the critic model
    # here we input the state and the action to get the q value
    # state and action are already tensors. 
    q_value = critic.forward(state, action)
    
    # how to train the critic (use temporal difference?)
    # what's the reward + q_value at next state
    next_action = action_target(next_state)
    q_value_next_state = critic_target(next_state, next_action)
    y = reward + q_value_next_state
    # critic_loss - difference between y and q_value. 
    # what type of loss metric to use. 
    critic_loss = mse_loss(q_value, y)
    
    critic_optimiser.zero_grad()
    critic_loss.backward()
    critic_optimiser.step()
    
    #Actor
    # differentiate the critic with respect to the actor and ascend gradient.
    # ?????
    actor_loss = -q_value
    actor_optimiser.zero_grad()
    actor_loss.backward()
    actor_optimiser.step()
    
    #Critic Target
    # every x cycles update
    # polyak averaging?
    for param, target_param in zip(critic.parameters(), critic_target.parameters()):
        target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)
    
    #Actor Target
    # every y cycles update
    # polyak averaging?
    for param, target_param in zip(actor.parameters(), critic_target.parameters()):
        target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)
    
    # Add in the ability to save a pretrain model. This might be useful when
    # training and testing etc. 
    