# Danny's Lunar Lander

## Overview

In this project, I chose to look at OpenAI's gym environment called LunarLander-v2 which simulates a very basic 2D spacecraft. The goal is to land the spacecraft in between the flags at an appropriate orientation and speed. In the spirit of the most recent unit, I looked into using a DQN to solve this and upon further research, I found out that it would be well suited for this problem. I used OpenAI's Gym to interact with the environment, Keras to create my model, and numpy for various computational tasks.

### Environment Setup
This agent is a Lunar spacecraft with 4 possible actions. It can either fire the right-facing thruster, the left-facing thruster, the bottom-facing thuster or no thruster at all. These actions correspond to the integers in the range 0-3 inclusive. We will call that the action space. The state space is a big larger and more involved. The state vector that is returned when we take an observation is of length 8 and follows the following format: [x_coord, y_coord, x_velocity, y_velocity, lander_angle, angular_velocity, isLeftLegDown, isRightLegDown]. At first I was intimidated by this vector and thought that there might have ben a lot of math involved, but reinforcement learning takes care of that for us. The end goal of this environment is to land on the ground below in between the two flags. The reward system is as follows. You get 100-140 points for moving from the top of the screen down to the ground, and points are lost if the lander moves up towards the sky again. If you land and come to rest, you get 100 points but if you crash, that is -100 points. Firing the main engine loses you 0.3 points per frame that it is fired so there is incentive to finish quickly, and 10 points are awarded for each leg making contact with the ground. 

### Model Setup
To begin, I recalled the lecture where we talked about DQN's. In order to solve the problem of correlated samples, a DQN uses a replay buffer to store information that it can sample from randomly. I created a class called ReplayBuffer that allowed me to store this information and create easily callable functions to add to and sample from it. 

Next, I used Keras to create a target neural network using the Sequential() model. The architecture of my network was influenced by my research but also somewhat of a guess and check. I settled on 4 hidden layers with an input dimension of 8 which represents an observation vector. Each hidden layer uses relu as it's activation function and the output layer has a linear activation function because I need to have multiple outputs. The output layer has an output dimension of size 4 because that is the size of the action space. I chose a learning rate of 0.001 as per the suggestion in previous lectures to keep it small in on the order of 0.001.

I also use a greedy-epsilon approach starting with an epsilon of 1 and decreasing the epsilon by a small amount each iteration. This works by randomly generating a number between 0 and 1. If the number is below epsilon, then I take a random sample of the action space and use that as my action. Otherwise, I use my network to predict an action and use that. This helps the model still explore even after it has been trained. 

### Training Procedure
In the training process, I let my model run 400 episodes and have a maximum iteration limit on each episode to avoid infinite hovering. The model begins with terrible performance and as is the way with reinforcement learning, it slowly begins to perform better. This model doesn't do particularly well until around 80-100 episodes. At this point, it begins to land more consistently and gets closeer to the desired landing pad. The model takes about 2 hours to run fully so it was a bit of a pain to test it because it is hard to tell if it is learning properly or not. 

### Troubleshooting
There has been PLENTY of troubleshooting! First, just getting the data formatted correctly took my some time. Initially in my replay buffer, I stored it as a 2D numpy array and that allowed for quick slicing and indexing when sampling. However, that gave me data dimension troubles that simply weren't worth the amount of spaghetti code necessary to fix them. So I reverted back to storing it as a list of lists which I then generate random indices for and then loop through my sample to get my desired numpy arrays of states, rewards, etc.

Additionally, in my training loop, I forgot to set my current state equal to the successor state at the end of my iteration which resulted in my rocket starting off random and learning to engage only the right or left engine. This caused the lander to begin doing flips the whole way down which, while that's pretty cool, it was not efficient nor the purpose of this exercise. It took my awhile to fix that because it was such a minor bug to notice in my code.

### Visualization

At the beginning of training, it is all random actions being chosen so we see behavior like this:  

![SegmentLocal](RL_Learning_Gif_2.gif "segment")

As the lander acts like this and recieved large negative rewards, it learns that it shouldn't be doing this and weights are adjusted.

But later after about 200 episodes, we see the lander aiming itself and learning how to land. This is about as good as I got with my model.  

![SegmentLocal](Project_gif_2.gif "segment")

And after 300 episodes:  

![SegmentLocal](LunarLanderGif4.gif "segment")

### Conclusion and Results

It seems that my model did work, however I would love to experiment more with it. For instance, I believe that changing the architecture of the network would be interesting. Perhaps I would add or subtract a hidden layer as well as change the sizes of the hidden layers. I also would like to look into using Huber loss instead of MSE in my model because based on what I found online, it also seems like an appropriate loss function. 

My behavior took around 200 episodes to reall =y see success and I wonder what I could change that would allow me to see a faster convergence. I would also like to reimplement my ReplayBuffer with the Numpy 2D array, as I think that would speed up the training process at least with faster computation.

Overall, I think that this was an interesting project and I appreciate OpenAI's efforts to making reinforcement learning environments accessible. I learned a lot about RL in the process of completing this project and I hope to continue on with this project to learn even more.

In [24]:
import gym
from tensorflow.keras.layers import Dense
from tensorflow.keras import Sequential
from tensorflow.keras.optimizers import Adam
import numpy as np
import random

In [25]:
# ORIGINAL REPLAY BUFFER USING NUMPY THAT DID NOT WORK
# class ReplayBuffer():

#     def __init__(self):

#         self.memory = np.array([[[0,0,0,0,0,0,0,0],0,0,[0,0,0,0,0,0,0,0],0]])
#         self.currSize = 0

#     def sample_from_buffer(self, batch_size):

#         if self.currSize < batch_size:
#             return None, None, None, None, None

#         currentSample = self.memory[np.random.choice(self.memory.shape[0], batch_size, replace=False)]

#         states = currentSample[:,0]
#         actions = currentSample[:,1]
#         rewards = currentSample[:,2]
#         succs = currentSample[:,3]
#         dones = currentSample[:,4]


#         return states, actions, rewards, succs, dones

#     def add_to_buffer(self, state, action, reward, successor, done):

#         if self.currSize == 0:
#             self.memory[0,:] = (state, action, reward, successor, done)
#         else:
#             self.memory = np.vstack([self.memory,(state, action, reward, successor, done)])
#         self.currSize += 1

#     def get_size(self):
#         return self.currSize

#     def print_buffer(self):
#         [print(i) for i in self.memory]

In [26]:
class ReplayBuffer():

    def __init__(self):

        self.memory = []
        self.currSize = 0

    def sample_from_buffer(self, batch_size):
        # If the size of memory is not enough to fill a batch, then we return Nones
        if self.currSize < batch_size:
            return None, None, None, None, None

        # Take sample from our memory of size batch_size
        currentSample = random.sample(self.memory, batch_size)

        # Initialize all of our lists
        s, a, r, s_prime, d = [],[],[],[],[]
        
        # Fill them with the appropriate data
        for samp in currentSample:

            s.append(samp[0])
            a.append(samp[1])
            r.append(samp[2])
            s_prime.append(samp[3])
            d.append(samp[4])

            # Turn all of these lists into numpy arrays
            states, actions, rewards, succs, dones = np.array(s), np.array(a), np.array(r), np.array(s_prime), np.array(d)

        # Remove single-dimensional entries
        states = np.squeeze(states)
        succs = np.squeeze(succs)

        return states, actions, rewards, succs, dones

    def add_to_buffer(self, state, action, reward, successor, done):
        
        # Adding one entry which holds all of the data associated with a single step
        self.memory.append((state, action, reward, successor, done))
        # Increment size to keep track of it
        self.currSize += 1

    def get_size(self):
        # Returns size of memory
        return self.currSize

    def print_buffer(self):
        # Prints each row from memory
        [print(i) for i in self.memory]


In [27]:
class GreedyEpsilonWrapper(gym.ActionWrapper):

    def __init__(self, env, model, epsilon):
        super(GreedyEpsilonWrapper, self).__init__(env)
        self.epsilon = epsilon
        self.model = model
        self.env = env

    def get_action(self, state):
        
        # Use epsilon-greedy approach, with probability epsilon, choose a random action
        if np.random.uniform() <= self.epsilon:
            return self.env.action_space.sample()
        
        else:
            # Otherwise, predict our action
            acts = self.model.predict(state)
            return np.argmax(acts[0])

In [28]:
class DQN_Agent():

    def __init__(self, actions, states):
        # Initialize all of our relevant parameters and environment
        self.gamma = 0.993
        self.epsilon = 1.0
        self.replay_buf = ReplayBuffer()
        self.num_actions = actions
        self.num_states = states
        self.DQN_model = self.create_model()
        self.env=gym.make('LunarLander-v2')
        
        
    def create_model(self):
        # Create a config list which will define our NN architecture when passed to Sequential constructor
        config = [Dense(200,input_dim=self.num_states, activation='relu'),
                  Dense(140, activation='relu'),
                  Dense(90, activation='relu'),
                  Dense(self.num_actions, activation='linear')]
        # Create model
        model = Sequential(config)
        
        # Specify Adam as the optimizer with a specified learning rate
        opt = Adam(learning_rate=0.001)
        
        # Compile the model with Mean Squared Error as the loss function and Adam as the optimizer
        model.compile(loss='mse', optimizer=opt)
        
        return model

    def eps_decrease(self):
        
        # Decrease epsilon until a minimum epsilon
        if self.epsilon > 0.01:
            self.epsilon *= 0.996

    def run_target_network(self):
        
        # If there isn't enough data for a batch, return
        if self.replay_buf.get_size() < 64:
            return
        
        # Sample from my Replay Buffer using sample method defined in above class
        states, actions, rewards, succs, isDones = self.replay_buf.sample_from_buffer(64)

        # If isDones is None, this means there wasn't enough data to sample
        if isDones is None:
            return

        # Find targets by predicting on current batch of states with target network
        all_targs = self.DQN_model.predict_on_batch(states)
        
        # Modify batch targets with target updates 
        all_targs[:64, [actions]] = rewards + self.gamma*(np.amax(self.DQN_model.predict_on_batch(succs), axis=1))*(1-isDones)

        # Fit model on updated targets (Hence the "moving target" problem)
        self.DQN_model.fit(states, all_targs, epochs=1, verbose=0)
        
        # Decrease epsilon so that less random actions are taken
        self.eps_decrease()

    def take_action(self, state):
        # Instantiate a GreedyEpsilonWrapper which helps us use epsilon-greedy approach
        choices = GreedyEpsilonWrapper(self.env, self.DQN_model, self.epsilon)
        
        # Return choice
        return choices.get_action(state)
     
    def fix_shape(self,state):
        
        # Modify shape of input state to be appropriate 
        state = np.reshape(state,(1,8))
        
        return state

    def train_model(self, num_episodes):
        
        # Initialize scores list
        scores = []
        
        # Loop over number of episodes
        for episode in range(num_episodes):

            # Reset environment for new episode
            state = self.env.reset()
            
            # Fix shape of array so that (8,) turns into (1,8)
            state = self.fix_shape(state)
            
            # Run set amount of episodes in environment
            total = self.run_episode(state, 3000)
            scores.append(total)
            print('Episode: ', episode)
            print('Score: ', total, '\n')

    def run_episode(self, state, max_iters):
        # Initialize total and count
        total = 0
        count = 0
        
        # Loop until we hit max_iters
        while count < max_iters:

            # Get an action
            action = self.take_action(state)
            
            # Render the environment
            self.env.render()
            
            # Get successor state, reward, done, and info dictionary by calling step with my action
            succ, reward, isDone, info = self.env.step(action)
            # Fix shape of successor state vector
            succ = np.reshape(succ, (1,8))

            # Add all state information to memory
            self.replay_buf.add_to_buffer(state[0], action, reward, succ[0], isDone)

            # Update this episode's score
            total += reward

            # Update state to be successor
            state = succ
            count += 1

            # Run function to update target network
            self.run_target_network()

            # if episode is done, return
            if isDone:
                return total

        return total

In [29]:
NUM_EPISODES = 400

env = gym.make('LunarLander-v2')
action_space = env.action_space.n
state_space = env.observation_space.shape[0]

model = DQN_Agent(action_space, state_space)
model.train_model(NUM_EPISODES)

## References

https://medium.com/@jonathan_hui/rl-dqn-deep-q-network-e207751f7ae4

https://towardsdatascience.com/solving-lunar-lander-openaigym-reinforcement-learning-785675066197 

https://github.com/openai/gym/blob/master/gym/envs/box2d/lunar_lander.py

https://becominghuman.ai/beat-atari-with-deep-reinforcement-learning-part-2-dqn-improvements-d3563f665a2c

https://github.com/shivaverma/OpenAIGym