<a href="https://colab.research.google.com/github/ThomasL642/AI-ML-Projects/blob/main/Training_a_Lunar_Lander_to_Land_with_a_DQN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Importing Libaries and Display Setup**

###**Importing Packages for Video Display**

In [None]:
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

###**Install Packages for Setup**

In [None]:
!apt-get update > /dev/null 2>&1
!apt-get install cmake > /dev/null 2>&1
!pip install --upgrade setuptools 2>&1
!pip install ez_setup > /dev/null 2>&1
!pip install gym[Box2d] > /dev/null 2>&1

###**Installing Packages for display and Packages for the Model and Enviroment**



In [None]:
import gym
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) #This will Set the minimal amount of logger message to 40 so we only see error messages
import math
import glob
import io
import base64
from IPython.display import HTML
from IPython import display as ipythondisplay

import tensorflow as tf
import keras
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
import Box2D
import random
from keras.layers import Activation, Dense
from keras import Sequential
from collections import deque
from keras.optimizers import Adam
from keras.activations import relu, linear

###**Defining the Display**

In [None]:
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900)) #Creating a screen that is 1400p long and 900p tall
display.start() #Starting the display

###**Display**



In [None]:
#This cell is not my code it's just the loop to display the video for the open AI gym enviroments

def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    

def wrap_env(env):
  env = Monitor(env, './video', force=True)
  return env

#**Deep Q Learning time!!**

This cell we are **calling the enviroment** with our video wrapper so we can see the environment ;and also, so we can use the environment to train our agent.

In [None]:
env = wrap_env(gym.make("LunarLander-v2"))
env.seed(111)
np.random.seed(111) #making the results set

###**Define the Agent**

This cell we're defining the agent class. I'll explain each part in depth with comments.

In [None]:
class DQN:

	def __init__(self, action_space, state_space): #__init__ is the constructor every time DQN is called it will be called
   
		self.action_space = action_space #the action space is the space where all valid actions take place
		self.state_space = state_space #the state space is all possible states the agent can be in
		self.epsilon = 1.1 #the epsilon is a value that changes how much exploring the agent will do. The higher the value, the more agent explores the environment.
		self.gamma = 0.99 #the gamma is a value from 0 to 1 that places higher value on long term rewards. The higher the value, the more it value long term rewards.
		self.batch_size = 50 #the batch size is how many samples the model must run through to update its parameters.
		self.epsilon_min = 0.02 #this value is the lowest the epsilon can go
		self.lr = 0.001 #this value is how fast the model tries to find the answer - so the higher the faster it tries to find the answer. This can lead to converging on false solutions.
		self.epsilon_decay = 0.995 #this value is how fast the epsilon decays 
		self.memory = deque(maxlen = 1000000) #the memory is exactly what it sounds like, it is basically a log for state, action taken, reward given becuase of action, and the next state
		self.model = self.DQN_model() #this is defining the model as the build_DQN function
  
	def DQN_model(self): #making the model
   
		model = Sequential() #Sequantial models that stack layers as they are ordered
		model.add(Dense(150, input_dim = self.state_space, activation=relu)) #Dense layer 1: inputting the state_space, relu makes negatives equal 0, 150 neurons
		model.add(Dense(120, activation=relu)) #Dense layer 2: relu again, 120 neurons
		model.add(Dense(self.action_space, activation=linear)) #self.action_space = amount of neurons, linear function 
		model.compile(loss="mse", optimizer=Adam(lr=self.lr)) #compile the model, mean squared error as loss which measures the mean of the squared error; hence, the name
		return model

	def new_memory(self, state, action, reward, next_state, done):
		self.memory.append((state, action, reward, next_state, done))
    #here we're defining the new_memory function which adds new the states, actions taken, reward given, the next_states and if the environment is done

	def predict(self, state): #we're defining the predict function now which will be what predicts what state to be in
        #if a random array is less or equal to epsilon, return an element from action_space
		if np.random.rand() <= self.epsilon:
			return random.randrange(self.action_space)
		predict_values = self.model.predict(state) #action values = what the model predicts they are
		return np.argmax(predict_values[0]) #we're returning the highest values of the first action array, the model predicts.

	def replay(self): #we're defining the replay memory function
    
		if len(self.memory) < self.batch_size: #if amount of memory tuples is less than batch size we'll have duplicates
                                              #but if the amount of memory tuples is more than the batch size we'll have no duplicates
			return

		sample = random.sample(self.memory, self.batch_size) #sample is a random point in the memory the amount of batches they are 
		
		states = np.array([i[0] for i in sample]) #for the states array we are taking the amount of batch size random samples then iterating though that array
		actions = np.array([i[1] for i in sample]) #for the actions array we are taking the amount of batch size random samples then iterating though that array
		rewards = np.array([i[2] for i in sample]) #for the rewards array we are taking the amount of batch size random samples then iterating though that array
		next_states = np.array([i[3] for i in sample]) #for the next_states array we are taking the amount of batch size random samples then iterating though that array
		dones = np.array([i[4] for i in sample]) #for the dones array we are taking the amount of batch size random samples then iterating though that array

		states = np.squeeze(states)#converting the 3D state array but a 2D array
		next_states = np.squeeze(next_states)#converting the 3D next_state array to a 2D array
		Qtargets = rewards + self.gamma*(np.amax(self.model.predict_on_batch(next_states), axis=1))*(1-dones) # Qtragets = reward of action in that state + discounted max q value in all possible actions for thatr state
		Qtarget = self.model.predict_on_batch(states) #Qtraget = model trying to predict the states
		batch_size_array = np.array([i for i in range(self.batch_size)]) #make a array from 0 batch_size -1 So if batch size 5 array will be  [0, 1, 2, 3, 4]
		Qtarget[[batch_size_array], [actions]] = Qtargets #finding the batch_size_array and actions array in the Qtraget array 

		self.model.fit(states, Qtarget, epochs=1, verbose=0) #training the model on the states, Qtarget, for for epochs cause each episode is on epoch and verbose is just the progress bar setting
		if self.epsilon > self.epsilon_min: #if epsilon is greater than the min we decrease the epsilon by time it by epsilon decay
			self.epsilon *= self.epsilon_decay



To connect all the dots in this cell we created an DQN agent

- That has certain values for learning like epsilon, gamma, espilon decay, learning rate 
- Created the Neural Network
- Has a empty memory of state, reward, action, next action and if done, can add to that memory
- Can predict the state to be in
- train on its rewards of its prediction + discount q value

###**Defining Train**

In [None]:
def train_dqn(episode):#def the train function

	loss = [] #making loss a list

	agent = DQN(env.action_space.n, env.observation_space.shape[0]) #calling agent in action_space and observable space as a integer
	for current_episode in range(episode): #making the training loop, for each episode run this loop
		state = env.reset() #reset the environment + state = environment
		state = np.reshape(state, (1, 8)) #reshaping the environment/state to not take hours to train
		score = 0 #defining the score and setting it to 0
		max_actions = 3000 #setting the maxing amount of actions
		for i in range(max_actions): #action loop
			action = agent.predict(state) #action is the predict function defined early
			env.render() #render the enviroment
			next_state, reward, done, ii = env.step(action) #next_state, the reward, and if it's done are what make the enviroment step which is bassically actions can take place in the enivroment
			score += reward #the score is equal to the score + reward
			next_state = np.reshape(next_state, (1, 8)) #reshaping the next state to be smaller
			agent.new_memory(state, action, reward, next_state, done) #runnning the new_memory function defined earlier to append the state, action, reward, next_state and done
			state = next_state #making state equal the next_state 
			agent.replay() #running the replay function defined earlier
            
			if done: #if the environment is done print f string and break from loop
				print(f"Episode {current_episode}/{episode}, Score: {score}")
				break
		loss.append(score) #add score to loss list
   
		solved = np.mean(loss[-100:]) #solved equals mean of last 100 episodes
		if solved > 210: #if the average of 100 episodes is greater than 210, print f string and break
			print(f"We have Landed on Episode #{current_episode}")
			break
		if current_episode > 99:
			print(f"Average Loss Over the Past 100 Episode: {solved}") #print f string if episode is greater than 99
		else:
			print(f"Average Loss Over the Past {current_episode} Episode: {solved}")#print this f string otherwise
	return loss #sending loss back to caller

To connect the dots we are creating loss, calling agent, then for an episode:
- reset environment
- resetting the score
- reshaping the environment
- setting max actions

Then for each action we are:
- predicting the state
- render the environment
- doing that action
- adding state, reward, action, next_state and if done to memory
- training the agent

Then breaking the loop - if done then:
- append score
- print information
- breaking loop if all episodes are done or if solved


###**Training Time!**

Here we are printing infomation, setting amount of episodes to train for, training, plotting infomation and showing the video.

In [None]:
print(env.observation_space) #print observation space
print(env.action_space) #print action space
episodes = 1000 #train for this many episodes
loss = train_dqn(episodes) #loss = train function
plt.plot([i+1 for i in range(0, len(loss), 2)], loss[::2]) #plotting loss over time
plt.show() #show graph
show_video() #show the video