# Deep Reinforcement Learning Beginners Tutorial (2) - Practice
#### by Julian Bernhart, Robin Guth

## Table of contents

1. [Requirements](#requirements)
1. [Targets](#targets)
1. [Introduction](#intro)
1. [OpenAi Gym](#openaigym)
    1. [Install OpenAi Gym - required!](#installgym)
    1. [Introduction to cartpole](#icartpole)
    1. [Make epochs visible - required!](#epochvisible)
    1. [OpenAi Gym and rendering](#rendering)
1. [Creating a RL Agent](#rlAgent)
    1. [Import necessary packages](#imports)
    1. [Exercise: Implementing the Agent](#exercise)
    1. [Summary of the created Agent](#summary)
1. [Initialize Agent](#initAgent)
    1. [Preparation for training](#prep)
    1. [Interact with Environment](#interact)
1. [Expected Result](#result)
1. [Outlook](#outlook)
1. [Sources](#sources)


### Requirements: <a name="requirements"></a>
- Basic knowledge and understanding of the content of the first notebook
- Basic knowledge about Python
- Knowledge regarding neural networks (NN)
- (Optional) Basic knowledge about PIP
- (Optional) Knowledge regarding convolutional neural networks (CNN)

### Targets: <a name="targets"></a>

- Applying the knowledge of the first notebook 
- Introduction to games as environments
- Visualizing the training process
- Implementation of a RL agent
- Configuration and training of the agent
- Discussion of further steps after this tutorial

## Introduction <a name="intro"></a>
After taking a deeper look at the theory behind _reinforcement learning_, we will now build our first RL agent. Most code for the agent is based on this [video [4]](#sources) about RL.

In the second part of the tutorial series, we will finally implement an RL agent with the help of theory that was provided in notebook 1. There will be general information about game environments, which are suited to be used as benchmarks for AI. We will also talk about visualizing the training process as gif, so we can watch it and better unterstand changes in our code. The reader is guided through the creation of an agent, which could be converted to an universal RL agent, step by step. After its creation, we will configure and train our very first agent and watch the results. Finally, we will provide some good starting points to continue our way into the world of reinforcement learning.

## OpenAi Gym <a name="openaigym"></a>

As we discussed before, Reinforcement Learning can be used to solve a range of different problems. Developing _machine learning_ algorithms is often neither easy to understand nor comprehensible especially for beginners. Furthermore, it is important to be able to compare the performance of different iterations of our algorithm, to be able to improve it. 

So basically we need an environment, that we can use to test and train our RL agent, which fulfills the following requirements:

- repeatable test/training epochs
- finite set of inputs
- finite set of actions
- easy state representation
- easy to control agent
- deliver a score for a given state

In practice, not all of these points will be fulfilled, but as this is a beginners guide, we will start with a simple environment. Luckily, many video games can be used as quite good environments for machine learning purposes.  Many implementations of RL are tested with games as benchmarks and there are some good reasons for this. Developing a whole test environment would be labour intensive and would require dedicated work towards a useable simulator. Using an existing game is also easier to compare to human performance and therefore the evaluation of different algorithms is easier. Another important point is the size of possible inputs and actions. The AI replaces the human player. Depending on the game, the input for our agent is an image, like a human player would see it. The set of actions is a combination of different buttons, which can be pressed on a controller. Finally, games are fun and most people can relate to them. It is also easier to understand what we want to accomplish, because we can transfer aspects from our human play style to the behaviour of an AI.  

### Install OpenAiGym - required! <a name="installgym"></a>

We have to install OpenAi Gym by executing the following code. This will install Gym itself and all required dependencies but we may have to restart the kernel afterwise. After that we need to import Gym.

In [None]:
pip install gym

In [None]:
import gym
from gym import wrappers

### Introduction to cartpole <a name="icartpole"></a>

As a first step in the creation of an AI, we should always look at our environment, to better understand what we want to achieve with the algorithm. The game, which will be used as environment, is called Cartpole. The following information is based of source [[1]](#sources). It involes a pretty simple task: The player tries to balance a pole in a 2D world without letting it tip over. We can play this game in the real world with something like a broomstick. This may seem trivial at first, but this task gets much harder if the pole is short. If we try the same with a pen for example, we will likely fail to balance it for a longer period of time. In the game, difficulty is reached by making the pole very sensitive to not beeing perpendicular to the ground and thus to accelerating very fast. Instead of our hand, the pole is resting on a small cart, that the player can move right or left. Our RL algorithm will replace the player completely and will have to do all tasks a human player would need to do. An image of Cartpole is shown below.

![Image of Cartpole](img/cartpole01.PNG "Exemplary Cartpole")

Cartpole is an endless game and there are only two possibilities to loose. Either the angle of the pole is greater than 15° or the cart moves further away than 2.4 units from its origin. Basically, the algorithm will learn to prevent both conditions. To achieve this, here are two different sets of inputs, that could be used as an input for an AI:

1. picture of the game
2. angle, angular velocity of the pole along with position and velocity of the cart

This is our state representation, which tells the algorithm information about its surroundings. After each action we take, this information will be updated. The first case is the closest to the human perspektive. The algorithm just receives a flow of pictures and must return a useful actions to perform well at the game. The AI must find important features, this means the connection between input and affiliated action, by itself. For the second case, we, as a developer, already decided which features are useful. We already know, that the AI has enough information to decide on an action with just these four values. We will be using this input for the first example. This is done for the sake of simplicity, it should not be done in a real life use-case. Humans are naturally pretty bad at abstraction in comparison to a computer. We are also biased most of the time and this may prohibit us from recognizing useful features sometimes. An AI on the other hand will just look at the data and find the best patterns, but it needs the freedom to do so. In some cases this yields unexpected results, demonstrating strange dependencies between data. 
The set of action is just containing two movements: We can either move the cart left or right. Normally this would be done by pressing a button on a controller, but now our AI will do this for us.
Finally, our score is the time, that our AI manages to balance the pole. Longer Times will result in higher scores. More information on this environment can be found at the [OpenAi Gym website](https://gym.openai.com/envs/CartPole-v1/).

In this chapter, we took a look at our first environment and its rules. In the next step, we will build our first Reinforcement Learning agent, but at first we visualize the environment.

### Make epochs viewable - required!<a name="epochvisible"></a>

For performance reasons, we do not want to have to wait and watch every training epoch our agent will complete. But to be honest, just looking at numbers is boring and not really helpful if our agent does not perform as expected. It is most interessting to watch some chosen epochs like the first, some between and finally the last. We can do this by collecting all frames, which are produced by the game and creating a gif afterwards. The idea for this solution was found at [[3]](https://stackoverflow.com/questions/40195740/how-to-run-openai-gym-render-over-a-server). The following code shows an example. First off, we will have to install two dependencies with the following code.

In [None]:
pip install pyvirtualdisplay

In [None]:
pip install git+https://github.com/jakevdp/JSAnimation.git

We will import some things we need for this. The method "display_frames_as_gif" will create a gif for us out of a queue with all rendered images.

In [None]:
#needed imports for gif creation
from JSAnimation.IPython_display import display_animation
import matplotlib.pyplot as plt
from IPython.display import display
from matplotlib import animation

# Virtual display
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

In [None]:
def display_frames_as_gif(frames):
    """
    Displays a list of frames as a gif, with controls
    """
    plt.figure(figsize=(frames[0].shape[1] / 72.0, frames[0].shape[0] / 72.0), dpi = 72)
    patch = plt.imshow(frames[0])
    plt.axis('off')

    def animate(i):
        patch.set_data(frames[i])

    anim = animation.FuncAnimation(plt.gcf(), animate, frames = len(frames), interval=50)
    display(display_animation(anim, default_mode='loop'))

### OpenAi Gym and rendering <a name="rendering"></a>

This is an example of a run, while performing random actions in the environment without an agent learning. We will use this later, to watch training episodes of our newly created agent.

In [None]:
# Creating the environment
env = gym.make('CartPole-v0')
env = wrappers.Monitor(env, 'modelOutput/test', video_callable=False ,force=True)
# Run a demo of the environment
observation = env.reset()
cum_reward = 0
frames = []
for t in range(5000):
    # Render into buffer. 
    frames.append(env.render(mode = 'rgb_array'))
    # This is where the agent will intervene later
    action = env.action_space.sample()
    observation, reward, done, info = env.step(action)
    if done:
        break
env.render()
# Creating the gif out of the rendered frames
display_frames_as_gif(frames)

## Creating a RL Agent <a name="rlAgent"></a>

In the first notebook we have dealt with formulas and the basics of RL. For the implementation of the agent, however, we need additional components so that it can fulfill its task of navigating through the environment. As previously discussed, we use a DQN that makes the decisions for the agent. This neural network needs data for the training. As we know, the agent has to gather information about its environment and then prefer actions with a positive effect while avoiding these with a negative. For this, it is necessary that the agent continues to be trained after each completed round of play in order to be able to adapt (RL). To collect this data, a memory is needed that helps us to train the model. The important thing is that we define the way the agent should interact with the environment. At first the memory is empty. This means that the agent has to perform random actions and remember the consequences. Once knowledge has been built up, the agent must also be able to apply this. Thus, we know that we need an possibility of interaction that distinguishes between exploration and exploitation. Furthermore, it makes sense to define parameters with which the learning of the DQN can be influenced. This way we can adjust the agent, by optimizing the performance of the neural network further. Finally, we implement a way to retain our trained model.

In summary we need for the agent:
- a memory for storing training information, from past run throughs.
- a model for decision-making.
- an ability to interact depending on the ratio between exploration and exploitation.
- an ability to train the model by using the memory.
- an ability to retain our trained DQN.
- parameters to influence the training.

### Important Configuration Values
The three following variables are very important for our agent, as they control the way the agent trains. 

- exploration rate $\epsilon$: determines the ratio between exploring the environment through random actions or choosing an action based on prior gained knowledge.
- learning rate $\alpha$: a discount factor, which determines how strong new experiences change the Q-values. This is done to prevent sudden changes through unusal experiences. With a smaller value, new experiences will influence the agents knowledge less, but it will be also more resistant to abnormal experiences.
- discount factor $\gamma$: a discount factor, which determines how strong future rewards are weighted to determine Q-values. This value controls how far our agent is looking ahead to determine its path. With a smaller value, our agent will be more short-sighted and be less impacted by further away states.

### Import necessary packages <a name="imports"></a>

In [None]:
import random
import gym
from gym import wrappers
import numpy as np
import os # for creating directories

from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

## Exercise: Implementing the Agent <a name="exercise"></a>

Following are three tasks to help with the creation of a DQN Agent. The agent is already mostly written, we will just need to implement some code based on the three tasks. The code is commented in order to clarify every step taken.

### Requirements: 
- Knowledge about the neural network basics.
- "Deep Reinforcment Learning - Theory" notebook, knowledge about deep reinforcement learning


### Task 1: Build a model for the DQN.
>In the previous notebook we have discussed the theory for this task, with an exemplary neural network, also we discussed our input and output above. {Verweis auf Basic Notebook?}

### Task 2: Implement a method in order to select a action.
>The action of the agent has to depent on whether it will be an random one or based on knowledge. We defined a epsilon in order to do this. The epsilon functions as a probability value. 

### Task 3: Train the DQN
>We chose the action with the highest Q-vlaue before and have to pay attention to the value range of epsilon.

In [None]:
class DQNAgent:
    
    
    def __init__(self, stateSize, actionSize):
        
        # the size and shape of the input
        self.stateSize = stateSize
        # the size of the output
        self.actionSize = actionSize
        
        # initialization of the memory
        self.memory = deque(maxlen = 2000)
        
        # the discount factor for our Bellman equation
        self.gamma = .95
        
        # value that defines the percentage of exploration
        # 100% to exploration 0% to exploitation at the moment
        self.epsilon = 1.0 
        # the decaying rate
        # decreases exploration and increases exploitation
        self.epsilonDecay = .9965
        # the minimal percentage value for exploration
        self.epsilonMin = .001
        
        self.learningRate = .001
        
        self.model = self.buildModel()
        
    
    # build the model for the DQN
    def buildModel(self):
        
        model = Sequential()
        
        # Solution 1 --- START
        model.add(Dense(24, input_dim = self.stateSize, activation = 'relu'))
        model.add(Dense(24, activation = 'relu'))
        model.add(Dense(self.actionSize, activation = 'linear'))
        # Solution 1 --- END
        
        # use the mean squared error like discussed in the theory notebook
        model.compile(loss = 'mse', optimizer = Adam(lr = self.learningRate))
        
        return model
    
    # remember all the relevant information in order to train the DQN
    def remember(self, state, action, reward, nextState, done):
        
        self.memory.append((state, action, reward, nextState, done))
        
    
    # choose a suitable action
    def act(self, state):
        
        # Solution 2 --- START
        # the agent needs to take action according to the epsilon
        if np.random.rand() <= self.epsilon:
            # select an (explorational) random action
            return random.randrange(self.actionSize)
        
        # the exploitational actions
        actValue = self.model.predict(state)
        
        # return the suitable action
        return np.argmax(actValue[0])
        # Solution 2 --- END
        
    
    # train the model with the acquired data
    def replay(self, batchSize):
        
        # Solution 3 --- START -> 
        # select a sample of data for training
        miniBatch = random.sample(self.memory, batchSize)
        
        for state, action, reward, nextState, done in miniBatch:
            # select the actual reward as the target
            target = reward
            if not done:
                # use the Bellman equation for calculating the Q-value
                target = (reward + self.gamma * np.amax(self.model.predict(nextState)[0]))
                
            # use the temporal difference error to train the model 
            # predict the future reward, we need the Q-values
            targetF = self.model.predict(state)
            # map the target reward to the predicted reward
            targetF[0][action] = target
            # trian the model in order to maximize the Q-value
            self.model.fit(state, targetF, epochs = 1, verbose = 0)
            
        # In order to shift the behaviour of the agent more and more to 
        # exploitation of its current knowledge, adjust the epsilon.
        # condition for the epsilon decrease
        if self.epsilon > self.epsilonMin:
            # decrease the epsilon
            self.epsilon *= self.epsilonDecay
        # Solution 3 --- END  
        
    
    # load a trained model
    def load(self, name):
        self.model.load_weights(name)
       
    
    # save a trained model
    def save(self, name):
        self.model.save_weights(name)

## Summary of the created Agent<a name="summary"></a>

We build an agent which can interact with its environment via a DQN. Following are the components and abilities.

- Memory: The agent has a cyclic memory to remember related information about environment and reward. This memory then contains data with which the DQN can be trained. 

- DQN: The model for predicting the Q-values and choosing an action.

- Interaction: The agent can interact with the environment by choosing either an explorational and random action or a exploitational one, based on knowledge. 

- Training: We use Q-values and the states in order to train the DQN. The data needed is stored in the memory during earlier interactions with the environment. This way the network learns which actions should be prefered in a given state and which should be avoided. 

- Save and Load: We can save and load a trained DQN.


# Initialize Agent <a name="initAgent"></a>

With our agent beeing complete, we now want to test it.

## Preparation for training<a name="prep"></a>
To be able to use the agent, we have to prepare some things. At first we need to create a Cartpole environment, similar to the [render example](#rendering), to train our agent to control the cart.

We need to configure the following values:

- stateSize: The size of our states. In our case this is an array with the variables mentioned in the cartpole game introduction.  
- actionSize: The amount of actions we can execute.

These informations are needed to build the neural network and can be retrieved directly from the environment. 

Now we need to specify some general conditions:

- epsiodes: The amount of epochs to play. One epoch is one gaming cyclus from start to finish or death. After each epoch the game is reseted and a new round is started.
- batchSize: The amount of data, which needs to be added to the memory before a replay is executed

First, we create the environment:

In [None]:
environment = gym.make('CartPole-v0')

environment = wrappers.Monitor(environment, 'modelOutput/test', video_callable=False ,force=True)

As discussed before we normally need as input for our dqn the individual frames of the game. In this way, the network has the greatest possible freedom to learn, for example, so vulnerabilities or specific dependencies between data can be found in games and exploited to get a higher score. In this example, we use OpenAi Gym, which gives us an abstracted form of the gamestate. It provides us with four values, which represent the state the game is in. This four values are the position and velocity of the cart, the angle and the angular velocity of the pole.

In [None]:
stateSize = environment.observation_space.shape[0]
stateSize

As the desired output we have just two values, whether the cart has to move left or right

In [None]:
actionSize = environment.action_space.n
actionSize

We choose a size for our training batch and the number of episodes.

In [None]:
batchSize = 32
episodes = 2000 #better results with 4000 episodes, but takes more time to train

At last we make sure that a folder is created to safe our DQN

In [None]:
outputDirectory = 'modelOutput/cartpole'

if not os.path.exists(outputDirectory):
    os.makedirs(outputDirectory)

The agent we created needs to be initialized:

In [None]:
agent = DQNAgent(stateSize, actionSize)

We want to be able to watch some of the interesting episodes. We use now the gif-creating function. To be able to watch different/more epsiodes, we can just add the epoch number to the following list: 

In [None]:
viewableEpochs = [1,episodes/2,episodes-2,episodes-1,episodes]

## Interact with Environment <a name="interact"></a>

Finally, it is time to train our agent. 

For every episode we want to play, we will complete the following steps:

1. Choose an appropriate action based on the current state.
1. Advance the environment one step with the chosen input, which delivers us the new state, the reward and our status.
1. If the agent is dead or finished the game, we need to set the reward to -10 to prohibit the agent from trying to finish early.
1. Our state is an array. Our neural network expects a vector instead, so we have to reshape it appropriately.
1. To be able to replay some data, we need to add it to the memory
1. Set the state of the agent to our reshaped new state 
1. If the agent died or won the game, we can abort the episode and print some information on the screen
1. If our memories size reached the batchsize, we will replay our knowledge and empty the memory.

If we execute the following code, our agent will train the set amount of episodes. 

In [None]:
done = False;

# buffer for rgb arrays to create a gif later on
frames = []

for e in range(episodes):
    state = environment.reset()
    state = np.reshape(state, [1, stateSize])
    
    for time in range(5000):
        #1. choose an appropriate action
        action = agent.act(state)
        
        #2. advance the envrionment one step
        nextState, reward, done, _ = environment.step(action)
        
        #3. negative reward if done
        reward = reward if not done else -10
        
        #4. prepare the state for next episode
        nextState = np.reshape(nextState, [1, stateSize])
        
        #5. add information from this step to memory
        agent.remember(state, action, reward, nextState, done)
        
        #6 advance the state for the agent
        state = nextState
        
        if e+1 in viewableEpochs:
            frames.append(environment.render(mode = 'rgb_array'))
        
        #7 abort if done
        if done:
            print("episode: {}/{}, score: {}, e: {:.2}".format(e+1, episodes, time, agent.epsilon))
            
            if e+1 in viewableEpochs:
                print("Displaying epsiode {} ".format(e+1))
                display_frames_as_gif(frames)
            
            frames.clear()
            break
        
    #8 if memory full, replay it    
    if len(agent.memory) > batchSize:
        agent.replay(batchSize)
    
    if e % 50 == 0:
        agent.save(outputDirectory + "weights " + '{:04d}'.format(e) 
                   + ".hdf5")

environment.close()

## Expected Result <a name="result"></a>

If we watch the gifs created while training, we will notice the improvements our agent will make over several epochs. While in the first one, it is just a random set of movements, the agent is able to keep the pole steady, in the end. Congratulations, we created our first working RL agent.

### Improvements:

If the agent does not work as expected, this may have some of the following reasons: 

- Short training: We need a proper amount of epochs for training. We recommend at least 4000 iterations, but more will further improve the agent.
- Bad $\epsilon$: If our exploration ratio is not properly set up, the agent may explore not enough to really know its environment. We can lower the reduction per round. 
- Bad $\gamma$: If our agent is to short-sighted, he might not be able to recognize the state of the pole and might fail to react properly. We can increase the discount factor the improve this. Be aware that we will need longer training because of the increased sight of the agent.
- Bad $\alpha$: If our $\alpha$ is to low, our agent will need many epochs to be able to learn enough to make educated decisions. If our $\alpha$ is to high, abnormal experiences like a death will impact the memory of the agent to strong and it will not be able to find better paths after several fails.
- Bad model (neural net): If our model is to small, the agent may not be able to find a good function to approximate the Q-values. If the model is to big, training will take longer.
- The formula to calculate the Q-values might be incorrect or not suited for this. There are several versions of the formula, which can perform different for different problems.

With these tips, we will be able to further improve our agent by fine-tuning some values. We highly recommend that you try to change the values and understand their meaning for the agent. You can also use the provided code as a base the create your own additons.

### Universal AI

In the beginning, we stated that _reinforcment learning_ is a step towards a universal AI. The agent we created is basically universal, this means, that we do not need to tell our agent many things about its environment, it will be able to learn these things by itself. As stated above, we only need to configure a few values. This is a special attribute of RL. In reality, the agent does not even know what it is doing, it just follows patterns in the data, which it gets fed. 

If we want to use our agent to learn a different game, we only need to change a few things like the environment, some values and the modell.

# Outlook <a name="outlook"></a>

This concludes this beginners tutorial. In the second notebook, we finally wrote our first simple agent to play the game Cartpole for us. We wrote our own implementation of an RL agent for this, which works well in this simple context. As always in programming, there are already plenty of libraries that provide neat functions, so we do not have to write everything ourselfs. Also, the usual benefit of testing through many developers guarantees a mostly error-free experience. 

The following link shows an example for a DQN Agent created with a library called "Keras RL" with the Atari Game "Breakout" as environment. If we take a closer look at the code, we will see, that it accomplishes the same things we did with our implementation, but with less self-written code. 

https://github.com/keras-rl/keras-rl/blob/master/examples/dqn_atari.py

We highly recommend that you take a look at one of the following libraries and maybe even use them to create further projects. We only took a quick look at the possibilities with RL and there is definitely more to see. There are also plenty of more advanced tutorials online, so we can continue our journey there.

- https://github.com/keras-rl/keras-rl - Keras RL: implementations of different RL algorithms
- https://github.com/google/dopamine - Dopamine: Googles take on a rl library
- https://github.com/openai/baselines - OpenAi: open source implementation of different RL algorithms 

# Sources <a name="sources"></a>

This notebook is based on the following articles/blogposts/tutorials:

[1] https://gym.openai.com/ - information about OpenAi Gym

[2] https://gym.openai.com/envs/CartPole-v1/ - information about the Cartpole environment of OpenAi Gym

[3] https://stackoverflow.com/questions/40195740/how-to-run-openai-gym-render-over-a-server - solutions for rendering epochs on a headless server

[4] https://www.youtube.com/watch?v=OYhFoMySoVs - code for the DQN