# Reinforcement Learning Beginners Tutorial 





## Targets

## OpenAi Gym

As we discussed before, Reinforcement Learning can be used to solve a range of different problems. Developing Machine Learning algorithms is often not easy to understand nor comprehensible especially for beginners. Furthermore, it is important to be able to compare the performance of different iterations of our algorithm, to be able to improve it. 

So bascially we need an environment, that we can use to test and train our RL agent, which fulfills the following requirements:

- repeatable test/training epochs
- finite set of inputs
- finite set of actions
- easy state representation
- easy to control agent
- deliver a score for a given state
- !!TODO!! INSERT OTHER REQUIREMENTS HERE

In practice, not all of these points will be fulfilled, but as this is a beginners guide, we will start with a simple environment. Luckily, many video games can be used as quite good environments for machine learning purposes.  Many implementations of RL are tested with games as Benchmark. 

!!TODO!!

### Introduction to cartpole

As a first step in the creation of an AI, we should always look at our environment, to better understand what we want to achieve with the algorithm. The game, which will be used as environment, is called Cartpole. It involes a pretty simple task: The player tries to balance a pole in a 2D world without letting it tip over. We can play this game in the real world with something like a broomstick. This may seem trivial at first, but this task gets much harder if the pole is short. If we try the same with a pen for example, we will likely fail to balance it for a longer period of time. In the game, difficulty is reached by making the pole very sensitive to not beeing perpendicular to the ground and accelerating very fast. Instead of our hand, the pole is resting on a small cart, that the player can move right or left. Our RL algorithm will replace the player completely and will have to do all tasks a human player would need to do. An image of Cartpole is shown below.

![Image of Cartpole](img/cartpole01.PNG "Exemplary Cartpole")

Cartpole is an endless game and there are only two possibilities to loose. Either the angle of the pole is greater than 15° or the cart moves further away than 2.4 units from its origin. Basically, the algorithm will learn to prevent both conditions. To achieve this, here are two different sets of inputs, that could be used as an input for an AI:

1. picture of the game
2. angle and velocity of the pole !!TODO!! CHECK THIS INFORMATION

This is our state representation, which tells the algorithm information about its surroundings. After each action we take, this information will be updated. The first case is the closest to the human perspektive. The algorithm just receives a flow of pictures and must return a useful actions to perform well at the game. The AI must find important features, this means the connection between input and affiliated action, by itself. For the second case, we, as a developer, already decided which features are useful. We already know, that the AI has enough information to decide on an action with just these two values. We will be using this input for the first example. This is done for the sake of simplicity, it should not be done in a real life use-case. Humans are naturally pretty bad at abstraction in comparison to a computer. We are also biased most of the time and this may prohibit us from recognizing useful features sometimes. An AI on the other hand will just look at the data and find the best patterns, but it needs the freedom to do so. In some cases this yields unexpected results, demonstrating strange dependencies between data. 
The set of action is just containing two movements: We can either move the cart left or right. Normally this would be done by pressing a button on a controller, but now our AI will do this for us.
Finally, our score is the time, that our AI manages to balance the pole. Longer Times will result in higher scores.

In this chapter, we took a look at our first environment and its rules. In the next step, we will build our first Reinforcement Learning agent.

## Universal AI

# Install & import dependencies

### Install gym

In [None]:
pip install gym

### Install some dependencies to render epochs

In [None]:
pip install pyvirtualdisplay

In [None]:
pip install git+https://github.com/jakevdp/JSAnimation.git

### Import

In [None]:
import random
import gym
from gym import wrappers
import numpy as np
import os # for creating directories

from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam


#needed for gif
from JSAnimation.IPython_display import display_animation
import matplotlib.pyplot as plt
from IPython.display import display
from matplotlib import animation

## Set parameters

In [None]:
environment = gym.make('CartPole-v0')

environment = wrappers.Monitor(environment, 'modelOutput/test', video_callable=False ,force=True)

In [None]:
stateSize = environment.observation_space.shape[0]

In [None]:
actionSize = environment.action_space.n

In [None]:
batchSize = 32

In [None]:
episodes = 4000

In [None]:
outputDirectory = 'modelOutput/cartpole'

In [None]:
if not os.path.exists(outputDirectory):
    os.makedirs(outputDirectory)

# Define Gif Making Method

In [None]:
def display_frames_as_gif(frames):
    """
    Displays a list of frames as a gif, with controls
    """
    plt.figure(figsize=(frames[0].shape[1] / 72.0, frames[0].shape[0] / 72.0), dpi = 72)
    patch = plt.imshow(frames[0])
    plt.axis('off')

    def animate(i):
        patch.set_data(frames[i])

    anim = animation.FuncAnimation(plt.gcf(), animate, frames = len(frames), interval=50)
    display(display_animation(anim, default_mode='loop'))


# Define Agent

In [None]:
class DQNAgent:
    
    
    def __init__(self, stateSize, actionSize):
        
        self.stateSize = stateSize
        self.actionSize = actionSize
        
        self.memory = deque(maxlen = 2000)
        
        self.gamma = .95
        
        self.epsilon = 1.0 # 100% to exploration 0% to exploitation
        self.epsilonDecay = .9965
        self.epsilonMin = .001
        
        self.learningRate = .001
        
        self.model = self.buildModel()
        
        
    def buildModel(self):
        
        model = Sequential()
        
        model.add(Dense(24, input_dim = self.stateSize, activation = 'relu'))
        model.add(Dense(24, activation = 'relu'))
        model.add(Dense(self.actionSize, activation = 'linear')) # directly, instead of propability or abstract
        
        model.compile(loss = 'mse', optimizer = Adam(lr = self.learningRate))
        
        return model
    
    
    def remember(self, state, action, reward, nextState, done):
        
        self.memory.append((state, action, reward, nextState, done))
        
        
    def act(self, state):
        
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.actionSize)
        
        actValue = self.model.predict(state)
        
        return np.argmax(actValue[0])
    
    
    def replay(self, batchSize):
        
        miniBatch = random.sample(self.memory, batchSize)
        
        for state, action, reward, nextState, done in miniBatch:
            target = reward
            if not done:
                target = (reward + self.gamma * np.amax(self.model.predict(nextState)[0]))
                
            targetF = self.model.predict(state) # predicted future reward
            targetF[0][action] = target
            
            self.model.fit(state, targetF, epochs = 1, verbose = 0)
            
        if self.epsilon > self.epsilonMin:
            #print("Before: " + str(self.epsilon))
            self.epsilon *= self.epsilonDecay
            #print("After: " + str(self.epsilon))
    
    def load(self, name):
        self.model.load_weights(name)
        
    def save(self, name):
        self.model.save_weights(name)

# Initialize Agent

In [None]:
agent = DQNAgent(stateSize, actionSize)

## Interact with Environment

In [None]:
done = False;

# buffer for rgb arrays to create a gif later on
frames = []

for e in range(episodes):
    state = environment.reset()
    state = np.reshape(state, [1, stateSize])
    
    for time in range(5000):
        action = agent.act(state)
        
        nextState, reward, done, _ = environment.step(action)
        
        reward = reward if not done else -10
        
        nextState = np.reshape(nextState, [1, stateSize])
        
        agent.remember(state, action, reward, nextState, done)
        
        state = nextState
        
        if e > episodes-6:
            frames.append(environment.render(mode = 'rgb_array'))
        
        
        if done:
            print("episode: {}/{}, score: {}, e: {:.2}".format(e+1, episodes, time, agent.epsilon))
            
            if e > episodes-6:
                display_frames_as_gif(frames)
                frames.clear()
            break
        
    if len(agent.memory) > batchSize:
        agent.replay(batchSize)
    
    if e % 50 == 0:
        agent.save(outputDirectory + "weights " + '{:04d}'.format(e) 
                   + ".hdf5")

environment.close()

# Outlook

# Sources

This notebook is based on the following articles/blogposts/tutorials:

[1] https://gym.openai.com/envs/CartPole-v1/ - information about the Cartpole environment of OpenAi Gym