# Pipple Lecture #12 - Reinforcement Learning
Now, you have seen quite some information relating to Reinforcement Learning. In this notebook, you will have the chance to program your own Deep Reinforcement Learning model. At least... tune its parameters. The programming of the game-environment, state-transitions, reward-calculations and training of the model has already been prepared for you. It is your job to focus on one task and one task only: keep your pole straight up!

During the lecture, we have not been able to discuss all elements of a DRL-model, as there are many aspects which can be tuned to perfection (or far from it). Some additional explanation will be given in the notebook where deemed necessary, but don't be shy to ask more!

## 0. Clone git-repo
Clone necessary data and install missing packages. This may take a few minutes, but will only have to be ran once.

In [None]:
!git clone https://github.com/PippleNL/Lecture_RL.git
!pip install wandb
!pip install tensorflow==1.14

Set system path so the program understands where to find the relevant packages.

In [None]:
import sys

root_path = '/content/Lecture_RL/keras-rl'
if root_path not in sys.path:
  sys.path.append(root_path)

## 1. Importing relevant modules
Let's get started. First, import necessary modules (and suppress some unwanted warnings). The 'gym' package is imported to be able to create a Cart Pole environment for you to play with. Further on, 'keras' enables the usage of a neural network, while 'keras-rl' contains a whole bunch of interesting Reinforcement Learning functions.

In [None]:
import numpy as np
import gym
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam

from rl.agents.dqn import DQNAgent
from rl.policy import EpsGreedyQPolicy
from rl.memory import SequentialMemory

## 2. Setting variables
Then, set the relevant variables. Get the environment and extract the number of actions available in the Cartpole problem. The seed settings can be useful to compare your results over different runs. However, both a neural network as the RL framework itself still contain a high level of randomization, which may make comparison of distinct runs difficult. Keep this in mind when trying different parameter settings

In [None]:
env = gym.make('CartPole-v0')
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n

## 3. Set up your neural network
Next, build a neural network model. Initially, it is set to a simple feed-forward neural net, with a single hidden layer and 4 nodes. Try different settings by yourself, to find your optimal set-up! Unfortunately, until the day of today, there are no clear rules for choosing how many layers or nodes to use. Google may give you some idea, but most decisions still follow the famous method of trial-and-error.

Try tuning the number of hidden layers, the number of nodes per hidden layer, and the type of activation functions in the hidden and output layers. Generally used activation functions are 'softmax', 'relu', 'tanh', 'sigmoid' and 'linear'.

Use the 'print(model.summary())' to get an overview of the complexity of your model.

In [None]:
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(2))
model.add(Activation('softmax'))
#model.add(Dense(4))
#model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('sigmoid'))
print(model.summary())

## 4. Create your learning agent

Now, configure and compile your agent. The memory is set to Sequential Memory, storing the result of performed actions and obtained rewards. Settings you can tune:

* **policy**: the way in which actions are selected over time, following some balancing method. This RL-concept is very important, incorporating a trade-off between exploring unknown parts of the environment, and exploiting known information. (possible policies: EpsGreedyQPolicy, LinearAnnealedPolicy, SoftmaxPolicy, GreedyQPolicy, BoltzmannQPolicy, MaxBoltzmannQPolicy, BoltzmannGumbelQPolicy)
* **memory limit**: the number of previous actions+rewards that are taken into account while learning, at a certain moment in time.
* **window_length**: actually not sure... just keep it at 1 to avoid errors (or see it as a challenge to find out ;))
* **target_model_update**: in RL-theory denoted by $\alpha$, the network's learning rate. It determines how quickly the algorithm wants to converge to found target values (such as Q-values).

In [None]:
policy = EpsGreedyQPolicy()
memory = SequentialMemory(limit=1000, window_length=1)
dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, target_model_update=0.25, policy=policy)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])

## 5. The long process of learning...
Now it's time to learn something! If you haven't already... There are four settings you can consider changing, however, only one which has an effect on your training performance:

* **nb_steps**: the larger, the more time your bot gets for trying to find a good strategy, but the longer you'll have to wait.
* **verbose**: printing running status. 0 for no logging, 1 for interval logging, 2 for episode logging
* **log_interval**: if verbose=1, the number of steps that are considered to be an interval

In [None]:
dqn.fit(env, nb_steps=1000, verbose=1, log_interval=10)

## 6. How well do you perform?
Run the below code to test your DRL model. The larger the reward and number of steps per episode, the better your model performs. Running about 10 episodes will give you a proper overall status. Unfortunately, visualization only works when running locally.

In [None]:
dqn.test(env, nb_episodes=10, visualize=False)

## 7. Happy?
If you are happy with your performance, save your model! Send it to lennart@pipple.nl, so it can be publicly evaluated.

In [None]:
import pickle
object_pkl = model
file_pkl = open('model_[enter_team_name].obj', 'wb')
pickle.dump(object_pkl, file_pkl)
file_pkl.close()