# Training an agent to play Super Mario using player recorderd data

In this exercise you will learn how to use player generated data to train a neural network to play Super Mario. The results will be evaluated against the results from the Q_Learner exercise. 

## 1. Generating data
First, you will have to generate some data for the neural network to train with.
You will have the most fun playing with a USB-Controller but if you have none, you can set the following variable to false to use the keyboard:

In [None]:
USE_GAMEPAD = True

// TODO: Steuerung erklären

In [None]:
import os
import subprocess
from d3rlpy.dataset import MDPDataset
from gym_setup import Env
from gamepad_controller import GamepadController
from keyboard_controller import KeyboardController
import numpy as np

level = os.path.join("levels", "RoughTerrainLevel.lvl") # TODO: Use a very easy level for this exercise

try:
    with subprocess.Popen(['java', '-jar', 'server.jar'], shell=True) as server:
        env = Env(visible=True, port=8080, level=level, run_server=False).env
        if USE_GAMEPAD:
            controller = GamepadController(env)
        else:
            controller = KeyboardController(env)
        while True:
            observation = env.reset()
            done = False
            action = controller.read()

            observations = [observation]
            actions = [action]
            rewards = [0]  # No reward at first time step, because no action was taken yet
            terminals = [done]

            while not done:
                observation, reward, done, info = env.step(action)
                action = controller.read()

                observations.append(observation)
                actions.append(action)
                rewards.append(reward)
                terminals.append(done)

            dataset_path = os.path.join("data", "datasets", os.path.split(level_path)[1] + ".h5")
            if os.path.isfile(dataset_path):
                dataset = MDPDataset.load(dataset_path)
                dataset.append(np.asarray(observations), np.asarray(actions), np.asarray(rewards),
                                np.asarray(terminals))
            else:
                dataset = MDPDataset(np.asarray(observations), np.asarray(actions), np.asarray(rewards),
                                        np.asarray(terminals), discrete_action=True)
            dataset.dump(dataset_path)
            stats = dataset.compute_stats()
            mean = stats['return']['mean']
            std = stats['return']['std']
            print(f'mean: {mean}, std: {std}')
except ConnectionResetError:
    # Finish
    pass


## 2. Use the generated data to train a policy
Now that you have generated some data for the neural network to train with, let's begin with the training.
For the purpose of this exercise we will use the Offline RL Python library d3rlpy.

### 2.1 Choosing an algorithm
// TODO: Why we chose DQN
![DQN](data/dqn.PNG)



### 2.2 Setup the training
Let's setup some parameters before the training:

In [None]:
from data.datasets.getDatasets import getDataset
from gym_setup import Env
import d3rlpy
import pathlib
from d3rlpy.metrics.scorer import evaluate_on_environment
from sklearn.model_selection import train_test_split
import copy
import socket
from contextlib import closing

MODEL_DIR = pathlib.Path("data", "models")
if not MODEL_DIR.exists():
    MODEL_DIR.mkdir(parents=True)

# Environment settings
port = 8081
run_server = True
visible = False

# Training parameters
gamma = 0.99
learning_rate = 0.0003
target_update_interval = 3000
n_epochs = 1000
test_size = 0.1
batch_size = 2
n_frames = 1
n_steps = 40
use_gpu = True

In [None]:
### 2.3 Training time!
To start the training run the next cell. If you want to see the progress of your training you can adittionaly open a new terminal and run ``pipenv run board`` to see the Tensorboard.

In [None]:
    env = Env(visible=visible, port=find_free_port(), level=level, run_server=run_server).env

    dataset = getDataset()

    train_episodes, test_episodes = train_test_split(dataset, test_size=test_size)

    dqn = d3rlpy.algos.DQN(learning_rate=learning_rate, gamma=gamma, use_gpu=use_gpu,
                           target_update_interval=target_update_interval, batch_size=batch_size)

    # train offline
    dqn.build_with_dataset(dataset)
    # set environment in scorer function
    evaluate_scorer = evaluate_on_environment(env)
    # evaluate algorithm on the environment
    rewards = evaluate_scorer(dqn)
    name = 'marioai_%s_%s_%s_%s_%s' % (level.split('/')[-1], gamma, learning_rate, target_update_interval, n_epochs)
    currentMax = -100000
    dqn_max = copy.deepcopy(dqn)

    for epoch, metrics in (dqn.fitter(train_episodes, eval_episodes=test_episodes, tensorboard_dir='runs', experiment_name=name, n_epochs=n_epochs, scorers={'environment': evaluate_scorer})):
        if metrics.get("environment") > currentMax:
            currentMax = metrics.get("environment")
            dqn_max.copy_q_function_from(dqn)
        else:
            dqn.copy_q_function_from(dqn_max)

    model_file = pathlib.Path(MODEL_DIR, name + ".pt")
    dqn.save_model(model_file)

In [None]:
### 2.4 See what worked
Now let's see if the training did something:

In [None]:
env = Env(visible=True, level=level, port=8082).env
dqn = DQN()
dqn.build_with_dataset(getDataset())
dqn.load_model('data/models/model.pt')

while True:
    observation = env.reset()
    done = False
    total_reward = 0
    while not done:
        observation, reward, done, info = env.step(dqn.predict([observation])[0])
        total_reward += reward

    print(f'finished episode, total_reward: {total_reward}')

## 3. Offline RL vs Online RL 
Now we want to compare the results from exercise 1 where and online Q_Learner was used with the results one can get with the offline RL approach.

## 3.1 Easy level

## 3.2 Medium Level

## 3.3 Hard Level

## 3.4 Conclusion