# SUPER MARIO AI #

## Welcome to this Exercise! ##



![](https://media.giphy.com/media/EpB8oRhHSQcnu/giphy.gif)

### <font color='red'>Let's train an agent to play Super Mario!</font>
##### We'll be using the Deep Q-Learning (DQN) algorithm.
##### Follow this step-by-step guide and feel to play around with the code. Maybe you'll be able to give Mario an upgrade. Have fun!

In [None]:
#install dependencies

%pip install keyboard #check out the read me if you're using MacOS
%pip install gym
%pip install d3rlpy 
#alternatively: conda install d3rlpy 


Let's gather data first. After installing the necessary dependencies, run the next cell to collect player data and get a feeling for the game for yourself.
To run the following cell, you need to connect to the game server. Therefor, open a terminal and run 'java -jar ./marioai-server-0.1-jar-with-dependencies.jar' in the gym_marioai/gym_marioai/server folder.


We encourage you to generate datasets at different levels. These levels include:
+ cliffLevel
+ coinLevel
+ earlyCliffLevel
+ easyLevel
+ enemyLevel
+ flatLevel
+ hardLevel
+ oneCliffLevel

There are more detailed instructions below that may help you generate the data.

+ `python play-for-training.py -u <int>` or `python play-for-training.py --user <int>` sets the user flag when collecting data for training. 0 = test user we will ignore all other: user-ids
+ `python play-for-training.py` runs the default seed 0
+ `python play-for-training.py -l coinLevel` or `python play-for-training.py --level coinLevel` runs specified level 'coinLevel'
+ `python play-for-training.py -s 188` or `python play-for-training.py --seed 188` runs specified seed 188
+ `python play-for-training.py -s random` or `python play-for-training.py --seed random` runs new random seed for each episode.
+ `python play-for-training.py --level coinLevel --seed 188` When a level is specified, python runs specified level and ignores seed number.

In [None]:
import getpass
import os

password = getpass.getpass()
#set different commands according to your needs
command = "sudo -S python play-for-training.py"
os.system('echo %s | %s' % (password, command))

In order to enrich your dataset, you can run random-for-training.py to add random data to the dataset.

+ `!python random-for-training.py` runs the default seed 0
+ `!python random-for-training.py -l coinLevel` or `!python random-for-training.py --level coinLevel` runs specified level 'coinLevel'
+ `!python random-for-training.py -s 188` or `!python random-for-training.py --seed 188` runs specified seed 188
+ `!python random-for-training.py -s random` or `!python random-for-training.py --seed random` runs new random seed for each episode.

the generated data willl be stored in ‘/data’ as npz file.

In [None]:
!python random-for-training.py

Now, You need to run preprocess_data.py to turn your played games into a Markov Decision Process Dataset. 
However, the training success largely depends on the number of collected data. 

No worries, you can use on of the prepared datasets! 

![](https://media.giphy.com/media/S5uMJDmtnATLbjjw3h/giphy.gif)

In [None]:
!python preprocess_data.py


## Basics of Deep Q-Learning (DQN)  

In deterministic environments, DQN approximates the return of a state x action pair. For function updates, every policy obeys the Bellman equation:

$Q(s,a;\theta) = r + \gamma * max_{a'}Q(s',a';\~\theta)$


The TD-error is the difference between the predicted reward and the actual reward.

$\delta = Q(s,a;\theta) - (r + \gamma * max_{a'}Q(s',a';\~\theta))$


To minimize the TD error, we use the Huber Loss as our loss function, which is designed to be more robust to outliers.

$L(\delta) =  \begin{cases} \frac{1}{2} * (Q(s,a;\theta) - (r + \gamma * max_{a'}Q(s',a';\~\theta)))^2 for |\delta| \leq \frac{1}{2} \\ |\delta| - \frac{1}{2} otherwise  \end{cases}$


In [8]:
from exercise_dqn import DQN 
from d3rlpy.dataset import MDPDataset
from constants import DATAPATH
import torch

### TODO: Please implement the Huber Loss Function from above. Note that 'value' describes the actual cumulated reward and 'target' the predicted cumulated reward.###

def huber_loss(beta, gamma, rewards, target, value):

  

  
  loss = torch.where() #TODO 
  return loss

Now, let's load the dataset and run the DQN algorithm with your implemented loss function

In [None]:
### Now, load the dataset and run the DQN algorithm with your implemented loss function ###
import os
import glob
import numpy as np
import d3rlpy
import gym
import gym_marioai
from gym_marioai import levels
import copy
import matplotlib.pyplot as plt 
from exercise_dqn import DQN
from d3rlpy.dataset import MDPDataset
from constants import DATAPATH
from sklearn.model_selection import train_test_split
from d3rlpy.metrics.scorer import td_error_scorer
from d3rlpy.metrics.scorer import evaluate_on_environment

dataset = MDPDataset.load(DATAPATH) #Choose Dataset here

dqn = DQN(huber_loss = huber_loss, gamma=0.8, batch_size=128) #TODO: Feel free to experiment with hyperparameters
log_dir="d3rlpy_logs"

train_episodes, test_episodes = train_test_split(dataset, test_size=0.1) 

all_actions = (0,1,2,3,4,5,6,7,8,9,10,11,12)

env = gym.make('Marioai-v0', render=False,
               level_path=levels.coin_level,
               compact_observation=False, #this must stay false for proper saving in dataset
               enabled_actions=all_actions,
               rf_width=20, rf_height=10)

evaluate_scorer = evaluate_on_environment(env)

#TODO: experiment with the number of epochs, shuffeling, size of the test set..
fitter = dqn.fitter(train_episodes, eval_episodes=test_episodes, n_epochs=20, shuffle=True, scorers={'environment': evaluate_scorer, 'td_error': td_error_scorer})


metr = []
for epoch, metrics in fitter:
  metr.append(metrics.get('environment'))
  #Stop training when a reward over 160 is reached
  if metrics.get('environment') > 160:
    break
  
  
#fetch latest dataset
latest_logs = max(glob.glob(os.path.join(log_dir, '*/')), key=os.path.getmtime)

#fetch latest model
latest_model = max(glob.iglob(latest_logs + '/*.pt'), key=os.path.getctime)
print(latest_model)
#to get specific model (not the latest), change this file path
dqn.load_model(latest_model)
dqn.save_policy(latest_logs +'/policy.pt')



After running the previous cell, you should find a new folder in the log_dir directory. It contains a model for each epoch and some logs, like total rewards in environment.csv. Let's try it out! 
Choose a model and run the next cell to evaluate the agent. 

In [None]:
import gym
from gym_marioai import levels
from d3rlpy.dataset import MDPDataset
from constants import DATAPATH
from exercise_dqn import DQN

dataset = MDPDataset.load(DATAPATH)


### Evaluation of your model on d3rlpy DQN###
dqn = DQN(huber_loss = huber_loss)


#use this instead of dqn.fit when dqn.fit() has already been run
dqn.build_with_dataset(dataset)

#TODO choose your model here
dqn.load_model('')

all_actions = (0,1,2,3,4,5,6,7,8,9,10,11,12)

env = gym.make('Marioai-v0', render=True, # turn this off for fast training without video
               level_path=levels.coin_level,
               compact_observation=False, #this must stay false for proper saving in dataset
               enabled_actions=all_actions,
               rf_width=20, rf_height=10)


while True:
        observation = env.reset()
        done = False
        total_reward = 0
        while not done:
  
            action = dqn.predict([observation])[0]
            observation, reward, done, info = env.step(action)
 
       
            total_reward += reward
        print(f'finished episode, total_reward: {total_reward}')

Great! But can you do better?
Thanks for the visit!

![](https://tenor.com/view/mario-pipe-byebye-gif-5530137.gif)
