# SUPER MARIO AI #

## Welcome to this Exercise! ##



![](https://media.giphy.com/media/EpB8oRhHSQcnu/giphy.gif)

### <font color='red'>Let's train an agent to play Super Mario!</font>
##### We'll be using the Convergent DQN algorithm, a more stable extension of DQN.
##### Follow this step-by-step guide and feel to play around with the code. Maybe you'll be able to give Mario an upgrade. Have fun!

In [None]:
#install dependencies

%pip install keyboard #check out the read me if you're using MacOS
%pip install gym
%pip install d3rlpy 
#alternatively: conda install d3rlpy 


Let's gather data first. After installing the necessary dependencies, run the next cell to collect player data and get a feeling for the game for yourself.
To run the following cell, you need to connect to the game server. Therefor, open a terminal and run 'java -jar ./marioai-server-0.1-jar-with-dependencies.jar' in the gym_marioai/gym_marioai/server folder.

In [None]:
import os
from random import randint
import keyboard
import argparse
import time

import gym
import numpy as np
from gym_marioai import levels


all_actions = (0,1,2,3,4,5,6,7,8,9,10,11,12)

#create gym environment
env = gym.make('Marioai-v0', render=True,
               compact_observation=False, #this must stay false for proper saving in dataset
               enabled_actions=all_actions,
               rf_width=20, rf_height=10)

''' 
0 LEFT
1 RIGHT
3 DOWN
4 JUMP
5 SPEED_JUMP
6 SPEED_RIGHT
7 SPEED_LEFT
8 JUMP_RIGHT
9 JUMP_LEFT
10 SPEED_JUMP_RIGHT
11 SPEED_JUMP_LEFT
12 NOTHING
'''

# programmed actions - feel free to change the keys or add actions by yourself
def get_action():
    if keyboard.is_pressed('up'):
        return env.JUMP #4
    elif keyboard.is_pressed('right'):
        return env.SPEED_RIGHT #6
    elif keyboard.is_pressed('left'):
        return env.SPEED_LEFT #7
    elif keyboard.is_pressed('down'):
        return env.DOWN #3
   
    elif keyboard.is_pressed('d'):
        return env.SPEED_JUMP_RIGHT #10
    elif keyboard.is_pressed('a'):
        return env.SPEED_JUMP_LEFT #11
    
    else:
        return env.NOTHING #12


level = levels.coin_level #change to: cliff_level, hard_level, easy_level, coin_level, one_cliff_level, early_cliff_level

counter = 0
#play loop: execute actions from keyboard input in the environment
while True:
    #new episode
    state = env.reset(level_path=level) 
    # you can also choose a specific seed, or a random seed
    ''' 
    seed = np.random.randint(0,1000)
    state = env.reset(seed=seed)
    '''
    
    done = False
    total_reward = 0
    
    #initialize data arrays with initial states for each episode
    observations = [state]
    actions = [12] #nothing
    rewards = [0]
    terminals = [done]

    while not done:
        action = get_action()
        print(action)
        next_state, reward, done, info = env.step(action)
        
        observations.append(next_state)
        actions.append(action)
        rewards.append(reward)
        terminals.append(done)
     
        total_reward += reward
        
    #create Markov-Decision-Process Dataset from collected episode
    datafile_name = str(level) + "_" + "reward" + str(int(total_reward)) + "_" + str(round(time.time())) 
    datapath = os.path.join("../data", datafile_name)
    
    data = np.savez(datapath, observations=observations, actions=actions, rewards=rewards, terminals=terminals)
    counter += 1
    print(f'finished episode {counter}, total_reward: {total_reward}')


You could run preprocess_data.py to turn your played games into a Markov Decision Process Dataset. 
However, the training success largely depends on the number of collected data. No worries, you can use on of the prepared datasets! 

![](https://media.giphy.com/media/S5uMJDmtnATLbjjw3h/giphy.gif)

For training, we will use the Convergent DQN.


#### Basics of Deep Q-Learning (DQN)  


##### Bellman equation:

$Q(s,a;\theta) = r + \gamma * max_{a'}Q(s',a';\~\theta)$


##### Temporal difference (TD) error: 
The TD-error is the difference between the predicted reward and the actual reward.

$\delta = Q(s,a;\theta) - (r + \gamma * max_{a'}Q(s',a';\~\theta))$


##### Huber Loss:
To minimize the TD error, we use the Huber Loss as our loss function, which is designed to be more robust to outliers.

$L(\delta) =  \begin{cases} \frac{1}{2} * (Q(s,a;\theta) - (r + \gamma * max_{a'}Q(s',a';\~\theta)))^2 for |\delta| \leq \frac{1}{2} \\ |\delta| - \frac{1}{2} otherwise  \end{cases}$


### Convergent DQN (CDQN)

DQN is a rather simple algorithm, which doesn't always converge. The Convergent DQN (https://arxiv.org/pdf/2106.15419.pdf) ensures loss convergence, by taking the maximum value l_DQN using the target network and l_MSBE (Mean Squared Bellman Error) using the current network. 

But what does the loss actually mean? And why should it converge?

The loss indicates how good or bad the model's prediction was on a sample. If the prediction was perfect, the loss is 0. Otherwise, the loss is larger than 0. 
Training the model - thus, finding a set of weights and biases - should therefor lower the loss on average over all samples. 
Note: The loss is a subjective metric depending on your data. A loss of 0.5 can be low for some problems, but large for others. 

CDQN loss is convergent and performs well in practice. Compared to DQN, CDQN is more stable, independent of data structure. It is defined as follows:


$ l\_DQN = Q(s,a;\theta) - (r + \gamma * max_{a'}Q(s',a';\~\theta))$

$ l\_MSBE = Q(s,a;\theta) - (r + \gamma * max_{a'}Q(s',a';\theta))$


$ l\_CDQN = {\mathbb{E}}[max(L\_DQN, l\_MSBE)] $

In [None]:
from exercise_dqn import DQN 
from d3rlpy.dataset import MDPDataset
from constants import DATAPATH
import torch

### TODO: Please implement the Huber Loss Function from above. Note that 'value' describes the actual cumulated reward and 'target' the predicted cumulated reward.###

def huber_loss(beta, gamma, rewards, target, value):
  
  loss = torch.where(#TODO) 
  
  return loss

In [None]:
### Now, load the dataset and run the DQN algorithm with your implemented loss function ###
import os
import glob
import numpy as np
import d3rlpy
import gym
import gym_marioai
from gym_marioai import levels
import copy
import matplotlib.pyplot as plt 
from exercise_dqn import DQN
from d3rlpy.dataset import MDPDataset
from constants import DATAPATH
from sklearn.model_selection import train_test_split
from d3rlpy.metrics.scorer import td_error_scorer
from d3rlpy.metrics.scorer import evaluate_on_environment

dataset = MDPDataset.load(DATAPATH) #TODO: Choose Dataset here

dqn = DQN(huber_loss = huber_loss, gamma=0.8, batch_size=128) #TODO: Feel free to experiment with hyperparameters
log_dir="d3rlpy_logs"

train_episodes, test_episodes = train_test_split(dataset, test_size=0.1) 

all_actions = (0,1,2,3,4,5,6,7,8,9,10,11,12)

env = gym.make('Marioai-v0', render=False, seed=0,
               compact_observation=False, #this must stay false for proper saving in dataset
               enabled_actions=all_actions,
               rf_width=20, rf_height=10)

evaluate_scorer = evaluate_on_environment(env)

#TODO: experiment with the number of epochs, shuffeling, size of the test set..
fitter = cdqn.fitter(train_episodes, eval_episodes=test_episodes, n_epochs=50, shuffle=True, scorers={'environment': evaluate_scorer, 'td_error': td_error_scorer})


metr = []
for epoch, metrics in fitter:
  metr.append(metrics.get('environment'))
  #Stop training when a reward over 160 is reached
  if metrics.get('environment') > 160:
    break
  
  
#fetch latest dataset
latest_logs = max(glob.glob(os.path.join(log_dir, '*/')), key=os.path.getmtime)

#fetch latest model
latest_model = max(glob.iglob(latest_logs + '/*.pt'), key=os.path.getctime)
print(latest_model)
#to get specific model (not the latest), change this file path
cdqn.load_model(latest_model)
cdqn.save_policy(latest_logs +'/policy.pt')



After running the previous cell, you should find a new folder in the log_dir directory. It contains a model for each epoch and some logs, like total rewards in environment.csv. Let's try it out! 
Choose a model and run the next cell to evaluate the agent. 

In [None]:
import gym
from gym_marioai import levels
from d3rlpy.dataset import MDPDataset
from constants import DATAPATH
from cdqn import CDQN

dataset = MDPDataset.load(DATAPATH)


### Evaluation of our implemented Convergent DQN algorithm based on the d3rlpy DQN ###
cdqn = CDQN()


#use this instead of dqn.fit when dqn.fit() has already been run
cdqn.build_with_dataset(dataset)

#choose your model here
cdqn.load_model('../evaluations\hard_level\CDQN\model_15980.pt')

all_actions = (0,1,2,3,4,5,6,7,8,9,10,11,12)

env = gym.make('Marioai-v0', render=True, # turn this off for fast training without video
               level_path=levels.hard_level,
               compact_observation=False, #this must stay false for proper saving in dataset
               enabled_actions=all_actions,
               rf_width=20, rf_height=10)


while True:
        observation = env.reset()
        done = False
        total_reward = 0
        while not done:
  
            action = cdqn.predict([observation])[0]
            observation, reward, done, info = env.step(action)
 
       
            total_reward += reward
        print(f'finished episode, total_reward: {total_reward}')