### Deep Reinforcement Learning for Super Mario World and the potential of RL models in Nuclear Fusion

#### Project Overview
The main goal of this project is to use deep reinforcement learning (DRL) to train a model to play Super Mario World on the SNES.  
The project will demonstrate the ability of RL models to use data as inputs and learn to prevent bad outcomes (from the reward function).  
Additionally, the proposal will explore the potential of DRL models in controlling devices in IoT, specifically in the context of nuclear fusion (to be seen).  
(The project will be divided into two parts: (1) training a DRL model to play Super Mario World, and (2) exploring the potential of DRL in controlling devices in IoT.)  

#### Data Description
The data for this project will come from the OpenAI Gym Retro environment, which provides an emulator for Super Mario World.  
The dataset consists of frames from the game, along with actions taken by the model and the corresponding rewards.  
The dataset will need to be preprocessed to extract features that are relevant to the DRL model.  

#### Methodology
The DRL model will be trained using the Proximal Policy Optimization (PPO) algorithm.  
The DRL model will be evaluated using a set of metrics, including the average score achieved and the number of deaths.  
For the exploration of the potential of DRL in controlling devices in IoT, the project will use a simulated environment to test the model's ability to control the parameters of the system.  

#### The timeline for completing the project is as follows:
Week 1: Collect and preprocess data for Super Mario World  
Week 2: Train and evaluate the DRL model on Super Mario World  
Week 3: Visualize the model's progress  
Week 4: Explore the potential of DRL in controlling IoT devices  

#### Results Interpretation
The results of the Super Mario World model will be interpreted in terms of the average score achieved and the number of deaths.  
The results of the exploration of DRL in IoT will be interpreted in terms of the model's ability to control the parameters of the system and prevent failures.  
The interpretation of the results will be presented in a dashboard, which will include graphs and visualizations to make the results more interpretable.  

#### Deployment and Delivery
The DRL model will be deployed as a standalone application that can be run on a desktop computer.  
The application will be delivered with a user manual that provides instructions on how to use the application.  
Additionally, a dashboard will be provided to showcase the results of the project.  

#### Conclusion
The project will demonstrate the ability of DRL models to learn from data and prevent bad outcomes.  
The exploration of the potential of DRL in controlling devices in IoT will provide insights into how this technology can be applied in real-world scenarios.  
Overall, this project will showcase the potential of DRL in solving complex problems and preventing failures of complex systems.  

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import os
import time

import gym
import retro

import torch

import gym.envs.classic_control as control
from gym.wrappers import GrayScaleObservation

from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import BaseCallback
from stable_baselines3.common.vec_env import VecFrameStack, DummyVecEnv
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.atari_wrappers import MaxAndSkipEnv

In [None]:
%load_ext tensorboard

In [None]:
torch.cuda.is_available()

In [None]:
class Discretizer(gym.ActionWrapper):

    def __init__(self, env, combos):
        super().__init__(env)
        assert isinstance(env.action_space, gym.spaces.MultiBinary)
        buttons = env.unwrapped.buttons
        self._decode_discrete_action = []
        for combo in combos:
            arr = np.array([False] * env.action_space.n)
            for button in combo:
                arr[buttons.index(button)] = True
            self._decode_discrete_action.append(arr)

        self.action_space = gym.spaces.Discrete(len(self._decode_discrete_action))

    def action(self, act):
        return self._decode_discrete_action[act].copy()


class MaRLioDiscretizer(Discretizer):

    def __init__(self, env):
        super().__init__(env=env, 
                         combos=[
            [],
            ["RIGHT"],
            ["RIGHT","Y"],
            ["RIGHT","A"],
            ["RIGHT","Y","A"],
            ["RIGHT","Y","B"],
            ["LEFT"],
            ["LEFT","Y"],
            ["LEFT","A"],
            ["LEFT","Y","A"],
            ["LEFT","Y","B"],
            ["A"],
            ["B"]
            ]
        )

In [None]:
# for train

env = retro.make(game="SuperMarioWorld-Snes", state="YoshiIsland3", record="./recordings/train/")

In [None]:
# for pred

env = retro.make(game="SuperMarioWorld-Snes", state="YoshiIsland3", record="./recordings/pred/")

In [None]:
# for test

env = retro.make(game="SuperMarioWorld-Snes", state="YoshiIsland3")

In [None]:
combos = [
    [],
    ["RIGHT"],
    ["RIGHT","Y"],
    ["RIGHT","A"],
    ["RIGHT","Y","A"],
    ["RIGHT","Y","B"],
    ["LEFT"],
    ["LEFT","Y"],
    ["LEFT","A"],
    ["LEFT","Y","A"],
    ["LEFT","Y","B"],
    ["A"],
    ["B"]
    ]

In [None]:
disc_env = MaRLioDiscretizer(env)

In [None]:
obs = disc_env.reset()

In [None]:
action = disc_env.action_space.sample()
print(action)
combos[action]

In [None]:
obs, reward, terminated, info = disc_env.step(action)
print(f"score: {reward}\nterminated: {terminated}\ninfo: {info}")

In [None]:
plt.matshow(obs)

In [None]:
# game loop with random actions
state = disc_env.reset()

done = False
while not done:
    action = disc_env.action_space.sample()
    state, reward, done, info = disc_env.step(action)
    disc_env.render()
    time.sleep(0.005)
disc_env.render(close=True)

In [None]:
# preprocess: grayscale, vectorize and framestacking
disc_env = GrayScaleObservation(disc_env, keep_dim=True)
disc_env = DummyVecEnv([lambda:disc_env])
disc_env = VecFrameStack(disc_env, 4, channels_order="last")
# disc_env = MaxAndSkipEnv(disc_env, 4)

In [None]:
state = disc_env.reset()

In [None]:
plt.matshow(state[0])

In [None]:
state.shape

In [None]:
state, reward, done, info = disc_env.step([disc_env.action_space.sample()])

In [None]:
plt.figure(figsize=(20,16))
for idx in range(state.shape[3]):
    plt.subplot(1,4,idx+1)
    plt.imshow(state[0][:,:,idx])
plt.show()

In [None]:
state = disc_env.reset()

In [None]:
# callback helper function

# logs and model saving
class TrainAndLoggingCallback(BaseCallback):

    def __init__(self, check_freq, save_path, verbose=1):
        super(TrainAndLoggingCallback, self).__init__(verbose)
        self.check_freq = check_freq
        self.save_path = save_path

    def _init_callback(self):
        if self.save_path is not None:
            os.makedirs(self.save_path, exist_ok=True)

    def _on_step(self):
        if self.n_calls % self.check_freq == 0:
            model_path = os.path.join(self.save_path, "best_model_{}".format(self.n_calls))
            self.model.save(model_path)

        return True

In [None]:
CHECKPOINT_DIR = "./train/"
LOG_DIR = "./logs/"

In [None]:
callback = TrainAndLoggingCallback(check_freq=10_000, save_path=CHECKPOINT_DIR)

In [None]:
# lambda params

lr = lambda f:f*0.0003
clipr = lambda f:f*0.2

In [None]:
model = PPO(
    "CnnPolicy", 
    disc_env, 
    learning_rate=lr, 
    n_steps=256, 
    batch_size=64, 
    n_epochs=10, 
    gamma=0.99, 
    gae_lambda=0.95, 
    clip_range=clipr, 
    clip_range_vf=None, 
    normalize_advantage=True, 
    ent_coef=0.01, 
    vf_coef=0.5, 
    max_grad_norm=0.5, 
    use_sde=False, 
    sde_sample_freq=-1, 
    target_kl=None,  
    tensorboard_log=LOG_DIR, 
    policy_kwargs=None, 
    verbose=0, 
    seed=42, 
    device='auto', 
    _init_setup_model=True
    )

In [None]:
model.learn(
    total_timesteps=100_000, 
    progress_bar=True, 
    tb_log_name="ppo",
    callback=[
    callback
    ]
)

In [None]:
marlio = "best_model_60000.zip"
model = PPO.load(f"./train/{marlio}")
print(f"using {marlio}")

In [None]:
state = disc_env.reset()

In [None]:
# game loop for predict

done = False

action_list = [0]
while not done:

    action, _ = model.predict(state)
    if action[0] not in action_list:
        action_list.pop()
        action_list.append(action[0])
        print(combos[action_list[0]])
    state, reward, done, info = disc_env.step(action)

    disc_env.render()
    time.sleep(0.004)

In [None]:
env.data.list_variables()

In [None]:
# possible playback code below

movie = retro.Movie("./recordings/train/SuperMarioWorld-Snes-YoshiIsland3-000008.bk2")
movie.step()

In [None]:
env = retro.make(
    game=movie.get_game(),
    state=None,
    # bk2s can contain any button presses, so allow everything
    use_restricted_actions=retro.Actions.ALL,
    players=movie.players,
)
env.initial_state = movie.get_state()
env.reset()

In [None]:
while movie.step():
    keys = []
    for p in range(movie.players):
        for i in range(env.num_buttons):
            keys.append(movie.get_key(i, p))
    env.step(keys)
    env.render()
    # time.sleep(0.004)

In [None]:
def record_video(env, policy, out_directory, fps=30):
    images = []
    done = False
    state = env.reset()
    img = env.render(mode="rgb_array")
    images.append(img)
    while not done:
        state = torch.Tensor(state).to(device)
        # Take the action (index) that have the maximum expected future reward given that state
        action, _, _, _ = policy.get_action_and_value(state)
        state, reward, done, info = env.step(
            action.cpu().numpy()
        )  # We directly put next_state = state for recording logic
        img = env.render(mode="rgb_array")
        images.append(img)
    imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)