# Reinforcement Learning baseline in Python with stable-baselines3
This code will make your life easier if you want to try Reinforcement Learning (RL) as a solution to kaggle's kore 2022 challenge.
One of the (multiple) difficulties of RL is achieving a clean implementation. While you can of course try to build yourself
one of the RL models described in literature, chances are that you will spend more time debugging your model than actually competing.

[stable-baselines3](https://stable-baselines3.readthedocs.io/en/master/#) is powerful RL library with a number of very nice features for this competition:
- It implements the most popular modern deep RL algorithms
- It is simple and ellegant to use
- It is rather well documented
- There are plenty of tutorials and examples

In other words, it's a fantastic starting point. Alas, it requires an environment compatible with OpenAI-gym and the kore environment is not. What you'll find in this notebook is **KoreGymEnv**, a wrapper around the kore env that makes it play nice with stable-baselines3. It includes very simple feature and action engineering, so the only thing you need to care about is building upon them, choosing a model and throwing yourself into the cold, unforgiving and yet very rewarding reinforcement learning waters ;)

As a bonus, this notebook also demonstrates the end-to-end process that you need to follow to submit any model with external dependencies. Click on submit and you're good to go.

#### Notes:

- In stable-baselines3, states and actions are numpy arrays. In the kore environment, states are lists of dicts and actions are dicts with shipyard ids as keys and shipyard actions as values. Thus, we need an interface to "translate" them. This interface is effectively where you implement your state & action engineering. You'll find more details in the KoreGymEnv class.
- In the ideal case, you would use self-play and let your agent play a very large number of games against itself, improving at ever step. Unfortunately, [it is not clear how to implement self-play in the kore env](https://www.kaggle.com/competitions/kore-2022/discussion/323382). So we have to train against static opponents. In this baseline, we'll use the starter bot. Of course, nothing prevents you from implementing pseudo-self-play and train against ever improving versions of your agent.

## tl; dr

```python
# Train a PPO agent
from environment import KoreGymEnv
from stable_baselines3 import PPO

kore_env = KoreGymEnv()
model = PPO('MlpPolicy', kore_env, verbose=1)
model.learn(total_timesteps=100000)
```

# Dependencies

In [None]:
!pip install --target=lib --no-deps stable-baselines3 gym

#### A note on dependencies
The kaggle notebook environment and the actual competition environment are different. I couldn't find any documentation on the differences other than through comments from more experienced kagglers. So let's take a minute to understand the cell below. I hope that this information saves fellow competitors a lot of time and trial-and-error!

`stable-baselines` is not (yet) a part of the kaggle docker environment, so we have to install it manually. In the notebook environment, you start at `/kaggle/working/`, so the cell above installs the libraries into `/kaggle/working/lib/`. We have two options to load the library now, `import lib.stable-baselines3` or add `/kaggle/working/lib/` to [sys.path](https://docs.python.org/3/library/sys.html#sys.path), which tells Python where look for modules.

When you submit your agent as an archive, however, your code is unzipped to `/kaggle_simulations/agent/`, _but the working directory remains `/kaggle/working/`_. In the competition env, neither of the options above work, because `lib` isn't `/kaggle/working/lib` anymore, it has been unzipped with the rest of your code to `/kaggle_simulations/agent/lib`. Surprise!

The code below then checks whether we are in the simulation environment, and adds the right location of the external dependencies to `sys.path`.

Additionally, there is a limit on the submission size, that's why we are installing with `--no-deps` to keep the submission size small.

In [None]:
import os
import sys
KAGGLE_AGENT_PATH = "/kaggle_simulations/agent/"
if os.path.exists(KAGGLE_AGENT_PATH):
    sys.path.insert(0, os.path.join(KAGGLE_AGENT_PATH, 'lib'))
else:
    sys.path.insert(0, os.path.join(os.getcwd(), 'lib'))

# Utils

### Config

In [None]:
%%writefile config.py
import numpy as np
from kaggle_environments import make

# Read env specification
ENV_SPECIFICATION = make('kore_fleets').specification
SHIP_COST = ENV_SPECIFICATION.configuration.spawnCost.default
SHIPYARD_COST = ENV_SPECIFICATION.configuration.convertCost.default
GAME_CONFIG = {
    'episodeSteps':  ENV_SPECIFICATION.configuration.episodeSteps.default,  # You might want to start with smaller values
    'size': ENV_SPECIFICATION.configuration.size.default,
    'maxLogLength': None
}

# Define your opponent. We'll use the starter bot in the notebook environment for this baseline.
OPPONENT = 'opponent.py'
GAME_AGENTS = [None, OPPONENT]

# Define our parameters
N_FEATURES = 4
ACTION_SIZE = (2,)
DTYPE = np.float64
MAX_OBSERVABLE_KORE = 500
MAX_OBSERVABLE_SHIPS = 200
MAX_ACTION_FLEET_SIZE = 150
MAX_KORE_IN_RESERVE = 40000
WIN_REWARD = 1000

In [None]:
%%writefile opponent.py
from kaggle_environments.envs.kore_fleets.helpers import *


# This is just the starter bot. Change this with the agent of your choice.
def agent(obs, config):
    board = Board(obs, config)

    me = board.current_player
    turn = board.step
    spawn_cost = board.configuration.spawn_cost
    kore_left = me.kore

    for shipyard in me.shipyards:
        if shipyard.ship_count > 10:
            direction = Direction.from_index(turn % 4)
            action = ShipyardAction.launch_fleet_with_flight_plan(2, direction.to_char())
            shipyard.next_action = action
        elif kore_left > spawn_cost * shipyard.max_spawn:
            action = ShipyardAction.spawn_ships(shipyard.max_spawn)
            shipyard.next_action = action
            kore_left -= spawn_cost * shipyard.max_spawn
        elif kore_left > spawn_cost:
            action = ShipyardAction.spawn_ships(1)
            shipyard.next_action = action
            kore_left -= spawn_cost

    return me.next_actions

### Reward utilities

In [None]:
%%writefile reward_utils.py
from config import GAME_CONFIG, SHIP_COST, SHIPYARD_COST
from kaggle_environments.envs.kore_fleets.helpers import Board
import numpy as np
from math import floor

# Compute weight constants -- See get_board_value's docstring
_max_steps = GAME_CONFIG['episodeSteps']
_end_of_asset_value = floor(.5 * _max_steps)
_weights_assets = np.linspace(start=1, stop=0, num=_end_of_asset_value)
_weights_kore = np.linspace(start=0, stop=1, num=_end_of_asset_value)
WEIGHTS_ASSETS = np.append(_weights_assets, np.zeros(_max_steps - _end_of_asset_value))
WEIGHTS_KORE = np.append(_weights_kore, np.ones(_max_steps - _end_of_asset_value))
WEIGHTS_MAX_SPAWN = {x: (x+3)/4 for x in range(1, 11)}  # Value multiplier of a shipyard as a function of its max spawn
WEIGHTS_KORE_IN_FLEETS = WEIGHTS_KORE * WEIGHTS_ASSETS/2  # Always equal or smaller than either, almost always smaller


def get_board_value(board: Board) -> float:
    """Computes the board value for the current player.

    The board value captures how are we currently performing, compared to the opponent. Each player's partial board
    value assesses the player's situation, taking into account their current kore, ship count, shipyard count
    (including their max spawn) and kore carried by fleets. We then define the board value as the difference between
    player's partial board values.
    Flight plans and the positioning of fleet and shipyards do not flow into the board value (yet).

    To keep things simple, we'll take a weighted sum as the partial board value. We need weighting since
    the importance of each item changes over time. We don't need to have the most kore at the beginning of the game,
    but we do at the end. Ship count won't help us win games in the latter stages, but it is crucial in the beginning.
    Fleets and shipyards will be accounted for proportionally to their kore cost.

    For efficiency, the weight factors are pre-computed at module level. Here is the logic behind the weighting:
    WEIGHTS_KORE: Applied to the player's kore count. Increases linearly from 0 to 1. It reaches one before
        the maximum game length is reached.
    WEIGHTS_ASSETS: Applied to fleets and shipyards. Decreases linearly from 1 to 0 and reaches zero before the maximum
        length. It emphasizes the need of having ships over kore at the beginning of the game.
    WEIGHTS_MAX_SPAWN: Shipyard value is multiplied by its max spawn. This captures the idea that long-held shipyards
        are more valuable.
    WEIGHTS_KORE_IN_FLEETS: Kore in fleets should be valued, too. But its value must be upper-bounded by WEIGHTS_KORE
        (it can never be better to have kore in cargo than home) and it must decrease in time, since it doesn't
        count towards the end kore count.

    Args:
        board: The board for which we want to compute the value.

    Returns:
        The value of the board.
    """
    board_value = 0
    if not board:
        return board_value

    # Get the weights as a function of the current game step
    step = board.step
    weight_kore, weight_assets, weight_cargo = WEIGHTS_KORE[step], WEIGHTS_ASSETS[step], WEIGHTS_KORE_IN_FLEETS[step]

    # Compute the partial board values
    for player in board.players.values():
        player_fleets, player_shipyards = list(player.fleets), list(player.shipyards)

        value_kore = weight_kore * player.kore

        value_fleets = weight_assets * SHIP_COST * (
                sum(fleet.ship_count for fleet in player_fleets)
                + sum(shipyard.ship_count for shipyard in player_shipyards)
        )

        value_shipyards = weight_assets * SHIPYARD_COST * (
            sum(shipyard.max_spawn * WEIGHTS_MAX_SPAWN[shipyard.max_spawn] for shipyard in player_shipyards)
        )

        value_kore_in_cargo = weight_cargo * sum(fleet.kore for fleet in player_fleets)

        # Add (or subtract) the partial values to the total board value. The current player is always us.
        modifier = 1 if player.is_current_player else -1
        board_value += modifier * (value_kore + value_fleets + value_shipyards + value_kore_in_cargo)

    return board_value

# The KoreGymEnv wrapper

In [None]:
%%writefile environment.py
import gym
import numpy as np
from gym import spaces
from math import floor
from kaggle_environments import make
from kaggle_environments.envs.kore_fleets.helpers import ShipyardAction, Board, Direction
from typing import Union, Tuple, Dict
from reward_utils import get_board_value
from config import (
    N_FEATURES,
    ACTION_SIZE,
    GAME_AGENTS,
    GAME_CONFIG,
    DTYPE,
    MAX_OBSERVABLE_KORE,
    MAX_OBSERVABLE_SHIPS,
    MAX_ACTION_FLEET_SIZE,
    MAX_KORE_IN_RESERVE,
    WIN_REWARD,
)


class KoreGymEnv(gym.Env):
    """An openAI-gym env wrapper for kaggle's kore environment. Can be used with stable-baselines3.

    There are three fundamental components to this class which you would want to customize for your own agents:
        The action space is defined by `action_space` and `gym_to_kore_action()`
        The state space (observations) is defined by `state_space` and `obs_as_gym_state()`
        The reward is computed with `compute_reward()`

    Note that the action and state spaces define the inputs and outputs to your model *as numpy arrays*. Use the
    functions mentioned above to translate these arrays into actual kore environment observations and actions.

    The rest is basically boilerplate and makes sure that the kaggle environment plays nicely with stable-baselines3.

    Usage:
        >>> from stable_baselines3 import PPO
        >>>
        >>> kore_env = KoreGymEnv()
        >>> model = PPO('MlpPolicy', kore_env, verbose=1)
        >>> model.learn(total_timesteps=100000)
    """

    def __init__(self, config=None, agents=None, debug=None):
        super(KoreGymEnv, self).__init__()

        if not config:
            config = GAME_CONFIG
        if not agents:
            agents = GAME_AGENTS
        if not debug:
            debug = True

        self.agents = agents
        self.env = make("kore_fleets", configuration=config, debug=debug)
        self.config = self.env.configuration
        self.trainer = None
        self.raw_obs = None
        self.previous_obs = None

        # Define the action and state space
        # Change these to match your needs. Normalization to the [-1, 1] interval is recommended. See:
        # https://araffin.github.io/slides/rlvs-tips-tricks/#/13/0/0
        # See https://www.gymlibrary.ml/content/spaces/ for more info on OpenAI-gym spaces.
        self.action_space = spaces.Box(
            low=-1,
            high=1,
            shape=ACTION_SIZE,
            dtype=DTYPE
        )

        self.observation_space = spaces.Box(
            low=-1,
            high=1,
            shape=(self.config.size ** 2 * N_FEATURES + 3,),
            dtype=DTYPE
        )

        self.strict_reward = config.get('strict', False)

        # Debugging info - Enable or disable as needed
        self.reward = 0
        self.n_steps = 0
        self.n_resets = 0
        self.n_dones = 0
        self.last_action = None
        self.last_done = False

    def reset(self) -> np.ndarray:
        """Resets the trainer and returns the initial observation in state space.

        Returns:
            self.obs_as_gym_state: the current observation encoded as a state in state space
        """
        # agents = self.agents if np.random.rand() > .5 else self.agents[::-1]  # Randomize starting position
        self.trainer = self.env.train(self.agents)
        self.raw_obs = self.trainer.reset()
        self.n_resets += 1
        return self.obs_as_gym_state

    def step(self, action: np.ndarray) -> Tuple[np.ndarray, float, bool, Dict]:
        """Execute action in the trainer and return the results.

        Args:
            action: The action in action space, i.e. the output of the stable-baselines3 agent

        Returns:
            self.obs_as_gym_state: the current observation encoded as a state in state space
            reward: The agent's reward
            done: If True, the episode is over
            info: A dictionary with additional debugging information
        """
        kore_action = self.gym_to_kore_action(action)
        self.previous_obs = self.raw_obs
        self.raw_obs, _, done, info = self.trainer.step(kore_action)  # Ignore trainer reward, which is just delta kore
        self.reward = self.compute_reward(done)

        # Debugging info
        # with open('logs/tmp.log', 'a') as log:
        #    print(kore_action.action_type, kore_action.num_ships, kore_action.flight_plan, file=log)
        #    if done:
        #        print('done', file=log)
        #    if info:
        #        print('info', file=log)
        self.n_steps += 1
        self.last_done = done
        self.last_action = kore_action
        self.n_dones += 1 if done else 0

        return self.obs_as_gym_state, self.reward, done, info

    def render(self, **kwargs):
        self.env.render(**kwargs)

    def close(self):
        pass

    @property
    def board(self):
        return Board(self.raw_obs, self.config)

    @property
    def previous_board(self):
        return Board(self.previous_obs, self.config)

    def gym_to_kore_action(self, gym_action: np.ndarray) -> Dict[str, str]:
        """Decode an action in action space as a kore action.

        In other words, transform a stable-baselines3 action into an action compatible with the kore environment.

        This method is central - It defines how the agent output is mapped to kore actions.
        You can modify it to suit your needs.

        Let's start with an übereasy mapping. Our gym_action is a 1-dimensional vector of size 2 (as defined in
        self.action_space). We will interpret the values as follows:
        if gym_action[0] > 0 launch a fleet, elif < 0 build ships, else wait.
        abs(gym_action[0]) encodes the number of ships to build/launch.
        gym_action[1] represents the direction in which to launch the fleet.

        Notes: The same action is sent to all shipyards, though we make sure that the actions are valid.

        Args:
            gym_action: The action produces by our stable-baselines3 agent.

        Returns:
            The corresponding kore environment actions or None if the agent wants to wait.

        """
        action_launch = gym_action[0] > 0
        action_build = gym_action[0] < 0
        # Mapping the number of ships is an interesting exercise. Here we chose a linear mapping to the interval
        # [1, MAX_ACTION_FLEET_SIZE], but you could use something else. With a linear mapping, all values are
        # evenly spaced. An exponential mapping, however, would space out lower values, making them easier for the agent
        # to distinguish and choose, at the cost of needing more precision to accurately select higher values.
        number_of_ships = int(
            clip_normalize(
                x=abs(gym_action[0]),
                low_in=0,
                high_in=1,
                low_out=1,
                high_out=MAX_ACTION_FLEET_SIZE
            )
        )

        # Broadcast the same action to all shipyards
        board = self.board
        me = board.current_player
        for shipyard in me.shipyards:
            action = None
            if action_build:
                # Limit the number of ships to the maximum that can be actually built
                max_spawn = shipyard.max_spawn
                max_purchasable = floor(me.kore / self.config["spawnCost"])
                number_of_ships = min(number_of_ships, max_spawn, max_purchasable)
                if number_of_ships:
                    action = ShipyardAction.spawn_ships(number_ships=number_of_ships)

            elif action_launch:
                # Limit the number of ships to the amount that is actually present in the shipyard
                shipyard_count = shipyard.ship_count
                number_of_ships = min(number_of_ships, shipyard_count)
                if number_of_ships:
                    direction = round((gym_action[1] + 1) * 1.5)  # int between 0 (North) and 3 (West)
                    action = ShipyardAction.launch_fleet_in_direction(number_ships=number_of_ships,
                                                                      direction=Direction.from_index(direction))
            shipyard.next_action = action

        return me.next_actions

    @property
    def obs_as_gym_state(self) -> np.ndarray:
        """Return the current observation encoded as a state in state space.

        In other words, transform a kore observation into a stable-baselines3-compatible np.ndarray.

        This property is central - It defines how the kore board is mapped to our state space.
        You can modify it to include as many features as you see convenient.

        Let's keep start with something easy: Define a 21x21x4+3 state (size x size x n_features and 3 extra features).
        # Feature 0: How much kore there is in a cell
        # Feature 1: How many ships there are in a cell (>0: friendly, <0: enemy)
        # Feature 2: Fleet direction
        # Feature 3: Is a shipyard present? (1: friendly, -1: enemy, 0: no)
        # Feature 4: Progress - What turn is it?
        # Feature 5: How much kore do I have?
        # Feature 6: How much kore does the opponent have?

        We'll make sure that all features are in the range [-1, 1] and as close to a normal distribution as possible.

        Note: This mapping doesn't tackle a critical issue in kore: How to encode (full) flight plans?
        """
        # Init output state
        gym_state = np.ndarray(shape=(self.config.size, self.config.size, N_FEATURES))

        # Get our player ID
        board = self.board
        our_id = board.current_player_id

        for point, cell in board.cells.items():
            # Feature 0: How much kore
            gym_state[point.y, point.x, 0] = cell.kore

            # Feature 1: How many ships (>0: friendly, <0: enemy)
            # Feature 2: Fleet direction
            fleet = cell.fleet
            if fleet:
                modifier = 1 if fleet.player_id == our_id else -1
                gym_state[point.y, point.x, 1] = modifier * fleet.ship_count
                gym_state[point.y, point.x, 2] = fleet.direction.value
            else:
                # The current cell has no fleet
                gym_state[point.y, point.x, 1] = gym_state[point.y, point.x, 2] = 0

            # Feature 3: Shipyard present (1: friendly, -1: enemy)
            shipyard = cell.shipyard
            if shipyard:
                gym_state[point.y, point.x, 3] = 1 if shipyard.player_id == our_id else -1
            else:
                # The current cell has no shipyard
                gym_state[point.y, point.x, 3] = 0

        # Normalize features to interval [-1, 1]
        # Feature 0: Logarithmic scale, kore in range [0, MAX_OBSERVABLE_KORE]
        gym_state[:, :, 0] = clip_normalize(
            x=np.log2(gym_state[:, :, 0] + 1),
            low_in=0,
            high_in=np.log2(MAX_OBSERVABLE_KORE)
        )

        # Feature 1: Ships in range [-MAX_OBSERVABLE_SHIPS, MAX_OBSERVABLE_SHIPS]
        gym_state[:, :, 1] = clip_normalize(
            x=gym_state[:, :, 1],
            low_in=-MAX_OBSERVABLE_SHIPS,
            high_in=MAX_OBSERVABLE_SHIPS
        )

        # Feature 2: Fleet direction in range (1, 4)
        gym_state[:, :, 2] = clip_normalize(
            x=gym_state[:, :, 2],
            low_in=1,
            high_in=4
        )

        # Feature 3 is already as normal as it gets

        # Flatten the input (recommended by stable_baselines3.common.env_checker.check_env)
        output_state = gym_state.flatten()

        # Extra Features: Progress, how much kore do I have, how much kore does opponent have
        player = board.current_player
        opponent = board.opponents[0]
        progress = clip_normalize(board.step, low_in=0, high_in=GAME_CONFIG['episodeSteps'])
        my_kore = clip_normalize(np.log2(player.kore+1), low_in=0, high_in=np.log2(MAX_KORE_IN_RESERVE))
        opponent_kore = clip_normalize(np.log2(opponent.kore+1), low_in=0, high_in=np.log2(MAX_KORE_IN_RESERVE))

        return np.append(output_state, [progress, my_kore, opponent_kore])

    def compute_reward(self, done: bool, strict=False) -> float:
        """Compute the agent reward. Welcome to the fine art of RL.

         We'll compute the reward as the current board value and a final bonus if the episode is over. If the player
          wins the episode, we'll add a final bonus that increases with shorter time-to-victory.
        If the player loses, we'll subtract that bonus.

        Args:
            done: True if the episode is over
            strict: If True, count only wins/loses (Useful for evaluating a trained agent)

        Returns:
            The agent's reward
        """
        board = self.board
        previous_board = self.previous_board

        if strict:
            if done:
                # Who won?
                # Ugly but 99% sure correct, see https://www.kaggle.com/competitions/kore-2022/discussion/324150#1789804
                agent_reward = self.raw_obs.players[0][0]
                opponent_reward = self.raw_obs.players[1][0]
                return int(agent_reward > opponent_reward)
            else:
                return 0
        else:
            if done:
                # Who won?
                agent_reward = self.raw_obs.players[0][0]
                opponent_reward = self.raw_obs.players[1][0]
                if agent_reward is None or opponent_reward is None:
                    we_won = -1
                else:
                    we_won = 1 if agent_reward > opponent_reward else -1
                win_reward = we_won * (WIN_REWARD + 5 * (GAME_CONFIG['episodeSteps'] - board.step))
            else:
                win_reward = 0

            return get_board_value(board) - get_board_value(previous_board) + win_reward


def clip_normalize(x: Union[np.ndarray, float],
                   low_in: float,
                   high_in: float,
                   low_out=-1.,
                   high_out=1.) -> Union[np.ndarray, float]:
    """Clip values in x to the interval [low_in, high_in] and then MinMax-normalize to [low_out, high_out].

    Args:
        x: The array of float to clip and normalize
        low_in: The lowest possible value in x
        high_in: The highest possible value in x
        low_out: The lowest possible value in the output
        high_out: The highest possible value in the output

    Returns:
        The clipped and normalized version of x

    Raises:
        AssertionError if the limits are not consistent

    Examples:
        >>> clip_normalize(50, low_in=0, high_in=100)
        0.0

        >>> clip_normalize(np.array([-1, .5, 99]), low_in=-1, high_in=1, low_out=0, high_out=2)
        array([0., 1.5, 2.])
    """
    assert high_in > low_in and high_out > low_out, "Wrong limits"

    # Clip outliers
    try:
        x[x > high_in] = high_in
        x[x < low_in] = low_in
    except TypeError:
        x = high_in if x > high_in else x
        x = low_in if x < low_in else x

    # y = ax + b
    a = (high_out - low_out) / (high_in - low_in)
    b = high_out - high_in * a

    return a * x + b

### Check that we have a valid environment

In [None]:
# The bad news: this check will fail in the kaggle docker environment. The most likely reason is a version mismatch between packages.
# The good news: That's alright since everything else works! We're doing some unconventional dependency management here, so we'll have to live with
# a failed check.

#from stable_baselines3.common.env_checker import check_env
#from environment import KoreGymEnv

#env = KoreGymEnv()
#check_env(env)

# Train the agent!

In [None]:
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
from environment import KoreGymEnv

In [None]:
kore_env = KoreGymEnv(config=dict(randomSeed=997269658))  # TODO: This seed is not enough. Seed everything!
monitored_env = Monitor(env=kore_env)
model = PPO('MlpPolicy', monitored_env, verbose=1)

In [None]:
%%time
# For serious training, likely many more iterations will be needed, as well as hyperparameter tuning!
# Even so, sometimes training will still fail. RL is like that. Try a couple times with the same config before giving up!
model.learn(total_timesteps=50000)  

In [None]:
# Watch it mercilessly beat the baseline bot - Note: The current episode might not be over yet
kore_env.render(mode="ipython", width=1000, height=800)

In [None]:
model.save("baseline_agent")

# Evaluate agent performance

In [None]:
import numpy as np

eval_env = KoreGymEnv(config=dict(strict=True))  # The 'strict' flags sets rewards to 1 if the agent won the episode and 0 else. Useful for evaluation.
monitored_env = Monitor(env=eval_env)
model_loaded = PPO.load('baseline_agent')

def evaluate(model, num_episodes=1):
    """
    Evaluate a RL agent - Adapted from 
    https://colab.research.google.com/github/Stable-Baselines-Team/rl-colab-notebooks/blob/sb3/stable_baselines_getting_started.ipynb
    :param model: (BaseRLModel object) the RL Agent
    :param num_episodes: (int) number of episodes to evaluate it
    :return: (float) Mean reward for the last num_episodes
    """
    all_episode_rewards = []
    for i in range(num_episodes):
        episode_rewards = []
        done = False
        obs = monitored_env.reset()
        while not done:
            action, _ = model.predict(obs)
            obs, _, done, info = monitored_env.step(action)
            if done:
                agent_reward = monitored_env.env.raw_obs.players[0][0]
                opponent_reward = monitored_env.env.raw_obs.players[1][0]
                reward = agent_reward > opponent_reward
            else:
                reward = 0
            # print(reward)
            # monitored_env.render(mode='ipython', height=400, width=300)
            episode_rewards.append(reward)

        all_episode_rewards.append(sum(episode_rewards))

    mean_episode_reward = np.mean(all_episode_rewards)
    print("Mean reward:", mean_episode_reward, "Num episodes:", num_episodes)

    return mean_episode_reward

evaluate(model_loaded, 20)

### Prepare the submission

In [None]:
%%writefile main.py
# All this syspath wranglig is needed to make sure that the agent runs on the target environment and can load both the external dependencies
# and the saved model. Dear kaggle, if possible, please make this easier!
import os
import sys
KAGGLE_AGENT_PATH = "/kaggle_simulations/agent/"
if os.path.exists(KAGGLE_AGENT_PATH):
    # We're in the kaggle target system
    sys.path.insert(0, os.path.join(KAGGLE_AGENT_PATH, 'lib'))
    agent_path = os.path.join(KAGGLE_AGENT_PATH, 'baseline_agent')
else:
    # We're somewhere else
    sys.path.insert(0, os.path.join(os.getcwd(), 'lib'))
    agent_path = 'baseline_agent'

# Now for the actual agent
from stable_baselines3 import PPO
from environment import KoreGymEnv

model = PPO.load(agent_path)
kore_env = KoreGymEnv()

def agent(obs, config):
    kore_env.raw_obs = obs
    state = kore_env.obs_as_gym_state
    action, _ = model.predict(state)
    return kore_env.gym_to_kore_action(action)

In [None]:
%%capture
# This is for debugging purposes only before submitting - Are there any errors?
from kaggle_environments import make
from config import OPPONENT
env = make("kore_fleets", debug=True)
env.run(['main.py', OPPONENT])

In [None]:
!tar -czf submission.tar.gz main.py config.py environment.py reward_utils.py baseline_agent.zip lib