# [ConnectX with Reinforcement Learning](https://www.kaggle.com/c/connectx)

## Description

We’re excited to announce a beta-version of a brand-new type of ML competition called Simulations. In Simulation Competitions, you’ll compete against a set of rules, rather than against an evaluation metric. To enter, [accept the rules](https://www.kaggle.com/c/connectx/rules) and create a python submission file that can “play” against a computer, or another user.

### The Challenge

In this game, your objective is to get a certain number of your checkers in a row horizontally, vertically, or diagonally on the game board before your opponent. When it's your turn, you “drop” one of your checkers into one of the columns at the top of the board. Then, let your opponent take their turn. This means each move may be trying to either win for you, or trying to stop your opponent from winning. The default number is four-in-a-row, but we’ll have other options to come soon.

### Background History

For the past 10 years, our competitions have been mostly focused on supervised machine learning. The field has grown, and we want to continue to provide the data science community cutting-edge opportunities to challenge themselves and grow their skills.

So, what’s next? Reinforcement learning is clearly a crucial piece in the next wave of data science learning. We hope that Simulation Competitions will provide the opportunity for Kagglers to practice and hone this burgeoning skill.

### How is this Competition Different?

Instead of submitting a CSV file, or a Kaggle Notebook, you will submit a Python .py file (more submission options are in development). You’ll also notice that the leaderboard is not based on how accurate your model is but rather how well you’ve performed against other users. See [Evaluation](https://www.kaggle.com/c/connectx/overview/evaluation) for more details.

### We’d Love Your Feedback

This competition is a low-stakes, trial-run introduction. We’re considering this a beta launch – there are complicated new mechanics in play and we’re still working on refining the process. We’d love your help testing the experience and want to hear your feedback.

Please note that we may make changes throughout the competition that could include things like resetting the leaderboard, invalidating episodes, making changes to the interface, or changing the environment configuration (e.g. modifying the number of columns, rows, or tokens in a row required to win, etc).

## Evaluation

Each Submission has an estimated Skill Rating which is modeled by a Gaussian N(μ,σ2) where μ is the estimated skill and σ represents our uncertainty of that estimate.

When you upload a Submission, we first play a Validation Episode where that Submission plays against itself to make sure it works properly. If the Episode fails, the Submission is marked as Error. Otherwise, we initialize the Submission with μ0=600 and it joins the pool of All Submissions for ongoing evaluation.

We repeatedly run Episodes from the pool of All Submissions, and try to pick Submissions with similar ratings for fair matches. We aim to run ~8 Episodes a day per Submission, with an additional slight rate increase for newer Episodes to give you feedback faster.

After an Episode finishes, we'll update the Rating estimate for both Submissions. If one Submission won, we'll increase its μ and decrease its opponent's μ -- if the result was a draw, then we'll move the two μ values closer towards their mean. The updates will have magnitude relative to the deviation from the expected result based on the previous μ values, and also relative to each Submission's uncertainty σ. We also reduce the σ terms relative to the amount of information gained by the result.

So all valid Submissions will continually play more matches and have dynamically changing scores as the pool increases. The Leaderboard will show the μ value of each Team's best Submission.

## Getting Started

**TLDR;**

Create `submission.py` with the following source and submit!

```python
def act(observation, configuration):
    board = observation.board
    columns = configuration.columns
    return [c for c in range(columns) if board[c] == 0][0]
```

**Starter Notebook**

Fork the [ConnectX Starter Notebook](https://www.kaggle.com/ajeffries/connectx-getting-started) and submit the generated `submission.py` file.

**Client Library**

Read the [README](https://github.com/Kaggle/kaggle-environments/blob/master/README.md) for the [kaggle-environments](https://pypi.org/project/kaggle-environments/) python package and checkout the [ConnectX Notebook](https://github.com/Kaggle/kaggle-environments/blob/master/kaggle_environments/envs/connectx/connectx.ipynb).

```bash
pip install kaggle-environments
```

## Environment Rules

### Episode Objective

Use your Agent to get a certain number of your checkers in a row horizontally, vertically, or diagonally on the game board before your opponent.

### How To Play

Player 1 will take the first turn. When it's your turn, you add, or “drop”, one of your checkers into the top of a column on the board and the checker will land in the last empty row in that column. The following can occur after dropping your checker in a column:

1. If the column you chose has no empty rows or is out of range of the number of columns, you lose the episode.
2. If the checker placed creates an "X-in-a-row", you win the episode. X represents the number specified in the parameters, for example 4, and to be “in a row”, the checkers can be in a row horizontally vertically, or diagonally.
3. If there are no empty cells, you tie the episode.
4. Otherwise, it's your opponent’s turn.

This episode continues until a win, lose, or tie occurs.

### Writing Agents

An Agent will receive the following parameters:

1. The episode configuration:
    - Number of Columns on the board.
    - Number of Rows on the board.
    - How many checkers, X, "in a row" are required to win.
2. The current state of the board (serialized grid of cells; rows by cols).
    - Empty cells are represented by "0".
    - Player 1's checkers are represented by "1".
    - Player 2's checkers are represented by "2".
3. Which player you are ("1" or "2").

An Agent should return which column to place a checker in. The column is an integer: [0, configuration.columns), and represents the columns going left to right. The row is an integer: [0, configuration.rows), and represents the rows going top to bottom

Here’s what that looks like as code:

```python
def agent(observation, configuration):
    # Number of Columns on the Board.
    columns = configuration.columns
    # Number of Rows on the Board.
    rows = configuration.rows
    # Number of Checkers "in a row" needed to win.
    inarow = configuration.inarow
    # The current serialized Board (rows x columns).
    board = observation.board
    # Which player the agent is playing as (1 or 2).
    mark = observation.mark

    # Return which column to drop a checker (action).
    return 0
```

### Agent Rules

1. Your Submission must be an “Agent”.
2. An Agent may only use modules from "The Python Standard Library", "numpy", "gym", "pytorch", and "scipy".
3. An Agent’s sole purpose is to generate an action. Activities/code which do not directly contribute to this will be considered malicious and handled according to the Rules.
4. An Agent can have a maximum file size limit of 1 MB.
5. An Agent must return an action within 5 seconds of being invoked. If the Agent does not, it will lose the episode and may be invalidated.
6. An Agent which throws errors or returns an invalid action will lose the episode and may be invalidated.
7. An Agent cannot store information between invocations.

## Reinforcement Learning

All of these algorithms use a similar process to produce an agent:

- Initially, the weights are set to random values.
- As the agent plays the game, the algorithm continually tries out new values for the weights, to see how the cumulative reward is affected, on average. Over time, after playing many games, we get a good idea of how the weights affect cumulative reward, and the algorithm settles towards weights that performed better.
    - Of course, we have glossed over the details here, and there's a lot of complexity involved in this process. For now, we focus on the big picture!


- This way, we'll end up with an agent that tries to win the game (so it gets the final reward of +1, and avoids the -1 and -10) and tries to make the game last as long as possible (so that it collects the 1/42 bonus as many times as it can).
    - You might argue that it doesn't really make sense to want the game to last as long as possible -- this might result in a very inefficient agent that doesn't play obvious winning moves early in gameplay. And, your intuition would be correct -- this will make the agent take longer to play a winning move! The reason we include the 1/42 bonus is to help the algorithms we'll use to converge better. Further discussion is outside of the scope of this course, but you can learn more by reading about the "temporal credit assignment problem" and "reward shaping".

## Install `kaggle-environments` and other dependencies

In [73]:
import sys

!{sys.executable} -m pip install 'kaggle-environments>=0.1.6' gym stable-baselines3



## Create ConnectX Environment

There are a lot of great implementations of reinforcement learning algorithms online. In this course, we'll use Stable Baselines.

Currently, Stable Baselines is not yet compatible with TensorFlow 2.0. So, we begin by downgrading to TensorFlow 1.0.

**_Obs.:_** There are several errors related to `Tensorflow 1.x` and `Stable Baselines 2.x`. This being said, after almost an entire day of work, we managed to make things work with `Tensorflow 2.2.0` and `Stable Baselines 3 (beta)`.

In [74]:
# Check version of tensorflow
import tensorflow as tf
tf.__version__

'2.2.0'

## Setup

After each move, we give the agent a reward that tells it how well it did:

  - _**If**_ the agent wins the game in that move, we give it a reward of +1.
  - _**Else if**_ the agent plays an invalid move (which ends the game), we give it a reward of -10.
  - _**Else if**_ the opponent wins the game in its next move (i.e., the agent failed to prevent its opponent from winning), we give the agent a reward of -1.
  - _**Else**_ , the agent gets a reward of 1/42.
  
At the end of each game, the agent adds up its reward. We refer to the sum of rewards as the agent's **cumulative reward**.

  - For instance, if the game lasted 8 moves (each player played four times), and the agent ultimately won, then its cumulative reward is `3*(1/42) + 1`.
  - If the game lasted 11 moves (and the opponent went first, so the agent played five times), and the opponent won in its final move, then the agent's cumulative reward is `4*(1/42) - 1`.
  - If the game ends in a draw, then the agent played exactly 21 moves, and it gets a cumulative reward of `21*(1/42)`.
  - If the game lasted 7 moves and ended with the agent selecting an invalid move, the agent gets a cumulative reward of `3*(1/42) - 10`.
  
**Our goal** is to find the weights of the neural network that (on average) maximize the agent's cumulative reward.

There's a bit of extra work that we need to do to make the environment compatible with Stable Baselines. For this, we define the `ConnectFourGym` class below. This class implements ConnectX as an [OpenAI Gym environment](http://gym.openai.com/docs/) and uses several methods:

 - `reset()` will be called at the beginning of every game. It returns the starting game board as a 2D numpy array with 6 rows and 7 columns.
 - `change_reward()` customizes the rewards that the agent receives. (The competition already has its own system for rewards that are used to rank the agents, and this method changes the values to match the rewards system we designed.)
 - `step()` is used to play the agent's choice of action (supplied as `action`), along with the opponent's response. It returns:
    - the resulting game board (as a numpy array),
    - the agent's reward (from the most recent move only: one of `+1`, `-10`, `-1`, or `1/42`), and
    - whether or not the game has ended (if the game has ended, `done=True`; otherwise, `done=False`).

To learn more about how to define environments, check out the documentation [here](https://stable-baselines.readthedocs.io/en/master/guide/custom_env.html).

In [121]:
import numpy as np
from kaggle_environments import evaluate, make, utils
from gym import spaces

class ConnectFourGym:
    def __init__(self, agent2="random"):
        ks_env = make("connectx", debug=True)
        self.env = ks_env.train([None, agent2])
        self.rows = ks_env.configuration.rows
        self.columns = ks_env.configuration.columns
        # Learn about spaces here: http://gym.openai.com/docs/#spaces
        self.action_space = spaces.Discrete(self.columns)
        self.observation_space = spaces.Box(low=0,
                                            high=2, 
                                            shape=(self.rows,self.columns,1),
                                            dtype=np.int)
        # Tuple corresponding to the min and max possible rewards
        self.reward_range = (-10, 1)
        # StableBaselines throws error if these are not defined
        self.spec = None
        self.metadata = None

    def reset(self):
        self.obs = self.env.reset()
        return np.array(self.obs['board']).reshape(self.rows,self.columns,1)

    def change_reward(self, old_reward, done):
        # The agent won the game
        if old_reward == 1: 
            return 1
        # The opponent won the game
        elif done:
            return -1
        # Reward 1/42
        else: 
            return 1/(self.rows*self.columns)

    def step(self, action):
        # Check if agent's move is valid
        is_valid = (self.obs['board'][int(action)] == 0)
        # Play the move
        if is_valid: 
            self.obs, old_reward, done, _ = self.env.step(int(action))
            reward = self.change_reward(old_reward, done)
        # End the game and penalize agent
        else:
            reward, done, _ = -10, True, {}
        return np.array(self.obs['board']).reshape(self.rows,self.columns,1), reward, done, _

In this notebook, we'll train an agent to beat the random agent. We specify this opponent in the `agent2` argument below. The "random" agent selects (uniformly) at random from the set of **valid moves.**

In [92]:
import numpy as np

# Create ConnectFour environment
env = ConnectFourGym(agent2="random")

The `Monitor` class lets us watch how the agent's performance gradually improves, as it plays more and more games.

In [93]:
import os
from stable_baselines3.common.monitor import Monitor 

# Create directory for logging training information
log_dir = "log/"
os.makedirs(log_dir, exist_ok=True)

# Logging progress
monitor_env = Monitor(env, log_dir, allow_early_resets=True)

Stable Baselines requires us to work with ["vectorized" environments](https://stable-baselines.readthedocs.io/en/master/guide/vec_envs.html). 

>_"Vectorized Environments are a method for stacking multiple independent environments into a single environment. Instead of training an RL agent on 1 environment per step, it allows us to train it on n environments per step."_

For this, we can use the `DummyVecEnv` class.

In [94]:
from stable_baselines3.common.vec_env import DummyVecEnv

# Create a vectorized environment
vec_env = DummyVecEnv([lambda: monitor_env])

The next step is to specify the architecture of the neural network. In this case, we use a convolutional neural network. To learn more about how to specify architectures with Stable Baselines, check out the documentation [here](https://stable-baselines3.readthedocs.io/en/master/).

In [156]:
import torch as th
from stable_baselines3 import A2C

# Initialize Policy Network: transforms input frames to output actions.
#  - Sparse rewards: there might be cases (complex ones) where a network might never receive a reward.
#  - Reward shaping: manually designing a reward function to guide policy to a single behavior (not only to all actions).
#    - Faster convergence
#    - Custom process for every environment: not scalable
#    - Alignment Problem: agent will find a way to get a lot of reward, while not generalizing to the intended behavior

policy_kwargs = {'activation_fn': th.nn.ReLU,
                 'net_arch': [128, 256, 512]}

model = A2C('MlpPolicy', vec_env, verbose=0, policy_kwargs=policy_kwargs)

## Train Agent

In the next code cell, we "train the agent", which is just another way of saying that we find weights of the neural network that are likely to result in the agent selecting good moves.

We plot a rolling average of the cumulative reward that the agent received during training. As evidenced by the increasing function, the agent gradually learned to perform better by playing the game

In [None]:
model.learn(total_timesteps=100000000)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Plot cumulative reward
with open(os.path.join(log_dir, "monitor.csv"), 'rt') as fh:    
    firstline = fh.readline()
    assert firstline[0] == '#'
    df = pd.read_csv(fh, index_col=None)['r']
df.rolling(window=1000).mean().plot()
plt.show()

## Create an Agent

To create the submission, an agent function should be fully encapsulated (no external dependencies).

When your agent is being evaluated against others, it will not have access to the Kaggle docker image. Only the following can be imported: Python Standard Library Modules, gym, numpy, scipy, pytorch (1.3.1, cpu only), and more may be added later.

`obs` contains two pieces of information:

 - `obs.board`: the game board (a Python list with one item for each grid location)
 - `obs.mark`: the piece assigned to the agent (either 1 or 2)
 
`obs.board` is a Python list that shows the locations of the discs, where the first row appears first, followed by the second row, and so on. We use `1` to track player 1's discs, and `2` to track player 2's discs. For instance, for this game board: `[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 2, 1, 2, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 2, 1, 2, 0, 2, 0]`

`config` contains three pieces of information:

  - `config.columns`: number of columns in the game board (7 for Connect Four)
  - `config.rows`: number of rows in the game board (6 for Connect Four)
  - `config.inarow`: number of pieces a player needs to get in a row in order to win (4 for Connect Four)

In [146]:
def my_agent(obs, config):
    # Use the best model to select a column
    col, _ = model.predict(np.array(obs['board']).reshape(6,7,1))
    # Check if selected column is valid
    is_valid = (obs['board'][int(col)] == 0)
    # If not valid, select random move. 
    if is_valid:
        return int(col)
    else:
        return random.choice([col for col in range(config.columns) if obs.board[int(col)] == 0])

## Test your Agent

In the next code cell, we see the outcome of one game round against a random agent.

In [148]:
# Create the game environment
env = make("connectx")

# Two random agents play one game round
env.run([my_agent, "random"])

# Show the game
env.render(mode="ipython", width=500, height=450)

## Evaluate your Agent

And, we calculate how it performs on average, against the random agent.

In [120]:
def mean_reward(rewards):
    return sum(r[0] for r in rewards) / float(len(rewards))

# Run multiple episodes to estimate its performance.
print("My Agent vs Random Agent:", mean_reward(evaluate("connectx", [my_agent, "random"], num_episodes=10)))
print("My Agent vs Negamax Agent:", mean_reward(evaluate("connectx", [my_agent, "negamax"], num_episodes=10)))

My Agent vs Random Agent: 1.0
My Agent vs Negamax Agent: -1.0


It's important to note that the agent that we've created here was only trained to beat the random agent, because all of its gameplay experience has been with the random agent as opponent.

If we want to produce an agent that reliably performs better than many other agents, we have to expose our agent to these other agents during training. To learn more about how to do this, you can read about [self-play](https://openai.com/blog/competitive-self-play/).

## Play your Agent

Click on any column to place a checker there ("manually select action").

In [113]:
# "None" represents which agent you'll manually play as (first or second player).
env.play([None, my_agent], width=500, height=450)

## Write Submission File

In [129]:
import inspect
import os

def write_agent_to_file(function, file):
    with open(file, "a" if os.path.exists(file) else "w") as f:
        f.write(inspect.getsource(function))
        print(function, "written to", file)

write_agent_to_file(my_agent, "submission.py")

<function my_agent at 0x7f1b74509280> written to submission.py


## Validate Submission

Play your submission against itself. This is the first episode the competition will run to weed out erroneous agents.

Why validate? This roughly verifies that your submission is fully encapsulated and can be run remotely.

In [130]:
# Note: Stdout replacement is a temporary workaround.
import sys
out = sys.stdout
submission = utils.read_file("submission.py")
agent = utils.get_last_callable(submission)
sys.stdout = out

env = make("connectx", debug=True)
env.run([agent, agent])
print("Success!" if env.state[0].status == env.state[1].status == "DONE" else "Failed...")

Error: ['Traceback (most recent call last):\n', '  File "/opt/conda/lib/python3.8/site-packages/kaggle_environments/agent.py", line 90, in run_agent\n    message.action = agent(*args)\n', '  File "<string>", line 3, in my_agent\n', "NameError: name 'model' is not defined\n"]
Failed...


## Submit to Competition¶

1. Commit this kernel.
2. View the commited version.
3. Go to "Data" section and find submission.py file.
4. Click "Submit to Competition"
5. Go to My Submissions to view your score and episodes being played.