# Connect X
#### Author: James Coffey
#### Date: 2025-08-31
#### Challenge URL: [Connect X](https://www.kaggle.com/competitions/connectx)

* I implemented **two agents**:

  * A fast **minimax + alpha–beta** baseline for analysis (not submitted here).
  * A **Deep RL** agent (PPO + CNN) that learns by playing and generalizing
    patterns.

* **Leaderboard honesty:** my minimax agent achieved **768.8**, while this RL
  agent scored **312.3**. I’m intentionally submitting the **RL agent** to
  demonstrate mastery of **RL training, architecture, evaluation, and deployment
  constraints** (exporting to pure NumPy so it runs inside the Kaggle match
  runner without extra dependencies).

## Imports used in this notebook (not in submission)

In [1]:
import random
import numpy as np
from kaggle_environments import make, evaluate

2025-08-31 21:10:18.656774: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1756674618.988645      36 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1756674619.084972      36 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


[kaggle_environments.envs.open_spiel.open_spiel] INFO: Successfully loaded OpenSpiel environments: 6.
[kaggle_environments.envs.open_spiel.open_spiel] INFO:    open_spiel_chess
[kaggle_environments.envs.open_spiel.open_spiel] INFO:    open_spiel_connect_four
[kaggle_environments.envs.open_spiel.open_spiel] INFO:    open_spiel_gin_rummy
[kaggle_environments.envs.open_spiel.open_spiel] INFO:    open_spiel_go
[kaggle_environments.envs.open_spiel.open_spiel] INFO:    open_spiel_tic_tac_toe
[kaggle_environments.envs.open_spiel.open_spiel] INFO:    open_spiel_universal_poker
[kaggle_environments.envs.open_spiel.open_spiel] INFO: OpenSpiel games skipped: 0.


## Training setup (RL vs. strong opponent)

PPO learns faster when sparring against a non-trivial opponent. We use **minimax (alpha–beta)** as the training opponent inside a lightweight Gym wrapper.

### Minimax sparring opponent (training-time only)

In [2]:
def minimax_agent(obs, config):
    import numpy as np, random

    rows, cols, k = config.rows, config.columns, config.inarow

    def as_grid(b):
        return np.asarray(b).reshape(rows, cols)

    def valid_moves(g):
        return [c for c in range(cols) if g[0, c] == 0]

    def drop(g, c, m):
        g2 = g.copy()
        for r in range(rows - 1, -1, -1):
            if g2[r, c] == 0:
                g2[r, c] = m
                return g2
        return g

    def has_k(arr, m):
        cnt = 0
        for v in arr:
            cnt = cnt + 1 if v == m else 0
            if cnt >= k:
                return True
        return False

    def win(g, m):
        for r in range(rows):
            if has_k(g[r, :], m):
                return True
        for c in range(cols):
            if has_k(g[:, c], m):
                return True
        for r in range(rows - k + 1):
            for c in range(cols - k + 1):
                if has_k(np.diag(g[r : r + k, c : c + k]), m):
                    return True
        for r in range(k - 1, rows):
            for c in range(cols - k + 1):
                if has_k(np.diag(g[r : r - k : -1, c : c + k]), m):
                    return True
        return False

    def full(g):
        return not np.any(g[0, :] == 0)

    def score(g, me):
        opp = 1 if me == 2 else 2
        center_bonus = np.count_nonzero(g[:, cols // 2] == me) * 3
        return center_bonus - (np.count_nonzero(g[:, cols // 2] == opp) * 2)

    def mm(g, d, a, b, maxing, me):
        opp = 1 if me == 2 else 2
        if win(g, me):
            return 10**9, None
        if win(g, opp):
            return -(10**9), None
        if d == 0 or full(g):
            return score(g, me), None
        moves = valid_moves(g)
        center = cols // 2
        moves.sort(key=lambda c: abs(c - center))
        if maxing:
            best, move = -1e18, random.choice(moves)
            for c in moves:
                sc, _ = mm(drop(g, c, me), d - 1, a, b, False, me)
                if sc > best:
                    best, move = sc, c
                a = max(a, best)
                if a >= b:
                    break
            return best, move
        else:
            best, move = 1e18, random.choice(moves)
            oppm = 1 if me == 2 else 2
            for c in moves:
                sc, _ = mm(drop(g, c, oppm), d - 1, a, b, True, me)
                if sc < best:
                    best, move = sc, c
                b = min(b, best)
                if a >= b:
                    break
            return best, move

    g = as_grid(obs.board)
    me = obs.mark
    vm = [c for c in range(config.columns) if g[0, c] == 0]
    if not vm:
        return 0
    # instant win/block
    opp = 1 if me == 2 else 2
    for c in vm:
        if win(drop(g, c, me), me):
            return c
    for c in vm:
        if win(drop(g, c, opp), opp):
            return c
    _, move = mm(g, 4, -1e18, 1e18, True, me)
    return (
        move
        if move is not None
        else min(vm, key=lambda c: abs(c - (config.columns // 2)))
    )

### 2.2 Gym environment for SB3 (training-time only)

In [3]:
import gym
from gym import spaces


class ConnectFourGym(gym.Env):
    def __init__(self, agent2=minimax_agent):
        ks_env = make("connectx", debug=True)
        self.env = ks_env.train([None, agent2])
        self.rows = ks_env.configuration.rows
        self.columns = ks_env.configuration.columns
        self.action_space = spaces.Discrete(self.columns)
        self.observation_space = spaces.Box(
            low=0, high=2, shape=(1, self.rows, self.columns), dtype=np.int64
        )
        self.reward_range = (-10, 1)
        self.spec, self.metadata = None, None

    def reset(self):
        self.obs = self.env.reset()
        return np.array(self.obs["board"]).reshape(1, self.rows, self.columns)

    def _reward_shaping(self, env_reward, done):
        if env_reward == 1:  # win
            return 1.0
        if done:  # loss
            return -1.0
        return 1.0 / (self.rows * self.columns)  # small step bonus

    def step(self, action):
        valid = self.obs["board"][int(action)] == 0
        if valid:
            self.obs, env_r, done, info = self.env.step(int(action))
            r = self._reward_shaping(env_r, done)
        else:
            r, done, info = -10.0, True, {}
        return (
            np.array(self.obs["board"]).reshape(1, self.rows, self.columns),
            r,
            done,
            info,
        )

## PPO + CNN policy (training-time only)

In [4]:
# Install (Kaggle usually has these in the RL course images; keep for robustness)
# !pip -q install "stable-baselines3>=2.4.0" torch --upgrade

import torch as th
import torch.nn as nn
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.torch_layers import BaseFeaturesExtractor
from stable_baselines3.common.vec_env import DummyVecEnv


class CustomCNN(BaseFeaturesExtractor):
    def __init__(self, observation_space: gym.spaces.Box, features_dim: int = 256):
        super().__init__(observation_space, features_dim)
        n_in = observation_space.shape[0]
        self.cnn = nn.Sequential(
            nn.Conv2d(n_in, 64, 3, 1, 0),
            nn.ReLU(),
            nn.Conv2d(64, 128, 3, 1, 0),
            nn.ReLU(),
            nn.Flatten(),
        )
        with th.no_grad():
            samp = th.as_tensor(observation_space.sample()[None]).float()
            n_flat = self.cnn(samp).shape[1]
        self.linear = nn.Sequential(nn.Linear(n_flat, features_dim), nn.ReLU())

    def forward(self, obs: th.Tensor) -> th.Tensor:
        obs = obs.float() / 2.0
        return self.linear(self.cnn(obs))


policy_kwargs = dict(
    features_extractor_class=CustomCNN,
    features_extractor_kwargs=dict(features_dim=256),
    net_arch=dict(pi=[256, 256], vf=[256, 256]),
)


def make_env():
    return ConnectFourGym(agent2=minimax_agent)


vec_env = make_vec_env(make_env, n_envs=8, vec_env_cls=DummyVecEnv)

model = PPO(
    "CnnPolicy",
    vec_env,
    policy_kwargs=policy_kwargs,
    device="cuda" if th.cuda.is_available() else "cpu",
    n_steps=128,
    batch_size=1024,
    n_epochs=10,
    learning_rate=2.5e-4,
    verbose=1,
)



Using cuda device


### Train (~50k steps used for the score reported)

In [5]:
# Adjust timesteps if you want a faster dev run
TOTAL_STEPS = 50_000
_ = model.learn(total_timesteps=TOTAL_STEPS)

  deprecation(


---------------------------------
| rollout/           |          |
|    ep_len_mean     | 5.59     |
|    ep_rew_mean     | -3.14    |
| time/              |          |
|    fps             | 25       |
|    iterations      | 1        |
|    time_elapsed    | 40       |
|    total_timesteps | 1024     |
---------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 5.6          |
|    ep_rew_mean          | -2.69        |
| time/                   |              |
|    fps                  | 25           |
|    iterations           | 2            |
|    time_elapsed         | 79           |
|    total_timesteps      | 2048         |
| train/                  |              |
|    approx_kl            | 0.0042360947 |
|    clip_fraction        | 0.00488      |
|    clip_range           | 0.2          |
|    entropy_loss         | -1.94        |
|    explained_variance   | -0.0787      |
|    learning_r

## Export to **pure NumPy** weights (for a dependency-free submission)

We extract exactly the layers used at inference: 2 convs → flatten → dense(256)
→ policy MLP(256,256) → logits(7). Then we serialize with `np.savez` and
base64-encode into a single string.

In [7]:
import io, base64

def export_numpy_policy(model):
    sd = model.policy.state_dict()

    # Features extractor (our CustomCNN)
    conv1_w = sd["features_extractor.cnn.0.weight"].cpu().numpy()
    conv1_b = sd["features_extractor.cnn.0.bias"].cpu().numpy()
    conv2_w = sd["features_extractor.cnn.2.weight"].cpu().numpy()
    conv2_b = sd["features_extractor.cnn.2.bias"].cpu().numpy()
    lin0_w  = sd["features_extractor.linear.0.weight"].cpu().numpy()
    lin0_b  = sd["features_extractor.linear.0.bias"].cpu().numpy()

    # Policy MLP (only the policy stream)
    pi0_w = sd["mlp_extractor.policy_net.0.weight"].cpu().numpy()
    pi0_b = sd["mlp_extractor.policy_net.0.bias"].cpu().numpy()
    pi1_w = sd["mlp_extractor.policy_net.2.weight"].cpu().numpy()
    pi1_b = sd["mlp_extractor.policy_net.2.bias"].cpu().numpy()

    # Action head
    act_w = sd["action_net.weight"].cpu().numpy()
    act_b = sd["action_net.bias"].cpu().numpy()

    # Pack into a single bytes blob
    buf = io.BytesIO()
    np.savez(
        buf,
        conv1_w=conv1_w, conv1_b=conv1_b,
        conv2_w=conv2_w, conv2_b=conv2_b,
        lin0_w=lin0_w,   lin0_b=lin0_b,
        pi0_w=pi0_w,     pi0_b=pi0_b,
        pi1_w=pi1_w,     pi1_b=pi1_b,
        act_w=act_w,     act_b=act_b,
    )
    buf.seek(0)
    return base64.b64encode(buf.read()).decode("utf-8")

# Create a base64 string you can embed in the submission
SUBMISSION_WEIGHTS = export_numpy_policy(model)
len(SUBMISSION_WEIGHTS), SUBMISSION_WEIGHTS[:60] + "..."

(2162560, 'UEsDBC0AAAAAAAAAIQB0CUE1//////////8LABQAY29udjFfdy5ucHkBABAA...')

## **Pure NumPy inference agent** (what we’ll write to `submission.py`)

This agent:

* Loads the CNN/MLP weights from the base64 blob,
* Implements `conv2d` (valid, 3×3, stride 1) and dense layers in NumPy,
* Picks the **best valid** column by descending logits,
* Falls back to a simple center-biased policy if a non-standard board size is
  used.


In [8]:
SUBMISSION_TEMPLATE = r'''
import numpy as np, base64, io, random

# ==== Serialized NumPy weights (set below) ====
_WEIGHTS_B64 = """__B64_WEIGHTS__"""

# Lazy weight loader
_WEIGHTS = None
def _load():
    global _WEIGHTS
    if _WEIGHTS is None:
        raw = base64.b64decode(_WEIGHTS_B64.encode("utf-8"))
        buf = io.BytesIO(raw)
        data = np.load(buf)
        _WEIGHTS = {k: data[k] for k in data.files}
    return _WEIGHTS

# ---- Small NumPy NN runtime ----
def _relu(x): 
    return np.maximum(x, 0.0)

def _conv2d_valid(x, w, b):
    """
    x: (C_in, H, W)
    w: (C_out, C_in, 3, 3)
    b: (C_out,)
    returns: (C_out, H-2, W-2)
    """
    C_out, C_in, kh, kw = w.shape
    _, H, W = x.shape
    out = np.empty((C_out, H - kh + 1, W - kw + 1), dtype=np.float32)
    for oc in range(C_out):
        acc = np.zeros((H - 2, W - 2), dtype=np.float32)
        for ic in range(C_in):
            for i in range(H - 2):
                for j in range(W - 2):
                    acc[i, j] += np.sum(x[ic, i:i+3, j:j+3] * w[oc, ic])
        out[oc] = acc + b[oc]
    return out

def _dense(x, w, b):
    # x: (D,), w: (out, D), b: (out,)
    return w @ x + b

def _forward(board, weights, rows=6, cols=7):
    # Prepare input: (C=1, H=rows, W=cols), scale 0..2 -> 0..1
    x = board.reshape(1, rows, cols).astype(np.float32) / 2.0

    # conv -> ReLU -> conv -> ReLU -> flatten
    x = _conv2d_valid(x, weights["conv1_w"], weights["conv1_b"])
    x = _relu(x)
    x = _conv2d_valid(x, weights["conv2_w"], weights["conv2_b"])
    x = _relu(x).reshape(-1)

    # linear(->256) -> ReLU
    x = _relu(_dense(x, weights["lin0_w"], weights["lin0_b"]))

    # policy MLP: 256 -> 256 -> 256
    x = _relu(_dense(x, weights["pi0_w"], weights["pi0_b"]))
    x = _relu(_dense(x, weights["pi1_w"], weights["pi1_b"]))

    # logits for 7 columns
    logits = _dense(x, weights["act_w"], weights["act_b"])
    return logits

def agent(obs, config):
    rows, cols = getattr(config, "rows", 6), getattr(config, "columns", 7)

    # This policy was trained for 6x7; if dimensions differ, use robust fallback
    if rows != 6 or cols != 7:
        grid_top = [c for c in range(cols) if obs["board"][c] == 0]
        if not grid_top:
            return 0
        center = cols // 2
        return min(grid_top, key=lambda c: abs(c - center))

    weights = _load()
    grid = np.asarray(obs["board"]).reshape(rows, cols)

    # Compute logits
    logits = _forward(np.asarray(obs["board"]), weights, rows, cols)

    # Choose best valid move by descending score
    order = np.argsort(-logits)
    for c in order:
        if grid[0, int(c)] == 0:
            return int(c)

    # Fallback: any valid move (or 0)
    valid = [c for c in range(cols) if grid[0,c]==0]
    return int(random.choice(valid)) if valid else 0
'''

## Quick local validation vs. `random`

In [9]:
def rl_agent_local(obs, config):
    # Use the same NumPy forward locally (without writing a file yet)
    weights = np.load(
        io.BytesIO(base64.b64decode(SUBMISSION_WEIGHTS.encode())), allow_pickle=False
    )
    W = {k: weights[k] for k in weights.files}
    rows, cols = config.rows, config.columns
    if rows != 6 or cols != 7:
        valid = [c for c in range(cols) if obs.board[c] == 0]
        return min(valid, key=lambda c: abs(c - (cols // 2))) if valid else 0
    grid = np.asarray(obs.board).reshape(rows, cols)

    # forward
    def relu(x):
        return np.maximum(x, 0.0)

    def conv2d(x, w, b):
        C_out, C_in, _, _ = w.shape
        _, H, W = x.shape
        out = np.empty((C_out, H - 2, W - 2), dtype=np.float32)
        for oc in range(C_out):
            acc = np.zeros((H - 2, W - 2), dtype=np.float32)
            for ic in range(C_in):
                for i in range(H - 2):
                    for j in range(W - 2):
                        acc[i, j] += np.sum(x[ic, i : i + 3, j : j + 3] * w[oc, ic])
            out[oc] = acc + b[oc]
        return out

    x = np.asarray(obs.board).reshape(1, rows, cols).astype(np.float32) / 2.0
    x = relu(conv2d(x, W["conv1_w"], W["conv1_b"]))
    x = relu(conv2d(x, W["conv2_w"], W["conv2_b"])).reshape(-1)
    x = relu(W["lin0_w"] @ x + W["lin0_b"])
    x = relu(W["pi0_w"] @ x + W["pi0_b"])
    x = relu(W["pi1_w"] @ x + W["pi1_b"])
    logits = W["act_w"] @ x + W["act_b"]
    for c in np.argsort(-logits):
        if grid[0, int(c)] == 0:
            return int(c)
    valid = [c for c in range(cols) if grid[0, c] == 0]
    return random.choice(valid) if valid else 0


def get_win_percentages(agent1, agent2, n_rounds=40):
    cfg = {"rows": 6, "columns": 7, "inarow": 4}
    outcomes = evaluate("connectx", [agent1, agent2], cfg, [], n_rounds // 2)
    outcomes += [
        [b, a]
        for [a, b] in evaluate(
            "connectx", [agent2, agent1], cfg, [], n_rounds - n_rounds // 2
        )
    ]
    a1 = outcomes.count([1, -1]) / len(outcomes)
    a2 = outcomes.count([-1, 1]) / len(outcomes)
    inv1 = outcomes.count([None, 0])
    inv2 = outcomes.count([0, None])
    print(
        f"Agent1 Win%: {a1:.2f} | Agent2 Win%: {a2:.2f} | Invalid A1/A2: {inv1}/{inv2}"
    )


get_win_percentages(rl_agent_local, "random", 40)

Agent1 Win%: 0.93 | Agent2 Win%: 0.07 | Invalid A1/A2: 0/0


In [13]:
env = make("connectx", debug=True)
env.reset()
# Play as the first agent against default "random" agent.
env.run([rl_agent_local, "random"])
env.render(mode="ipython", width=500, height=450)

## Write the **submission file** (`submission.py`)

This writes the NumPy-only agent with embedded weights.

In [10]:
import os, textwrap


def write_submission(weights_b64, path="submission.py"):
    code = SUBMISSION_TEMPLATE.replace("__B64_WEIGHTS__", weights_b64)
    code = textwrap.dedent(code)
    with open(path, "w") as f:
        f.write(code)
    print(f"Wrote {path} ({len(code.splitlines())} lines)")


write_submission(SUBMISSION_WEIGHTS, "submission.py")

Wrote submission.py (91 lines)


## Results & discussion

* **What I’m submitting:** a **Deep RL agent** (PPO + CNN) trained via self-play
  against a **minimax** sparring partner, then **exported to pure NumPy** for
  reliable evaluation.
* **Why RL if it scored lower?**

  * My **minimax** agent (alpha–beta with tactical win/block) scored **768.8**
    on the leaderboard.
  * This **RL** agent scored **312.3**.
  * I’m intentionally featuring RL to showcase real-world ML engineering:
    **environment design, reward shaping, curriculum (strong opponent),
    architecture, training dynamics, evaluation**, and—crucially—**deployment
    constraints** (no PyTorch at evaluation time), solved via **NumPy export**
    and a hand-rolled forward pass.

* **What’s sophisticated here**

  * CNN feature extractor + policy/value heads (SB3 PPO)
  * Reward shaping & invalid-move penalties
  * Training against a **non-trivial opponent** (minimax) rather than random
  * **Model export** to NumPy tensors and a minimal **inference runtime**
    (conv + dense) to meet sandbox constraints
  * Robust fallbacks (dimension check, valid-move descending logits, center bias)

* **Future improvements**

  * Longer curriculum (start vs random → negamax → deeper minimax).
  * Data augmentation (board symmetries).
  * Value-guided move filtering to reduce branching in tricky states.
  * Distillation: train a small policy on minimax rollouts for higher score at
    RL speed.