# Introduction

So far, our agents have relied on detailed information about how to play the game.  The heuristic really provides a lot of guidance about how to select moves!

In this tutorial, you'll learn how to use **reinforcement learning** to build an intelligent agent without the use of a heuristic.  Instead, we will gradually refine the agent's strategy over time, simply by playing the game and trying to maximize the winning rate.

In this brief notebook, we won't be able to explore this complex field in detail, but you'll learn about the big picture and explore code that you can use to train your own agent.

# Neural Networks

It's difficult to come up with a perfect heuristic.  Improving the heuristic generally entails playing the game many times, to determine specific cases where the agent could have made better choices.  And, it can prove challenging to interpret what exactly is going wrong, and ultimately to fix old mistakes without accidentally introducing new ones.

Wouldn't it be much easier if we had a more systematic way of improving the agent with gameplay experience?  

In this tutorial, towards this goal, we'll replace the heuristic with a neural network, called a **value network**.

The value network accepts the current board as input.  And, it outputs a probability for each possible move.

<center>
<img src="https://i.imgur.com/KgAliYQ.png" width=90%><br/>
</center>

Then, the agent selects a move by sampling from these probabilities.  For instance, for the game board in the image above, the agent selects column 4 with 50% probability.

This way, to encode a good gameplay strategy, we need only amend the weights of the value network so that _for every possible game board_, it assigns higher probabilities to better moves.

At least in theory, that's our goal.  In practice, we won't actually check if that's the case -- since remember that Connect Four has over 4 trillion possible game boards!

# Reinforcement Learning

How can we approach the task of amending the weights of the value network, in practice?  Here's the approach we'll take in this lesson:
- After each move, we give the agent a **reward** that tells it how well it did:
  - **_If_** the agent wins the game in that move, we give it a reward of `+1`.
  - **_Else if_** the opponent wins the game in its next move (i.e., the agent failed to prevent its opponent from winning), we give the agent a reward of `-1`.
  - **_Else_**, the agent gets a reward of `1/42`.
  
  
- At the end of each game, the agent adds up its reward.  We refer to the sum of rewards as the agent's **cumulative reward**.  
  - For instance, if the game lasted 8 moves (each player played four times), and the agent ultimately won, then its cumulative reward is `3*(1/42) + 1`.
  - If the game lasted 11 moves (and the opponent went first, so the agent played five times), and the opponent won in its final move, then the agent's cumulative reward is `4*(1/42) - 1`.
  - If the game ends in a draw, then the agent played exactly 21 moves, and it gets a reward of `21*(1/42) = 0.5`.
  
  
- Our goal is to find the weights of the neural network that (on average) maximize the agent's cumulative reward.  

Initially, the weights are set to random values.
- As the agent plays the game, we continually try out new values for the weights, to see how the cumulative reward is affected, on average.  Over time, after playing many games, as we get a better idea of how the weights affect cumulative reward, we can start to settle towards weights that performed better.  
    - _Of course, we have glossed over the details here, and there's a lot of complexity involved in this process.  For now, we focus on the big picture!_
    
    
- This way, we'll end up with an agent that tries to win the game (so it gets the final reward of `+1`) and tries to make the game last as long as possible (so that it collects the `1/42` bonus as many times as it can).
    - _You might argue that it doesn't really make sense to want the game to last as long as possible -- this might result in a very inefficient agent that doesn't play obvious winning moves early in gameplay.  And, your intuition would be correct -- this will make the agent less efficient!  The reason we include the `1/42` bonus is to help the algorithms we'll use to converge better.  Further discussion is outside of the scope of this course, but you can learn more by reading about the "temporal credit assignment problem" and "reward shaping"._
    
# Code

There are a lot of great implementations of reinforcement learning algorithms online.  In this course, we'll use [Stable Baselines](https://github.com/hill-a/stable-baselines).

There's a bit of extra work that we need to do to make the environment compatible with Stable Baselines.  For this, we define the class below.

In [None]:
#$HIDE_INPUT$
import random
import numpy as np
import pandas as pd

In [None]:
from kaggle_simulations import make, evaluate
from gym import spaces

class ConnectFourGym:
    def __init__(self, agent2="random"):
        ks_env = make("connectx", debug=True)
        ks_env._Environment__get_space = self.__get_space
        self.env = ks_env.gym([None, agent2])
        self.rows = ks_env.configuration.rows
        self.columns = ks_env.configuration.columns
        self.action_space = spaces.Discrete(self.columns)
        self.observation_space = spaces.Box(low=0, high=2, 
                                            shape=(self.rows,self.columns,1), dtype=np.int)
        self.metadata = None
        self.reward_range = (-1, 1)
        self.spec = None
    def reset(self):
        obs, reward, done, info = self.env.reset()
        return np.array(obs['board']).reshape(self.rows,self.columns,1)
    def step(self, action):
        obs, reward, done, info = self.env.step(int(action))
        return np.array(obs['board']).reshape(self.rows,self.columns,1), reward, done, info
    def __get_space(self, spec):
        return 

# Create ConnectFour environment
env = ConnectFourGym()

Stable baselines requires us to work with "vectorized" environments.  For this, we can use the `DummyVecEnv` class.

In [None]:
import os
from stable_baselines.bench import Monitor 
from stable_baselines.common.vec_env import DummyVecEnv

# Create directory for logging training information
log_dir = "/kaggle/working/log/"
os.makedirs(log_dir, exist_ok=True)

# 
monitor_env = Monitor(env, log_dir, allow_early_resets=True)

# Create a vectorized environment
vec_env = DummyVecEnv([lambda: monitor_env])

Our next step is to specify the architecture of the neural network that will be used to predict the action values.  

In [None]:
import tensorflow as tf
from stable_baselines import DQN 
from stable_baselines.a2c.utils import conv, linear, conv_to_fc
from stable_baselines.deepq.policies import CnnPolicy

# Neural network for predicting action values
def modified_cnn(scaled_images, **kwargs):
    activ = tf.nn.relu
    layer_1 = activ(conv(scaled_images, 'c1', n_filters=32, filter_size=3, stride=1, 
                         init_scale=np.sqrt(2), **kwargs))
    layer_2 = activ(conv(layer_1, 'c2', n_filters=64, filter_size=3, stride=1, 
                         init_scale=np.sqrt(2), **kwargs))
    layer_2 = conv_to_fc(layer_2)
    return activ(linear(layer_2, 'fc1', n_hidden=512, init_scale=np.sqrt(2)))  

class CustomCnnPolicy(CnnPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomCnnPolicy, self).__init__(*args, **kwargs, cnn_extractor=modified_cnn)
        
# Initialize agent
model = DQN(CustomCnnPolicy, vec_env, verbose=0)

note there are a lot of parameters involved in this algorithm. default performs fine

In [None]:
from stable_baselines.results_plotter import load_results, ts2xy

# How often to check for model improvement
check_every = 2000

# Initialize training information
best_mean_reward, n_steps = -np.inf, 0

# Track training progress and save best model
def callback(_locals, _globals):
    global n_steps, best_mean_reward
    if (n_steps + 1) % check_every == 0:
        x, y = ts2xy(load_results(log_dir), 'timesteps')
        if len(x) > 0:
            mean_reward = np.mean(y[-check_every:])
            print(x[-1], 'timesteps')
            print("Best mean reward: {:.2f} - Last mean reward per episode: {:.2f}".format(best_mean_reward, mean_reward))
            if mean_reward > best_mean_reward:
                best_mean_reward = mean_reward
                print("Saving new best model")
                _locals['self'].save(log_dir + 'best_model.pkl')
    n_steps += 1
    return True

# Train agent
model.learn(total_timesteps=500000, callback=callback)

Plot the training progress

In [None]:
with open(os.path.join(log_dir, "monitor.csv"), 'rt') as fh:    
    firstline = fh.readline()
    assert firstline[0] == '#'
    df = pd.read_csv(fh, index_col=None)['r']
df.replace(-1, 0).rolling(window=1000).mean().plot()

...

In [None]:
def agent1(obs, config):
    # Load the best model
    
    # Use the best model to select a column
    return 

agent plays against a random agent

In [None]:
# Create the game environment
env = ks.make("connectx")

# Two random agents play one game round
env.run([agent1, "random"])

# Show the game
env.render(mode="ipython")

# Your turn

tbd ...