# Introduction

So far, our agents have relied on detailed information about how to play the game.  In particular, the heuristic provides a lot of guidance about how to select moves.  

In this tutorial, you'll learn how to use **reinforcement learning** to train an intelligent agent without the use of a heuristic.  Instead, your agent will gradually develop its own strategy over time, simply by playing the game and trying to maximize its winning rate.

# (Deep) Reinforcement Learning

DQN as jumping ground

maps state to value of each action

since 

# Code

There are a lot of great implementations of reinforcement learning algorithms online.  In this course, we'll use [Stable Baselines](https://github.com/hill-a/stable-baselines).

There's a bit of extra work that we need to do to ensure that the environment is compatible with Stable Baselines.  For this, we define the class below.

In [None]:
#$HIDE_INPUT$
import random
import numpy as np
import pandas as pd

In [None]:
import kaggle_simulations as ks
from gym import spaces

class ConnectFourGym:
    def __init__(self, agent2="random"):
        ks_env = ks.make("connectx")
        ks_env._Environment__get_space = self.__get_space
        self.env = ks_env.gym([None, agent2])
        self.rows = ks_env.configuration.rows
        self.columns = ks_env.configuration.columns
        self.action_space = spaces.Discrete(self.columns)
        self.observation_space = spaces.Box(low=0, high=2, shape=(self.rows,self.columns,1), dtype=np.int)
        self.metadata = None
        self.reward_range = (-1, 1)
        self.spec = None
    def reset(self):
        obs, reward, done, info = self.env.reset()
        return np.array(obs['board']).reshape(self.rows,self.columns,1)
    def step(self, action):
        obs, reward, done, info = self.env.step(int(action))
        return np.array(obs['board']).reshape(self.rows,self.columns,1), reward, done, info
    def __get_space(self, spec):
        return 

# Create ConnectFour environment
env = ConnectFourGym()

Stable baselines requires us to work with "vectorized" environments.  For this, we can use the `DummyVecEnv` class.

In [None]:
import os
from stable_baselines.bench import Monitor 
from stable_baselines.common.vec_env import DummyVecEnv

# Create directory for logging training information
log_dir = "/kaggle/working/log/"
os.makedirs(log_dir, exist_ok=True)

# 
monitor_env = Monitor(env, log_dir, allow_early_resets=True)

# Create a vectorized environment
vec_env = DummyVecEnv([lambda: monitor_env])

Our next step is to specify the architecture of the neural network that will be used to predict the action values.  

In [None]:
import tensorflow as tf
from stable_baselines import DQN 
from stable_baselines.a2c.utils import conv, linear, conv_to_fc
from stable_baselines.deepq.policies import CnnPolicy

# Neural network for predicting action values
def modified_cnn(scaled_images, **kwargs):
    activ = tf.nn.relu
    layer_1 = activ(conv(scaled_images, 'c1', n_filters=32, filter_size=3, stride=1, init_scale=np.sqrt(2), **kwargs))
    layer_2 = activ(conv(layer_1, 'c2', n_filters=64, filter_size=3, stride=1, init_scale=np.sqrt(2), **kwargs))
    layer_2 = conv_to_fc(layer_2)
    return activ(linear(layer_2, 'fc1', n_hidden=512, init_scale=np.sqrt(2)))  

class CustomCnnPolicy(CnnPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomCnnPolicy, self).__init__(*args, **kwargs, cnn_extractor=modified_cnn)
        
# Initialize agent
model = DQN(CustomCnnPolicy, vec_env, verbose=0)

In [None]:
from stable_baselines.results_plotter import load_results, ts2xy

# How often to check for model improvement
check_every = 2000

# Initialize training information
best_mean_reward, n_steps = -np.inf, 0

# Track training progress and save best model
def callback(_locals, _globals):
    global n_steps, best_mean_reward
    if (n_steps + 1) % check_every == 0:
        x, y = ts2xy(load_results(log_dir), 'timesteps')
        if len(x) > 0:
            mean_reward = np.mean(y[-check_every:])
            print(x[-1], 'timesteps')
            print("Best mean reward: {:.2f} - Last mean reward per episode: {:.2f}".format(best_mean_reward, mean_reward))
            if mean_reward > best_mean_reward:
                best_mean_reward = mean_reward
                print("Saving new best model")
                _locals['self'].save(log_dir + 'best_model.pkl')
    n_steps += 1
    return True

# Train agent
model.learn(total_timesteps=500000, callback=callback)

Plot the training progress

In [None]:
with open(os.path.join(log_dir, "monitor.csv"), 'rt') as fh:    
    firstline = fh.readline()
    assert firstline[0] == '#'
    df = pd.read_csv(fh, index_col=None)['r']
df.replace(-1, 0).rolling(window=1000).mean().plot()

...

In [None]:
def agent1(obs, config):
    # Load the best model
    
    # Use the best model to select a column
    return 

agent plays against a random agent

In [None]:
# Create the game environment
env = ks.make("connectx")

# Two random agents play one game round
env.run([agent1, "random"])

# Show the game
env.render(mode="ipython")

# Your turn

tbd ...