# Reinforcement Learning Chess 
Reinforcement Learning Chess is a series of notebooks where I implement Reinforcement Learning algorithms to develop a chess AI. I start of with simpler versions (environments) that can be tackled with simple methods and gradually expand on those concepts untill I have a full-flegded chess AI. 

[**Notebook 1: Policy Iteration**](https://www.kaggle.com/arjanso/reinforcement-learning-chess-1-policy-iteration)  
[**Notebook 3: Q-networks**](https://www.kaggle.com/arjanso/reinforcement-learning-chess-3-q-networks)  
[**Notebook 4: Policy Gradients**](https://www.kaggle.com/arjanso/reinforcement-learning-chess-4-policy-gradients)  
[**Notebook 5: Monte Carlo Tree Search**](https://www.kaggle.com/arjanso/reinforcement-learning-chess-5-tree-search)  

# Notebook II: Model-free control
In this notebook I use the same move-chess environment as in notebook 1. In this notebook I mentioned that policy evaluation calculates the state value by backing up the successor state values and the transition probabilities to those states. The problem is that these probabilities are usually unknown in real-world problems. Luckily there are control techniques that can work in these unknown environments. These techniques don't leverage any prior knowledge about the environment's dynamics, they are model-free.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import inspect

In [None]:
!pip install --upgrade git+https://github.com/arjangroen/RLC.git  # RLC is the Reinforcement Learning package

In [None]:
from RLC.move_chess.environment import Board
from RLC.move_chess.agent import Piece
from RLC.move_chess.learn import Reinforce

### The environment
- The state space is a 8 by 8 grid
- The starting state S is the top-left square (0,0)
- The terminal state F is square (5,7). 
- Every move from state to state gives a reward of minus 1
- Naturally the best policy for this evironment is to move from S to F in the lowest amount of moves possible.

In [None]:
env = Board()
env.render()
env.visual_board

### The agent
- The agent is a chess Piece (king, queen, rook, knight or bishop)
- The agent has a behavior policy determining what the agent does in what state

In [None]:
p = Piece(piece='king')

### Reinforce
- The reinforce object contains the algorithms for solving move chess
- The agent and the environment are attributes of the Reinforce object

In [None]:
r = Reinforce(p,env)

# 2.1 Monte Carlo Control

**Theory**  
The basic intuition is:
* We do not know the environment, so we sample an episode from beginning to end by running our current policy
* We try to estimate the action-values rather than the state values. This is because we are working model-free so just knowning state values won't help us select the best actions. 
* The value of a state-action value is defined as the future returns from the first visit of that state-action
* Based on this we can improve our policy and repeat the process untill the algorithm converges

![](http://incompleteideas.net/book/first/ebook/pseudotmp5.png)

**Implementation**

In [None]:
#print(inspect.getsource(r.agent.apply_policy))
def apply_policy(self, state, epsilon):
        """
        Apply the policy of the agent
        Args:
            state: tuple of length 2
            epsilon: exploration probability, 0 for greedy behavior, 1 for pure exploration

        Returns:
            the selected action for the state under the current policy

        """
        greedy_action_value = np.max(self.policy[state[0], state[1], :])
        greedy_indices = [i for i, a in enumerate(self.policy[state[0], state[1], :]) if
                          a == greedy_action_value]
        action_index = np.random.choice(greedy_indices)
        if np.random.uniform(0, 1) < epsilon:
            action_index = np.random.choice(range(len(self.action_space)))
        return action_index

func_type = type(r.agent.apply_policy) # method not function
r.apply_policy = func_type(r.agent.apply_policy, r)

In [None]:
#print(inspect.getsource(r.play_episode))

def play_episode(self, state, max_steps=1e3, epsilon=0.1):
    """
    Play an episode of move chess
    :param state: tuple describing the starting state on 8x8 matrix
    :param max_steps: integer, maximum amount of steps before terminating the episode
    :param epsilon: exploration parameter
    :return: tuple of lists describing states, actions and rewards in a episode
    """
    self.env.state = state
    states = []
    actions = []
    rewards = []
    episode_end = False

    # Play out an episode
    count_steps = 0
    while not episode_end:
        count_steps += 1
        states.append(state)
        action_index = self.agent.apply_policy(state, epsilon)  # get the index of the next action
        action = self.agent.action_space[action_index]
        actions.append(action_index)
        reward, episode_end = self.env.step(action)
        state = self.env.state
        rewards.append(reward)

        #  avoid infinite loops
        if count_steps > max_steps:
            episode_end = True

    return states, actions, rewards

func_type = type(r.play_episode) # method not function
r.play_episode = func_type(play_episode, r)

In [None]:
#print(inspect.getsource(r.monte_carlo_learning))

def monte_carlo_learning(self, epsilon=0.1):
        """
        Learn move chess through monte carlo control
        :param epsilon: exploration rate
        :return:
        """
        state = (0, 0)
        self.env.state = state

        # Play out an episode
        states, actions, rewards = self.play_episode(state, epsilon=epsilon)
        
        #print(states, actions, rewards)
        first_visits = []
        for idx, state in enumerate(states):
            action_index = actions[idx]
            if (state, action_index) in first_visits:
                continue
            r = np.sum(rewards[idx:])
            if (state, action_index) in self.agent.Returns.keys():
                self.agent.Returns[(state, action_index)].append(r)
            else:
                self.agent.Returns[(state, action_index)] = [r]
            self.agent.action_function[state[0], state[1], action_index] = \
                np.mean(self.agent.Returns[(state, action_index)])
            first_visits.append((state, action_index))
        
        #print(len(self.agent.Returns.keys()))
        #print(self.agent.action_function.shape)
        # Update the policy. In Monte Carlo Control, this is greedy behavior with respect to the action function
        self.agent.policy = self.agent.action_function.copy()

        
func_type = type(r.monte_carlo_learning) # method not function
r.monte_carlo_learning = func_type(monte_carlo_learning, r)

**Demo**  
We do 100 iterations of monte carlo learning while maintaining a high exploration rate of 0.5:

In [None]:
for k in range(100):
    eps = 0.5
    r.monte_carlo_learning(epsilon=eps)

In [None]:
for k in range(1):
    eps = 0.1
    r.monte_carlo_learning(epsilon=eps)
    print("------------")
    r.visualize_policy()

Best action value for each state:

In [None]:
r.agent.action_function.max(axis=2).astype(int)

# 2.2 Temporal Difference Learning 

**Theory**
* Like Policy Iteration, we can back up state-action values from the successor state action without waiting for the episode to end. 
* We update our state-action value in the direction of the successor state action value.
* The algorithm is called SARSA: State-Action-Reward-State-Action.
* Epsilon is gradually lowered (the GLIE property)

**Implementation**

In [None]:
p = Piece(piece='king')
env = Board()
r = Reinforce(p,env)

In [None]:
# print(inspect.getsource(r.sarsa_td))

def sarsa_td(self, n_episodes=1000, alpha=0.01, gamma=0.9):
        """
        Run the sarsa control algorithm (TD0), finding the optimal policy and action function
        :param n_episodes: int, amount of episodes to train
        :param alpha: learning rate
        :param gamma: discount factor of future rewards
        :return: finds the optimal policy for move chess
        """
        print("-----")
        for k in range(n_episodes):
            state = (0, 0)
            self.env.state = state
            episode_end = False
            # go as eps ， there will be less random
            epsilon = max(1 / (1 + k), 0.05)
            while not episode_end:
                state = self.env.state
                action_index = self.agent.apply_policy(state, epsilon)
                action = self.agent.action_space[action_index]
                reward, episode_end = self.env.step(action)
                successor_state = self.env.state
                successor_action_index = self.agent.apply_policy(successor_state, epsilon)

                action_value = self.agent.action_function[state[0], state[1], action_index]

                successor_action_value = self.agent.action_function[successor_state[0],
                                                                    successor_state[1], successor_action_index]
                q_update = alpha * (reward + gamma * successor_action_value - action_value)

                self.agent.action_function[state[0], state[1], action_index] += q_update
                
                #print(q_update)
                    
                self.agent.policy = self.agent.action_function.copy()

                
func_type = type(r.sarsa_td) # method not function
r.sarsa_td = func_type(sarsa_td, r)

**Demonstration**

In [None]:
r.sarsa_td(n_episodes=1,alpha=0.2,gamma=0.9)

In [None]:
r.visualize_policy()

# 2.3 TD-lambda
**Theory**  
In Monte Carlo we do a full-depth backup while in Temporal Difference Learning we de a 1-step backup. You could also choose a depth in-between: backup by n steps. But what value to choose for n?
* TD lambda uses all n-steps and discounts them with factor lambda
* This is called lambda-returns
* TD-lambda uses an eligibility-trace to keep track of the previously encountered states
* This way action-values can be updated in retrospect

**Implementation**

In [None]:
p = Piece(piece='king')
env = Board()
r = Reinforce(p,env)

In [None]:
#  print(inspect.getsource(r.sarsa_lambda))
def sarsa_lambda(self, n_episodes=1000, alpha=0.05, gamma=0.9, lamb=0.8):
        """
        Run the sarsa control algorithm (TD lambda), finding the optimal policy and action function
        :param n_episodes: int, amount of episodes to train
        :param alpha: learning rate
        :param gamma: discount factor of future rewards
        :param lamb: lambda parameter describing the decay over n-step returns
        :return: finds the optimal move chess policy
        """
        for k in range(n_episodes):
            # only backup policy with in same eps
            self.agent.E = np.zeros(shape=self.agent.action_function.shape)
            state = (0, 0)
            self.env.state = state
            episode_end = False
            epsilon = max(1 / (1 + k), 0.2)
            action_index = self.agent.apply_policy(state, epsilon)
            action = self.agent.action_space[action_index]
            #print(action_index, action)
            while not episode_end:
                reward, episode_end = self.env.step(action)
                successor_state = self.env.state
                successor_action_index = self.agent.apply_policy(successor_state, epsilon)                
                action_value = self.agent.action_function[state[0], state[1], action_index]
                if not episode_end:
                    successor_action_value = self.agent.action_function[successor_state[0],
                                                                        successor_state[1], successor_action_index]
                else:
                    successor_action_value = 0
                delta = reward + gamma * successor_action_value - action_value                  
                self.agent.E[state[0], state[1], action_index] += 1
                self.agent.action_function = self.agent.action_function + alpha * delta * self.agent.E
                self.agent.E = gamma * lamb * self.agent.E
                state = successor_state
                action = self.agent.action_space[successor_action_index]
                action_index = successor_action_index
                #print(action)
                self.agent.policy = self.agent.action_function.copy()
                
func_type = type(r.sarsa_lambda) # method not function
r.sarsa_lambda = func_type(sarsa_lambda, r)

**Demonstration**

In [None]:
r.sarsa_lambda(n_episodes=1,alpha=0.2,gamma=0.9)

In [None]:
r.agent.E.shape

In [None]:
r.visualize_policy()

# 2.4 Q-learning

**Theory**
* In SARSA/TD0, we back-up our action values with the succesor action value
* In SARSA-max/Q learning, we back-up using the maximum action value. 

**Implementation**

In [None]:
p = Piece(piece='king')
env = Board()
r = Reinforce(p,env)

In [None]:
#print(inspect.getsource(r.q_learning))

def q_learning(self, n_episodes=1000, alpha=0.05, gamma=0.9):
        """
        Run Q-learning (also known as sarsa-max, finding the optimal policy and value function
        :param n_episodes: int, amount of episodes to train
        :param alpha: learning rate
        :param gamma: discount factor of future rewards
        :return: finds the optimal move chess policy
        """
        for k in range(n_episodes):
            state = (0, 0)
            self.env.state = state
            episode_end = False
            epsilon = max(1 / (k + 1), 0.1)
            while not episode_end:
                action_index = self.agent.apply_policy(state, epsilon)
                action = self.agent.action_space[action_index]
                reward, episode_end = self.env.step(action)
                successor_state = self.env.state
                # no randomness
                successor_action_index = self.agent.apply_policy(successor_state, -1)

                action_value = self.agent.action_function[state[0], state[1], action_index]
                
                #   why end ep end 
                if not episode_end:
                    successor_action_value = self.agent.action_function[successor_state[0],
                                                                        successor_state[1], successor_action_index]
                else:
                    successor_action_value = 0
                    
                # as same a TD-sarsa
                av_new = self.agent.action_function[state[0], state[1], action_index] + alpha * (reward +
                                                                                                 gamma *
                                                                                                 successor_action_value
                                                                                                 - action_value)
                self.agent.action_function[state[0], state[1], action_index] = av_new
                self.agent.policy = self.agent.action_function.copy()
                state = successor_state
                
                
func_type = type(r.q_learning) # method not function
r.q_learning = func_type(q_learning, r)

**Demonstration**

In [None]:
r.q_learning(n_episodes=1000,alpha=0.2,gamma=0.9)

In [None]:
r.visualize_policy()

In [None]:
r.agent.action_function.max(axis=2).round().astype(int)

# References
1. Reinforcement Learning: An Introduction  
   Richard S. Sutton and Andrew G. Barto  
   1st Edition  
   MIT Press, march 1998
2. RL Course by David Silver: Lecture playlist  
   https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ