In [None]:
from Scripts.Othello.Board import Board
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, random_split
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

## Replay Memory

Nous utiliserons la mémoire de relecture de l'expérience pour former notre DQN. Elle stocke les transitions que l'agent observe, ce qui nous permet de réutiliser ces données plus tard. En prélevant des échantillons de façon aléatoire, les transitions qui constituent un lot sont décorrélées. Il a été démontré que cela stabilise et améliore grandement la procédure d'entraînement du DQN.

Pour cela, nous allons avoir besoin de deux classes :

- **Transition** - un tuple nommé représentant une seule transition dans notre environnement. Il fait essentiellement correspondre les paires (état, action) à leur résultat (next_state, récompense), l'état étant l'image de différence d'écran comme décrit plus loin.
- **ReplayMemory** - un tampon cyclique de taille limitée qui contient les transitions observées récemment. Il met également en œuvre une méthode .sample() pour sélectionner un lot aléatoire de transitions pour l'entraînement.


In [None]:
Transition = namedtuple('Transition',
                        ('state', 'action', 'next_state', 'reward'))


class ReplayMemory(object):

    def __init__(self, capacity):
        self.capacity = capacity
        self.memory = []
        self.position = 0

    def push(self, *args):
        """Saves a transition."""
        if len(self.memory) < self.capacity:
            self.memory.append(None)
        self.memory[self.position] = Transition(*args)
        self.position = (self.position + 1) % self.capacity

    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)

    def __len__(self):
        return len(self.memory)

DQN algorithm

Our environment is deterministic, so all equations presented here are also formulated deterministically for the sake of simplicity. In the reinforcement learning literature, they would also contain expectations over stochastic transitions in the environment.

Our aim will be to train a policy that tries to maximize the discounted, cumulative reward Rt0=∑∞t=t0γt−t0rt
, where Rt0 is also known as the return. The discount, γ, should be a constant between 0 and 1

that ensures the sum converges. It makes rewards from the uncertain far future less important for our agent than the ones in the near future that it can be fairly confident about.

The main idea behind Q-learning is that if we had a function Q∗:State×Action→R

, that could tell us what our return would be, if we were to take an action in a given state, then we could easily construct a policy that maximizes our rewards:
π∗(s)=argmaxa Q∗(s,a)

However, we don’t know everything about the world, so we don’t have access to Q∗
. But, since neural networks are universal function approximators, we can simply create one and train it to resemble Q∗

.

For our training update rule, we’ll use a fact that every Q

function for some policy obeys the Bellman equation:
Qπ(s,a)=r+γQπ(s′,π(s′))

The difference between the two sides of the equality is known as the temporal difference error, δ

:
δ=Q(s,a)−(r+γmaxaQ(s′,a))

To minimise this error, we will use the Huber loss. The Huber loss acts like the mean squared error when the error is small, but like the mean absolute error when the error is large - this makes it more robust to outliers when the estimates of Q
are very noisy. We calculate this over a batch of transitions, B

, sampled from the replay memory:
L=1|B|∑(s,a,s′,r) ∈ BL(δ)
whereL(δ)=⎧⎩⎨⎪⎪12δ2|δ|−12for |δ|≤1,otherwise.