# Guess the ELO Transformer

In this notebook, we begin training a transformer to predict the ELO rating of 2 participants in a chess game given the move order. ELO ratings attempt to quantify and rank the ability of chess players; they are determined by the lichess ELO rating system (Glicko 2 rating system: https://lichess.org/page/rating-systems). 

We use games selected from a single month of the lichess open database (https://database.lichess.org/). Only rapid time control games are included. Data is stored in .pgn format, a standard file format for recording chess games. Each .pgn file contains information for multiple games. For each game in the .pgn file, we extract the ELO of the white and black player, the move order, given in algebraic notation (https://en.wikipedia.org/wiki/Algebraic_notation_(chess)) and the result of the game ("1-0" for white win, "1/2-1/2" for a draw, and "0-1" for black win) which is concatenated at the end of the move history. These are then written to a .csv file for easier processing.

We train a RoBERTa model from the Hugging Face library (https://huggingface.co/docs/transformers/index) on the bidirectional (dynamic) masked machine learning task in order to get an informative fixed size representation of the move history. To our knowledge, no existing transformer model exists which has been trained on a large corpus of chess games in algebraic notation, and so transfer learning is not available and training begins with a random initialization of weights. After completing training on the masked language model task, we fine-tune the model to solve the particular problem of predicting player elo ratings from move history. We organize our models in this way to mimic the organization of standard NLP applications, where a large pretraining step is done on the masked machine learning task, which is then fine-tuned for a particular use case.

In [11]:
import chess.pgn # For parsing pgn file
import torch
import torch.nn as nn
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot
import os, sys
import logging
import tqdm

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
from pathlib import Path

from transformers import Trainer
from transformers import LineByLineTextDataset, DataCollatorForLanguageModeling
from transformers import RobertaConfig, RobertaForMaskedLM, RobertaTokenizerFast
from transformers import Trainer, TrainingArguments
from datasets import Dataset

os.environ["WANDB_DISABLED"] = "true"

FORMAT = '%(asctime)s %(message)s'
logging.basicConfig(format=FORMAT, level=logging.INFO)

tokenizer_data = 'all_moves.txt'
transformer_data = 'mlm_training.txt'
model_dir = './inBERTnational_master'

num_games = 200000
num_test = 10000
vocab_size = 2000
max_tokens=200
train_split = .7

## Data Parsing

Data is originally stored in .pgn format (standard format used for storing chess games). In this cell, we use pychess to parse the pgn file and extract the moves + result as a string and the ELO of the white and black player. The original Lichess pgn files are quite large (256 GB). This is too large for our current resources and is likely more data than needed, given that standard algebraic notation is a fairly constrained "language." We parse games until 1 GB of games are stored in a csv file to accommodate for our computational resources. 

In [2]:
def read_one_game(game):
    white_elo = game.headers["WhiteElo"]
    black_elo = game.headers["BlackElo"]
    result = game.headers["Result"]
    board = chess.Board()
    move_history = board.variation_san(game.mainline_moves()).split()
    # Any strings ending in . are numbers i.e. 1. 2. 3. 4., ... these are just "grammar" and don't provide any additional information and so are removed.
    move_history = " ".join(mv for mv in move_history if not mv[-1] == ".")
    move_history += " " + result
    return move_history, white_elo, black_elo

# Downloaded from Lichess database linked above
large_pgn_file = '/media/sql/Samsung_T5/lichess_db_standard_rated_2021-03.pgn'
small_pgn_file_size = 1 # in GB
small_pgn_file = f'/home/sql/Documents/elo_guesser/data_file_{small_pgn_file_size}GB.csv'

In [57]:
with open(large_pgn_file) as large_pgn:
    with open(small_pgn_file, 'a') as small_pgn: 
        game = chess.pgn.read_game(large_pgn)
        while game is not None and os.path.getsize(small_pgn_file) < small_pgn_file_size * 1000000000:
            if all([head in game.headers for head in ['WhiteElo', 'BlackElo', 'Result','Event']]) and game.headers['Event']=='Rated Rapid game':
                move_history, white_elo, black_elo = read_one_game(game)
                small_pgn.write(",".join([move_history, white_elo, black_elo]) + "\n")
            game = chess.pgn.read_game(large_pgn)

In [4]:
# Convert CSV to plain text for MLM
X = pd.read_csv('data_file_1GB.csv', sep=',', names = ['moves','white_elo', 'black_elo'])

logging.info(f'Number of games total: {X.shape[0]}')
# Used to train tokenizer
if os.path.exists(tokenizer_data):
    os.remove(tokenizer_data)
with open(tokenizer_data, 'a') as f:
    moves_as_str = X['moves']
    moves_as_str = moves_as_str.to_csv(sep='\n', header=False, index=False)
    f.write(moves_as_str)

# Used to train model
if os.path.exists(transformer_data):
    os.remove(transformer_data)
with open(transformer_data, 'a') as f:
    moves_as_str = X['moves'].iloc[:num_games]
    moves_as_str = moves_as_str.to_csv(sep='\n', header=False, index=False)
    f.write(moves_as_str)

2022-01-02 20:36:45,773 Number of games total: 3596879


### Tokenization

In [66]:
# Train a tokenizer for standard algebraic notation from scratch using Hugging Face libary
mlm_data_files = [tokenizer_data]

tokenizer = Tokenizer(BPE())
trainer = BpeTrainer(vocab_size=vocab_size, special_tokens=["<s>", "</s>", "<mask>", "<pad>", "<unk>"])
tokenizer.pre_tokenizer = Whitespace()
tokenizer.enable_truncation(max_length=max_tokens)
tokenizer.train(mlm_data_files, trainer)
tokenizer.model.save(model_dir)

['./inBERTnational_master/vocab.json', './inBERTnational_master/merges.txt']

### Masked Language Model

We build a transformer to solve the (dynamic) masked language problem for the "language" of standard algebraic notation. This section makes large use of the Hugging Face library and in particular borrows code and ideas from the google colab notebook associated with this (https://huggingface.co/blog/how-to-train) blog post. We are unable to use pretrained Hugging Face models as standard algebraic notation is not a language for which a pretrained model exists.

It's worth noting that the amount of data here is very large; we have stored nearly 3.6 million games. As such the transformer takes fairly long to train, especially since we don't currently have access to a local gpu. Training is done in a google colab notebook seperately and the trained model will be available here. 12 hour runtime limits are bypassed by frequent checkpointing / restarting the colab kernel. Memory limits are bypassed by splitting the source text into multiple smaller text files and looping over these text files. These details of implementation complicate the code, though, and so are only shown in the colab notebook.

In [67]:
# First train RoBERTa for Masked Langauge Modelling to get a rich representation of input string.
config = RobertaConfig(
    vocab_size=vocab_size,
    max_position_embeddings=514,
    num_attention_heads=6,
    num_hidden_layers=3,
    type_vocab_size=1,
    num_labels=2,
)

tokenizer = RobertaTokenizerFast.from_pretrained(model_dir, max_len=max_tokens)
model = RobertaForMaskedLM(config=config)

dataset =  LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path=transformer_data,
    block_size=128,
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

training_args = TrainingArguments(
    output_dir="./trained_model",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=32,
    learning_rate=1e-4,
    save_steps=500,
    save_total_limit=2,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

trainer.train()
trainer.save_model(model_dir)

Didn't find file ./inBERTnational_master/tokenizer.json. We won't load it.
Didn't find file ./inBERTnational_master/added_tokens.json. We won't load it.
Didn't find file ./inBERTnational_master/special_tokens_map.json. We won't load it.
Didn't find file ./inBERTnational_master/tokenizer_config.json. We won't load it.
loading file ./inBERTnational_master/vocab.json
loading file ./inBERTnational_master/merges.txt
loading file None
loading file None
loading file None
loading file None
loading configuration file ./inBERTnational_master/config.json
Model config RobertaConfig {
  "_name_or_path": "./inBERTnational_master",
  "architectures": [
    "RobertaForSequenceMultiTargetRegression"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 51

Step,Training Loss
500,3.3484
1000,2.5597
1500,1.8439
2000,1.659
2500,1.5413
3000,1.4701
3500,1.4254
4000,1.385
4500,1.3571
5000,1.3349


Saving model checkpoint to ./trained_model/checkpoint-500
Configuration saved in ./trained_model/checkpoint-500/config.json
Model weights saved in ./trained_model/checkpoint-500/pytorch_model.bin
Saving model checkpoint to ./trained_model/checkpoint-1000
Configuration saved in ./trained_model/checkpoint-1000/config.json
Model weights saved in ./trained_model/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to ./trained_model/checkpoint-1500
Configuration saved in ./trained_model/checkpoint-1500/config.json
Model weights saved in ./trained_model/checkpoint-1500/pytorch_model.bin
Deleting older checkpoint [trained_model/checkpoint-500] due to args.save_total_limit
Saving model checkpoint to ./trained_model/checkpoint-2000
Configuration saved in ./trained_model/checkpoint-2000/config.json
Model weights saved in ./trained_model/checkpoint-2000/pytorch_model.bin
Deleting older checkpoint [trained_model/checkpoint-1000] due to args.save_total_limit
Saving model checkpoint to ./train

In [7]:
tokenizer = RobertaTokenizerFast.from_pretrained(model_dir, max_len=max_tokens)
model = RobertaForMaskedLM.from_pretrained(model_dir)

### Fine-Tuning

We fine-tune the standard algebraic notation masked language model to the particular problem of predicting elo ratings of participating players. This is done similarly to as before, with the only 2 differences:

The data now also contains elo ratings of the 2 players under the "targets" key.

The objective is a simple L2 loss of predicted player ratings.

Hugging Face provides a RobertaForSequenceClassification wrapper, which can also be used for regression by setting config.num_labels to 1. However, we need to perform multi-target regression, so the provided class must be modified. This is done in the cell below, which is very similar to the RobertaForSequenceClassification class, but tweaks the relevant lines to allow for multi-target regression. 

In [5]:
from transformers import BertPreTrainedModel, RobertaModel, RobertaForSequenceClassification

class RobertaForSequenceMultiTargetRegression(BertPreTrainedModel):
    config_class = RobertaConfig
    base_model_prefix = "roberta"

    def __init__(self, config):
        super().__init__(config)
        self.n_outputs = config.num_labels
        self.roberta = RobertaModel(config)
        self.regressor = RobertaRegressionHead(config)

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        targets = None):
        outputs = self.roberta(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
        )
        sequence_output = outputs[0]
        head_output = self.regressor(sequence_output)

        outputs = (head_output,) + outputs[2:]

        loss_fct = torch.nn.MSELoss()
        loss = loss_fct(head_output.view(-1,self.n_outputs), targets.view(-1,self.n_outputs))
        outputs = (loss,) + outputs
        return outputs  # (loss), logits, (hidden_states), (attentions)

    
class RobertaRegressionHead(nn.Module):
    """Head for sentence-level Regression tasks."""

    def __init__(self, config):
        super().__init__()
        self.relu = torch.nn.ReLU()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, features, **kwargs):
        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])
        x = self.dropout(x)
        x = self.dense(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.out_proj(x)
        return x
    
# The loss function is changed by redefining the compute_loss() function in the Trainer class.
class MultiTargetTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        targets = inputs['targets']
        outputs = model(**inputs)
        preds = outputs[1]
        loss_fct = nn.MSELoss()
        loss = loss_fct(preds.view(-1, self.model.config.num_labels),
                        targets.float().view(-1, self.model.config.num_labels))
        return (loss, outputs) if return_outputs else loss

In [10]:
def compute_metrics(pred):
    truth = pred.label_ids
    preds = pred.predictions
    L2 = sklearn.metrics.mean_squared_error(truth, preds)
    return { "loss": L2
    }

model = RobertaForSequenceMultiTargetRegression.from_pretrained(model_dir)

data_dict = {'moves': X['moves'].iloc[:num_games], 'targets': X[['white_elo', 'black_elo']].iloc[:num_games].values}
all_data =  Dataset.from_dict(data_dict)
train_data = Dataset.from_dict(all_data[:int(train_split * num_games)])
test_data = Dataset.from_dict(all_data[int(train_split * num_games):])

def tokenization(batched_text):
    return tokenizer(batched_text['moves'], padding = True, truncation=True)

train_data = train_data.map(tokenization, batched = True, batch_size = len(train_data))
test_data = test_data.map(tokenization, batched = True, batch_size = len(test_data))

training_args = TrainingArguments(
    output_dir = './finetuned_model',
    num_train_epochs=1,
    per_device_train_batch_size = 32,
    gradient_accumulation_steps = 2,    
    per_device_eval_batch_size= 32,
    evaluation_strategy = "steps",
    disable_tqdm = False, 
    load_best_model_at_end=True,
    save_steps=500,
    eval_steps=500,
    weight_decay=0.01,
    save_strategy = "steps",
    dataloader_num_workers = 8,
    label_names=['targets'],
    run_name = 'roberta-regression'
)

trainer = MultiTargetTrainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_data,
    eval_dataset=test_data
)

trainer.train()
trainer.save_model('./finetuned_model')

loading configuration file ./inBERTnational_master/config.json
Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 6,
  "num_hidden_layers": 3,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.15.0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 2000
}

loading weights file ./inBERTnational_master/pytorch_model.bin
Some weights of the model checkpoint at ./inBERTnational_master were not used when initializing RobertaForSequenceMultiTargetRegression: ['lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.wei

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
The following columns in the training set  don't have a corresponding argument in `RobertaForSequenceMultiTargetRegression.forward` and have been ignored: moves.
***** Running training *****
  Num examples = 140000
  Num Epochs = 1
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 2
  Total optimization steps = 2187


Step,Training Loss,Validation Loss
500,2221194.752,1878776.282972
1000,1474099.2,1079000.315457
1500,781082.496,538470.03547
2000,419089.664,327692.268279


The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceMultiTargetRegression.forward` and have been ignored: moves.
***** Running Evaluation *****
  Num examples = 60000
  Batch size = 32
Saving model checkpoint to ./finetuned_model/checkpoint-500
Configuration saved in ./finetuned_model/checkpoint-500/config.json
Model weights saved in ./finetuned_model/checkpoint-500/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceMultiTargetRegression.forward` and have been ignored: moves.
***** Running Evaluation *****
  Num examples = 60000
  Batch size = 32
Saving model checkpoint to ./finetuned_model/checkpoint-1000
Configuration saved in ./finetuned_model/checkpoint-1000/config.json
Model weights saved in ./finetuned_model/checkpoint-1000/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceMultiTargetRegress

In [None]:
# Analyze on testing data