In [1]:
import numpy as np
import torch
import time
import tqdm
import pickle

# Load the move dictionary created during the data-processing
MOVE_DICTIONARY = pickle.load(open("../dataset/processed/test_elite/test_move_dictionary.p", "rb"))

# Initialize Dataset and Dataload
## Dataset
Our dataset consists of around 10M moves, played by humans with Lichess ELO of at least 2100, to ensure quality moves, pre-processed in file `data_processing.ipynb`
Our dataset is written in csv files, with just two columns `bitmaps` that represent the game state and `move played` that represents the game played by a human in that game state. 

As our dataset contains a very large number of examples we load it in batches of `NUM_EXAMPLES_TO_LOAD_PER_FETCH`, normally 640k examples at a time.
This is imperative as fetching 1.2M examples will use around 5gb ram, and, since the training was made in a GPU with 6GB of VRAM, we couldn't load the entire dataset at once.

This loaded batch is shuffled to prevent any inherent order or pattern in the data from affecting the training, especially sinced our data consists of moves of games that are read in order.

## Dataloader
Dataloader serves the purpose of fetching the data from our dataset and dividing the examples in batches of `TRAINING_BATCH_SIZE` elements 

In [2]:
from torch.utils.data import DataLoader
from ChessDataset import ChessEvalDataset

DATASET_PATH = '../dataset/processed/test_elite/results_black.csv'
## !IMPORTANT: This dictates how much ram will be used, and how much data will be loaded
# 640_000 loads around 5gb, dont push this too high as it will crash if ram deplects
# NUM_EXAMPLES_TO_LOAD_PER_FETCH = 1_280_000 
NUM_EXAMPLES_TO_LOAD_PER_FETCH = 320_000
TRAINING_BATCH_SIZE = 64

HEADERS = ("bitmaps", "movePlayed")
dataset = ChessEvalDataset(
    file = DATASET_PATH, 
    validation_size = 25_000,
    load_batch_size = NUM_EXAMPLES_TO_LOAD_PER_FETCH, 
    headers = HEADERS
)
dataloader = DataLoader(dataset, batch_size=TRAINING_BATCH_SIZE, shuffle=False) # Shuffle is made in the dataset manually

# Model
We use a model, saved in the file `model.py`present in the folder `./model/models/architecture_2Conv/model.py` (this allowed us to have a versioning system of our models) 

The final model consists of two Convolutional layers and two fully connected layers, that receive a 8x8x12 tensor, which represent the game state (8x8 squares, 6 white pieces and 6 black pieces)

And has 1800 outputs, each represent a played move (moves played in our dataset)
We didn't used all possible moves (64x63 moves) because since we didn't have all the possible moves represented in our dataset our model wasn't converging to acceptable values (40% validation accuracy) 

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
import gc
# from models.architecture_batchnorm_2Conv.model import CompleteChessBotNetwork
from models.architecture_2Conv_classes.model import ChessModel as ChessModel
model = ChessModel(len(MOVE_DICTIONARY)).to(device)

# Training Loop
Our training loop, written in pseudocode:
```
For each epoch:
    For batch in dataloader.get_next_batch():
        bitmaps, expected_moves = batch
        predictions = model.predict(bitmaps)

        loss = CrossEntropyLoss(predictions, expected_moves)
        loss.backpropagation()

    validation_dataset = get_validation_dataset()
    loss, accuracy = evaluate_accuracy(validation_dataset)

    if epoch % 5 == 0:
        save_model(model)
```
We also save the weights of our model every 5 epochs.

## Optimizer and Scheduler
We use, as an optimizer, Adam (Adaptive Moment Estimation) optimizer, which adjust learning rates during training, as it works well with large datasets and complex models because it uses memory efficiently and adapts the learning rate for each parameter automatically.

## Loss Function
As a classification problem, we use Cross Entropy Loss to calculate the loss of each batch

## Model Accuracy Evaluation
To evaluate our model, we extract, in the beginning, 50k examples from the dataset that are never used in the training phase, which allows us to see how well our model generalizes.

In [None]:
NUM_EPOCHS = 100
OUTPUT_PATH = './models/architecture_2Conv_classes/blackOnly_batchnorm'

# Continue with pretrained weights
model.load_state_dict(torch.load("./models/architecture_2Conv_classes/blackOnly_batchnorm/epoch-10.pth"))
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.5)

print(torch.cuda.is_available())
print("Using device: ", device)
for epoch in range(10, NUM_EPOCHS+1):
    model.train()
    t0 = time.time()
    avg_loss = 0.0
    correct = 0
    for board_tensor, target_eval in (pbar := tqdm.tqdm(dataloader)):        
        board_tensor, target_eval = board_tensor.to(device), target_eval.to(device)  # Move data to GPU
        optimizer.zero_grad()
        pred = model(board_tensor)

        # Compute loss with valid move vlaidaiton
        loss = loss_fn(pred, target_eval.squeeze(1))
        loss.backward()
        
        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        
        optimizer.step()
        avg_loss += loss.item()

        batch_correct = (pred.argmax(dim=1) == target_eval[:, 0]).sum().item()
        correct += batch_correct
        pbar.set_description(f"Batch Accuracy: {batch_correct*100 / (TRAINING_BATCH_SIZE):.2f}%")
    scheduler.step()
    
    # Validation set
    model.eval()
    validation_features, validation_targets = dataset.get_validation_set()
    validation_features = validation_features.to(device)
    validation_targets = validation_targets.to(device)
    with torch.no_grad():
        pred = model(validation_features)
        validation_set_loss = loss_fn(pred, validation_targets.squeeze(1))
        validation_set_correct = (pred.argmax(dim=1) == validation_targets[:, 0]).sum().item()
        validation_set_accuracy = 100 * validation_set_correct / len(validation_targets)
        pred = pred.cpu()
        validation_features = validation_features.cpu()
        validation_targets = validation_targets.cpu()

    accuracy = 100 * correct / (len(dataloader) * TRAINING_BATCH_SIZE)
    tf = time.time()
    print(f"Epoch {epoch} - {avg_loss / len(dataloader):.4f} | Training Accuracy: {accuracy:.2f}%| Time: {tf-t0}")
    print(f"Validation set - accuracy: {validation_set_accuracy:.2f}% | loss: {validation_set_loss:.4f}\n")

    # Free GPU memory
    del validation_features, validation_targets, validation_set_loss, validation_set_accuracy
    gc.collect()
    torch.cuda.empty_cache()
    
    if epoch % 5 == 0:
        torch.save(model.state_dict(), f"{OUTPUT_PATH}/epoch-{epoch}.pth")

True
Using device:  cuda


Batch Accuracy: 48.44%: 100%|██████████| 157405/157405 [14:10<00:00, 185.14it/s]


Epoch 10 - 1.8637 | Training Accuracy: 43.79%| Time: 850.9807343482971
Validation set - accuracy: 40.09% | loss: 2.0427



Batch Accuracy: 35.94%: 100%|██████████| 157405/157405 [14:59<00:00, 174.92it/s]


Epoch 11 - 1.8538 | Training Accuracy: 44.01%| Time: 900.6456513404846
Validation set - accuracy: 39.97% | loss: 2.0430



Batch Accuracy: 45.31%: 100%|██████████| 157405/157405 [16:03<00:00, 163.30it/s]


Epoch 12 - 1.8458 | Training Accuracy: 44.20%| Time: 964.7244350910187
Validation set - accuracy: 40.08% | loss: 2.0432



Batch Accuracy: 43.75%: 100%|██████████| 157405/157405 [16:45<00:00, 156.62it/s] 


Epoch 13 - 1.8386 | Training Accuracy: 44.36%| Time: 1005.9593245983124
Validation set - accuracy: 40.02% | loss: 2.0425



Batch Accuracy: 37.50%: 100%|██████████| 157405/157405 [16:08<00:00, 162.59it/s]


Epoch 14 - 1.8325 | Training Accuracy: 44.52%| Time: 968.8990986347198
Validation set - accuracy: 40.17% | loss: 2.0416



Batch Accuracy: 45.31%: 100%|██████████| 157405/157405 [16:02<00:00, 163.51it/s] 


Epoch 15 - 1.8268 | Training Accuracy: 44.65%| Time: 963.496907711029
Validation set - accuracy: 40.20% | loss: 2.0452



Batch Accuracy: 48.44%: 100%|██████████| 157405/157405 [15:47<00:00, 166.15it/s]


Epoch 16 - 1.8219 | Training Accuracy: 44.77%| Time: 948.1530303955078
Validation set - accuracy: 39.96% | loss: 2.0444



Batch Accuracy: 48.44%: 100%|██████████| 157405/157405 [15:59<00:00, 164.01it/s]


Epoch 17 - 1.8174 | Training Accuracy: 44.87%| Time: 960.5497415065765
Validation set - accuracy: 39.95% | loss: 2.0436



Batch Accuracy: 42.19%: 100%|██████████| 157405/157405 [16:13<00:00, 161.73it/s]


Epoch 18 - 1.8135 | Training Accuracy: 44.99%| Time: 974.0424044132233
Validation set - accuracy: 40.16% | loss: 2.0427



Batch Accuracy: 53.12%: 100%|██████████| 157405/157405 [15:01<00:00, 174.67it/s]


Epoch 19 - 1.8099 | Training Accuracy: 45.06%| Time: 901.8765630722046
Validation set - accuracy: 40.08% | loss: 2.0463



Batch Accuracy: 40.62%: 100%|██████████| 157405/157405 [15:04<00:00, 173.96it/s]


Epoch 20 - 1.8062 | Training Accuracy: 45.15%| Time: 905.5936632156372
Validation set - accuracy: 40.04% | loss: 2.0491



Batch Accuracy: 51.56%: 100%|██████████| 157405/157405 [15:25<00:00, 170.10it/s]


Epoch 21 - 1.8032 | Training Accuracy: 45.23%| Time: 926.1206521987915
Validation set - accuracy: 40.22% | loss: 2.0466



Batch Accuracy: 42.19%:  29%|██▊       | 44999/157405 [03:44<07:52, 238.11it/s] 