# Data Exploration
In this notebook I will take a look at the different data sets that are avaibalbe.
I will test how well they can be transformed into a format that is usable for this project and lastly store the transformed data.

I use the excelent python-chess library to process data in the pgn format.

In [None]:
!pip install python-chess

The first dataset I want to look at is the FiCS Games Database: https://www.ficsgames.org/download.html

This database contains games millions of games, the first ones being from 1999.
A subcategorie are games from players with a rating above 2000.
I will only consider these games.

All games from one year can be downloaded in bluk as one pgn-file.
Below, I will write methods to convert the pgn format into the board state or into a string of words.

In [150]:
import chess
import chess.pgn 
from tqdm import tqdm_notebook as tqdm
import re

def dataExtractor(path):
    """
    Extracts the moves and fens of each game as strings
    path - the path to the pgn file in which the games are stored
    returns an array of objects containing the moves and the fens
    """
    
    re0 = re.compile(r"{.*?}", re.MULTILINE)
    re1 = re.compile(r"{.*}", re.MULTILINE)
    re2 = re.compile(r"\d+\..", re.MULTILINE)
    re3 = re.compile(r"\.", re.MULTILINE)
    re4 = re.compile(r"\$\d+", re.MULTILINE)
    
    pgn = open(path)
    data = {
        'moves': [],
        'fens': []
    }
    pgn.readlines()
    with tqdm(total=pgn.tell()-1) as pbar:
        pgn.seek(0)
        while True:
            pbar.n = pgn.tell()
            pbar.refresh()
            game = chess.pgn.read_game(pgn)

            if game is None:
                break

            string = str(game.mainline_moves())
            string = re.sub(re0, "", string) # Remove comments
            string = re.sub(re1, "", string)
            string = re.sub(re2, "", string)
            string = re.sub(re3, "", string)
            string = re.sub(re4, "", string)

            if "{" in string:
                continue
            
            fens = []
            
            board = game.board()

            for move in game.mainline_moves():
                board.push(move)
                fen = board.board_fen()
                fens.append(str(fen))
                
            data['moves'].append(list(filter(None, string.split(" "))))
            data['fens'].append(fens)
            
        return data
    
path = 'data/sample.pgn'

games = dataExtractor(path)

HBox(children=(IntProgress(value=0, max=68121757), HTML(value='')))




KeyboardInterrupt: 

In [104]:
games['moves'][0]

['c4',
 'Nf6',
 'd4',
 'c6',
 'Nf3',
 'd5',
 'cxd5',
 'cxd5',
 'Nc3',
 'a6',
 'Bf4',
 'Nc6',
 'Rc1',
 'Bf5',
 'e3',
 'Rc8',
 'Be2',
 'e6',
 'O-O',
 'Nd7',
 'Na4',
 'h6',
 'Bd3',
 'Bxd3',
 'Qxd3',
 'Be7',
 'Qb3',
 'Na5',
 'Qd1',
 'O-O',
 'Rxc8',
 'Qxc8',
 'Qe1',
 'Nc6',
 'Qd2',
 'b5',
 'Nc3',
 'Qb7',
 'Ne2',
 'Nb6',
 'Rc1',
 'Nc4',
 'Qc2',
 'Rc8',
 'b3',
 'Nb4',
 'Qb1',
 'Na3',
 'Qa1',
 'g5',
 'Bg3',
 'Nbc2',
 'Qb2',
 'f6',
 'Nc3',
 'Bb4',
 'Ne2',
 'Qh7',
 'Rd1',
 'Nxe3',
 'Rc1',
 'Nec2',
 'Kh1',
 'Kf7',
 'Nfg1',
 'h5',
 'f4',
 'h4',
 'Bf2',
 'gxf4',
 'Nxf4',
 'Qe4',
 'Nxd5',
 'exd5',
 'Rf1',
 'Rg8',
 'Bg3',
 'hxg3',
 'h3',
 'Ne3',
 'Rc1',
 'Nac2']

I also wrote a method to convert the fen notation to the matrix representation.
However, I will not store data in this format to save space.

In [59]:
import numpy as np

def indexToArray(i, len_ = 12):
    '''
    Converts an index into a one-hot-encoded vector.
    i - the index of the 1.
    len_ - (optional) the len of the one-hot-vector. Default is 12.
    returns a vector of length len_
    '''
    
    array = [0] * len_
    
    if(i >= 0 and i < len(array)):
        array[i] = 1
        
    return array

def fenToMatrix(fen):
    '''
    Converts a fen string to a 8x8x16 matrix.
    fen - a string in the fen notation
    returns a 8x8x16 matrix
    '''
    
    # 'rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR'
    pieces = {
        'r': indexToArray(0),
        'n': indexToArray(1),
        'b': indexToArray(2),
        'q': indexToArray(3),
        'k': indexToArray(4),
        'p': indexToArray(5),
        'P': indexToArray(6),
        'R': indexToArray(7),
        'N': indexToArray(8),
        'B': indexToArray(9),
        'Q': indexToArray(10),
        'K': indexToArray(11),
    }
    
    matrix = []
    row = []
    
    for c in fen:
        try:
            cInt = int(c)
            
            for i in range(cInt):
                row.append(indexToArray(-1))
        except: # c can not be cast as integer       
            if c == '/':
                matrix.append(row)
                row = []
            else:
                row.append(pieces[c]) 
    matrix.append(row)
                
    return matrix

Lastly I save the data as a json

In [60]:
import json

with open('data/data.json', 'w') as outfile:  
    json.dump(games, outfile)

## Datasets with annotation
Next, I will look at the ~4000 games with annotations sourced from http://www.angelfire.com/games3/smartbridge/

The below command will throw errors but I handle these (they are still displayed even though they are irrelevant).

In [102]:
from os import listdir

annotated_games = []

for file in tqdm(listdir('data/annotated')):
    pgn = open('data/annotated/' + file)
    while True:
        game = chess.pgn.read_game(pgn)
        
        if game is None:
            break
            
        if len(game.errors) > 0:
            continue
        
        annotated_games.append(game)
    
annotated_games

HBox(children=(IntProgress(value=0, max=38), HTML(value='')))

error during pgn parsing
Traceback (most recent call last):
  File "C:\Users\Robin\Anaconda3\lib\site-packages\chess\pgn.py", line 1271, in read_game
    move = visitor.parse_san(board_stack[-1], token)
  File "C:\Users\Robin\Anaconda3\lib\site-packages\chess\pgn.py", line 709, in parse_san
    return board.parse_san(san)
  File "C:\Users\Robin\Anaconda3\lib\site-packages\chess\__init__.py", line 2744, in parse_san
    raise ValueError("illegal san: {!r} in {}".format(san, self.fen()))
ValueError: illegal san: 'g1-f3' in rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
error during pgn parsing
Traceback (most recent call last):
  File "C:\Users\Robin\Anaconda3\lib\site-packages\chess\pgn.py", line 1271, in read_game
    move = visitor.parse_san(board_stack[-1], token)
  File "C:\Users\Robin\Anaconda3\lib\site-packages\chess\pgn.py", line 709, in parse_san
    return board.parse_san(san)
  File "C:\Users\Robin\Anaconda3\lib\site-packages\chess\__init__.py", line 2744, in parse_s

ValueError: illegal san: 'Bxg5' in rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
error during pgn parsing
Traceback (most recent call last):
  File "C:\Users\Robin\Anaconda3\lib\site-packages\chess\pgn.py", line 1271, in read_game
    move = visitor.parse_san(board_stack[-1], token)
  File "C:\Users\Robin\Anaconda3\lib\site-packages\chess\pgn.py", line 709, in parse_san
    return board.parse_san(san)
  File "C:\Users\Robin\Anaconda3\lib\site-packages\chess\__init__.py", line 2739, in parse_san
    raise ValueError("ambiguous san: {!r} in {}".format(san, self.fen()))
ValueError: ambiguous san: 'Nd7' in rnbqk2r/ppp1bppp/4pn2/3p2B1/2PP4/2N2N2/PP2PPPP/R2QKB1R b KQkq - 5 5
error during pgn parsing
Traceback (most recent call last):
  File "C:\Users\Robin\Anaconda3\lib\site-packages\chess\pgn.py", line 1271, in read_game
    move = visitor.parse_san(board_stack[-1], token)
  File "C:\Users\Robin\Anaconda3\lib\site-packages\chess\pgn.py", line 709, in parse_san
    return board.pa




[<Game at 0x14cf3508e10 ('Karjakin, Sergey' vs. 'Shirov, Alexei', '2002.11.29')>,
 <Game at 0x14cf458e5c0 ('Shirov, Alexei' vs. 'Karpov, Anatoly', '2002.11.29')>,
 <Game at 0x14cf477f470 ('San Segundo Carrillo, P.' vs. 'Polgar, Judit', '2002.11.29')>,
 <Game at 0x14cf46b32b0 ('Ponomariov, Ruslan' vs. 'Karjakin, Sergey', '2002.11.29')>,
 <Game at 0x14cf4682518 ('Polgar, Judit' vs. 'Shirov, Alexei', '2002.11.29')>,
 <Game at 0x14cf46b3fd0 ('Ponomariov, Ruslan' vs. 'Polgar, Judit', '2002.11.29')>,
 <Game at 0x14cf45f7b70 ('Karjakin, Sergey' vs. 'Karpov, Anatoly', '2002.11.29')>,
 <Game at 0x14cf4651048 ('Polgar, Judit' vs. 'Karjakin, Sergey', '2002.11.30')>,
 <Game at 0x14cf469d208 ('Karpov, Anatoly' vs. 'Polgar, Judit', '2002.11.30')>,
 <Game at 0x14cf4693438 ('Polgar, Judit' vs. 'Granero Roca, A.', '2002.11.30')>,
 <Game at 0x14cf47ca710 ('Psakhis, Lev' vs. 'Shirov, Alexei', '2002.11.30')>,
 <Game at 0x14cf4784e80 ('Polgar, Judit' vs. 'Paunovic, D.', '2002.11.30')>,
 <Game at 0x14cf45b8

In [137]:
len(annotated_games)

3683

In [139]:
for moves in annotated_games[0].mainline():
    print(str(moves.move) + " " + moves.comment)

e2e4 
c7c5 Although Shirov has played a variety of defenses, I think he's been playing the Petroff's Defense and perhaps the Caro-Kann most recently.
g1f3 
b8c6 
d2d4 
c5d4 
f3d4 
e7e5 Neo-Sveshnikov or Lowenthal leaves d5 (and perhaps d6) weak. More common is 4...Nf6, forcing White to defend Pe4 and threatening ...d5.
d4b5 
d7d6 
b1c3 He threatens Nc3-d5 and Nc7+, so Black must nip that idea in the bud.
a7a6 
b5a3 
b7b5 Instead of White dominating square c4 it's Black who controls it, for the moment. Eventually piece play should dominate a position more than pawns. There is also the immediate threat of ...b4.
c3d5 White could continue with Bc1-e3-b6 and Nc7+, so Black has to react quickly.
c6e7 
c2c4 
e7d5 
e4d5 This makes the pawns a bit unbalanced with White advancing on the queen-side. Aside from Pb5 Black shouldn't have much difficulty completing his development.
b5c4 
a3c4 Now White controls c4!
g8f6 
c1e3 
a8b8 
f1e2 
f8e7 
a2a4 aiming for a4-a5 and Nb6 or Bb6 to cramp Black, bu

In [146]:
def dataExtractor2(games):
    """
    Extracts the moves and fens of each game as strings
    path - the path to the pgn file in which the games are stored
    returns an array of objects containing the moves and the fens
    """
    
    counter = 0
    
    re0 = re.compile(r"{.*?}", re.MULTILINE)
    re1 = re.compile(r"{.*}", re.MULTILINE)
    re2 = re.compile(r"\d+\..", re.MULTILINE)
    re3 = re.compile(r"\.", re.MULTILINE)
    re4 = re.compile(r"\$\d+", re.MULTILINE)
    
    data = {
        'moves': [],
        'fens': []
    }
    for game in tqdm(games):
        result1 = [] # Stores all the commented moves for this game
        result2 = [] # Stores all the commented fens for this game
        
        string = str(game.mainline_moves())
        string = re.sub(re0, "", string) # Remove comments
        string = re.sub(re1, "", string)
        string = re.sub(re2, "", string)
        string = re.sub(re3, "", string)
        string = re.sub(re4, "", string)
        
        if "{" in string:
            continue
        
        moves = list(filter(None, string.split(" ")))
        mainline = list(game.mainline())
                
        helper = []
        
        board = game.board()                
        
        for i in range(len(moves)):
            move = moves[i]
            comment = mainline[i].comment
            helper.append(move)
            board.push(mainline[i].move)
            
            if comment: # comment exists  
                counter += 1
                result1.append([helper.copy(), comment])
                result2.append([str(board.board_fen()), comment])
            
        data['moves'].append(result1)
        data['fens'].append(result2)
                               
    print(counter, 'commented moves found')        
    return data
                               
data = dataExtractor2(annotated_games)

HBox(children=(IntProgress(value=0, max=3683), HTML(value='')))


17615 commented moves found


In [148]:
data['fens'][0]

[['rnbqkbnr/pp1ppppp/8/2p5/4P3/8/PPPP1PPP/RNBQKBNR',
  "Although Shirov has played a variety of defenses, I think he's been playing the Petroff's Defense and perhaps the Caro-Kann most recently."],
 ['r1bqkbnr/pp1p1ppp/2n5/4p3/3NP3/8/PPP2PPP/RNBQKB1R',
  'Neo-Sveshnikov or Lowenthal leaves d5 (and perhaps d6) weak. More common is 4...Nf6, forcing White to defend Pe4 and threatening ...d5.'],
 ['r1bqkbnr/pp3ppp/2np4/1N2p3/4P3/2N5/PPP2PPP/R1BQKB1R',
  'He threatens Nc3-d5 and Nc7+, so Black must nip that idea in the bud.'],
 ['r1bqkbnr/5ppp/p1np4/1p2p3/4P3/N1N5/PPP2PPP/R1BQKB1R',
  "Instead of White dominating square c4 it's Black who controls it, for the moment. Eventually piece play should dominate a position more than pawns. There is also the immediate threat of ...b4."],
 ['r1bqkbnr/5ppp/p1np4/1p1Np3/4P3/N7/PPP2PPP/R1BQKB1R',
  'White could continue with Bc1-e3-b6 and Nc7+, so Black has to react quickly.'],
 ['r1bqkbnr/5ppp/p2p4/1p1Pp3/2P5/N7/PP3PPP/R1BQKB1R',
  "This makes the pawns

In [149]:
with open('data/data2.json', 'w') as outfile:  
    json.dump(data, outfile)