# Mid-Course Project | Data Preparation

The code below can be used to extract any .pgn chess database into a Pandas dataframe. I first saved the .pgn files as .txt.

There are user-created python libraries and functions that can do this already, but for this project I wrote the code.

Chess games are stored with a standard format (.pgn) albeit with different headings. Games are stored in the form:

#### (an example from ficsgames.org)

[Event "FICS rated blitz game"]

[Site "FICS freechess.org"]

[FICSGamesDBGameNo "513225677"]

[White "MrVali"]

[Black "IFDStock"]

[WhiteElo "1729"]

[BlackElo "2346"]

[WhiteRD "23.1"]

[BlackRD "39.2"]

[BlackIsComp "Yes"]

[TimeControl "300+0"]

[Date "2022.12.31"]

[Time "13:08:00"]

[WhiteClock "0:05:00.000"]

[BlackClock "0:05:00.000"]

[ECO "D00"]

[PlyCount "18"]

[Result "0-1"]

1. d4 d5 2. f4 Bf5 3. Be3 Nf6 4. c3 e6 5. Bf2 c5 6. e3 Qb6 7. Qc1 cxd4 8. exd4 Bd6 9. g3 Be4 {White resigns} 0-1

---

#### (another example from Lichess.org)

[Event "Rated Blitz game"]

[Site "https://lichess.org/pgnm3ztm"]

[White "ASTROSCEPTRE"]

[Black "danilober"]

[Result "0-1"]

[UTCDate "2013.11.30"]

[UTCTime "23:00:16"]

[WhiteElo "1253"]

[BlackElo "1586"]

[WhiteRatingDiff "-6"]

[BlackRatingDiff "+3"]

[ECO "C42"]

[Opening "Russian Game: Urusov Gambit"]

[TimeControl "300+2"]

[Termination "Normal"]

1. e4 e5 2. Bc4 Nf6 3. Nf3 Nxe4 4. Bxf7+ Kxf7 5. Nxe5+ Kg8 6. O-O d6 7. Nc4 b5 8. Ne3 Bb7 9. d3 Nf6 10. c3 g6 11. f4 Bg7 12. f5 Nc6 13. Qb3+ Kf8 14. fxg6 hxg6 15. Ng4 Ne5 16. Nxf6 Bxf6 17. d4 Ng4 18. Bg5 Kg7 19. Bxf6+ Nxf6 20. Qe6 Rhe8 21. Qh3 Re2 22. Na3 Rxg2+ 0-1

---



Both libraries capture a similar but not identical set of headings.

In [1]:
import pandas as pd
import numpy as np

The fics datasets came from https://www.ficsgames.org/download.html

The lichess data came from https://database.lichess.org/#standard_games

In [2]:
data_fics = pd.read_csv('./Raw Data/ficsgamesdb_2022_CvH_nomovetimes_304198 - Copy.txt', names = ['text'])
data_fics

Unnamed: 0,text
0,"[Event ""FICS rated standard game""]"
1,"[Site ""FICS freechess.org""]"
2,"[FICSGamesDBGameNo ""530000657""]"
3,"[White ""IFDStock""]"
4,"[Black ""Aromas""]"
...,...
721292,"[BlackClock ""0:05:00.000""]"
721293,"[ECO ""A25""]"
721294,"[PlyCount ""37""]"
721295,"[Result ""1-0""]"


First, we extract the future column headings and the content of each line from the text string with the form:

[heading "content"]

The exception is the Movetext sequence of moves played which is entirely content and does not contain a heading.

In [4]:
data_fics['heading'] = np.where(data_fics['text'].str[0]=='[', data_fics['text'].str.split(' ').str[0].str.replace("[",""), 'Movetext')
data_fics

  data_fics['heading'] = np.where(data_fics['text'].str[0]=='[', data_fics['text'].str.split(' ').str[0].str.replace("[",""), 'Movetext')


Unnamed: 0,text,heading
0,"[Event ""FICS rated standard game""]",Event
1,"[Site ""FICS freechess.org""]",Site
2,"[FICSGamesDBGameNo ""530000657""]",FICSGamesDBGameNo
3,"[White ""IFDStock""]",White
4,"[Black ""Aromas""]",Black
...,...,...
721292,"[BlackClock ""0:05:00.000""]",BlackClock
721293,"[ECO ""A25""]",ECO
721294,"[PlyCount ""37""]",PlyCount
721295,"[Result ""1-0""]",Result


This lambda expression extracts all string content between the quote marks as 'content'.

In [5]:
data_fics['content'] = np.where(data_fics['text'].str[0]=='[', data_fics['text'].apply(lambda st: st[st.find(' "')+2:st.find('"]')]), data_fics['text'])
data_fics


Unnamed: 0,text,heading,content
0,"[Event ""FICS rated standard game""]",Event,FICS rated standard game
1,"[Site ""FICS freechess.org""]",Site,FICS freechess.org
2,"[FICSGamesDBGameNo ""530000657""]",FICSGamesDBGameNo,530000657
3,"[White ""IFDStock""]",White,IFDStock
4,"[Black ""Aromas""]",Black,Aromas
...,...,...,...
721292,"[BlackClock ""0:05:00.000""]",BlackClock,0:05:00.000
721293,"[ECO ""A25""]",ECO,A25
721294,"[PlyCount ""37""]",PlyCount,37
721295,"[Result ""1-0""]",Result,1-0


We now need to group the games. There is nothing currently in each table to explicitly state that row 2 and row 3 for example are in fact part of the same game.

We can benefit from the repeating nature of the .pgn file.

Every 19 rows in the fics file (and every 16 rows in the lichess file) start over a new game with an "Event" header. The code beneath defines and then incrementally increases the 'game' number by 1 every time it hits a new "Event" row.

In [6]:
def game_counter(data):

    data['game'] = 0

    for i in range (1,len(data)):
        if data.loc[i,'heading'] == 'Event':
            data.loc[i,'game'] = 1 + data.loc[i-1,'game']
        else:
            data.loc[i,'game'] = data.loc[i-1,'game']

    return data

In [7]:
game_counter(data_fics)
data_fics

Unnamed: 0,text,heading,content,game
0,"[Event ""FICS rated standard game""]",Event,FICS rated standard game,0
1,"[Site ""FICS freechess.org""]",Site,FICS freechess.org,0
2,"[FICSGamesDBGameNo ""530000657""]",FICSGamesDBGameNo,530000657,0
3,"[White ""IFDStock""]",White,IFDStock,0
4,"[Black ""Aromas""]",Black,Aromas,0
...,...,...,...,...
721292,"[BlackClock ""0:05:00.000""]",BlackClock,0:05:00.000,37962
721293,"[ECO ""A25""]",ECO,A25,37962
721294,"[PlyCount ""37""]",PlyCount,37,37962
721295,"[Result ""1-0""]",Result,1-0,37962


Once grouped, a pivot function will create the dataframe needed to begin the analysis.

In [8]:
data_fics_prepared = data_fics.pivot(index='game',columns='heading',values='content').reset_index()
data_fics_prepared.to_csv('./Processed Data/data_fics_prepared.csv', index=False)
data_fics_prepared

heading,game,Black,BlackClock,BlackElo,BlackIsComp,BlackRD,Date,ECO,Event,FICSGamesDBGameNo,...,PlyCount,Result,Site,Time,TimeControl,White,WhiteClock,WhiteElo,WhiteIsComp,WhiteRD
0,0,Aromas,0:15:00.000,1979,,25.0,2022.12.31,A00,FICS rated standard game,530000657,...,41,1-0,FICS freechess.org,22:03:00,900+0,IFDStock,0:15:00.000,2526,Yes,51.5
1,1,slaran,0:02:00.000,1593,,47.6,2022.12.31,A56,FICS rated blitz game,530000656,...,29,1-0,FICS freechess.org,21:56:00,120+12,exeComp,0:02:00.000,2509,Yes,93.8
2,2,scalaQueen,0:01:00.000,2116,Yes,27.9,2022.12.31,D04,FICS rated lightning game,530000441,...,110,0-1,FICS freechess.org,21:00:00,60+0,ManOOwar,0:01:00.000,2111,,30.2
3,3,ManOOwar,0:01:00.000,2103,,30.3,2022.12.31,B40,FICS rated lightning game,530000432,...,124,0-1,FICS freechess.org,20:58:00,60+0,scalaQueen,0:01:00.000,2124,Yes,27.9
4,4,scalaQueen,0:01:00.000,2124,Yes,28.0,2022.12.31,D04,FICS rated lightning game,530000423,...,92,1/2-1/2,FICS freechess.org,20:56:00,60+0,ManOOwar,0:01:00.000,2103,,30.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37958,37958,chesspickle,0:05:00.000,1928,Yes,0.0,2022.01.01,B30,FICS rated blitz game,510001352,...,61,1-0,FICS freechess.org,00:26:00,300+0,Geforce,0:05:00.000,2086,,0.0
37959,37959,Geforce,0:05:00.000,2090,,0.0,2022.01.01,C65,FICS rated blitz game,510001337,...,98,1/2-1/2,FICS freechess.org,00:17:00,300+0,chesspickle,0:05:00.000,1924,Yes,0.0
37960,37960,chesspickle,0:05:00.000,1929,Yes,0.0,2022.01.01,B30,FICS rated blitz game,510001310,...,65,1-0,FICS freechess.org,00:08:00,300+0,Geforce,0:05:00.000,2085,,0.0
37961,37961,konozrout,0:05:00.000,2024,Yes,0.0,2022.01.01,B50,FICS rated blitz game,510001293,...,61,1-0,FICS freechess.org,00:01:00,300+0,Geforce,0:05:00.000,2078,,0.0


This process can be summarised with a generic function.

In [9]:
def prepare(filepath):
    import pandas as pd
    import numpy as np

    data = pd.read_csv(filepath, names = ['text'])
    data['heading'] = np.where(data['text'].str[0]=='[', data['text'].str.split(' ').str[0].str.replace("[",""), 'Movetext')
    data['content'] = np.where(data['text'].str[0]=='[', data['text'].apply(lambda st: st[st.find(' "')+2:st.find('"]')]), data['text'])
    
    data['game'] = 0
    for i in range (1,len(data)):
        if data.loc[i,'heading'] == 'Event':
            data.loc[i,'game'] = 1 + data.loc[i-1,'game']
        else:
            data.loc[i,'game'] = data.loc[i-1,'game']

    data_prepared = data.pivot(index='game',columns='heading',values='content').reset_index()
    
    return data_prepared

For my project, I used a larger sample set (which took much longer to run) below:

In [5]:
# fics2 = prepare('./Raw Data/ficsgamesdb_201801_chess_nomovetimes_304468.txt')

  data['heading'] = np.where(data['text'].str[0]=='[', data['text'].str.split(' ').str[0].str.replace("[",""), 'Movetext')


In [6]:
# fics2.to_csv('./Processed Data/data_fics2_prepared.csv', index=False)

### Optional: pushing to an SQL database

In [8]:
# import mysql.connector
# from getpass import getpass
# password = getpass()
 
# dataBase = mysql.connector.connect(
#   host ="localhost",
#   user ="root",
#   passwd =password
# )

In [9]:
# # preparing a cursor object
# cursorObject = dataBase.cursor()
 
# # creating database
# cursorObject.execute("CREATE DATABASE chessFICS")

In [7]:
# import mysql.connector
# from getpass import getpass
# password = getpass()

# dataBase = mysql.connector.connect(
#                      host = "localhost",
#                      user = "root",
#                      passwd = password,
#                      database = "chessFICS" ) 
 
# # preparing a cursor object
# cursorObject = dataBase.cursor()
 
# # creating table 
# create_games = """CREATE TABLE GAMES (
#                    game MEDIUMINT,
#                    Black VARCHAR(50),
#                    BlackElo MEDIUMINT,
#                    BlackIsComp VARCHAR(5),
#                    BlackRD FLOAT,
#                    Date DATE,
#                    ECO VARCHAR(5),
#                    Event VARCHAR(50),
#                    Movetext VARCHAR(10000),
#                    Result VARCHAR(10),
#                    Site VARCHAR(20),
#                    Time TIME,
#                    TimeControl VARCHAR(10),
#                    White VARCHAR(50),
#                    WhiteElo MEDIUMINT,
#                    WhiteRD FLOAT,
#                    BlackClock VARCHAR(20),
#                    FICSGamesDBGameNo VARCHAR(10),
#                    PlyCount SMALLINT,
#                    Variant VARCHAR(50),
#                    WhiteClock VARCHAR(20),
#                    WhiteIsComp VARCHAR(5)
#                    )"""
 
# # table created
# cursorObject.execute(create_games) 
 
# # disconnecting from server
# dataBase.close()

········
