# **Popular Chess Openings based on Rating and Time Control**




This Python Notebook aims to gather the most popular chess opening repertoire depending on the player's rating and chosen time control, which will help the player know what opening is played the most at their specific level and time format. By analyzing a large dataset of chess games from various online platforms, this project will identify patterns and trends in opening choices among players of different skill levels (e.g., beginner, intermediate, advanced) and across different time controls (e.g., classical, rapid, blitz). The insights derived from this analysis can provide players with a strategic advantage by informing them of commonly encountered openings in their games, enabling them to add these to their arsenal and improve their overall performance.

## **Importing libraries**

In [6]:
import pandas as pd
import numpy as np
import chess.pgn



## **Parsing the PGN file**

The games found in the PGN file is extracted from the database of FICS (Free Internet Chess Server). The link to the website can be found [here](https://www.ficsgames.org/download.html). The games extracted were from the entire year of 2023, and it included every time control and every player rating.  

In [7]:
# # Initialize an empty dictionary to store the game data
# game_data = {
#     "Result": [],
#     "WhiteElo": [], 
#     "BlackElo": [],
#     "TimeControl": [], 
#     "ECO": []
# }

# num_games = 1

# with open("./dataset/2023_fics_games.pgn") as pgn_file:
#     # Parse the first game in the file
#     game = chess.pgn.read_game(pgn_file)
#     print(game.headers)

#     # Iterate through all games in the file
#     while game is not None:
#         print(f"Processing game {num_games}", end="\r")
#         # Append game headers (metadata) to the game_data dictionary
#         for key in game_data.keys():
#             if key in game.headers:
#                 game_data[key].append(game.headers[key])
#             else:
#                 game_data[key].append(None)    
#         # Read the next game in the file
#         game = chess.pgn.read_game(pgn_file)
#         num_games += 1
#         if num_games > 1000000:
#             break
# print(f"Done parsing the games. Total games: {num_games}")

# # Convert the game_data dictionary into a DataFrame
# df = pd.DataFrame(game_data)
# df.head()

Before we proceed, let's save this dataframe into a `.csv` file. 

In [8]:
# df.to_csv("./dataset/2023_fics_games.csv", index=False)

## **Preprocessing the dataset**

The dataset used for this project comprises approximately 1,000,000 chess games that were played on ficgames.org. Each game record includes details such as the result, time control, player ratings, and opening ECO code.

In [9]:
df = pd.read_csv("./dataset/2023_fics_games.csv") 
df.head()

Unnamed: 0,Result,WhiteElo,BlackElo,TimeControl,ECO
0,1-0,1491,1554,blitz,D58
1,0-1,1520,1458,rapid,B02
2,1-0,1613,1553,rapid,C00
3,1-0,1338,1411,blitz,A40
4,0-1,1504,1608,blitz,D02


Let's see the shape of the dataset. 

In [10]:
df.shape

(999990, 5)

Let's also see the number of columns and its datatype. 

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999990 entries, 0 to 999989
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Result       999990 non-null  object
 1   WhiteElo     999990 non-null  int64 
 2   BlackElo     999990 non-null  int64 
 3   TimeControl  999990 non-null  object
 4   ECO          999990 non-null  object
dtypes: int64(2), object(3)
memory usage: 38.1+ MB


We can see that there are rows that have null values on the `ECO` column. Let's simply drop these. 

In [12]:
df.dropna(subset=['ECO'], inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999990 entries, 0 to 999989
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Result       999990 non-null  object
 1   WhiteElo     999990 non-null  int64 
 2   BlackElo     999990 non-null  int64 
 3   TimeControl  999990 non-null  object
 4   ECO          999990 non-null  object
dtypes: int64(2), object(3)
memory usage: 38.1+ MB


Note that I haved grouped the original values in `TimeControl` (i.e. 180+0, 120+12) into `classical`, `blitz`, `rapid`, and `bullet` for better readability. The grouping was based on what is used by lichess.org, which can be seen [here](https://lichess.org/faq#time-controls). 

The estimated game duration will be computed using this formula: 

`estimated game duration = (clock initial time) + 40 × (clock increment)`

This was how the resulting duration will be grouped:
* < 179s = `bullet`
* < 479s = `blitz`
* < 1500s = `rapid`
* ≥ 1500s = `classical`

Before we proceed, let's see the number of games played using a specific time control. 

In [13]:
df['TimeControl'].value_counts()

TimeControl
blitz        759410
rapid        209144
bullet        24106
classical      7330
Name: count, dtype: int64

Now, let's group the ratings based on this criteria found in this [forum](https://lichess.org/forum/general-chess-discussion/what-is-a-good-lichess-rating) on Lichess:
* < 1300 = `beginner`
* < 2000 = `intermediate`
* < 2400 = `advanced`
* < 2700 = `master`
* ≥ 2700 = `grandmaster`

In [14]:
white_rating_conditions = [
    (df['WhiteElo'] < 1300), 
    (df['WhiteElo'] >= 1300) & (df['WhiteElo'] < 2000),
    (df['WhiteElo'] >= 2000) & (df['WhiteElo'] < 2700),
    (df['WhiteElo'] >= 2700)
]

black_rating_conditions = [
    (df['BlackElo'] < 1300), 
    (df['BlackElo'] >= 1300) & (df['BlackElo'] < 2000),
    (df['BlackElo'] >= 2000) & (df['BlackElo'] < 2700),
    (df['BlackElo'] >= 2700)
]

rating_choices = ['beginner', 'intermediate', 'advanced', 'grandmaster']

df['WhiteLevel'] = np.select(white_rating_conditions, rating_choices, default='unknown')
df['BlackLevel'] = np.select(black_rating_conditions, rating_choices, default='unknown')
df.head()

Unnamed: 0,Result,WhiteElo,BlackElo,TimeControl,ECO,WhiteLevel,BlackLevel
0,1-0,1491,1554,blitz,D58,intermediate,intermediate
1,0-1,1520,1458,rapid,B02,intermediate,intermediate
2,1-0,1613,1553,rapid,C00,intermediate,intermediate
3,1-0,1338,1411,blitz,A40,intermediate,intermediate
4,0-1,1504,1608,blitz,D02,intermediate,intermediate


Now that there is a `WhiteLevel` and `BlackLevel` columns, we can now remove the `WhiteElo` and `BlackElo` columns. 

In [16]:
df = df.drop(["WhiteElo", "BlackElo"], axis=1)
df.head()

Unnamed: 0,Result,TimeControl,ECO,WhiteLevel,BlackLevel
0,1-0,blitz,D58,intermediate,intermediate
1,0-1,rapid,B02,intermediate,intermediate
2,1-0,rapid,C00,intermediate,intermediate
3,1-0,blitz,A40,intermediate,intermediate
4,0-1,blitz,D02,intermediate,intermediate


Let's remove the rows in which black and white belong to different groups in terms of rating. 

In [18]:
df_same_rating = df.loc[df['WhiteLevel'] == df['BlackLevel']]
df_same_rating.shape

(841570, 5)

Let's see the number of games played depending on the level of the players. 

In [19]:
df_same_rating['WhiteLevel'].value_counts()

WhiteLevel
intermediate    800189
beginner         28772
advanced         11368
grandmaster       1241
Name: count, dtype: int64

`grandmaster` hass too few games played. Let's combine them with the `advanced`. 

In [24]:
df_same_rating['WhiteLevel'] = df_same_rating['WhiteLevel'].where(df_same_rating['WhiteLevel'] != "grandmaster", "advanced")
df_same_rating['BlackLevel'] = df_same_rating['BlackLevel'].where(df_same_rating['BlackLevel'] != "grandmaster", "advanced")
df_same_rating['WhiteLevel'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_same_rating['WhiteLevel'] = df_same_rating['WhiteLevel'].where(df_same_rating['WhiteLevel'] != "grandmaster", "advanced")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_same_rating['BlackLevel'] = df_same_rating['BlackLevel'].where(df_same_rating['BlackLevel'] != "grandmaster", "advanced")


WhiteLevel
intermediate    800189
beginner         28772
advanced         12609
Name: count, dtype: int64

Let's also see the number of games played based on the opening. 

In [34]:
df_games_played_per_opening = df_same_rating['ECO'].value_counts().reset_index()
df_games_played_per_opening.columns = ['ECO', 'Count']
df_games_played_per_opening.sort_values(by='ECO', inplace=True)
df_games_played_per_opening

Unnamed: 0,ECO,Count
0,A00,57622
23,A01,8272
35,A02,6334
44,A03,4681
14,A04,11205
...,...,...
389,E95,24
492,E96,1
234,E97,159
373,E98,30


Finally, let's reorder the columns and rows for better readability. 

In [35]:
df_preprocessed = df_same_rating.reindex(columns=['Result', 'WhiteLevel', 'BlackLevel', 'TimeControl', 'ECO'])
df_preprocessed.reset_index(drop=True, inplace=True)
df_preprocessed.head()

Unnamed: 0,Result,WhiteLevel,BlackLevel,TimeControl,ECO
0,0-1,intermediate,intermediate,blitz,A00
1,1-0,intermediate,intermediate,blitz,A00
2,1-0,intermediate,intermediate,blitz,A00
3,1-0,intermediate,intermediate,blitz,A00
4,0-1,intermediate,intermediate,blitz,A00


The data preprocessing is now done! Let's save it into a new `.csv` file. 

In [37]:
df_preprocessed.to_csv("./dataset/cleaned_dataset.csv", index=False)

## **Data Exploration**

Now that the dataset is cleaned, we can now explore the data and see if we can extract useful information from it. First, let's see every opening repertoire and their respective overall usage. 

In [None]:
df_opening_usage = df.groupby(['opening_name', 'opening_eco']).size().reset_index(name='count')
df_opening_usage.sort_values('count', inplace=True, ascending=False)
df_opening_usage.reset_index(drop=True, inplace=True)
df_opening_usage

Unnamed: 0,opening_name,opening_eco,count
0,Sicilian Defense: Bowdler Attack,B20,256
1,French Defense: Knight Variation,C00,218
2,Van't Kruijs Opening,A00,210
3,Queen's Pawn Game: Mason Attack,D00,196
4,Scandinavian Defense: Mieses-Kotroc Variation,B01,192
...,...,...,...
1387,Polish Opening: Bugayev Advance Variation,A00,1
1388,Polish Opening: Czech Defense,A00,1
1389,Polish Opening: Dutch Defense,A00,1
1390,Polish Opening: Grigorian Variation,A00,1


Now, let's see their win percentage when played as white. 

In [None]:
df_white_wins = df[df['winner'] == 'white']
white_wins_per_opening = df_white_wins.groupby(['opening_name', 'opening_eco']).size().reset_index(name='white_wins')
df_opening_usage = df_opening_usage.merge(white_wins_per_opening, on=['opening_name', 'opening_eco'], how='left')
df_opening_usage['white_wins'].fillna(0, inplace=True)
df_opening_usage['win_percentage_as_white'] = df_opening_usage['white_wins'] / df_opening_usage['count']
df_opening_usage.head()

Unnamed: 0,opening_name,opening_eco,count,white_wins,win_percentage_as_white
0,Sicilian Defense: Bowdler Attack,B20,256,103.0,0.402344
1,French Defense: Knight Variation,C00,218,108.0,0.495413
2,Van't Kruijs Opening,A00,210,68.0,0.32381
3,Queen's Pawn Game: Mason Attack,D00,196,95.0,0.484694
4,Scandinavian Defense: Mieses-Kotroc Variation,B01,192,112.0,0.583333


Let's do the same for black. 

In [None]:
df_black_wins = df[df['winner'] == 'black']
black_wins_per_opening = df_black_wins.groupby(['opening_name', 'opening_eco']).size().reset_index(name='black_wins')
df_opening_usage = df_opening_usage.merge(black_wins_per_opening, on=['opening_name', 'opening_eco'], how='left')
df_opening_usage['black_wins'].fillna(0, inplace=True)
df_opening_usage['win_percentage_as_black'] = df_opening_usage['black_wins'] / df_opening_usage['count']
df_opening_usage.head()

Unnamed: 0,opening_name,opening_eco,count,white_wins,win_percentage_as_white,black_wins,win_percentage_as_black
0,Sicilian Defense: Bowdler Attack,B20,256,103.0,0.402344,141.0,0.550781
1,French Defense: Knight Variation,C00,218,108.0,0.495413,96.0,0.440367
2,Van't Kruijs Opening,A00,210,68.0,0.32381,130.0,0.619048
3,Queen's Pawn Game: Mason Attack,D00,196,95.0,0.484694,89.0,0.454082
4,Scandinavian Defense: Mieses-Kotroc Variation,B01,192,112.0,0.583333,76.0,0.395833


Finally, let's determine the number of draws achieved per opening. 

In [None]:
df_draw = df[df['winner'] == 'draw']
draw_per_opening = df_draw.groupby(['opening_name', 'opening_eco']).size().reset_index(name='draw')
df_opening_usage = df_opening_usage.merge(draw_per_opening, on=['opening_name', 'opening_eco'], how='left')
df_opening_usage['draw'].fillna(0, inplace=True)
df_opening_usage['draw_percentage'] = df_opening_usage['draw'] / df_opening_usage['count']
df_opening_usage.head()

Unnamed: 0,opening_name,opening_eco,count,white_wins,win_percentage_as_white,black_wins,win_percentage_as_black,draw,draw_percentage
0,Sicilian Defense: Bowdler Attack,B20,256,103.0,0.402344,141.0,0.550781,12.0,0.046875
1,French Defense: Knight Variation,C00,218,108.0,0.495413,96.0,0.440367,14.0,0.06422
2,Van't Kruijs Opening,A00,210,68.0,0.32381,130.0,0.619048,12.0,0.057143
3,Queen's Pawn Game: Mason Attack,D00,196,95.0,0.484694,89.0,0.454082,12.0,0.061224
4,Scandinavian Defense: Mieses-Kotroc Variation,B01,192,112.0,0.583333,76.0,0.395833,4.0,0.020833


We can now see the top 10 most used openings. 

In [None]:
df_opening_usage.head(10)

Unnamed: 0,opening_name,opening_eco,count,white_wins,win_percentage_as_white,black_wins,win_percentage_as_black,draw,draw_percentage
0,Sicilian Defense: Bowdler Attack,B20,256,103.0,0.402344,141.0,0.550781,12.0,0.046875
1,French Defense: Knight Variation,C00,218,108.0,0.495413,96.0,0.440367,14.0,0.06422
2,Van't Kruijs Opening,A00,210,68.0,0.32381,130.0,0.619048,12.0,0.057143
3,Queen's Pawn Game: Mason Attack,D00,196,95.0,0.484694,89.0,0.454082,12.0,0.061224
4,Scandinavian Defense: Mieses-Kotroc Variation,B01,192,112.0,0.583333,76.0,0.395833,4.0,0.020833
5,Horwitz Defense,A40,161,87.0,0.540373,71.0,0.440994,3.0,0.018634
6,Italian Game: Anti-Fried Liver Defense,C55,151,79.0,0.523179,65.0,0.430464,7.0,0.046358
7,Scandinavian Defense,B01,150,64.0,0.426667,77.0,0.513333,9.0,0.06
8,Philidor Defense #3,C41,148,94.0,0.635135,51.0,0.344595,3.0,0.02027
9,Philidor Defense #2,C41,146,67.0,0.458904,71.0,0.486301,8.0,0.054795


Before we further breakdown the data, let's determine how many games are `bullet`, `blitz`, `rapid`, and `classical`. 

In [None]:
time_control_games = df['time_control'].value_counts()
time_control_games

time_control
bullet       9533
blitz        3842
rapid        1420
classical     221
Name: count, dtype: int64

## **References**
* https://lichess.org/faq#time-controls
* https://lichess.org/forum/general-chess-discussion/what-is-a-good-lichess-rating
* https://www.kaggle.com/datasets/datasnaek/chess