# **Popular Chess Openings based on Rating and Time Control**




This Python Notebook aims to gather the most popular chess opening repertoire depending on the player's rating and chosen time control, which will help the player know what opening is played the most at their specific level and time format. By analyzing a large dataset of chess games from various online platforms, this project will identify patterns and trends in opening choices among players of different skill levels (e.g., beginner, intermediate, advanced) and across different time controls (e.g., classical, rapid, blitz). The insights derived from this analysis can provide players with a strategic advantage by informing them of commonly encountered openings in their games, enabling them to add these to their arsenal and improve their overall performance.

## **Importing libraries**

In [70]:
import pandas as pd
import numpy as np
import chess.pgn



## **Parsing the PGN file**

The games found in the PGN file is extracted from the database of FICS (Free Internet Chess Server). The link to the website can be found [here](https://www.ficsgames.org/download.html). The games extracted were from the entire year of 2023, and it included every time control and every player rating.  

In [71]:
# # Initialize an empty dictionary to store the game data
# game_data = {
#     "Result": [],
#     "WhiteElo": [], 
#     "BlackElo": [],
#     "TimeControl": [], 
#     "ECO": []
# }

# num_games = 1

# with open("./dataset/2023_fics_games.pgn") as pgn_file:
#     # Parse the first game in the file
#     game = chess.pgn.read_game(pgn_file)
#     print(game.headers)

#     # Iterate through all games in the file
#     while game is not None:
#         print(f"Processing game {num_games}", end="\r")
#         # Append game headers (metadata) to the game_data dictionary
#         for key in game_data.keys():
#             if key in game.headers:
#                 game_data[key].append(game.headers[key])
#             else:
#                 game_data[key].append(None)    
#         # Read the next game in the file
#         game = chess.pgn.read_game(pgn_file)
#         num_games += 1
#         if num_games > 1000000:
#             break
# print(f"Done parsing the games. Total games: {num_games}")

# # Convert the game_data dictionary into a DataFrame
# df = pd.DataFrame(game_data)
# df.head()

Before we proceed, let's save this dataframe into a `.csv` file. 

In [72]:
# df.to_csv("./dataset/2023_fics_games.csv", index=False)

## **Preprocessing the dataset**

The dataset used for this project comprises approximately 1,000,000 chess games that were played on ficgames.org. Each game record includes details such as the result, time control, player ratings, and opening ECO code.

In [73]:
df = pd.read_csv("./dataset/2023_fics_games.csv") 
df.head()

Unnamed: 0,Result,WhiteElo,BlackElo,TimeControl,ECO
0,1-0,1491,1554,blitz,D58
1,0-1,1520,1458,rapid,B02
2,1-0,1613,1553,rapid,C00
3,1-0,1338,1411,blitz,A40
4,0-1,1504,1608,blitz,D02


Let's see the shape of the dataset. 

In [74]:
df.shape

(999990, 5)

Let's also see the number of columns and its datatype. 

In [75]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999990 entries, 0 to 999989
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Result       999990 non-null  object
 1   WhiteElo     999990 non-null  int64 
 2   BlackElo     999990 non-null  int64 
 3   TimeControl  999990 non-null  object
 4   ECO          999990 non-null  object
dtypes: int64(2), object(3)
memory usage: 38.1+ MB


We can see that there are rows that have null values on the `ECO` column. Let's simply drop these. 

In [76]:
df.dropna(subset=['ECO'], inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999990 entries, 0 to 999989
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Result       999990 non-null  object
 1   WhiteElo     999990 non-null  int64 
 2   BlackElo     999990 non-null  int64 
 3   TimeControl  999990 non-null  object
 4   ECO          999990 non-null  object
dtypes: int64(2), object(3)
memory usage: 38.1+ MB


Note that I haved grouped the original values in `TimeControl` (i.e. 180+0, 120+12) into `classical`, `blitz`, `rapid`, and `bullet` for better readability. The grouping was based on what is used by lichess.org, which can be seen [here](https://lichess.org/faq#time-controls). 

The estimated game duration will be computed using this formula: 

`estimated game duration = (clock initial time) + 40 × (clock increment)`

This was how the resulting duration will be grouped:
* < 179s = `bullet`
* < 479s = `blitz`
* < 1500s = `rapid`
* ≥ 1500s = `classical`

Before we proceed, let's see the number of games played using a specific time control. 

In [77]:
df['TimeControl'].value_counts()

TimeControl
blitz        759410
rapid        209144
bullet        24106
classical      7330
Name: count, dtype: int64

Now, let's group the ratings based on this criteria found in this [forum](https://lichess.org/forum/general-chess-discussion/what-is-a-good-lichess-rating) on Lichess:
* < 1300 = `beginner`
* < 2000 = `intermediate`
* < 2400 = `advanced`
* < 2700 = `master`
* ≥ 2700 = `grandmaster`

In [78]:
white_rating_conditions = [
    (df['WhiteElo'] < 1300), 
    (df['WhiteElo'] >= 1300) & (df['WhiteElo'] < 2000),
    (df['WhiteElo'] >= 2000) & (df['WhiteElo'] < 2700),
    (df['WhiteElo'] >= 2700)
]

black_rating_conditions = [
    (df['BlackElo'] < 1300), 
    (df['BlackElo'] >= 1300) & (df['BlackElo'] < 2000),
    (df['BlackElo'] >= 2000) & (df['BlackElo'] < 2700),
    (df['BlackElo'] >= 2700)
]

rating_choices = ['beginner', 'intermediate', 'advanced', 'grandmaster']

df['WhiteLevel'] = np.select(white_rating_conditions, rating_choices, default='unknown')
df['BlackLevel'] = np.select(black_rating_conditions, rating_choices, default='unknown')
df.head()

Unnamed: 0,Result,WhiteElo,BlackElo,TimeControl,ECO,WhiteLevel,BlackLevel
0,1-0,1491,1554,blitz,D58,intermediate,intermediate
1,0-1,1520,1458,rapid,B02,intermediate,intermediate
2,1-0,1613,1553,rapid,C00,intermediate,intermediate
3,1-0,1338,1411,blitz,A40,intermediate,intermediate
4,0-1,1504,1608,blitz,D02,intermediate,intermediate


Now that there is a `WhiteLevel` and `BlackLevel` columns, we can now remove the `WhiteElo` and `BlackElo` columns. 

In [79]:
df = df.drop(["WhiteElo", "BlackElo"], axis=1)
df.head()

Unnamed: 0,Result,TimeControl,ECO,WhiteLevel,BlackLevel
0,1-0,blitz,D58,intermediate,intermediate
1,0-1,rapid,B02,intermediate,intermediate
2,1-0,rapid,C00,intermediate,intermediate
3,1-0,blitz,A40,intermediate,intermediate
4,0-1,blitz,D02,intermediate,intermediate


Let's remove the rows in which black and white belong to different groups in terms of rating. 

In [80]:
df_same_rating = df.loc[df['WhiteLevel'] == df['BlackLevel']]
df_same_rating.shape

(841570, 5)

Let's see the number of games played depending on the level of the players. 

In [81]:
df_games_played_per_level = df_same_rating['WhiteLevel'].value_counts().reset_index()
df_games_played_per_level.columns = ['Level', 'Count']
df_games_played_per_level

Unnamed: 0,Level,Count
0,intermediate,800189
1,beginner,28772
2,advanced,11368
3,grandmaster,1241


`grandmaster` hass too few games played. Let's combine them with the `advanced`. 

In [82]:
df_same_rating['WhiteLevel'] = df_same_rating['WhiteLevel'].where(df_same_rating['WhiteLevel'] != "grandmaster", "advanced")
df_same_rating['BlackLevel'] = df_same_rating['BlackLevel'].where(df_same_rating['BlackLevel'] != "grandmaster", "advanced")
df_games_played_per_level = df_same_rating['WhiteLevel'].value_counts().reset_index()
df_games_played_per_level.columns = ['Level', 'Count']
df_games_played_per_level

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_same_rating['WhiteLevel'] = df_same_rating['WhiteLevel'].where(df_same_rating['WhiteLevel'] != "grandmaster", "advanced")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_same_rating['BlackLevel'] = df_same_rating['BlackLevel'].where(df_same_rating['BlackLevel'] != "grandmaster", "advanced")


Unnamed: 0,Level,Count
0,intermediate,800189
1,beginner,28772
2,advanced,12609


Let's also see the number of games played based on the opening. 

In [83]:
df_games_played_per_opening = df_same_rating['ECO'].value_counts().reset_index()
df_games_played_per_opening.columns = ['ECO', 'Count']
df_games_played_per_opening.sort_values(by='ECO', inplace=True)
df_games_played_per_opening

Unnamed: 0,ECO,Count
0,A00,57622
23,A01,8272
35,A02,6334
44,A03,4681
14,A04,11205
...,...,...
387,E95,24
487,E96,1
234,E97,159
374,E98,30


The `Result` column is encoded as `1-0` when white wins, `0-1` when black wins, and `1/2-1/2` when the game is a draw. Let's change that to indicate who won or if it's a draw to make it more readable. 

In [84]:
win_conditions = [
    (df_same_rating['Result'] == '1-0'),
    (df_same_rating['Result'] == '0-1'),
    (df_same_rating['Result'] == '1/2-1/2')
]

win_choices = ['white', 'black', 'draw']

df_same_rating['Result'] = np.select(win_conditions, win_choices, default='unknown')
df_same_rating.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_same_rating['Result'] = np.select(win_conditions, win_choices, default='unknown')


Unnamed: 0,Result,TimeControl,ECO,WhiteLevel,BlackLevel
0,white,blitz,D58,intermediate,intermediate
1,black,rapid,B02,intermediate,intermediate
2,white,rapid,C00,intermediate,intermediate
3,white,blitz,A40,intermediate,intermediate
4,black,blitz,D02,intermediate,intermediate


Finally, let's reorder the columns and rows for better readability. 

In [85]:
df_preprocessed = df_same_rating.reindex(columns=['Result', 'WhiteLevel', 'BlackLevel', 'TimeControl', 'ECO'])
df_preprocessed.reset_index(drop=True, inplace=True)
df_preprocessed.head()

Unnamed: 0,Result,WhiteLevel,BlackLevel,TimeControl,ECO
0,white,intermediate,intermediate,blitz,D58
1,black,intermediate,intermediate,rapid,B02
2,white,intermediate,intermediate,rapid,C00
3,white,intermediate,intermediate,blitz,A40
4,black,intermediate,intermediate,blitz,D02


The data preprocessing is now done! Let's save it into a new `.csv` file. 

In [86]:
df_preprocessed.to_csv("./dataset/cleaned_dataset.csv", index=False)

## **Data Exploration**

Now that the dataset is cleaned, we can now explore the data and see if we can extract useful information from it. First, let's gather the numbers. 

In [87]:
df_games_played_per_opening = df_preprocessed['ECO'].value_counts().reset_index()
df_games_played_per_opening.columns = ['ECO', 'Count']
df_games_played_per_opening.sort_values(by='ECO', inplace=True)
df_games_played_per_opening

Unnamed: 0,ECO,Count
0,A00,57622
23,A01,8272
35,A02,6334
44,A03,4681
14,A04,11205
...,...,...
387,E95,24
487,E96,1
234,E97,159
374,E98,30


In [88]:
df_games_played_per_level = df_preprocessed['WhiteLevel'].value_counts().reset_index()
df_games_played_per_level.columns = ['Level', 'Count']
df_games_played_per_level

Unnamed: 0,Level,Count
0,intermediate,800189
1,beginner,28772
2,advanced,12609


In [89]:
df_games_played_per_time_control = df_preprocessed['TimeControl'].value_counts().reset_index()
df_games_played_per_time_control.columns = ['TimeControl', 'Count']
df_games_played_per_time_control

Unnamed: 0,TimeControl,Count
0,blitz,674929
1,rapid,144421
2,bullet,16295
3,classical,5925


### **Time Control and Player Level**

Let's now see how many `blitz`, `rapid`, `bullet`, and `classical` games are played by `beginner`, `intermediate`, and `advanced` players. 

#### **Time Control: Blitz**

In [90]:
df_blitz_beginner = df_preprocessed.loc[(df_preprocessed['TimeControl'] == 'blitz') & (df_preprocessed['WhiteLevel'] == 'beginner')]
df_blitz_beginner.shape

(10998, 5)

In [91]:
df_blitz_intermediate = df_preprocessed.loc[(df_preprocessed['TimeControl'] == 'blitz') & (df_preprocessed['WhiteLevel'] == 'intermediate')]
df_blitz_intermediate.shape

(656859, 5)

In [92]:
df_blitz_advanced = df_preprocessed.loc[(df_preprocessed['TimeControl'] == 'blitz') & (df_preprocessed['WhiteLevel'] == 'advanced')]
df_blitz_advanced.shape

(7072, 5)

#### **Time Control: Rapid**

In [93]:
df_rapid_beginner = df_preprocessed.loc[(df_preprocessed['TimeControl'] == 'rapid') & (df_preprocessed['WhiteLevel'] == 'beginner')]
df_rapid_beginner.shape

(17694, 5)

In [94]:
df_rapid_intermediate = df_preprocessed.loc[(df_preprocessed['TimeControl'] == 'rapid') & (df_preprocessed['WhiteLevel'] == 'intermediate')]
df_rapid_intermediate.shape

(123458, 5)

In [95]:
df_rapid_advanced = df_preprocessed.loc[(df_preprocessed['TimeControl'] == 'rapid') & (df_preprocessed['WhiteLevel'] == 'advanced')]
df_rapid_advanced.shape

(3269, 5)

#### **Time Control: Bullet**

In [96]:
df_bullet_beginner = df_preprocessed.loc[(df_preprocessed['TimeControl'] == 'bullet') & (df_preprocessed['WhiteLevel'] == 'beginner')]
df_bullet_beginner.shape

(42, 5)

In [97]:
df_bullet_intermediate = df_preprocessed.loc[(df_preprocessed['TimeControl'] == 'bullet') & (df_preprocessed['WhiteLevel'] == 'intermediate')]
df_bullet_intermediate.shape

(15013, 5)

In [98]:
df_bullet_advanced = df_preprocessed.loc[(df_preprocessed['TimeControl'] == 'bullet') & (df_preprocessed['WhiteLevel'] == 'advanced')]
df_bullet_advanced.shape

(1240, 5)

#### **Time Control: Classical**

In [99]:
df_classical_beginner = df_preprocessed.loc[(df_preprocessed['TimeControl'] == 'classical') & (df_preprocessed['WhiteLevel'] == 'beginner')]
df_classical_beginner.shape

(38, 5)

In [100]:
df_classical_intermediate = df_preprocessed.loc[(df_preprocessed['TimeControl'] == 'classical') & (df_preprocessed['WhiteLevel'] == 'intermediate')]
df_classical_intermediate.shape

(4859, 5)

In [101]:
df_classical_advanced = df_preprocessed.loc[(df_preprocessed['TimeControl'] == 'classical') & (df_preprocessed['WhiteLevel'] == 'advanced')]
df_classical_advanced.shape

(1028, 5)

## **References**
* https://lichess.org/faq#time-controls
* https://lichess.org/forum/general-chess-discussion/what-is-a-good-lichess-rating
* https://www.kaggle.com/datasets/datasnaek/chess