# **Popular Chess Openings based on Rating and Time Control**




This Python Notebook aims to gather the most popular chess opening repertoire depending on the player's rating and chosen time control, which will help the player know what opening is played the most at their specific level and time format. By analyzing a large dataset of chess games from various online platforms, this project will identify patterns and trends in opening choices among players of different skill levels (e.g., beginner, intermediate, advanced) and across different time controls (e.g., classical, rapid, blitz). The insights derived from this analysis can provide players with a strategic advantage by informing them of commonly encountered openings in their games, enabling them to add these to their arsenal and improve their overall performance.

## **Importing libraries**

In [1]:
import pandas as pd
import numpy as np



## **Preprocessing the dataset**

The dataset used by this notebook is made by Mitchell J, which can be found in this [link](https://www.kaggle.com/datasets/datasnaek/chess). It comprises over 20,000 chess games from Lichess.org. Each game record includes details such as game ID, rated status, start and end times, number of turns, game status, winner, time increment, player IDs and ratings, all moves in standard notation, opening ECO code, and opening name.

Let's load the dataset into a dataframe and see its first few rows. 

In [2]:
df = pd.read_csv("./dataset/games.csv")
df.head()

Unnamed: 0,id,rated,created_at,last_move_at,turns,victory_status,winner,increment_code,white_id,white_rating,black_id,black_rating,moves,opening_eco,opening_name,opening_ply
0,TZJHLljE,False,1504210000000.0,1504210000000.0,13,outoftime,white,15+2,bourgris,1500,a-00,1191,d4 d5 c4 c6 cxd5 e6 dxe6 fxe6 Nf3 Bb4+ Nc3 Ba5...,D10,Slav Defense: Exchange Variation,5
1,l1NXvwaE,True,1504130000000.0,1504130000000.0,16,resign,black,5+10,a-00,1322,skinnerua,1261,d4 Nc6 e4 e5 f4 f6 dxe5 fxe5 fxe5 Nxe5 Qd4 Nc6...,B00,Nimzowitsch Defense: Kennedy Variation,4
2,mIICvQHh,True,1504130000000.0,1504130000000.0,61,mate,white,5+10,ischia,1496,a-00,1500,e4 e5 d3 d6 Be3 c6 Be2 b5 Nd2 a5 a4 c5 axb5 Nc...,C20,King's Pawn Game: Leonardis Variation,3
3,kWKvrqYL,True,1504110000000.0,1504110000000.0,61,mate,white,20+0,daniamurashov,1439,adivanov2009,1454,d4 d5 Nf3 Bf5 Nc3 Nf6 Bf4 Ng4 e3 Nc6 Be2 Qd7 O...,D02,Queen's Pawn Game: Zukertort Variation,3
4,9tXo1AUZ,True,1504030000000.0,1504030000000.0,95,mate,white,30+3,nik221107,1523,adivanov2009,1469,e4 e5 Nf3 d6 d4 Nc6 d5 Nb4 a3 Na6 Nc3 Be7 b4 N...,C41,Philidor Defense,5


Let's also see the number of columns and its datatype. 

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20058 entries, 0 to 20057
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              20058 non-null  object 
 1   rated           20058 non-null  bool   
 2   created_at      20058 non-null  float64
 3   last_move_at    20058 non-null  float64
 4   turns           20058 non-null  int64  
 5   victory_status  20058 non-null  object 
 6   winner          20058 non-null  object 
 7   increment_code  20058 non-null  object 
 8   white_id        20058 non-null  object 
 9   white_rating    20058 non-null  int64  
 10  black_id        20058 non-null  object 
 11  black_rating    20058 non-null  int64  
 12  moves           20058 non-null  object 
 13  opening_eco     20058 non-null  object 
 14  opening_name    20058 non-null  object 
 15  opening_ply     20058 non-null  int64  
dtypes: bool(1), float64(2), int64(4), object(9)
memory usage: 2.3+ MB


Let's drop the unnecessary columns and only retain `winner`, `increment_code`, `white_rating`, `black_rating`, and `opening_eco`, `opening_name`. We only should have 6 columns after this. 

In [4]:
df = df.drop(["id", "rated", "created_at", "last_move_at", "turns", "victory_status", "moves", "white_id", "black_id", "opening_ply"], axis=1)
df.head()

Unnamed: 0,winner,increment_code,white_rating,black_rating,opening_eco,opening_name
0,white,15+2,1500,1191,D10,Slav Defense: Exchange Variation
1,black,5+10,1322,1261,B00,Nimzowitsch Defense: Kennedy Variation
2,white,5+10,1496,1500,C20,King's Pawn Game: Leonardis Variation
3,white,20+0,1439,1454,D02,Queen's Pawn Game: Zukertort Variation
4,white,30+3,1523,1469,C41,Philidor Defense


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20058 entries, 0 to 20057
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   winner          20058 non-null  object
 1   increment_code  20058 non-null  object
 2   white_rating    20058 non-null  int64 
 3   black_rating    20058 non-null  int64 
 4   opening_eco     20058 non-null  object
 5   opening_name    20058 non-null  object
dtypes: int64(2), object(4)
memory usage: 940.3+ KB


Now, let's see the shape of the dataset. 

In [6]:
df.shape

(20058, 6)

Let's group the values in `increment_code` into `classical`, `blitz`, `rapid`, and `bullet` for better readability. The grouping will be based on what is used by lichess.org, which can be seen [here](https://lichess.org/faq#time-controls). 

First, the estimated game duration will be computed using this formula: 

`estimated game duration = (clock initial time) + 40 × (clock increment)`

In [7]:
df['initial_time'] = df['increment_code'].str.split('+').str[0]
df['increment'] = df['increment_code'].str.split('+').str[1]
df['estimated_time'] = (df['initial_time'].astype(int) + df['increment'].astype(int) * 40)
df.head()

Unnamed: 0,winner,increment_code,white_rating,black_rating,opening_eco,opening_name,initial_time,increment,estimated_time
0,white,15+2,1500,1191,D10,Slav Defense: Exchange Variation,15,2,95
1,black,5+10,1322,1261,B00,Nimzowitsch Defense: Kennedy Variation,5,10,405
2,white,5+10,1496,1500,C20,King's Pawn Game: Leonardis Variation,5,10,405
3,white,20+0,1439,1454,D02,Queen's Pawn Game: Zukertort Variation,20,0,20
4,white,30+3,1523,1469,C41,Philidor Defense,30,3,150


This is how the resulting duration will be grouped:
* < 179s = `bullet`
* < 479s = `blitz`
* < 1500s = `rapid`
* ≥ 1500s = `classical`

In [8]:
time_control_conditions = [
    (df['estimated_time'] < 179), 
    (df['estimated_time'] >= 179) & (df['estimated_time'] < 479),
    (df['estimated_time'] >= 479) & (df['estimated_time'] < 1500),
    (df['estimated_time'] >= 1500)
]

time_control_choices = ['bullet', 'blitz', 'rapid', 'classical']

df['time_control'] = np.select(time_control_conditions, time_control_choices, default='unknown')
df.head()

Unnamed: 0,winner,increment_code,white_rating,black_rating,opening_eco,opening_name,initial_time,increment,estimated_time,time_control
0,white,15+2,1500,1191,D10,Slav Defense: Exchange Variation,15,2,95,bullet
1,black,5+10,1322,1261,B00,Nimzowitsch Defense: Kennedy Variation,5,10,405,blitz
2,white,5+10,1496,1500,C20,King's Pawn Game: Leonardis Variation,5,10,405,blitz
3,white,20+0,1439,1454,D02,Queen's Pawn Game: Zukertort Variation,20,0,20,bullet
4,white,30+3,1523,1469,C41,Philidor Defense,30,3,150,bullet


Now that we have made a `time_control` column, the `increment_code`, `initial_time`, `increment`, and `estimated_time` columns can now be dropped. 

In [9]:
df = df.drop(["increment_code", "initial_time", "increment", "estimated_time"], axis=1)
df.head()

Unnamed: 0,winner,white_rating,black_rating,opening_eco,opening_name,time_control
0,white,1500,1191,D10,Slav Defense: Exchange Variation,bullet
1,black,1322,1261,B00,Nimzowitsch Defense: Kennedy Variation,blitz
2,white,1496,1500,C20,King's Pawn Game: Leonardis Variation,blitz
3,white,1439,1454,D02,Queen's Pawn Game: Zukertort Variation,bullet
4,white,1523,1469,C41,Philidor Defense,bullet


Now, let's group the ratings based on this criteria found in this [forum](https://lichess.org/forum/general-chess-discussion/what-is-a-good-lichess-rating) on Lichess:
* < 1300 = `beginner`
* < 2000 = `intermediate`
* < 2400 = `advanced`
* < 2700 = `master`
* ≥ 2700 = `grandmaster`

In [10]:
white_rating_conditions = [
    (df['white_rating'] < 1300), 
    (df['white_rating'] >= 1300) & (df['white_rating'] < 2000),
    (df['white_rating'] >= 2000) & (df['white_rating'] < 2700),
    (df['white_rating'] >= 2700)
]

black_rating_conditions = [
    (df['black_rating'] < 1300), 
    (df['black_rating'] >= 1300) & (df['black_rating'] < 2000),
    (df['black_rating'] >= 2000) & (df['black_rating'] < 2700),
    (df['black_rating'] >= 2700)
]

rating_choices = ['beginner', 'intermediate', 'advanced', 'grandmaster']

df['white_rating_group'] = np.select(white_rating_conditions, rating_choices, default='unknown')
df['black_rating_group'] = np.select(black_rating_conditions, rating_choices, default='unknown')
df.head()

Unnamed: 0,winner,white_rating,black_rating,opening_eco,opening_name,time_control,white_rating_group,black_rating_group
0,white,1500,1191,D10,Slav Defense: Exchange Variation,bullet,intermediate,beginner
1,black,1322,1261,B00,Nimzowitsch Defense: Kennedy Variation,blitz,intermediate,beginner
2,white,1496,1500,C20,King's Pawn Game: Leonardis Variation,blitz,intermediate,intermediate
3,white,1439,1454,D02,Queen's Pawn Game: Zukertort Variation,bullet,intermediate,intermediate
4,white,1523,1469,C41,Philidor Defense,bullet,intermediate,intermediate


Now that there is a `black_rating_group` and `white_rating_group` columns, we can now remove the `black_rating` and `white_rating` columns. 

In [11]:
df = df.drop(["white_rating", "black_rating"], axis=1)
df.head()

Unnamed: 0,winner,opening_eco,opening_name,time_control,white_rating_group,black_rating_group
0,white,D10,Slav Defense: Exchange Variation,bullet,intermediate,beginner
1,black,B00,Nimzowitsch Defense: Kennedy Variation,blitz,intermediate,beginner
2,white,C20,King's Pawn Game: Leonardis Variation,blitz,intermediate,intermediate
3,white,D02,Queen's Pawn Game: Zukertort Variation,bullet,intermediate,intermediate
4,white,C41,Philidor Defense,bullet,intermediate,intermediate


Let's remove the rows in which black and white belong to different groups in terms of rating. 

In [12]:
df = df.loc[df['white_rating_group'] == df['black_rating_group']]
df.shape

(15016, 6)

Finally, let's reorder the columns and rows for better readability. 

In [13]:
df = df.reindex(columns=['winner', 'opening_eco', 'opening_name', 'time_control', 'black_rating_group', 'white_rating_group'])
df.sort_values('opening_name', inplace=True)
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,winner,opening_eco,opening_name,time_control,black_rating_group,white_rating_group
0,white,B03,Alekhine Defense,blitz,intermediate,intermediate
1,white,B03,Alekhine Defense,bullet,intermediate,intermediate
2,black,B02,Alekhine Defense,blitz,intermediate,intermediate
3,black,B03,Alekhine Defense,bullet,intermediate,intermediate
4,black,B02,Alekhine Defense,rapid,beginner,beginner


The data preprocessing is now done! Let's save it into a new `.csv` file. 

In [14]:
df.to_csv("./dataset/cleaned_dataset.csv", index=False)

## **Data Exploration**

Now that the dataset is cleaned, we can now explore the data and see if we can extract useful information from it. First, let's see every opening repertoire and their respective overall usage. 

In [15]:
df_opening_usage = df.groupby(['opening_name', 'opening_eco']).size().reset_index(name='count')
df_opening_usage.sort_values('count', inplace=True, ascending=False)
df_opening_usage.reset_index(drop=True, inplace=True)
df_opening_usage

Unnamed: 0,opening_name,opening_eco,count
0,Sicilian Defense: Bowdler Attack,B20,256
1,French Defense: Knight Variation,C00,218
2,Van't Kruijs Opening,A00,210
3,Queen's Pawn Game: Mason Attack,D00,196
4,Scandinavian Defense: Mieses-Kotroc Variation,B01,192
...,...,...,...
1387,Polish Opening: Bugayev Advance Variation,A00,1
1388,Polish Opening: Czech Defense,A00,1
1389,Polish Opening: Dutch Defense,A00,1
1390,Polish Opening: Grigorian Variation,A00,1


Now, let's see their win percentage when played as white. 

In [16]:
df_white_wins = df[df['winner'] == 'white']
white_wins_per_opening = df_white_wins.groupby(['opening_name', 'opening_eco']).size().reset_index(name='white_wins')
df_opening_usage = df_opening_usage.merge(white_wins_per_opening, on=['opening_name', 'opening_eco'], how='left')
df_opening_usage['white_wins'].fillna(0, inplace=True)
df_opening_usage['win_percentage_as_white'] = df_opening_usage['white_wins'] / df_opening_usage['count']
df_opening_usage.head()

Unnamed: 0,opening_name,opening_eco,count,white_wins,win_percentage_as_white
0,Sicilian Defense: Bowdler Attack,B20,256,103.0,0.402344
1,French Defense: Knight Variation,C00,218,108.0,0.495413
2,Van't Kruijs Opening,A00,210,68.0,0.32381
3,Queen's Pawn Game: Mason Attack,D00,196,95.0,0.484694
4,Scandinavian Defense: Mieses-Kotroc Variation,B01,192,112.0,0.583333


Let's do the same for black. 

In [17]:
df_black_wins = df[df['winner'] == 'black']
black_wins_per_opening = df_black_wins.groupby(['opening_name', 'opening_eco']).size().reset_index(name='black_wins')
df_opening_usage = df_opening_usage.merge(black_wins_per_opening, on=['opening_name', 'opening_eco'], how='left')
df_opening_usage['black_wins'].fillna(0, inplace=True)
df_opening_usage['win_percentage_as_black'] = df_opening_usage['black_wins'] / df_opening_usage['count']
df_opening_usage.head()

Unnamed: 0,opening_name,opening_eco,count,white_wins,win_percentage_as_white,black_wins,win_percentage_as_black
0,Sicilian Defense: Bowdler Attack,B20,256,103.0,0.402344,141.0,0.550781
1,French Defense: Knight Variation,C00,218,108.0,0.495413,96.0,0.440367
2,Van't Kruijs Opening,A00,210,68.0,0.32381,130.0,0.619048
3,Queen's Pawn Game: Mason Attack,D00,196,95.0,0.484694,89.0,0.454082
4,Scandinavian Defense: Mieses-Kotroc Variation,B01,192,112.0,0.583333,76.0,0.395833


Finally, let's determine the number of draws achieved per opening. 

In [21]:
df_draw = df[df['winner'] == 'draw']
draw_per_opening = df_draw.groupby(['opening_name', 'opening_eco']).size().reset_index(name='draw')
df_opening_usage = df_opening_usage.merge(draw_per_opening, on=['opening_name', 'opening_eco'], how='left')
df_opening_usage['draw'].fillna(0, inplace=True)
df_opening_usage['draw_percentage'] = df_opening_usage['draw'] / df_opening_usage['count']
df_opening_usage.head()

Unnamed: 0,opening_name,opening_eco,count,white_wins,win_percentage_as_white,black_wins,win_percentage_as_black,draw,draw_percentage
0,Sicilian Defense: Bowdler Attack,B20,256,103.0,0.402344,141.0,0.550781,12.0,0.046875
1,French Defense: Knight Variation,C00,218,108.0,0.495413,96.0,0.440367,14.0,0.06422
2,Van't Kruijs Opening,A00,210,68.0,0.32381,130.0,0.619048,12.0,0.057143
3,Queen's Pawn Game: Mason Attack,D00,196,95.0,0.484694,89.0,0.454082,12.0,0.061224
4,Scandinavian Defense: Mieses-Kotroc Variation,B01,192,112.0,0.583333,76.0,0.395833,4.0,0.020833


Let's see the top 10 most used openings. 

In [22]:
df_opening_usage.head(10)

Unnamed: 0,opening_name,opening_eco,count,white_wins,win_percentage_as_white,black_wins,win_percentage_as_black,draw,draw_percentage
0,Sicilian Defense: Bowdler Attack,B20,256,103.0,0.402344,141.0,0.550781,12.0,0.046875
1,French Defense: Knight Variation,C00,218,108.0,0.495413,96.0,0.440367,14.0,0.06422
2,Van't Kruijs Opening,A00,210,68.0,0.32381,130.0,0.619048,12.0,0.057143
3,Queen's Pawn Game: Mason Attack,D00,196,95.0,0.484694,89.0,0.454082,12.0,0.061224
4,Scandinavian Defense: Mieses-Kotroc Variation,B01,192,112.0,0.583333,76.0,0.395833,4.0,0.020833
5,Horwitz Defense,A40,161,87.0,0.540373,71.0,0.440994,3.0,0.018634
6,Italian Game: Anti-Fried Liver Defense,C55,151,79.0,0.523179,65.0,0.430464,7.0,0.046358
7,Scandinavian Defense,B01,150,64.0,0.426667,77.0,0.513333,9.0,0.06
8,Philidor Defense #3,C41,148,94.0,0.635135,51.0,0.344595,3.0,0.02027
9,Philidor Defense #2,C41,146,67.0,0.458904,71.0,0.486301,8.0,0.054795


## **References**
* https://lichess.org/faq#time-controls
* https://lichess.org/forum/general-chess-discussion/what-is-a-good-lichess-rating
* https://www.kaggle.com/datasets/datasnaek/chess