# **Popular Chess Openings based on Rating and Time Control**




This Python Notebook aims to gather the most popular chess opening repertoire depending on the player's rating and chosen time control, which will help the player know what opening is played the most at their specific level and time format. By analyzing a large dataset of chess games from various online platforms, this project will identify patterns and trends in opening choices among players of different skill levels (e.g., beginner, intermediate, advanced) and across different time controls (e.g., classical, rapid, blitz). The insights derived from this analysis can provide players with a strategic advantage by informing them of commonly encountered openings in their games, enabling them to add these to their arsenal and improve their overall performance.

## **Importing libraries**

In [24]:
import pandas as pd
import numpy as np



## Preprocessing the dataset

The dataset used by this notebook is made by Mitchell J, which can be found in this [link](https://www.kaggle.com/datasets/datasnaek/chess). It comprises over 20,000 chess games from Lichess.org. Each game record includes details such as game ID, rated status, start and end times, number of turns, game status, winner, time increment, player IDs and ratings, all moves in standard notation, opening ECO code, and opening name.

Let's load the dataset into a dataframe and see its first few rows. 

In [25]:
df = pd.read_csv("./dataset/games.csv")
df.head()

Unnamed: 0,id,rated,created_at,last_move_at,turns,victory_status,winner,increment_code,white_id,white_rating,black_id,black_rating,moves,opening_eco,opening_name,opening_ply
0,TZJHLljE,False,1504210000000.0,1504210000000.0,13,outoftime,white,15+2,bourgris,1500,a-00,1191,d4 d5 c4 c6 cxd5 e6 dxe6 fxe6 Nf3 Bb4+ Nc3 Ba5...,D10,Slav Defense: Exchange Variation,5
1,l1NXvwaE,True,1504130000000.0,1504130000000.0,16,resign,black,5+10,a-00,1322,skinnerua,1261,d4 Nc6 e4 e5 f4 f6 dxe5 fxe5 fxe5 Nxe5 Qd4 Nc6...,B00,Nimzowitsch Defense: Kennedy Variation,4
2,mIICvQHh,True,1504130000000.0,1504130000000.0,61,mate,white,5+10,ischia,1496,a-00,1500,e4 e5 d3 d6 Be3 c6 Be2 b5 Nd2 a5 a4 c5 axb5 Nc...,C20,King's Pawn Game: Leonardis Variation,3
3,kWKvrqYL,True,1504110000000.0,1504110000000.0,61,mate,white,20+0,daniamurashov,1439,adivanov2009,1454,d4 d5 Nf3 Bf5 Nc3 Nf6 Bf4 Ng4 e3 Nc6 Be2 Qd7 O...,D02,Queen's Pawn Game: Zukertort Variation,3
4,9tXo1AUZ,True,1504030000000.0,1504030000000.0,95,mate,white,30+3,nik221107,1523,adivanov2009,1469,e4 e5 Nf3 d6 d4 Nc6 d5 Nb4 a3 Na6 Nc3 Be7 b4 N...,C41,Philidor Defense,5


Let's also see the number of columns and its datatype. 

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20058 entries, 0 to 20057
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              20058 non-null  object 
 1   rated           20058 non-null  bool   
 2   created_at      20058 non-null  float64
 3   last_move_at    20058 non-null  float64
 4   turns           20058 non-null  int64  
 5   victory_status  20058 non-null  object 
 6   winner          20058 non-null  object 
 7   increment_code  20058 non-null  object 
 8   white_id        20058 non-null  object 
 9   white_rating    20058 non-null  int64  
 10  black_id        20058 non-null  object 
 11  black_rating    20058 non-null  int64  
 12  moves           20058 non-null  object 
 13  opening_eco     20058 non-null  object 
 14  opening_name    20058 non-null  object 
 15  opening_ply     20058 non-null  int64  
dtypes: bool(1), float64(2), int64(4), object(9)
memory usage: 2.3+ MB


Let's drop the unnecessary columns and only retain `winner`, `increment_code`, `white_rating`, `black_rating`, and `opening_eco`, `opening_name`. We only should have 6 columns after this. 

In [27]:
df = df.drop(["id", "rated", "created_at", "last_move_at", "turns", "victory_status", "moves", "white_id", "black_id", "opening_ply"], axis=1)
df.head()

Unnamed: 0,winner,increment_code,white_rating,black_rating,opening_eco,opening_name
0,white,15+2,1500,1191,D10,Slav Defense: Exchange Variation
1,black,5+10,1322,1261,B00,Nimzowitsch Defense: Kennedy Variation
2,white,5+10,1496,1500,C20,King's Pawn Game: Leonardis Variation
3,white,20+0,1439,1454,D02,Queen's Pawn Game: Zukertort Variation
4,white,30+3,1523,1469,C41,Philidor Defense


In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20058 entries, 0 to 20057
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   winner          20058 non-null  object
 1   increment_code  20058 non-null  object
 2   white_rating    20058 non-null  int64 
 3   black_rating    20058 non-null  int64 
 4   opening_eco     20058 non-null  object
 5   opening_name    20058 non-null  object
dtypes: int64(2), object(4)
memory usage: 940.3+ KB


Now, let's see the shape of the dataset. 

In [29]:
df.shape

(20058, 6)

Let's group the values in `increment_code` into `classical`, `blitz`, `rapid`, and `bullet` for better readability. The grouping will be based on what is used by lichess.org, which can be seen [here](https://lichess.org/faq#time-controls). 

First, the estimated game duration will be computed using this formula: 

`estimated game duration = (clock initial time) + 40 × (clock increment)`

This is how the resulting duration will be grouped:
* < 179s = `bullet`
* < 479s = `blitz`
* < 1499s = `rapid`
* ≥ 1500s = `classical`

In [30]:
df['initial_time'] = df['increment_code'].str.split('+').str[0]
df['increment'] = df['increment_code'].str.split('+').str[1]
df['estimated_time'] = (df['initial_time'].astype(int) + df['increment'].astype(int) * 40)
df.head()

Unnamed: 0,winner,increment_code,white_rating,black_rating,opening_eco,opening_name,initial_time,increment,estimated_time
0,white,15+2,1500,1191,D10,Slav Defense: Exchange Variation,15,2,95
1,black,5+10,1322,1261,B00,Nimzowitsch Defense: Kennedy Variation,5,10,405
2,white,5+10,1496,1500,C20,King's Pawn Game: Leonardis Variation,5,10,405
3,white,20+0,1439,1454,D02,Queen's Pawn Game: Zukertort Variation,20,0,20
4,white,30+3,1523,1469,C41,Philidor Defense,30,3,150


In [31]:
conditions = [
    (df['estimated_time'] < 179), 
    (df['estimated_time'] >= 179) & (df['estimated_time'] < 479),
    (df['estimated_time'] >= 479) & (df['estimated_time'] < 1499),
    (df['estimated_time'] >= 1499)
]

choices = ['bullet', 'blitz', 'rapid', 'classical']

df['time_control'] = np.select(conditions, choices, default='unknown')
df.head()

Unnamed: 0,winner,increment_code,white_rating,black_rating,opening_eco,opening_name,initial_time,increment,estimated_time,time_control
0,white,15+2,1500,1191,D10,Slav Defense: Exchange Variation,15,2,95,bullet
1,black,5+10,1322,1261,B00,Nimzowitsch Defense: Kennedy Variation,5,10,405,blitz
2,white,5+10,1496,1500,C20,King's Pawn Game: Leonardis Variation,5,10,405,blitz
3,white,20+0,1439,1454,D02,Queen's Pawn Game: Zukertort Variation,20,0,20,bullet
4,white,30+3,1523,1469,C41,Philidor Defense,30,3,150,bullet


## **References**
* https://lichess.org/faq#time-controls
* https://www.kaggle.com/datasets/datasnaek/chess