# Analysis of Chess Dataset

### Metadata
- Dataset name: games.csv
- Source: tbd
- Date acquired: 2025-10-24
- Access path: https://github.com/Titaniel3/ASDA_2025_Group_1_Portfolio/blob/5fa1a748946e7faf7d4ee6f6deb02c5cb9c91272/additional_material/games.csv
- MD5: 22a652ce21b9ecc314b979cf6b28d463
- Description: detailed information of 20.000 games of chess

### Import of libraries and file
Import the file and create a dataframe from it.

In [16]:
import pandas as pd

# jupyter notebooks are only displaying the result of the last expression. display let"s us have multiple evaluations from one cell
from IPython.display import display

In [None]:
df = pd.read_csv('../additional_material/games.csv')

# Display df head to verify that df was created correctly
df.head()

In [6]:
#df.dtypes allows us to see columns and data types at the same time in one single step
df.dtypes


id                 object
rated                bool
created_at        float64
last_move_at      float64
turns               int64
victory_status     object
winner             object
increment_code     object
white_id           object
white_rating        int64
black_id           object
black_rating        int64
moves              object
opening_eco        object
opening_name       object
opening_ply         int64
dtype: object

In [18]:
#df.count() provides an information about the numbers of nun-null values in each column. I added df.shape to have a comparison to the complete set.

display(df.shape)
display(df.count())

(20058, 16)

id                20058
rated             20058
created_at        20058
last_move_at      20058
turns             20058
victory_status    20058
winner            20058
increment_code    20058
white_id          20058
white_rating      20058
black_id          20058
black_rating      20058
moves             20058
opening_eco       20058
opening_name      20058
opening_ply       20058
dtype: int64

In [20]:
# since there are no NaN cells we don't need to count nan values as an additional unique value and can leave the (dropna=False)

df.nunique()

id                19113
rated                 2
created_at        13151
last_move_at      13186
turns               211
victory_status        4
winner                3
increment_code      400
white_id           9438
white_rating       1516
black_id           9331
black_rating       1521
moves             18920
opening_eco         365
opening_name       1477
opening_ply          23
dtype: int64

In [21]:
# for insight into the numerical columns

df.describe()


Unnamed: 0,created_at,last_move_at,turns,white_rating,black_rating,opening_ply
count,20058.0,20058.0,20058.0,20058.0,20058.0,20058.0
mean,1483617000000.0,1483618000000.0,60.465999,1596.631868,1588.831987,4.816981
std,28501510000.0,28501400000.0,33.570585,291.253376,291.036126,2.797152
min,1376772000000.0,1376772000000.0,1.0,784.0,789.0,1.0
25%,1477548000000.0,1477548000000.0,37.0,1398.0,1391.0,3.0
50%,1496010000000.0,1496010000000.0,55.0,1567.0,1562.0,4.0
75%,1503170000000.0,1503170000000.0,79.0,1793.0,1784.0,6.0
max,1504493000000.0,1504494000000.0,349.0,2700.0,2723.0,28.0


In [22]:
# for insights into categorical columns

df.describe(include=['object', 'bool'])


Unnamed: 0,id,rated,victory_status,winner,increment_code,white_id,black_id,moves,opening_eco,opening_name
count,20058,20058,20058,20058,20058,20058,20058,20058,20058,20058
unique,19113,2,4,3,400,9438,9331,18920,365,1477
top,XRuQPSzH,True,resign,white,10+0,taranga,taranga,e4 e5,A00,Van't Kruijs Opening
freq,5,16155,11147,10001,7721,72,82,27,1007,368
