## Some exploratory analyses of `appearances.csv` dataset

Shortly, the dataset `appearances.csv` captures who played, when, for which club, and what they did on the pitch.

From now on, we will use `.parquet` files which are easier to handle and with much less weight than `.csv`.
Accordingly, `appearances.csv` was previously transformed to `appearances_FULL.parquet`.
`appearances_FULL.parquet` is based in Data directory inside the repository

#### Prior to loading `appearances_FULL.parquet`, we set the route

In [8]:
# This is to set the path from the local folder to the appearances_FULL.parquet in the data folder

from pathlib import Path

# project root is one level up from notebooks/
ROOT = Path.cwd().parent
DATA = ROOT / "data"

file_path = DATA / "appearances_FULL.parquet"
print(file_path)

/Users/andresdevegili/Desktop/Data Viz/messi/Messi-outlier/data/appearances_FULL.parquet


#### Now we load `appearances_FULL.parquet`. Hereafter `df`.

In [9]:
# Load the file

import pandas as pd

# Load the parquet file
df = pd.read_parquet(file_path)

#### Exploratory analyses

In [14]:
# Basic info 
print("Shape:", df.shape)         # rows, columns
print(df.dtypes)                  # types of each column

# First look 
display(df.head(10))                # first 5 rows

Shape: (1706806, 13)
appearance_id             object
game_id                    int64
player_id                  int64
player_club_id             int64
player_current_club_id     int64
date                      object
player_name               object
competition_id            object
yellow_cards               int64
red_cards                  int64
goals                      int64
assists                    int64
minutes_played             int64
dtype: object


Unnamed: 0,appearance_id,game_id,player_id,player_club_id,player_current_club_id,date,player_name,competition_id,yellow_cards,red_cards,goals,assists,minutes_played
0,2231978_38004,2231978,38004,853,235,2012-07-03,Aurélien Joachim,CLQ,0,0,2,0,90
1,2233748_79232,2233748,79232,8841,2698,2012-07-05,Ruslan Abyshov,ELQ,0,0,0,0,90
2,2234413_42792,2234413,42792,6251,465,2012-07-05,Sander Puri,ELQ,0,0,0,0,45
3,2234418_73333,2234418,73333,1274,6646,2012-07-05,Vegar Hedenstad,ELQ,0,0,0,0,90
4,2234421_122011,2234421,122011,195,3008,2012-07-05,Markus Henriksen,ELQ,0,0,0,1,90
5,2234421_146889,2234421,146889,195,2778,2012-07-05,Peter Ankersen,ELQ,1,0,0,0,90
6,2235539_28716,2235539,28716,282,7185,2012-07-05,Adi Adilovic,ELQ,0,0,0,0,90
7,2235539_69445,2235539,69445,282,19771,2012-07-05,Ivan Sesar,ELQ,1,0,0,1,90
8,2235545_19409,2235545,19409,317,200,2012-07-05,Willem Janssen,ELQ,0,0,0,0,45
9,2235545_30003,2235545,30003,317,317,2012-07-05,Wout Brama,ELQ,0,0,0,0,90


#### Summary

- 1,706,806 rows × 13 columns  
- Each row represents a player's appearance in a single match.
- Columns:
   
`appearance_id` — Unique identifier for the appearance (player × game).  
`game_id` — Match identifier.  
`player_id` — Player identifier.  
`player_club_id` — Club at the time of the match.  
`player_current_club_id` — Player’s current club at the dataset’s snapshot.  
`date` — Date of the match.  
`player_name` — Player’s full name.  
`competition_id` — Competition identifier (e.g., league, UCL, ELQ).  
`yellow_cards` — Number of yellow cards in the match.  
`red_cards` — Number of red cards in the match.  
`goals` — Goals scored in the match.  
`assists` — Assists made in the match.  
`minutes_played` — Total minutes played.