# **Messi outlier**
## DATASET 1
### "Football Data from Transfermarkt"

## Exploratory analyses of `appearances.csv`

The dataset `appearances.csv` is part of a collection of datasets called "Football Data from Transfermarkt" from David Cariboo. The full data is available in [Kaggle](https://www.kaggle.com/datasets/davidcariboo/player-scores) and [GitHub](https://github.com/dcaribou/transfermarkt-datasets).

`appearances.csv` is a vast list of football player appearances that captures who played, when, for which club, and what they did on the pitch.

The aim is of this notebook is to examine if `appearances.csv` is suitable to compare Messi performance with other players.

From now on, we use `.parquet` files. Accordingly, `appearances.csv` was previously transformed to `appearances_FULL.parquet`.

#### Route to `appearances_FULL.parquet`

In [4]:
# This chunk is to set the path from the local folder to the appearances_FULL.parquet in the data folder
from pathlib import Path

# project root is one level up from notebooks/
ROOT = Path.cwd().parent
DATA = ROOT / "data"

file_path = DATA / "appearances_FULL.parquet"

#### Load `appearances_FULL.parquet`, hereafter `df`.

In [5]:
# Load the file
import pandas as pd

# Load the parquet file
df = pd.read_parquet(file_path)

#### Exploratory analyses

In [14]:
# Basic info 
print("Shape:", df.shape)         # rows, columns
print(df.dtypes)                  # types of each column

# Time span
date_min = df["date"].min()
date_max = df["date"].max()
print(f"\nDate span: {date_min} → {date_max}")

# First look 
display(df.head(10))                # first 5 rows

Shape: (1706806, 13)
appearance_id             object
game_id                    int64
player_id                  int64
player_club_id             int64
player_current_club_id     int64
date                      object
player_name               object
competition_id            object
yellow_cards               int64
red_cards                  int64
goals                      int64
assists                    int64
minutes_played             int64
dtype: object

Date span: 2012-07-03 → 2025-04-10


Unnamed: 0,appearance_id,game_id,player_id,player_club_id,player_current_club_id,date,player_name,competition_id,yellow_cards,red_cards,goals,assists,minutes_played
0,2231978_38004,2231978,38004,853,235,2012-07-03,Aurélien Joachim,CLQ,0,0,2,0,90
1,2233748_79232,2233748,79232,8841,2698,2012-07-05,Ruslan Abyshov,ELQ,0,0,0,0,90
2,2234413_42792,2234413,42792,6251,465,2012-07-05,Sander Puri,ELQ,0,0,0,0,45
3,2234418_73333,2234418,73333,1274,6646,2012-07-05,Vegar Hedenstad,ELQ,0,0,0,0,90
4,2234421_122011,2234421,122011,195,3008,2012-07-05,Markus Henriksen,ELQ,0,0,0,1,90
5,2234421_146889,2234421,146889,195,2778,2012-07-05,Peter Ankersen,ELQ,1,0,0,0,90
6,2235539_28716,2235539,28716,282,7185,2012-07-05,Adi Adilovic,ELQ,0,0,0,0,90
7,2235539_69445,2235539,69445,282,19771,2012-07-05,Ivan Sesar,ELQ,1,0,0,1,90
8,2235545_19409,2235545,19409,317,200,2012-07-05,Willem Janssen,ELQ,0,0,0,0,45
9,2235545_30003,2235545,30003,317,317,2012-07-05,Wout Brama,ELQ,0,0,0,0,90


#### Summary

- 1,706,806 rows × 13 columns  
- Each row represents a player's appearance in a single match.
- Date span is 2012-07-03 → 2025-04-10
- Columns:
   
`appearance_id` — Unique identifier for the appearance (player × game).  
`game_id` — Match identifier.  
`player_id` — Player identifier.  
`player_club_id` — Club at the time of the match.  
`player_current_club_id` — Player’s current club at the dataset’s snapshot.  
`date` — Date of the match.  
`player_name` — Player’s full name.  
`competition_id` — Competition identifier (e.g., league, UCL, ELQ).  
`yellow_cards` — Number of yellow cards in the match.  
`red_cards` — Number of red cards in the match.  
`goals` — Goals scored in the match.  
`assists` — Assists made in the match.  
`minutes_played` — Total minutes played.

#### **WARNING**
Date span is 2012-07-03 → 2025-04-10.\
The dataset does not include Messi "absolute peak performance" in 2011-2012.\
However it includes Messi "second prime" in 2014-2015, in which he formed the legendary MSN trio (Messi-Suárez-Neymar)

#### To further examine how much of Messi carreer is considered in the `df`, we filter `player_name` by Messi

In [7]:
# Search for any player_name that contains "Messi"
df[df["player_name"].str.contains("Messi", case=False, na=False)]["player_name"].unique()

array(['Lionel Messi', 'Messias', 'Junior Messias'], dtype=object)

In [13]:
# Apply filter 'Lionel Messi'
messi_df = df[df["player_name"] == "Lionel Messi"]
# Basic structure
n_rows, n_cols = messi_df.shape
print(f"Rows (appearances): {n_rows}")
print(f"Columns: {n_cols}")

# Date range
date_min = messi_df["date"].min()
date_max = messi_df["date"].max()
print(f"Date span: {date_min} → {date_max}")

# Show 10 first rows in the messi_df
messi_df.head(10)

Rows (appearances): 522
Columns: 13
Date span: 2012-08-19 → 2023-06-03


Unnamed: 0,appearance_id,game_id,player_id,player_club_id,player_current_club_id,date,player_name,competition_id,yellow_cards,red_cards,goals,assists,minutes_played
8135,2244378_28003,2244378,28003,131,583,2012-08-19,Lionel Messi,ES1,0,0,2,0,90
8929,2244388_28003,2244388,28003,131,583,2012-08-22,Lionel Messi,SUC,0,0,1,0,90
11912,2242828_28003,2242828,28003,131,583,2012-08-26,Lionel Messi,ES1,0,0,2,0,90
12883,2244389_28003,2244389,28003,131,583,2012-08-29,Lionel Messi,SUC,0,0,1,0,90
15970,2242881_28003,2242881,28003,131,583,2012-09-02,Lionel Messi,ES1,0,0,0,1,90
18056,2242910_28003,2242910,28003,131,583,2012-09-15,Lionel Messi,ES1,0,0,2,0,32
20210,2262160_28003,2262160,28003,131,583,2012-09-19,Lionel Messi,CL,0,0,2,0,90
21888,2242904_28003,2242904,28003,131,583,2012-09-22,Lionel Messi,ES1,0,0,0,1,90
26474,2242837_28003,2242837,28003,131,583,2012-09-29,Lionel Messi,ES1,0,0,0,2,90
28593,2262168_28003,2262168,28003,131,583,2012-10-02,Lionel Messi,CL,0,0,0,2,90


#### **NOTE**
The Dataset 1 "Football Data from Transfermarkt" does not include an apropiate time span to compare Messi performance with other players. Specifically, Messi prime 2011/2012 is not included.