# Task 1 – Dataset Creation and Preprocessing for xG Prediction

This notebook is dedicated to the **creation** and **preprocessing** of structured datasets for the xG analysis task.  
The objective is to build **progressive versions of the dataset**, starting from a minimal shot-only baseline and then gradually adding contextual information such as passes, possessions, and freeze frames. Each stage is versioned separately to ensure reproducibility and to allow the evaluation of the marginal contribution of additional features.


## DS0: Shot only

The first dataset is constructed as a **clean baseline limited to the shot event itself**, focusing exclusively on information available **prior to the outcome**. The following features are included as model inputs:

- **`location`** → decomposed into `location_x` and `location_y`, representing the on-pitch coordinates of the shot. From these, two additional geometric variables are derived:  
  - `shot_distance`: Euclidean distance from the shooting location to the center of the goal.  
  - `shot_angle`: the angle subtended by the goalposts relative to the shooting position.  

- **`shot_end_location`** → decomposed into `shot_end_x` and `shot_end_y` capturing the coordinates where the shot attempt ended (e.g., goal, block, off target).  

- **`shot_type`** → categorical variable describing whether the attempt was taken as a penalty, free kick, open play shot, or another category.  

- **`shot_technique`** → categorical variable describing the technical execution of the shot, such as volley, half-volley, or normal shot.  

- **`shot_body_part`** → categorical variable identifying the body surface used, such as right foot, left foot, or head.  

- **`play_pattern`** → categorical variable indicating the pattern of play that led to the shot, e.g., counterattack, set piece, fast break.  

- **`under_pressure`** → binary flag indicating whether the player was under defensive pressure at the moment of the attempt.  

- **`shot_first_time`** → binary flag indicating whether the shot was taken directly without control.  

- **`shot_one_on_one`** → binary flag identifying one-on-one situations with the goalkeeper.  

- **`shot_statsbomb_xg`** → continuous variable representing the expected goals (xG) value provided by StatsBomb. This is the **target variable** for regression.  

Some variables are present in the raw dataset but are **not included as predictive features** because they introduce leakage or lack intrinsic predictive value. In particular:
- **`shot_outcome`** is excluded since it represents the realized result of the attempt (goal, saved, blocked, off target), and using it would constitute **data leakage** given that `shot_statsbomb_xg` is an **ex-ante probability**.

- **`shot_saved_to_post`** is excluded because it describes an event that occurs only after the shot has been taken, making it unusable as a predictor.  

Regarding temporal information, variables such as **`minute`, `second`,** and **`period`** provide context about when the shot occurred within the match. They are retained for descriptive purposes but are **not used as predictors**, as they do not directly influence the intrinsic probability of a shot being converted.  

> **NOTE**  
> The following identifier columns are kept only as **service keys**   to allow merging/enrichment in further datasets.  
> They **MUST NOT** be used as predictive features in the model:
>
> - **`id`** : unique identifier of the event (primary key in StatsBomb events)  
> - **`match_id`** : identifier of the match (needed to retrieve competition/season/gender)  
> - **`team_id`** : identifier of the team performing the event  
> - **`player_id`** : identifier of the player performing the event  
> - **`possession_team_id`** : identifier of the team in possession during the event  

In [4]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

# Load the basic shots dataset
shots_df = pd.read_csv("../task1_xg/data/shots_df.csv")

# Columns to keep for DS0
cols_keep = [
    "id", "match_id", "team_id", "player_id", "possession_team_id",
    "minute", "second", "period", 
    "location", "shot_end_location",
    "shot_type", "shot_technique", "shot_body_part", "play_pattern",
    "under_pressure", "shot_first_time", "shot_one_on_one",
    "shot_statsbomb_xg"   # <-- target variable
]

# Build DS0
ds0 = shots_df[cols_keep].copy()

# Basic Info
print("="*50)
print("DATASET DS0 - BASIC INFO")
print("="*50)
print(f"Shape: {ds0.shape[0]} rows, {ds0.shape[1]} columns\n")

print("Column data types:")
print(ds0.dtypes)

# Missing values
print("\n" + "="*50)
print("MISSING VALUES")
print("="*50)
print(ds0.isna().sum())
# Numeric ranges
print("\n" + "="*50)
print("NUMERIC RANGES")
print("="*50)
num_cols = ["minute", "second", "period", "shot_statsbomb_xg"]
for col in num_cols:
    print(f"{col}: min={ds0[col].min()}  max={ds0[col].max()}")

print("\nshot_statsbomb_xg distribution:")
print(ds0["shot_statsbomb_xg"].describe())


# Categorical features
print("\n" + "="*50)
print("CATEGORICAL FEATURES")
print("="*50)
categorical = ["shot_type", "shot_technique", "shot_body_part", "play_pattern"]
for col in categorical:
    print(f"\n--- {col.upper()} ---")
    print("Unique values:", ds0[col].unique())
    print("\nValue counts:")
    print(ds0[col].value_counts(dropna=False))


# Binary flags
print("\n" + "="*50)
print("BINARY FLAGS")
print("="*50)
bin_flags = ["under_pressure", "shot_first_time", "shot_one_on_one"]
for col in bin_flags:
    print(f"\n--- {col.upper()} ---")
    print(ds0[col].value_counts(dropna=False))
    print("Unique values:", ds0[col].unique())

# Event Id Sample
print("\n" + "="*50)
print("EVENT ID SAMPLE")
print("="*50)
print(ds0["id"].sample(3).tolist())

# Verify id uniqueness and NaN
if ds0["id"].is_unique:
    print("\nAll IDs are unique")
if ds0["id"].isna().any():
    print("\nThere are NaN values in 'id' column")

# Location Fields
print("\n" + "="*50)
print("LOCATION FIELDS (Preview)")
print("="*50)
print("Sample 'location':", ds0["location"].head(3).tolist())
print("Sample 'shot_end_location':", ds0["shot_end_location"].head(3).tolist())


DATASET DS0 - BASIC INFO
Shape: 88023 rows, 18 columns

Column data types:
id                     object
match_id                int64
team_id                 int64
player_id             float64
possession_team_id      int64
minute                  int64
second                  int64
period                  int64
location               object
shot_end_location      object
shot_type              object
shot_technique         object
shot_body_part         object
play_pattern           object
under_pressure         object
shot_first_time        object
shot_one_on_one        object
shot_statsbomb_xg     float64
dtype: object

MISSING VALUES
id                        0
match_id                  0
team_id                   0
player_id                 0
possession_team_id        0
minute                    0
second                    0
period                    0
location                  0
shot_end_location         0
shot_type                 0
shot_technique            0
shot_body_part     

###  Data Situation

The initial inspection of the **DS0 shot dataset** reveals the following:

- Categorical fields (`shot_type`, `shot_technique`, `shot_body_part`, `play_pattern`) are clean and have no missing values

- Binary flags (`under_pressure`, `shot_first_time`, `shot_one_on_one`) contain a high proportion of `NaN`, indicating that missing likely means “False”

- `minute`, `second`, and `period` ranges are valid (e.g., `period` up to 5 accounts for extra-time or penalty shootouts).

- `shot_statsbomb_xg` (our regression target) spans from ~0.0002 to ~0.995, with a realistic distribution (median ~0.055).

- Coordinate fields (`location`, `shot_end_location`) are stored as strings representing lists (e.g., `"[100.4, 35.1]"`), needing parsing into numeric `x`, `y` components.

### Cleaning Pipeline for DS0

To prepare **DS0** for modeling, the following transformations are applied:

1. **Rename identifier column**
   - Column: `id`
   - Rename the `id` column to `event_id` to be more clear


2. **Cast categorical columns**  
   - Columns: `shot_type`, `shot_technique`, `shot_body_part`, `play_pattern`  
   - Converted to pandas `category` dtype to optimize memory usage and facilitate encoding

3. **Fill missing binary flags with `False`**  
   - Columns: `under_pressure`, `shot_first_time`, `shot_one_on_one`  
   - Cast to boolean, since in this case Nan indicates a negative condition

4. **Parse coordinate fields**  
   - Convert `location` into numeric columns `loc_x`, `loc_y`
   - Convert `shot_end_location` into `end_shot_x`, `end_shot_y`
   - Drop the original string-based columns afterward

5. **Validate numeric ranges**  
   - Ensure features such as pitch coordinates, seconds, and period values fall within expected limits

6. **Verify missing values**  
   - Confirm dataset completeness before proceeding

Columns such as `event_id`, `match_id`, `team_id`, `player_id`, and `possession_team_id` are retained only as service keys to allow merging and enrichment with external datasets so the preprocessing step can be skipped for them.

In [5]:
import numpy as np
import pandas as pd

# 0) Reload the dataset for reproducibility

# Load the basic shots dataset
shots_df = pd.read_csv("../task1_xg/data/shots_df.csv")

# Columns to keep for DS0
cols_keep = [
    "id", "match_id", "team_id", "player_id", "possession_team_id",
    "minute", "second", "period", 
    "location", "shot_end_location",
    "shot_type", "shot_technique", "shot_body_part", "play_pattern",
    "under_pressure", "shot_first_time", "shot_one_on_one",
    "shot_statsbomb_xg"   # <-- target variable
]

# Build DS0
ds0 = shots_df[cols_keep].copy()

In [6]:
# 1) Rename "id" column into "event_id"
ds0 = ds0.rename(columns={"id": "event_id"})

# 2) Cast categorical columns
# This allows for more efficient memory usage and faster operations
categorical_cols = ["shot_type", "shot_technique", "shot_body_part", "play_pattern"]
ds0[categorical_cols] = ds0[categorical_cols].astype("category")                        

# 3) Fill missing binary flags with False
bin_flags = ["under_pressure", "shot_first_time", "shot_one_on_one"]
for col in bin_flags:
    ds0[col] = ds0[col].fillna(False).astype(bool)              # Binary flags: NaN -> False

# 4) Parse coordinate fields (location and shot_end_location)
def parse_coords_basic(s, n=2):
    """Parse '[x, y]' string into list of floats"""
    try:
        vals = s.strip("[]").split(",")   # remove brackets, split by comma
        vals = [float(v) for v in vals]   # convert to floats
        return vals[:n]
    except:
        return [np.nan] * n

# location -> loc_x, loc_y
ds0[["loc_x", "loc_y"]] = ds0["location"].apply(lambda x: pd.Series(parse_coords_basic(x, 2)))

# shot_end_location -> end_shot_x, end_shot_y
ds0[["end_shot_x", "end_shot_y"]] = ds0["shot_end_location"].apply(lambda x: pd.Series(parse_coords_basic(x, 2)))

# Drop original string columns
ds0 = ds0.drop(columns=["location", "shot_end_location"])

# 5) Validate numeric ranges
def check_range(col, min_val, max_val):
    bad = ds0[(ds0[col] < min_val) | (ds0[col] > max_val)]
    if not bad.empty:
        print(f"{col} out of range values found:\n", bad[[col]].head())
    else:
        print(f"{col} within expected range [{min_val}, {max_val}]")

# Example checks
check_range("loc_x", 0, 120)
check_range("loc_y", 0, 80)
check_range("end_shot_x", 0, 120)
check_range("end_shot_y", 0, 80)
check_range("second", 0, 59)
check_range("period", 1, 5)

print("\nDS0 processed - shape:", ds0.shape)


loc_x out of range values found:
        loc_x
4682   120.2
5351   120.4
14663  120.1
61833  120.2
81525  120.1
loc_y within expected range [0, 80]
end_shot_x within expected range [0, 120]
end_shot_y within expected range [0, 80]
second within expected range [0, 59]
period within expected range [1, 5]

DS0 processed - shape: (88023, 20)


> **NOTE**: During the final check, a few `loc_x` values were found slightly above the official pitch limit  These cases occur when an event is registered just beyond the field boundary (e.g. a shot taken or ending out of play).  
It has been decided **not to clip these values**, since the deviation is very small (<1 meter) and may carry meaningful information about the context of the action (e.g. an off-target attempt).   Leaving them as is preserves the fidelity of the raw data while remaining compatible with the modeling stage.


In [7]:
# 5) Verify DS0
print("DTYPE SUMMARY")
print(ds0.dtypes)

print("\nMISSING VALUES")
print(ds0.isna().sum())

ds0.head(3)

DTYPE SUMMARY
event_id                object
match_id                 int64
team_id                  int64
player_id              float64
possession_team_id       int64
minute                   int64
second                   int64
period                   int64
shot_type             category
shot_technique        category
shot_body_part        category
play_pattern          category
under_pressure            bool
shot_first_time           bool
shot_one_on_one           bool
shot_statsbomb_xg      float64
loc_x                  float64
loc_y                  float64
end_shot_x             float64
end_shot_y             float64
dtype: object

MISSING VALUES
event_id              0
match_id              0
team_id               0
player_id             0
possession_team_id    0
minute                0
second                0
period                0
shot_type             0
shot_technique        0
shot_body_part        0
play_pattern          0
under_pressure        0
shot_first_time       0


Unnamed: 0,event_id,match_id,team_id,player_id,possession_team_id,minute,second,period,shot_type,shot_technique,shot_body_part,play_pattern,under_pressure,shot_first_time,shot_one_on_one,shot_statsbomb_xg,loc_x,loc_y,end_shot_x,end_shot_y
0,c577e730-b9f5-44f2-9257-9e7730c23d7b,3895302,176,8826.0,176,6,48,1,Open Play,Normal,Right Foot,From Free Kick,False,True,False,0.056644,100.4,35.1,101.6,35.2
1,bbc2c68d-c096-483d-abf4-32c0175a0f55,3895302,904,38004.0,904,7,40,1,Open Play,Normal,Left Foot,Regular Play,True,True,False,0.143381,114.6,33.5,118.1,35.7
2,12b5206b-9ed0-4b1e-9ec3-f2028187e09f,3895302,176,51769.0,176,11,8,1,Open Play,Normal,Left Foot,From Free Kick,False,True,False,0.038188,106.2,55.8,113.4,46.8


### Feature Engineering: Shot Distance and Shot Angle

Before the preprocessing pipeline, the dataset is enriched with two engineered features derived from `loc_x`, `loc_y`:

- **Shot Distance**: the Euclidean distance between the shot location and the center of the goal (x=120, y=40). This feature captures how far the shooter was from the target. 

- **Shot Angle**: the angle under which the shooter sees the two goalposts  (left post at x=120, y=36; right post at x=120, y=44). This represents the scoring angle. In general, wider angles correspond to higher scoring  probabilities.

In [8]:
# Define pitch and goal dimensions (StatsBomb units: 120x80)
GOAL_X = 120
GOAL_Y_TOP = 44
GOAL_Y_BOTTOM = 36
GOAL_CENTER_Y = 40

# Compute distance to the center of the goal
# Formula: d = sqrt((GOAL_X - x)^2 + (GOAL_CENTER_Y - y)^2)
def compute_distance(x, y):
    """Euclidean distance from shot location to the goal center."""
    return np.sqrt((GOAL_X - x) ** 2 + (GOAL_CENTER_Y - y) ** 2)

# Compute shooting angle
def compute_angle(x, y):
    """
    Compute the visible angle of the goal from the shot location.
    
    Idea:
    - From the shooter's point (x,y), we draw two lines: one to the top post and one to the bottom post
    - The wider the separation between these two lines, the larger the goal appears
    - If the player is behind the goal line, the angle is set to zero

    Why atan2?
    - atan2(dy, dx) gives the angle of a line with respect to the x-axis
    - Unlike arctan, atan2 considers the signs of dx and dy, so it works correctly in all directions
    - We compute the angle to the top and bottom posts separately and then take their difference

    Range of resulting angles: [0, π]
    """
    dx = GOAL_X - x
    if dx <= 0:  # if the shot is taken beyond or on the goal line
        return 0.0
    angle_top = np.arctan2((GOAL_Y_TOP - y), dx)
    angle_bottom = np.arctan2((GOAL_Y_BOTTOM - y), dx)
    return abs(angle_top - angle_bottom)

# Add new features using loc_x and loc_y
ds0["distance_to_goal"] = ds0.apply(lambda row: compute_distance(row["loc_x"], row["loc_y"]), axis=1)
ds0["angle_to_goal"] = ds0.apply(lambda row: compute_angle(row["loc_x"], row["loc_y"]), axis=1)

# Display results
print("New features: distance to goal & angle to goal")
print(ds0[["loc_x", "loc_y", "distance_to_goal", "angle_to_goal"]].head(10))

New features: distance to goal & angle to goal
   loc_x  loc_y  distance_to_goal  angle_to_goal
0  100.4   35.1         20.203218       0.380357
1  114.6   33.5          8.450444       0.662204
2  106.2   55.8         20.978084       0.254675
3  113.9   47.4          9.590099       0.570985
4   89.2   42.5         30.901294       0.256650
5  110.2   32.6         12.280065       0.526782
6  105.4   45.1         15.465122       0.482167
7  108.0   40.0         12.000000       0.643501
8  101.5   47.5         19.962465       0.369187
9  116.3   46.0          7.049113       0.720865


In [9]:
# Min and Max values for distance and angle
print("Min and Max values for distance to goal:")
print(ds0["distance_to_goal"].min(), ds0["distance_to_goal"].max())

print("\nMin and Max values for angle to goal:")
print(ds0["angle_to_goal"].min(), ds0["angle_to_goal"].max())

Min and Max values for distance to goal:
0.3999999999999986 92.80086206496145

Min and Max values for angle to goal:
0.0 2.9427443102583553


### Preprocessing Pipeline for DS0 

In the final step of preprocessing, the remaining features are standardized to guarantee consistency and comparability across the dataset.  

1. **Normalize coordinate and spatio/temporal features**  
   - Columns: `loc_x`, `loc_y`, `end_shot_x`, `end_shot_y`, `minute`, `second`, `distance_to_goal`, `shot_angle`  
   - Scaled into the `[0,1]` range using `MinMaxScaler` to ensure comparability.  

2. **One-Hot Encode categorical features**  
   - Expand categorical variables into 0/1 variables suitable for machine learning models.  

3. **Convert Boolean features to integers (0/1)**  
   - Guarantees consistency with numerical and encoded features.  

4. **Confirm final structure of DS0**  
   - Contains:  
     - **Numerical features**: `minute`, `second`, `period`, normalized coordinates, binary flags, normalized spatial shot features.  
     - **Encoded categorical features**: 0/1 variables.  
     - **Target variable**: `shot_statsbomb_xg` (expected goals).  

In [10]:
from sklearn.preprocessing import MinMaxScaler

# 1) Normalize numeric features (keep them as float32)
numeric_cols = ["loc_x", "loc_y", "end_shot_x", "end_shot_y", 
              "minute", "second", "distance_to_goal", "angle_to_goal"]

scaler = MinMaxScaler()
ds0[numeric_cols] = scaler.fit_transform(ds0[numeric_cols]).astype("float32")

print("Numeric ranges after normalization:")
print(ds0[numeric_cols].describe().T)

# 2) One-Hot Encode categorical features (including 'period')
categorical_cols = ["period", "shot_type", "shot_technique", "shot_body_part", "play_pattern"]
ds0 = pd.get_dummies(ds0, columns=categorical_cols)

# 2.1) Remove spaces in column names (replace with underscore)
ds0.columns = ds0.columns.str.replace(" ", "_")

# 2.2) Convert categories (one-hot) to int8
cat_cols = [c for c in ds0.columns if any(pref in c for pref in ["period_", "shot_type_", "shot_technique_", "shot_body_part_", "play_pattern_"])]
ds0[cat_cols] = ds0[cat_cols].astype("int8")

# 3) Convert binary flags to int8
bin_flags = ["under_pressure", "shot_first_time", "shot_one_on_one"]
ds0[bin_flags] = ds0[bin_flags].astype("int8")

# 4) Final check
print("\nDTYPE SUMMARY AFTER PREPROCESSING")
print(ds0.dtypes)
print("\nDS0 shape:", ds0.shape)

ds0.head(3)


Numeric ranges after normalization:
                    count      mean       std  min       25%       50%  \
loc_x             88023.0  0.812739  0.097400  0.0  0.743079  0.828350   
loc_y             88023.0  0.495153  0.124037  0.0  0.404015  0.498118   
end_shot_x        88023.0  0.917267  0.125201  0.0  0.880374  0.981308   
end_shot_y        88023.0  0.499401  0.088085  0.0  0.451815  0.499374   
minute            88023.0  0.353498  0.196186  0.0  0.187050  0.352518   
second            88023.0  0.499131  0.294107  0.0  0.237288  0.491525   
distance_to_goal  88023.0  0.203250  0.094672  0.0  0.125544  0.195097   
angle_to_goal     88023.0  0.150679  0.093504  0.0  0.089166  0.117194   

                       75%  max  
loc_x             0.890365  1.0  
loc_y             0.585947  1.0  
end_shot_x        1.000000  1.0  
end_shot_y        0.546934  1.0  
minute            0.517986  1.0  
second            0.745763  1.0  
distance_to_goal  0.271589  1.0  
angle_to_goal     0.18705

Unnamed: 0,event_id,match_id,team_id,player_id,possession_team_id,minute,second,under_pressure,shot_first_time,shot_one_on_one,...,shot_body_part_Right_Foot,play_pattern_From_Corner,play_pattern_From_Counter,play_pattern_From_Free_Kick,play_pattern_From_Goal_Kick,play_pattern_From_Keeper,play_pattern_From_Kick_Off,play_pattern_From_Throw_In,play_pattern_Other,play_pattern_Regular_Play
0,c577e730-b9f5-44f2-9257-9e7730c23d7b,3895302,176,8826.0,176,0.043165,0.813559,0,1,0,...,1,0,0,1,0,0,0,0,0,0
1,bbc2c68d-c096-483d-abf4-32c0175a0f55,3895302,904,38004.0,904,0.05036,0.677966,1,1,0,...,0,0,0,0,0,0,0,0,0,1
2,12b5206b-9ed0-4b1e-9ec3-f2028187e09f,3895302,176,51769.0,176,0.079137,0.135593,0,1,0,...,0,0,0,1,0,0,0,0,0,0


### Final Step – Target Definition and Save

At this stage, we define the target variable for our models.  
The **expected goals (xG)** value, provided by StatsBomb as `statsbomb_shot_xg`, is renamed to **`target_xg`** for clarity and consistency.  

The final preprocessed dataset is then saved as `DSO.csv` to ensure reproducibility and easy access for model training.


In [11]:
# Rename target column
ds0 = ds0.rename(columns={"shot_statsbomb_xg": "target_xg"})

# Save final dataset
output_path = "../task1_xg/data/DS0.csv"
ds0.to_csv(output_path, index=False)

print(f"Dataset saved successfully as {output_path}")
print("Final shape:", ds0.shape)


Dataset saved successfully as ../task1_xg/data/DS0.csv
Final shape: (88023, 47)


## DS1: Enrichment with Competition Gender

To construct **DS1**, DS0 is enriched with **contextual information** about each match—specifically the **gender** of the competition. 
Including the `gender` category enables the model to capture systematic differences between men’s and women’s competitions—such as tactical patterns, play pace, or physical dynamics—that may influence shot outcomes and expected goals (xG).

The workflow operates as follows:

1. DS0 already includes `match_id` for every shot event.
2. For each competition-season pair, retrieve match data via `sb.matches()`. Each returned match record includes `match_id`, as well as text fields `competition` and `season`.
3. The `competition` field in the `matches` output appears in a format like `"Spain - La Liga"`. To align it with the competition names in `sb.competitions()` (which uses `"competition_name"`), the code splits the string at the first dash (`-`) and uses the second part (e.g. `"La Liga"`). Then it's stripped and renamed to `"competition_name"`.
4. After standardizing the competition and season names, merge with the `sb.competitions()` metadata (which includes `competition_gender`) using `[competition_name, season_name]` as keys.
5. The result of this join yields a mapping `match_id → gender`, which is then merged back into DS0 to produce **DS1**, containing the additional column `gender`.

As before, `match_id` is used solely as an **enrichment key** and is not included as a predictive feature in the model.


In [44]:
import pandas as pd
from statsbombpy import sb

# Load DS0
ds0 = pd.read_csv("../task1_xg/data/DS0.csv")

# Retrieve all competitions data, keeping only the relevant columns
competitions = sb.competitions()
competitions = competitions[[
    "competition_id", 
    "season_id", 
    "competition_name", 
    "season_name", 
    "competition_gender"
]]

# Extract the set of unique match_ids from DS0
unique_match_ids = set(ds0["match_id"].unique())

# Container for match metadata
match_data = []

# Iterate over each competition-season pair
for _, row in competitions.iterrows():
    comp_id = row["competition_id"]
    season_id = row["season_id"]

    try:
        # Retrieve all matches metadata for this competition-season
        matches = sb.matches(competition_id=comp_id, season_id=season_id)

        # Important step: StatsBomb sometimes encodes competition names as "country - name"
        # By splitting on "-", we extract only the competition name
        # Example: "Spain - La Liga" ==> "La Liga"
        matches["competition"] = matches["competition"].str.split("-").str[1].str.strip()

        # Rename columns to align with competitions DataFrame
        matches = matches.rename(columns={
            "competition": "competition_name",
            "season": "season_name",
        })

        # Keep only the matches that are relevant for our DS0 dataset
        matches = matches[matches["match_id"].isin(unique_match_ids)]

        if not matches.empty:
            # Merge with competitions metadata to attach the gender
            # The join is done on competition_name and season_name,
            # because match_id is unique but does not directly exist in competitions
            matches = matches.merge(
                competitions,
                on=["competition_name", "season_name"],
                how="left"
            )
            match_data.append(matches)
    except Exception as e:
        print(f"Error fetching comp={comp_id}, season={season_id}: {e}")

# Concatenate all collected matches into a single DataFrame
df_matches = pd.concat(match_data, ignore_index=True)

# Keep only match_id and gender, removing duplicates
df_matches = df_matches[["match_id", "competition_gender"]].drop_duplicates()

# Rename column for clarity
df_matches = df_matches.rename(columns={"competition_gender": "gender"})

# Merge gender information into DS0 → producing DS1
ds1 = ds0.merge(df_matches, on="match_id", how="left")

print("DS1 shape:", ds1.shape)
print()
print(ds1[["match_id", "gender"]].head())


DS1 shape: (88023, 48)

   match_id gender
0   3895302   male
1   3895302   male
2   3895302   male
3   3895302   male
4   3895302   male


In [None]:
# Verify if there are NaN values in gender
missing_gender = ds1["gender"].isna().sum()
print(f"Number of NaN values in gender: {missing_gender}")

# Show rows with missing gender (if any)
if missing_gender > 0:
    display(ds1[ds1["gender"].isna()])


Number of NaN values in gender: 0


#### Normalization of Gender Column

To facilitate its use in **predictive models**, the categorical variable `gender` (with values `"male"` or `"female"`) is transformed into a **binary numerical feature**:

- `"male"` --> **1**  
- `"female"` -->  **0**

The new column is stored as `gender_binary` with data type `int8` to keep memory usage low.  

In [54]:
# Normalize gender column to binary 0/1
# Let's map Male → 1, Female → 0
ds1["gender_binary"] = ds1["gender"].str.lower().map({
    "male": 1,
    "female": 0
}).astype("int8")

ds1.drop(columns=["gender"], inplace=True)

# Verify the distribution
print(ds1["gender_binary"].value_counts(dropna=False))

gender_binary
1    73874
0    14149
Name: count, dtype: int64


In [55]:
# Save DS1 to disk
ds1.to_csv("../task1_xg/data/DS1.csv", index=False)