# Preprocessing

Load World Cup 2018 data from StatsBomb JSON, filter to shots, and create the cleaned shot-level dataset with engineered features (x, y, distance, angle, etc.). Finally, save the result to CSV for use in the modeling notebook.

## Imports and configuration

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import json

pd.set_option('display.max_columns', None)


## Load matches and events for World Cup 2018

This uses your local copy of the StatsBomb open-data repository. Adjust `BASE` if the path is different on your machine.

In [None]:
# Base path to StatsBomb open-data on your machine
BASE = Path(r"C:\\Users\\traik\\Desktop\\Final project data\\open-data-master\\data")

# World Cup 2018: competition_id=43, season_id=3
matches_file = BASE / "matches" / "43" / "3.json"

with open(matches_file, "r", encoding="utf-8") as f:
    matches_wc = json.load(f)

match_ids_wc = [m["match_id"] for m in matches_wc]

events_folder = BASE / "events"
all_events_wc = []

for mid in match_ids_wc:
    fp = events_folder / f"{mid}.json"
    with open(fp, "r", encoding="utf-8") as f:
        all_events_wc.extend(json.load(f))

df_wc = pd.json_normalize(all_events_wc, sep="_")
df_wc.head()


## Filter to shots only

In [None]:
df_shots = df_wc[df_wc["type_name"] == "Shot"].copy()
df_shots.shape


## Select relevant columns

These match the columns you used in your original notebook for modeling.

In [None]:
cols = [
    "location",
    "counterpress",
    "shot_statsbomb_xg",
    "shot_end_location",
    "shot_type_id",
    "shot_technique_id",
    "shot_outcome_id",
    "shot_body_part_id",
    "shot_open_goal",
    "shot_first_time",
    "shot_one_on_one",
    "shot_aerial_won",
]

shots = df_shots[cols].copy()
shots.head()


## Clean boolean shot flags

`shot_open_goal`, `shot_first_time`, `shot_one_on_one`, `shot_aerial_won` are mapped from `NaN` → 0 and cast to integer 0/1.

In [None]:
binary_cols = [
    "shot_open_goal",
    "shot_first_time",
    "shot_one_on_one",
    "shot_aerial_won",
]

for col in binary_cols:
    shots[col] = shots[col].fillna(0).astype(int)

shots[binary_cols].head()


## Drop unreliable / unused columns

Based on your exploration, `counterpress` was unreliable / mostly missing. We drop it. We also keep `shot_end_location` out of the modeling features (xG should not depend on where the ball ended up).

In [None]:
# Drop counterpress if present
if "counterpress" in shots.columns:
    shots.drop(["counterpress"], axis=1, inplace=True)

shots.isna().sum()


## Extract shot location (x, y)

We convert the StatsBomb `location` list `[x, y]` into separate numeric `x` and `y` columns, then drop the original `location` column.

In [None]:
shots["x"] = shots["location"].apply(lambda loc: loc[0])
shots["y"] = shots["location"].apply(lambda loc: loc[1])

shots.drop("location", axis=1, inplace=True)
shots.head()


## Create goal label `is_goal`

We use your mapping: `shot_outcome_id == 97` corresponds to goals in this dataset.

In [None]:
shots["is_goal"] = (shots["shot_outcome_id"] == 97).astype(int)
shots["is_goal"].value_counts()


In [None]:
# Drop the outcome_id now that we have the label
shots.drop("shot_outcome_id", axis=1, inplace=True)
shots.head()


## Add geometry features: distance and angle

We compute:
- `distance`: distance from shot location to the centre of the goal
- `angle`: angle between lines from the shot location to the two goalposts

Pitch and goal coordinates follow the StatsBomb convention: x ∈ [0,120], y ∈ [0,80], goal centered at (120, 40).

In [None]:
# Goal coordinates (StatsBomb pitch)
goal_x = 120
goal_y = 40
left_post_y = 36.8
right_post_y = 43.2

# Distance to goal centre
shots["distance"] = np.sqrt((goal_x - shots["x"])**2 + (goal_y - shots["y"])**2)

def calc_angle(row):
    x = row["x"]
    y = row["y"]
    angle_left = np.arctan2(left_post_y - y, goal_x - x)
    angle_right = np.arctan2(right_post_y - y, goal_x - x)
    return abs(angle_right - angle_left)

shots["angle"] = shots.apply(calc_angle, axis=1)
shots[["x", "y", "distance", "angle"]].head()


## Create final featured shots dataframe

We drop `shot_end_location` so the features only describe the chance **before** the shot outcome. This final dataframe `shots_featured` will be saved to CSV and used in the modeling notebook.

In [None]:
shots_featured = shots.copy()

if "shot_end_location" in shots_featured.columns:
    shots_featured.drop(["shot_end_location"], axis=1, inplace=True)

shots_featured.head()


## Save preprocessed shots to CSV

In [None]:
output_path = Path("shots_featured_wc2018.csv")
shots_featured.to_csv(output_path, index=False)
print(f"Saved preprocessed shots to {output_path.resolve()}")
