## Expected Goals (xG) Model - Preprocessing

This notebook showcases the starting point for the xg model project after importing the data. <br>
For a detailed description of the features selected you can consult <a href="Features selected.md">Features selected.md</a> file

In [1]:
import numpy as np
import pandas as pd
import math
import ast
import pyspark
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql import DataFrame, SparkSession
from pyspark.ml.feature import VectorAssembler

In [2]:
import sys
import os

# Add project root to path for custom modules
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.append(project_root)

# file holding all the constants used in this project
from src.constants import *

In [4]:
# Initialize Spark Session
spark = SparkSession.builder \
    .appName("PreprocessingDemo") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .getOrCreate()

print("✅ Spark Session initialized successfully!")
print(f"Spark Version: {spark.version}")

✅ Spark Session initialized successfully!
Spark Version: 4.0.0


In [14]:
# Reading data
df = spark.read.csv('../data/split_data/events/Europe - Champions League.csv',  # Load the CSV file containing Champions League events data
                    sep=';', # Specify semicolon as separator (common in European CSVs)
                    inferSchema=True, # Automatically infer column data types (e.g., int, string, double)
                    header=True) # Use the first row as column headers
df.show(n=10,truncate=False)

+-----+--------------------+------------------------------+------------------+--------------------+------------+--------------+---------------+---------------+------------+---------+--------+------------------------+-------------------+------------------+------------------+--------------------+-----------------------+------------------+-------------------+--------------------+---------------+------------------------------------+-----+--------------------+------------+--------+------+---------------+----------+---------------------+-------------+--------------+----------+-------------+--------------+-----------------+----------------+-----------+-----------+------------+-----------------------------------+-----------------+----------------+-----------+---------+------+-------------+----------------------+---------+-------------------------+----------+-----------------+------------------+-----------------------------------------------------------------------------------------------------

### Preprocessing global schema
The preprocessing starts with `__init__` which
- Initialize parameters (`spark`, `df`, `season`, `events`, etc.)
- Filter by season if specified
- Filter for shots and passes leading to shots
- Create UDFs for shot angle and player counting
- If `full_pp=True`, call `preprocess()` immediately to execute the full preprocessing pipeline

In [15]:
# 1. initializing
df = df.filter(df.season == SEASON)
df = df.filter((df.type == 'Shot') | (df.pass_assisted_shot_id.isNotNull()))\
                         .select(EVENTS)
df.show(n=5, truncate=False)

+------------------------------------+--------+------+------+----+------+-------------+---------------+---------+--------------------+---------------------+--------------+--------------+--------------+---------+-----------------+--------------+---------------+---------------+---------------+--------------+--------------------+-----------------+------------+--------------+-----------------+------------------------------------+-----------+-----------+----------+---------------+----------+-------------+-----------+-----------------+---------------+----------------+-------------+-------------+
|id                                  |match_id|minute|second|type|period|location     |team           |player_id|player              |position             |play_pattern  |shot_body_part|shot_technique|shot_type|shot_freeze_frame|under_pressure|shot_aerial_won|shot_first_time|shot_one_on_one|shot_open_goal|shot_follows_dribble|shot_statsbomb_xg|shot_outcome|pass_body_part|shot_end_location|pass_assist

`preprocess` function is responsible for holding the global process calling the methods sequentially.<br>
First method to be called is `spatial_data` which works as follows

In [None]:
# 2. Location Splitting Demonstration
def demonstrate_split_location(df):
    """Show how location strings are split into X,Y coordinates"""
    print("📍 LOCATION SPLITTING DEMONSTRATION")
    print("=" * 50)
    
    print("🔍 Original location format:")
    df.select("id", "location").show(truncate=False)
    
    print("\n🔧 Extracting X coordinates...")
    df_with_x = df.withColumn("shot_location_x",
                             F.regexp_extract(F.col("location"), r'\[(.*?),', 1).cast("float"))
    
    print("🔧 Extracting Y coordinates...")
    df_with_xy = df_with_x.withColumn("shot_location_y",
                                     F.regexp_extract(F.col("location"), r', (.*?)\]', 1).cast("float"))
    
    print("\n✅ Result:")
    df_with_xy.select("id", "location", "shot_location_x", "shot_location_y").show()
        
    return df_with_xy

# Run the demonstration
df_with_coords = demonstrate_split_location(df.limit(5))

📍 LOCATION SPLITTING DEMONSTRATION
🔍 Original location format:
+------------------------------------+-------------+
|id                                  |location     |
+------------------------------------+-------------+
|ae487d2d-c72a-422c-b257-cf1b1a8a9752|[120.0, 79.0]|
|572f93c1-e160-4405-b6f5-35b005ce86ee|[100.0, 63.0]|
|080b5b50-dbcc-4fbd-8f9f-7c9c1e2e67a8|[90.0, 29.0] |
|6fa1c5be-0b71-4a4d-8e89-55ef6e75e4dc|[106.0, 36.0]|
|26e6fb35-605e-4ade-9975-42698778faec|[51.0, 58.0] |
+------------------------------------+-------------+


🔧 Extracting X coordinates...
🔧 Extracting Y coordinates...

✅ Result:
+--------------------+-------------+---------------+---------------+
|                  id|     location|shot_location_x|shot_location_y|
+--------------------+-------------+---------------+---------------+
|ae487d2d-c72a-422...|[120.0, 79.0]|          120.0|           79.0|
|572f93c1-e160-440...|[100.0, 63.0]|          100.0|           63.0|
|080b5b50-dbcc-4fb...| [90.0, 29.0]|      

Having the exact coordinates of a shot will allow us to calculate the distance and angle to goal.

In [None]:
# 3. Distance to Goal Calculation
def demonstrate_distance_calculation(df):
    """Show how distance to goal is calculated"""
    print("📏 DISTANCE TO GOAL CALCULATION")
    print("=" * 50)
        
    print(f"🥅 Goal center position: ({GOAL_X}, {(GOAL_Y1 + GOAL_Y2) / 2})")
    
    # Calculate distance using Euclidean formula
    df_with_distance = df.withColumn("distance_to_goal",
                                    F.round(F.sqrt(
                                        F.pow(F.col("shot_location_x") - F.lit(GOAL_X), 2) +
                                        F.pow(F.col("shot_location_y") - F.lit((GOAL_Y1 + GOAL_Y2) / 2), 2)
                                    ), 2))
    
    print("\n📊 Distance calculations:")
    result = df_with_distance.select("id", "shot_location_x", "shot_location_y", "distance_to_goal").collect()
    
    for row in result:
        if row.shot_location_x is not None:
            print(f"  Shot {row.id}: Position ({row.shot_location_x}, {row.shot_location_y}) → Distance: {row.distance_to_goal}")
    
    return df_with_distance

# Run the demonstration
df_with_distance = demonstrate_distance_calculation(df_with_coords)

📏 DISTANCE TO GOAL CALCULATION
🥅 Goal center position: (120, 40.0)

📊 Distance calculations:
  Shot ae487d2d-c72a-422c-b257-cf1b1a8a9752: Position (120.0, 79.0) → Distance: 39.0
  Shot 572f93c1-e160-4405-b6f5-35b005ce86ee: Position (100.0, 63.0) → Distance: 30.48
  Shot 080b5b50-dbcc-4fbd-8f9f-7c9c1e2e67a8: Position (90.0, 29.0) → Distance: 31.95
  Shot 6fa1c5be-0b71-4a4d-8e89-55ef6e75e4dc: Position (106.0, 36.0) → Distance: 14.56
  Shot 26e6fb35-605e-4ade-9975-42698778faec: Position (51.0, 58.0) → Distance: 71.31


In [28]:
GOAL_X = 120

GOAL_Y1 = 36

GOAL_Y2 = 44

def shot_angle(shot_x, shot_y, GOAL_X=120, GOAL_Y1=36, GOAL_Y2=44):
    """
    Calculate the angle between vectors from shot to each goal post
    """
    print(f"  🎯 Calculating shot angle for position ({shot_x}, {shot_y})")
    print(f"  📍 Goal posts at ({GOAL_X}, {GOAL_Y1}) and ({GOAL_X}, {GOAL_Y2})")
    
    # Vectors from shot to each goal post
    u_x = GOAL_X - shot_x
    u_y = GOAL_Y1 - shot_y
    v_y = GOAL_Y2 - shot_y
    
    print(f"  📐 Vector to lower post: ({u_x}, {u_y})")
    print(f"  📐 Vector to upper post: ({u_x}, {v_y})")
    
    # Calculate dot product and magnitudes
    dot_product = u_x ** 2 + u_y * v_y
    magnitude_u = math.sqrt(u_x ** 2 + u_y ** 2)
    magnitude_v = math.sqrt(u_x ** 2 + v_y ** 2)

    if magnitude_u == 0 or magnitude_v == 0:
        return 0.0
    
    # Calculate angle
    angle_radians = math.acos(dot_product / (magnitude_u * magnitude_v))
    angle_degrees = math.degrees(angle_radians)
    
    print(f"  📊 Final angle: {angle_degrees:.2f}°\n")
    return angle_degrees

# Test the shot angle calculation
print("🎯 TESTING SHOT ANGLE CALCULATION")
print("=" * 50)

shot_angle_udf = F.udf(lambda shot_x, shot_y: shot_angle(shot_x, shot_y, GOAL_X, GOAL_Y1, GOAL_Y2), T.FloatType())

df = df_with_distance.limit(3).withColumn("shot_angle",
                                     shot_angle_udf(df_with_distance.shot_location_x, df_with_distance.shot_location_y))
df.show()

🎯 TESTING SHOT ANGLE CALCULATION
+--------------------+--------+------+------+----+------+-------------+---------------+---------+--------------------+--------------+--------------+--------------+--------------+---------+-----------------+--------------+---------------+---------------+---------------+--------------+--------------------+-----------------+------------+--------------+-----------------+---------------------+-----------+-----------+----------+---------------+----------+-------------+-----------+-----------------+---------------+----------------+-------------+-------------+---------------+---------------+----------------+----------+
|                  id|match_id|minute|second|type|period|     location|           team|player_id|              player|      position|  play_pattern|shot_body_part|shot_technique|shot_type|shot_freeze_frame|under_pressure|shot_aerial_won|shot_first_time|shot_one_on_one|shot_open_goal|shot_follows_dribble|shot_statsbomb_xg|shot_outcome|pass_body_pa

  🎯 Calculating shot angle for position (120.0, 79.0)
  📍 Goal posts at (120, 36) and (120, 44)
  📐 Vector to lower post: (0.0, -43.0)
  📐 Vector to upper post: (0.0, -35.0)
  📊 Final angle: 0.00°

  🎯 Calculating shot angle for position (100.0, 63.0)
  📍 Goal posts at (120, 36) and (120, 44)
  📐 Vector to lower post: (20.0, -27.0)
  📐 Vector to upper post: (20.0, -19.0)
  📊 Final angle: 9.94°

  🎯 Calculating shot angle for position (90.0, 29.0)
  📍 Goal posts at (120, 36) and (120, 44)
  📐 Vector to lower post: (30.0, 7.0)
  📐 Vector to upper post: (30.0, 15.0)
  📊 Final angle: 13.43°

