## Expected Goals (xG) Model - Preprocessing

This notebook showcases the starting point for the xg model project after importing the data. <br>
For a detailed description of the features selected you can consult <a href="Features selected.md">Features selected.md</a> file

In [1]:
import sys
import os

# Add project root to path for custom modules
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.append(project_root)

# file holding all the constants used in this project
from src.constants import *

In [2]:
import numpy as np
import pandas as pd
import math
import ast
import pyspark
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql import DataFrame, SparkSession
from pyspark.ml.feature import VectorAssembler

In [4]:
# Initialize Spark Session
spark = SparkSession.builder \
    .appName("PreprocessingDemo") \
    .getOrCreate()

print("✅ Spark Session initialized successfully!")
print(f"Spark Version: {spark.version}")

✅ Spark Session initialized successfully!
Spark Version: 4.0.0


In [5]:
# Reading data
df = spark.read.csv('../data/split_data/events/Europe - Champions League.csv',  # Load the CSV file containing events data
                    sep=';', # Specify the separator
                    inferSchema=True, # Automatically infer column data types (e.g., int, string, double)
                    header=True) # Use the first row as column headers

df.show(n=10)

25/07/15 21:33:23 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+-----+--------------------+------------------------------+------------------+--------------------+------------+--------------+---------------+---------------+------------+---------+--------+------------------------+-------------------+------------------+------------------+--------------------+-----------------------+------------------+-------------------+--------------------+---------------+--------------------+-----+--------------------+------------+--------+------+---------------+----------+---------------------+-------------+--------------+----------+-------------+--------------+-----------------+----------------+-----------+-----------+------------+--------------------+-----------------+----------------+-----------+---------+------+-------------+--------------------+---------+--------------------+----------+-----------------+------------------+--------------------+------+---------------+--------------+-----------------+---------------+-----------------+----------------+-----------

The preprocessing starts with `__init__` which
- Initializes parameters (`spark`, `df`, `season`, `events`, etc.)
- Filters by season if specified
- Filters for shots and passes leading to shots
- Creates UDFs for shot angle and player counting
- If `full_pp=True`, call `preprocess()` immediately to execute the full preprocessing pipeline

In [6]:
# 1. initializing
df = df.filter(df.season == SEASON)
df = df.filter((df.type == 'Shot') | (df.pass_assisted_shot_id.isNotNull()))\
                         .select(EVENTS)
df.show(n=10, truncate=False)

+------------------------------------+--------+------+------+----+------+-------------+---------------+---------+--------------------------+------------------------+--------------+--------------+--------------+---------+-----------------+--------------+---------------+---------------+---------------+--------------+--------------------+-----------------+------------+--------------+-----------------+------------------------------------+-----------+-----------+-----------+---------------+----------+-------------+-----------+-----------------+---------------+----------------+-------------+-------------+
|id                                  |match_id|minute|second|type|period|location     |team           |player_id|player                    |position                |play_pattern  |shot_body_part|shot_technique|shot_type|shot_freeze_frame|under_pressure|shot_aerial_won|shot_first_time|shot_one_on_one|shot_open_goal|shot_follows_dribble|shot_statsbomb_xg|shot_outcome|pass_body_part|shot_end_l

### Spatial_data
First method to be called is `spatial_data` which works as follows
1. Split location to x and y coordinates
2. Calculate shot distance to goal
3. Calculate shot angle to goal

In [7]:
# 2. Location Splitting Demonstration

def demonstrate_split_location(df):
    """Show how location strings are split into X,Y coordinates"""
    print("📍 LOCATION SPLITTING DEMONSTRATION")
    print("=" * 50)
    
    print("🔍 Original location format:")
    df.select("id", "location").show(truncate=False)
    
    print("\n🔧 Extracting X coordinates...")
    df_with_x = df.withColumn("shot_location_x",
                             F.regexp_extract(F.col("location"), r'\[(.*?),', 1).cast("float"))
    
    print("🔧 Extracting Y coordinates...")
    df_with_xy = df_with_x.withColumn("shot_location_y",
                                     F.regexp_extract(F.col("location"), r', (.*?)\]', 1).cast("float"))
    
    print("\n✅ Result:")
    df_with_xy.select("id", "location", "shot_location_x", "shot_location_y").show()
        
    return df_with_xy

# Run the demonstration
df_with_coords = demonstrate_split_location(df.limit(10))

📍 LOCATION SPLITTING DEMONSTRATION
🔍 Original location format:
+------------------------------------+-------------+
|id                                  |location     |
+------------------------------------+-------------+
|ae487d2d-c72a-422c-b257-cf1b1a8a9752|[120.0, 79.0]|
|572f93c1-e160-4405-b6f5-35b005ce86ee|[100.0, 63.0]|
|080b5b50-dbcc-4fbd-8f9f-7c9c1e2e67a8|[90.0, 29.0] |
|6fa1c5be-0b71-4a4d-8e89-55ef6e75e4dc|[106.0, 36.0]|
|26e6fb35-605e-4ade-9975-42698778faec|[51.0, 58.0] |
|ecf8f42e-806c-46a7-931e-7f805a102669|[77.0, 22.0] |
|acc634bb-1f28-403d-a9f1-2c58bee509d1|[69.0, 62.0] |
|aadc8803-02d9-41c6-8605-25ead4d9db06|[74.0, 31.0] |
|5ab38751-0e6b-49c2-97f1-ad456b5285fa|[120.0, 1.0] |
|8fdd982e-051c-4b40-b844-66723d79eb11|[66.0, 34.0] |
+------------------------------------+-------------+


🔧 Extracting X coordinates...
🔧 Extracting Y coordinates...

✅ Result:
+--------------------+-------------+---------------+---------------+
|                  id|     location|shot_location_x|s

Having the exact coordinates of a shot will allow us to calculate the distance and angle to goal.

In [8]:
# 3. Distance to Goal Calculation

def demonstrate_distance_calculation(df):
    """Show how distance to goal is calculated"""
    print("📏 DISTANCE TO GOAL CALCULATION")
    print("=" * 50)
        
    print(f"🥅 Goal center position: ({GOAL_X}, {(GOAL_Y1 + GOAL_Y2) / 2})")
    
    # Calculate distance using Euclidean formula
    df_with_distance = df.withColumn("distance_to_goal",
                                    F.round(F.sqrt(
                                        F.pow(F.col("shot_location_x") - F.lit(GOAL_X), 2) +
                                        F.pow(F.col("shot_location_y") - F.lit((GOAL_Y1 + GOAL_Y2) / 2), 2)
                                    ), 2))
    
    print("\n📊 Distance calculations:")
    result = df_with_distance.select("id", "shot_location_x", "shot_location_y", "distance_to_goal").collect()
    
    for row in result:
        if row.shot_location_x is not None:
            print(f"  Shot Position ({row.shot_location_x}, {row.shot_location_y}) → Distance: {row.distance_to_goal}")
    
    return df_with_distance

# Run the demonstration
df_with_distance = demonstrate_distance_calculation(df_with_coords)

📏 DISTANCE TO GOAL CALCULATION
🥅 Goal center position: (120, 40.0)

📊 Distance calculations:
  Shot Position (120.0, 79.0) → Distance: 39.0
  Shot Position (100.0, 63.0) → Distance: 30.48
  Shot Position (90.0, 29.0) → Distance: 31.95
  Shot Position (106.0, 36.0) → Distance: 14.56
  Shot Position (51.0, 58.0) → Distance: 71.31
  Shot Position (77.0, 22.0) → Distance: 46.62
  Shot Position (69.0, 62.0) → Distance: 55.54
  Shot Position (74.0, 31.0) → Distance: 46.87
  Shot Position (120.0, 1.0) → Distance: 39.0
  Shot Position (66.0, 34.0) → Distance: 54.33


In [9]:
# 4. Shot angle

def shot_angle(shot_x, shot_y,
               GOAL_X=120, GOAL_Y1=36,
               GOAL_Y2=44):
    """
    Calculate the angle between vectors from shot to each goal post
    """
    print(f"  🎯 Calculating shot angle for position ({shot_x}, {shot_y})")
    print(f"  📍 Goal posts at ({GOAL_X}, {GOAL_Y1}) and ({GOAL_X}, {GOAL_Y2})")
    
    # Vectors from shot to each goal post
    u_x = GOAL_X - shot_x
    u_y = GOAL_Y1 - shot_y
    v_y = GOAL_Y2 - shot_y
    
    print(f"  📐 Vector to lower post: ({u_x}, {u_y})")
    print(f"  📐 Vector to upper post: ({u_x}, {v_y})")
    
    # Calculate dot product and magnitudes
    dot_product = u_x ** 2 + u_y * v_y
    magnitude_u = math.sqrt(u_x ** 2 + u_y ** 2)
    magnitude_v = math.sqrt(u_x ** 2 + v_y ** 2)

    if magnitude_u == 0 or magnitude_v == 0:
        return 0.0
    
    # Calculate angle
    angle_radians = math.acos(dot_product / (magnitude_u * magnitude_v))
    angle_degrees = math.degrees(angle_radians)
    
    print(f"  📊 Final angle: {angle_degrees:.2f}°\n")
    return angle_degrees

# Test the shot angle calculation
print("🎯 TESTING SHOT ANGLE CALCULATION")
print("=" * 50)

shot_angle_udf = F.udf(lambda shot_x, shot_y: shot_angle(shot_x, shot_y,
                                                         GOAL_X, GOAL_Y1,
                                                         GOAL_Y2),
                       T.FloatType())

df_with_shot_angle = df_with_distance.withColumn("shot_angle",
                                     shot_angle_udf(df_with_distance.shot_location_x,
                                                    df_with_distance.shot_location_y))
df_with_shot_angle.show()

🎯 TESTING SHOT ANGLE CALCULATION


  🎯 Calculating shot angle for position (120.0, 79.0)                          
  📍 Goal posts at (120, 36) and (120, 44)
  📐 Vector to lower post: (0.0, -43.0)
  📐 Vector to upper post: (0.0, -35.0)
  📊 Final angle: 0.00°

  🎯 Calculating shot angle for position (100.0, 63.0)
  📍 Goal posts at (120, 36) and (120, 44)
  📐 Vector to lower post: (20.0, -27.0)
  📐 Vector to upper post: (20.0, -19.0)
  📊 Final angle: 9.94°

  🎯 Calculating shot angle for position (90.0, 29.0)
  📍 Goal posts at (120, 36) and (120, 44)
  📐 Vector to lower post: (30.0, 7.0)
  📐 Vector to upper post: (30.0, 15.0)
  📊 Final angle: 13.43°

  🎯 Calculating shot angle for position (106.0, 36.0)
  📍 Goal posts at (120, 36) and (120, 44)
  📐 Vector to lower post: (14.0, 0.0)
  📐 Vector to upper post: (14.0, 8.0)
  📊 Final angle: 29.74°

  🎯 Calculating shot angle for position (51.0, 58.0)
  📍 Goal posts at (120, 36) and (120, 44)
  📐 Vector to lower post: (69.0, -22.0)
  📐 Vector to upper post: (69.0, -14.0)
  📊 Fin

+--------------------+--------+------+------+----+------+-------------+---------------+---------+--------------------+--------------------+--------------+--------------+--------------+---------+-----------------+--------------+---------------+---------------+---------------+--------------+--------------------+-----------------+------------+--------------+-----------------+---------------------+-----------+-----------+-----------+---------------+----------+-------------+-----------+-----------------+---------------+----------------+-------------+-------------+---------------+---------------+----------------+----------+
|                  id|match_id|minute|second|type|period|     location|           team|player_id|              player|            position|  play_pattern|shot_body_part|shot_technique|shot_type|shot_freeze_frame|under_pressure|shot_aerial_won|shot_first_time|shot_one_on_one|shot_open_goal|shot_follows_dribble|shot_statsbomb_xg|shot_outcome|pass_body_part|shot_end_location

### Preferred foot

In [10]:
def demonstrate_preferred_foot_analysis(df):
    """Show how preferred foot is determined"""
    print("🦶 PREFERRED FOOT ANALYSIS")
    print("=" * 50)
    
    print("📊 Analyzing body parts used in passes and shots...")
    
    # Extract pass body parts
    pass_bp = df.filter(F.col('type') == 'Pass') \
                .select('player_id', F.col('pass_body_part').alias('body_part')) \
                .filter(F.col('body_part').isin('Right Foot', 'Left Foot'))
    
    print("Pass body parts:")
    pass_bp.show()
    
    # Extract shot body parts
    shot_bp = df.filter(F.col('type') == 'Shot') \
                .select('player_id', F.col('shot_body_part').alias('body_part')) \
                .filter(F.col('body_part').isin('Right Foot', 'Left Foot'))
    
    print("Shot body parts:")
    shot_bp.show()
    
    # Combine datasets
    bp = pass_bp.union(shot_bp)
    
    print("Combined body part usage:")
    bp.show()
    
    # Convert to numerical indicators
    bp_mapped = bp.withColumn('left_foot', (F.col('body_part') == 'Left Foot').cast('int')) \
                  .withColumn('right_foot', (F.col('body_part') == 'Right Foot').cast('int')) \
                  .drop('body_part')
    
    print("Numerical indicators:")
    bp_mapped.show()
    
    # Aggregate by player
    foot_counts = bp_mapped.groupBy('player_id') \
                          .sum('left_foot', 'right_foot') \
                          .withColumnRenamed('sum(left_foot)', 'left_foot') \
                          .withColumnRenamed('sum(right_foot)', 'right_foot')
    
    foot_counts = foot_counts.withColumn("total_actions", F.col("left_foot") + F.col("right_foot"))
    
    print("Foot usage counts:")
    foot_counts.show()
    
    # Determine preferred foot
    foot_counts = foot_counts.withColumn("preferred_foot",
                                        F.when((F.col("left_foot") / F.col("total_actions")) >= 0.66, "Left Foot")
                                        .when((F.col("right_foot") / F.col("total_actions")) >= 0.66, "Right Foot")
                                        .otherwise("Two-Footed"))
    
    print("✅ Preferred foot determination:")
    foot_counts.select("player_id", "preferred_foot").show()
    
    return foot_counts.select("player_id", "preferred_foot")

# Run the demonstration
preferred_foot_df = demonstrate_preferred_foot_analysis(df_with_shot_angle)

🦶 PREFERRED FOOT ANALYSIS
📊 Analyzing body parts used in passes and shots...
Pass body parts:
+---------+----------+
|player_id| body_part|
+---------+----------+
|   6381.0|Right Foot|
|   6399.0| Left Foot|
|   6399.0| Left Foot|
|   5463.0|Right Foot|
|   5199.0|Right Foot|
|   5199.0|Right Foot|
|   6384.0|Right Foot|
|   5199.0|Right Foot|
|   5463.0|Right Foot|
+---------+----------+

Shot body parts:
+---------+---------+
|player_id|body_part|
+---------+---------+
+---------+---------+

Combined body part usage:
+---------+----------+
|player_id| body_part|
+---------+----------+
|   6381.0|Right Foot|
|   6399.0| Left Foot|
|   6399.0| Left Foot|
|   5463.0|Right Foot|
|   5199.0|Right Foot|
|   5199.0|Right Foot|
|   6384.0|Right Foot|
|   5199.0|Right Foot|
|   5463.0|Right Foot|
+---------+----------+

Numerical indicators:
+---------+---------+----------+
|player_id|left_foot|right_foot|
+---------+---------+----------+
|   6381.0|        0|         1|
|   6399.0|        1

In [11]:
df_shot_freeze_frame = df.filter(df.type=='Shot').limit(5)

In [12]:
def demonstrate_freeze_frame_processing(df):
    """Show how freeze frame data is processed"""
    print("🎬 FREEZE FRAME DATA PROCESSING")
    print("=" * 50)
    
    # Filter for shots with freeze frame data
    shots_with_frames = df.filter(F.col('shot_type') != 'Penalty') \
                         .select('id', 'shot_freeze_frame') \
                         .filter(F.col('shot_freeze_frame').isNotNull())
    
    print("Shots with freeze frame data:")
    shots_with_frames.show(truncate=False)
    
    # Process freeze frame data (simplified version)
    print("\n🔧 Processing freeze frame data...")
    
    processed_frames = []
    for row in shots_with_frames.collect():
        shot_id = row.id
        frame_data = row.shot_freeze_frame
        
        print(f"\n📊 Processing shot {shot_id}:")
        print(f"Raw data: {frame_data}")
        
        # Simplified parsing (in real implementation, this would be more robust)
        if frame_data:
            # Extract coordinate pairs (simplified)
            import re
            coordinates = re.findall(r'\[([\d.]+), ([\d.]+)\]', frame_data)
            
            print(f"Extracted coordinates: {coordinates}")
            
            for i, (x, y) in enumerate(coordinates):
                processed_frames.append({
                    'Shot_id': shot_id,
                    'player_num': i,
                    'x': float(x),
                    'y': float(y),
                    'teammate': i % 2 == 0  # Simplified alternating pattern
                })
    
    # Create DataFrame from processed data
    if processed_frames:
        frame_df = spark.createDataFrame(processed_frames)
        print("\n✅ Processed freeze frame data:")
        frame_df.show()
        return frame_df
    else:
        print("⚠️  No freeze frame data to process")
        return None

# Run the demonstration
freeze_frame_df = demonstrate_freeze_frame_processing(df_shot_freeze_frame)

🎬 FREEZE FRAME DATA PROCESSING
Shots with freeze frame data:
+------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                                                                                

+--------------------+----------+--------+-----+----+
|             Shot_id|player_num|teammate|    x|   y|
+--------------------+----------+--------+-----+----+
|129d6ccd-ef7d-44e...|         0|    true|113.0|51.0|
|129d6ccd-ef7d-44e...|         1|   false|120.0|79.0|
|129d6ccd-ef7d-44e...|         2|    true|112.0|47.0|
|129d6ccd-ef7d-44e...|         3|   false|119.0|43.0|
|129d6ccd-ef7d-44e...|         4|    true|103.0|47.0|
|129d6ccd-ef7d-44e...|         5|   false|103.0|64.0|
|129d6ccd-ef7d-44e...|         6|    true|107.0|56.0|
|129d6ccd-ef7d-44e...|         7|   false|119.0|77.0|
|129d6ccd-ef7d-44e...|         8|    true|114.0|41.0|
|129d6ccd-ef7d-44e...|         9|   false|120.0|80.0|
|129d6ccd-ef7d-44e...|        10|    true|108.0|62.0|
|129d6ccd-ef7d-44e...|        11|   false|108.0|41.0|
|855d5fee-a32c-481...|         0|    true|106.0|56.0|
|855d5fee-a32c-481...|         1|   false|115.0|38.0|
|855d5fee-a32c-481...|         2|    true|113.0|37.0|
|855d5fee-a32c-481...|      

In [18]:
def demonstrate_dummy_creation(df):
    """Show how categorical variables are converted to dummy variables"""
    print("🎛️ DUMMY VARIABLE CREATION")
    print("=" * 50)
    
    print("📊 Original categorical data:")
    df.select("id", "play_pattern", "shot_type", "shot_technique").show()
    
    # Create dummy variables using the same logic as the class
    df_with_dummies = df
    
    # Process each categorical column in DUMMIES dictionary
    for col_name, mapping in DUMMIES.items():
        print(f"\n🔧 Creating dummy variables for {col_name}...")
        for value, dummy_col in mapping.items():
            df_with_dummies = df_with_dummies.withColumn(dummy_col,
                                                       F.when(F.col(col_name) == value, 1).otherwise(0))
            print(f"  ✅ Created {dummy_col} for {value}")
    
    # Create special columns (goal and header) as done in the class
    print("\n🔧 Creating special binary columns...")
    df_with_dummies = df_with_dummies.withColumn('goal',
                                                F.when(F.col('shot_outcome') == 'Goal', 1).otherwise(0))
    print("  ✅ Created 'goal' column")
    
    df_with_dummies = df_with_dummies.withColumn('header',
                                                F.when(F.col('shot_body_part') == 'Head', 1).otherwise(0))
    print("  ✅ Created 'header' column")
    
    # Show results with dummy variables
    print("\n✅ Result with dummy variables:")
    
    # Collect all dummy column names from DUMMIES
    dummy_cols = ['id']
    for mapping in DUMMIES.values():
        dummy_cols.extend(mapping.values())
    
    # Add special columns
    dummy_cols.extend(['goal', 'header'])
    
    # Show only columns that exist in the DataFrame
    existing_cols = [col for col in dummy_cols if col in df_with_dummies.columns]
    df_with_dummies.select(existing_cols).show()
    
    return df_with_dummies

# Run the demonstration
df_with_dummies = demonstrate_dummy_creation(df.filter(df.type=='Shot'))

🎛️ DUMMY VARIABLE CREATION
📊 Original categorical data:
+--------------------+--------------+---------+--------------+
|                  id|  play_pattern|shot_type|shot_technique|
+--------------------+--------------+---------+--------------+
|129d6ccd-ef7d-44e...| From Throw In|Open Play|        Volley|
|855d5fee-a32c-481...|From Free Kick|Open Play|        Volley|
|f17c2cb5-cc03-499...|From Free Kick|Open Play|   Half Volley|
|158ca4ad-9fd2-4f8...|  Regular Play|Open Play|        Normal|
|2bf5883e-1d85-441...|From Free Kick|Open Play|        Volley|
|793a58d0-0e35-498...|  Regular Play|Open Play|        Normal|
|97411bc2-5346-408...|  Regular Play|Open Play|   Half Volley|
|2dd5f4d5-75b0-43a...| From Throw In|Open Play|        Normal|
|67243207-864c-489...| From Throw In|Open Play|        Normal|
|7d2addef-35f1-442...|  Regular Play|Open Play|        Normal|
|e64ae9d5-81cf-47a...|         Other|  Penalty|        Normal|
|622f5af7-f231-4a7...|   From Corner|Open Play|        Normal|

In [None]:
def demonstrate_bool_to_int(df):
    """Show how boolean columns are converted to integers"""
    print("🔄 BOOLEAN TO INTEGER CONVERSION")
    print("=" * 50)
    
    print("📊 Original boolean data:")
    df.select("id", "under_pressure", "shot_one_on_one",'').show()
    
    print("\n🔧 Converting boolean columns to integers...")
    df_converted = df
    
    for col_name in ['under_pressure', 'shot_one_on_one']:
        df_converted = df_converted.withColumn(col_name,
                                              F.when(F.col(col_name).isNull(), 0)
                                              .otherwise(F.col(col_name).cast('int')))
        print(f"  ✅ Converted {col_name} to integer")
    
    # Handle special case for shot_one_on_one vs pk_type
    df_converted = df_converted.withColumn('shot_one_on_one',
                                          F.when(F.col('pk_type') == 1, 1)
                                          .otherwise(F.col('shot_one_on_one')))
    
    print("\n✅ Result after conversion:")
    df_converted.select("id", "under_pressure", "shot_one_on_one", "pk_type").show()
    
    return df_converted

# Run the demonstration
df_final = demonstrate_bool_to_int(df_with_dummies)

🔄 BOOLEAN TO INTEGER CONVERSION
📊 Original boolean data:
+--------------------+--------------+---------------+
|                  id|under_pressure|shot_one_on_one|
+--------------------+--------------+---------------+
|129d6ccd-ef7d-44e...|          NULL|           NULL|
|855d5fee-a32c-481...|          NULL|           NULL|
|f17c2cb5-cc03-499...|          NULL|           NULL|
|158ca4ad-9fd2-4f8...|          NULL|           NULL|
|2bf5883e-1d85-441...|          NULL|           NULL|
|793a58d0-0e35-498...|          NULL|           NULL|
|97411bc2-5346-408...|          NULL|           NULL|
|2dd5f4d5-75b0-43a...|          NULL|           NULL|
|67243207-864c-489...|          NULL|           NULL|
|7d2addef-35f1-442...|          NULL|           NULL|
|e64ae9d5-81cf-47a...|          NULL|           NULL|
|622f5af7-f231-4a7...|          NULL|           NULL|
|80f48391-c0c4-41d...|          NULL|           NULL|
|44d0c095-b3a4-438...|          NULL|           NULL|
|d3f03900-52fa-401...|   