# Building Features

Some of the groundwork for basic game-state features was laid in the previous notebook. In this notebook we:

Finish deriving game-state metrics (score differential, player differential, cards, minutes/time-interval features).

Compute xT-based team strength metrics and running xT differentials.

Aggregate recent match xT for each team and incorporate those as features.

Save the final processed dataset for modeling.

In [1]:
import json
from glob import glob
import pandas as pd; pd.set_option("display.max_columns", None)
import numpy as np
from tqdm import tqdm

In [5]:
files = glob(r"../processed-data/*.csv")
match_ids = [file.split("\\")[-1].split(".csv")[0].split("_")[0] for file in files]
home_team_ids = [file.split("\\")[-1].split(".csv")[0].split("_")[1] for file in files]
away_team_ids = [file.split("\\")[-1].split(".csv")[0].split("_")[2] for file in files]

In [6]:
all_team_match_files_dict = {}
for team_id in sorted(list(set(home_team_ids))):
    all_team_match_files_dict[team_id] = [file for file in files if team_id in file]

I have not used any AI assistance in cell above.

The variable `n_prev_matches` is the number of previous matches we will consider to build our team strength metrics. 

Also, in the following 2 code cells, we will also iterate through all the matches and quickly calculate the xt accumulated by both teams in each match. We will then save that data. I'm not sure if this is the most efficient way of doing this and I think it might be possible to get this data in our main processing loop as well but this was the most straight-forward way so that's what I did.  

In [7]:
n_prev_matches = 4
create_xt_matches = True
n_matches = 38

In [8]:
xt_matches_dict = {}
if create_xt_matches:
    for file, match_id, home_team_id, away_team_id in tqdm(zip(files, match_ids, home_team_ids, away_team_ids)):
        d = pd.read_csv(file)
        grouped_xt_df = d.query("xt_value >= 0").groupby("teamId").agg(sum_fwd_xt= ("xt_value", "sum")).reset_index()
        xt_matches_dict[match_id] = dict(zip(grouped_xt_df.teamId.astype(str), grouped_xt_df.sum_fwd_xt))
    with open("../pre_xt_matches.json", "w") as f:
        json.dump(xt_matches_dict, f)
else:
    with open("../pre_xt_matches.json") as f:
        xt_matches_dict = json.load(f)
    

0it [00:00, ?it/s]

380it [00:01, 233.42it/s]


I have not used any AI assistance in cell above.

In [9]:
cols = ["matchId", 
        "teamId", 
        "score_differential", 
        "goals_scored",
        "player_differential", 
        "own_yellow_cards", 
        "opposition_yellow_cards", 
        "is_home_team", 
        "avg_team_xt",
        "avg_opp_xt",
        "minutes_remaining",
        "time_interval",
        "time_intervals_remaining",
        "running_xt_differential",
        "scored_goal_after" ##target
       ]

In [10]:
def get_running_xt(args):
    """ get xt difference for every minute of the match. """
    team_id, minute = args
    if team_id == home_team_id:
        return home_away_running_xt_dict[minute]
    else:
        return -home_away_running_xt_dict[minute] if home_away_running_xt_dict[minute] != 0 else 0    
    
def get_running_xt_differential(vals):
    """ calculate the xt difference b/w both teams in the last 10 minutes of the match. Essentially a proxy for game flow"""
    n_minutes_period = 10
    team_id, minute, match_period = vals
    return pdf.query("matchPeriod == @match_period & teamId == @team_id & (@minute-@n_minutes_period<=minutes_round<=@minute)")["minute_xt_difference"].mean()

Citation : The above code snippet was generated using GPT 5 on 11/18/25 at 3:25p. 

In [11]:
data = []
for file, match_id, home_team_id, away_team_id in tqdm(zip(files, match_ids, home_team_ids, away_team_ids)):
    
    home_file_idx = all_team_match_files_dict[home_team_id].index(file)
    away_file_idx = all_team_match_files_dict[away_team_id].index(file)
    
    if home_file_idx >=4 and away_file_idx >= 4:
    
        home_xt_files = all_team_match_files_dict[home_team_id][home_file_idx - n_prev_matches: home_file_idx]
        away_xt_files = all_team_match_files_dict[away_team_id][away_file_idx - n_prev_matches: away_file_idx]

        home_vals = []
        for home_xt_file in home_xt_files:
            hmid = home_xt_file.split("_")[0].split("\\")[-1]
            home_vals.append(xt_matches_dict[hmid][home_team_id])

        away_vals = []
        for away_xt_file in away_xt_files:
            amid = away_xt_file.split("_")[0].split("\\")[-1]
            away_vals.append(xt_matches_dict[amid][away_team_id])

        pre_home_avg_xt_value = np.mean(home_vals)
        pre_away_avg_xt_value = np.mean(away_vals)

        home_team_id = int(home_team_id)
        away_team_id = int(away_team_id)

        df = pd.read_csv(file)

        ## minutes remaining
        df["minutes"] = df["eventSec"]/60
        df["minutes_round"] = df["minutes"].astype(int)
        h1_last_minute = df.query("matchPeriod == '1H'").minutes_round.max()
        h2_last_minute = df.query("matchPeriod == '2H'").minutes_round.max()

        ##running xt differential
        pdfs = []
        for period in ["1H", "2H"]:
            pdf = df.query("matchPeriod == @period").copy()
            p = pdf.groupby(["teamId", "minutes_round"]).\
                agg(xt=("xt_value", "sum")).\
                reset_index().\
                pivot(index="minutes_round", columns="teamId", values="xt").fillna(0)
            home_away_running_xt_dict = dict(zip(p.index, p[home_team_id] - p[away_team_id]))

            pdf["minute_xt_difference"] = pdf[['teamId', 'minutes_round']].apply(get_running_xt, axis=1)
            pdf["running_xt_differential"] = pdf["minute_xt_difference"].rolling(10).mean().fillna(0)

            if period == "1H":
                pdf["minutes_remaining"] = pdf['minutes_round'].apply(lambda minute: (h1_last_minute - minute) + h2_last_minute)
            else:
                pdf["minutes_remaining"] = pdf['minutes_round'].apply(lambda minute: h2_last_minute - minute)

            pdfs.append(pdf)
        df = pd.concat(pdfs, ignore_index=False)
        df['time_interval'] = 100 - pd.cut(df["minutes_remaining"], bins=100, labels=False) ## time intervals comes directly from the blog
        df['time_intervals_remaining'] = 100 - df['time_interval']
        df['score_differential'] = np.where(df["teamId"] == home_team_id, 
                                            df['home_goals']-df['away_goals'], 
                                            df['away_goals']-df['home_goals'])

        df['goals_scored'] = np.where(df["teamId"] == home_team_id, 
                                      df['home_goals'], 
                                      df['away_goals'])

        df['player_differential'] = np.where(df["teamId"] == home_team_id, 
                                             df['home_number_of_players']-df['away_number_of_players'], 
                                             df['away_number_of_players']-df['home_number_of_players'])

        df['own_yellow_cards'] = np.where(df["teamId"] == home_team_id, 
                                          df['home_number_of_yellows'], 
                                          df['away_number_of_yellows'])

        df['opposition_yellow_cards'] = np.where(df["teamId"] == home_team_id, 
                                                 df['away_number_of_yellows'], 
                                                 df['home_number_of_yellows'])

        df['is_home_team'] = np.where(df["teamId"] == home_team_id, 1, 0)

        df['avg_team_xt'] = np.where(df["teamId"] == home_team_id, pre_home_avg_xt_value, pre_away_avg_xt_value)
        df['avg_opp_xt'] = np.where(df["teamId"] == home_team_id, pre_away_avg_xt_value, pre_home_avg_xt_value)


        ##
        ## target - goals scored after        
        df["home_num_goals_scored_after"] = df['home_goals'].max()
        df["home_num_goals_scored_after"] = df["home_num_goals_scored_after"] - df["home_goals"]
        
        df["away_num_goals_scored_after"] = df['away_goals'].max()
        df["away_num_goals_scored_after"] = df["away_num_goals_scored_after"] - df["away_goals"]


        df['scored_goal_after'] = np.where(df["teamId"] == home_team_id, df['home_num_goals_scored_after'], df['away_num_goals_scored_after'])
        df["scored_goal_after"] = df["scored_goal_after"]/df["time_intervals_remaining"]
        
        ##save 
        data.append(df.fillna(0))        

380it [00:04, 81.36it/s] 


Citation : The above code snippet was generated using GPT 5 on 11/20/25 at 2:14p. 

In [12]:
final_df = pd.concat(data, ignore_index=True)

In [13]:
final_df.head()

Unnamed: 0,eventId,subEventName,tags,playerId,positions,matchId,eventName,teamId,matchPeriod,eventSec,subEventId,id,tags_list,home_goals,away_goals,home_number_of_yellows,away_number_of_yellows,home_number_of_players,away_number_of_players,xt_value,minutes,minutes_round,minute_xt_difference,running_xt_differential,minutes_remaining,time_interval,time_intervals_remaining,score_differential,goals_scored,player_differential,own_yellow_cards,opposition_yellow_cards,is_home_team,avg_team_xt,avg_opp_xt,home_num_goals_scored_after,away_num_goals_scored_after,scored_goal_after
0,8,Simple pass,[{'id': 1801}],10252,"[{'y': 51, 'x': 50}, {'y': 40, 'x': 31}]",2500048,Pass,10531,1H,3.050446,85.0,240887221,[1801],0,0,0,0,11,11,-0.003522,0.050841,0,0.011892,0.0,97,1,99,0,0,0,0,0,0,1.059127,1.430297,1,1,0.010101
1,8,Simple pass,[{'id': 1801}],246866,"[{'y': 40, 'x': 31}, {'y': 88, 'x': 29}]",2500048,Pass,10531,1H,3.468107,85.0,240887222,[1801],0,0,0,0,11,11,-0.001547,0.057802,0,0.011892,0.0,97,1,99,0,0,0,0,0,0,1.059127,1.430297,1,1,0.010101
2,8,High pass,[{'id': 1802}],284,"[{'y': 88, 'x': 29}, {'y': 42, 'x': 69}]",2500048,Pass,10531,1H,8.349323,83.0,240887223,[1802],0,0,0,0,11,11,0.0,0.139155,0,0.011892,0.0,97,1,99,0,0,0,0,0,0,1.059127,1.430297,1,1,0.010101
3,8,Head pass,[{'id': 1801}],5281,"[{'y': 58, 'x': 31}, {'y': 20, 'x': 42}]",2500048,Pass,1627,1H,10.104679,82.0,240886924,[1801],0,0,0,0,11,11,0.002522,0.168411,0,-0.011892,0.0,97,1,99,0,0,0,0,0,1,1.430297,1.059127,1,1,0.010101
4,8,Head pass,[{'id': 1802}],8370,"[{'y': 20, 'x': 42}, {'y': 44, 'x': 72}]",2500048,Pass,1627,1H,11.599694,82.0,240886925,[1802],0,0,0,0,11,11,0.0,0.193328,0,-0.011892,0.0,97,1,99,0,0,0,0,0,1,1.430297,1.059127,1,1,0.010101


In [14]:
final_df.to_csv("../final_processed_data.csv", index=False) ##this is the data we will use for modelling purposes