# Extracting all the tackling data

## Context and objectives

We want to analyse tackles. There may be different things we want to look at different points, but we do not want to revisit all the individual games multiple times. We need a structured approach to data extraction.

As always, the data is organised with one folder per competition:

In [2]:
!ls ../opta_data/Mens/

[34mAutumnNationsCup[m[m        [34mNPC[m[m                     [34mSuperRugbyAU[m[m
[34mChallengeCup[m[m            [34mPacificNationsCup[m[m       [34mSuperRugbyAotearoa[m[m
[34mChampionsCup[m[m            [34mPremiership[m[m             [34mSuperRugbyPacific[m[m
[34mCurrieCup[m[m               [34mProD2[m[m                   [34mSuperRugbyTranstasman[m[m
[34mInternational[m[m           [34mRWC[m[m                     [34mTRC[m[m
[34mJapanRugbyLeagueOneD1[m[m   [34mRainbowCup[m[m              [34mTop14[m[m
[34mJapanTopLeague[m[m          [34mRainbowCupSA[m[m            [34mURC[m[m
[34mLions[m[m                   [34mRugbyEuropeChampionship[m[m
[34mMLR[m[m                     [34mSixNations[m[m


Then for each competition, we have a subfolder per season:

In [4]:
!ls ../opta_data/Mens/SuperRugbyPacific/

[34m2021-22[m[m [34m2022-23[m[m [34m2023-24[m[m [34m2024-25[m[m


We specify which competition and which season we want to process.

If we want to process all games, we set both variables to '*'.

In [6]:
comp = "SuperRugbyPacific"
season = "2024-25"

ADD DETAILS HERE

## Data extraction

We generate the list of the games we want to process.

In [9]:
import glob

def check_duplicates(my_list):
    seen = set()
    for x in my_list:
        if x in seen:
            print(f"Game {x} has been seen before")
        seen.add(x)



list_games = glob.glob(f"../opta_data/Mens/{comp}/{season}/**/*.csv",recursive=True)

print(f"{len(list_games)} games to process.")
check_duplicates(list_games)

20 games to process.


For each game in the list, we extract and save all the tackle data.

In [11]:
from tqdm import tqdm
from ast import literal_eval

tackle_list = []

for f in tqdm(list_games):
    
    # DEBUG
#     print(f)
    # if "945282" not in f:
    #     continue
        
    with open(f,'r') as inFile:
        lines = inFile.readlines()

    # Read the header line
    header = lines[0].strip().split(',')
    
    # Determine the index for all the columns we want to use
    homeTeamName = header.index('homeTeamName')
    awayTeamName = header.index('awayTeamName')
    competitionName = header.index('competitionName')
    competitionSeason = header.index('season')
    FXID = header.index('FXID')
    datePlayed = header.index('datePlayed')
    action = header.index('action')
    actionName = header.index('actionName')
    ActionType = header.index('ActionType')
    ActionTypeName = header.index('ActionTypeName')
    Actionresult = header.index('Actionresult')
    ActionResultName = header.index('ActionResultName')
    qualifier3Name = header.index('qualifier3Name')
    qualifier4Name = header.index('qualifier4Name')
    qualifier5Name = header.index('qualifier5Name')
    qualifier6Name = header.index('qualifier6Name')
    ps_timestamp = header.index('ps_timestamp')
    ps_endstamp = header.index('ps_endstamp')
    MatchTime = header.index('MatchTime')
    period = header.index('period')
    x_coord = header.index('x_coord')
    y_coord = header.index('y_coord')
    x_coord_end = header.index('x_coord_end')
    y_coord_end = header.index('y_coord_end')
    teamName = header.index('teamName')
    PLID = header.index('PLID')
    playerName = header.index('playerName')
    playerpositionID = header.index('playerpositionID')
    assoc_playerName = header.index('assoc_playerName')
    PlayNum = header.index('PlayNum')
    ID = 0 # hardcoding this one, because of annoying special characters
  
    # we start by extracting the values that are constant for the whole game...
    temp_array = lines[1].split(",")
    
    # home and away teams
    home_team = temp_array[homeTeamName]
    away_team = temp_array[awayTeamName]
    
    # competition name and season
    competition_name = temp_array[competitionName]
    competition_season = temp_array[competitionSeason]

    # game ID
    game_ID = temp_array[FXID]
    
    # date
    game_date = temp_array[datePlayed]
    
    # then we can process each line in the file
    i=1
    while i<len(lines):
        
        temp_array = lines[i].split(",")

        j=i # we will use j if we need to process additional lines

        play_ID = temp_array[ID]
    
        # we extract the details of the action
        action_ID = temp_array[action]
        
        # we ignore everything that is not a tackle
        if action_ID!="2":

            if action_ID=="12":
                previous_action = lines[i-1].split(",")[action]
                if previous_action!="2":
                    print(temp_array[0:10],action_ID,previous_action)
                    print(lines[i-1].rstrip())
                    print(lines[i].rstrip())
                previous_tackler = lines[i-1].split(",")[playerName]
                if previous_tackler!=temp_array[playerName]:
                    print("Different guy")
            
            i+=1
            continue

        tackle_descriptor = temp_array[ActionTypeName]
        tackle_outcome = temp_array[ActionResultName]
        tackle_qualifier = temp_array[qualifier3Name]
        tackle_dominance = temp_array[qualifier4Name]
        tackler_number = temp_array[qualifier5Name] # sometimes multiple players tackle the same player
        tackle_area = temp_array[qualifier6Name]

        # we extract the tackling player and their position
        tackler_player_name = temp_array[playerName]
        tackler_player_position = temp_array[playerpositionID]

        # we extract the player carrying the ball
        carrier_player_name = temp_array[assoc_playerName]

        # we extract information about the current time
        time = f"{temp_array[period]}_{temp_array[MatchTime].zfill(5)}" # time is captured as x_mmmss

        # we extract the phase
        phase_number = temp_array[PlayNum]

        # we extract information about the position at the start and end of the possession
        x_start = temp_array[x_coord]
        y_start = temp_array[y_coord]

        # we extract details about the teams
        tackling_team = temp_array[teamName]
        carrying_team = home_team
        if tackling_team==home_team:
            carrying_team = away_team

        # these variables will stay empty strings unless they are used
        miss_category = ""
        miss_outcome = ""

        if tackle_outcome=="Missed":
            # if the tackle is missed, the following line should always contain details
            next_array = lines[i+1].split(",")
            if next_array[action]!="12":
                print("We were expecting a 'Missed Tackle' line here: ",game_ID,i)

            else:
                miss_category = next_array[ActionTypeName]
                miss_outcome = next_array[ActionResultName]
                
        tackle_list.append([game_ID, competition_name, competition_season, game_date,
                            home_team, away_team, tackling_team, carrying_team,
                            time, phase_number, x_start, y_start,
                            tackle_descriptor, tackle_outcome, tackle_qualifier,
                            tackle_dominance, tackler_number, tackle_area,
                            tackler_player_name, tackler_player_position, carrier_player_name,
                            miss_category, miss_outcome
                            ])
        i+=1


100%|██████████████████████████████████████████| 20/20 [00:00<00:00, 134.91it/s]


## Inspecting and saving the results

Just an initial inspection, to make sure everything looks reasonable.

In [13]:
import pandas as pd

df_tackle = pd.DataFrame(tackle_list, 
                  columns=['Game ID', 'Competition', 'Season', 'Date',
                           'Home team', 'Away team', 'Tackling team', 'Carrying team', 
                           'Time', 'Phase', 'x_start', 'y_start',
                            'Tackle descriptor', 'Tackle outcome', 'Tackle qualifier',
                            'Tackle dominance', 'Tackler number', 'Tackle area',
                            'Tackler name', 'Tackler position', 'Attacking player name',
                            'Missed tackle category', 'Missed tackle outcome'
                          ])
df_tackle

Unnamed: 0,Game ID,Competition,Season,Date,Home team,Away team,Tackling team,Carrying team,Time,Phase,...,Tackle outcome,Tackle qualifier,Tackle dominance,Tackler number,Tackle area,Tackler name,Tackler position,Attacking player name,Missed tackle category,Missed tackle outcome
0,945268,Super Rugby Pacific,2025,15/02/2025,Western Force,Moana Pasifika,Moana Pasifika,Western Force,1_00005,1,...,Passive,,Ineffective Tackle,1st Tackler,Upper Torso,Sione Havili Talitui,7,Vaiolini Ekuasi,,
1,945268,Super Rugby Pacific,2025,15/02/2025,Western Force,Moana Pasifika,Moana Pasifika,Western Force,1_00007,1,...,Passive,Assist,Ineffective Tackle,2nd Tackler,Upper Torso,Lalomilo Lalomilo,12,Vaiolini Ekuasi,,
2,945268,Super Rugby Pacific,2025,15/02/2025,Western Force,Moana Pasifika,Western Force,Moana Pasifika,1_00028,1,...,Missed,,Ineffective Tackle,1st Tackler,,Tom Robertson,3,Ardie Savea,Stepped,Clean Break
3,945268,Super Rugby Pacific,2025,15/02/2025,Western Force,Moana Pasifika,Western Force,Moana Pasifika,1_00029,1,...,Missed,Assist,Ineffective Tackle,2nd Tackler,,Nic Dolly,2,Ardie Savea,Positional,Clean Break
4,945268,Super Rugby Pacific,2025,15/02/2025,Western Force,Moana Pasifika,Western Force,Moana Pasifika,1_00033,1,...,Complete,,Neutral Tackle,1st Tackler,Upper Torso,Ben Donaldson,10,Lalomilo Lalomilo,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7829,945271,Super Rugby Pacific,2025,22/02/2025,Hurricanes,Fijian Drua,Hurricanes,Fijian Drua,2_07913,17,...,Complete,,Neutral Tackle,1st Tackler,Upper Torso,Pouri Rakete-Stones,1,Isoa Nasilasila,,
7830,945271,Super Rugby Pacific,2025,22/02/2025,Hurricanes,Fijian Drua,Hurricanes,Fijian Drua,2_07913,17,...,Complete,Assist,Neutral Tackle,2nd Tackler,Lower Torso,Hugo Plummer,5,Isoa Nasilasila,,
7831,945271,Super Rugby Pacific,2025,22/02/2025,Hurricanes,Fijian Drua,Hurricanes,Fijian Drua,2_07922,18,...,Complete,,Neutral Tackle,1st Tackler,Lower Torso,Raymond Tuputupu,2,Inia Tabuavou,,
7832,945271,Super Rugby Pacific,2025,22/02/2025,Hurricanes,Fijian Drua,Hurricanes,Fijian Drua,2_07938,19,...,Missed,,Ineffective Tackle,1st Tackler,Lower Torso,Bailyn Sullivan,13,Iosefo Masi,Bumped Off,Tackled


We save the extracted kicks to a CSV file.

In [15]:
suffix = ""

if comp=="*":
    suffix+="AllComps_"
else:
    suffix+=f"{comp}_"
    
if season=="*":
    suffix+="AllSeasons"
else:
    suffix+=f"{season}"

outfile = f"./data/tackles_{suffix}.csv"
print(f"Saving results to {outfile}")
df_tackle.to_csv(outfile, index=False) 

Saving results to ./data/tackles_SuperRugbyPacific_2024-25.csv


Done.