# ETL for the Lower Body Injury Data 

Since the data for this analysis exceeds 50 million rows for each part, I'm using Polars instead of Pandas for memory optimization, since many of the steps using Pandas are very slow due to unoptimized resource usage. I'm using SQLAlchemy to push and pull data from the database, however, due to the size of the tracking data, I'm using the csv files from any previously filtered data (in the case of these data, the injury table was not filtered, so it is the native table) to load using the lazy loading feature of Polars.

In [118]:
import polars as pl
import sqlalchemy as db
from sqlalchemy.orm import Session
from sqlalchemy import create_engine
import psycopg2

In [119]:
# Make connection to the database
from config import db_password
uri = f"postgresql://postgres:{db_password}@127.0.0.1:5432/nfl"
del db_password
query = "SELECT * FROM qualitative"

quals = pl.read_database_uri(query=query, uri=uri)


The qualitative table contains all of the player, weather, stadium, and injury data. This will have to be cleaned due to the excessive changes that are better suited for programmatic coding on python rather than in SQL. The Stadium Types and Weather need major changes, and the null handling for the remaining values are easily handled in Polars.

I have removed PlayerDay, PlayerGame, PlayerGamePlay, Position, and PositionGroup since most of these will artificially identify the specific player instead of the conditions that led to the injury. It may be true that more injuries occur later in the season, but this is unreliable since some of the early injured players may return to the field and get a secondary injury.

In [120]:
quals.head()

playkey,rosterposition,stadiumtype,fieldtype,temperature,weather,playtype,bodypart,dm_m1,dm_m7,dm_m28,dm_m42
str,str,str,str,i32,str,str,str,i32,i32,i32,i32
"""26624-1-1""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Pass""",,,,,
"""26624-1-2""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Pass""",,,,,
"""26624-1-3""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Rush""",,,,,
"""26624-1-4""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Rush""",,,,,
"""26624-1-5""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Pass""",,,,,


In [121]:
# scan = pl.scan_csv("F:/Data/nfl-playing-surface-analytics/PlayerTrackData.csv")
# tracker = scan.collect(streaming=True, infer_schema_length=10000)

What I am considering doing for the Event column in is filling null values from the same play with the same event naming. So for all 26624-1-1, each null would be filled with "huddle_start_offense," but i need to verify that there is only 1 event per play-player. To check this, I used SQL, applying a query to group distinct events and look at the playkeys with such events. Most of the events are null, however, each of the 250 players has key events highlighted with the timepoints in between as null. I'm considering taking 1/2 of the time between and listing it as post-{} and pre-{} of the included event, or maybe even just cut the middle regions in half and attach the prior/following events to interpolate the null values. 

The previous analysis ignored the events, so I will be removing the events for the positional analysis to reduce the amount of data. 

In [122]:
# tracker.head()

## Cleaning the Stadiums
First thing, the column headers are ugly, so I want to capitalize them. Second, the stadium type is a problem. Literally, every stadium is described differently, including 7 unique spellings of the word 'Outdoor.'

In [123]:
quals.columns

['playkey',
 'rosterposition',
 'stadiumtype',
 'fieldtype',
 'temperature',
 'weather',
 'playtype',
 'bodypart',
 'dm_m1',
 'dm_m7',
 'dm_m28',
 'dm_m42']

In [124]:
def column_capitalizer(quals):
    columns = {
        'playkey': "PlayKey"
        , 'rosterposition': 'Position'
        , 'stadiumtype': 'StadiumType'
        , 'fieldtype': 'FieldType'
        , 'temperature': 'Temperature'
        , 'weather': 'Weather'
        , 'playtype': 'PlayType'
        , 'bodypart': 'BodyPart'
        , 'dm_m1': 'DM_1'
        , 'dm_m7': 'DM_7'
        , 'dm_m28': 'DM_28'
        , 'dm_m42': 'DM_42'
    }

    quals = quals.rename(columns)

    return quals



In [125]:
quals = column_capitalizer(quals)
quals.head()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
str,str,str,str,i32,str,str,str,i32,i32,i32,i32
"""26624-1-1""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Pass""",,,,,
"""26624-1-2""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Pass""",,,,,
"""26624-1-3""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Rush""",,,,,
"""26624-1-4""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Rush""",,,,,
"""26624-1-5""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Pass""",,,,,


Let's take a look at the number of unique values in each of the columns: 

In [126]:
quals.select(pl.all().n_unique())

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
267005,10,30,2,79,64,12,4,2,3,3,3


Now to clean the stadiums...

In [127]:
quals["StadiumType"].unique().to_list()

['Oudoor',
 'Ourdoor',
 'Open',
 'Retr. Roof-Closed',
 'Outside',
 'Retr. Roof - Closed',
 'Dome, closed',
 'Outdor',
 'Domed, open',
 'Domed, closed',
 'Indoor, Roof Closed',
 'Indoor, Open Roof',
 'Outdoors',
 'Retr. Roof-Open',
 'Closed Dome',
 'Retr. Roof - Open',
 None,
 'Outdoor',
 'Dome',
 'Domed, Open',
 'Cloudy',
 'Indoors',
 'Indoor',
 'Outddors',
 'Heinz Field',
 'Outdoor Retr Roof-Open',
 'Retr. Roof Closed',
 'Retractable Roof',
 'Bowl',
 'Domed']

In [128]:
quals["StadiumType"].value_counts()

StadiumType,count
str,u32
"""Retr. Roof-Closed""",2015
,16910
"""Indoor""",6892
"""Open""",4124
"""Retr. Roof - Open""",486
…,…
"""Outdoors""",32956
"""Indoor, Roof Closed""",547
"""Domed, open""",1779
"""Outdoor""",145032


In [129]:
quals["StadiumType"].null_count()

16910

Since most of the stadiums in the NFL are outdoor, and the 11 stadiums with domes that would be considered indoor are already accounted for, the Null values will be interpolated as Outdoor Stadiums. 

In [130]:
quals = quals.with_columns(pl.col("StadiumType").fill_null("Outdoor"))

In [131]:
quals["StadiumType"].null_count()

0

For the rest of them, I will be converting the stadiums to be either indoor or outdoor. If the stadium is a domed stadium and the dome is open, it will be considered Outdoor, whereas for the games that were recorded with the dome closed, it will be considered Indoor

In [132]:
stadium_dict = {'Outdoor': 'Outdoor',
        'Indoors': 'Indoor',
        'Oudoor': 'Outdoor',
        'Outdoors': 'Outdoor',
        'Open': 'Outdoor',
        'Closed Dome': 'Indoor',
        'Domed, closed': 'Indoor',
        'Dome': 'Indoor',
        'Indoor': 'Indoor',
        'Domed': 'Indoor',
        'Retr. Roof-Closed': 'Indoor',
        'Outdoor Retr Roof-Open': 'Outdoor',
        'Retractable Roof': 'Indoor',
        'Ourdoor': 'Outdoor',
        'Indoor, Roof Closed': 'Indoor',
        'Retr. Roof - Closed': 'Indoor',
        'Bowl': 'Outdoor',
        'Outddors': 'Outdoor',
        'Retr. Roof-Open': 'Outdoor',
        'Dome, closed': 'Indoor',
        'Indoor, Open Roof': 'Outdoor',
        'Domed, Open': 'Outdoor',
        'Domed, open': 'Outdoor',
        'Heinz Field': 'Outdoor',
        'Cloudy': 'Outdoor',
        'Retr. Roof - Open': 'Outdoor',
        'Retr. Roof Closed': 'Indoor',
        'Outdor': 'Outdoor',
        'Outside': 'Outdoor'}


In [133]:
quals = quals.with_columns(pl.col("StadiumType").replace(stadium_dict))

In [134]:
quals.head()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
str,str,str,str,i32,str,str,str,i32,i32,i32,i32
"""26624-1-1""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Pass""",,,,,
"""26624-1-2""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Pass""",,,,,
"""26624-1-3""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Rush""",,,,,
"""26624-1-4""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Rush""",,,,,
"""26624-1-5""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Pass""",,,,,


In [135]:
quals.select(pl.all().n_unique())

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
267005,10,2,2,79,64,12,4,2,3,3,3


## Getting Clear Weather
There are even more weather conditions than there were with the stadium types, which isn't particularly helpful. I'm changing the weather into 7 different conditions that encompass the main weather events for outdoor stadiums. Indoor stadiums will be labeled indoor, since there isn't an impact from the sun on visibility, which may be a factor with clear days. 

In [136]:
weather_dict = {'Clear and warm': 'Clear',
                'Mostly Cloudy': 'Cloudy',
                'Sunny': 'Clear',
                'Clear': 'Clear',
                'Cloudy': 'Cloudy',
                'Cloudy, fog started developing in 2nd quarter': 'Hazy/Fog',
                'Rain': 'Rain',
                'Partly Cloudy': 'Cloudy',
                'Mostly cloudy': 'Cloudy',
                'Cloudy and cold': 'Cloudy',
                'Cloudy and Cool': 'Cloudy',
                'Rain Chance 40%': 'Rain',
                'Controlled Climate': 'Indoor',
                'Sunny and warm': 'Clear',
                'Partly cloudy': 'Cloudy',
                'Clear and Cool': 'Cloudy',
                'Clear and cold': 'Cloudy',
                'Sunny and cold': 'Clear',
                'Indoor': 'Indoor',
                'Partly Sunny': 'Clear',
                'N/A (Indoors)': 'Indoor',
                'Mostly Sunny': 'Clear',
                'Indoors': 'Indoor',
                'Clear Skies': 'Clear',
                'Partly sunny': 'Clear',
                'Showers': 'Rain',
                'N/A Indoor': 'Indoor',
                'Sunny and clear': 'Clear',
                'Snow': 'Snow',
                'Scattered Showers': 'Rain',
                'Party Cloudy': 'Cloudy',
                'Clear skies': 'Clear',
                'Rain likely, temps in low 40s.': 'Rain',
                'Hazy': 'Hazy/Fog',
                'Partly Clouidy': 'Cloudy',
                'Sunny Skies': 'Clear',
                'Overcast': 'Cloudy',
                'Cloudy, 50% change of rain': 'Cloudy',
                'Fair': 'Clear',
                'Light Rain': 'Rain',
                'Partly clear': 'Clear',
                'Mostly Coudy': 'Cloudy',
                '10% Chance of Rain': 'Cloudy',
                'Cloudy, chance of rain': 'Cloudy',
                'Heat Index 95': 'Clear',
                'Sunny, highs to upper 80s': 'Clear',
                'Sun & clouds': 'Cloudy',
                'Heavy lake effect snow': 'Snow',
                'Mostly sunny': 'Clear',
                'Cloudy, Rain': 'Rain',
                'Sunny, Windy': 'Windy',
                'Mostly Sunny Skies': 'Clear',
                'Rainy': 'Rain',
                '30% Chance of Rain': 'Rain',
                'Cloudy, light snow accumulating 1-3"': 'Snow',
                'cloudy': 'Cloudy',
                'Clear and Sunny': 'Clear',
                'Coudy': 'Cloudy',
                'Clear and sunny': 'Clear',
                'Clear to Partly Cloudy': 'Clear',
                'Cloudy with periods of rain, thunder possible. Winds shifting to WNW, 10-20 mph.': 'Windy',
                'Rain shower': 'Rain',
                'Cold': 'Clear'}


quals = quals.with_columns(pl.col("Weather").replace(weather_dict))
quals.head()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
str,str,str,str,i32,str,str,str,i32,i32,i32,i32
"""26624-1-1""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""",,,,,
"""26624-1-2""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""",,,,,
"""26624-1-3""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Rush""",,,,,
"""26624-1-4""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Rush""",,,,,
"""26624-1-5""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""",,,,,


In [137]:
quals["Weather"].value_counts()

Weather,count
str,u32
"""Windy""",713
"""Cloudy""",112306
"""Indoor""",20277
"""Clear""",96985
"""Rain""",14280
,18691
"""Snow""",1945
"""Hazy/Fog""",1809


Since I don't know whether the nulls are Indoor or Outdoor, I am going to change the weather from all indoor stadiums to "Indoor," and hopefully if those nulls are indoor, they'll all disappear!

In [138]:
quals = quals.with_columns(
                pl.when(pl.col("StadiumType") == "Indoor")
                .then(pl.col("Weather").fill_null("Indoor"))
                .otherwise(pl.col("Weather"))
                .alias("Weather")
)

In [139]:
quals.head()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
str,str,str,str,i32,str,str,str,i32,i32,i32,i32
"""26624-1-1""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""",,,,,
"""26624-1-2""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""",,,,,
"""26624-1-3""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Rush""",,,,,
"""26624-1-4""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Rush""",,,,,
"""26624-1-5""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""",,,,,


In [140]:
quals["Weather"].value_counts().sort("count", descending=True)

Weather,count
str,u32
"""Cloudy""",112306
"""Clear""",96985
"""Indoor""",33862
"""Rain""",14280
,5106
"""Snow""",1945
"""Hazy/Fog""",1809
"""Windy""",713


In [141]:
quals.filter(pl.col("Weather").is_null()).select(["Weather", "StadiumType", "Temperature"]).unique()

Weather,StadiumType,Temperature
str,str,i32
,"""Outdoor""",70
,"""Outdoor""",72
,"""Outdoor""",63
,"""Outdoor""",79
,"""Outdoor""",77
,"""Outdoor""",58
,"""Outdoor""",71
,"""Outdoor""",40
,"""Outdoor""",68
,"""Outdoor""",39


It appears that there are 10 games where the weather was not recorded, and each of them are different stadiums based on the EDA. Since there is a pretty similar distribution of cloudy and clear games, I don't want to assign all 5000 values to one of them, so I will be making all games greater than 70 degrees Clear and the rest will be Cloudy. This isn't perfect, but without datetime data, it's impossible to determine what game and temperatures. 

In [142]:
quals = quals.with_columns(
                pl.when(pl.col("Temperature") > 70)
                .then(pl.col("Weather").fill_null("Clear"))
                .otherwise(pl.col("Weather"))
                .alias("Weather")
)

In [143]:
quals.filter(pl.col("Weather").is_null()).select(["Weather", "StadiumType", "Temperature"]).unique()

Weather,StadiumType,Temperature
str,str,i32
,"""Outdoor""",63
,"""Outdoor""",68
,"""Outdoor""",39
,"""Outdoor""",40
,"""Outdoor""",70
,"""Outdoor""",58


In [144]:
quals = quals.with_columns(pl.col("Weather").fill_null("Cloudy"))

In [145]:
quals["Weather"].value_counts().sort("count", descending=True)

Weather,count
str,u32
"""Cloudy""",115504
"""Clear""",98893
"""Indoor""",33862
"""Rain""",14280
"""Snow""",1945
"""Hazy/Fog""",1809
"""Windy""",713


## Helping the Uninjured
The data from the join using injury data left a whole lot of null values. Each of the null values from teh DMs should be 0, since the player did not sustain an injury for that duration. For Body Part, we can add "None", however, this will ultimately get encoded, so I will need to add an additional column that just denotes IsInjured as a binary, and then the remining will each become their own column denoting a 1 or 0 whether the injury sustained was a Knee, Foot, or Ankle. 

In [146]:
quals.head()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
str,str,str,str,i32,str,str,str,i32,i32,i32,i32
"""26624-1-1""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""",,,,,
"""26624-1-2""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""",,,,,
"""26624-1-3""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Rush""",,,,,
"""26624-1-4""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Rush""",,,,,
"""26624-1-5""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""",,,,,


First, let's change all of the null Body Parts to No Injury

In [147]:
quals = quals.with_columns(pl.col("BodyPart").fill_null("NoInjury"))

In [148]:
quals.head()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
str,str,str,str,i32,str,str,str,i32,i32,i32,i32
"""26624-1-1""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""","""NoInjury""",,,,
"""26624-1-2""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""","""NoInjury""",,,,
"""26624-1-3""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Rush""","""NoInjury""",,,,
"""26624-1-4""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Rush""","""NoInjury""",,,,
"""26624-1-5""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""","""NoInjury""",,,,


In [149]:
quals.null_count()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,367,0,266929,266929,266929,266929


What I would like to do is fill all remaining nulls with a 0, but only in the last 4 columns; however, I'm seeing that there are still 367 playtypes that are unknown. I will actually just label these as Unknonwn for now, and if possible I can determine whether the events give us an indication upon merging later. 

In [150]:
quals = quals.with_columns(pl.col("PlayType").fill_null("Unknown"))

In [151]:
quals.null_count()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,0,0,266929,266929,266929,266929


In [152]:
quals = quals.with_columns(
    pl.col(["DM_1", "DM_7", "DM_28", "DM_42"]).fill_null(0))

In [153]:
quals.head()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
str,str,str,str,i32,str,str,str,i32,i32,i32,i32
"""26624-1-1""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""","""NoInjury""",0,0,0,0
"""26624-1-2""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""","""NoInjury""",0,0,0,0
"""26624-1-3""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Rush""","""NoInjury""",0,0,0,0
"""26624-1-4""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Rush""","""NoInjury""",0,0,0,0
"""26624-1-5""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""","""NoInjury""",0,0,0,0


In [154]:
quals.null_count()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,0,0,0,0,0,0


This final line just examines to see if any of the Unknown plays led to an injury. I suspect that if none of them led to an injury that this will give a false identifier - if a play does come up as unknown, and ML model will determine that there will not be an injury, though this is just an artifact of bad record-keeping.  

In [155]:
quals.filter((pl.col("PlayType") == "Unknown") & (pl.col("BodyPart") != "NoInjury"))

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
str,str,str,str,i32,str,str,str,i32,i32,i32,i32


As seen, this is the case. Now that these data are clean, how much data will be lost if we remove those rows? 

In [77]:
quals.filter(pl.col("PlayType") == "Unknown").count()/quals.count()*100

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
0.13745,0.13745,0.13745,0.13745,0.13745,0.13745,0.13745,0.13745,0.13745,0.13745,0.13745,0.13745


By removing these data, we will be losing 0.14% of the dataset, which will not be a lot to lose for purposes of keeping the PlayType as part of the analysis. 

In [156]:
quals = quals.filter(pl.col('PlayType').is_not_null())

In [157]:
quals.head()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
str,str,str,str,i32,str,str,str,i32,i32,i32,i32
"""26624-1-1""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""","""NoInjury""",0,0,0,0
"""26624-1-2""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""","""NoInjury""",0,0,0,0
"""26624-1-3""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Rush""","""NoInjury""",0,0,0,0
"""26624-1-4""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Rush""","""NoInjury""",0,0,0,0
"""26624-1-5""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""","""NoInjury""",0,0,0,0


In [158]:
quals.count()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
267006,267006,267006,267006,267006,267006,267006,267006,267006,267006,267006,267006


The final step of this process is to rewrite this code for production, so that it doesn't need to be pushed through a jupyter file for any future runs or modifications 

In [None]:
import polars as pl
import sqlalchemy as db
from sqlalchemy.orm import Session
from sqlalchemy import create_engine
import psycopg2

# Make connection to the database
from config import db_password
uri = f"postgresql://postgres:{db_password}@127.0.0.1:5432/nfl"
del db_password
query = "SELECT * FROM qualitative"

quals = pl.read_database_uri(query=query, uri=uri)

In [None]:
def column_capitalizer(quals):          # Changes the all lower-case to Capitalized PascalCase column headers 
    columns = {
        'playkey': "PlayKey"
        , 'rosterposition': 'Position'
        , 'stadiumtype': 'StadiumType'
        , 'fieldtype': 'FieldType'
        , 'temperature': 'Temperature'
        , 'weather': 'Weather'
        , 'playtype': 'PlayType'
        , 'bodypart': 'BodyPart'
        , 'dm_m1': 'DM_1'
        , 'dm_m7': 'DM_7'
        , 'dm_m28': 'DM_28'
        , 'dm_m42': 'DM_42'
    } 
    quals = quals.rename(columns)

    return quals


In [None]:
def stadium_cleaner(quals):         # This changes stadiums to either Indoor or Outdoor per game records - some of the dome stadiums have a roof that can open, if open the game is considered outdoor.
    stadium_dict = {'Outdoor': 'Outdoor',
        'Indoors': 'Indoor',
        'Oudoor': 'Outdoor',
        'Outdoors': 'Outdoor',
        'Open': 'Outdoor',
        'Closed Dome': 'Indoor',
        'Domed, closed': 'Indoor',
        'Dome': 'Indoor',
        'Indoor': 'Indoor',
        'Domed': 'Indoor',
        'Retr. Roof-Closed': 'Indoor',
        'Outdoor Retr Roof-Open': 'Outdoor',
        'Retractable Roof': 'Indoor',
        'Ourdoor': 'Outdoor',
        'Indoor, Roof Closed': 'Indoor',
        'Retr. Roof - Closed': 'Indoor',
        'Bowl': 'Outdoor',
        'Outddors': 'Outdoor',
        'Retr. Roof-Open': 'Outdoor',
        'Dome, closed': 'Indoor',
        'Indoor, Open Roof': 'Outdoor',
        'Domed, Open': 'Outdoor',
        'Domed, open': 'Outdoor',
        'Heinz Field': 'Outdoor',
        'Cloudy': 'Outdoor',
        'Retr. Roof - Open': 'Outdoor',
        'Retr. Roof Closed': 'Indoor',
        'Outdor': 'Outdoor',
        'Outside': 'Outdoor'}
    
    quals = quals.with_columns(pl.col("StadiumType").fill_null("Outdoor")) # Since most stadiums are outdoor and the percentage of games played indoor is already met by the known indoor games those seasons, all unknown games were set to outdoor
    quals = quals.with_columns(pl.col("StadiumType").replace(stadium_dict)) # This uses the dict to assign naming conventions

    return quals

In [50]:
def weather_cleaner(quals):
     weather_dict = {'Clear and warm': 'Clear',
                'Mostly Cloudy': 'Cloudy',
                'Sunny': 'Clear',
                'Clear': 'Clear',
                'Cloudy': 'Cloudy',
                'Cloudy, fog started developing in 2nd quarter': 'Hazy/Fog',
                'Rain': 'Rain',
                'Partly Cloudy': 'Cloudy',
                'Mostly cloudy': 'Cloudy',
                'Cloudy and cold': 'Cloudy',
                'Cloudy and Cool': 'Cloudy',
                'Rain Chance 40%': 'Rain',
                'Controlled Climate': 'Indoor',
                'Sunny and warm': 'Clear',
                'Partly cloudy': 'Cloudy',
                'Clear and Cool': 'Cloudy',
                'Clear and cold': 'Cloudy',
                'Sunny and cold': 'Clear',
                'Indoor': 'Indoor',
                'Partly Sunny': 'Clear',
                'N/A (Indoors)': 'Indoor',
                'Mostly Sunny': 'Clear',
                'Indoors': 'Indoor',
                'Clear Skies': 'Clear',
                'Partly sunny': 'Clear',
                'Showers': 'Rain',
                'N/A Indoor': 'Indoor',
                'Sunny and clear': 'Clear',
                'Snow': 'Snow',
                'Scattered Showers': 'Rain',
                'Party Cloudy': 'Cloudy',
                'Clear skies': 'Clear',
                'Rain likely, temps in low 40s.': 'Rain',
                'Hazy': 'Hazy/Fog',
                'Partly Clouidy': 'Cloudy',
                'Sunny Skies': 'Clear',
                'Overcast': 'Cloudy',
                'Cloudy, 50% change of rain': 'Cloudy',
                'Fair': 'Clear',
                'Light Rain': 'Rain',
                'Partly clear': 'Clear',
                'Mostly Coudy': 'Cloudy',
                '10% Chance of Rain': 'Cloudy',
                'Cloudy, chance of rain': 'Cloudy',
                'Heat Index 95': 'Clear',
                'Sunny, highs to upper 80s': 'Clear',
                'Sun & clouds': 'Cloudy',
                'Heavy lake effect snow': 'Snow',
                'Mostly sunny': 'Clear',
                'Cloudy, Rain': 'Rain',
                'Sunny, Windy': 'Windy',
                'Mostly Sunny Skies': 'Clear',
                'Rainy': 'Rain',
                '30% Chance of Rain': 'Rain',
                'Cloudy, light snow accumulating 1-3"': 'Snow',
                'cloudy': 'Cloudy',
                'Clear and Sunny': 'Clear',
                'Coudy': 'Cloudy',
                'Clear and sunny': 'Clear',
                'Clear to Partly Cloudy': 'Clear',
                'Cloudy with periods of rain, thunder possible. Winds shifting to WNW, 10-20 mph.': 'Windy',
                'Rain shower': 'Rain',
                'Cold': 'Clear'}
     
     quals = quals.with_columns(pl.col("Weather").replace(weather_dict)) # Standardizes the weather to a few main types

     quals = quals.with_columns(             # Null handling - all null weather conditions for indoor stadiums are filled "indoor"
                pl.when(pl.col("StadiumType") == "Indoor")
                .then(pl.col("Weather").fill_null("Indoor"))
                .otherwise(pl.col("Weather"))
                .alias("Weather")
                )
     
     # For the non-indoor games with null values for weather, to maintain the percentage of games that were clear/cloudy, temperature was used as a divider, above and below 70 degrees
     quals = quals.with_columns(
                pl.when(pl.col("Temperature") > 70)
                .then(pl.col("Weather").fill_null("Clear"))
                .otherwise(pl.col("Weather"))
                .alias("Weather")
                )
     quals = quals.with_columns(pl.col("Weather").fill_null("Cloudy"))

     return quals


In [None]:
def injury_cleaner(quals):
    quals = quals.filter(pl.col('PlayType').is_not_null()) # 0.14% of rows did not have a play type, and ALL of these were non-injury plays, so they were removed

    quals = quals.with_columns(pl.col("BodyPart").fill_null("NoInjury")) # This fills all null from the join with No Injury

    quals = quals.with_columns(
    pl.col(["DM_1", "DM_7", "DM_28", "DM_42"]).fill_null(0)) # This fills the nulls from the Join with 0s, since there were no injuries.

    return quals




In [1]:
import polars as pl
import sqlalchemy as db
from sqlalchemy.orm import Session
from sqlalchemy import create_engine
import psycopg2
from CleaningFunctions import *

# Make connection to the database
from config import db_password
uri = f"postgresql://postgres:{db_password}@127.0.0.1:5432/nfl"
del db_password
query = "SELECT * FROM qualitative"

quals = pl.read_database_uri(query=query, uri=uri)



In [4]:
from CleaningFunctions import *

In [5]:
quals = clean_injuries('quals')


In [7]:
quals = data_loader('clean_quals')

In [8]:
quals.head()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
str,str,str,str,i32,str,str,str,i32,i32,i32,i32
"""26624-1-1""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""","""NoInjury""",0,0,0,0
"""26624-1-2""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""","""NoInjury""",0,0,0,0
"""26624-1-3""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Rush""","""NoInjury""",0,0,0,0
"""26624-1-4""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Rush""","""NoInjury""",0,0,0,0
"""26624-1-5""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""","""NoInjury""",0,0,0,0


## Shrinking the Size of the Table

I want to reduce the size of this table before exporting to optimize speed and time for additional data manipulation. This won't be as critical with the qualitative data as the tracking, but I may as well do it for both!

In [10]:
import polars as pl
def data_shrinker(df, verbose=True):
    """
    Optimize memory usage of a Polars dataframe for both categorical and numeric data.
    """
    import polars as pl
    import numpy as np
    start_mem = df.estimated_size("mb")
    if verbose:
        print(f'Memory usage of dataframe is {start_mem:.2f} MB')
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type in [pl.Int8, pl.Int16, pl.Int32, pl.Int64, pl.Float32, pl.Float64]:
            c_min = df[col].min()
            c_max = df[col].max()
            
            if col_type.is_integer():
                if c_min >= np.iinfo(np.int8).min and c_max <= np.iinfo(np.int8).max:
                    df = df.with_columns(pl.col(col).cast(pl.Int8))
                elif c_min >= np.iinfo(np.int16).min and c_max <= np.iinfo(np.int16).max:
                    df = df.with_columns(pl.col(col).cast(pl.Int16))
                elif c_min >= np.iinfo(np.int32).min and c_max <= np.iinfo(np.int32).max:
                    df = df.with_columns(pl.col(col).cast(pl.Int32))
                else:
                    df = df.with_columns(pl.col(col).cast(pl.Int64))
            else:
                if c_min >= np.finfo(np.float32).min and c_max <= np.finfo(np.float32).max:
                    df = df.with_columns(pl.col(col).cast(pl.Float32))
                else:
                    df = df.with_columns(pl.col(col).cast(pl.Float64))
        
        elif col_type == pl.Utf8:
            if df[col].n_unique() / len(df) < 0.5:  # If less than 50% unique values
                df = df.with_columns(pl.col(col).cast(pl.Categorical))

    end_mem = df.estimated_size("mb")
    if verbose:
        print(f'Memory usage after optimization is: {end_mem:.2f} MB')
        print(f'Decreased by {100 * (start_mem - end_mem) / start_mem:.1f}%')
    
    return df

In [11]:
optimized = data_shrinker(quals)

Memory usage of dataframe is 19.22 MB
Memory usage after optimization is: 10.29 MB
Decreased by 46.5%


There will be quantitative summaries from the tracking data that I would like to add to the qualitative data. These will be able to be merged by the PlayKey, and since the summary data is looking at maxima and means, there will be one PlayKey per row, so this will give us some descriptive statistics from the quant data that can be used with analysis of the qualitative models. 

NameError: name 'create_engine' is not defined

# Handling the differences between orientation and direction
this becomes an issue since the data are all recorded in degrees, with 0 degrees facing the Visitor sideline. There are a number of artifacts that arise when the direction and orientation cross the 0 line, but it's not simple to just do a mod that reduces everything by 180, because we'll run into the same issues at those boundaries. 

To handle this, I will be treating each as a unit vector and converting them into their sine and cosine components. This way I can do vector subtraction and then use an arctan transform to show the angle between. 

In [170]:
import polars as pl

data = {
    "PlayKey": ["26624-1-1", "26624-1-1", "26624-1-1", "26624-1-2", "26624-1-2"],
    "time": [0.0, 0.1, 0.2, 0.0, 0.1],
    "event": ["huddle_start_offense", None, None, "cheesecake_served", None],
    "x": [87.46, 87.45, 87.44, 80.0, 80.1],
    "y": [28.93, 28.92, 28.92, 20.0, 20.1],
    "dir": [15, 350, 340, 180, 190],
    "dis": [0.01, 0.01, 0.01, 0.01, 0.01],
    "o": [10, 350, 340, 180, 190],
    "s": [0.13, 0.12, 0.12, 0.10, 0.09]
}

df = pl.DataFrame(data)
print(df)

shape: (5, 9)
┌───────────┬──────┬──────────────────────┬───────┬───┬─────┬──────┬─────┬──────┐
│ PlayKey   ┆ time ┆ event                ┆ x     ┆ … ┆ dir ┆ dis  ┆ o   ┆ s    │
│ ---       ┆ ---  ┆ ---                  ┆ ---   ┆   ┆ --- ┆ ---  ┆ --- ┆ ---  │
│ str       ┆ f64  ┆ str                  ┆ f64   ┆   ┆ i64 ┆ f64  ┆ i64 ┆ f64  │
╞═══════════╪══════╪══════════════════════╪═══════╪═══╪═════╪══════╪═════╪══════╡
│ 26624-1-1 ┆ 0.0  ┆ huddle_start_offense ┆ 87.46 ┆ … ┆ 15  ┆ 0.01 ┆ 10  ┆ 0.13 │
│ 26624-1-1 ┆ 0.1  ┆ null                 ┆ 87.45 ┆ … ┆ 350 ┆ 0.01 ┆ 350 ┆ 0.12 │
│ 26624-1-1 ┆ 0.2  ┆ null                 ┆ 87.44 ┆ … ┆ 340 ┆ 0.01 ┆ 340 ┆ 0.12 │
│ 26624-1-2 ┆ 0.0  ┆ cheesecake_served    ┆ 80.0  ┆ … ┆ 180 ┆ 0.01 ┆ 180 ┆ 0.1  │
│ 26624-1-2 ┆ 0.1  ┆ null                 ┆ 80.1  ┆ … ┆ 190 ┆ 0.01 ┆ 190 ┆ 0.09 │
└───────────┴──────┴──────────────────────┴───────┴───┴─────┴──────┴─────┴──────┘


In [149]:
def data_shrinker(df, verbose=True):
    """
    Optimize memory usage of a Polars dataframe for both categorical and numeric data.
    """
    import polars as pl
    import numpy as np
    start_mem = df.estimated_size("mb")
    if verbose:
        print(f'Memory usage of dataframe is {start_mem:.2f} MB')
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type in [pl.Int8, pl.Int16, pl.Int32, pl.Int64, pl.Float32, pl.Float64]:
            c_min = df[col].min()
            c_max = df[col].max()
            
            if col_type.is_integer():
                if c_min >= np.iinfo(np.int8).min and c_max <= np.iinfo(np.int8).max:
                    df = df.with_columns(pl.col(col).cast(pl.Int8))
                elif c_min >= np.iinfo(np.int16).min and c_max <= np.iinfo(np.int16).max:
                    df = df.with_columns(pl.col(col).cast(pl.Int16))
                elif c_min >= np.iinfo(np.int32).min and c_max <= np.iinfo(np.int32).max:
                    df = df.with_columns(pl.col(col).cast(pl.Int32))
                else:
                    df = df.with_columns(pl.col(col).cast(pl.Int64))
            else:
                if c_min >= np.finfo(np.float32).min and c_max <= np.finfo(np.float32).max:
                    df = df.with_columns(pl.col(col).cast(pl.Float32))
                else:
                    df = df.with_columns(pl.col(col).cast(pl.Float64))
        
        elif col_type == pl.Utf8:
            if df[col].n_unique() / len(df) < 0.5:  # If less than 50% unique values
                df = df.with_columns(pl.col(col).cast(pl.Categorical))

    end_mem = df.estimated_size("mb")
    if verbose:
        print(f'Memory usage after optimization is: {end_mem:.2f} MB')
        print(f'Decreased by {100 * (start_mem - end_mem) / start_mem:.1f}%')
    
    return df

In [154]:
# df = df.drop(["event", "dis", "s"]) # Drop the event column and the dis and s columns, which are distance and speed determined by the NGS, but not on the positional data
# df = df.sort(["PlayKey", "time"]) # We need to group by PlayKey and then sort by time, so as not to sort only by time and merge all of the playkeys. 

# # This will change the angles from -180 to 180, which allows for more accurate angular calculations compared to that of 0 to 360, which introduces fringe artifacts
# df = df.with_columns([
#     ((pl.col("dir") + 180) % 360 - 180).alias("dir"),
#     ((pl.col("o") + 180) % 360 - 180).alias("o")
# ])


# print(df)

The following code works for calculating dynamics parameters. 

In [None]:
# # Dynamics and Calculus
# df = df.with_columns([
#     ((((pl.col("x") - pl.col("x").shift(1)).over("PlayKey"))**2 + ((pl.col("x") - pl.col("x").shift(1)).over("PlayKey"))**2)**0.5).alias("displacement"), # Calculate displacement (x^2 + y^2)^0.5
#     (((((pl.col("x") - pl.col("x").shift(1)).over("PlayKey"))**2 + ((pl.col("x") - pl.col("x").shift(1)).over("PlayKey"))**2)**0.5)/(pl.col("time") - pl.col("time").shift(1)).over("PlayKey")).alias("speed"), # Speed = displacement/time
#     (np.degrees(np.arctan2((pl.col("x") - pl.col("x").shift(1)).over("PlayKey"), (pl.col("y") - pl.col("y").shift(1)).over("PlayKey")))).alias("direction"), # direction is arctangent of the x and y vectors 
#     ((pl.col("x") - pl.col("x").shift(1)).over("PlayKey")/(pl.col("time") - pl.col("time").shift(1)).over("PlayKey")).alias("vx"), # Velocity component in the x direction
#     ((pl.col("y") - pl.col("y").shift(1)).over("PlayKey")/(pl.col("time") - pl.col("time").shift(1)).over("PlayKey")).alias("vy"), # Velocity component in the y direction
#     ((pl.col("dir") - pl.col("dir").shift(1)).over("PlayKey").alias("d_dir")/(pl.col("time") - pl.col("time").shift(1)).over("PlayKey").alias("dt")).alias("omega_dir"), # Angular velocity of change in direction of motion
#     ((pl.col("o") - pl.col("o").shift(1)).over("PlayKey").alias("d_o")/(pl.col("time") - pl.col("time").shift(1)).over("PlayKey").alias("dt")).alias("omega_o"), # Angular velopicy of the change in orientation
#     (calculate_angle_difference(df["dir"], df["o"])).alias("angle_diff") # Calculate the difference between the angles 
# ])

I entered this code into AI to see the best way to optimize running this and got similar results from Gemini and from Claude Sonnet. The following is a modification from the Sonnet output. This will precalculate shifted values in a single operation and then chains this for the subsequent calculations using the with_columns() operator. 

In [256]:
def dynamics_calculator(df):
    import numpy as np
    import polars as pl
    """
    Using the (X,Y) and time columns, perform calculations based on the 
    difference between two rows to find displacement, speed, direction 
    of motion, velocity in x and y components, and the angular velocities 
    of the direction of motion and orientations 
    """

    df = df.with_columns([
        # Pre-calculate shifted values
        pl.col("x").shift(1).over("PlayKey").alias("prev_x"),
        pl.col("y").shift(1).over("PlayKey").alias("prev_y"),
        pl.col("time").shift(1).over("PlayKey").alias("prev_time"),
        pl.col("dir").shift(1).over("PlayKey").alias("prev_dir"),
        pl.col("o").shift(1).over("PlayKey").alias("prev_o")
    ]).with_columns([
        # Calculate time difference
        (pl.col("time") - pl.col("prev_time")).alias("dt"),
        # Calculate x and y differences
        (pl.col("x") - pl.col("prev_x")).alias("dx"),
        (pl.col("y") - pl.col("prev_y")).alias("dy")
    ]).with_columns([
        # Calculate displacement
        ((pl.col("dx")**2 + pl.col("dy")**2)**0.5).alias("dist")
    ]).with_columns([
        # Calculate speed
        (pl.col("dist") / pl.col("dt")).alias("speed"),
        # Calculate direction
        (np.degrees(np.arctan2(pl.col("dx"), pl.col("dy")))).alias("direction"),
        # Calculate velocity components
        (pl.col("dx") / pl.col("dt")).alias("vx"),
        (pl.col("dy") / pl.col("dt")).alias("vy"),
        # Calculate angular velocities
        ((pl.col("dir") - pl.col("prev_dir")) / pl.col("dt")).alias("omega_dir"),
        ((pl.col("o") - pl.col("prev_o")) / pl.col("dt")).alias("omega_o")
    ]).with_columns([
        ((pl.col("omega_dir") - pl.col("omega_o")).abs()).alias("d_omega")
    ]).drop([
        "prev_x", "prev_y", "prev_time", "prev_dir", "prev_o", "dt", "dx", "dy"
    ]).drop_nulls()


    return df

In [173]:
df = dynamics_calculator(df)
print(df)

shape: (3, 13)
┌───────────┬──────┬───────────┬───────┬───┬───────────┬───────────┬───────────┬─────────┐
│ PlayKey   ┆ time ┆ x         ┆ y     ┆ … ┆ vx        ┆ vy        ┆ omega_dir ┆ omega_o │
│ ---       ┆ ---  ┆ ---       ┆ ---   ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---     │
│ cat       ┆ f32  ┆ f32       ┆ f32   ┆   ┆ f32       ┆ f32       ┆ f32       ┆ f32     │
╞═══════════╪══════╪═══════════╪═══════╪═══╪═══════════╪═══════════╪═══════════╪═════════╡
│ 26624-1-1 ┆ 0.1  ┆ 87.449997 ┆ 28.92 ┆ … ┆ -0.100021 ┆ -0.100002 ┆ 3350.0    ┆ 3400.0  │
│ 26624-1-1 ┆ 0.2  ┆ 87.440002 ┆ 28.92 ┆ … ┆ -0.099945 ┆ 0.0       ┆ -100.0    ┆ -100.0  │
│ 26624-1-2 ┆ 0.1  ┆ 80.099998 ┆ 20.1  ┆ … ┆ 0.999985  ┆ 1.000004  ┆ 100.0     ┆ 100.0   │
└───────────┴──────┴───────────┴───────┴───┴───────────┴───────────┴───────────┴─────────┘


This will calculate the difference between the dir and o angles to get a sense of the relationship between the direction of motion and direction of orientation. I would also like to know what is happening with the change in angle over time. 

In [229]:
def calculate_angle_difference(angle1, angle2):
    import numpy as np
    """
    Calculate the smallest angle difference between two angles 
    using trigonometric functions, accounting for edge cases.
    """
    sin_diff = np.sin(np.radians(angle2 - angle1))
    cos_diff = np.cos(np.radians(angle2 - angle1))
    return np.degrees(np.arctan2(sin_diff, cos_diff))

In [244]:
def angle_corrector(df):
    import polars as pl
    """
    Make corrections to angles to reduce fringe errors at 360
    """
    df = df.with_columns([
        ((pl.col("dir") + 180) % 360 - 180).alias("dir"),
        ((pl.col("o") + 180) % 360 - 180).alias("o")
    ]).with_columns(
        (calculate_angle_difference(pl.col("dir"), pl.col("o"))).abs().round(2).alias("angle_diff")
        )
    
    return df

In [124]:
df = df.drop_nulls() # This will drop the first row from every Play, which cannot be used to calculate the differences. 
print(df)

shape: (3, 14)
┌───────────┬──────┬───────┬───────┬───┬──────┬───────────┬─────────┬────────────┐
│ PlayKey   ┆ time ┆ x     ┆ y     ┆ … ┆ vy   ┆ omega_dir ┆ omega_o ┆ angle_diff │
│ ---       ┆ ---  ┆ ---   ┆ ---   ┆   ┆ ---  ┆ ---       ┆ ---     ┆ ---        │
│ str       ┆ f64  ┆ f64   ┆ f64   ┆   ┆ f64  ┆ f64       ┆ f64     ┆ f64        │
╞═══════════╪══════╪═══════╪═══════╪═══╪══════╪═══════════╪═════════╪════════════╡
│ 26624-1-1 ┆ 0.1  ┆ 87.45 ┆ 28.92 ┆ … ┆ -0.1 ┆ -250.0    ┆ -200.0  ┆ 0.0        │
│ 26624-1-1 ┆ 0.2  ┆ 87.44 ┆ 28.92 ┆ … ┆ 0.0  ┆ -100.0    ┆ -100.0  ┆ 0.0        │
│ 26624-1-2 ┆ 0.1  ┆ 80.1  ┆ 20.1  ┆ … ┆ 1.0  ┆ 100.0     ┆ 100.0   ┆ 0.0        │
└───────────┴──────┴───────┴───────┴───┴──────┴───────────┴─────────┴────────────┘


In [125]:
df.columns


['PlayKey',
 'time',
 'x',
 'y',
 'dir',
 'o',
 'displacement',
 'speed',
 'direction',
 'vx',
 'vy',
 'omega_dir',
 'omega_o',
 'angle_diff']

Now to combine all of these steps into a series of smaller functions: 

In [205]:
import polars as pl

data = {
    "PlayKey": ["26624-1-1", "26624-1-1", "26624-1-1", "26624-1-2", "26624-1-2"],
    "time": [0.0, 0.1, 0.2, 0.0, 0.1],
    "event": ["huddle_start_offense", None, None, "cheesecake_served", None],
    "x": [87.46, 87.45, 87.44, 80.0, 80.1],
    "y": [28.93, 28.92, 28.92, 20.0, 20.1],
    "dir": [15, 350, 340, 180, 190],
    "dis": [0.01, 0.01, 0.01, 0.01, 0.01],
    "o": [10, 350, 340, 180, 190],
    "s": [0.13, 0.12, 0.12, 0.10, 0.09]
}

df = pl.DataFrame(data)


In [206]:
df.columns

['PlayKey', 'time', 'event', 'x', 'y', 'dir', 'dis', 'o', 's']

I want to create a new table that gives me the total distance traveled as well as the displacement from the start and finish of the player per playKey

In [304]:
def path_calculator(df):
    import polars as pl
    # Calculate total distance and displacement for each PlayKey
    result = df.select([
        "PlayKey",
        pl.col("dist").sum().over("PlayKey").alias("distance"),
        pl.col("x").first().over("PlayKey").alias("start_x"),
        pl.col("y").first().over("PlayKey").alias("start_y"),
        pl.col("x").last().over("PlayKey").alias("end_x"),
        pl.col("y").last().over("PlayKey").alias("end_y"), 
        pl.col("angle_diff").max().over("PlayKey").alias("max_angle_diff"),
        pl.col("angle_diff").mean().over("PlayKey").alias("mean_angle_diff"), 
        pl.col("speed").max().over("PlayKey").alias("max_speed"),
        pl.col("speed").mean().over("PlayKey").alias("mean_speed"),
        pl.col("omega_dir").max().over("PlayKey").alias("max_omega_dir"),
        pl.col("omega_dir").mean().over("PlayKey").alias("mean_omega_dir"),
        pl.col("omega_o").max().over("PlayKey").alias("max_omega_o"),
        pl.col("omega_o").mean().over("PlayKey").alias("mean_omega_o"), 
        pl.col("d_omega").max().over("PlayKey").alias("max_d_omega"),
        pl.col("d_omega").mean().over("PlayKey").alias("mean_d_omega") 
    ]).unique(subset=["PlayKey"])

    # Calculate the displacement
    result = result.with_columns([
        (((pl.col("end_x") - pl.col("start_x"))**2 + 
          (pl.col("end_y") - pl.col("start_y"))**2)**0.5)
        .alias("displacement")
    ]).with_columns([
        (pl.col("distance") - pl.col("displacement")).alias("path_diff")
    ])

     
    # Select only the required columns
    result = result.select(['PlayKey',
        'distance',
        'displacement',
        'path_diff',
        'max_angle_diff',
        'mean_angle_diff',
        'max_speed',
        'mean_speed',
        'max_omega_dir',
        'mean_omega_dir',
        'max_omega_o',
        'mean_omega_o',
        'max_d_omega',
        'mean_d_omega']).sort("PlayKey")

    return result

# # Assuming 'df' is your original DataFrame
# play_summary = calculate_play_summary(df)
# print(play_summary)

In [203]:
df = track_cleaner(df)
df = angle_corrector(df)
df = dynamics_calculator(df)
summary = path_calculator(df)

print(df)
print(summary)

Memory usage of dataframe is 0.00 MB
Memory usage after optimization is: 0.00 MB
Decreased by 51.8%
shape: (3, 14)
┌───────────┬──────┬───────────┬───────┬───┬───────────┬───────────┬───────────┬─────────┐
│ PlayKey   ┆ time ┆ x         ┆ y     ┆ … ┆ vx        ┆ vy        ┆ omega_dir ┆ omega_o │
│ ---       ┆ ---  ┆ ---       ┆ ---   ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---     │
│ cat       ┆ f32  ┆ f32       ┆ f32   ┆   ┆ f32       ┆ f32       ┆ f32       ┆ f32     │
╞═══════════╪══════╪═══════════╪═══════╪═══╪═══════════╪═══════════╪═══════════╪═════════╡
│ 26624-1-1 ┆ 0.1  ┆ 87.449997 ┆ 28.92 ┆ … ┆ -0.100021 ┆ -0.100002 ┆ -250.0    ┆ -200.0  │
│ 26624-1-1 ┆ 0.2  ┆ 87.440002 ┆ 28.92 ┆ … ┆ -0.099945 ┆ 0.0       ┆ -100.0    ┆ -100.0  │
│ 26624-1-2 ┆ 0.1  ┆ 80.099998 ┆ 20.1  ┆ … ┆ 0.999985  ┆ 1.000004  ┆ 100.0     ┆ 100.0   │
└───────────┴──────┴───────────┴───────┴───┴───────────┴───────────┴───────────┴─────────┘
shape: (2, 4)
┌───────────┬──────────┬──────────────┬───────────┐


Now to import a sample of the full tracking table to test this out. 

In [298]:
df = pl.read_csv("F:/Data/nfl-playing-surface-analytics/PlayerTrackData.csv", n_rows=10000, columns=['PlayKey', 'time', 'x', 'y', 'dir', 'o'])

In [299]:
df.head()

PlayKey,time,x,y,dir,o
str,f64,f64,f64,f64,f64
"""26624-1-1""",0.0,87.46,28.93,288.24,262.33
"""26624-1-1""",0.1,87.45,28.92,283.91,261.69
"""26624-1-1""",0.2,87.44,28.92,280.4,261.17
"""26624-1-1""",0.3,87.44,28.92,278.79,260.66
"""26624-1-1""",0.4,87.44,28.92,275.44,260.27


In [305]:
ngs =data_shrinker(df)
ngs = angle_corrector(ngs)
ngs = dynamics_calculator(ngs)
paths = path_calculator(ngs)
ngs.head()
# paths.head()

Memory usage of dataframe is 0.47 MB
Memory usage after optimization is: 0.23 MB
Decreased by 51.7%


  df = df.with_columns(pl.col(col).cast(pl.Categorical))


PlayKey,time,x,y,dir,o,angle_diff,dist,speed,direction,vx,vy,omega_dir,omega_o,d_omega
cat,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32
"""26624-1-1""",0.1,87.449997,28.92,-76.089996,-98.309998,22.219999,0.014144,0.141438,-134.994522,-0.100021,-0.100002,-43.299866,-6.399841,36.900024
"""26624-1-1""",0.2,87.440002,28.92,-79.600006,-98.829987,19.23,0.009995,0.099945,-90.0,-0.099945,0.0,-35.100098,-5.19989,29.900208
"""26624-1-1""",0.3,87.440002,28.92,-81.209991,-99.339996,18.129999,0.0,0.0,0.0,0.0,0.0,-16.099852,-5.100097,10.999754
"""26624-1-1""",0.4,87.440002,28.92,-84.559998,-99.730011,15.17,0.0,0.0,0.0,0.0,0.0,-33.500065,-3.900147,29.599918
"""26624-1-1""",0.5,87.449997,28.92,-89.940002,-99.920013,9.98,0.009995,0.099945,90.0,0.099945,0.0,-53.800053,-1.900025,51.900028


In [306]:
paths.head()

PlayKey,distance,displacement,path_diff,max_angle_diff,mean_angle_diff,max_speed,mean_speed,max_omega_dir,mean_omega_dir,max_omega_o,mean_omega_o,max_d_omega,mean_d_omega
cat,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32,f32
"""26624-1-1""",16.944927,5.915897,11.02903,178.550003,81.936852,4.825988,0.568622,3555.986572,2.728592,3579.00293,9.189604,3610.202881,150.995209
"""26624-1-2""",23.554543,8.247253,15.307289,178.600006,92.665977,5.103921,0.909442,3472.719971,-6.991522,3584.403076,9.550516,3599.503174,167.596008
"""26624-1-3""",11.040586,3.217531,7.823056,179.830002,77.508858,4.237926,0.374257,3458.952881,-3.191952,3590.886475,8.691579,3653.38623,167.961563
"""26624-1-4""",7.034525,3.077176,3.957349,137.759995,73.762772,3.465546,0.558296,1214.399902,-0.088782,3574.286377,-0.773946,3563.303223,119.459381
"""26624-1-5""",26.218555,18.091106,8.127449,163.669998,78.897873,4.517696,1.0123,3537.186523,1.0382,3551.753906,12.263534,3562.654053,167.059769


In [303]:
paths.columns

['PlayKey',
 'distance',
 'start_x',
 'start_y',
 'end_x',
 'end_y',
 'max_angle_diff',
 'mean_angle_diff',
 'max_speed',
 'mean_speed',
 'max_omega_dir',
 'mean_omega_dir',
 'max_omega_o',
 'mean_omega_o',
 'max_d_omega',
 'mean_d_omega',
 'displacement',
 'path_diff']

## Merge the Summary with Quals

# Push the Merged table back to the SQL database for future use in Tableau

In [None]:
# Make connection to the database

from config import db_password
uri = f"postgresql://postgres:{db_password}@127.0.0.1:5432/nfl"
del db_password

# Having an issue pushing the polars to the DB, so I need to switch to Pandas to use SQLalchemy
quals_p = quals.to_pandas()

# Write table to database
engine = create_engine(uri)
quals_p.to_sql("qualitative_clean", engine, if_exists='replace', index=False)