# ETL for the Lower Body Injury Data 

Since the data for this analysis exceeds 50 million rows for each part, I'm using Polars instead of Pandas for memory optimization, since many of the steps using Pandas are very slow due to unoptimized resource usage. I'm using SQLAlchemy to push and pull data from the database, however, due to the size of the tracking data, I'm using the csv files from any previously filtered data (in the case of these data, the injury table was not filtered, so it is the native table) to load using the lazy loading feature of Polars.

In [1]:
import polars as pl
import sqlalchemy as db
from sqlalchemy.orm import Session
from sqlalchemy import create_engine
import psycopg2

In [2]:
# Make connection to the database
from config import db_password
uri = f"postgresql://postgres:{db_password}@127.0.0.1:5432/nfl"
del db_password
query = "SELECT * FROM qualitative"

quals = pl.read_database_uri(query=query, uri=uri)


The qualitative table contains all of the player, weather, stadium, and injury data. This will have to be cleaned due to the excessive changes that are better suited for programmatic coding on python rather than in SQL. The Stadium Types and Weather need major changes, and the null handling for the remaining values are easily handled in Polars.

I have removed PlayerDay, PlayerGame, PlayerGamePlay, Position, and PositionGroup since most of these will artificially identify the specific player instead of the conditions that led to the injury. It may be true that more injuries occur later in the season, but this is unreliable since some of the early injured players may return to the field and get a secondary injury.

In [3]:
quals.head()

playkey,rosterposition,stadiumtype,fieldtype,temperature,weather,playtype,bodypart,dm_m1,dm_m7,dm_m28,dm_m42
str,str,str,str,i32,str,str,str,i32,i32,i32,i32
"""26624-1-1""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Pass""",,,,,
"""26624-1-2""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Pass""",,,,,
"""26624-1-3""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Rush""",,,,,
"""26624-1-4""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Rush""",,,,,
"""26624-1-5""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Pass""",,,,,


In [64]:
# scan = pl.scan_csv("F:/Data/nfl-playing-surface-analytics/PlayerTrackData.csv")
# tracker = scan.collect(streaming=True, infer_schema_length=10000)

What I am considering doing for the Event column in is filling null values from the same play with the same event naming. So for all 26624-1-1, each null would be filled with "huddle_start_offense," but i need to verify that there is only 1 event per play-player. To check this, I used SQL, applying a query to group distinct events and look at the playkeys with such events. Most of the events are null, however, each of the 250 players has key events highlighted with the timepoints in between as null. I'm considering taking 1/2 of the time between and listing it as post-{} and pre-{} of the included event, or maybe even just cut the middle regions in half and attach the prior/following events to interpolate the null values. 

The previous analysis ignored the events, so I will be removing the events for the positional analysis to reduce the amount of data. 

In [65]:
# tracker.head()

## Cleaning the Stadiums
First thing, the column headers are ugly, so I want to capitalize them. Second, the stadium type is a problem. Literally, every stadium is described differently, including 7 unique spellings of the word 'Outdoor.'

In [4]:
quals.columns

['playkey',
 'rosterposition',
 'stadiumtype',
 'fieldtype',
 'temperature',
 'weather',
 'playtype',
 'bodypart',
 'dm_m1',
 'dm_m7',
 'dm_m28',
 'dm_m42']

In [5]:
def column_capitalizer(quals):
    columns = {
        'playkey': "PlayKey"
        , 'rosterposition': 'Position'
        , 'stadiumtype': 'StadiumType'
        , 'fieldtype': 'FieldType'
        , 'temperature': 'Temperature'
        , 'weather': 'Weather'
        , 'playtype': 'PlayType'
        , 'bodypart': 'BodyPart'
        , 'dm_m1': 'DM_1'
        , 'dm_m7': 'DM_7'
        , 'dm_m28': 'DM_28'
        , 'dm_m42': 'DM_42'
    }

    quals = quals.rename(columns)

    return quals



In [6]:
quals = column_capitalizer(quals)
quals.head()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
str,str,str,str,i32,str,str,str,i32,i32,i32,i32
"""26624-1-1""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Pass""",,,,,
"""26624-1-2""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Pass""",,,,,
"""26624-1-3""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Rush""",,,,,
"""26624-1-4""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Rush""",,,,,
"""26624-1-5""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Pass""",,,,,


Let's take a look at the number of unique values in each of the columns: 

In [7]:
quals.select(pl.all().n_unique())

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
267005,10,30,2,79,64,12,4,2,3,3,3


Now to clean the stadiums...

In [8]:
quals["StadiumType"].unique().to_list()

['Retr. Roof Closed',
 'Dome',
 'Retractable Roof',
 'Indoor, Roof Closed',
 'Outside',
 'Heinz Field',
 'Retr. Roof-Closed',
 'Indoor, Open Roof',
 'Dome, closed',
 'Ourdoor',
 'Domed',
 'Retr. Roof-Open',
 'Domed, closed',
 'Bowl',
 'Outdoor',
 'Domed, open',
 'Retr. Roof - Closed',
 'Outdor',
 'Outdoors',
 'Oudoor',
 'Cloudy',
 'Indoor',
 'Indoors',
 'Domed, Open',
 None,
 'Retr. Roof - Open',
 'Open',
 'Outdoor Retr Roof-Open',
 'Outddors',
 'Closed Dome']

In [9]:
quals["StadiumType"].value_counts()

StadiumType,count
str,u32
"""Retr. Roof - Closed""",2235
"""Outdor""",356
"""Outddors""",595
"""Domed, Open""",807
"""Domed, closed""",3076
…,…
"""Open""",4124
,16910
"""Domed, open""",1779
"""Domed""",985


In [10]:
quals["StadiumType"].null_count()

16910

Since most of the stadiums in the NFL are outdoor, and the 11 stadiums with domes that would be considered indoor are already accounted for, the Null values will be interpolated as Outdoor Stadiums. 

In [11]:
quals = quals.with_columns(pl.col("StadiumType").fill_null("Outdoor"))

In [12]:
quals["StadiumType"].null_count()

0

For the rest of them, I will be converting the stadiums to be either indoor or outdoor. If the stadium is a domed stadium and the dome is open, it will be considered Outdoor, whereas for the games that were recorded with the dome closed, it will be considered Indoor

In [13]:
stadium_dict = {'Outdoor': 'Outdoor',
        'Indoors': 'Indoor',
        'Oudoor': 'Outdoor',
        'Outdoors': 'Outdoor',
        'Open': 'Outdoor',
        'Closed Dome': 'Indoor',
        'Domed, closed': 'Indoor',
        'Dome': 'Indoor',
        'Indoor': 'Indoor',
        'Domed': 'Indoor',
        'Retr. Roof-Closed': 'Indoor',
        'Outdoor Retr Roof-Open': 'Outdoor',
        'Retractable Roof': 'Indoor',
        'Ourdoor': 'Outdoor',
        'Indoor, Roof Closed': 'Indoor',
        'Retr. Roof - Closed': 'Indoor',
        'Bowl': 'Outdoor',
        'Outddors': 'Outdoor',
        'Retr. Roof-Open': 'Outdoor',
        'Dome, closed': 'Indoor',
        'Indoor, Open Roof': 'Outdoor',
        'Domed, Open': 'Outdoor',
        'Domed, open': 'Outdoor',
        'Heinz Field': 'Outdoor',
        'Cloudy': 'Outdoor',
        'Retr. Roof - Open': 'Outdoor',
        'Retr. Roof Closed': 'Indoor',
        'Outdor': 'Outdoor',
        'Outside': 'Outdoor'}


In [14]:
quals = quals.with_columns(pl.col("StadiumType").replace(stadium_dict))

In [15]:
quals.head()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
str,str,str,str,i32,str,str,str,i32,i32,i32,i32
"""26624-1-1""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Pass""",,,,,
"""26624-1-2""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Pass""",,,,,
"""26624-1-3""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Rush""",,,,,
"""26624-1-4""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Rush""",,,,,
"""26624-1-5""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear and warm""","""Pass""",,,,,


In [16]:
quals.select(pl.all().n_unique())

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
267005,10,2,2,79,64,12,4,2,3,3,3


## Getting Clear Weather
There are even more weather conditions than there were with the stadium types, which isn't particularly helpful. I'm changing the weather into 7 different conditions that encompass the main weather events for outdoor stadiums. Indoor stadiums will be labeled indoor, since there isn't an impact from the sun on visibility, which may be a factor with clear days. 

In [17]:
weather_dict = {'Clear and warm': 'Clear',
                'Mostly Cloudy': 'Cloudy',
                'Sunny': 'Clear',
                'Clear': 'Clear',
                'Cloudy': 'Cloudy',
                'Cloudy, fog started developing in 2nd quarter': 'Hazy/Fog',
                'Rain': 'Rain',
                'Partly Cloudy': 'Cloudy',
                'Mostly cloudy': 'Cloudy',
                'Cloudy and cold': 'Cloudy',
                'Cloudy and Cool': 'Cloudy',
                'Rain Chance 40%': 'Rain',
                'Controlled Climate': 'Indoor',
                'Sunny and warm': 'Clear',
                'Partly cloudy': 'Cloudy',
                'Clear and Cool': 'Cloudy',
                'Clear and cold': 'Cloudy',
                'Sunny and cold': 'Clear',
                'Indoor': 'Indoor',
                'Partly Sunny': 'Clear',
                'N/A (Indoors)': 'Indoor',
                'Mostly Sunny': 'Clear',
                'Indoors': 'Indoor',
                'Clear Skies': 'Clear',
                'Partly sunny': 'Clear',
                'Showers': 'Rain',
                'N/A Indoor': 'Indoor',
                'Sunny and clear': 'Clear',
                'Snow': 'Snow',
                'Scattered Showers': 'Rain',
                'Party Cloudy': 'Cloudy',
                'Clear skies': 'Clear',
                'Rain likely, temps in low 40s.': 'Rain',
                'Hazy': 'Hazy/Fog',
                'Partly Clouidy': 'Cloudy',
                'Sunny Skies': 'Clear',
                'Overcast': 'Cloudy',
                'Cloudy, 50% change of rain': 'Cloudy',
                'Fair': 'Clear',
                'Light Rain': 'Rain',
                'Partly clear': 'Clear',
                'Mostly Coudy': 'Cloudy',
                '10% Chance of Rain': 'Cloudy',
                'Cloudy, chance of rain': 'Cloudy',
                'Heat Index 95': 'Clear',
                'Sunny, highs to upper 80s': 'Clear',
                'Sun & clouds': 'Cloudy',
                'Heavy lake effect snow': 'Snow',
                'Mostly sunny': 'Clear',
                'Cloudy, Rain': 'Rain',
                'Sunny, Windy': 'Windy',
                'Mostly Sunny Skies': 'Clear',
                'Rainy': 'Rain',
                '30% Chance of Rain': 'Rain',
                'Cloudy, light snow accumulating 1-3"': 'Snow',
                'cloudy': 'Cloudy',
                'Clear and Sunny': 'Clear',
                'Coudy': 'Cloudy',
                'Clear and sunny': 'Clear',
                'Clear to Partly Cloudy': 'Clear',
                'Cloudy with periods of rain, thunder possible. Winds shifting to WNW, 10-20 mph.': 'Windy',
                'Rain shower': 'Rain',
                'Cold': 'Clear'}


quals = quals.with_columns(pl.col("Weather").replace(weather_dict))
quals.head()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
str,str,str,str,i32,str,str,str,i32,i32,i32,i32
"""26624-1-1""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""",,,,,
"""26624-1-2""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""",,,,,
"""26624-1-3""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Rush""",,,,,
"""26624-1-4""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Rush""",,,,,
"""26624-1-5""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""",,,,,


In [18]:
quals["Weather"].value_counts()

Weather,count
str,u32
"""Cloudy""",112306
"""Clear""",96985
"""Indoor""",20277
"""Windy""",713
"""Hazy/Fog""",1809
"""Rain""",14280
"""Snow""",1945
,18691


Since I don't know whether the nulls are Indoor or Outdoor, I am going to change the weather from all indoor stadiums to "Indoor," and hopefully if those nulls are indoor, they'll all disappear!

In [19]:
quals = quals.with_columns(
                pl.when(pl.col("StadiumType") == "Indoor")
                .then(pl.col("Weather").fill_null("Indoor"))
                .otherwise(pl.col("Weather"))
                .alias("Weather")
)

In [20]:
quals.head()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
str,str,str,str,i32,str,str,str,i32,i32,i32,i32
"""26624-1-1""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""",,,,,
"""26624-1-2""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""",,,,,
"""26624-1-3""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Rush""",,,,,
"""26624-1-4""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Rush""",,,,,
"""26624-1-5""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""",,,,,


In [21]:
quals["Weather"].value_counts().sort("count", descending=True)

Weather,count
str,u32
"""Cloudy""",112306
"""Clear""",96985
"""Indoor""",33862
"""Rain""",14280
,5106
"""Snow""",1945
"""Hazy/Fog""",1809
"""Windy""",713


In [22]:
quals.filter(pl.col("Weather").is_null()).select(["Weather", "StadiumType", "Temperature"]).unique()

Weather,StadiumType,Temperature
str,str,i32
,"""Outdoor""",77
,"""Outdoor""",40
,"""Outdoor""",70
,"""Outdoor""",79
,"""Outdoor""",68
,"""Outdoor""",39
,"""Outdoor""",58
,"""Outdoor""",72
,"""Outdoor""",71
,"""Outdoor""",63


It appears that there are 10 games where the weather was not recorded, and each of them are different stadiums based on the EDA. Since there is a pretty similar distribution of cloudy and clear games, I don't want to assign all 5000 values to one of them, so I will be making all games greater than 70 degrees Clear and the rest will be Cloudy. This isn't perfect, but without datetime data, it's impossible to determine what game and temperatures. 

In [23]:
quals = quals.with_columns(
                pl.when(pl.col("Temperature") > 70)
                .then(pl.col("Weather").fill_null("Clear"))
                .otherwise(pl.col("Weather"))
                .alias("Weather")
)

In [24]:
quals.filter(pl.col("Weather").is_null()).select(["Weather", "StadiumType", "Temperature"]).unique()

Weather,StadiumType,Temperature
str,str,i32
,"""Outdoor""",39
,"""Outdoor""",40
,"""Outdoor""",63
,"""Outdoor""",58
,"""Outdoor""",68
,"""Outdoor""",70


In [25]:
quals = quals.with_columns(pl.col("Weather").fill_null("Cloudy"))

In [26]:
quals["Weather"].value_counts().sort("count", descending=True)

Weather,count
str,u32
"""Cloudy""",115504
"""Clear""",98893
"""Indoor""",33862
"""Rain""",14280
"""Snow""",1945
"""Hazy/Fog""",1809
"""Windy""",713


## Helping the Uninjured
The data from the join using injury data left a whole lot of null values. Each of the null values from teh DMs should be 0, since the player did not sustain an injury for that duration. For Body Part, we can add "None", however, this will ultimately get encoded, so I will need to add an additional column that just denotes IsInjured as a binary, and then the remining will each become their own column denoting a 1 or 0 whether the injury sustained was a Knee, Foot, or Ankle. 

In [30]:
quals.head()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
str,str,str,str,i32,str,str,str,i32,i32,i32,i32
"""26624-1-1""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""",,,,,
"""26624-1-2""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""",,,,,
"""26624-1-3""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Rush""",,,,,
"""26624-1-4""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Rush""",,,,,
"""26624-1-5""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""",,,,,


First, let's change all of the null Body Parts to No Injury

In [31]:
quals = quals.with_columns(pl.col("BodyPart").fill_null("NoInjury"))

In [32]:
quals.head()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
str,str,str,str,i32,str,str,str,i32,i32,i32,i32
"""26624-1-1""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""","""NoInjury""",,,,
"""26624-1-2""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""","""NoInjury""",,,,
"""26624-1-3""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Rush""","""NoInjury""",,,,
"""26624-1-4""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Rush""","""NoInjury""",,,,
"""26624-1-5""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""","""NoInjury""",,,,


In [33]:
quals.null_count()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,367,0,266929,266929,266929,266929


What I would like to do is fill all remaining nulls with a 0, but only in the last 4 columns; however, I'm seeing that there are still 367 playtypes that are unknown. I will actually just label these as Unknonwn for now, and if possible I can determine whether the events give us an indication upon merging later. 

In [34]:
quals = quals.with_columns(pl.col("PlayType").fill_null("Unknown"))

In [35]:
quals.null_count()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,0,0,266929,266929,266929,266929


In [36]:
quals = quals.with_columns(
    pl.col(["DM_1", "DM_7", "DM_28", "DM_42"]).fill_null(0))

In [38]:
quals.head()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
str,str,str,str,i32,str,str,str,i32,i32,i32,i32
"""26624-1-1""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""","""NoInjury""",0,0,0,0
"""26624-1-2""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""","""NoInjury""",0,0,0,0
"""26624-1-3""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Rush""","""NoInjury""",0,0,0,0
"""26624-1-4""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Rush""","""NoInjury""",0,0,0,0
"""26624-1-5""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""","""Pass""","""NoInjury""",0,0,0,0


In [39]:
quals.null_count()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,0,0,0,0,0,0


This final line just examines to see if any of the Unknown plays led to an injury. I suspect that if none of them led to an injury that this will give a false identifier - if a play does come up as unknown, and ML model will determine that there will not be an injury, though this is just an artifact of bad record-keeping.  

In [43]:
quals.filter((pl.col("PlayType") == "Unknown") & (pl.col("BodyPart") != "NoInjury"))

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
str,str,str,str,i32,str,str,str,i32,i32,i32,i32


As seen, this is the case. Now that these data are clean, how much data will be lost if we remove those rows? 

In [47]:
quals.filter(pl.col("PlayType") == "Unknown").count()/quals.count()*100

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
0.13745,0.13745,0.13745,0.13745,0.13745,0.13745,0.13745,0.13745,0.13745,0.13745,0.13745,0.13745


By removing these data, we will be losing 0.14% of the dataset, which will not be a lot to lose for purposes of keeping the PlayType as part of the analysis. 

In [48]:
quals = quals.filter(pl.col("PlayType") != "Unknown")

In [49]:
quals.count()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
266639,266639,266639,266639,266639,266639,266639,266639,266639,266639,266639,266639


The final step of this process is to rewrite this code for production, so that it doesn't need to be pushed through a jupyter file for any future runs or modifications 

In [None]:
import polars as pl
import sqlalchemy as db
from sqlalchemy.orm import Session
from sqlalchemy import create_engine
import psycopg2

# Make connection to the database
from config import db_password
uri = f"postgresql://postgres:{db_password}@127.0.0.1:5432/nfl"
del db_password
query = "SELECT * FROM qualitative"

quals = pl.read_database_uri(query=query, uri=uri)

In [None]:
def column_capitalizer(quals):          # Changes the all lower-case to Capitalized PascalCase column headers 
    columns = {
        'playkey': "PlayKey"
        , 'rosterposition': 'Position'
        , 'stadiumtype': 'StadiumType'
        , 'fieldtype': 'FieldType'
        , 'temperature': 'Temperature'
        , 'weather': 'Weather'
        , 'playtype': 'PlayType'
        , 'bodypart': 'BodyPart'
        , 'dm_m1': 'DM_1'
        , 'dm_m7': 'DM_7'
        , 'dm_m28': 'DM_28'
        , 'dm_m42': 'DM_42'
    } 
    quals = quals.rename(columns)

    return quals


In [None]:
def stadium_cleaner(quals):         # This changes stadiums to either Indoor or Outdoor per game records - some of the dome stadiums have a roof that can open, if open the game is considered outdoor.
    stadium_dict = {'Outdoor': 'Outdoor',
        'Indoors': 'Indoor',
        'Oudoor': 'Outdoor',
        'Outdoors': 'Outdoor',
        'Open': 'Outdoor',
        'Closed Dome': 'Indoor',
        'Domed, closed': 'Indoor',
        'Dome': 'Indoor',
        'Indoor': 'Indoor',
        'Domed': 'Indoor',
        'Retr. Roof-Closed': 'Indoor',
        'Outdoor Retr Roof-Open': 'Outdoor',
        'Retractable Roof': 'Indoor',
        'Ourdoor': 'Outdoor',
        'Indoor, Roof Closed': 'Indoor',
        'Retr. Roof - Closed': 'Indoor',
        'Bowl': 'Outdoor',
        'Outddors': 'Outdoor',
        'Retr. Roof-Open': 'Outdoor',
        'Dome, closed': 'Indoor',
        'Indoor, Open Roof': 'Outdoor',
        'Domed, Open': 'Outdoor',
        'Domed, open': 'Outdoor',
        'Heinz Field': 'Outdoor',
        'Cloudy': 'Outdoor',
        'Retr. Roof - Open': 'Outdoor',
        'Retr. Roof Closed': 'Indoor',
        'Outdor': 'Outdoor',
        'Outside': 'Outdoor'}
    
    quals = quals.with_columns(pl.col("StadiumType").fill_null("Outdoor")) # Since most stadiums are outdoor and the percentage of games played indoor is already met by the known indoor games those seasons, all unknown games were set to outdoor
    quals = quals.with_columns(pl.col("StadiumType").replace(stadium_dict)) # This uses the dict to assign naming conventions

    return quals

In [50]:
def weather_cleaner(quals):
     weather_dict = {'Clear and warm': 'Clear',
                'Mostly Cloudy': 'Cloudy',
                'Sunny': 'Clear',
                'Clear': 'Clear',
                'Cloudy': 'Cloudy',
                'Cloudy, fog started developing in 2nd quarter': 'Hazy/Fog',
                'Rain': 'Rain',
                'Partly Cloudy': 'Cloudy',
                'Mostly cloudy': 'Cloudy',
                'Cloudy and cold': 'Cloudy',
                'Cloudy and Cool': 'Cloudy',
                'Rain Chance 40%': 'Rain',
                'Controlled Climate': 'Indoor',
                'Sunny and warm': 'Clear',
                'Partly cloudy': 'Cloudy',
                'Clear and Cool': 'Cloudy',
                'Clear and cold': 'Cloudy',
                'Sunny and cold': 'Clear',
                'Indoor': 'Indoor',
                'Partly Sunny': 'Clear',
                'N/A (Indoors)': 'Indoor',
                'Mostly Sunny': 'Clear',
                'Indoors': 'Indoor',
                'Clear Skies': 'Clear',
                'Partly sunny': 'Clear',
                'Showers': 'Rain',
                'N/A Indoor': 'Indoor',
                'Sunny and clear': 'Clear',
                'Snow': 'Snow',
                'Scattered Showers': 'Rain',
                'Party Cloudy': 'Cloudy',
                'Clear skies': 'Clear',
                'Rain likely, temps in low 40s.': 'Rain',
                'Hazy': 'Hazy/Fog',
                'Partly Clouidy': 'Cloudy',
                'Sunny Skies': 'Clear',
                'Overcast': 'Cloudy',
                'Cloudy, 50% change of rain': 'Cloudy',
                'Fair': 'Clear',
                'Light Rain': 'Rain',
                'Partly clear': 'Clear',
                'Mostly Coudy': 'Cloudy',
                '10% Chance of Rain': 'Cloudy',
                'Cloudy, chance of rain': 'Cloudy',
                'Heat Index 95': 'Clear',
                'Sunny, highs to upper 80s': 'Clear',
                'Sun & clouds': 'Cloudy',
                'Heavy lake effect snow': 'Snow',
                'Mostly sunny': 'Clear',
                'Cloudy, Rain': 'Rain',
                'Sunny, Windy': 'Windy',
                'Mostly Sunny Skies': 'Clear',
                'Rainy': 'Rain',
                '30% Chance of Rain': 'Rain',
                'Cloudy, light snow accumulating 1-3"': 'Snow',
                'cloudy': 'Cloudy',
                'Clear and Sunny': 'Clear',
                'Coudy': 'Cloudy',
                'Clear and sunny': 'Clear',
                'Clear to Partly Cloudy': 'Clear',
                'Cloudy with periods of rain, thunder possible. Winds shifting to WNW, 10-20 mph.': 'Windy',
                'Rain shower': 'Rain',
                'Cold': 'Clear'}
     
     quals = quals.with_columns(pl.col("Weather").replace(weather_dict)) # Standardizes the weather to a few main types

     quals = quals.with_columns(             # Null handling - all null weather conditions for indoor stadiums are filled "indoor"
                pl.when(pl.col("StadiumType") == "Indoor")
                .then(pl.col("Weather").fill_null("Indoor"))
                .otherwise(pl.col("Weather"))
                .alias("Weather")
                )
     
     # For the non-indoor games with null values for weather, to maintain the percentage of games that were clear/cloudy, temperature was used as a divider, above and below 70 degrees
     quals = quals.with_columns(
                pl.when(pl.col("Temperature") > 70)
                .then(pl.col("Weather").fill_null("Clear"))
                .otherwise(pl.col("Weather"))
                .alias("Weather")
                )
     quals = quals.with_columns(pl.col("Weather").fill_null("Cloudy"))

     return quals


In [None]:
def injury_cleaner(quals):
    quals = quals.with_columns(pl.col('PlayType').is_not_null()) # 0.14% of rows did not have a play type, and ALL of these were non-injury plays, so they were removed

    quals = quals.with_columns(pl.col("BodyPart").fill_null("NoInjury")) # This fills all null from the join with No Injury

    quals = quals.with_columns(
    pl.col(["DM_1", "DM_7", "DM_28", "DM_42"]).fill_null(0)) # This fills the nulls from the Join with 0s, since there were no injuries.

    return quals




In [1]:
import polars as pl
import sqlalchemy as db
from sqlalchemy.orm import Session
from sqlalchemy import create_engine
import psycopg2
from CleaningFunctions import *

# Make connection to the database
from config import db_password
uri = f"postgresql://postgres:{db_password}@127.0.0.1:5432/nfl"
del db_password
query = "SELECT * FROM qualitative"

quals = pl.read_database_uri(query=query, uri=uri)



In [1]:
from CleaningFunctions import *

In [4]:
quals = clean_injuries('quals')


In [5]:
quals.head()

PlayKey,Position,StadiumType,FieldType,Temperature,Weather,PlayType,BodyPart,DM_1,DM_7,DM_28,DM_42
str,str,str,str,i32,str,bool,str,i32,i32,i32,i32
"""26624-1-1""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""",True,"""NoInjury""",0,0,0,0
"""26624-1-2""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""",True,"""NoInjury""",0,0,0,0
"""26624-1-3""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""",True,"""NoInjury""",0,0,0,0
"""26624-1-4""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""",True,"""NoInjury""",0,0,0,0
"""26624-1-5""","""Quarterback""","""Outdoor""","""Synthetic""",63,"""Clear""",True,"""NoInjury""",0,0,0,0
