# ETL for the Concussion Data

There is a lot of overlap with the Lower Body Data and some of the functions should be able to be used for both of them. Going to continue to use polars with this since the data is still huge. There are 60 million rows in the ngs data




## game_data
Starting with the gamekey, season year, season type, week, and game_day, and game_date - the game date actually provides all of the necessary data, and anything additional can actually skew any machine learning findings. 

Likewise, team and team codes are redundant. The team codes are less likely to have spelling errors. Will maintain hometeamcode and visitteamcode. 

As far as stadium locations, the ones that are redundant are Mexico City - Estadio Azteca and Mexico - Estadio Azteca, and London - Wembley and Wembley - Wembley Stadium

There is an issue with the game listed in Indianapolis at Tom Benson, which is in Canton - this game was actually canceled as verified online. 
Ralph Wilson Stadium is the same location as New Era Field.

The outdoor weather is so inconsistently recorded, there are 134 differnt weather conditions, but most don't actually include weather types, whereas there are only 10 types of gameweather including null. 

In [None]:
SELECT gamekey
    , game_date
    , game_site
    , start_time
    , hometeamcode
    , visitteamcode
    , stadiumtype
    , turf
    , gameweather
    , temperature 

FROM game_data
LIMIT 10;


## ngs_data

Again, season_year is unnecessary, since it is included in time. I would like to see if the GSISID numbers are the same as those from the other database. We will need to keep the rest of the information. 


In [None]:
SELECT gamekey ,
       playid ,
       gsisid ,
       time ,
       x,
       y,
       dis ,
       o ,
       dir ,
       event
FROM ngs_data
LIMIT 10 ;

## play_information

This has similar crap that can be removed, as does the game_data. This will end up joined with the game_data, so we'll need to keep enough to connect them. 

Season_year, season_type, week can disappear. 
Play description isn't going to be particularly helpful, as every one of them is completely different. We should already have the home_team_visit_team information from the joined table game_data, so this column may be dropped later.

In [None]:
SELECT gamekey ,
       game_date ,
       playid,
       game_clock ,
       yardline ,
       quarter ,
       play_type ,
       poss_team ,
       home_team_visit_team ,
       score_home_visiting
FROM play_information
LIMIT 10 ;

## punt_data

Everything here is necessary - gsisid, number, and position (code)

## role_data

season_year is unnecessary, since the gamekeys don't overlap, and are already chronologically established 

In [None]:
SELECT gamekey , playid , gsisid , role FROM role_data
LIMIT 10 ;

## video_review

There are no turnovers in the entire table, so this parameter can be removed. 
In terms of this analysis - there isn't a real reason to keep the primary_partner_gsisid, since we're not looking for the individuals responsible, though it will be nice to have for the visualizations.  
friendly_fire isn't something that we can control biomechanically, so there isn't anything that I can really use from this information. 


In [None]:
SELECT gamekey, playid, gsisid, player_activity_derived, primary_impact_type, primary_partner_activity_derived, primary_partner_gsisid FROM video_review
LIMIT 10 ;

# Extract and Clean

In [None]:
import polars as pl
import sqlalchemy as db
from sqlalchemy.orm import Session
from sqlalchemy import create_engine
import psycopg2

In [None]:
from DataHandler import data_loader, data_shrinker
from CleaningFunctions import column_capitalizer, stadium_cleaner, weather_cleaner, turf_cleaner

In [None]:
concussion = data_loader(database='nfl_concussion', dataset='concussion')
concussion.head()

In [None]:
concussion = column_capitalizer(concussion, 'concussion')
concussion.head()

In [None]:
concussion = stadium_cleaner(concussion, 'concussion')
concussion.head()

In [None]:
concussion = weather_cleaner(concussion, 'concussion')
concussion.head()

In [None]:
concussion.null_count()

In [None]:
concussion = turf_cleaner(concussion)
concussion.head()

In [None]:
concussion.null_count()

In [None]:
filtered = concussion.filter(pl.col('Game_Date').is_null())
result = filtered.select(['Game_Date', 'HomeTeamCode', 'StadiumType', 'FieldType', 'Weather'])

In [None]:
print(filtered)

In [None]:
# def turf_cleaner(df):
#     import polars as pl

#     turf_dict = {
#         'Grass': 'Natural',
#         'Field Turf': 'Synthetic', 
#         'Natural Grass': 'Natural',
#         'grass': 'Natural',
#         'Artificial': 'Synthetic',
#         'FieldTurf': 'Synthetic',
#         'DD GrassMaster': 'Synthetic',
#         'A-Turf Titan': 'Synthetic',
#         'UBU Sports Speed S5-M': 'Synthetic',
#         'UBU Speed Series S5-M': 'Synthetic',
#         'Artifical': 'Synthetic',
#         'UBU Speed Series-S5-M': 'Synthetic',
#         'FieldTurf 360': 'Synthetic',
#         'Natural grass': 'Natural',
#         'Field turf': 'Synthetic',
#         'Natural': 'Natural',
#         'Natrual Grass': 'Natural',
#         'Synthetic': 'Synthetic',
#         'Natural Grass ': 'Natural',
#         'Naturall Grass': 'Natural',
#         'FieldTurf360': 'Synthetic',
#         None: 'Natural' # The only field with null values is Miami Gardens, which has Natural
#         }
    
#     df = df.with_columns(pl.col("FieldType").replace(turf_dict))
#     return df

In [None]:
concussion = turf_cleaner(concussion)

In [None]:
concussion.filter(pl.col('FieldType').is_null())

## New Test for the clean_concussion() function

In [1]:
import polars as pl
import sqlalchemy as db
from sqlalchemy.orm import Session
from sqlalchemy import create_engine
import psycopg2

from CleaningFunctions import *
from DataHandler import data_loader, data_writer, data_shrinker


In [2]:
# df = clean_concussions()
# df.head()

# NExt - the transformation
Need to create a playkey in SQL:


In [None]:
-- ALTER TABLE ngs_data
-- ADD COLUMN PlayKey VARCHAR(15);

The processing of this in SQL timed out after 15 minutes, so i'm going to try to use polars actually do the transformation. 

In [2]:
track = data_loader(dataset='ngs_data', database='nfl_concussion')

In [3]:
track = data_shrinker(track)

Memory usage of dataframe is 2565.14 MB
Memory usage after optimization is: 2330.61 MB
Decreased by 9.1%


In [4]:
track = track.with_columns([
        pl.concat_str([
            pl.col('gsisid').cast(pl.Int32).cast(pl.Utf8)
            , pl.lit('-')
            , pl.col('gamekey').cast(pl.Utf8)
            , pl.lit('-')
            , pl.col('playid').cast(pl.Utf8)
        ]).alias('PlayKey')
])

This took 7.9 seconds, compared to the over 13 minutes in SQL

In [5]:
track.head()

gamekey,playid,gsisid,time,x,y,dis,o,dir,event,PlayKey
i16,i16,f32,datetime[ns],f32,f32,f32,f32,f32,enum,str
512,3623,32449.0,2017-10-29 19:48:49.800,93.230003,24.440001,0.0,175.630005,100.169998,,"""32449-512-3623"""
512,3623,27978.0,2017-10-29 19:48:49.800,86.470001,25.1,0.01,98.730003,330.899994,,"""27978-512-3623"""
512,3623,32688.0,2017-10-29 19:48:49.800,56.509998,5.6,0.3,97.400002,197.529999,,"""32688-512-3623"""
512,3623,30160.0,2017-10-29 19:48:49.800,85.139999,23.370001,0.07,287.670013,62.099998,,"""30160-512-3623"""
512,3623,33260.0,2017-10-29 19:48:49.800,85.300003,23.75,0.0,261.279999,279.519989,,"""33260-512-3623"""


In [6]:
track = track.select([
    'PlayKey'
    , 'time'
    , 'x'
    , 'y'
    , 'dis'
    , 'o'
    , 'dir'
    , 'event'
])

track.head()

PlayKey,time,x,y,dis,o,dir,event
str,datetime[ns],f32,f32,f32,f32,f32,enum
"""32449-512-3623""",2017-10-29 19:48:49.800,93.230003,24.440001,0.0,175.630005,100.169998,
"""27978-512-3623""",2017-10-29 19:48:49.800,86.470001,25.1,0.01,98.730003,330.899994,
"""32688-512-3623""",2017-10-29 19:48:49.800,56.509998,5.6,0.3,97.400002,197.529999,
"""30160-512-3623""",2017-10-29 19:48:49.800,85.139999,23.370001,0.07,287.670013,62.099998,
"""33260-512-3623""",2017-10-29 19:48:49.800,85.300003,23.75,0.0,261.279999,279.519989,


In [7]:
def calculate_angle_difference(angle1, angle2):
    import numpy as np
    """
    Calculate the smallest angle difference between two angles 
    using trigonometric functions, accounting for edge cases.
    """
    sin_diff = np.sin(np.radians(angle2 - angle1))
    cos_diff = np.cos(np.radians(angle2 - angle1))
    return np.degrees(np.arctan2(sin_diff, cos_diff))

def angle_corrector(df):
    import polars as pl
    """
    Make corrections to angles to reduce fringe errors at 360
    """
    df = df.with_columns([
        ((pl.col("dir") + 180) % 360 - 180).alias("dir")
        , ((pl.col("o") + 180) % 360 - 180).alias("o")
        ]).with_columns(
            (calculate_angle_difference(pl.col("dir"), pl.col("o"))).abs().round(2).alias("angle_diff")
        )
    
    return df


In [8]:
track = angle_corrector(track)
track.head()

PlayKey,time,x,y,dis,o,dir,event,angle_diff
str,datetime[ns],f32,f32,f32,f32,f32,enum,f32
"""32449-512-3623""",2017-10-29 19:48:49.800,93.230003,24.440001,0.0,175.630005,100.169983,,75.459999
"""27978-512-3623""",2017-10-29 19:48:49.800,86.470001,25.1,0.01,98.730011,-29.100006,,127.830002
"""32688-512-3623""",2017-10-29 19:48:49.800,56.509998,5.6,0.3,97.399994,-162.470001,,100.129997
"""30160-512-3623""",2017-10-29 19:48:49.800,85.139999,23.370001,0.07,-72.329987,62.100006,,134.429993
"""33260-512-3623""",2017-10-29 19:48:49.800,85.300003,23.75,0.0,-98.720001,-80.480011,,18.24


In [9]:
import numpy as np
import polars as pl

def dynamics_calculator(df):
    """
    Calculate dynamics based on (X,Y) and time columns per PlayKey.
    """
    
    df = df.lazy()
    df = df.sort(['PlayKey', 'time']).with_columns([
        # Pre-calculate shifted values
        pl.col("x").shift(1).over("PlayKey").alias("prev_x"),
        pl.col("y").shift(1).over("PlayKey").alias("prev_y"),
        pl.col("time").shift(1).over("PlayKey").alias("prev_time"),
        pl.col("dir").shift(1).over("PlayKey").alias("prev_dir"),
        pl.col("o").shift(1).over("PlayKey").alias("prev_o")
    ]).with_columns([
        # Calculate time difference in seconds
        ((pl.col("time") - pl.col("prev_time")).dt.total_seconds()).alias("dt"),
        # Calculate x and y differences
        (pl.col("x") - pl.col("prev_x")).alias("dx"),
        (pl.col("y") - pl.col("prev_y")).alias("dy")
    ]).with_columns([
        # Calculate displacement
        ((pl.col("dx")**2 + pl.col("dy")**2)**0.5).alias("dist")
    ]).with_columns([
        # Calculate speed (yards per second)
        (pl.col("dist") / pl.col("dt")).alias("speed"),
        # Calculate direction (degrees)
        (np.degrees(pl.arctan2(pl.col("dy"), pl.col("dx")))).alias("direction"),
        # Calculate velocity components (yards per second)
        (pl.col("dx") / pl.col("dt")).alias("vx"),
        (pl.col("dy") / pl.col("dt")).alias("vy"),
        # Calculate angular velocities (degrees per second)
        ((pl.col("dir") - pl.col("prev_dir")) / pl.col("dt")).alias("omega_dir"),
        ((pl.col("o") - pl.col("prev_o")) / pl.col("dt")).alias("omega_o")
    ]).with_columns([
        ((pl.col("omega_dir") - pl.col("omega_o")).abs()).alias("d_omega")
    ]).drop([
        "prev_x", "prev_y", "prev_time", "prev_dir", "prev_o", "dt", "dx", "dy"
    ])

    return df.collect()


In [10]:
track = dynamics_calculator(track)

: 