# ETL for the Concussion Data

There is a lot of overlap with the Lower Body Data and some of the functions should be able to be used for both of them. Going to continue to use polars with this since the data is still huge. There are 60 million rows in the ngs data




## game_data
Starting with the gamekey, season year, season type, week, and game_day, and game_date - the game date actually provides all of the necessary data, and anything additional can actually skew any machine learning findings. 

Likewise, team and team codes are redundant. The team codes are less likely to have spelling errors. Will maintain hometeamcode and visitteamcode. 

As far as stadium locations, the ones that are redundant are Mexico City - Estadio Azteca and Mexico - Estadio Azteca, and London - Wembley and Wembley - Wembley Stadium

There is an issue with the game listed in Indianapolis at Tom Benson, which is in Canton - this game was actually canceled as verified online. 
Ralph Wilson Stadium is the same location as New Era Field.

The outdoor weather is so inconsistently recorded, there are 134 differnt weather conditions, but most don't actually include weather types, whereas there are only 10 types of gameweather including null. 

In [None]:
SELECT gamekey
    , game_date
    , game_site
    , start_time
    , hometeamcode
    , visitteamcode
    , stadiumtype
    , turf
    , gameweather
    , temperature 

FROM game_data
LIMIT 10;


## ngs_data

Again, season_year is unnecessary, since it is included in time. I would like to see if the GSISID numbers are the same as those from the other database. We will need to keep the rest of the information. 


In [None]:
SELECT gamekey ,
       playid ,
       gsisid ,
       time ,
       x,
       y,
       dis ,
       o ,
       dir ,
       event
FROM ngs_data
LIMIT 10 ;

## play_information

This has similar crap that can be removed, as does the game_data. This will end up joined with the game_data, so we'll need to keep enough to connect them. 

Season_year, season_type, week can disappear. 
Play description isn't going to be particularly helpful, as every one of them is completely different. We should already have the home_team_visit_team information from the joined table game_data, so this column may be dropped later.

In [None]:
SELECT gamekey ,
       game_date ,
       playid,
       game_clock ,
       yardline ,
       quarter ,
       play_type ,
       poss_team ,
       home_team_visit_team ,
       score_home_visiting
FROM play_information
LIMIT 10 ;

## punt_data

Everything here is necessary - gsisid, number, and position (code)

## role_data

season_year is unnecessary, since the gamekeys don't overlap, and are already chronologically established 

In [None]:
SELECT gamekey , playid , gsisid , role FROM role_data
LIMIT 10 ;

## video_review

There are no turnovers in the entire table, so this parameter can be removed. 
In terms of this analysis - there isn't a real reason to keep the primary_partner_gsisid, since we're not looking for the individuals responsible, though it will be nice to have for the visualizations.  
friendly_fire isn't something that we can control biomechanically, so there isn't anything that I can really use from this information. 


In [None]:
SELECT gamekey, playid, gsisid, player_activity_derived, primary_impact_type, primary_partner_activity_derived, primary_partner_gsisid FROM video_review
LIMIT 10 ;

# Extract and Clean

In [1]:
import polars as pl
import sqlalchemy as db
from sqlalchemy.orm import Session
from sqlalchemy import create_engine
import psycopg2

In [2]:
from DataHandler import data_loader

In [3]:
game_data = data_loader(database='nfl_concussion', dataset='game_data')
game_data.head()

gamekey,game_date,game_site,start_time,hometeamcode,visitteamcode,stadiumtype,turf,gameweather,temperature
i32,datetime[ns],str,str,str,str,str,str,str,f32
1,2016-08-07 00:00:00,"""Indianapolis""","""20:00""","""IND""","""GB""","""Outdoor""","""Turf""",,
2,2016-08-13 00:00:00,"""Los Angeles""","""17:00""","""LA""","""DAL""","""Outdoor""","""Grass""","""Sunny""",79.0
3,2016-08-11 00:00:00,"""Baltimore""","""19:30""","""BLT""","""CAR""","""Outdoor""","""Natural Grass""","""Party Cloudy""",94.0
4,2016-08-12 00:00:00,"""Green Bay""","""19:00""","""GB""","""CLV""","""Outdoor""","""DD GrassMaster""",,73.0
5,2016-08-11 00:00:00,"""Chicago""","""19:00""","""CHI""","""DEN""","""Outdoor""","""Grass""","""Partly Cloudy, Chance of Rain …",88.0


In [4]:
play_information = data_loader(database='nfl_concussion', dataset='play_information')
play_information.head()

gamekey,game_date,playid,yardline,quarter,play_type,poss_team,score_home_visiting
i32,date,i32,str,i32,str,str,str
2,2016-08-13,191,"""LA 47""",1,"""Punt""","""LA""","""0 - 7"""
2,2016-08-13,1132,"""LA 29""",2,"""Punt""","""LA""","""7 - 21"""
2,2016-08-13,1227,"""DAL 18""",2,"""Punt""","""DAL""","""7 - 21"""
2,2016-08-13,1864,"""DAL 46""",2,"""Punt""","""LA""","""7 - 24"""
2,2016-08-13,2247,"""DAL 15""",3,"""Punt""","""DAL""","""14 - 24"""


In [7]:
punt_data = data_loader(database='nfl_concussion', dataset='punt_data')
punt_data.head()

gsisid,number,position
i32,str,str
32069,"""36""","""SS"""
30095,"""11""","""WR"""
31586,"""22""","""FS"""
29520,"""35""","""SS"""
30517,"""51""","""OLB"""


In [8]:
role_data = data_loader(database='nfl_concussion', dataset='role_data')
role_data.head()

gamekey,playid,gsisid,role
i32,i32,i32,str
414,188,33704,"""PDL2"""
414,1107,33704,"""PDL2"""
424,1113,33704,"""PDR3"""
424,1454,33704,"""PLR2"""
424,644,33704,"""PRG"""


In [10]:
video_review = data_loader(database='nfl_concussion', dataset='video_review')
video_review.head()

gamekey,playid,gsisid,player_activity_derived,primary_impact_type,primary_partner_activity_derived,primary_partner_gsisid
i32,i32,i32,str,str,str,str
5,3129,31057,"""Tackling""","""Helmet-to-body""","""Tackled""","""32482"""
21,2587,29343,"""Blocked""","""Helmet-to-helmet""","""Blocking""","""31059"""
29,538,31023,"""Tackling""","""Helmet-to-body""","""Tackled""","""31941"""
45,1212,33121,"""Tackling""","""Helmet-to-body""","""Tackled""","""28249"""
54,1045,32444,"""Blocked""","""Helmet-to-body""","""Blocked""","""31756"""


role_data: 146573   - G P Gs
play_information: 6681  - G P 

punt_data: 3259  - Gs
video_review: 37  - G P Gs

game_data: 666  - G


ngs_data: 60716164  - G P Gs 

In [None]:
qualitative_df = role_data.join(
    play_information
    , on=['gamekey', 'playid']
    , how='left'
)