# ETL for the Concussion Data

There is a lot of overlap with the Lower Body Data and some of the functions should be able to be used for both of them. Going to continue to use polars with this since the data is still huge. There are 60 million rows in the ngs data




## game_data
Starting with the gamekey, season year, season type, week, and game_day, and game_date - the game date actually provides all of the necessary data, and anything additional can actually skew any machine learning findings. 

Likewise, team and team codes are redundant. The team codes are less likely to have spelling errors. Will maintain hometeamcode and visitteamcode. 

As far as stadium locations, the ones that are redundant are Mexico City - Estadio Azteca and Mexico - Estadio Azteca, and London - Wembley and Wembley - Wembley Stadium

There is an issue with the game listed in Indianapolis at Tom Benson, which is in Canton - this game was actually canceled as verified online. 
Ralph Wilson Stadium is the same location as New Era Field.

The outdoor weather is so inconsistently recorded, there are 134 differnt weather conditions, but most don't actually include weather types, whereas there are only 10 types of gameweather including null. 

In [None]:
SELECT gamekey
    , game_date
    , game_site
    , start_time
    , hometeamcode
    , visitteamcode
    , stadiumtype
    , turf
    , gameweather
    , temperature 

FROM game_data
LIMIT 10;


## ngs_data

Again, season_year is unnecessary, since it is included in time. I would like to see if the GSISID numbers are the same as those from the other database. We will need to keep the rest of the information. 


In [None]:
SELECT gamekey ,
       playid ,
       gsisid ,
       time ,
       x,
       y,
       dis ,
       o ,
       dir ,
       event
FROM ngs_data
LIMIT 10 ;

## play_information

This has similar crap that can be removed, as does the game_data. This will end up joined with the game_data, so we'll need to keep enough to connect them. 

Season_year, season_type, week can disappear. 
Play description isn't going to be particularly helpful, as every one of them is completely different. We should already have the home_team_visit_team information from the joined table game_data, so this column may be dropped later.

In [None]:
SELECT gamekey ,
       game_date ,
       playid,
       game_clock ,
       yardline ,
       quarter ,
       play_type ,
       poss_team ,
       home_team_visit_team ,
       score_home_visiting
FROM play_information
LIMIT 10 ;

## punt_data

Everything here is necessary - gsisid, number, and position (code)

## role_data

season_year is unnecessary, since the gamekeys don't overlap, and are already chronologically established 

In [None]:
SELECT gamekey , playid , gsisid , role FROM role_data
LIMIT 10 ;

## video_review

There are no turnovers in the entire table, so this parameter can be removed. 
In terms of this analysis - there isn't a real reason to keep the primary_partner_gsisid, since we're not looking for the individuals responsible, though it will be nice to have for the visualizations.  
friendly_fire isn't something that we can control biomechanically, so there isn't anything that I can really use from this information. 


In [None]:
SELECT gamekey, playid, gsisid, player_activity_derived, primary_impact_type, primary_partner_activity_derived, primary_partner_gsisid FROM video_review
LIMIT 10 ;

# Extract and Clean

In [None]:
import polars as pl
import sqlalchemy as db
from sqlalchemy.orm import Session
from sqlalchemy import create_engine
import psycopg2

In [None]:
from DataHandler import data_loader, data_shrinker
from CleaningFunctions import column_capitalizer, stadium_cleaner, weather_cleaner, turf_cleaner

In [None]:
concussion = data_loader(database='nfl_concussion', dataset='concussion')
concussion.head()

In [None]:
concussion = column_capitalizer(concussion, 'concussion')
concussion.head()

In [None]:
concussion = stadium_cleaner(concussion, 'concussion')
concussion.head()

In [None]:
concussion = weather_cleaner(concussion, 'concussion')
concussion.head()

In [None]:
concussion.null_count()

In [None]:
concussion = turf_cleaner(concussion)
concussion.head()

In [None]:
concussion.null_count()

In [None]:
filtered = concussion.filter(pl.col('Game_Date').is_null())
result = filtered.select(['Game_Date', 'HomeTeamCode', 'StadiumType', 'FieldType', 'Weather'])

In [None]:
print(filtered)

In [None]:
# def turf_cleaner(df):
#     import polars as pl

#     turf_dict = {
#         'Grass': 'Natural',
#         'Field Turf': 'Synthetic', 
#         'Natural Grass': 'Natural',
#         'grass': 'Natural',
#         'Artificial': 'Synthetic',
#         'FieldTurf': 'Synthetic',
#         'DD GrassMaster': 'Synthetic',
#         'A-Turf Titan': 'Synthetic',
#         'UBU Sports Speed S5-M': 'Synthetic',
#         'UBU Speed Series S5-M': 'Synthetic',
#         'Artifical': 'Synthetic',
#         'UBU Speed Series-S5-M': 'Synthetic',
#         'FieldTurf 360': 'Synthetic',
#         'Natural grass': 'Natural',
#         'Field turf': 'Synthetic',
#         'Natural': 'Natural',
#         'Natrual Grass': 'Natural',
#         'Synthetic': 'Synthetic',
#         'Natural Grass ': 'Natural',
#         'Naturall Grass': 'Natural',
#         'FieldTurf360': 'Synthetic',
#         None: 'Natural' # The only field with null values is Miami Gardens, which has Natural
#         }
    
#     df = df.with_columns(pl.col("FieldType").replace(turf_dict))
#     return df

In [None]:
concussion = turf_cleaner(concussion)

In [None]:
concussion.filter(pl.col('FieldType').is_null())

## New Test for the clean_concussion() function

In [1]:
import polars as pl
import sqlalchemy as db
from sqlalchemy.orm import Session
from sqlalchemy import create_engine
import psycopg2

from CleaningFunctions import *
from DataHandler import data_loader, data_writer

In [2]:
df = clean_concussions()
df.head()

AttributeError: 'NoneType' object has no attribute 'head'