# Post-Analysis - Turf Data Cleaning

- This removes some of the features that were determined to be flawed from the previous analyses
- This contains the data cleaning as well as the feature analysis
- The latter portion of this file is used to produce the data used for the ML analysis and for Visualization production (each in a separate notebook)

---
# Dependencies

In [1]:
import pandas as pd
import numpy as np
from PostAnalysisCleaningFunctions import *

import sqlalchemy as db
from sqlalchemy.orm import Session
from sqlalchemy import create_engine
import psycopg2

## Import the Data from the NFL Turf Kaggle data

### Connect and Clean Playlist.csv

The data provided from playlist contains all of the plays, including the exact play that will match the injury list. Anything that is in both, with the exception of the PlayerKey, should be maintained on this DF so that we don't lose the non-injury columns. 

In [2]:
# Connect to the Database using the postgres server and sqlalchemy
from config import db_password

db_string = f"postgresql://postgres:{db_password}@127.0.0.1:5433/NFL_Turf"
engine = db.create_engine(db_string)
conn = engine.connect()
metadata = db.MetaData()


del db_password

# Read in the specific table - this is done from the same connection
table = db.Table('playlist', metadata, autoload=True, autoload_with=engine)
query = db.select(table)
Results = conn.execute(query).fetchall()


In [3]:
# Create the new dataframe and set the keys
playlist = pd.DataFrame(Results)
playlist.columns = Results[0].keys()

playlist.head(2)

Unnamed: 0,playerkey,gameid,playkey,rosterposition,playerday,playergame,stadiumtype,fieldtype,temperature,weather,playtype,playergameplay,position,postiongroup
0,26624,26624-1,26624-1-1,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,1,QB,QB
1,26624,26624-1,26624-1-2,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,2,QB,QB


Due to the collection of the data from sqlalchemy, the columns didn't maintain the capitalization of the original data, so the column_capitalizer function from InjuryCleaningFunctions was used to correct this:

In [4]:
playlist = column_capitalizer(playlist, 'playlist')
playlist.head(2)

Unnamed: 0,PlayerKey,GameID,PlayKey,RosterPosition,PlayerDay,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,PositionGroup
0,26624,26624-1,26624-1-1,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,1,QB,QB
1,26624,26624-1,26624-1-2,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,2,QB,QB


PlayKey will be used as the Key to merge the datasets, so the PlayerKey and GameID can be removed. The Temperature actually created an improper feature importance, since the majority of games are played at 68 degrees in all indoor stadiums. While temperature should be considered continuous, it was behaving as a binary feature, either 68-70 or not. The Weather also showed very little impact on the injuries in these data, so the weather will also be removed in this modified analysis. 

- PlayKeys represent all plays, not only those where injuries occurred - these will function to merge the tables
- FieldType only has 2 values, Natural or Synthetic and can be easily changed to binary values for the ML 
- Stadium Type is also strange with 29 unique types of stadiums. These can likely be grouped in smaller categories.
- RosterPosition, Position, and Position Group are all similar and need to be investigated
    - PositionGroup will be Removed
    - RosterPosition and Position will be encoded, but only one of the two columns will be used per evaluation, since they are non-independent features
- PlayTypes should be encoded, as they are categorical such as pass, rush, kick, ... 

### Create a coded surface variable for the Machine Learning

In [6]:
playlist = surface_coder(playlist)
playlist.head(3)

Unnamed: 0,PlayerKey,GameID,PlayKey,RosterPosition,PlayerDay,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,PositionGroup,SyntheticField
0,26624,26624-1,26624-1-1,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,1,QB,QB,1
1,26624,26624-1,26624-1-2,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,2,QB,QB,1
2,26624,26624-1,26624-1-3,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Rush,3,QB,QB,1


### Fix the Stadium Types

There is a different type of stadium listed for each stadium in the NFL, the function stadium_coder() changes all stadiums to either indoor or outdoor. For retractable roof stadiums, the classification will depend on whether the roof was open or closed during the specific game that they plays were recorded from. 

In [7]:
playlist = stadium_coder(playlist)
playlist.StadiumType.nunique()

2

### Fix the positions
- Positions and roster positions will all be converted to the 1-2 letter abbreviations
- Positions for the ML Model will be encoded numerically, shown as a separate file as below

In [8]:
playlist = position_coder(playlist)
playlist.head()

Unnamed: 0,PlayerKey,GameID,PlayKey,RosterPosition,PlayerDay,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,SyntheticField,Outdoor,RosterPosition_Num,Position_Num
0,26624,26624-1,26624-1-1,QB,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,1,QB,1,1,0,0
1,26624,26624-1,26624-1-2,QB,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,2,QB,1,1,0,0
2,26624,26624-1,26624-1-3,QB,1,1,Outdoor,Synthetic,63,Clear and warm,Rush,3,QB,1,1,0,0
3,26624,26624-1,26624-1-4,QB,1,1,Outdoor,Synthetic,63,Clear and warm,Rush,4,QB,1,1,0,0
4,26624,26624-1,26624-1-5,QB,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,5,QB,1,1,0,0


### Fix the Plays

Most plays are either Passing plays or Rushing plays. The rest of the plays involve kicking plays. Because our other dataset specifically looks at punt plays, we will be differentiating between Kick plays and Punt Plays. 

In [9]:
playlist = play_coder(playlist)
playlist.head()

Unnamed: 0,PlayerKey,GameID,PlayKey,RosterPosition,PlayerDay,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,SyntheticField,Outdoor,RosterPosition_Num,Position_Num,PlayCode
0,26624,26624-1,26624-1-1,QB,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,1,QB,1,1,0,0,0.0
1,26624,26624-1,26624-1-2,QB,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,2,QB,1,1,0,0,0.0
2,26624,26624-1,26624-1-3,QB,1,1,Outdoor,Synthetic,63,Clear and warm,Rush,3,QB,1,1,0,0,1.0
3,26624,26624-1,26624-1-4,QB,1,1,Outdoor,Synthetic,63,Clear and warm,Rush,4,QB,1,1,0,0,1.0
4,26624,26624-1,26624-1-5,QB,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,5,QB,1,1,0,0,0.0


In [10]:
playlist.PlayType.value_counts()

Pass    138079
Rush     92606
Kick     23973
Punt     11701
0          279
Name: PlayType, dtype: int64

### Fix the PlayerDay

The minimum number from the player day is -62. Additionally, the days are continued across 2 seasons, but not all players from those teams played both seasons. To fix this, all playerdays will be adjusted so that each player's first player day is 1, per season. 

In [11]:
print(min(playlist.PlayerDay), max(playlist.PlayerDay))

-62 480


In [12]:
playlist = playerday_adjuster(playlist)
print(min(playlist.DaysPlayed), max(playlist.DaysPlayed))


1 207


### Before Merging, Drop Appropriate Columns

Since the Visualizations Need to maintain the Character Columns, but the Machine Learning can only use the numerical columns, the next two functions automate the full cleaning process, returning the dataframe with the appropriate columns

In [13]:
# Create the new dataframe and set the keys
playlist = pd.DataFrame(Results)
playlist.columns = Results[0].keys()
del Results
playlist.head(2)


Unnamed: 0,playerkey,gameid,playkey,rosterposition,playerday,playergame,stadiumtype,fieldtype,temperature,weather,playtype,playergameplay,position,postiongroup
0,26624,26624-1,26624-1-1,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,1,QB,QB
1,26624,26624-1,26624-1-2,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,2,QB,QB


---

# Connect and Clean the Injuries Data

In [16]:
# Read in the specific table - this can be done on the same connection:
injuries_sql = db.Table('injuries', metadata,
                        autoload=True, autoload_with=engine)
query = db.select(injuries_sql)
Results = conn.execute(query).fetchall()

# Create the new dataframe and set the keys
injuries = pd.DataFrame(Results)
injuries.columns = Results[0].keys()
conn.close()
# del Results, metadata, conn, engine, query, table, db_string
injuries.head()


Unnamed: 0,playerkey,gameid,playkey,bodypart,fieldtype,dm_m1,dm_m7,dm_m28,dm_m42
0,39873,39873-4,39873-4-32,Knee,Synthetic,1,1,1,1
1,46074,46074-7,46074-7-26,Knee,Natural,1,1,0,0
2,36557,36557-1,36557-1-70,Ankle,Synthetic,1,1,1,1
3,46646,46646-3,46646-3-30,Ankle,Natural,1,0,0,0
4,43532,43532-5,43532-5-69,Ankle,Synthetic,1,1,1,1


### Capitalize the Columns

In [17]:
injuries = column_capitalizer(injuries, 'injuries')
injuries.head()

Unnamed: 0,PlayerKey,GameID,PlayKey,BodyPart,Surface,DM_M1,DM_M7,DM_M28,DM_M42
0,39873,39873-4,39873-4-32,Knee,Synthetic,1,1,1,1
1,46074,46074-7,46074-7-26,Knee,Natural,1,1,0,0
2,36557,36557-1,36557-1-70,Ankle,Synthetic,1,1,1,1
3,46646,46646-3,46646-3-30,Ankle,Natural,1,0,0,0
4,43532,43532-5,43532-5-69,Ankle,Synthetic,1,1,1,1


- The PlayKey column is the only column with NA values, and it's only associate with toe injuries, since they were likely discovered after the game
- Everyone in this list is in the DM_M1 column, indicating that everyone in this list is injured
- Surface can be dropped - it's in the merge table
- The DM columns will be converted into a single column called InjuryDuration, which will scale 1, 7, 28, 42, and when merged adding 0 for non-injuries
- BodyPart will be encoded for ML, so an additional coded column will be added for that, while the string is maintained for Vis
- GameID is already part of the PlayKey, and since it isn't in any chronological order, this column is dropped
- Since we want to be able to classify whether a player is injured, we will create an IsInjured column, which should be TRUE for all players on this table
- We need all info from this table, so all NaN values must be dropped before merging

In [19]:
injuries = injury_duration_coder(injuries, 'vis')
injuries.head()

Unnamed: 0,PlayerKey,GameID,PlayKey,BodyPart,Surface,DM_M1,DM_M7,DM_M28,DM_M42,InjuryDuration,SevereInjury
0,39873,39873-4,39873-4-32,Knee,Synthetic,1,1,1,1,42,Severe
1,46074,46074-7,46074-7-26,Knee,Natural,1,1,0,0,7,Mild
2,36557,36557-1,36557-1-70,Ankle,Synthetic,1,1,1,1,42,Severe
3,46646,46646-3,46646-3-30,Ankle,Natural,1,0,0,0,1,
4,43532,43532-5,43532-5-69,Ankle,Synthetic,1,1,1,1,42,Severe


In the above code, the process was added as either 'vis' or 'ml', since the outputs will be different

In [20]:
injuries = bodypart_coder(injuries)
injuries.head()

Unnamed: 0,PlayerKey,GameID,PlayKey,BodyPart,Surface,DM_M1,DM_M7,DM_M28,DM_M42,InjuryDuration,SevereInjury,InjuryType
0,39873,39873-4,39873-4-32,Knee,Synthetic,1,1,1,1,42,Severe,3.0
1,46074,46074-7,46074-7-26,Knee,Natural,1,1,0,0,7,Mild,3.0
2,36557,36557-1,36557-1-70,Ankle,Synthetic,1,1,1,1,42,Severe,2.0
3,46646,46646-3,46646-3-30,Ankle,Natural,1,0,0,0,1,,2.0
4,43532,43532-5,43532-5-69,Ankle,Synthetic,1,1,1,1,42,Severe,2.0


### Injury Import Function

Similar to the Above Vis and ML functions, there are coordinating vis and ml functions that remove the appropriate columns for injuries as well

In [21]:
df = pd.DataFrame(Results)
df.columns = Results[0].keys()
df.head()

Unnamed: 0,playerkey,gameid,playkey,bodypart,fieldtype,dm_m1,dm_m7,dm_m28,dm_m42
0,39873,39873-4,39873-4-32,Knee,Synthetic,1,1,1,1
1,46074,46074-7,46074-7-26,Knee,Natural,1,1,0,0
2,36557,36557-1,36557-1-70,Ankle,Synthetic,1,1,1,1
3,46646,46646-3,46646-3-30,Ankle,Natural,1,0,0,0
4,43532,43532-5,43532-5-69,Ankle,Synthetic,1,1,1,1


---