# Preliminary Clean and Analysis - Lower Body Injuries
- **Preliminary Data Cleaning and Feature Analysis**
- **Preliminary Machine Learning Models**

This is a combined file containing the full preliminary analysis, which led to several changes implemented in the final data cleaning and machine learning models

---
## Dependencies

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from ColumnCapitals import column_capitalizer

import sqlalchemy as db
from sqlalchemy.orm import Session
from sqlalchemy import create_engine
import psycopg2

- The processes used in the data cleaning only required Pandas and Numpy for the Python Processing. 
- The Scikit Learn library was used for the validation splits and encoding of dummy data
- SQL Alchemy was used to connect to our database for both data retrieval and data exports 

---

## Exploratory Data Analysis - Lower Body Injury Data
### PlayList.csv

The data provided from Kaggle in the .csv file titled "PlayList.csv" was imported into our PostgreSQL database named "NFL_Turf". The code below connects to the database and retreives the file and stores it as a dataframe titled "playlist"

The first thing to note is that this list contains all of the plays, including the exact play that will match with the injury list, therefore anything that is on both with the exception of the PlayerKey should be maintained on THIS DF so that we don't lose data on the non-injury columns

In [2]:
# Make connection to the database
from config import db_password
db_string = f"postgresql://postgres:{db_password}@127.0.0.1:543/NFL_Turf"
engine = db.create_engine(db_string)
conn = engine.connect()
metadata = db.MetaData()

del db_password

# Read in the specific table - this can be done on the same connection:
table = db.Table('playlist', metadata,
                        autoload=True, autoload_with=engine)
query = db.select(table)
Results = conn.execute(query).fetchall()

# Create the new dataframe and set the keys
playlist = pd.DataFrame(Results)
playlist.columns = Results[0].keys()

playlist.head()

Unnamed: 0,playerkey,gameid,playkey,rosterposition,playerday,playergame,stadiumtype,fieldtype,temperature,weather,playtype,playergameplay,position,postiongroup
0,26624,26624-1,26624-1-1,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,1,QB,QB
1,26624,26624-1,26624-1-2,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,2,QB,QB
2,26624,26624-1,26624-1-3,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Rush,3,QB,QB
3,26624,26624-1,26624-1-4,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Rush,4,QB,QB
4,26624,26624-1,26624-1-5,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,5,QB,QB


It should be noted that the columns retrieved from the database are returned as all lowercase; however, when working with the raw data, these columns are capitalized. To maintain the fidelity of the original data, we will be using a function to change the imported column names back to their original formats using the function column_capitalizer from ColumnCapitals.py

In [3]:
playlist = column_capitalizer(playlist, 'playlist')
playlist.head()


Unnamed: 0,PlayerKey,GameID,PlayKey,RosterPosition,PlayerDay,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,PositionGroup
0,26624,26624-1,26624-1-1,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,1,QB,QB
1,26624,26624-1,26624-1-2,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,2,QB,QB
2,26624,26624-1,26624-1-3,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Rush,3,QB,QB
3,26624,26624-1,26624-1-4,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Rush,4,QB,QB
4,26624,26624-1,26624-1-5,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,5,QB,QB


PlayKey will be used as the Key to merge the datasets, so PlayerKey and GameID can be removed. While FieldType information is also in the surface column of the injuries table, we need to maintain it here, so we don't lose the data from the columns not containing injuries. 

In [4]:
playlist.drop(columns=['PlayerKey', 'GameID'], inplace=True)
playlist.head()


Unnamed: 0,PlayKey,RosterPosition,PlayerDay,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,PositionGroup
0,26624-1-1,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,1,QB,QB
1,26624-1-2,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,2,QB,QB
2,26624-1-3,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Rush,3,QB,QB
3,26624-1-4,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Rush,4,QB,QB
4,26624-1-5,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,5,QB,QB


In [5]:
playlist.nunique()

PlayKey           267005
RosterPosition        10
PlayerDay            215
PlayerGame            32
StadiumType           29
FieldType              2
Temperature           79
Weather               63
PlayType              11
PlayerGamePlay       102
Position              23
PositionGroup         10
dtype: int64

In [6]:
objects = playlist.dtypes[playlist.dtypes == 'object'].index.tolist()
objects

['PlayKey',
 'RosterPosition',
 'StadiumType',
 'FieldType',
 'Weather',
 'PlayType',
 'Position',
 'PositionGroup']

- PlayKeys represent all plays, not only those where injuries occurred - these will function to merge the tables
- FieldType only has 2 values, Natural or Synthetic and can be easily changed to binary values 
- Stadium Type is also strange with 29 unique types of stadiums. These can likely be grouped in smaller categories.
- Weather - there are 63 unique types of weather.... this is odd. 
- RosterPosition, Position, and Position Group are all similar and need to be investigated
- PlayTypes should be encoded, as they are categorical such as pass, rush, kick, ... 

---
### Change the Field Types to Binary Values

In [7]:
# Creates a function to change the surface values
def surface_code(row):
    surface = row['FieldType']
    coded_surface = 0
    if surface == 'Natural':
        coded_surface = 0
    elif surface == 'Synthetic':
        coded_surface = 1

    return coded_surface

This function is later grouped into the InjuryCleaning.py file

In [8]:
playlist['CodedSurface'] = playlist.apply(surface_code, axis=1)
playlist.head()

Unnamed: 0,PlayKey,RosterPosition,PlayerDay,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,PositionGroup,CodedSurface
0,26624-1-1,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,1,QB,QB,1
1,26624-1-2,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,2,QB,QB,1
2,26624-1-3,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Rush,3,QB,QB,1
3,26624-1-4,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Rush,4,QB,QB,1
4,26624-1-5,Quarterback,1,1,Outdoor,Synthetic,63,Clear and warm,Pass,5,QB,QB,1


In [9]:
playlist# The code above worked, now change the FieldType to the coded and remove the redundant column
playlist['FieldType'] = playlist['CodedSurface']
playlist.drop(columns='CodedSurface', inplace=True)
playlist.head()

Unnamed: 0,PlayKey,RosterPosition,PlayerDay,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,PositionGroup
0,26624-1-1,Quarterback,1,1,Outdoor,1,63,Clear and warm,Pass,1,QB,QB
1,26624-1-2,Quarterback,1,1,Outdoor,1,63,Clear and warm,Pass,2,QB,QB
2,26624-1-3,Quarterback,1,1,Outdoor,1,63,Clear and warm,Rush,3,QB,QB
3,26624-1-4,Quarterback,1,1,Outdoor,1,63,Clear and warm,Rush,4,QB,QB
4,26624-1-5,Quarterback,1,1,Outdoor,1,63,Clear and warm,Pass,5,QB,QB


---
### Reduce the Number of Stadium Types to Something Meaningful

It turns out that there are a lot of misspelled stadium types. There are 7 unique spellings of the word 'Outdoor' alone. Also, the people of Pittsburgh seemed pretty confused as to the meaning of Stadium Type, as there are MANY entries listing the stadium type as Heinz Field. 

In [10]:
stadiums = playlist.StadiumType.unique().tolist()
stadiums

['Outdoor',
 'Indoors',
 'Oudoor',
 'Outdoors',
 'Open',
 'Closed Dome',
 'Domed, closed',
 None,
 'Dome',
 'Indoor',
 'Domed',
 'Retr. Roof-Closed',
 'Outdoor Retr Roof-Open',
 'Retractable Roof',
 'Ourdoor',
 'Indoor, Roof Closed',
 'Retr. Roof - Closed',
 'Bowl',
 'Outddors',
 'Retr. Roof-Open',
 'Dome, closed',
 'Indoor, Open Roof',
 'Domed, Open',
 'Domed, open',
 'Heinz Field',
 'Cloudy',
 'Retr. Roof - Open',
 'Retr. Roof Closed',
 'Outdor',
 'Outside']

In [11]:
# How many Stadium Types are missing? 
playlist.StadiumType.isna().sum()

16910

In [12]:
# Since most stadiums are outdoor stadiums, for now, just going to change any NaN stadiums to outdoor for now
playlist.StadiumType.fillna('Outdoor', inplace=True)
playlist.StadiumType.isna().sum()


0

Grouping all stadiums into Outdoor, Indoor, Open Dome, or Closed Dome using a dictionary 

In [13]:
dict = {'Outdoor': 'Outdoor',
        'Indoors': 'Indoor',
        'Oudoor': 'Outdoor',
        'Outdoors': 'Outdoor',
        'Open': 'Outdoor',
        'Closed Dome': 'Indoor',
        'Domed, closed': 'Indoor',
        'Dome': 'Indoor',
        'Indoor': 'Indoor',
        'Domed': 'Indoor',
        'Retr. Roof-Closed': 'Indoor',
        'Outdoor Retr Roof-Open': 'Outdoor',
        'Retractable Roof': 'Indoor',
        'Ourdoor': 'Outdoor',
        'Indoor, Roof Closed': 'Indoor',
        'Retr. Roof - Closed': 'Indoor',
        'Bowl': 'Outdoor',
        'Outddors': 'Outdoor',
        'Retr. Roof-Open': 'Outdoor',
        'Dome, closed': 'Indoor',
        'Indoor, Open Roof': 'Outdoor',
        'Domed, Open': 'Outdoor',
        'Domed, open': 'Outdoor',
        'Heinz Field': 'Outdoor',
        'Cloudy': 'Outdoor',
        'Retr. Roof - Open': 'Outdoor',
        'Retr. Roof Closed': 'Indoor',
        'Outdor': 'Outdoor',
        'Outside': 'Outdoor'}


playlist.StadiumType.replace(dict, inplace=True)
playlist.head()

Unnamed: 0,PlayKey,RosterPosition,PlayerDay,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,PositionGroup
0,26624-1-1,Quarterback,1,1,Outdoor,1,63,Clear and warm,Pass,1,QB,QB
1,26624-1-2,Quarterback,1,1,Outdoor,1,63,Clear and warm,Pass,2,QB,QB
2,26624-1-3,Quarterback,1,1,Outdoor,1,63,Clear and warm,Rush,3,QB,QB
3,26624-1-4,Quarterback,1,1,Outdoor,1,63,Clear and warm,Rush,4,QB,QB
4,26624-1-5,Quarterback,1,1,Outdoor,1,63,Clear and warm,Pass,5,QB,QB


---
### For the Supervised Learning, going to initially group the Stadium Types as Outdoor, or Not Outdoor in a new column, OutdoorStadium

In [14]:
# This uses the numpy where to classify anything that meets the True condition as 1, denoting Outdoor Stadium, and False becomes 0, for all other non-outdoor stadiums
playlist['OutdoorStadium'] = np.where(playlist['StadiumType']=='Outdoor', 1, 0)
playlist.head()

Unnamed: 0,PlayKey,RosterPosition,PlayerDay,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,PositionGroup,OutdoorStadium
0,26624-1-1,Quarterback,1,1,Outdoor,1,63,Clear and warm,Pass,1,QB,QB,1
1,26624-1-2,Quarterback,1,1,Outdoor,1,63,Clear and warm,Pass,2,QB,QB,1
2,26624-1-3,Quarterback,1,1,Outdoor,1,63,Clear and warm,Rush,3,QB,QB,1
3,26624-1-4,Quarterback,1,1,Outdoor,1,63,Clear and warm,Rush,4,QB,QB,1
4,26624-1-5,Quarterback,1,1,Outdoor,1,63,Clear and warm,Pass,5,QB,QB,1


---
### Dealing with the Weather Situation

There were a lot of different entries meaning the same thing; these were grouped in a dictionary the same way the stadiums were, and can be adjusted if necessary 

In [15]:
weather_dict = {'Clear and warm': 'Clear',
                'Mostly Cloudy': 'Cloudy',
                'Sunny': 'Clear',
                'Clear': 'Clear',
                'Cloudy': 'Cloudy',
                'Cloudy, fog started developing in 2nd quarter': 'Hazy/Fog',
                'Rain': 'Rain',
                'Partly Cloudy': 'Cloudy',
                'Mostly cloudy': 'Cloudy',
                'Cloudy and cold': 'Cloudy',
                'Cloudy and Cool': 'Cloudy',
                'Rain Chance 40%': 'Rain',
                'Controlled Climate': 'Indoor',
                'Sunny and warm': 'Clear',
                'Partly cloudy': 'Cloudy',
                'Clear and Cool': 'Cloudy',
                'Clear and cold': 'Cloudy',
                'Sunny and cold': 'Clear',
                'Indoor': 'Indoor',
                'Partly Sunny': 'Clear',
                'N/A (Indoors)': 'Indoor',
                'Mostly Sunny': 'Clear',
                'Indoors': 'Indoor',
                'Clear Skies': 'Clear',
                'Partly sunny': 'Clear',
                'Showers': 'Rain',
                'N/A Indoor': 'Indoor',
                'Sunny and clear': 'Clear',
                'Snow': 'Snow',
                'Scattered Showers': 'Rain',
                'Party Cloudy': 'Cloudy',
                'Clear skies': 'Clear',
                'Rain likely, temps in low 40s.': 'Rain',
                'Hazy': 'Hazy/Fog',
                'Partly Clouidy': 'Cloudy',
                'Sunny Skies': 'Clear',
                'Overcast': 'Cloudy',
                'Cloudy, 50% change of rain': 'Cloudy',
                'Fair': 'Clear',
                'Light Rain': 'Rain',
                'Partly clear': 'Clear',
                'Mostly Coudy': 'Cloudy',
                '10% Chance of Rain': 'Cloudy',
                'Cloudy, chance of rain': 'Cloudy',
                'Heat Index 95': 'Clear',
                'Sunny, highs to upper 80s': 'Clear',
                'Sun & clouds': 'Cloudy',
                'Heavy lake effect snow': 'Snow',
                'Mostly sunny': 'Clear',
                'Cloudy, Rain': 'Rain',
                'Sunny, Windy': 'Windy',
                'Mostly Sunny Skies': 'Clear',
                'Rainy': 'Rain',
                '30% Chance of Rain': 'Rain',
                'Cloudy, light snow accumulating 1-3': 'Snow',
                'cloudy': 'Cloudy',
                'Clear and Sunny': 'Clear',
                'Coudy': 'Cloudy',
                'Clear and sunny': 'Clear',
                'Clear to Partly Cloudy': 'Clear',
                'Cloudy with periods of rain, thunder possible. Winds shifting to WNW, 10-20 mph.': 'Windy',
                'Rain shower': 'Rain',
                'Cold': 'Clear'}

playlist.Weather.replace(weather_dict, inplace=True)
playlist.head()

Unnamed: 0,PlayKey,RosterPosition,PlayerDay,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,PositionGroup,OutdoorStadium
0,26624-1-1,Quarterback,1,1,Outdoor,1,63,Clear,Pass,1,QB,QB,1
1,26624-1-2,Quarterback,1,1,Outdoor,1,63,Clear,Pass,2,QB,QB,1
2,26624-1-3,Quarterback,1,1,Outdoor,1,63,Clear,Rush,3,QB,QB,1
3,26624-1-4,Quarterback,1,1,Outdoor,1,63,Clear,Rush,4,QB,QB,1
4,26624-1-5,Quarterback,1,1,Outdoor,1,63,Clear,Pass,5,QB,QB,1


Assess whether the nan rows are indoor statiums, in which case, change to Indoor, otherwise remove

In [16]:
playlist.Weather.value_counts()

Cloudy      112306
Clear        96985
Indoor       20276
Rain         14280
Snow          1945
Hazy/Fog      1809
Windy          713
Name: Weather, dtype: int64

In [17]:
playlist.Weather.isna().sum()

18691

Since it is impossible to predict the outdoor weather, we will have to drop the NaN values associated with outdoor, but the indoor NaN values can be filled with Indoor weather conditions

In [18]:
# This line of code identifies from the plays table, where the stadium type is 'Indoor' and then fills NaN values in the 'Weather' column with 'Indoor'.
playlist.loc[playlist.StadiumType == 'Indoor',
             'Weather'] = playlist.loc[playlist.StadiumType == 'Indoor', 'Weather'].fillna('Indoor')

In [19]:
# This addeda bout 7000 values to the Indoor values
playlist.Weather.value_counts()

Cloudy      112306
Clear        96985
Indoor       33861
Rain         14280
Snow          1945
Hazy/Fog      1809
Windy          713
Name: Weather, dtype: int64

In [20]:
# The remaining ~ 5,000 were outdoor with no weather - going to remove these since it's impossible to predict the weather conditions
playlist.Weather.isna().sum()

5106

Now, instead of dropping over 18000 values, we only have to drop 5000 values due to NaN 

In [21]:
# It's possible to determine the weather on those days if absolutely necessary, this looks like about 1.9% of the data...
playlist = playlist.loc[playlist.Weather.isna() == False]
playlist.Weather.isna().sum()

0

In [22]:
# Weather has been reduced from 63 different values to 7
playlist.Weather.nunique()

7

Now that the Weather has been reduced to fewer than 10, it is ready to be encoded.

---
### Encoding the Weather in a new column called Precipitation
Most of the documented material from online sources suggest that the only weather that really has a big effect on play is the presence of precipitation in the form of rain or snow. 

Weather can be ranked in order of impact:  Clear and Indoor= 0, Cloudy = 0,  Windy = 0, Hazy/Fog = 0, Rain = 1, Snow = 1 

In [23]:
precipitation = {
    'Indoor': 0, 
    'Clear': 0, 
    'Cloudy': 0,
    'Windy': 0,
    'Hazy/Fog': 0, 
    'Rain': 1, 
    'Snow': 1 
}

playlist['Precipitation'] = playlist.Weather.map(precipitation)
playlist.head()


Unnamed: 0,PlayKey,RosterPosition,PlayerDay,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,PositionGroup,OutdoorStadium,Precipitation
0,26624-1-1,Quarterback,1,1,Outdoor,1,63,Clear,Pass,1,QB,QB,1,0
1,26624-1-2,Quarterback,1,1,Outdoor,1,63,Clear,Pass,2,QB,QB,1,0
2,26624-1-3,Quarterback,1,1,Outdoor,1,63,Clear,Rush,3,QB,QB,1,0
3,26624-1-4,Quarterback,1,1,Outdoor,1,63,Clear,Rush,4,QB,QB,1,0
4,26624-1-5,Quarterback,1,1,Outdoor,1,63,Clear,Pass,5,QB,QB,1,0


---
### Looking at the Temperature Values - was determined in the PCA that some temperatures were... aberrant

In [24]:
playlist.Temperature.value_counts()

-999    24170
 68     13588
 61      6744
 72      6513
 48      6068
        ...  
 34       418
 32       383
 10       292
 26       243
 9        210
Name: Temperature, Length: 79, dtype: int64

In [25]:
playlist['Temperature'] = np.where(
    (playlist['Temperature'] == -999) & (playlist['StadiumType'] == 'Indoor'), 70, playlist.Temperature)
playlist.head()

Unnamed: 0,PlayKey,RosterPosition,PlayerDay,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,PositionGroup,OutdoorStadium,Precipitation
0,26624-1-1,Quarterback,1,1,Outdoor,1,63,Clear,Pass,1,QB,QB,1,0
1,26624-1-2,Quarterback,1,1,Outdoor,1,63,Clear,Pass,2,QB,QB,1,0
2,26624-1-3,Quarterback,1,1,Outdoor,1,63,Clear,Rush,3,QB,QB,1,0
3,26624-1-4,Quarterback,1,1,Outdoor,1,63,Clear,Rush,4,QB,QB,1,0
4,26624-1-5,Quarterback,1,1,Outdoor,1,63,Clear,Pass,5,QB,QB,1,0


In [26]:
playlist.Temperature.value_counts()

70    28170
68    13588
61     6744
72     6513
48     6068
      ...  
34      418
32      383
10      292
26      243
9       210
Name: Temperature, Length: 79, dtype: int64

Note that 18000 temperatures were included as -999 degrees. This did impact the analysis, and for the time being, these will all be dropped for initial analysis.
Almost all of the -999 degree measurements were from indoor stadiums, most of which have a set temperature to 70 degrees, so -999 was set to 70 for all indoor stadiums. 

There were only 807 rows of the 18000 that remain from outdoor stadiums that will be removed from the dataset. 

In [27]:
playlist = playlist[playlist['Temperature'] != -999]
playlist.head()

Unnamed: 0,PlayKey,RosterPosition,PlayerDay,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,PositionGroup,OutdoorStadium,Precipitation
0,26624-1-1,Quarterback,1,1,Outdoor,1,63,Clear,Pass,1,QB,QB,1,0
1,26624-1-2,Quarterback,1,1,Outdoor,1,63,Clear,Pass,2,QB,QB,1,0
2,26624-1-3,Quarterback,1,1,Outdoor,1,63,Clear,Rush,3,QB,QB,1,0
3,26624-1-4,Quarterback,1,1,Outdoor,1,63,Clear,Rush,4,QB,QB,1,0
4,26624-1-5,Quarterback,1,1,Outdoor,1,63,Clear,Pass,5,QB,QB,1,0


---
### Addressing the Positions Issue

RosterPositions are similar to the PositionGroups, only not put in abbreviations. Will need to change the Roster Positions into abbreviations first. PositionGroups can be dropped, since they are nearly identical to the Roster and actual positions. 

In [28]:
playlist.RosterPosition.unique()

array(['Quarterback', 'Wide Receiver', 'Linebacker', 'Running Back',
       'Defensive Lineman', 'Tight End', 'Safety', 'Cornerback',
       'Offensive Lineman', 'Kicker'], dtype=object)

In [29]:
playlist.Position.unique()

array(['QB', 'Missing Data', 'WR', 'ILB', 'RB', 'DE', 'TE', 'FS', 'CB',
       'G', 'T', 'OLB', 'DT', 'SS', 'MLB', 'C', 'NT', 'DB', 'K', 'LB',
       'S', 'HB', 'P'], dtype=object)

Going to change the the positions to numerical dummy values for the machine learning analysis instead of using OneHotEncoder

In [30]:
position_dict = {
    'Quarterback': 0,
    'Wide Receiver': 1,
    'Linebacker': 2,
    'Running Back': 3,
    'Defensive Lineman': 4,
    'Tight End': 5,
    'Safety': 6,
    'Cornerback': 7,
    'Offensive Lineman': 8,
    'Kicker': 9
}

playlist.RosterPosition.replace(position_dict, inplace=True)
playlist.head()

Unnamed: 0,PlayKey,RosterPosition,PlayerDay,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,PositionGroup,OutdoorStadium,Precipitation
0,26624-1-1,0,1,1,Outdoor,1,63,Clear,Pass,1,QB,QB,1,0
1,26624-1-2,0,1,1,Outdoor,1,63,Clear,Pass,2,QB,QB,1,0
2,26624-1-3,0,1,1,Outdoor,1,63,Clear,Rush,3,QB,QB,1,0
3,26624-1-4,0,1,1,Outdoor,1,63,Clear,Rush,4,QB,QB,1,0
4,26624-1-5,0,1,1,Outdoor,1,63,Clear,Pass,5,QB,QB,1,0


In [31]:
playlist.Position.unique()

array(['QB', 'Missing Data', 'WR', 'ILB', 'RB', 'DE', 'TE', 'FS', 'CB',
       'G', 'T', 'OLB', 'DT', 'SS', 'MLB', 'C', 'NT', 'DB', 'K', 'LB',
       'S', 'HB', 'P'], dtype=object)

In [32]:
# Drop the Position Group column
playlist = playlist.drop(columns='PositionGroup')
playlist.head()

Unnamed: 0,PlayKey,RosterPosition,PlayerDay,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,OutdoorStadium,Precipitation
0,26624-1-1,0,1,1,Outdoor,1,63,Clear,Pass,1,QB,1,0
1,26624-1-2,0,1,1,Outdoor,1,63,Clear,Pass,2,QB,1,0
2,26624-1-3,0,1,1,Outdoor,1,63,Clear,Rush,3,QB,1,0
3,26624-1-4,0,1,1,Outdoor,1,63,Clear,Rush,4,QB,1,0
4,26624-1-5,0,1,1,Outdoor,1,63,Clear,Pass,5,QB,1,0


In [33]:
playlist.Position[playlist.Position == "Missing Data"].value_counts()

Missing Data    45
Name: Position, dtype: int64

In [34]:
# This code identifies "Missing Data" from the Position and replaces the missing value with the RosterPosition
playlist['Position'] = np.where(playlist['Position'] == 'Missing Data', playlist['RosterPosition'], playlist['Position'])

# Verify that the missing Data values have been replaced
playlist.Position[playlist.Position == "Missing Data"].value_counts()

Series([], Name: Position, dtype: int64)

In [35]:
playlist.Position.value_counts()
# This is binned into more than 10 groups and may not produce reliable results

WR     42281
OLB    32946
CB     28929
FS     21077
G      17027
T      16661
SS     15057
DT     13828
DE     13513
C      12614
RB     11181
ILB     8191
TE      7623
QB      6895
MLB     5420
LB      2589
NT      2587
DB      1280
K        521
S        444
HB       213
P        170
0         17
3          5
1          4
8          4
2          4
7          3
6          3
5          3
4          2
Name: Position, dtype: int64

In [36]:
playlist.RosterPosition.value_counts()

2    49065
8    46306
1    42215
6    38520
4    30015
7    28273
3    11469
5     7626
0     6912
9      691
Name: RosterPosition, dtype: int64

The above values show how many recorded plays each player type was logged in as for all data. The positions are categorical and will be encoded using OneHotEncoder, changing them to binary columns. The Roster Position is the general class, and is redudant if we keep both position and Roster Position.

Position was initially tested, and only the WR and OLB had a high impact and were related to the frequency of the positions. 

In [37]:
# Something weird happened when trying to do a Naive Bayes... it found negative values... 
min(playlist.PlayerDay)

-62

In [38]:
playlist = playlist.assign(DaysPlayed = lambda x: x['PlayerDay'] + 63)
playlist.head()

Unnamed: 0,PlayKey,RosterPosition,PlayerDay,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,OutdoorStadium,Precipitation,DaysPlayed
0,26624-1-1,0,1,1,Outdoor,1,63,Clear,Pass,1,QB,1,0,64
1,26624-1-2,0,1,1,Outdoor,1,63,Clear,Pass,2,QB,1,0,64
2,26624-1-3,0,1,1,Outdoor,1,63,Clear,Rush,3,QB,1,0,64
3,26624-1-4,0,1,1,Outdoor,1,63,Clear,Rush,4,QB,1,0,64
4,26624-1-5,0,1,1,Outdoor,1,63,Clear,Pass,5,QB,1,0,64


In [39]:
min(playlist.DaysPlayed)

1

In [40]:
playlist.drop(columns='PlayerDay', inplace=True)

---
## Cleaning The Injuries Dataset

In [41]:
# Read in the specific table - this can be done on the same connection:
injuries_sql = db.Table('injuries', metadata,
                        autoload=True, autoload_with=engine)
query = db.select(injuries_sql)
Results = conn.execute(query).fetchall()

# Create the new dataframe and set the keys
injuries = pd.DataFrame(Results)
injuries.columns = Results[0].keys()
conn.close()
del Results, metadata, conn, engine, query, table, db_string
injuries.head()

Unnamed: 0,playerkey,gameid,playkey,bodypart,fieldtype,dm_m1,dm_m7,dm_m28,dm_m42
0,39873,39873-4,39873-4-32,Knee,Synthetic,1,1,1,1
1,46074,46074-7,46074-7-26,Knee,Natural,1,1,0,0
2,36557,36557-1,36557-1-70,Ankle,Synthetic,1,1,1,1
3,46646,46646-3,46646-3-30,Ankle,Natural,1,0,0,0
4,43532,43532-5,43532-5-69,Ankle,Synthetic,1,1,1,1


In [42]:
injuries = column_capitalizer(injuries, 'injuries')
injuries.head()

Unnamed: 0,PlayerKey,GameID,PlayKey,BodyPart,Surface,DM_M1,DM_M7,DM_M28,DM_M42
0,39873,39873-4,39873-4-32,Knee,Synthetic,1,1,1,1
1,46074,46074-7,46074-7-26,Knee,Natural,1,1,0,0
2,36557,36557-1,36557-1-70,Ankle,Synthetic,1,1,1,1
3,46646,46646-3,46646-3-30,Ankle,Natural,1,0,0,0
4,43532,43532-5,43532-5-69,Ankle,Synthetic,1,1,1,1


Evaluate all columns for na values

In [43]:
# The PlayKey column is the only one that has NaN values
injuries['PlayKey'].isna().sum()

28

In [44]:
# Drop the NaN values, since we won't be able to correlate these with the other tables
injuries = injuries.dropna(subset = ['PlayKey'])
injuries.nunique()

PlayerKey    74
GameID       76
PlayKey      76
BodyPart      3
Surface       2
DM_M1         1
DM_M7         2
DM_M28        2
DM_M42        2
dtype: int64

Note: there is only 1 unique value for DM_M1 - which means that every player on this list was injured for at least 1 day

The Surface is the same as the Field Type from the other table, so this can be dropped. 
Note: Anyone whose injury is in the DM_M42 list is also in all of the prior lists, so there will be more of the lower values due the the encoding. Going to change this to a single column with values of 1, 7, 28, and 42

---
### Group the DM columns into a single Injury Duration column

In [45]:
def injury_duration(row):
    injury_duration = 0
    if row["DM_M42"] == 1:
        injury_duration = 42
    else:
        if row["DM_M28"] == 1:
            injury_duration = 28
        else:
            if row["DM_M7"] == 1:
                injury_duration = 7
            else: 
                injury_duration = 1
    
    return injury_duration

# Apply the function to all rows
injuries['InjuryDuration'] = injuries.apply(injury_duration, axis=1)
injuries.head()


Unnamed: 0,PlayerKey,GameID,PlayKey,BodyPart,Surface,DM_M1,DM_M7,DM_M28,DM_M42,InjuryDuration
0,39873,39873-4,39873-4-32,Knee,Synthetic,1,1,1,1,42
1,46074,46074-7,46074-7-26,Knee,Natural,1,1,0,0,7
2,36557,36557-1,36557-1-70,Ankle,Synthetic,1,1,1,1,42
3,46646,46646-3,46646-3-30,Ankle,Natural,1,0,0,0,1
4,43532,43532-5,43532-5-69,Ankle,Synthetic,1,1,1,1,42


In [46]:
# Remove the rows for DMs
injuries.drop(columns=['DM_M1', 'DM_M7', 'DM_M28', 'DM_M42', 'Surface'], inplace=True)
injuries.head()

Unnamed: 0,PlayerKey,GameID,PlayKey,BodyPart,InjuryDuration
0,39873,39873-4,39873-4-32,Knee,42
1,46074,46074-7,46074-7-26,Knee,7
2,36557,36557-1,36557-1-70,Ankle,42
3,46646,46646-3,46646-3-30,Ankle,1
4,43532,43532-5,43532-5-69,Ankle,42


Analyze the BodyPart of injury to verify it's ready for encoding

In [47]:
# The body parts are categorical and will be, but since each injury was logged as unique, going to use the occurrence frequency as the numerical coding instead of arbitrary numbers
knee_freq = injuries.BodyPart.value_counts()['Knee']
ankle_freq = injuries.BodyPart.value_counts()['Ankle']
foot_freq = injuries.BodyPart.value_counts()['Foot']
injuries.BodyPart.value_counts()

Knee     36
Ankle    35
Foot      6
Name: BodyPart, dtype: int64

In [48]:
# There are 74 known individual players that have been injured for at least 1 day 
injuries.PlayerKey.nunique()

74

In [49]:
# This output only 76 unique plays with only 74 players, so only 2 players were reinjured at different times of the season
injuries.PlayKey.nunique()

76

Every GameID and PlayID are unique, meaning that once that
particular player was injured during a specific game at a specific play,
they didn't return to the field. Since the GameID numbers are not in any 
chronological order and offer no information other than the PlayKey can, this column can be dropped

In [50]:
injuries.GameID.nunique()

76

Since the PlayerID, GameID, and PlayKey number are all contained within the PlayKey, the GameID and PlayerID can be dropped. 

In [51]:
injuries.drop(columns=['GameID', 'PlayerKey'], inplace=True)
injuries.head()

Unnamed: 0,PlayKey,BodyPart,InjuryDuration
0,39873-4-32,Knee,42
1,46074-7-26,Knee,7
2,36557-1-70,Ankle,42
3,46646-3-30,Ankle,1
4,43532-5-69,Ankle,42


For the supervised analysis, the injuries will need to be recorded as numerical values. We will create 2 columns:
- 'IsInjured' where 0 is not injured and 1 is injured
- 'InjuryType' where the Injury Type will be encoded by the frequency of occurrence, Knee = 36, Ankle = 35, and Foot = 6

Depeding on the type of analysis, if we're trying to predict with a binary outcome, whether or not there will be an injury, we will use 'IsInjured'. If we're trying to predict which types of injury, we'd need the numerical factors for each type of injury. 

These changes cannot be made until this table is merged with the other table, containing the non-injured player plays

---
## Merge the 2 dataframes

In [52]:
playlist.head()

Unnamed: 0,PlayKey,RosterPosition,PlayerGame,StadiumType,FieldType,Temperature,Weather,PlayType,PlayerGamePlay,Position,OutdoorStadium,Precipitation,DaysPlayed
0,26624-1-1,0,1,Outdoor,1,63,Clear,Pass,1,QB,1,0,64
1,26624-1-2,0,1,Outdoor,1,63,Clear,Pass,2,QB,1,0,64
2,26624-1-3,0,1,Outdoor,1,63,Clear,Rush,3,QB,1,0,64
3,26624-1-4,0,1,Outdoor,1,63,Clear,Rush,4,QB,1,0,64
4,26624-1-5,0,1,Outdoor,1,63,Clear,Pass,5,QB,1,0,64


- Drop the categorical columns that have been encoded for the supervised analysis
- Play Type and RosterPosition will be encoded with OneHotEncoder

In [53]:
playlist.drop(columns=['StadiumType', 'Weather', 'Position'], inplace=True)
playlist.head()

Unnamed: 0,PlayKey,RosterPosition,PlayerGame,FieldType,Temperature,PlayType,PlayerGamePlay,OutdoorStadium,Precipitation,DaysPlayed
0,26624-1-1,0,1,1,63,Pass,1,1,0,64
1,26624-1-2,0,1,1,63,Pass,2,1,0,64
2,26624-1-3,0,1,1,63,Rush,3,1,0,64
3,26624-1-4,0,1,1,63,Rush,4,1,0,64
4,26624-1-5,0,1,1,63,Pass,5,1,0,64


If we want, we can switch out the RosterPosition for the played position to see if there was a difference, but the actual position is more specific to the play, which may be a better indicator

In [54]:
injuries.head()

Unnamed: 0,PlayKey,BodyPart,InjuryDuration
0,39873-4-32,Knee,42
1,46074-7-26,Knee,7
2,36557-1-70,Ankle,42
3,46646-3-30,Ankle,1
4,43532-5-69,Ankle,42


In [55]:
play_injuries = pd.merge(playlist, injuries, on='PlayKey', how='outer')
play_injuries_inner = pd.merge(playlist, injuries, on='PlayKey', how='inner')
play_injuries.head()

Unnamed: 0,PlayKey,RosterPosition,PlayerGame,FieldType,Temperature,PlayType,PlayerGamePlay,OutdoorStadium,Precipitation,DaysPlayed,BodyPart,InjuryDuration
0,26624-1-1,0,1,1,63,Pass,1,1,0,64,,
1,26624-1-2,0,1,1,63,Pass,2,1,0,64,,
2,26624-1-3,0,1,1,63,Rush,3,1,0,64,,
3,26624-1-4,0,1,1,63,Rush,4,1,0,64,,
4,26624-1-5,0,1,1,63,Pass,5,1,0,64,,


---
### Add values for duration and Body Part. Change NaN to NoInjury for body part. Change Injury_Duration to 0 for all NaN values

In [56]:
play_injuries.BodyPart.fillna('NoInjury', inplace=True)
play_injuries.InjuryDuration.fillna(0, inplace=True)
play_injuries.head()

Unnamed: 0,PlayKey,RosterPosition,PlayerGame,FieldType,Temperature,PlayType,PlayerGamePlay,OutdoorStadium,Precipitation,DaysPlayed,BodyPart,InjuryDuration
0,26624-1-1,0,1,1,63,Pass,1,1,0,64,NoInjury,0.0
1,26624-1-2,0,1,1,63,Pass,2,1,0,64,NoInjury,0.0
2,26624-1-3,0,1,1,63,Rush,3,1,0,64,NoInjury,0.0
3,26624-1-4,0,1,1,63,Rush,4,1,0,64,NoInjury,0.0
4,26624-1-5,0,1,1,63,Pass,5,1,0,64,NoInjury,0.0


Add a binary column for injury/no_injury

In [57]:
play_injuries['IsInjured'] = play_injuries['BodyPart'].apply(lambda x: 0 if x == 'NoInjury'  else 1)
play_injuries.head()

Unnamed: 0,PlayKey,RosterPosition,PlayerGame,FieldType,Temperature,PlayType,PlayerGamePlay,OutdoorStadium,Precipitation,DaysPlayed,BodyPart,InjuryDuration,IsInjured
0,26624-1-1,0,1,1,63,Pass,1,1,0,64,NoInjury,0.0,0
1,26624-1-2,0,1,1,63,Pass,2,1,0,64,NoInjury,0.0,0
2,26624-1-3,0,1,1,63,Rush,3,1,0,64,NoInjury,0.0,0
3,26624-1-4,0,1,1,63,Rush,4,1,0,64,NoInjury,0.0,0
4,26624-1-5,0,1,1,63,Pass,5,1,0,64,NoInjury,0.0,0


Add the numerical frequency-based column for the InjuryTypes

In [58]:
# Rearrange the columns 
injury_type = {
    'Knee': knee_freq, 
    'Ankle': ankle_freq,
    'Foot': foot_freq, 
    'NoInjury': 0
}

play_injuries['InjuryType'] = play_injuries.BodyPart.map(injury_type)
play_injuries.head()

Unnamed: 0,PlayKey,RosterPosition,PlayerGame,FieldType,Temperature,PlayType,PlayerGamePlay,OutdoorStadium,Precipitation,DaysPlayed,BodyPart,InjuryDuration,IsInjured,InjuryType
0,26624-1-1,0,1,1,63,Pass,1,1,0,64,NoInjury,0.0,0,0
1,26624-1-2,0,1,1,63,Pass,2,1,0,64,NoInjury,0.0,0,0
2,26624-1-3,0,1,1,63,Rush,3,1,0,64,NoInjury,0.0,0,0
3,26624-1-4,0,1,1,63,Rush,4,1,0,64,NoInjury,0.0,0,0
4,26624-1-5,0,1,1,63,Pass,5,1,0,64,NoInjury,0.0,0,0


In [59]:
play_injuries.InjuryType.value_counts()

0     261016
36        36
35        35
6          6
Name: InjuryType, dtype: int64

In [60]:
# Drop the BodyPart column and PlayKey
play_injuries.drop(columns=['PlayKey','BodyPart'], inplace=True)
play_injuries.dtypes

RosterPosition      int64
PlayerGame          int64
FieldType           int64
Temperature         int64
PlayType           object
PlayerGamePlay      int64
OutdoorStadium      int32
Precipitation       int64
DaysPlayed          int64
InjuryDuration    float64
IsInjured           int64
InjuryType          int64
dtype: object

In [61]:
play_injuries.isna().sum()

RosterPosition      0
PlayerGame          0
FieldType           0
Temperature         0
PlayType          363
PlayerGamePlay      0
OutdoorStadium      0
Precipitation       0
DaysPlayed          0
InjuryDuration      0
IsInjured           0
InjuryType          0
dtype: int64

There seem to be no na values in most of the columns, except the play type - where there are 336. This won't remove a lot of the data, so going to drop the remaining nan values 

In [62]:
play_injuries = play_injuries.dropna()
play_injuries.head()

Unnamed: 0,RosterPosition,PlayerGame,FieldType,Temperature,PlayType,PlayerGamePlay,OutdoorStadium,Precipitation,DaysPlayed,InjuryDuration,IsInjured,InjuryType
0,0,1,1,63,Pass,1,1,0,64,0.0,0,0
1,0,1,1,63,Pass,2,1,0,64,0.0,0,0
2,0,1,1,63,Rush,3,1,0,64,0.0,0,0
3,0,1,1,63,Rush,4,1,0,64,0.0,0,0
4,0,1,1,63,Pass,5,1,0,64,0.0,0,0


After the first run, the Pass and Rush plays had a large impact, but each of the other plays were minimal, and all of the other plays are effectively kicking plays. Going to group the other plays to reduce features.

In [63]:
play_injuries.PlayType.value_counts()

Pass                    135015
Rush                     90683
Extra Point               5769
Kickoff                   5581
Punt                      5533
Field Goal                4845
Kickoff Not Returned      4518
Punt Not Returned         3419
Kickoff Returned          2679
Punt Returned             2421
0                          267
Name: PlayType, dtype: int64

In [64]:
play_type = {
    'Pass': 'Pass',
    'Rush': 'Rush',
    'Extra Point': 'Kick',
    'Kickoff': 'Kick',
    'Punt': 'Kick',
    'Field Goal': 'Kick',
    'Kickoff Not Returned': 'Kick',
    'Punt Not Returned': 'Kick',
    'Kickoff Returned': 'Kick',
    'Punt Returned': 'Kick',
    '0': 'Kick'
}

play_injuries['PlayType'] = play_injuries.PlayType.map(play_type)
play_injuries.head()

Unnamed: 0,RosterPosition,PlayerGame,FieldType,Temperature,PlayType,PlayerGamePlay,OutdoorStadium,Precipitation,DaysPlayed,InjuryDuration,IsInjured,InjuryType
0,0,1,1,63,Pass,1,1,0,64,0.0,0,0
1,0,1,1,63,Pass,2,1,0,64,0.0,0,0
2,0,1,1,63,Rush,3,1,0,64,0.0,0,0
3,0,1,1,63,Rush,4,1,0,64,0.0,0,0
4,0,1,1,63,Pass,5,1,0,64,0.0,0,0


In [65]:
play_injuries.PlayType.value_counts()

Pass    135015
Rush     90683
Kick     35032
Name: PlayType, dtype: int64

In [66]:
play_injuries.isna().sum()

RosterPosition    0
PlayerGame        0
FieldType         0
Temperature       0
PlayType          0
PlayerGamePlay    0
OutdoorStadium    0
Precipitation     0
DaysPlayed        0
InjuryDuration    0
IsInjured         0
InjuryType        0
dtype: int64

---
## Encode the Position and Play type using OneHotEncoder

At this point, we need to encode all of the categorical data to be able to do a machine learning analysis. We opted to use OneHotEncoder in this case because of the possibility of non-existent categories in the testing, that may create a dimensional mismatch that OneHotEncoder can handle, but get_dummies cannot. Either way, the results end up being the same. 

In [67]:
# Gather the categorical variables
play_cat = play_injuries.dtypes[play_injuries.dtypes == 'object'].index.tolist()

# Create the Encoder Instance
enc = OneHotEncoder(sparse=False)

# Fit and transform the categorical data
encode_df = pd.DataFrame(enc.fit_transform(play_injuries[play_cat]))

# Add the encoded variable names to the dataframe
encode_df.columns = enc.get_feature_names_out(play_cat)
encode_df.head()

Unnamed: 0,PlayType_Kick,PlayType_Pass,PlayType_Rush
0,0.0,1.0,0.0
1,0.0,1.0,0.0
2,0.0,0.0,1.0
3,0.0,0.0,1.0
4,0.0,1.0,0.0


In [68]:
# Merge encoded features and drop the original columns
clean_play_injuries = play_injuries.merge(encode_df, left_index=True, right_index=True)
clean_play_injuries.drop(columns=play_cat, inplace=True)

In [69]:
clean_play_injuries.head()

Unnamed: 0,RosterPosition,PlayerGame,FieldType,Temperature,PlayerGamePlay,OutdoorStadium,Precipitation,DaysPlayed,InjuryDuration,IsInjured,InjuryType,PlayType_Kick,PlayType_Pass,PlayType_Rush
0,0,1,1,63,1,1,0,64,0.0,0,0,0.0,1.0,0.0
1,0,1,1,63,2,1,0,64,0.0,0,0,0.0,1.0,0.0
2,0,1,1,63,3,1,0,64,0.0,0,0,0.0,0.0,1.0
3,0,1,1,63,4,1,0,64,0.0,0,0,0.0,0.0,1.0
4,0,1,1,63,5,1,0,64,0.0,0,0,0.0,1.0,0.0


## Export the Data for Analysis

This will export the cleaned data to a new database called NFL_Injuries, which will be used for all of the processed data, keeping it separated from the original datasets

In [71]:
# Make connection to the database
from config import db_password
db_string = f"postgresql://postgres:{db_password}@127.0.0.1:5433/NFL_Injuries"
engine = db.create_engine(db_string)

del db_string, db_password
# Write table to database
# clean_play_injuries.to_sql(name='clean_play_injuries', con=engine, index=False)