# **Week 1**

#### **Objective: Evaluate the Data Set using Polynomial and Interaction Terms**

Each dataset has already been explored through univariate, bivariate, and multivariate analysis, including assissments of variable interactions and correlations with the target outcome. 

in this section, we extend that analysis by explicitly evaluating polynomial transformations and interaction terms (both categorical and numeric). The goal is to capture potential nonlinear relationships and combined effects between variables, and to assess how these features contribute to prediction the target. 

___

#### **Package Imports**

In [19]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import math
import matplotlib.pyplot as plt
import seaborn as sns


# VIF Imports
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import OneHotEncoder

#### **Import Datasets**

In [20]:
BDB_All_Plays = pd.read_csv("../../AFL_Final_Project/BDB_All_Plays.csv") # Big Data Bowl Dataset
FNF_All_Plays = pd.read_csv("../../AFL_Final_Project/FNF_All_Plays.csv") # First and Future Dataset
PDA_All_Plays = pd.read_csv('../../AFL_Final_Project/PDA_All_Plays.csv') # Punt Data Analytics

#### **Important Functions**

In [21]:
# =======================================================
# Taken from Module 3 Milestone 1
# # Link: https://github.com/LeeMcFarling/Module_3_Milestone_1/blob/main/Milestone_01.ipynb
# 
# Split df into Numeric and Categorical Datasets, so that 
# visualizations can be catered accordingly. 
# =======================================================

# Numeric Columns
def numerify(df):
    numeric_cols = df.select_dtypes(include='number').columns
    filtered_cols = [col for col in numeric_cols if 'id' not in col.lower()] # Filter OUT 'id' columns
    df_numeric = df[filtered_cols]
    return df_numeric

# Categorical Columns
def categorify(df):
    pot_id_cols = ('gameId','playId','nflId','playerId','teamId','stadiumId')
    valid_id_columns = [c for c in pot_id_cols if c in df.columns]

    df_categorical_cols = df.select_dtypes(exclude=['number']).columns.tolist()

    combined_cat_cols = df_categorical_cols + valid_id_columns
    df_categorical = df[combined_cat_cols]
    return df_categorical



In [22]:
# ========================================================================================
# Taken from Module 3 - Final Project Milestone 1
# Link: https://github.com/LeeMcFarling/Module_3_Milestone_1/blob/main/Milestone_01.ipynb
# 
# Prupose:
# This is meant to consolidate the 'show_null_counts_features' function from before with 
# another with 'value' and 'unique' counts later on in this analysis. 
# ========================================================================================

def profile_dataset(df):
    # Identify feature types
    feature_types = df.dtypes.apply(lambda x: 'Numeric' if np.issubdtype(x, np.number) else 'Categorical')

    # Build a summary DataFrame
    summary = pd.DataFrame({
        'Feature': df.columns,
        'Type': feature_types.values,
        'Null Values': df.isnull().sum().values,
        'Null %': (df.isnull().mean() * 100).round(2).values,
        'Count (Non-Null)': df.count().values,
        'Unique Values': df.nunique().values
    })

    # Sort Values in Summary by % of null values
    summary = summary.sort_values(by='Null %', ascending=False).reset_index(drop=True)

    # Add dataset shape info above the table
    print(f"This dataset contain {df.shape[0]} rows")
    print(f"This dataset contain {df.shape[1]} columns")

    # Display the summary
    return summary

In [23]:
# ========================================================================================
# Taken from Module 3 - Final Project Milestone 1
# Link: https://github.com/LeeMcFarling/Module_3_Milestone_1/blob/main/Milestone_01.ipynb
# 
# Function purpose is to intake a variety of related columns (Foul 1, Foul 2, etc.) and create
# an indicator flag from it. i.e. 'Did penalty occur? (Y/N)'
# ========================================================================================


def create_indicator_from_columns(df, columns, new_column_name):
    # Initialize a boolean series with False for all rows.
    indicator = pd.Series(False, index=df.index)
    
    for col in columns:
        try:
            if col in df.columns:
                indicator = indicator | df[col].notnull()
            else:
                print(f"Warning: Column '{col}' not found. Skipping.")
        except Exception as e:
            print(f"Error processing column '{col}': {e}")
    
    # Assign the indicator as an integer column to the DataFrame
    df[new_column_name] = indicator.astype(int)
    return df

#### **DataFrame Functions**

In [24]:
pd.set_option('display.max_columns', None) 
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None )


____

# **Big Data Bowl PreAnalysis**

In [25]:
BDB_All_Plays.head(1)

Unnamed: 0,gameId,playId,playDescription,quarter,down,yardsToGo,possessionTeam,defensiveTeam,yardlineSide,yardlineNumber,gameClock,preSnapHomeScore,preSnapVisitorScore,passResult,penaltyYards,prePenaltyPlayResult,playResult,foulName1,foulNFLId1,foulName2,foulNFLId2,foulName3,foulNFLId3,absoluteYardlineNumber,offenseFormation,personnelO,defendersInBox,personnelD,dropBackType,pff_playAction,pff_passCoverage,pff_passCoverageType,Inj_Occured
0,2021090900,97,(13:33) (Shotgun) T.Brady pass incomplete deep right to C.Godwin.,1,3,2,TB,DAL,TB,33,13:33,0,0,I,,0,0,,,,,,,43.0,SHOTGUN,"1 RB, 1 TE, 3 WR",6.0,"4 DL, 2 LB, 5 DB",TRADITIONAL,0,Cover-1,Man,0


Let's break the dataset down into numeric and categorical data

In [26]:
BDB_All_Plays_Numeric = numerify(BDB_All_Plays)
BDB_All_Plays_Categorical = categorify(BDB_All_Plays)

Here is a profile of the numeric data

In [27]:
profile_dataset(BDB_All_Plays_Numeric)

This dataset contain 8557 rows
This dataset contain 13 columns


Unnamed: 0,Feature,Type,Null Values,Null %,Count (Non-Null),Unique Values
0,penaltyYards,Numeric,7801,91.17,756,60
1,defendersInBox,Numeric,7,0.08,8550,11
2,absoluteYardlineNumber,Numeric,1,0.01,8556,99
3,quarter,Numeric,0,0.0,8557,5
4,down,Numeric,0,0.0,8557,5
5,yardsToGo,Numeric,0,0.0,8557,32
6,yardlineNumber,Numeric,0,0.0,8557,50
7,preSnapHomeScore,Numeric,0,0.0,8557,42
8,preSnapVisitorScore,Numeric,0,0.0,8557,38
9,prePenaltyPlayResult,Numeric,0,0.0,8557,98


From looking at the data, it looks like we need to impute a 0 in the penaltyYards column if there was no injury, and either drop the rows. or inpute an unknown into the absoluteYardlineNumber and defendersInBox. 

In [28]:
profile_dataset(BDB_All_Plays_Categorical)

This dataset contain 8557 rows
This dataset contain 17 columns


Unnamed: 0,Feature,Type,Null Values,Null %,Count (Non-Null),Unique Values
0,foulName3,Categorical,8556,99.99,1,1
1,foulName2,Categorical,8527,99.65,30,15
2,foulName1,Categorical,7821,91.4,736,29
3,dropBackType,Categorical,528,6.17,8029,8
4,yardlineSide,Categorical,125,1.46,8432,32
5,offenseFormation,Categorical,7,0.08,8550,7
6,personnelD,Categorical,1,0.01,8556,29
7,personnelO,Categorical,1,0.01,8556,30
8,gameId,Numeric,0,0.0,8557,122
9,pff_passCoverageType,Categorical,0,0.0,8557,3


There are a lot of null fields in the Foul / Penalty Fields. Let's roll that up into an indicator field to make things more simple to follow: 


- First, let's roll up Foul ID fields into the flag we need. 
- Second, let's consolidate Foul2 and Foul3

First, we'll use the fould ID fields to make a flag called 'foul_on_play'. 

In [29]:
columns = ['foulNFLId1', 'foulNFLId2', 'foulNFLId3']
new_column_name = 'foul_on_play'
BDB_All_Plays_Clean = create_indicator_from_columns(BDB_All_Plays, columns, new_column_name)

BDB_All_Plays_Clean = BDB_All_Plays_Clean.drop(columns=columns, index=1)

Next, because the prevalence of injuries is so small compared to the total number of plays overall, and because the number of plays in which both fouls and Injuries occured is smaller still, we should do some quick analysis to see if granular foul name type information is worth keeping, or if a simple foul flag will suffice. 

The reasoning is that if a certain foul only occured twice in the overall total sample, we cannot reliably identify patterns on such a small number of events. Keeping this level of dtail risks the temptation to bootstrap an exceedingly small sample, which could amplify incomplete or misleading interactions. within the dataset as a whole. 

In [30]:
print(f' Number of Plays in which Injuries & Fouls Occured: {len(BDB_All_Plays_Clean[(BDB_All_Plays_Clean['Inj_Occured'] == 1) & (BDB_All_Plays_Clean['foul_on_play'] == 1)])}')
print(f' Number of Plays in which Injuries Occured: {len(BDB_All_Plays_Clean[(BDB_All_Plays_Clean['Inj_Occured'] == 1)])}')
print(f' Number of Total Plays: {len(BDB_All_Plays)}')

 Number of Plays in which Injuries & Fouls Occured: 25
 Number of Plays in which Injuries Occured: 209
 Number of Total Plays: 8557


As suspected, the number of plays in which both fouls and injuries occured is 25 out of 8557, so ~ about a 0.0029 rate of incidence. 

Further investigating the data, we can see that there are 29 foul types in FoulName1 and 15 in FoulName2 and so on. 


Example: 

In [31]:
for col in ['foulName1', 'foulName2', 'foulName3']:
    print(f"\n--- {col} ---")
    print(f'{BDB_All_Plays_Clean[col].dropna().unique()}')


--- foulName1 ---
['Illegal Use of Hands' 'Taunting' 'Defensive Pass Interference'
 'Defensive Holding' 'Offensive Holding' 'Illegal Block Above the Waist'
 'Intentional Grounding' 'Offensive Pass Interference'
 'Unsportsmanlike Conduct' 'Defensive Offside' 'Illegal Formation'
 'Roughing the Passer' 'Unnecessary Roughness' 'Illegal Touch Pass'
 'Face Mask (15 Yards)' 'Ineligible Downfield Pass' 'Illegal Contact'
 'Disqualification' 'Illegal Blindside Block'
 'Lowering the Head to Initiate Contact' 'Chop Block' 'Low Block'
 'Illegal Shift' 'Tripping' 'Illegal Forward Pass' 'Illegal Substitution'
 'Illegal Motion' 'Horse Collar Tackle' 'Clipping']

--- foulName2 ---
['Unnecessary Roughness' 'Face Mask (15 Yards)' 'Tripping'
 'Defensive Offside' 'Roughing the Passer' 'Defensive Pass Interference'
 'Offensive Holding' 'Unsportsmanlike Conduct' 'Defensive Holding'
 'Illegal Use of Hands' 'Taunting' 'Disqualification'
 'Intentional Grounding' 'Offensive Pass Interference' 'Illegal Contact']

^ As such, the data is too granular on such a small scale to combine and then create dummy variables for > 29 different categories. Further investigations into foul types on injury rates are worth investigating but the sample size should be more than 8 weeks in one NFL season. 

For now, a foul_on_play flag will be deemed sufficient, and extra granular information will be removed to avoid any risks in confounding the model. 

In [32]:
drop_cols = ['foulName1', 'foulName2', 'foulName3']
BDB_All_Plays_Clean.drop(columns=drop_cols, axis=1, inplace=True)

In [33]:
profile_dataset(BDB_All_Plays_Clean)

This dataset contain 8556 rows
This dataset contain 28 columns


Unnamed: 0,Feature,Type,Null Values,Null %,Count (Non-Null),Unique Values
0,penaltyYards,Numeric,7800,91.16,756,60
1,dropBackType,Categorical,528,6.17,8028,8
2,yardlineSide,Categorical,125,1.46,8431,32
3,defendersInBox,Numeric,7,0.08,8549,11
4,offenseFormation,Categorical,7,0.08,8549,7
5,personnelD,Categorical,1,0.01,8555,29
6,personnelO,Categorical,1,0.01,8555,30
7,absoluteYardlineNumber,Numeric,1,0.01,8555,99
8,prePenaltyPlayResult,Numeric,0,0.0,8556,98
9,Inj_Occured,Numeric,0,0.0,8556,2


Now let's check up on the features that still have null fields. 

In [34]:
for col in ['penaltyYards', 'dropBackType', 'yardlineSide', 'defendersInBox', 'offenseFormation']:
    print(f"\n--- {col} ---")
    print(f'{BDB_All_Plays_Clean[col].unique()}')
    print()


--- penaltyYards ---
[ nan   0.  14.   5.  26. -10.  16.  10.  -5. -14.   3.  19. -12.   8.
  15.  13.  11. -15.  17.  35.   2.   9.  -2.   6.  -4.  32.  21. -11.
  27.  36.   4.  12. -18.  25.  48.  -7.  28.  -3.  -6.  24.   7.   1.
  18.  43.  22.  23.  -9.  31.  20.  45.  47.  41.  33.  -8.  50.  39.
  46.  40.  38.  42. -13.]


--- dropBackType ---
['TRADITIONAL' 'SCRAMBLE_ROLLOUT_RIGHT' 'DESIGNED_ROLLOUT_RIGHT' nan
 'SCRAMBLE' 'DESIGNED_ROLLOUT_LEFT' 'UNKNOWN' 'DESIGNED_RUN'
 'SCRAMBLE_ROLLOUT_LEFT']


--- yardlineSide ---
['TB' 'DAL' nan 'ATL' 'PHI' 'PIT' 'BUF' 'NYJ' 'CAR' 'MIN' 'CIN' 'DET' 'SF'
 'HOU' 'JAX' 'IND' 'SEA' 'TEN' 'ARI' 'LAC' 'WAS' 'CLE' 'KC' 'MIA' 'NE'
 'NO' 'GB' 'NYG' 'DEN' 'CHI' 'LA' 'LV' 'BAL']


--- defendersInBox ---
[ 6.  7.  5.  4.  8.  3.  9. 10. nan 11.  2.  1.]


--- offenseFormation ---
['SHOTGUN' 'SINGLEBACK' 'EMPTY' 'I_FORM' 'JUMBO' 'PISTOL' nan 'WILDCAT']



Of the five features with missing values, three are easily imputed with 0 or unknown values -- the reasoning is as follows: 

- Penalty Yards: If no penalty, there are 0 penalty yards,
- drop back type:  An 'Unknown' type already exists. If N/A, then it's assumed unknown 
- YardLineSide: This is a string type categorical variable. It is easily imputed as NA with little change to the overall nature of the field. 

- OffensiveFormation: Again, as this is a string type categorical field, N/A values are easily imputed as UNKNONWN here. 

In [35]:
BDB_All_Plays_Clean = BDB_All_Plays_Clean.fillna({
    'penaltyYards': 0,
    'dropBackType': 'UNKNOWN',
    'yardlineSide': 'UNK',
    'offensiveFormation': 'UNKNOWN'
})

Because defenders in the box is not categorical, and is indeed numeric, adding an unknown category here would change the nature of the feature itself, additionally, we cannot impute '0' in this case because '0' is distinct from 'unknown' and doing so could confound the variable. 

To further investigate, let's query based on this field specifically

In [36]:
BDB_All_Plays_Clean[BDB_All_Plays_Clean['defendersInBox'].isna()].index

Index([916, 1570, 1654, 4887, 6874, 6899, 7912], dtype='int64')

Judging from the INJ_OCCURED field, there were no injuries that happened in this (extremely small) sample. In this case, it's safe just to drop the records and it will not have a significant impact on our target. 

In [37]:
BDB_dropable_records = BDB_All_Plays_Clean[BDB_All_Plays_Clean['defendersInBox'].isna()].index
BDB_All_Plays_Clean = BDB_All_Plays_Clean.drop(BDB_dropable_records)

And let's re-profile the dataset to see how  we're doing: 

In [38]:
profile_dataset(BDB_All_Plays_Clean)

This dataset contain 8549 rows
This dataset contain 28 columns


Unnamed: 0,Feature,Type,Null Values,Null %,Count (Non-Null),Unique Values
0,gameId,Numeric,0,0.0,8549,122
1,playId,Numeric,0,0.0,8549,3761
2,Inj_Occured,Numeric,0,0.0,8549,2
3,pff_passCoverageType,Categorical,0,0.0,8549,3
4,pff_passCoverage,Categorical,0,0.0,8549,12
5,pff_playAction,Numeric,0,0.0,8549,2
6,dropBackType,Categorical,0,0.0,8549,8
7,personnelD,Categorical,0,0.0,8549,29
8,defendersInBox,Numeric,0,0.0,8549,11
9,personnelO,Categorical,0,0.0,8549,30


Great, we're almost ready to investigate the polynomial terms, all we have to do left is drop the ID type fields, make dummy variables for the categorical features, and then do some last minute checks to make sure we don't have perfect multicoliniarity. 

Regarding the ID fields:

In [39]:
ID_Fields = ['gameId', 'playId', 'playDescription']
BDB_All_Plays_Clean.drop(columns=ID_Fields, inplace=True)

And now let's handle the dummy variables: 


Ok so we are mostly good to go except for the gameClock variable which has 898 different unique variables. Let's change that into a floating point number instead of 14:58 etc, format it's in as nobody needs the negativity of 898 extra columns in a one-hot encoded variable in their lives. 

In [40]:
BDB_All_Plays_Clean['gameClock']

0       13:33
2       12:23
3       09:56
4       09:46
5       08:53
6       08:24
7       08:20
8       07:53
9       07:30
10      06:13
11      05:26
12      04:15
13      02:45
14      02:22
15      01:43
16      00:59
17      00:11
18      00:05
19      14:21
20      12:07
21      11:29
22      11:13
23      09:53
24      09:48
25      09:09
26      09:04
27      08:16
28      07:27
29      06:46
30      05:43
31      05:37
32      04:25
33      03:30
34      03:26
35      03:08
36      02:45
37      02:38
38      02:00
39      01:54
40      01:50
41      01:43
42      01:16
43      01:11
44      00:35
45      00:28
46      00:15
47      00:09
48      00:06
49      14:57
50      14:27
51      12:44
52      12:14
53      12:10
54      11:26
55      10:32
56      09:08
57      09:04
58      09:00
59      08:44
60      08:31
61      07:12
62      06:28
63      05:59
64      04:49
65      03:26
66      02:43
67      02:03
68      00:39
69      00:33
70      13:38
71      12:11
72    

In order to keep things consistant, let's convert this into what fraction of the quarter has elapsed. For instanct 7:30 out of a 15:00 game clock would be 0.5. 

In [41]:
minutes = BDB_All_Plays_Clean['gameClock'].str[:2].astype(int) # convert minutes to int
seconds = BDB_All_Plays_Clean['gameClock'].str[-2:].astype(int) # convert seconds to int

numerator = (minutes * 60 + seconds) # our data points in seconds
denominator = (60 * 15) # amount of seconds in 15 minutes

BDB_All_Plays_Clean['frac_quarter_elapsed'] = 1 - (numerator / denominator).round(2)


and let's check if it worked

In [42]:
BDB_All_Plays_Clean.head(1)

Unnamed: 0,quarter,down,yardsToGo,possessionTeam,defensiveTeam,yardlineSide,yardlineNumber,gameClock,preSnapHomeScore,preSnapVisitorScore,passResult,penaltyYards,prePenaltyPlayResult,playResult,absoluteYardlineNumber,offenseFormation,personnelO,defendersInBox,personnelD,dropBackType,pff_playAction,pff_passCoverage,pff_passCoverageType,Inj_Occured,foul_on_play,frac_quarter_elapsed
0,1,3,2,TB,DAL,TB,33,13:33,0,0,I,0.0,0,0,43.0,SHOTGUN,"1 RB, 1 TE, 3 WR",6.0,"4 DL, 2 LB, 5 DB",TRADITIONAL,0,Cover-1,Man,0,0,0.1


It worked. Now let's drop the gameClock variable. 

In [43]:
BDB_All_Plays_Clean.drop(columns='gameClock', axis=1, inplace=True)

In [46]:
BDB_All_Plays_Clean.head(5)

Unnamed: 0,quarter,down,yardsToGo,possessionTeam,defensiveTeam,yardlineSide,yardlineNumber,preSnapHomeScore,preSnapVisitorScore,passResult,penaltyYards,prePenaltyPlayResult,playResult,absoluteYardlineNumber,offenseFormation,personnelO,defendersInBox,personnelD,dropBackType,pff_playAction,pff_passCoverage,pff_passCoverageType,Inj_Occured,foul_on_play,frac_quarter_elapsed
0,1,3,2,TB,DAL,TB,33,0,0,I,0.0,0,0,43.0,SHOTGUN,"1 RB, 1 TE, 3 WR",6.0,"4 DL, 2 LB, 5 DB",TRADITIONAL,0,Cover-1,Man,0,0,0.1
2,1,2,6,DAL,TB,DAL,34,0,0,C,0.0,5,5,76.0,SHOTGUN,"0 RB, 2 TE, 3 WR",6.0,"3 DL, 3 LB, 5 DB",TRADITIONAL,0,Cover-3,Zone,0,0,0.17
3,1,1,10,DAL,TB,TB,39,0,0,I,0.0,0,0,49.0,SINGLEBACK,"1 RB, 2 TE, 2 WR",6.0,"4 DL, 3 LB, 4 DB",TRADITIONAL,1,Cover-3,Zone,0,0,0.34
4,1,3,15,DAL,TB,TB,44,0,0,I,0.0,0,0,54.0,SHOTGUN,"1 RB, 1 TE, 3 WR",7.0,"3 DL, 4 LB, 4 DB",TRADITIONAL,0,Cover-3,Zone,0,0,0.35
5,1,2,5,TB,DAL,TB,11,0,0,C,0.0,10,10,21.0,EMPTY,"1 RB, 1 TE, 3 WR",6.0,"4 DL, 2 LB, 5 DB",TRADITIONAL,0,Cover-1,Man,0,0,0.41


next before one- hot encoding and doing VIF, let's make sure we know which of the columns to drop: 

#### **NOTE TO LEE: MOVE DOWN**

In [47]:
BDB_All_Plays_Clean = BDB_All_Plays_Clean.drop(columns='pff_passCoverage', axis=1) 

In [48]:
BDB_All_Plays_Clean_Categorical = categorify(BDB_All_Plays_Clean)
profile_dataset(BDB_All_Plays_Clean_Categorical)

This dataset contain 8549 rows
This dataset contain 9 columns


Unnamed: 0,Feature,Type,Null Values,Null %,Count (Non-Null),Unique Values
0,possessionTeam,Categorical,0,0.0,8549,32
1,defensiveTeam,Categorical,0,0.0,8549,32
2,yardlineSide,Categorical,0,0.0,8549,33
3,passResult,Categorical,0,0.0,8549,5
4,offenseFormation,Categorical,0,0.0,8549,7
5,personnelO,Categorical,0,0.0,8549,30
6,personnelD,Categorical,0,0.0,8549,29
7,dropBackType,Categorical,0,0.0,8549,8
8,pff_passCoverageType,Categorical,0,0.0,8549,3


Now let's make dummie variables: 

In [49]:
BDB_Dummies = pd.get_dummies(BDB_All_Plays_Clean_Categorical, drop_first=True)

print(BDB_Dummies.shape)

BDB_Dummies.head(1)

(8549, 170)


Unnamed: 0,possessionTeam_ATL,possessionTeam_BAL,possessionTeam_BUF,possessionTeam_CAR,possessionTeam_CHI,possessionTeam_CIN,possessionTeam_CLE,possessionTeam_DAL,possessionTeam_DEN,possessionTeam_DET,possessionTeam_GB,possessionTeam_HOU,possessionTeam_IND,possessionTeam_JAX,possessionTeam_KC,possessionTeam_LA,possessionTeam_LAC,possessionTeam_LV,possessionTeam_MIA,possessionTeam_MIN,possessionTeam_NE,possessionTeam_NO,possessionTeam_NYG,possessionTeam_NYJ,possessionTeam_PHI,possessionTeam_PIT,possessionTeam_SEA,possessionTeam_SF,possessionTeam_TB,possessionTeam_TEN,possessionTeam_WAS,defensiveTeam_ATL,defensiveTeam_BAL,defensiveTeam_BUF,defensiveTeam_CAR,defensiveTeam_CHI,defensiveTeam_CIN,defensiveTeam_CLE,defensiveTeam_DAL,defensiveTeam_DEN,defensiveTeam_DET,defensiveTeam_GB,defensiveTeam_HOU,defensiveTeam_IND,defensiveTeam_JAX,defensiveTeam_KC,defensiveTeam_LA,defensiveTeam_LAC,defensiveTeam_LV,defensiveTeam_MIA,defensiveTeam_MIN,defensiveTeam_NE,defensiveTeam_NO,defensiveTeam_NYG,defensiveTeam_NYJ,defensiveTeam_PHI,defensiveTeam_PIT,defensiveTeam_SEA,defensiveTeam_SF,defensiveTeam_TB,defensiveTeam_TEN,defensiveTeam_WAS,yardlineSide_ATL,yardlineSide_BAL,yardlineSide_BUF,yardlineSide_CAR,yardlineSide_CHI,yardlineSide_CIN,yardlineSide_CLE,yardlineSide_DAL,yardlineSide_DEN,yardlineSide_DET,yardlineSide_GB,yardlineSide_HOU,yardlineSide_IND,yardlineSide_JAX,yardlineSide_KC,yardlineSide_LA,yardlineSide_LAC,yardlineSide_LV,yardlineSide_MIA,yardlineSide_MIN,yardlineSide_NE,yardlineSide_NO,yardlineSide_NYG,yardlineSide_NYJ,yardlineSide_PHI,yardlineSide_PIT,yardlineSide_SEA,yardlineSide_SF,yardlineSide_TB,yardlineSide_TEN,yardlineSide_UNK,yardlineSide_WAS,passResult_I,passResult_IN,passResult_R,passResult_S,offenseFormation_I_FORM,offenseFormation_JUMBO,offenseFormation_PISTOL,offenseFormation_SHOTGUN,offenseFormation_SINGLEBACK,offenseFormation_WILDCAT,"personnelO_0 RB, 1 TE, 4 WR","personnelO_0 RB, 2 TE, 3 WR","personnelO_0 RB, 3 TE, 2 WR","personnelO_1 RB, 0 TE, 4 WR","personnelO_1 RB, 1 TE, 2 WR,1 LB","personnelO_1 RB, 1 TE, 3 WR","personnelO_1 RB, 2 TE, 2 WR","personnelO_1 RB, 3 TE, 1 WR","personnelO_1 RB, 4 TE, 0 WR","personnelO_2 QB, 1 RB, 0 TE, 3 WR","personnelO_2 QB, 1 RB, 1 TE, 2 WR","personnelO_2 QB, 1 RB, 2 TE, 1 WR","personnelO_2 QB, 1 RB, 3 TE, 0 WR","personnelO_2 QB, 2 RB, 0 TE, 2 WR","personnelO_2 QB, 2 RB, 1 TE, 1 WR","personnelO_2 QB, 6 OL, 1 RB, 1 TE, 1 WR","personnelO_2 RB, 0 TE, 3 WR","personnelO_2 RB, 1 TE, 2 WR","personnelO_2 RB, 2 TE, 1 WR","personnelO_2 RB, 3 TE, 0 WR","personnelO_3 RB, 0 TE, 2 WR","personnelO_6 OL, 1 RB, 0 TE, 3 WR","personnelO_6 OL, 1 RB, 1 TE, 2 WR","personnelO_6 OL, 1 RB, 2 TE, 1 WR","personnelO_6 OL, 1 RB, 3 TE, 0 WR","personnelO_6 OL, 2 RB, 0 TE, 2 WR","personnelO_6 OL, 2 RB, 1 TE, 1 WR","personnelO_6 OL, 2 RB, 2 TE, 0 WR","personnelO_7 OL, 1 RB, 0 TE, 2 WR","personnelD_0 DL, 5 LB, 6 DB","personnelD_1 DL, 2 LB, 8 DB","personnelD_1 DL, 3 LB, 7 DB","personnelD_1 DL, 4 LB, 6 DB","personnelD_1 DL, 5 LB, 5 DB","personnelD_2 DL, 2 LB, 7 DB","personnelD_2 DL, 3 LB, 6 DB","personnelD_2 DL, 4 LB, 5 DB","personnelD_2 DL, 5 LB, 4 DB","personnelD_3 DL, 1 LB, 7 DB","personnelD_3 DL, 2 LB, 6 DB","personnelD_3 DL, 3 LB, 5 DB","personnelD_3 DL, 4 LB, 4 DB","personnelD_3 DL, 5 LB, 3 DB","personnelD_4 DL, 1 LB, 6 DB","personnelD_4 DL, 2 LB, 5 DB","personnelD_4 DL, 3 LB, 4 DB","personnelD_4 DL, 4 LB, 3 DB","personnelD_4 DL, 5 LB, 2 DB","personnelD_4 DL, 6 LB, 1 DB","personnelD_5 DL, 1 LB, 5 DB","personnelD_5 DL, 2 LB, 4 DB","personnelD_5 DL, 3 LB, 3 DB","personnelD_5 DL, 5 LB, 1 DB","personnelD_6 DL, 1 LB, 4 DB","personnelD_6 DL, 2 LB, 3 DB","personnelD_6 DL, 3 LB, 2 DB","personnelD_6 DL, 4 LB, 1 DB",dropBackType_DESIGNED_ROLLOUT_RIGHT,dropBackType_DESIGNED_RUN,dropBackType_SCRAMBLE,dropBackType_SCRAMBLE_ROLLOUT_LEFT,dropBackType_SCRAMBLE_ROLLOUT_RIGHT,dropBackType_TRADITIONAL,dropBackType_UNKNOWN,pff_passCoverageType_Other,pff_passCoverageType_Zone
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False


And there were some issues, so let's cast the boolean columns as integers. 

In [50]:
bool_cols = BDB_Dummies.select_dtypes(include='bool').columns
BDB_Dummies[bool_cols] = BDB_Dummies[bool_cols].astype(int)

and then let's concatenate the dummie variables created with the numeric features. 

In [51]:
BDB_All_Plays_Clean_Numeric = numerify(BDB_All_Plays_Clean)
BDB_All_Plays_Model_Ready = pd.concat([BDB_All_Plays_Clean_Numeric, BDB_Dummies], axis=1)

print(BDB_All_Plays_Model_Ready.shape)
BDB_All_Plays_Model_Ready.head(5)

(8549, 185)


Unnamed: 0,quarter,down,yardsToGo,yardlineNumber,preSnapHomeScore,preSnapVisitorScore,penaltyYards,prePenaltyPlayResult,playResult,absoluteYardlineNumber,defendersInBox,pff_playAction,Inj_Occured,foul_on_play,frac_quarter_elapsed,possessionTeam_ATL,possessionTeam_BAL,possessionTeam_BUF,possessionTeam_CAR,possessionTeam_CHI,possessionTeam_CIN,possessionTeam_CLE,possessionTeam_DAL,possessionTeam_DEN,possessionTeam_DET,possessionTeam_GB,possessionTeam_HOU,possessionTeam_IND,possessionTeam_JAX,possessionTeam_KC,possessionTeam_LA,possessionTeam_LAC,possessionTeam_LV,possessionTeam_MIA,possessionTeam_MIN,possessionTeam_NE,possessionTeam_NO,possessionTeam_NYG,possessionTeam_NYJ,possessionTeam_PHI,possessionTeam_PIT,possessionTeam_SEA,possessionTeam_SF,possessionTeam_TB,possessionTeam_TEN,possessionTeam_WAS,defensiveTeam_ATL,defensiveTeam_BAL,defensiveTeam_BUF,defensiveTeam_CAR,defensiveTeam_CHI,defensiveTeam_CIN,defensiveTeam_CLE,defensiveTeam_DAL,defensiveTeam_DEN,defensiveTeam_DET,defensiveTeam_GB,defensiveTeam_HOU,defensiveTeam_IND,defensiveTeam_JAX,defensiveTeam_KC,defensiveTeam_LA,defensiveTeam_LAC,defensiveTeam_LV,defensiveTeam_MIA,defensiveTeam_MIN,defensiveTeam_NE,defensiveTeam_NO,defensiveTeam_NYG,defensiveTeam_NYJ,defensiveTeam_PHI,defensiveTeam_PIT,defensiveTeam_SEA,defensiveTeam_SF,defensiveTeam_TB,defensiveTeam_TEN,defensiveTeam_WAS,yardlineSide_ATL,yardlineSide_BAL,yardlineSide_BUF,yardlineSide_CAR,yardlineSide_CHI,yardlineSide_CIN,yardlineSide_CLE,yardlineSide_DAL,yardlineSide_DEN,yardlineSide_DET,yardlineSide_GB,yardlineSide_HOU,yardlineSide_IND,yardlineSide_JAX,yardlineSide_KC,yardlineSide_LA,yardlineSide_LAC,yardlineSide_LV,yardlineSide_MIA,yardlineSide_MIN,yardlineSide_NE,yardlineSide_NO,yardlineSide_NYG,yardlineSide_NYJ,yardlineSide_PHI,yardlineSide_PIT,yardlineSide_SEA,yardlineSide_SF,yardlineSide_TB,yardlineSide_TEN,yardlineSide_UNK,yardlineSide_WAS,passResult_I,passResult_IN,passResult_R,passResult_S,offenseFormation_I_FORM,offenseFormation_JUMBO,offenseFormation_PISTOL,offenseFormation_SHOTGUN,offenseFormation_SINGLEBACK,offenseFormation_WILDCAT,"personnelO_0 RB, 1 TE, 4 WR","personnelO_0 RB, 2 TE, 3 WR","personnelO_0 RB, 3 TE, 2 WR","personnelO_1 RB, 0 TE, 4 WR","personnelO_1 RB, 1 TE, 2 WR,1 LB","personnelO_1 RB, 1 TE, 3 WR","personnelO_1 RB, 2 TE, 2 WR","personnelO_1 RB, 3 TE, 1 WR","personnelO_1 RB, 4 TE, 0 WR","personnelO_2 QB, 1 RB, 0 TE, 3 WR","personnelO_2 QB, 1 RB, 1 TE, 2 WR","personnelO_2 QB, 1 RB, 2 TE, 1 WR","personnelO_2 QB, 1 RB, 3 TE, 0 WR","personnelO_2 QB, 2 RB, 0 TE, 2 WR","personnelO_2 QB, 2 RB, 1 TE, 1 WR","personnelO_2 QB, 6 OL, 1 RB, 1 TE, 1 WR","personnelO_2 RB, 0 TE, 3 WR","personnelO_2 RB, 1 TE, 2 WR","personnelO_2 RB, 2 TE, 1 WR","personnelO_2 RB, 3 TE, 0 WR","personnelO_3 RB, 0 TE, 2 WR","personnelO_6 OL, 1 RB, 0 TE, 3 WR","personnelO_6 OL, 1 RB, 1 TE, 2 WR","personnelO_6 OL, 1 RB, 2 TE, 1 WR","personnelO_6 OL, 1 RB, 3 TE, 0 WR","personnelO_6 OL, 2 RB, 0 TE, 2 WR","personnelO_6 OL, 2 RB, 1 TE, 1 WR","personnelO_6 OL, 2 RB, 2 TE, 0 WR","personnelO_7 OL, 1 RB, 0 TE, 2 WR","personnelD_0 DL, 5 LB, 6 DB","personnelD_1 DL, 2 LB, 8 DB","personnelD_1 DL, 3 LB, 7 DB","personnelD_1 DL, 4 LB, 6 DB","personnelD_1 DL, 5 LB, 5 DB","personnelD_2 DL, 2 LB, 7 DB","personnelD_2 DL, 3 LB, 6 DB","personnelD_2 DL, 4 LB, 5 DB","personnelD_2 DL, 5 LB, 4 DB","personnelD_3 DL, 1 LB, 7 DB","personnelD_3 DL, 2 LB, 6 DB","personnelD_3 DL, 3 LB, 5 DB","personnelD_3 DL, 4 LB, 4 DB","personnelD_3 DL, 5 LB, 3 DB","personnelD_4 DL, 1 LB, 6 DB","personnelD_4 DL, 2 LB, 5 DB","personnelD_4 DL, 3 LB, 4 DB","personnelD_4 DL, 4 LB, 3 DB","personnelD_4 DL, 5 LB, 2 DB","personnelD_4 DL, 6 LB, 1 DB","personnelD_5 DL, 1 LB, 5 DB","personnelD_5 DL, 2 LB, 4 DB","personnelD_5 DL, 3 LB, 3 DB","personnelD_5 DL, 5 LB, 1 DB","personnelD_6 DL, 1 LB, 4 DB","personnelD_6 DL, 2 LB, 3 DB","personnelD_6 DL, 3 LB, 2 DB","personnelD_6 DL, 4 LB, 1 DB",dropBackType_DESIGNED_ROLLOUT_RIGHT,dropBackType_DESIGNED_RUN,dropBackType_SCRAMBLE,dropBackType_SCRAMBLE_ROLLOUT_LEFT,dropBackType_SCRAMBLE_ROLLOUT_RIGHT,dropBackType_TRADITIONAL,dropBackType_UNKNOWN,pff_passCoverageType_Other,pff_passCoverageType_Zone
0,1,3,2,33,0,0,0.0,0,0,43.0,6.0,0,0,0,0.1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
2,1,2,6,34,0,0,0.0,5,5,76.0,6.0,0,0,0,0.17,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1
3,1,1,10,39,0,0,0.0,0,0,49.0,6.0,1,0,0,0.34,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1
4,1,3,15,44,0,0,0.0,0,0,54.0,7.0,0,0,0,0.35,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1
5,1,2,5,11,0,0,0.0,10,10,21.0,6.0,0,0,0,0.41,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0


___

# **Big Data Bowl: Variance Inflation Factor**

Next, we need to check the model for interactions between the polynomial terms. We will use the VIF method as detailed at the link below. 

#### **VIF Interpretation**

- Values near 1 mean that the features are independent
- Values between 1 and 5 shows moderate correlation
- Values > 10 show problematic levels of multicolinearity


Source: https://www.geeksforgeeks.org/python/detecting-multicollinearity-with-vif-python/

In [52]:
def VIF_Analyze(df):
    df_ready = df.select_dtypes(include=[np.number, 'bool', 'boolean'])
    df_not_ready = df.select_dtypes(exclude=[np.number])

    VIF_data = pd.DataFrame()
    VIF_data['feature'] = df_ready.columns
    VIF_data['VIF'] = [variance_inflation_factor(df_ready.values, i) for i in range(len(df_ready.columns))]

    print(f'These Columns were not formatted correctly. Could not include in analysis \n {df_not_ready.columns}')
    return VIF_data

In [70]:
VIF_data = VIF_Analyze(BDB_All_Plays_Model_Ready)
VIF_data.head(5)

These Columns were not formatted correctly. Could not include in analysis 
 Index([], dtype='object')


Unnamed: 0,feature,VIF
0,quarter,23.813044
1,down,7.656916
2,yardsToGo,6.985012
3,yardlineNumber,8.292387
4,preSnapHomeScore,6.117451


In [73]:
BDB_All_Plays_Clean.columns

Index(['quarter', 'down', 'yardsToGo', 'possessionTeam', 'defensiveTeam',
       'yardlineSide', 'yardlineNumber', 'preSnapHomeScore',
       'preSnapVisitorScore', 'passResult', 'penaltyYards',
       'prePenaltyPlayResult', 'playResult', 'absoluteYardlineNumber',
       'offenseFormation', 'defendersInBox', 'dropBackType', 'pff_playAction',
       'pff_passCoverageType', 'Inj_Occured', 'foul_on_play',
       'frac_quarter_elapsed'],
      dtype='object')

#### **Notes on the following VIF Analysis**

(I put the notes up here because I figured it would be more intuitive than scrolling to the bottom of a big data frame.)

So it looks like pff pass coverage and offensive formations are exhibiting perfect multicolliniarity, even after we drop one of the dummie variables. My guess would be that personnel O (offensive formation) and personell O (the specific number of offensive personel on the field.) are perfectly coordinated -- Intuitively, this makes sense, and might even be reflected in the Defensive formations.


So we'll start by dropping PFF Pass Coverage columns and then we'll re-run VIF and go from there. 

In [64]:
VIF_data[VIF_data['VIF'] > 10].sort_values(by='VIF', ascending=False)

Unnamed: 0,feature,VIF
134,"personnelD_4 DL, 2 LB, 5 DB",79.506766
10,defendersInBox,75.195898
152,dropBackType_TRADITIONAL,48.313112
126,"personnelD_2 DL, 4 LB, 5 DB",46.059335
130,"personnelD_3 DL, 3 LB, 5 DB",36.021212
8,playResult,33.489379
7,prePenaltyPlayResult,32.115694
125,"personnelD_2 DL, 3 LB, 6 DB",24.954811
0,quarter,24.435793
135,"personnelD_4 DL, 3 LB, 4 DB",24.020934


In [65]:
# BDB_All_Plays_Clean = BDB_All_Plays_Clean.drop(columns='pff_passCoverage', axis=1) 

# Added after VIF a second time
# BDB_All_Plays_Clean = BDB_All_Plays_Clean.drop(columns='personnelO', axis=1) 

# Added after VIF a third time
BDB_All_Plays_Clean = BDB_All_Plays_Clean.drop(columns='personnelD', axis=1) 
	

In [66]:
BDB_All_Plays_Clean_Categorical = categorify(BDB_All_Plays_Clean)
profile_dataset(BDB_All_Plays_Clean_Categorical)

This dataset contain 8549 rows
This dataset contain 7 columns


Unnamed: 0,Feature,Type,Null Values,Null %,Count (Non-Null),Unique Values
0,possessionTeam,Categorical,0,0.0,8549,32
1,defensiveTeam,Categorical,0,0.0,8549,32
2,yardlineSide,Categorical,0,0.0,8549,33
3,passResult,Categorical,0,0.0,8549,5
4,offenseFormation,Categorical,0,0.0,8549,7
5,dropBackType,Categorical,0,0.0,8549,8
6,pff_passCoverageType,Categorical,0,0.0,8549,3


Now let's make dummie variables: 

In [67]:
BDB_Dummies = pd.get_dummies(BDB_All_Plays_Clean_Categorical, drop_first=True)

print(BDB_Dummies.shape)

BDB_Dummies.head(1)

(8549, 113)


Unnamed: 0,possessionTeam_ATL,possessionTeam_BAL,possessionTeam_BUF,possessionTeam_CAR,possessionTeam_CHI,possessionTeam_CIN,possessionTeam_CLE,possessionTeam_DAL,possessionTeam_DEN,possessionTeam_DET,possessionTeam_GB,possessionTeam_HOU,possessionTeam_IND,possessionTeam_JAX,possessionTeam_KC,possessionTeam_LA,possessionTeam_LAC,possessionTeam_LV,possessionTeam_MIA,possessionTeam_MIN,possessionTeam_NE,possessionTeam_NO,possessionTeam_NYG,possessionTeam_NYJ,possessionTeam_PHI,possessionTeam_PIT,possessionTeam_SEA,possessionTeam_SF,possessionTeam_TB,possessionTeam_TEN,possessionTeam_WAS,defensiveTeam_ATL,defensiveTeam_BAL,defensiveTeam_BUF,defensiveTeam_CAR,defensiveTeam_CHI,defensiveTeam_CIN,defensiveTeam_CLE,defensiveTeam_DAL,defensiveTeam_DEN,defensiveTeam_DET,defensiveTeam_GB,defensiveTeam_HOU,defensiveTeam_IND,defensiveTeam_JAX,defensiveTeam_KC,defensiveTeam_LA,defensiveTeam_LAC,defensiveTeam_LV,defensiveTeam_MIA,defensiveTeam_MIN,defensiveTeam_NE,defensiveTeam_NO,defensiveTeam_NYG,defensiveTeam_NYJ,defensiveTeam_PHI,defensiveTeam_PIT,defensiveTeam_SEA,defensiveTeam_SF,defensiveTeam_TB,defensiveTeam_TEN,defensiveTeam_WAS,yardlineSide_ATL,yardlineSide_BAL,yardlineSide_BUF,yardlineSide_CAR,yardlineSide_CHI,yardlineSide_CIN,yardlineSide_CLE,yardlineSide_DAL,yardlineSide_DEN,yardlineSide_DET,yardlineSide_GB,yardlineSide_HOU,yardlineSide_IND,yardlineSide_JAX,yardlineSide_KC,yardlineSide_LA,yardlineSide_LAC,yardlineSide_LV,yardlineSide_MIA,yardlineSide_MIN,yardlineSide_NE,yardlineSide_NO,yardlineSide_NYG,yardlineSide_NYJ,yardlineSide_PHI,yardlineSide_PIT,yardlineSide_SEA,yardlineSide_SF,yardlineSide_TB,yardlineSide_TEN,yardlineSide_UNK,yardlineSide_WAS,passResult_I,passResult_IN,passResult_R,passResult_S,offenseFormation_I_FORM,offenseFormation_JUMBO,offenseFormation_PISTOL,offenseFormation_SHOTGUN,offenseFormation_SINGLEBACK,offenseFormation_WILDCAT,dropBackType_DESIGNED_ROLLOUT_RIGHT,dropBackType_DESIGNED_RUN,dropBackType_SCRAMBLE,dropBackType_SCRAMBLE_ROLLOUT_LEFT,dropBackType_SCRAMBLE_ROLLOUT_RIGHT,dropBackType_TRADITIONAL,dropBackType_UNKNOWN,pff_passCoverageType_Other,pff_passCoverageType_Zone
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False


And there were some issues, so let's cast the boolean columns as integers. 

In [68]:
bool_cols = BDB_Dummies.select_dtypes(include='bool').columns
BDB_Dummies[bool_cols] = BDB_Dummies[bool_cols].astype(int)

and then let's concatenate the dummie variables created with the numeric features. 

In [69]:
BDB_All_Plays_Clean_Numeric = numerify(BDB_All_Plays_Clean)
BDB_All_Plays_Model_Ready = pd.concat([BDB_All_Plays_Clean_Numeric, BDB_Dummies], axis=1)

print(BDB_All_Plays_Model_Ready.shape)
BDB_All_Plays_Model_Ready.head(5)

(8549, 128)


Unnamed: 0,quarter,down,yardsToGo,yardlineNumber,preSnapHomeScore,preSnapVisitorScore,penaltyYards,prePenaltyPlayResult,playResult,absoluteYardlineNumber,defendersInBox,pff_playAction,Inj_Occured,foul_on_play,frac_quarter_elapsed,possessionTeam_ATL,possessionTeam_BAL,possessionTeam_BUF,possessionTeam_CAR,possessionTeam_CHI,possessionTeam_CIN,possessionTeam_CLE,possessionTeam_DAL,possessionTeam_DEN,possessionTeam_DET,possessionTeam_GB,possessionTeam_HOU,possessionTeam_IND,possessionTeam_JAX,possessionTeam_KC,possessionTeam_LA,possessionTeam_LAC,possessionTeam_LV,possessionTeam_MIA,possessionTeam_MIN,possessionTeam_NE,possessionTeam_NO,possessionTeam_NYG,possessionTeam_NYJ,possessionTeam_PHI,possessionTeam_PIT,possessionTeam_SEA,possessionTeam_SF,possessionTeam_TB,possessionTeam_TEN,possessionTeam_WAS,defensiveTeam_ATL,defensiveTeam_BAL,defensiveTeam_BUF,defensiveTeam_CAR,defensiveTeam_CHI,defensiveTeam_CIN,defensiveTeam_CLE,defensiveTeam_DAL,defensiveTeam_DEN,defensiveTeam_DET,defensiveTeam_GB,defensiveTeam_HOU,defensiveTeam_IND,defensiveTeam_JAX,defensiveTeam_KC,defensiveTeam_LA,defensiveTeam_LAC,defensiveTeam_LV,defensiveTeam_MIA,defensiveTeam_MIN,defensiveTeam_NE,defensiveTeam_NO,defensiveTeam_NYG,defensiveTeam_NYJ,defensiveTeam_PHI,defensiveTeam_PIT,defensiveTeam_SEA,defensiveTeam_SF,defensiveTeam_TB,defensiveTeam_TEN,defensiveTeam_WAS,yardlineSide_ATL,yardlineSide_BAL,yardlineSide_BUF,yardlineSide_CAR,yardlineSide_CHI,yardlineSide_CIN,yardlineSide_CLE,yardlineSide_DAL,yardlineSide_DEN,yardlineSide_DET,yardlineSide_GB,yardlineSide_HOU,yardlineSide_IND,yardlineSide_JAX,yardlineSide_KC,yardlineSide_LA,yardlineSide_LAC,yardlineSide_LV,yardlineSide_MIA,yardlineSide_MIN,yardlineSide_NE,yardlineSide_NO,yardlineSide_NYG,yardlineSide_NYJ,yardlineSide_PHI,yardlineSide_PIT,yardlineSide_SEA,yardlineSide_SF,yardlineSide_TB,yardlineSide_TEN,yardlineSide_UNK,yardlineSide_WAS,passResult_I,passResult_IN,passResult_R,passResult_S,offenseFormation_I_FORM,offenseFormation_JUMBO,offenseFormation_PISTOL,offenseFormation_SHOTGUN,offenseFormation_SINGLEBACK,offenseFormation_WILDCAT,dropBackType_DESIGNED_ROLLOUT_RIGHT,dropBackType_DESIGNED_RUN,dropBackType_SCRAMBLE,dropBackType_SCRAMBLE_ROLLOUT_LEFT,dropBackType_SCRAMBLE_ROLLOUT_RIGHT,dropBackType_TRADITIONAL,dropBackType_UNKNOWN,pff_passCoverageType_Other,pff_passCoverageType_Zone
0,1,3,2,33,0,0,0.0,0,0,43.0,6.0,0,0,0,0.1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0
2,1,2,6,34,0,0,0.0,5,5,76.0,6.0,0,0,0,0.17,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1
3,1,1,10,39,0,0,0.0,0,0,49.0,6.0,1,0,0,0.34,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1
4,1,3,15,44,0,0,0.0,0,0,54.0,7.0,0,0,0,0.35,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1
5,1,2,5,11,0,0,0.0,10,10,21.0,6.0,0,0,0,0.41,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
