# **Week 3**
### **Forward and Backward Feature Selection, PCA/R, PLCR**

Quick note here: 

Again the main task of our models and datasets is to see if we can meaningfully predict whether an injury will occur in different plays in the NFL, and then do a deep dive on which conditions are most likely to lead to those injuries. 

We have three datasets looking at different types of injuries in the NFL: 
- **First and Future:** Which looks at lower extremity injuries
- **Punt Data Analytics:** Which looks at head injuries during punt plays
- **Big Data Bowl:** Which looks at a variety of injuries in the NFL

In [101]:
# Standard Libraries
import os
import re
import time
import math
import io
import zipfile
import requests
from urllib.parse import urlparse
from itertools import chain, combinations
from joblib import Memory
import warnings
from sklearn.exceptions import ConvergenceWarning

# Data Science Libraries
import numpy as np
import pandas as pd
import seaborn as sns

# Visualization
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.ticker as mticker  # Optional: Format y-axis labels as dollars
import seaborn as sns
import matplotlib.pyplot as plt

# SK Learn
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.decomposition import PCA

#### **Helpful Functions**

In [132]:
def list_strip(names):
    return [re.sub(r'^[^_]+__', '', n) for n in names]


with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=ConvergenceWarning)


#### **Import Datasets**

In [None]:
BDB_All_Plays_Model_Ready = pd.read_csv("BDB_All_Plays_Model_Ready.csv") # Big Data Bowl Dataset
PDA_Model_Ready = pd.read_csv("PDA_Model_Ready.csv") # Punt Data Analytics
FNF_Model_Ready = pd.read_csv("FNF_Model_Ready.csv") # First and Future

## **Forward Feature Selection**

We used boiler plate forward featue selection in our previous modules, but that was specifically written for Regression. This attempts to use SKlearn's built-in functionality to tailor it for our Classification Operation. 


First, we need to filter out our numeric data from our one-hot encoded categorical data and Target variable, and then we need to feed that into a pipeline where we scale the numeric features, and then run a feature selection. 

In [79]:
# ==========================================================
# Looked up a Standard Scaler to Log Regression here: 
# https://www.google.com/search?q=sklearn+pipeline+and+forward+feature+selection&sca_esv=fd2d5e0aca235bbf&rlz=1C5CHFA_enUS1112US1112&ei=szTtaMqHNPiq0PEP3p3x0QI&oq=sklearn+pipeline+and+forward&gs_lp=Egxnd3Mtd2l6LXNlcnAiHHNrbGVhcm4gcGlwZWxpbmUgYW5kIGZvcndhcmQqAggAMgUQIRigATIFECEYoAEyBRAhGKABMgUQIRigATIFECEYnwVIk2FQiAlYhlVwA3gBkAEAmAG9AaAB4RuqAQQwLjI4uAEDyAEA-AEBmAIfoAL_HMICChAAGLADGNYEGEfCAg0QABiABBiRAhiKBRgKwgIKEAAYgAQYQxiKBcICDhAuGIAEGLEDGNEDGMcBwgIWEC4YgAQYsQMY0QMYQxiDARjHARiKBcICExAuGIAEGLEDGNEDGEMYxwEYigXCAg0QABiABBixAxhDGIoFwgIIEAAYgAQYsQPCAgUQLhiABMICBRAAGIAEwgIOEAAYgAQYkQIYsQMYigXCAgsQABiABBiRAhiKBcICDBAAGIAEGEMYigUYCsICBhAAGBYYHsICCBAAGKIEGIkFwgIFEAAY7wXCAggQABiABBiiBMICCxAAGIAEGIYDGIoFwgIHECEYoAEYCpgDAIgGAZAGCJIHBDMuMjigB76yAbIHBDAuMji4B_EcwgcHMC4xMC4yMcgHbQ&sclient=gws-wiz-serp
# 
# Pre-Processor was added with help from ChatGPT: 
# https://chatgpt.com/share/68ed4580-9b74-800f-b5d0-f817ffafccaa
# ==========================================================

nuniques = X.nunique(dropna=True)
numeric_cols = nuniques.index[nuniques > 2].tolist()
onehot_cols  = nuniques.index[nuniques == 2].tolist()

pre = ColumnTransformer([
    ('num', StandardScaler(), numeric_cols),
    ('cat', 'passthrough', onehot_cols)
])

logreg = LogisticRegression(
    class_weight='balanced', solver='liblinear', C=0.2,
    penalty='l2', max_iter=2000, tol=1e-2, random_state=42
)

cv = StratifiedKFold(3, shuffle=True, random_state=42)

sfs = SequentialFeatureSelector(
    estimator=logreg,
    n_features_to_select=30,   # Selecting all was yielding run times of > 1 hr
    direction='forward',
    scoring='roc_auc',
    cv=cv,
    n_jobs=-1
)

pipe = Pipeline([
    ('pre', pre),
    ('sfs', sfs),
    ('model', logreg)
])

And now we'll call that code that we made earlier for the Big Data Bowl Dataset. 

In [49]:
X = BDB_All_Plays_Model_Ready.drop(columns=['Inj_Occured'])
y = BDB_All_Plays_Model_Ready['Inj_Occured']

pipe.fit(X, y)
mask = pipe.named_steps['sfs'].get_support()
feat_names = pipe.named_steps['pre'].get_feature_names_out()
bdb_forward_selected = [f for f, m in zip(feat_names, mask) if m]
print(bdb_forward_selected)




['num__preSnapHomeScore', 'num__playResult', 'cat__possessionTeam_CIN', 'cat__possessionTeam_CLE', 'cat__possessionTeam_LV', 'cat__possessionTeam_WAS', 'cat__defensiveTeam_CHI', 'cat__defensiveTeam_DAL', 'cat__defensiveTeam_JAX', 'cat__defensiveTeam_LV', 'cat__defensiveTeam_NO', 'cat__defensiveTeam_NYJ', 'cat__defensiveTeam_PIT', 'cat__yardlineSide_JAX', 'cat__yardlineSide_KC', 'cat__yardlineSide_MIA', 'cat__yardlineSide_MIN', 'cat__yardlineSide_NYJ', 'cat__yardlineSide_PIT', 'cat__yardlineSide_TEN', 'cat__yardlineSide_UNK', 'cat__passResult_R', 'cat__passResult_S', 'cat__offenseFormation_JUMBO', 'cat__offenseFormation_SHOTGUN', 'cat__offenseFormation_WILDCAT', 'cat__dropBackType_DESIGNED_ROLLOUT_RIGHT', 'cat__dropBackType_DESIGNED_RUN', 'cat__dropBackType_SCRAMBLE_ROLLOUT_RIGHT', 'cat__pff_passCoverageType_Other']


In [None]:
bdb_forward_selected

And Forward Selection for the first and Future Dataset: 

In [78]:
X = FNF_Model_Ready.drop(columns=['Inj_Occured'])
y = FNF_Model_Ready['Inj_Occured']

In [80]:
X = FNF_Model_Ready.drop(columns=['Inj_Occured'])
y = FNF_Model_Ready['Inj_Occured']

with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=ConvergenceWarning)

    pipe.fit(X, y)

mask = pipe.named_steps['sfs'].get_support()
feat_names = pipe.named_steps['pre'].get_feature_names_out()
fnf_forward_selected = [f for f, m in zip(feat_names, mask) if m]  
print()  
print(fnf_forward_selected)


['num__PlayerDay', 'num__Temperature', 'num__PlayerGamePlay', 'num__x', 'num__speed', 'cat__StadiumType_Outdoors', 'cat__FieldType_Synthetic', 'cat__Weather_Fog', 'cat__Weather_N/A (Indoors)', 'cat__Weather_Partly Cloudy', 'cat__Weather_Rain', 'cat__Weather_Snow', 'cat__PlayType_Field Goal', 'cat__PlayType_Kickoff', 'cat__PlayType_Pass', 'cat__PlayType_Punt', 'cat__PlayType_Rush', 'cat__PlayType_Unknown', 'cat__Position_CB', 'cat__Position_FS', 'cat__Position_G', 'cat__Position_HB', 'cat__Position_K', 'cat__Position_MLB', 'cat__Position_Missing Data', 'cat__Position_NT', 'cat__Position_P', 'cat__Position_QB', 'cat__Position_S', 'cat__Position_T']


And then for the Punt Data Analytics Dataset: 

In [75]:
X = PDA_Model_Ready.drop(columns=['Inj_Occured'])
y = PDA_Model_Ready['Inj_Occured']

In [77]:
X = PDA_Model_Ready.drop(columns=['Inj_Occured'])
y = PDA_Model_Ready['Inj_Occured']

with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=ConvergenceWarning)

    pipe.fit(X, y)
    
mask = pipe.named_steps['sfs'].get_support()
feat_names = pipe.named_steps['pre'].get_feature_names_out()
pda_forward_selected = [f for f, m in zip(feat_names, mask) if m]
print(pda_forward_selected)

['num__yardline_100', 'cat__Game_Day_Wednesday', 'cat__Start_Time_13:00', 'cat__Start_Time_17:00', 'cat__Start_Time_20:00', 'cat__Visit_Team_Atlanta Falcons', 'cat__Visit_Team_Cleveland Browns', 'cat__Visit_Team_Detroit Lions', 'cat__Visit_Team_Houston Texans', 'cat__Visit_Team_Indianapolis Colts', 'cat__Visit_Team_Los Angeles Rams', 'cat__Visit_Team_Miami Dolphins', 'cat__Visit_Team_Minnesota Vikings', 'cat__Visit_Team_New England Patriots', 'cat__Visit_Team_New Orleans Saints', 'cat__Visit_Team_New York Jets', 'cat__Visit_Team_Philadelphia Eagles', 'cat__Visit_Team_Pittsburgh Steelers', 'cat__Visit_Team_Seattle Seahawks', 'cat__Visit_Team_Tampa Bay Buccaneers', 'cat__Visit_Team_Washington Redskins', 'cat__StadiumType_Outdoors', 'cat__GameWeather_Fog', 'cat__GameWeather_N/A (Indoors)', 'cat__GameWeather_Rain', 'cat__GameWeather_Snow', 'cat__month_February', 'cat__month_November', 'cat__month_October', 'cat__month_September']


#### **Final Forward - Selected Datasets**

We'll filter these and then use them in future weeks

In [None]:

bdb_cols_clean = list_strip(bdb_forward_selected)
pda_cols_clean = list_strip(pda_forward_selected)
fnf_cols_clean = list_strip(fnf_forward_selected)

# now subset with cleaned names
BDB_Forward_Features = BDB_All_Plays_Model_Ready[bdb_cols_clean]
PDA_Forward_Features = PDA_Model_Ready[pda_cols_clean]
FNF_Forward_Features = FNF_Model_Ready[fnf_cols_clean]

# **Backward Feature Selection**

Now we'll do a similar thing only we'll reverse the direction. We'll start with the code we had from earlier, just flipping the direction in the sequential feature selector. 

In [95]:
# ==========================================================
# Looked up a Standard Scaler to Log Regression here: 
# https://www.google.com/search?q=sklearn+pipeline+and+forward+feature+selection&sca_esv=fd2d5e0aca235bbf&rlz=1C5CHFA_enUS1112US1112&ei=szTtaMqHNPiq0PEP3p3x0QI&oq=sklearn+pipeline+and+forward&gs_lp=Egxnd3Mtd2l6LXNlcnAiHHNrbGVhcm4gcGlwZWxpbmUgYW5kIGZvcndhcmQqAggAMgUQIRigATIFECEYoAEyBRAhGKABMgUQIRigATIFECEYnwVIk2FQiAlYhlVwA3gBkAEAmAG9AaAB4RuqAQQwLjI4uAEDyAEA-AEBmAIfoAL_HMICChAAGLADGNYEGEfCAg0QABiABBiRAhiKBRgKwgIKEAAYgAQYQxiKBcICDhAuGIAEGLEDGNEDGMcBwgIWEC4YgAQYsQMY0QMYQxiDARjHARiKBcICExAuGIAEGLEDGNEDGEMYxwEYigXCAg0QABiABBixAxhDGIoFwgIIEAAYgAQYsQPCAgUQLhiABMICBRAAGIAEwgIOEAAYgAQYkQIYsQMYigXCAgsQABiABBiRAhiKBcICDBAAGIAEGEMYigUYCsICBhAAGBYYHsICCBAAGKIEGIkFwgIFEAAY7wXCAggQABiABBiiBMICCxAAGIAEGIYDGIoFwgIHECEYoAEYCpgDAIgGAZAGCJIHBDMuMjigB76yAbIHBDAuMji4B_EcwgcHMC4xMC4yMcgHbQ&sclient=gws-wiz-serp
# 
# Pre-Processor was added with help from ChatGPT: 
# https://chatgpt.com/share/68ed4580-9b74-800f-b5d0-f817ffafccaa
# ==========================================================

nuniques = X.nunique(dropna=True)
numeric_cols = nuniques.index[nuniques > 2].tolist()
onehot_cols  = nuniques.index[nuniques == 2].tolist()

pre = ColumnTransformer([
    ('num', StandardScaler(), numeric_cols),
    ('cat', 'passthrough', onehot_cols)
])

logreg = LogisticRegression(
    class_weight='balanced', solver='liblinear',
    penalty='l2', max_iter=50_000, tol=1e-2, random_state=42
)

cv = StratifiedKFold(3, shuffle=True, random_state=42)

sfs = SequentialFeatureSelector(
    estimator=logreg,
    n_features_to_select='auto',
    direction='backward',
    scoring='roc_auc',
    cv=cv,
    n_jobs=-1
)

pipe = Pipeline([
    ('pre', pre),
    ('sfs', sfs),
    ('model', logreg)
])

And then we'll call that for the Big Data Bowl set: 

In [None]:
X = BDB_All_Plays_Model_Ready.drop(columns=['Inj_Occured'])
y = BDB_All_Plays_Model_Ready['Inj_Occured']

In [90]:
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=ConvergenceWarning)

    pipe.fit(X, y)

mask = pipe.named_steps['sfs'].get_support()
feat_names = pipe.named_steps['pre'].get_feature_names_out()
bdb_backward_selected = [f for f, m in zip(feat_names, mask) if m]
print(bdb_backward_selected)

['num__down', 'num__yardlineNumber', 'num__preSnapHomeScore', 'num__penaltyYards', 'num__playResult', 'num__absoluteYardlineNumber', 'cat__foul_on_play', 'cat__possessionTeam_BAL', 'cat__possessionTeam_BUF', 'cat__possessionTeam_CAR', 'cat__possessionTeam_CHI', 'cat__possessionTeam_CLE', 'cat__possessionTeam_DAL', 'cat__possessionTeam_DET', 'cat__possessionTeam_MIA', 'cat__possessionTeam_NE', 'cat__possessionTeam_SEA', 'cat__possessionTeam_WAS', 'cat__defensiveTeam_ATL', 'cat__defensiveTeam_BAL', 'cat__defensiveTeam_CHI', 'cat__defensiveTeam_CLE', 'cat__defensiveTeam_GB', 'cat__defensiveTeam_JAX', 'cat__defensiveTeam_KC', 'cat__defensiveTeam_LAC', 'cat__defensiveTeam_LV', 'cat__defensiveTeam_MIA', 'cat__defensiveTeam_NO', 'cat__defensiveTeam_NYJ', 'cat__defensiveTeam_PIT', 'cat__defensiveTeam_SF', 'cat__yardlineSide_CHI', 'cat__yardlineSide_DAL', 'cat__yardlineSide_DET', 'cat__yardlineSide_GB', 'cat__yardlineSide_JAX', 'cat__yardlineSide_KC', 'cat__yardlineSide_LV', 'cat__yardlineSide_

And then the first and future set: 

In [91]:
X = FNF_Model_Ready.drop(columns=['Inj_Occured'])
y = FNF_Model_Ready['Inj_Occured']

In [None]:
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=ConvergenceWarning)

    pipe.fit(X, y)
    
mask = pipe.named_steps['sfs'].get_support()
feat_names = pipe.named_steps['pre'].get_feature_names_out()
fnf_forward_selected = [f for f, m in zip(feat_names, mask) if m]
print(fnf_backward_selected)

['num__PlayerDay', 'num__Temperature', 'num__PlayerGamePlay', 'num__x', 'num__speed', 'cat__StadiumType_Outdoors', 'cat__FieldType_Synthetic', 'cat__Weather_Fog', 'cat__Weather_N/A (Indoors)', 'cat__Weather_Partly Cloudy', 'cat__Weather_Rain', 'cat__Weather_Snow', 'cat__PlayType_Kickoff', 'cat__PlayType_Pass', 'cat__PlayType_Punt', 'cat__PlayType_Rush', 'cat__Position_CB', 'cat__Position_G', 'cat__Position_HB', 'cat__Position_K', 'cat__Position_NT', 'cat__Position_P', 'cat__Position_QB', 'cat__Position_S']


and then the punt data analytics set: 

In [94]:
X = PDA_Model_Ready.drop(columns=['Inj_Occured'])
y = PDA_Model_Ready['Inj_Occured']

In [96]:
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=ConvergenceWarning)

    pipe.fit(X, y)

mask = pipe.named_steps['sfs'].get_support()
feat_names = pipe.named_steps['pre'].get_feature_names_out()
pda_backward_selected = [f for f, m in zip(feat_names, mask) if m]
print(pda_backward_selected)

['num__yardline_100', 'cat__Game_Day_Wednesday', 'cat__Start_Time_13:00', 'cat__Start_Time_14:00', 'cat__Start_Time_15:00', 'cat__Start_Time_16:00', 'cat__Start_Time_19:00', 'cat__Start_Time_20:00', 'cat__Visit_Team_Atlanta Falcons', 'cat__Visit_Team_Cincinnati Bengals', 'cat__Visit_Team_Cleveland Browns', 'cat__Visit_Team_Detroit Lions', 'cat__Visit_Team_Houston Texans', 'cat__Visit_Team_Los Angeles Rams', 'cat__Visit_Team_Miami Dolphins', 'cat__Visit_Team_Minnesota Vikings', 'cat__Visit_Team_New England Patriots', 'cat__Visit_Team_New Orleans Saints', 'cat__Visit_Team_New York Jets', 'cat__Visit_Team_Pittsburgh Steelers', 'cat__Visit_Team_San Francisco 49ers', 'cat__Visit_Team_Seattle Seahawks', 'cat__Visit_Team_Tampa Bay Buccaneers', 'cat__Visit_Team_Washington Redskins', 'cat__StadiumType_Outdoors', 'cat__GameWeather_Fog', 'cat__GameWeather_N/A (Indoors)', 'cat__GameWeather_Rain', 'cat__GameWeather_Snow', 'cat__month_December', 'cat__month_February', 'cat__month_November', 'cat__mo

And let's throw em into dataframes as well: 

In [None]:

bdb_back_cols_clean = list_strip(bdb_backward_selected)
pda_back_cols_clean = list_strip(pda_backward_selected)
fnf_back_cols_clean = list_strip(fnf_backward_selected)

BDB_back_Features = BDB_All_Plays_Model_Ready[bdb_back_cols_clean]
PDA_back_Features = PDA_Model_Ready[pda_back_cols_clean]
FNF_back_Features = FNF_Model_Ready[fnf_back_cols_clean]

_____


## **Principle Component ~~Regression~~ Classification**

The idea here is to use PCA on the numeric columns, concatenate those back with the one-hot encodings and the run a logistic regression on the resulting dataset.

We can re-work the pipeline we made from earlier to work with this, we'll just be removing the bits with cross validation and then our pipeline will only have the scaling step and the PCA step (set that to carry 95% of the variance)


So our steps here are basically chop up the full featureset into categorical (onehot) columns and numeric columns and from there, scale the numeric columns and run a pca on them, take about 95% of the varience, drop the rest, and then tak it back onto the one-hot columns. 


We had previously used a run_model_classifier() wrapper function in previous weeks to run a cross validation run on these. So once we get the dataframes we can train test split them and run them through that to see what the relative effect of each was. 

In [None]:
nuniques = X.nunique(dropna=True)
numeric_cols = nuniques.index[nuniques > 2].tolist()
onehot_cols  = nuniques.index[nuniques == 2].tolist()

# Pipeline
num_branch = Pipeline([
    ('scale', StandardScaler()),
    ('pca', PCA(n_components=0.95, svd_solver='full', random_state=42))
])


pre = ColumnTransformer([
    ('num_pca', num_branch, numeric_cols),
    ('cat', 'passthrough', onehot_cols)
], remainder='drop')


And then let's call that and get the new features: 

For the big data bowl: 

In [110]:
target = 'Inj_Occured'
X = BDB_All_Plays_Model_Ready.drop(columns=['Inj_Occured'])
y = BDB_All_Plays_Model_Ready['Inj_Occured']

In [112]:

Z = pre.fit_transform(X)
feat_names = pre.get_feature_names_out()

cols_clean = list_strip(feat_names)

Z_df = pd.DataFrame(Z, columns=cols_clean, index=X.index)
BDB_PCA_Features = Z_df.join(y)

In [113]:
BDB_PCA_Features.head()

Unnamed: 0,num_pca__pca0,num_pca__pca1,num_pca__pca2,num_pca__pca3,num_pca__pca4,num_pca__pca5,num_pca__pca6,num_pca__pca7,num_pca__pca8,pff_playAction,...,offenseFormation_WILDCAT,dropBackType_DESIGNED_ROLLOUT_RIGHT,dropBackType_DESIGNED_RUN,dropBackType_SCRAMBLE,dropBackType_SCRAMBLE_ROLLOUT_LEFT,dropBackType_SCRAMBLE_ROLLOUT_RIGHT,dropBackType_UNKNOWN,pff_passCoverageType_Other,pff_passCoverageType_Zone,Inj_Occured
0,-2.331976,-1.447836,0.94766,-0.161869,0.888649,-0.666966,0.036838,-0.577322,0.078585,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,-2.046137,-0.133238,0.181787,0.789121,-0.163614,-0.285742,0.199552,-0.508909,0.057577,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
2,-1.594351,0.760391,-1.129086,-0.081369,0.616551,-0.00918,-0.035066,-0.872549,0.007977,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
3,-1.671998,0.479033,-0.564723,0.608799,1.3516,0.145544,-1.442641,1.033279,0.181504,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
4,-1.754725,-0.599952,0.36275,-2.15009,-0.107079,0.100393,0.750206,0.004444,-0.0285,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


and for the first and future: 

In [127]:
target = 'Inj_Occured'
X = FNF_Model_Ready.drop(columns=['Inj_Occured'])
y = FNF_Model_Ready['Inj_Occured']

In [129]:
Z = pre.fit_transform(X)
feat_names = pre.get_feature_names_out()

cols_clean = list_strip(feat_names)

Z_df = pd.DataFrame(Z, columns=cols_clean, index=X.index)
FNF_PCA_Features = Z_df.join(y)

In [130]:
FNF_PCA_Features.head()

Unnamed: 0,num_pca__pca0,num_pca__pca1,num_pca__pca2,num_pca__pca3,num_pca__pca4,num_pca__pca5,num_pca__pca6,num_pca__pca7,StadiumType_Outdoors,StadiumType_Unknown,...,Position_OLB,Position_P,Position_QB,Position_RB,Position_S,Position_SS,Position_T,Position_TE,Position_WR,Inj_Occured
0,-1.254648,1.181344,0.683035,-0.847518,0.372132,-0.796001,-1.06699,-1.04628,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,-0.603785,1.170289,1.225442,-0.685322,0.413524,-0.220237,-1.201717,-0.975325,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,-1.650825,1.133402,0.625745,-0.771552,0.415045,-0.407799,-1.199267,-1.014177,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0
3,-1.027815,0.642432,-0.054762,0.187475,0.035836,-0.720262,-1.461661,-1.877251,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0
4,-0.425107,1.066156,0.827199,-0.677716,-0.247306,0.508857,-1.33516,-0.691741,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0


punt data analytics

In [121]:
target = 'Inj_Occured'
X = PDA_Model_Ready.drop(columns=['Inj_Occured'])
y = PDA_Model_Ready['Inj_Occured']

In [123]:
Z = pre.fit_transform(X)
feat_names = pre.get_feature_names_out()

cols_clean = list_strip(feat_names)

Z_df = pd.DataFrame(Z, columns=cols_clean, index=X.index)
PDA_PCA_Features = Z_df.join(y)

In [124]:
PDA_PCA_Features.head()

Unnamed: 0,num_pca__pca0,num_pca__pca1,num_pca__pca2,num_pca__pca3,Game_Day_Monday,Game_Day_Saturday,Game_Day_Sunday,Game_Day_Thursday,Game_Day_Wednesday,Start_Time_13:00,...,GameWeather_Snow,GameWeather_Sunny,GameWeather_Unknown,month_December,month_February,month_January,month_November,month_October,month_September,Inj_Occured
0,-1.70507,-0.383385,1.184063,0.63119,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,0.159773,-1.040541,0.118457,1.287283,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,0.22382,-0.968049,-0.776093,1.292554,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
3,0.670837,2.126908,0.473424,1.558459,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
4,1.361204,-1.307729,-0.825215,1.015366,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


____

____

## **Partial Least Squares Regression**

Regarding Partial Least Squares Regression, this is a *regression* method that optimizes covarience instead of varience in the case of PCA (which we'll use for Logistic Regression), it's not directly applicable to a binary classification task. 


While I was able to write a wrapper to adapt other regression models to work as binary classifiers in week 2, they performed absolutely abyssmally, and I figured the juice was not worth the squeeze in this case either, as forward, backward, and PCA(on numeric) features were already used and this would not contribute further to the analysis. 