# **Week 4 - Capstone Development**

#### **Logistic Regression and Feature Scaling:**

So this is awkward, but I already did Logistic Regression and feature scaling in week 2. Furthermore, in week 3, I did a forward feature selection; a backward feature selection; and then a joined dataset where the numerical columns were seperated and then a PCA was performed on the data. From there, they were paired back up with the one-hot encoded data. 

In this week, we will find the featureset to rule them all :) 

In [2]:
# Standard Libraries
import os
import time
import math
import io
import zipfile
import requests
from urllib.parse import urlparse
from itertools import chain, combinations

# Data Science Libraries
import numpy as np
import pandas as pd
import seaborn as sns

# Visualization
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.ticker as mticker  # Optional: Format y-axis labels as dollars
import seaborn as sns
import matplotlib.pyplot as plt



# Scikit-learn (Machine Learning)
from sklearn.model_selection import (
    train_test_split,
    cross_val_score,
    GridSearchCV,
    RandomizedSearchCV,
    RepeatedStratifiedKFold,
    RepeatedKFold
)

from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, root_mean_squared_error, accuracy_score, f1_score, roc_auc_score, balanced_accuracy_score
from sklearn.feature_selection import SequentialFeatureSelector, f_regression, SelectKBest
from sklearn.linear_model import LogisticRegression, Lasso, RidgeClassifier, ElasticNet
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_predict, KFold
# Progress Tracking

from tqdm import tqdm

# =============================
# Global Variables
# =============================
random_state = 42

#### **Dataset Imports**

In [11]:
BDB_All_Plays_Model_Ready = pd.read_csv("/Users/leemcfarling/projects/ds_homework/AFL_Project/AFL_Final_Project/BDB_All_Plays_Model_Ready.csv") # Big Data Bowl Dataset

In [None]:
PDA_PCA_Features = pd.read_csv('../../Feature_Subsets/PDA_PCA_Features.csv')
FNF_PCA_Features = pd.read_csv('../../Feature_Subsets/FNF_PCA_Features.csv')
BDB_PCA_Features = pd.read_csv('../../Feature_Subsets/BDB_PCA_Features.csv')

# === Backward Feature Sets ===
FNF_back_Features = pd.read_csv('../../Feature_Subsets/FNF_back_Features.csv')
PDA_back_Features = pd.read_csv('../../Feature_Subsets/PDA_back_Features.csv')
BDB_back_Features = pd.read_csv('../../Feature_Subsets/BDB_back_Features.csv')

# === Forward Feature Sets ===
FNF_Forward_Features = pd.read_csv('../../Feature_Subsets/FNF_Forward_Features.csv')
PDA_Forward_Features = pd.read_csv('../../Feature_Subsets/PDA_Forward_Features.csv')
BDB_Forward_Features = pd.read_csv('../../Feature_Subsets/BDB_Forward_Features.csv')

____

#### **Function Definitions**

Function to take a provided dataframe and split that dataframe into feature and target columns. 

In [15]:
# ===========================================================================================
# Function taken from Module 3 Final Project
# https://github.com/LeeMcFarling/Final_Project_Writeup/blob/main/Final_Project_Report.ipynb
# ===========================================================================================

def train_test_split_data(df, target_col):
    X = df.drop(columns=target_col)
    y = df[target_col]

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    return X_train, X_test, y_train, y_test

____

#### **Lists to split Data into Numeric and Categorical Data**

Because we already made one hot encoded variables here are lists to seperate numeric and categorical data. 

#### **Standardization Function**

So, we would typically apply a standardization function here, but because all of our datasets are already pre-processed, there is really no need to. (They were already standardized before the feature selection and PCA pipelines), so now we just need to train test split each one and then go from there. 

#### **Run Model Classifier**

In [16]:
# =============================================================================================
# Taken from Mod 3 Week 8:
# https://github.com/waysnyder/Module-3-Assignments/blob/main/Homework_08.ipynb
# 
# Global dataframe logic taken from mod 3 final project: 
# https://github.com/LeeMcFarling/Final_Project_Writeup/blob/main/Final_Project_Report.ipynb
# 
# Final Function was developed in Week 2 of this Module
# =============================================================================================

def run_model_classifier(model, X_train, y_train, X_test, y_test, n_repeats=10, n_jobs=-1, run_comment=None, return_model=False, concat_results=False, **model_params):

    global combined_results
    # Remove extra key used to store error metric, if it was added to the parameter dictionary
    if 'accuracy_found' in model_params:
        model_params = model_params.copy()
        model_params.pop('accuracy_found', None)  
        
    # Instantiate the model if a class is provided
    if isinstance(model, type):
        model = model(**model_params)
    else:                                    
        model.set_params(**model_params)    

    model_name = model.__name__ if isinstance(model, type) else model.__class__.__name__ # Added because 


    # Use RepeatedStratifiedKFold for classification to preserve class distribution
    cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=n_repeats, random_state=42)
    
    # Perform 5-fold cross-validation using accuracy as the scoring metric
    cv_scores = cross_val_score(model, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=n_jobs)
    
    mean_cv_accuracy = np.mean(cv_scores)
    std_cv_accuracy  = np.std(cv_scores)
    
    # Fit the model on the full training set
    model.fit(X_train, y_train)
    
    # Compute training and testing accuracy
    train_preds    = model.predict(X_train)
    test_preds     = model.predict(X_test)

    # Normal Accuracy 
    train_accuracy = accuracy_score(y_train, train_preds)
    test_accuracy  = accuracy_score(y_test, test_preds)

    # Balanced Accuracy Metrics
    balanced_train_accuracy = balanced_accuracy_score(y_train, train_preds)
    balanced_test_accuracy = balanced_accuracy_score(y_test, test_preds)

    results_df = pd.DataFrame([{
        'model': model_name, 
        'model_params': model.get_params(),
        'mean_cv_accuracy': mean_cv_accuracy,
        'std_cv_accuracy': std_cv_accuracy,
        'train_accuracy': train_accuracy, 
        'test_accuracy': test_accuracy,
        'balanced_train_accuracy' : balanced_train_accuracy,
        'balanced_test_accuracy': balanced_test_accuracy,
        'run_comment': run_comment
    }])
    
    if concat_results:
        try:
            combined_results = pd.concat([combined_results, results_df], ignore_index=True)
        except NameError:
            combined_results = results_df

    return (results_df, model) if return_model else results_df

_____

#### **Prepare Data**

In this following cell we will do two things: 

1- Train Test Split the data - which will be stored in variables, X_train, X_test, etc. 

2- Standardize the data and **then** train test split it: This will be stored in variables X_train_scaled, X_test_scaled, etc. 

In [None]:
# Non Standardized Data
X_train, X_test, y_train, y_test = train_test_split_data(BDB_All_Plays_Model_Ready, 'Inj_Occured')


# Standardized Numeric Data
BDB_All_Plays_Standardized = standardize_features(BDB_All_Plays_Model_Ready, target_column='Inj_Occured')
X_train_Scaled, X_test_scaled, y_train_scaled, y_test_scaled = train_test_split_data(BDB_All_Plays_Standardized, 'Inj_Occured')

In [None]:
# put all your feature sets into a dictionary
feature_sets = {
    "PDA_PCA": PDA_PCA_Features,
    "FNF_PCA": FNF_PCA_Features,
    "BDB_PCA": BDB_PCA_Features,
    "FNF_back": FNF_back_Features,
    "PDA_back": PDA_back_Features,
    "BDB_back": BDB_back_Features,
    "FNF_forward": FNF_Forward_Features,
    "PDA_forward": PDA_Forward_Features,
    "BDB_forward": BDB_Forward_Features
}

# dictionary to store results
splits = {}

# loop through and split each
for name, df in feature_sets.items():
    print(f"Splitting {name}...")
    X_train, X_test, y_train, y_test = train_test_split_data(df, 'Inj_Occured')
    splits[name] = {
        'X_train': X_train,
        'X_test': X_test,
        'y_train': y_train,
        'y_test': y_test
    }
    print(f"Done: {X_train.shape[0]} train rows, {X_test.shape[0]} test rows")

print("\n All splits complete!")

Splitting PDA_PCA...
Done: 5344 train rows, 1337 test rows
Splitting FNF_PCA...
Done: 213604 train rows, 53402 test rows
Splitting BDB_PCA...
Done: 6839 train rows, 1710 test rows

 All splits complete!


____

# **Modeling**

NOTE: As previously discussed in Semester 2, the primary goal of this analysis and modeling excercise is to classify whether a particular play will result in an injury, and to determine the factors that are most likely to cause this injury. Furthermore, as was *also* previously discussed, this dataset has some extreme imbalance issues (injury occurance < 2%), and as such high performance on these baseline models is NOT expected. 

## **Baseline Logistic Regression**

#to do: **put baseline here**

So, in order to help convergence, we cranked the max_iter, and tol hyperparameters. 

Great, no more convergence issues. And now, again from week two, we used the scaled feature set to see if that would yield any improvements in performance below. 

#### **Effect of Standardizing Data**

____