

---


# **BALTIMORE RAVENS**

Team Members:

- Andina Hapsari
- Elaine Silva
- Rudrani Sakhare
- Sergio Prado
- Sergio Sanabria
- Tadayori Tamura


---



# **A2 NFL Draft Using Machine Learning**

This project combines advanced machine learning with strategic team analysis to support the Baltimore Ravens in identifying high-potential NFL draft prospects. Facing critical decisions in roster development, the Ravens must not only evaluate athletic talent but also align selections with long-term organizational goals — including positional depth, succession planning, and player development timelines. Leveraging historical player data across five key position groups (QBs, RBs, OL, WRs/TEs, and DEF), this notebook builds predictive models designed to forecast whether incoming prospects will succeed in the league.

Success is defined through both binary classification (Success_1) and a composite scoring metric (Success_2), capturing early performance and long-term career impact. Using this foundation, multiple models — including Random Forest, XGBoost, Decision Trees, and Neural Networks — were trained and evaluated per position group. The best-performing model for each group was selected based on precision and F1-score, ensuring a balance between accuracy and actionable predictions. These models were then applied to real prospect data, tested manually using insights into the Ravens’ current roster needs. For example, where gaps exist in offensive line stability or defensive backfield performance, models were scrutinized for their ability to surface relevant, high-upside players.

The outcome is a robust, data-driven scouting tool that doesn’t just rate players — it helps answer the strategic question: “Which players can actually help the Baltimore Ravens win more games?” By blending analytics with organizational context, this project bridges the gap between raw performance metrics and intelligent football decision-making.

After each main section of the file, an explanation of what is going to be found in that section can be found.

# Importing ALL Libraries and Training Data

In [1]:
# Standard libraries
import warnings
from collections import Counter

# Data handling
import numpy as np
import pandas as pd

# Visualization
import matplotlib as plt

# Scikit-learn: preprocessing, models, metrics, and model selection
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix,
    mean_absolute_error, mean_squared_error, r2_score
)
from sklearn.exceptions import FitFailedWarning
from sklearn.feature_selection import RFE

# Scikit-learn classifiers & regressors
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor

# XGBoost
import xgboost as xgb

# TensorFlow / Keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping


In [2]:
# Install missing libraries (only xgboost typically needs this in Colab)

#!pip install xgboost
#!pip install tensorflow


In [3]:
#Importing data files
qbs = pd.read_csv('QBs_def.csv')
rbs = pd.read_csv('RBs_def.csv')
ol = pd.read_csv('OL_def.csv')
wr_tes = pd.read_csv('WRs_TEs_def.csv')
defe = pd.read_csv('DEF_def.csv')


# Creating Success Metrics for All Groups

## Success Metrics Explanation

In this project, two main types of success metrics were created to evaluate whether NFL draft prospects turned out to be successful players. These metrics are foundational to training predictive models and forecasting the potential of new players.

Success_1: Performance in rookie year
It is based on a player's performance during their rookie season, specifically comparing their rookie-year grade against the average grade of players in the same position and draft year. If the player performed better than the average for their peer group, they are labeled as successful.

This metric is especially important because it captures early signs of high performance and allows the use of classification models, which are ideal for yes/no predictions such as "Will this player succeed?"

Success_2: Composite Success Score
The second success metric, Success_2, is a more nuanced and continuous score. It combines several normalized performance indicators to reflect a player’s overall value and development across their early career. These indicators include:

Career value (weighted Approximate Value),

Draft year performance,

Rookie-year performance,

Cumulative career stats,

and performance improvement over time.

Each of these components is weighted to reflect its importance, giving more influence to long-term contribution while still recognizing early promise. The result is a composite score that helps rank players on a success scale, not just as a binary outcome

By using position-specific averages and multi-year career data, these metrics ensure fair comparisons and reduce noise from short-term variability or subjective judgments.

## QBS

Summary of QBS, data review and success metrics overview.

In [4]:
qbs

Unnamed: 0,Rnd,Pick,Tm,Player,Pos,Age,To,AP1,PB,St,...,RECV,RUN,RBLK,DEF,RDEF,TACK,PRSH,COV,SPEC,Year
0,1,1,CHI,Caleb Williams,QB,22.0,2024.0,0,0,1,...,73.1,89.7,55.1,64.5,53.0,60.7,59.1,78.1,78.6,2023.0
1,1,2,WAS,Jayden Daniels,QB,23.0,2024.0,0,1,1,...,67.7,80.0,65.2,62.0,50.5,62.4,67.0,65.9,62.0,2023.0
2,1,3,NWE,Drake Maye,QB,22.0,2024.0,0,1,1,...,66.7,78.7,63.5,72.8,80.7,71.0,62.2,66.8,77.6,2023.0
3,1,8,ATL,Michael Penix Jr.,QB,24.0,2024.0,0,0,0,...,74.0,76.7,68.2,76.8,87.6,61.0,64.6,65.2,65.3,2023.0
4,1,10,MIN,J.J. McCarthy,QB,21.0,,0,0,0,...,77.8,75.2,65.0,68.4,62.2,48.5,59.9,77.3,79.0,2023.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
204,4,103,DAL,Isaiah Stanback,QB,23.0,2012.0,0,0,0,...,85.5,47.6,88.0,84.2,91.8,90.1,59.0,84.1,-,2006.0
205,5,151,CIN,Jeff Rowe,QB,23.0,,0,0,0,...,91.1,53.8,79.2,78.4,81.0,86.6,60.2,87.7,-,2006.0
206,5,174,BAL,Troy Smith,QB,23.0,2010.0,0,0,0,...,79.5,49.5,68.0,91.8,93.3,90.2,69.7,93.0,-,2006.0
207,6,205,WAS,Jordan Palmer,QB,23.0,2014.0,0,0,0,...,76.0,46.9,81.9,62.1,69.1,55.4,52.0,57.1,-,2006.0


In [5]:
qbs[qbs['Ht'].isna() | qbs['Wt'].isna()]


Unnamed: 0,Rnd,Pick,Tm,Player,Pos,Age,To,AP1,PB,St,...,RECV,RUN,RBLK,DEF,RDEF,TACK,PRSH,COV,SPEC,Year
21,5,149,GNB,Sean Clifford,QB,25.0,2023.0,0,0,0,...,70.3,91.4,55.9,67.4,53.1,57.7,71.5,77.7,71.0,2022.0
31,7,241,PIT,Chris Oladokun,QB,25.0,2024.0,0,0,0,...,66.9,71.5,58.5,64.2,55.2,57.2,81.4,51.6,74.6,2021.0
40,3,66,MIN,Kellen Mond,QB,22.0,2021.0,0,0,0,...,89.9,91.3,65.7,53.4,41.5,46.3,59.3,61.0,78.6,2020.0
54,7,231,DAL,Ben DiNucci,QB,23.0,2020.0,0,0,0,...,78.6,88.6,71.8,78.6,73.5,59.6,73.1,88.7,45.5,2019.0
55,7,240,NOR,Tommy Stevens,QB,23.0,2020.0,0,0,0,...,81.2,77.9,75.1,80.1,83.5,71.7,80.1,69.2,82.4,2019.0
66,6,178,JAX,Gardner Minshew II,QB,23.0,2024.0,0,1,4,...,60.4,61.8,63.9,83.7,83.0,66.8,77.6,84.1,73.4,2018.0
79,7,220,SEA,Alex McGough,QB,22.0,,0,0,0,...,74.4,68.7,55.3,86.9,79.0,91.7,77.0,92.6,70.2,2017.0
90,7,253,DEN,Chad Kelly,QB,23.0,2018.0,0,0,0,...,74.4,53.8,78.5,87.5,80.8,89.5,70.8,92.2,73.0,2016.0
102,6,191,DET,Jake Rudock,QB,23.0,2017.0,0,0,0,...,82.2,57.3,70.8,72.1,80.1,71.5,66.5,63.9,66.4,2015.0
112,7,250,DEN,Trevor Siemian,QB,23.0,2023.0,0,0,2,...,81.3,75.5,81.0,86.9,88.8,81.9,67.0,91.7,50.5,2014.0


In [6]:
qbs = qbs.dropna(subset=['Ht', 'Wt'])

### Success Metric 1 #
grades_rookie_season > Average_grades_per_position_in_year

This metric compares the rookie grades of a player to the average of all the players in his position for that particular year


In [7]:
qbs.rename(columns={'Average_Grades_QBs_in_Year': 'Average_grades_per_position_in_year'}, inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  qbs.rename(columns={'Average_Grades_QBs_in_Year': 'Average_grades_per_position_in_year'}, inplace=True)


In [8]:
# Drop rows where required columns are missing
qbs = qbs.dropna(subset=['grade_rookie_season', 'Average_grades_per_position_in_year'])

In [9]:

# Create the binary success metric
qbs['Success_1'] = (
    qbs['grade_rookie_season'] > qbs['Average_grades_per_position_in_year']
).astype(int)


In [10]:
qbs_success_1 = qbs

In [11]:
qbs_success_1

Unnamed: 0,Rnd,Pick,Tm,Player,Pos,Age,To,AP1,PB,St,...,RUN,RBLK,DEF,RDEF,TACK,PRSH,COV,SPEC,Year,Success_1
0,1,1,CHI,Caleb Williams,QB,22.0,2024.0,0,0,1,...,89.7,55.1,64.5,53.0,60.7,59.1,78.1,78.6,2023.0,0
1,1,2,WAS,Jayden Daniels,QB,23.0,2024.0,0,1,1,...,80.0,65.2,62.0,50.5,62.4,67.0,65.9,62.0,2023.0,1
2,1,3,NWE,Drake Maye,QB,22.0,2024.0,0,1,1,...,78.7,63.5,72.8,80.7,71.0,62.2,66.8,77.6,2023.0,0
3,1,8,ATL,Michael Penix Jr.,QB,24.0,2024.0,0,0,0,...,76.7,68.2,76.8,87.6,61.0,64.6,65.2,65.3,2023.0,1
4,1,10,MIN,J.J. McCarthy,QB,21.0,,0,0,0,...,75.2,65.0,68.4,62.2,48.5,59.9,77.3,79.0,2023.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
203,3,92,BUF,Trent Edwards,QB,23.0,2012.0,0,0,2,...,41.4,63.5,77.4,55.6,89.9,62.7,93.1,72.2,2006.0,0
204,4,103,DAL,Isaiah Stanback,QB,23.0,2012.0,0,0,0,...,47.6,88.0,84.2,91.8,90.1,59.0,84.1,-,2006.0,0
205,5,151,CIN,Jeff Rowe,QB,23.0,,0,0,0,...,53.8,79.2,78.4,81.0,86.6,60.2,87.7,-,2006.0,0
206,5,174,BAL,Troy Smith,QB,23.0,2010.0,0,0,0,...,49.5,68.0,91.8,93.3,90.2,69.7,93.0,-,2006.0,0


In [12]:
qbs_success_1['Success_1'].value_counts()


Unnamed: 0_level_0,count
Success_1,Unnamed: 1_level_1
0,166
1,24


In [13]:
qbs_success_1[qbs_success_1['Age'].isna() ]

Unnamed: 0,Rnd,Pick,Tm,Player,Pos,Age,To,AP1,PB,St,...,RUN,RBLK,DEF,RDEF,TACK,PRSH,COV,SPEC,Year,Success_1
190,5,137,MIN,John David Booty,QB,,,0,0,0,...,85.4,89.4,82.1,90.7,68.0,62.2,88.6,-,2007.0,0


In [14]:
qbs_success_1.loc[qbs_success_1['Player'] == 'John David Booty', 'Age'] = 23

In [15]:
qbs_success_1[qbs_success_1['Age'].isna() ]

Unnamed: 0,Rnd,Pick,Tm,Player,Pos,Age,To,AP1,PB,St,...,RUN,RBLK,DEF,RDEF,TACK,PRSH,COV,SPEC,Year,Success_1


In [16]:
# Columns from Block II: NFL Total Statistics
nfl_total_cols = [
    'To', 'AP1', 'PB', 'St', 'wAV', 'DrAV', 'G',
    'Cmp_Pass', 'Att_Pass', 'Yds_Pass', 'TD_Pass', 'Int_Pass',
    'Att_Rush', 'Yds_Rush', 'TD_Rush', 'Rec', 'Yds_Rec', 'TD_Rec',
    'Tackles', 'Int_Def', 'Sk','TeamDrafted','Rnd','Pick','Tm','Player','Pos','year_drafted','Team','Bench'
]

# Columns from Block V: PFF Scores
pff_score_cols = [
    'grade_rookie_season', 'Career_Avg_Grade',
    'Average_grades_per_position_in_year'
]

# Columns from Block VI: PFF Scores of Previous Season
prev_season_pff_cols = [
    'prev_year', 'Record', 'PF', 'PA', 'OVER', 'OFF', 'PASS',
    'PBLK', 'RECV', 'RUN', 'RBLK', 'DEF', 'RDEF', 'TACK',
    'PRSH', 'COV', 'SPEC', 'Year'
]

# Combine all columns to drop
columns_to_drop = nfl_total_cols + pff_score_cols + prev_season_pff_cols

# Drop safely only if column exists
qbs_success_1_clean = qbs_success_1.drop(columns=[col for col in columns_to_drop if col in qbs_success_1.columns])


In [17]:
qbs_success_1_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 190 entries, 0 to 207
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             190 non-null    float64
 1   College/Univ    190 non-null    object 
 2   Total_Games     183 non-null    float64
 3   Total_Cmp       183 non-null    float64
 4   Total_Att       183 non-null    float64
 5   Total_Yds       183 non-null    float64
 6   Total_TD        183 non-null    float64
 7   Total_Int       183 non-null    float64
 8   Cmp_Percentage  183 non-null    float64
 9   TD_Percentage   183 non-null    float64
 10  Int_Percentage  183 non-null    float64
 11  Y_A             183 non-null    float64
 12  AY_A            183 non-null    float64
 13  Y_C             183 non-null    float64
 14  Y_G             183 non-null    float64
 15  Rate            183 non-null    float64
 16  Seasons_Played  183 non-null    float64
 17  Ht              190 non-null    object 


In [18]:
# Define a function to convert height
def convert_height(ht_str):
    if isinstance(ht_str, str) and '-' in ht_str:
        feet, inches = ht_str.split('-')
        return int(feet) * 12 + int(inches)
    return None  # if not valid format, return None

# Apply the function to the Ht column
qbs_success_1_clean['Ht'] = qbs_success_1_clean['Ht'].apply(convert_height)


In [19]:
# List of columns to check for missing values
college_stat_cols_qb = [
    'Total_Games', 'Total_Cmp', 'Total_Att', 'Total_Yds', 'Total_TD',
    'Total_Int', 'Cmp_Percentage', 'TD_Percentage', 'Int_Percentage',
    'Y_A', 'AY_A', 'Y_C', 'Y_G', 'Rate', 'Seasons_Played'
]

# Drop rows with any NaNs in those columns
qbs_success_1_clean = qbs_success_1_clean.dropna(subset=college_stat_cols_qb)


In [20]:
qbs_success_1_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 183 entries, 0 to 207
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             183 non-null    float64
 1   College/Univ    183 non-null    object 
 2   Total_Games     183 non-null    float64
 3   Total_Cmp       183 non-null    float64
 4   Total_Att       183 non-null    float64
 5   Total_Yds       183 non-null    float64
 6   Total_TD        183 non-null    float64
 7   Total_Int       183 non-null    float64
 8   Cmp_Percentage  183 non-null    float64
 9   TD_Percentage   183 non-null    float64
 10  Int_Percentage  183 non-null    float64
 11  Y_A             183 non-null    float64
 12  AY_A            183 non-null    float64
 13  Y_C             183 non-null    float64
 14  Y_G             183 non-null    float64
 15  Rate            183 non-null    float64
 16  Seasons_Played  183 non-null    float64
 17  Ht              183 non-null    int64  


In [21]:
def rate_qb_40yd(time):
    if pd.isna(time):
        return 0
    elif time <= 4.35:
        return 4
    elif time <= 4.59:
        return 3
    elif time <= 4.9:
        return 2
    elif time < 5.0:
        return 1
    else:
        return 0

# Apply to QBs dataset
qbs_success_1_clean['40yd'] = qbs_success_1_clean['40yd'].apply(rate_qb_40yd)


In [22]:
qbs_success_1_clean['40yd'].value_counts()

Unnamed: 0_level_0,count
40yd,Unnamed: 1_level_1
2,91
0,48
1,24
3,20


In [23]:
def rate_qb_vertical(jump):
    if pd.isna(jump):
        return 0
    elif jump >= 40.5:
        return 4
    elif jump >= 32:
        return 3
    elif jump >= 28:
        return 2
    elif jump > 0:
        return 1
    else:
        return 0

# Apply to QBs dataset
qbs_success_1_clean['Vertical_rating'] = qbs_success_1_clean['Vertical'].apply(rate_qb_vertical)
qbs_success_1_clean['Vertical_rating'].value_counts()


Unnamed: 0_level_0,count
Vertical_rating,Unnamed: 1_level_1
3,60
2,56
0,49
1,17
4,1


In [24]:
def rate_qb_broad_jump(jump):
    if pd.isna(jump):
        return 0
    elif jump >= 129:
        return 4
    elif jump >= 108:
        return 3
    elif jump >= 96:
        return 2
    elif jump > 0:
        return 1
    else:
        return 0

# Apply to QBs dataset
qbs_success_1_clean['Broad Jump'] = qbs_success_1_clean['Broad Jump'].apply(rate_qb_broad_jump)
qbs_success_1_clean['Broad Jump'].value_counts()

Unnamed: 0_level_0,count
Broad Jump,Unnamed: 1_level_1
3,100
0,50
2,32
4,1


In [25]:
def rate_qb_3cone(time):
    if pd.isna(time):
        return 0
    elif time <= 6.55:
        return 4
    elif time <= 7.15:
        return 3
    elif time <= 7.50:
        return 2
    elif time <= 8.0:
        return 1
    else:
        return 0

# Apply to QBs dataset
qbs_success_1_clean['3Cone'] = qbs_success_1_clean['3Cone'].apply(rate_qb_3cone)
qbs_success_1_clean['3Cone'].value_counts()

Unnamed: 0_level_0,count
3Cone,Unnamed: 1_level_1
3,77
0,62
2,39
1,5


In [26]:
def rate_qb_shuttle(time):
    if pd.isna(time):
        return 0
    elif time <= 4.00:
        return 4
    elif time <= 4.20:
        return 3
    elif time <= 4.45:
        return 2
    elif time <= 4.65:
        return 1
    else:
        return 0

# Apply to QBs dataset
qbs_success_1_clean['Shuttle'] = qbs_success_1_clean['Shuttle'].apply(rate_qb_shuttle)
qbs_success_1_clean['Shuttle'].value_counts()


Unnamed: 0_level_0,count
Shuttle,Unnamed: 1_level_1
2,70
0,60
1,26
3,25
4,2


### Success Metric 2



Ref: https://www.pro-football-reference.com/

In [27]:
# Work on the existing dataframe that already has Success_1
qbs_success_2 = qbs_success_1.copy()

In [28]:
# Drop rows with missing values needed for Success_Score
required_cols_score = [
    'wAV', 'DrAV', 'grade_rookie_season', 'Career_Avg_Grade', 'Average_grades_per_position_in_year'
]


qbs_success_2 = qbs_success_2.dropna(subset=required_cols_score)


In [29]:
# Compute the rookie vs position average delta
qbs_success_2['Rookie_vs_PosAvg'] = (
    qbs_success_2['grade_rookie_season'] - qbs_success_2['Average_grades_per_position_in_year']
)


In [30]:
# Normalize the metrics
scaler = MinMaxScaler()
normalized = scaler.fit_transform(qbs_success_2[[
    'wAV', 'DrAV', 'grade_rookie_season', 'Career_Avg_Grade', 'Rookie_vs_PosAvg'
]])
qbs_success_2[['wAV_norm', 'DrAV_norm', 'rookie_norm', 'career_norm', 'delta_norm']] = normalized

In [31]:

# Calculate Success_Score
qbs_success_2['Success_2'] = (
    0.35 * qbs_success_2['wAV_norm'] +
    0.25 * qbs_success_2['DrAV_norm'] +
    0.20 * qbs_success_2['rookie_norm'] +
    0.15 * qbs_success_2['career_norm'] +
    0.05 * qbs_success_2['delta_norm']
)


In [32]:
qbs_success_2

Unnamed: 0,Rnd,Pick,Tm,Player,Pos,Age,To,AP1,PB,St,...,SPEC,Year,Success_1,Rookie_vs_PosAvg,wAV_norm,DrAV_norm,rookie_norm,career_norm,delta_norm,Success_2
0,1,1,CHI,Caleb Williams,QB,22.0,2024.0,0,0,1,...,78.6,2023.0,0,-5.721739,0.106667,0.108108,0.700418,0.700418,0.706078,0.344810
1,1,2,WAS,Jayden Daniels,QB,23.0,2024.0,0,1,1,...,62.0,2023.0,1,17.778261,0.160000,0.162162,0.945720,0.945720,0.946746,0.474880
2,1,3,NWE,Drake Maye,QB,22.0,2024.0,0,1,1,...,77.6,2023.0,0,-2.721739,0.080000,0.081081,0.731733,0.731733,0.736801,0.341217
3,1,8,ATL,Michael Penix Jr.,QB,24.0,2024.0,0,0,0,...,65.3,2023.0,1,15.078261,0.040000,0.040541,0.917537,0.917537,0.919095,0.391228
5,1,12,DEN,Bo Nix,QB,24.0,2024.0,0,0,1,...,83.9,2023.0,1,5.478261,0.113333,0.114865,0.817328,0.817328,0.820779,0.395487
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
201,2,40,MIA,John Beck,QB,26.0,2011.0,0,0,0,...,-,2006.0,0,-36.545652,0.053333,0.040541,0.304802,0.429019,0.390404,0.173635
202,2,43,DET,Drew Stanton,QB,23.0,2017.0,0,0,1,...,-,2006.0,0,-65.745652,0.093333,0.054054,0.000000,0.558008,0.091362,0.134449
203,3,92,BUF,Trent Edwards,QB,23.0,2012.0,0,0,2,...,72.2,2006.0,0,-6.345652,0.133333,0.128378,0.620042,0.656785,0.699688,0.336272
204,4,103,DAL,Isaiah Stanback,QB,23.0,2012.0,0,0,0,...,-,2006.0,0,-65.745652,0.026667,0.027027,0.000000,0.000000,0.091362,0.020658


In [33]:
qbs_success_2['Success_2'].describe()

Unnamed: 0,Success_2
count,141.0
mean,0.327517
std,0.185435
min,0.020658
25%,0.180741
50%,0.29691
75%,0.432993
max,0.933294


We chose to retain Success_2 as a continuous variable rather than applying a binary threshold, in order to preserve the full range of player outcomes. Unlike Success_1, which reflects a binary classification based solely on rookie season performance, Success_2 incorporates a broader, weighted view of career success including metrics like total career value (wAV), team-specific contribution (DrAV), and performance consistency (Career_Avg_Grade). Binarizing this metric would reduce nuanced differences between players who are "above average" and those who are truly elite, potentially weakening model sensitivity. Keeping Success_2 continuous allows us to train a regression model that captures degrees of long-term success, which can then be combined with the probability output of the classification model (Success_1) for a more balanced and informed draft strategy.

In [34]:
qbs_success_2

Unnamed: 0,Rnd,Pick,Tm,Player,Pos,Age,To,AP1,PB,St,...,SPEC,Year,Success_1,Rookie_vs_PosAvg,wAV_norm,DrAV_norm,rookie_norm,career_norm,delta_norm,Success_2
0,1,1,CHI,Caleb Williams,QB,22.0,2024.0,0,0,1,...,78.6,2023.0,0,-5.721739,0.106667,0.108108,0.700418,0.700418,0.706078,0.344810
1,1,2,WAS,Jayden Daniels,QB,23.0,2024.0,0,1,1,...,62.0,2023.0,1,17.778261,0.160000,0.162162,0.945720,0.945720,0.946746,0.474880
2,1,3,NWE,Drake Maye,QB,22.0,2024.0,0,1,1,...,77.6,2023.0,0,-2.721739,0.080000,0.081081,0.731733,0.731733,0.736801,0.341217
3,1,8,ATL,Michael Penix Jr.,QB,24.0,2024.0,0,0,0,...,65.3,2023.0,1,15.078261,0.040000,0.040541,0.917537,0.917537,0.919095,0.391228
5,1,12,DEN,Bo Nix,QB,24.0,2024.0,0,0,1,...,83.9,2023.0,1,5.478261,0.113333,0.114865,0.817328,0.817328,0.820779,0.395487
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
201,2,40,MIA,John Beck,QB,26.0,2011.0,0,0,0,...,-,2006.0,0,-36.545652,0.053333,0.040541,0.304802,0.429019,0.390404,0.173635
202,2,43,DET,Drew Stanton,QB,23.0,2017.0,0,0,1,...,-,2006.0,0,-65.745652,0.093333,0.054054,0.000000,0.558008,0.091362,0.134449
203,3,92,BUF,Trent Edwards,QB,23.0,2012.0,0,0,2,...,72.2,2006.0,0,-6.345652,0.133333,0.128378,0.620042,0.656785,0.699688,0.336272
204,4,103,DAL,Isaiah Stanback,QB,23.0,2012.0,0,0,0,...,-,2006.0,0,-65.745652,0.026667,0.027027,0.000000,0.000000,0.091362,0.020658


In [35]:
# Columns from Block II: NFL Total Statistics
nfl_total_cols = [
    'To', 'AP1', 'PB', 'St', 'wAV', 'DrAV', 'G',
    'Cmp_Pass', 'Att_Pass', 'Yds_Pass', 'TD_Pass', 'Int_Pass',
    'Att_Rush', 'Yds_Rush', 'TD_Rush', 'Rec', 'Yds_Rec', 'TD_Rec',
    'Tackles', 'Int_Def', 'Sk','TeamDrafted','Rnd','Pick','Tm','Player','Pos','year_drafted','Team','Bench','Rookie_vs_PosAvg','wAV_norm','DrAV_norm',
    'rookie_norm','career_norm','delta_norm', 'Success_1'
]

# Columns from Block V: PFF Scores
pff_score_cols = [
    'grade_rookie_season', 'Career_Avg_Grade',
    'Average_grades_per_position_in_year'
]

# Columns from Block VI: PFF Scores of Previous Season
prev_season_pff_cols = [
    'prev_year', 'Record', 'PF', 'PA', 'OVER', 'OFF', 'PASS',
    'PBLK', 'RECV', 'RUN', 'RBLK', 'DEF', 'RDEF', 'TACK',
    'PRSH', 'COV', 'SPEC', 'Year'
]

# Combine all columns to drop
columns_to_drop = nfl_total_cols + pff_score_cols + prev_season_pff_cols

# Drop safely only if column exists
qbs_success_2_clean = qbs_success_2.drop(columns=[col for col in columns_to_drop if col in qbs_success_2.columns])


In [36]:
# List of columns to check for missing values
college_stat_cols_qb = [
    'Total_Games', 'Total_Cmp', 'Total_Att', 'Total_Yds', 'Total_TD',
    'Total_Int', 'Cmp_Percentage', 'TD_Percentage', 'Int_Percentage',
    'Y_A', 'AY_A', 'Y_C', 'Y_G', 'Rate', 'Seasons_Played'
]

# Drop rows with any NaNs in those columns
qbs_success_2_clean = qbs_success_2_clean.dropna(subset=college_stat_cols_qb)


In [37]:
qbs_success_2_clean

Unnamed: 0,Age,College/Univ,Total_Games,Total_Cmp,Total_Att,Total_Yds,Total_TD,Total_Int,Cmp_Percentage,TD_Percentage,...,Rate,Seasons_Played,Ht,Wt,40yd,Vertical,Broad Jump,3Cone,Shuttle,Success_2
0,22.0,USC,37.0,735.0,1099.0,10082.0,93.0,14.0,66.9,8.5,...,169.3,3.0,6-1,214.0,,,,,,0.344810
1,23.0,LSU,55.0,953.0,1438.0,12749.0,89.0,20.0,66.3,6.2,...,158.4,5.0,6-4,210.0,,,,,,0.474880
2,22.0,North Carolina,30.0,618.0,952.0,8018.0,63.0,16.0,64.9,6.6,...,154.1,3.0,6-4,223.0,,,,,,0.341217
3,24.0,Washington,48.0,1067.0,1685.0,13741.0,96.0,34.0,63.3,5.7,...,146.6,6.0,6-2,216.0,,,,,,0.391228
5,24.0,Oregon,61.0,1286.0,1936.0,15351.0,113.0,26.0,66.4,5.8,...,149.6,5.0,6-2,214.0,,,,,,0.395487
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
201,26.0,BYU,43.0,885.0,1418.0,11021.0,79.0,34.0,62.4,5.6,...,141.3,4.0,6-2,215.0,4.75,29.5,111.0,6.81,4.17,0.173635
202,23.0,Michigan St.,45.0,543.0,846.0,6524.0,42.0,28.0,64.2,5.0,...,138.7,4.0,6-3,226.0,4.75,30.5,108.0,6.77,4.41,0.134449
203,23.0,Stanford,35.0,487.0,865.0,5429.0,36.0,33.0,56.3,4.2,...,115.1,4.0,6-4,231.0,4.76,,,7.14,4.46,0.336272
204,23.0,Washington,36.0,269.0,523.0,3868.0,22.0,12.0,51.4,4.2,...,122.9,4.0,6-2,216.0,4.60,,,,,0.020658


In [38]:
# Define a function to convert height
def convert_height(ht_str):
    if isinstance(ht_str, str) and '-' in ht_str:
        feet, inches = ht_str.split('-')
        return int(feet) * 12 + int(inches)
    return None  # if not valid format, return None

# Apply the function to the Ht column
qbs_success_2_clean['Ht'] = qbs_success_2_clean['Ht'].apply(convert_height)

In [39]:
qbs_success_2_clean

Unnamed: 0,Age,College/Univ,Total_Games,Total_Cmp,Total_Att,Total_Yds,Total_TD,Total_Int,Cmp_Percentage,TD_Percentage,...,Rate,Seasons_Played,Ht,Wt,40yd,Vertical,Broad Jump,3Cone,Shuttle,Success_2
0,22.0,USC,37.0,735.0,1099.0,10082.0,93.0,14.0,66.9,8.5,...,169.3,3.0,73,214.0,,,,,,0.344810
1,23.0,LSU,55.0,953.0,1438.0,12749.0,89.0,20.0,66.3,6.2,...,158.4,5.0,76,210.0,,,,,,0.474880
2,22.0,North Carolina,30.0,618.0,952.0,8018.0,63.0,16.0,64.9,6.6,...,154.1,3.0,76,223.0,,,,,,0.341217
3,24.0,Washington,48.0,1067.0,1685.0,13741.0,96.0,34.0,63.3,5.7,...,146.6,6.0,74,216.0,,,,,,0.391228
5,24.0,Oregon,61.0,1286.0,1936.0,15351.0,113.0,26.0,66.4,5.8,...,149.6,5.0,74,214.0,,,,,,0.395487
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
201,26.0,BYU,43.0,885.0,1418.0,11021.0,79.0,34.0,62.4,5.6,...,141.3,4.0,74,215.0,4.75,29.5,111.0,6.81,4.17,0.173635
202,23.0,Michigan St.,45.0,543.0,846.0,6524.0,42.0,28.0,64.2,5.0,...,138.7,4.0,75,226.0,4.75,30.5,108.0,6.77,4.41,0.134449
203,23.0,Stanford,35.0,487.0,865.0,5429.0,36.0,33.0,56.3,4.2,...,115.1,4.0,76,231.0,4.76,,,7.14,4.46,0.336272
204,23.0,Washington,36.0,269.0,523.0,3868.0,22.0,12.0,51.4,4.2,...,122.9,4.0,74,216.0,4.60,,,,,0.020658


In [40]:
def rate_qb_40yd(time):
    if pd.isna(time):
        return 0
    elif time <= 4.35:
        return 4
    elif time <= 4.59:
        return 3
    elif time <= 4.9:
        return 2
    elif time < 5.0:
        return 1
    else:
        return 0

# Apply to QBs dataset
qbs_success_2_clean['40yd'] = qbs_success_2_clean['40yd'].apply(rate_qb_40yd)

qbs_success_2_clean['40yd'].value_counts()

Unnamed: 0_level_0,count
40yd,Unnamed: 1_level_1
2,69
0,34
3,17
1,15


In [41]:
def rate_qb_vertical(jump):
    if pd.isna(jump):
        return 0
    elif jump >= 40.5:
        return 4
    elif jump >= 32:
        return 3
    elif jump >= 28:
        return 2
    elif jump > 0:
        return 1
    else:
        return 0

# Apply to QBs dataset
qbs_success_2_clean['Vertical_rating'] = qbs_success_2_clean['Vertical'].apply(rate_qb_vertical)
qbs_success_2_clean['Vertical_rating'].value_counts()


Unnamed: 0_level_0,count
Vertical_rating,Unnamed: 1_level_1
3,51
2,37
0,36
1,10
4,1


In [42]:
def rate_qb_broad_jump(jump):
    if pd.isna(jump):
        return 0
    elif jump >= 129:
        return 4
    elif jump >= 108:
        return 3
    elif jump >= 96:
        return 2
    elif jump > 0:
        return 1
    else:
        return 0

# Apply to QBs dataset
qbs_success_2_clean['Broad Jump'] = qbs_success_2_clean['Broad Jump'].apply(rate_qb_broad_jump)
qbs_success_2_clean['Broad Jump'].value_counts()

Unnamed: 0_level_0,count
Broad Jump,Unnamed: 1_level_1
3,77
0,39
2,18
4,1


In [43]:
def rate_qb_3cone(time):
    if pd.isna(time):
        return 0
    elif time <= 6.55:
        return 4
    elif time <= 7.15:
        return 3
    elif time <= 7.50:
        return 2
    elif time <= 8.0:
        return 1
    else:
        return 0

# Apply to QBs dataset
qbs_success_2_clean['3Cone'] = qbs_success_2_clean['3Cone'].apply(rate_qb_3cone)
qbs_success_2_clean['3Cone'].value_counts()

Unnamed: 0_level_0,count
3Cone,Unnamed: 1_level_1
3,60
0,47
2,25
1,3


In [44]:
def rate_qb_shuttle(time):
    if pd.isna(time):
        return 0
    elif time <= 4.00:
        return 4
    elif time <= 4.20:
        return 3
    elif time <= 4.45:
        return 2
    elif time <= 4.65:
        return 1
    else:
        return 0

# Apply to QBs dataset
qbs_success_2_clean['Shuttle'] = qbs_success_2_clean['Shuttle'].apply(rate_qb_shuttle)
qbs_success_2_clean['Shuttle'].value_counts()

Unnamed: 0_level_0,count
Shuttle,Unnamed: 1_level_1
2,50
0,47
3,20
1,16
4,2


## RBS

Summary of RBS, data review and success metrics overview.

In [45]:
rbs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 385 entries, 0 to 384
Data columns (total 76 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Rnd                            385 non-null    int64  
 1   Pick                           385 non-null    int64  
 2   Tm                             385 non-null    object 
 3   Player                         385 non-null    object 
 4   Pos                            385 non-null    object 
 5   Age                            379 non-null    float64
 6   To                             356 non-null    float64
 7   AP1                            385 non-null    int64  
 8   PB                             385 non-null    int64  
 9   St                             385 non-null    int64  
 10  wAV                            356 non-null    float64
 11  DrAV                           335 non-null    float64
 12  G                              356 non-null    flo

In [46]:
rbs[rbs['Ht'].isna() | rbs['Wt'].isna()]

Unnamed: 0,Rnd,Pick,Tm,Player,Pos,Age,To,AP1,PB,St,...,RECV,RUN,RBLK,DEF,RDEF,TACK,PRSH,COV,SPEC,Year
7,4,128,BUF,Ray Davis,RB,24.0,2024.0,0,0,0,...,75.0,91.3,61.2,77.8,54.1,41.4,76.4,91.3,72.7,2023.0
10,5,147,DEN,Audric Estime,RB,21.0,2024.0,0,0,0,...,69.6,77.6,72.3,60.0,59.3,34.2,69.8,48.4,83.9,2023.0
24,3,84,MIA,De'Von Achane,RB,21.0,2024.0,0,0,1,...,83.0,76.6,69.8,71.7,75.9,47.0,80.4,52.5,56.2,2022.0
31,6,193,WAS,Chris Rodriguez,RB,22.0,2024.0,0,0,0,...,75.1,77.2,55.6,74.4,61.6,68.0,71.9,83.1,90.4,2022.0
35,7,235,GNB,Lew Nichols,RB,22.0,,0,0,0,...,70.3,91.4,55.9,67.4,53.1,57.7,71.5,77.7,71.0,2022.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
358,6,204,MIA,Lex Hilliard,RB,24.0,2012.0,0,0,1,...,66.2,69.3,69.8,59.3,71.5,80.1,60.4,36.3,-,2007.0
375,4,137,BAL,Le'Ron McClain,RB,22.0,2013.0,1,2,6,...,79.5,49.5,68.0,91.8,93.3,90.2,69.7,93.0,-,2006.0
377,6,181,MIA,Reagan Maui'a,RB,23.0,2012.0,0,0,1,...,66.7,50.5,73.1,88.0,92.3,74.6,73.5,74.4,-,2006.0
379,6,208,NWE,Justise Hairston,RB,,,0,0,0,...,68.7,42.3,86.6,90.6,92.8,77.6,70.9,91.3,-,2006.0


In [47]:
rbs = rbs.dropna(subset=['Ht', 'Wt'])

In [48]:
rbs[rbs['Age'].isna()]

Unnamed: 0,Rnd,Pick,Tm,Player,Pos,Age,To,AP1,PB,St,...,RECV,RUN,RBLK,DEF,RDEF,TACK,PRSH,COV,SPEC,Year
315,6,200,PHI,Charles Scott,RB,,,0,0,0,...,71.2,49.4,85.9,75.3,73.4,75.8,69.2,81.8,-,2009.0
365,2,49,CIN,Kenny Irons,RB,,,0,0,0,...,91.1,53.8,79.2,78.4,81.0,86.6,60.2,87.7,-,2006.0
378,6,186,SFO,Thomas Clayton,RB,,,0,0,0,...,72.3,53.3,81.9,70.1,82.8,60.0,53.2,64.3,-,2006.0
381,7,236,PHI,Nate Ilaoa,RB,,,0,0,0,...,69.0,61.5,88.4,70.4,57.6,86.4,61.2,90.1,-,2006.0


In [49]:
rbs.loc[rbs['Player'] == 'Charles Scott', 'Age'] = 22
rbs.loc[rbs['Player'] == 'Kenny Irons', 'Age'] = 24
rbs.loc[rbs['Player'] == 'Thomas Clayton', 'Age'] = 23
rbs.loc[rbs['Player'] == 'Nate Ilaoa', 'Age'] = 24

In [50]:
rbs[rbs['grade_rookie_season'].isna() | rbs['Average_Grades_of_RBs_in_Year'].isna()]

Unnamed: 0,Rnd,Pick,Tm,Player,Pos,Age,To,AP1,PB,St,...,RECV,RUN,RBLK,DEF,RDEF,TACK,PRSH,COV,SPEC,Year


In [51]:
rbs[rbs['grade_rookie_season'].isna() | rbs['Average_Grades_of_RBs_in_Year'].isna() | rbs['wAV'].isna() | rbs['DrAV'].isna()]

Unnamed: 0,Rnd,Pick,Tm,Player,Pos,Age,To,AP1,PB,St,...,RECV,RUN,RBLK,DEF,RDEF,TACK,PRSH,COV,SPEC,Year
17,6,205,HOU,Jawhar Jordan,RB,25.0,,0,0,0,...,81.3,75.1,57.3,77.4,63.1,43.1,75.8,89.8,80.0,2023.0
34,7,222,MIN,DeWayne McBride,RB,22.0,,0,0,0,...,82.0,78.9,74.5,81.1,85.4,85.5,75.2,71.3,75.9,2022.0
54,6,196,BAL,Tyler Badie,RB,22.0,2024.0,0,0,0,...,72.2,82.5,72.3,62.5,61.1,49.1,70.3,48.6,87.5,2021.0
76,7,244,MIA,Gerrid Doaks,RB,23.0,,0,0,0,...,76.5,77.1,56.2,65.3,65.5,44.1,72.2,54.2,69.1,2020.0
95,7,245,TAM,Raymond Calais,RB,22.0,2020.0,0,0,0,...,84.5,70.8,63.1,74.3,79.0,47.5,73.5,69.5,56.3,2019.0
104,4,112,WAS,Bryce Love,RB,22.0,,0,0,0,...,63.9,75.4,56.8,77.8,72.3,84.0,70.6,80.9,64.0,2018.0
115,6,211,CIN,Rodney Anderson,RB,22.0,,0,0,0,...,70.6,77.8,56.7,74.8,77.8,50.5,69.7,69.4,68.8,2018.0
117,7,218,DAL,Mike Weber,RB,22.0,,0,0,0,...,75.1,74.9,61.8,80.5,74.3,56.3,72.0,89.6,60.4,2018.0
140,7,236,DAL,Bo Scarbrough,RB,23.0,2020.0,0,0,0,...,66.8,75.3,79.8,79.9,71.1,82.6,82.2,82.5,78.7,2017.0
153,4,121,SFO,Joe Williams,RB,24.0,,0,0,0,...,59.1,62.3,67.3,61.4,60.1,64.6,58.3,65.8,68.4,2016.0


In [52]:
rbs = rbs.drop(
    rbs[rbs['grade_rookie_season'].isna() |
        rbs['Average_Grades_of_RBs_in_Year'].isna() |
        rbs['wAV'].isna() |
        rbs['DrAV'].isna()
    ].index
)


Creating success metric 1

In [53]:
rbs['Success_1'] = (
    rbs['grade_rookie_season'] > rbs['Average_Grades_of_RBs_in_Year']
).astype(int)

In [54]:
rbs

Unnamed: 0,Rnd,Pick,Tm,Player,Pos,Age,To,AP1,PB,St,...,RUN,RBLK,DEF,RDEF,TACK,PRSH,COV,SPEC,Year,Success_1
0,2,46,CAR,Jonathon Brooks,RB,21.0,2024.0,0,0,0,...,79.5,53.5,73.0,65.1,32.0,71.5,79.8,71.8,2023.0,0
1,3,66,ARI,Trey Benson,RB,21.0,2024.0,0,0,0,...,92.3,54.0,50.9,38.2,54.1,62.7,55.2,84.1,2023.0,0
2,3,83,LAR,Blake Corum,RB,23.0,2024.0,0,0,0,...,90.0,71.5,68.2,70.2,46.3,77.5,46.2,64.1,2023.0,1
3,3,88,GNB,MarShawn Lloyd,RB,23.0,2024.0,0,0,0,...,85.7,55.1,70.9,61.3,58.9,75.5,73.5,67.0,2023.0,0
4,4,120,MIA,Jaylen Wright,RB,21.0,2024.0,0,0,0,...,91.7,57.8,85.4,89.1,48.0,84.1,71.2,66.6,2023.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
374,4,111,BUF,Dwayne Wright,RB,24.0,2007.0,0,0,0,...,41.4,63.5,77.4,55.6,89.9,62.7,93.1,72.2,2006.0,0
376,5,148,KAN,Kolby Smith,RB,22.0,2009.0,0,0,0,...,53.2,80.6,76.9,69.9,73.6,71.3,82.1,-,2006.0,0
380,7,228,GNB,DeShawn Wynn,RB,23.0,2010.0,0,0,0,...,40.4,66.6,82.7,84.0,91.1,68.7,85.0,-,2006.0,1
383,7,246,TAM,Kenneth Darby,RB,24.0,2010.0,0,0,0,...,50.3,54.1,66.1,65.0,45.2,59.7,69.2,-,2006.0,0


In [55]:
# Define a function to convert height
def convert_height(ht_str):
    if isinstance(ht_str, str) and '-' in ht_str:
        feet, inches = ht_str.split('-')
        return int(feet) * 12 + int(inches)
    return None  # if not valid format, return None

# Apply the function to the Ht column
rbs['Ht'] = rbs['Ht'].apply(convert_height)


In [56]:
# Compute the rookie vs position average delta
rbs['Rookie_vs_PosAvg'] = (
    rbs['grade_rookie_season'] - rbs['Average_Grades_of_RBs_in_Year']
)


In [57]:
# Normalize the metrics
scaler = MinMaxScaler()
normalized = scaler.fit_transform(rbs[[
    'wAV', 'DrAV', 'grade_rookie_season', 'Career_Avg_Grade', 'Rookie_vs_PosAvg'
]])
rbs[['wAV_norm', 'DrAV_norm', 'rookie_norm', 'career_norm', 'delta_norm']] = normalized

In [58]:
# Calculate Success_Score
rbs['Success_2'] = (
    0.35 * rbs['wAV_norm'] +
    0.25 * rbs['DrAV_norm'] +
    0.20 * rbs['rookie_norm'] +
    0.15 * rbs['career_norm'] +
    0.05 * rbs['delta_norm']
)


In [59]:
# List of RB-specific college stats columns to check for NaNs
college_stat_cols_rbs = [
    'College/Univ', 'Total_Games', 'Total_Rec', 'Total_Rec_Yds', 'Total_Rec_TD',
    'Rec_Y_R', 'Rec_Y_G', 'Total_Rush_Att', 'Total_Rush_Yds', 'Total_Rush_TD',
    'Rush_Y_A', 'Rush_Y_G', 'Total_Plays', 'Total_Yds', 'Total_TD',
    'Seasons_Played', 'wAV/G'
]

# Drop rows with missing values in any of the above columns
rbs = rbs.dropna(subset=college_stat_cols_rbs)

In [60]:
# 40-Yard Dash Rating for RBs (lower is better)
def rate_rb_40yd(time):
    if pd.isna(time): return 0
    elif time <= 4.31: return 4
    elif time <= 4.55: return 3
    elif time <= 4.78: return 2
    elif time <= 5.5: return 1
    else: return 0

# Vertical Jump Rating for RBs (higher is better)
def rate_rb_vertical(jump):
    if pd.isna(jump): return 0
    elif jump >= 41.5: return 4
    elif jump >= 34.0: return 3
    elif jump >= 28.0: return 2
    elif jump > 0: return 1
    else: return 0

# Broad Jump Rating for RBs (higher is better)
def rate_rb_broad_jump(jump):
    if pd.isna(jump): return 0
    elif jump >= 135: return 4
    elif jump >= 120: return 3
    elif jump >= 108: return 2
    elif jump > 0: return 1
    else: return 0

# 3-Cone Drill Rating for RBs (lower is better)
def rate_rb_3cone(time):
    if pd.isna(time): return 0
    elif time <= 6.50: return 4
    elif time <= 6.95: return 3
    elif time <= 7.30: return 2
    elif time <= 7.40: return 1
    else: return 0

# Shuttle Rating for RBs (lower is better)
def rate_rb_shuttle(time):
    if pd.isna(time): return 0
    elif time <= 3.93: return 4
    elif time <= 4.15: return 3
    elif time <= 4.35: return 2
    elif time <= 4.60: return 1
    else: return 0

# Bench Press Rating for RBs (higher is better)
def rate_rb_bench(reps):
    if pd.isna(reps): return 0
    elif reps >= 36: return 4
    elif reps >= 20: return 3
    elif reps >= 15: return 2
    elif reps > 0: return 1
    else: return 0

# Apply all ratings to your RB DataFrame (example: df_rbs)
rbs['40yd'] = rbs['40yd'].apply(rate_rb_40yd)
rbs['Vertical'] = rbs['Vertical'].apply(rate_rb_vertical)
rbs['Broad Jump'] = rbs['Broad Jump'].apply(rate_rb_broad_jump)
rbs['3Cone'] = rbs['3Cone'].apply(rate_rb_3cone)
rbs['Shuttle'] = rbs['Shuttle'].apply(rate_rb_shuttle)
rbs['Bench'] = rbs['Bench'].apply(rate_rb_bench)




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rbs['40yd'] = rbs['40yd'].apply(rate_rb_40yd)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rbs['Vertical'] = rbs['Vertical'].apply(rate_rb_vertical)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rbs['Broad Jump'] = rbs['Broad Jump'].apply(rate_rb_broad_jump)
A value is trying to be set on a copy

In [61]:
rbs.info()

<class 'pandas.core.frame.DataFrame'>
Index: 263 entries, 0 to 384
Data columns (total 84 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Rnd                            263 non-null    int64  
 1   Pick                           263 non-null    int64  
 2   Tm                             263 non-null    object 
 3   Player                         263 non-null    object 
 4   Pos                            263 non-null    object 
 5   Age                            263 non-null    float64
 6   To                             263 non-null    float64
 7   AP1                            263 non-null    int64  
 8   PB                             263 non-null    int64  
 9   St                             263 non-null    int64  
 10  wAV                            263 non-null    float64
 11  DrAV                           263 non-null    float64
 12  G                              263 non-null    float64


### RB's Success Metric 1

In [62]:
rbs_success_1 = rbs

In [63]:
rbs_success_1

Unnamed: 0,Rnd,Pick,Tm,Player,Pos,Age,To,AP1,PB,St,...,SPEC,Year,Success_1,Rookie_vs_PosAvg,wAV_norm,DrAV_norm,rookie_norm,career_norm,delta_norm,Success_2
0,2,46,CAR,Jonathon Brooks,RB,21.0,2024.0,0,0,0,...,71.8,2023.0,0,-7.533333,0.000000,0.000000,0.675824,0.677313,0.662895,0.269907
1,3,66,ARI,Trey Benson,RB,21.0,2024.0,0,0,0,...,84.1,2023.0,0,-6.233333,0.019802,0.022472,0.690110,0.691630,0.676898,0.288160
2,3,83,LAR,Blake Corum,RB,23.0,2024.0,0,0,0,...,64.1,2023.0,1,2.966667,0.019802,0.022472,0.791209,0.792952,0.775992,0.328533
3,3,88,GNB,MarShawn Lloyd,RB,23.0,2024.0,0,0,0,...,67.0,2023.0,0,-15.733333,0.000000,0.000000,0.585714,0.587004,0.574572,0.233922
4,4,120,MIA,Jaylen Wright,RB,21.0,2024.0,0,0,0,...,66.6,2023.0,0,-3.733333,0.019802,0.022472,0.717582,0.719163,0.703826,0.299131
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
374,4,111,BUF,Dwayne Wright,RB,24.0,2007.0,0,0,0,...,72.2,2006.0,0,-27.136290,0.009901,0.011236,0.438462,0.439427,0.451749,0.182468
376,5,148,KAN,Kolby Smith,RB,22.0,2009.0,0,0,0,...,-,2006.0,0,-4.536290,0.039604,0.044944,0.686813,0.653084,0.695177,0.295181
380,7,228,GNB,DeShawn Wynn,RB,23.0,2010.0,0,0,0,...,-,2006.0,1,1.563710,0.029703,0.033708,0.753846,0.611601,0.760881,0.299376
383,7,246,TAM,Kenneth Darby,RB,24.0,2010.0,0,0,0,...,-,2006.0,0,-5.236290,0.039604,0.000000,0.679121,0.644548,0.687637,0.280750


In [64]:
# Columns from Block II: NFL Total Statistics
nfl_total_cols = [
    'To', 'AP1', 'PB', 'St', 'wAV', 'DrAV', 'G',
    'Cmp_Pass', 'Att_Pass', 'Yds_Pass', 'TD_Pass', 'Int_Pass',
    'Att_Rush', 'Yds_Rush', 'TD_Rush', 'Rec', 'Yds_Rec', 'TD_Rec',
    'Tackles', 'Int_Def', 'Sk','TeamDrafted','Rnd','Pick','Tm','Player','Pos','year_drafted','Team','Rookie_vs_PosAvg','wAV_norm','DrAV_norm',
    'rookie_norm','career_norm','delta_norm', 'Success_2'
]

# Columns from Block V: PFF Scores
pff_score_cols = [
    'grade_rookie_season', 'Career_Avg_Grade',
    'Average_Grades_of_RBs_in_Year'
]

# Columns from Block VI: PFF Scores of Previous Season
prev_season_pff_cols = [
    'prev_year', 'Record', 'PF', 'PA', 'OVER', 'OFF', 'PASS',
    'PBLK', 'RECV', 'RUN', 'RBLK', 'DEF', 'RDEF', 'TACK',
    'PRSH', 'COV', 'SPEC', 'Year'
]

# Combine all columns to drop
columns_to_drop = nfl_total_cols + pff_score_cols + prev_season_pff_cols

# Drop safely only if column exists
rbs_success_1_clean = rbs_success_1.drop(columns=[col for col in columns_to_drop if col in rbs_success_1.columns])

In [65]:
rbs_success_1_clean

Unnamed: 0,Age,College/Univ,Total_Games,Total_Rec,Total_Rec_Yds,Total_Rec_TD,Rec_Y_R,Rec_Y_G,Total_Rush_Att,Total_Rush_Yds,...,wAV/G,Ht,Wt,40yd,Vertical,Bench,Broad Jump,3Cone,Shuttle,Success_1
0,21.0,Texas,23.0,28.0,335.0,2.0,11.964286,14.565217,238.0,1479.0,...,0.000000,72,216.0,0,0,0,0,0,0,0
1,21.0,Florida St.,36.0,33.0,371.0,1.0,11.242424,10.305556,316.0,1918.0,...,0.153846,72,216.0,3,2,0,3,0,0,0
2,23.0,Michigan,45.0,56.0,411.0,3.0,7.339286,9.133333,674.0,3737.0,...,0.117647,68,205.0,3,3,3,0,3,3,1
3,23.0,USC,32.0,34.0,452.0,2.0,13.294118,14.125000,291.0,1621.0,...,0.000000,69,220.0,3,3,3,2,0,0,0
4,21.0,Tennessee,34.0,30.0,171.0,0.0,5.700000,5.029412,368.0,2297.0,...,0.133333,71,210.0,3,3,0,3,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
374,24.0,Fresno St.,30.0,51.0,429.0,2.0,8.411765,14.300000,501.0,2683.0,...,0.066667,71,228.0,2,3,1,2,2,1,0
376,22.0,Louisville,46.0,56.0,581.0,2.0,10.375000,12.630435,316.0,1863.0,...,0.148148,71,220.0,3,3,2,2,2,2,0
380,23.0,Florida,46.0,35.0,376.0,3.0,10.742857,8.173913,446.0,2077.0,...,0.130435,70,232.0,3,2,0,2,0,0,1
383,24.0,Alabama,47.0,70.0,340.0,2.0,4.857143,7.234043,702.0,3324.0,...,0.097561,70,211.0,2,2,2,2,2,1,0


In [66]:
rbs_success_1_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 263 entries, 0 to 384
Data columns (total 27 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             263 non-null    float64
 1   College/Univ    263 non-null    object 
 2   Total_Games     263 non-null    float64
 3   Total_Rec       263 non-null    float64
 4   Total_Rec_Yds   263 non-null    float64
 5   Total_Rec_TD    263 non-null    float64
 6   Rec_Y_R         263 non-null    float64
 7   Rec_Y_G         263 non-null    float64
 8   Total_Rush_Att  263 non-null    float64
 9   Total_Rush_Yds  263 non-null    float64
 10  Total_Rush_TD   263 non-null    float64
 11  Rush_Y_A        263 non-null    float64
 12  Rush_Y_G        263 non-null    float64
 13  Total_Plays     263 non-null    float64
 14  Total_Yds       263 non-null    float64
 15  Total_TD        263 non-null    float64
 16  Seasons_Played  263 non-null    float64
 17  wAV/G           263 non-null    float64


In [67]:
rbs_success_1_clean.drop(columns=['wAV/G'], inplace=True)

### RB's Success Metric 2

In [68]:
rbs_success_2 = rbs_success_1.copy()

In [69]:
# Columns from Block II: NFL Total Statistics
nfl_total_cols = [
    'To', 'AP1', 'PB', 'St', 'wAV', 'DrAV', 'G',
    'Cmp_Pass', 'Att_Pass', 'Yds_Pass', 'TD_Pass', 'Int_Pass',
    'Att_Rush', 'Yds_Rush', 'TD_Rush', 'Rec', 'Yds_Rec', 'TD_Rec',
    'Tackles', 'Int_Def', 'Sk','TeamDrafted','Rnd','Pick','Tm','Player','Pos','year_drafted','Team','Rookie_vs_PosAvg','wAV_norm','DrAV_norm',
    'rookie_norm','career_norm','delta_norm', 'Success_1'
]

# Columns from Block V: PFF Scores
pff_score_cols = [
    'grade_rookie_season', 'Career_Avg_Grade',
    'Average_Grades_of_RBs_in_Year'
]

# Columns from Block VI: PFF Scores of Previous Season
prev_season_pff_cols = [
    'prev_year', 'Record', 'PF', 'PA', 'OVER', 'OFF', 'PASS',
    'PBLK', 'RECV', 'RUN', 'RBLK', 'DEF', 'RDEF', 'TACK',
    'PRSH', 'COV', 'SPEC', 'Year'
]

# Combine all columns to drop
columns_to_drop = nfl_total_cols + pff_score_cols + prev_season_pff_cols

# Drop safely only if column exists
rbs_success_2_clean = rbs_success_2.drop(columns=[col for col in columns_to_drop if col in rbs_success_2.columns])

In [70]:
rbs_success_2_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 263 entries, 0 to 384
Data columns (total 27 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             263 non-null    float64
 1   College/Univ    263 non-null    object 
 2   Total_Games     263 non-null    float64
 3   Total_Rec       263 non-null    float64
 4   Total_Rec_Yds   263 non-null    float64
 5   Total_Rec_TD    263 non-null    float64
 6   Rec_Y_R         263 non-null    float64
 7   Rec_Y_G         263 non-null    float64
 8   Total_Rush_Att  263 non-null    float64
 9   Total_Rush_Yds  263 non-null    float64
 10  Total_Rush_TD   263 non-null    float64
 11  Rush_Y_A        263 non-null    float64
 12  Rush_Y_G        263 non-null    float64
 13  Total_Plays     263 non-null    float64
 14  Total_Yds       263 non-null    float64
 15  Total_TD        263 non-null    float64
 16  Seasons_Played  263 non-null    float64
 17  wAV/G           263 non-null    float64


In [71]:
rbs_success_2_clean.drop(columns=['wAV/G'], inplace=True)

## WRs & TEs

Summary of WRs & TEs, data review and success metrics overview.

In [72]:
wr_tes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 845 entries, 0 to 844
Data columns (total 76 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Rnd                                  845 non-null    int64  
 1   Pick                                 845 non-null    int64  
 2   Tm                                   845 non-null    object 
 3   Player                               845 non-null    object 
 4   Pos                                  845 non-null    object 
 5   Age                                  836 non-null    float64
 6   To                                   787 non-null    float64
 7   AP1                                  845 non-null    int64  
 8   PB                                   845 non-null    int64  
 9   St                                   845 non-null    int64  
 10  wAV                                  787 non-null    float64
 11  DrAV                            

In [73]:
wr_tes['Pos'].value_counts()

Unnamed: 0_level_0,count
Pos,Unnamed: 1_level_1
WR,582
TE,263


In [74]:
# Drop rows with missing Ht or Wt
wr_tes = wr_tes.dropna(subset=['Ht', 'Wt'])

In [75]:
# Drop rows with missing success metric inputs
required_success_cols = [
    'grade_rookie_season', 'Average_Grades_per_position_in_Year', 'wAV', 'DrAV'
]
wr_tes = wr_tes.dropna(subset=required_success_cols)

In [76]:
wr_tes[wr_tes['Age'].isna() ]

Unnamed: 0,Rnd,Pick,Tm,Player,Pos,Age,To,AP1,PB,St,...,RECV,RUN,RBLK,DEF,RDEF,TACK,PRSH,COV,SPEC,Year


Creating Success metric 1


In [77]:
# Create Success_1 (binary)
wr_tes['Success_1'] = (wr_tes['grade_rookie_season'] > wr_tes['Average_Grades_per_position_in_Year']).astype(int)


Creating Success metric 2

In [78]:
# Create delta and normalize for Success_2
wr_tes['Rookie_vs_PosAvg'] = wr_tes['grade_rookie_season'] - wr_tes['Average_Grades_per_position_in_Year']

scaler = MinMaxScaler()
normalized = scaler.fit_transform(wr_tes[[
    'wAV', 'DrAV', 'grade_rookie_season', 'Career_Avg_Grade', 'Rookie_vs_PosAvg'
]])
wr_tes[['wAV_norm', 'DrAV_norm', 'rookie_norm', 'career_norm', 'delta_norm']] = normalized

# Compute Success_2 (continuous)
wr_tes['Success_2'] = (
    0.35 * wr_tes['wAV_norm'] +
    0.25 * wr_tes['DrAV_norm'] +
    0.20 * wr_tes['rookie_norm'] +
    0.15 * wr_tes['career_norm'] +
    0.05 * wr_tes['delta_norm']
)


In [79]:
wr_tes

Unnamed: 0,Rnd,Pick,Tm,Player,Pos,Age,To,AP1,PB,St,...,SPEC,Year,Success_1,Rookie_vs_PosAvg,wAV_norm,DrAV_norm,rookie_norm,career_norm,delta_norm,Success_2
1,1,6,NYG,Malik Nabers,WR,21.0,2024.0,0,1,1,...,69.4,2023.0,1,21.076796,0.089109,0.092784,0.946507,0.949097,0.949400,0.433520
2,1,9,CHI,Rome Odunze,WR,22.0,2024.0,0,0,1,...,78.6,2023.0,0,-1.823204,0.059406,0.061856,0.696507,0.698413,0.708010,0.315720
3,1,23,JAX,Brian Thomas,WR,21.0,2024.0,0,1,1,...,85.4,2023.0,1,16.376796,0.108911,0.113402,0.895197,0.897646,0.899857,0.425148
4,1,28,KAN,Xavier Worthy,WR,21.0,2024.0,0,0,1,...,90.4,2023.0,1,6.176796,0.069307,0.072165,0.783843,0.785988,0.792338,0.356582
5,1,31,SFO,Ricky Pearsall,WR,23.0,2024.0,0,0,0,...,71.8,2023.0,0,-1.723204,0.029703,0.030928,0.697598,0.699507,0.709064,0.298027
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
835,4,129,LAC,Scott Chandler,TE,22.0,2015.0,0,0,2,...,-,2006.0,0,-67.318391,0.178218,0.000000,0.000000,0.727422,0.017621,0.172371
836,4,133,ATL,Martrez Milner,TE,23.0,2007.0,0,0,0,...,-,2006.0,0,-12.918391,0.000000,0.000000,0.593886,0.595512,0.591055,0.237657
837,5,153,NYG,Kevin Boss,TE,23.0,2012.0,0,0,4,...,-,2006.0,1,1.781609,0.168317,0.144330,0.754367,0.852217,0.746008,0.411000
838,5,155,CAR,Dante Rosario,TE,22.0,2014.0,0,0,1,...,-,2006.0,1,3.381609,0.089109,0.072165,0.771834,0.679666,0.762874,0.343690


In [80]:
# Columns from Block II: NFL Total Statistics
nfl_total_cols = [
    'To', 'AP1', 'PB', 'St', 'wAV', 'DrAV', 'G',
    'Cmp_Pass', 'Att_Pass', 'Yds_Pass', 'TD_Pass', 'Int_Pass',
    'Att_Rush', 'Yds_Rush', 'TD_Rush', 'Rec', 'Yds_Rec', 'TD_Rec',
    'Tackles', 'Int_Def', 'Sk','TeamDrafted','Rnd','Pick','Tm','Player','Pos','year_drafted','Team','Rookie_vs_PosAvg','wAV_norm','DrAV_norm',
    'rookie_norm','career_norm','delta_norm'
]

# Columns from Block V: PFF Scores
pff_score_cols = [
    'grade_rookie_season', 'Career_Avg_Grade',
    'Average_Grades_per_position_in_Year'
]

# Columns from Block VI: PFF Scores of Previous Season
prev_season_pff_cols = [
    'prev_year', 'Record', 'PF', 'PA', 'OVER', 'OFF', 'PASS',
    'PBLK', 'RECV', 'RUN', 'RBLK', 'DEF', 'RDEF', 'TACK',
    'PRSH', 'COV', 'SPEC', 'Year'
]

# Combine all columns to drop
columns_to_drop = nfl_total_cols + pff_score_cols + prev_season_pff_cols

# Drop safely only if column exists
wr_tes = wr_tes.drop(columns=[col for col in columns_to_drop if col in wr_tes.columns])

In [81]:
wr_tes.info()

<class 'pandas.core.frame.DataFrame'>
Index: 642 entries, 1 to 840
Data columns (total 28 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             642 non-null    float64
 1   College/Univ    642 non-null    object 
 2   Total_Games     607 non-null    float64
 3   Total_Rec       607 non-null    float64
 4   Total_Rec_Yds   607 non-null    float64
 5   Total_Rec_TD    607 non-null    float64
 6   Rec_Y_R         607 non-null    float64
 7   Rec_Y_G         607 non-null    float64
 8   Total_Rush_Att  607 non-null    float64
 9   Total_Rush_Yds  607 non-null    float64
 10  Total_Rush_TD   607 non-null    float64
 11  Rush_Y_A        419 non-null    object 
 12  Rush_Y_G        607 non-null    float64
 13  Total_Plays     607 non-null    float64
 14  Total_Yds       607 non-null    float64
 15  Total_TD        607 non-null    float64
 16  Seasons_Played  607 non-null    float64
 17  wAV/G           642 non-null    float64


In [82]:
# 40-Yard Dash Rating (WR/TE) — lower is better
def rate_wrte_40yd(time):
    if pd.isna(time): return 0
    elif time <= 4.22: return 4
    elif time <= 4.53: return 3
    elif time <= 4.85: return 2
    elif time <= 5.5: return 1
    else: return 0

# Vertical Jump Rating (WR/TE) — higher is better
def rate_wrte_vertical(jump):
    if pd.isna(jump): return 0
    elif jump >= 42.5: return 4
    elif jump >= 33.8: return 3
    elif jump >= 26.75: return 2
    elif jump > 0: return 1
    else: return 0

# Broad Jump Rating (WR/TE) — higher is better
def rate_wrte_broad_jump(jump):
    if pd.isna(jump): return 0
    elif jump >= 137: return 4
    elif jump >= 119: return 3
    elif jump >= 108: return 2
    elif jump > 0: return 1
    else: return 0

# 3-Cone Drill Rating (WR/TE) — lower is better
def rate_wrte_3cone(time):
    if pd.isna(time): return 0
    elif time <= 6.7: return 4
    elif time <= 7.0: return 3
    elif time <= 7.3: return 2
    elif time <= 7.6: return 1
    else: return 0

# Shuttle Drill Rating (WR/TE) — lower is better
def rate_wrte_shuttle(time):
    if pd.isna(time): return 0
    elif time <= 4.0: return 4
    elif time <= 4.25: return 3
    elif time <= 4.45: return 2
    elif time <= 4.6: return 1
    else: return 0

# Bench Press Rating (WR/TE) — higher is better
def rate_wrte_bench(reps):
    if pd.isna(reps): return 0
    elif reps >= 23: return 4
    elif reps >= 18: return 3
    elif reps >= 12: return 2
    elif reps > 0: return 1
    else: return 0

# Apply all ratings to WR/TE DataFrame (example: df_wrtes)
wr_tes['40yd'] = wr_tes['40yd'].apply(rate_wrte_40yd)
wr_tes['Vertical'] = wr_tes['Vertical'].apply(rate_wrte_vertical)
wr_tes['Broad Jump'] = wr_tes['Broad Jump'].apply(rate_wrte_broad_jump)
wr_tes['3Cone'] = wr_tes['3Cone'].apply(rate_wrte_3cone)
wr_tes['Shuttle'] = wr_tes['Shuttle'].apply(rate_wrte_shuttle)
wr_tes['Bench'] = wr_tes['Bench'].apply(rate_wrte_bench)


### WRs & TEs Success Metric 1

In [83]:
wr_tes_success_1_clean = wr_tes.copy()

In [84]:
wr_tes_success_1_clean

Unnamed: 0,Age,College/Univ,Total_Games,Total_Rec,Total_Rec_Yds,Total_Rec_TD,Rec_Y_R,Rec_Y_G,Total_Rush_Att,Total_Rush_Yds,...,Ht,Wt,40yd,Vertical,Bench,Broad Jump,3Cone,Shuttle,Success_1,Success_2
1,21.0,LSU,38.0,189.0,3003.0,21.0,15.888889,79.026316,6.0,29.0,...,6-0,200.0,0,0,0,0,0,0,1,0.433520
2,22.0,Washington,40.0,214.0,3272.0,24.0,15.289720,81.800000,10.0,40.0,...,6-3,212.0,3,3,0,3,3,3,0,0.315720
3,21.0,LSU,38.0,127.0,1897.0,24.0,14.937008,49.921053,3.0,0.0,...,6-3,209.0,3,3,1,3,0,0,1,0.425148
4,21.0,Texas,39.0,197.0,2755.0,26.0,13.984772,70.641026,7.0,56.0,...,5-11,165.0,4,3,0,3,0,0,1,0.356582
5,23.0,Florida,55.0,159.0,2420.0,14.0,15.220126,44.000000,21.0,253.0,...,6-1,189.0,3,3,2,3,4,3,0,0.298027
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
835,22.0,Iowa,37.0,117.0,1467.0,10.0,12.538462,39.648649,0.0,0.0,...,6-7,270.0,2,2,2,2,2,2,0,0.172371
836,23.0,Georgia,48.0,49.0,808.0,5.0,16.489796,16.833333,0.0,0.0,...,6-4,252.0,2,3,3,2,1,1,0,0.237657
837,23.0,Western Oregon,,,,,,,,,...,6-6,252.0,2,3,0,2,3,2,1,0.411000
838,22.0,Oregon,49.0,94.0,1003.0,11.0,10.670213,20.469388,27.0,84.0,...,6-3,244.0,2,3,3,2,3,1,1,0.343690


In [85]:
def convert_height(ht_str):
    if isinstance(ht_str, str) and '-' in ht_str:
        feet, inches = ht_str.split('-')
        return int(feet) * 12 + int(inches)
    return None  # if not valid format, return None

# Apply the function to the Ht column
wr_tes_success_1_clean['Ht'] = wr_tes_success_1_clean['Ht'].apply(convert_height)

In [86]:
wr_tes_success_1_clean.drop(columns=['Success_2'], inplace=True)


In [87]:
wr_tes_success_1_clean.dropna(subset=['Total_Games', 'Total_Rec', 'Total_Rec_Yds', 'Total_Rec_TD',
                  'Rec_Y_R', 'Rec_Y_G', 'Total_Rush_Att', 'Total_Rush_Yds',
                  'Total_Rush_TD', 'Rush_Y_A', 'Rush_Y_G', 'Total_Plays',
                  'Total_Yds', 'Total_TD', 'Seasons_Played'], inplace=True)


In [88]:
wr_tes_success_1_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 419 entries, 1 to 838
Data columns (total 27 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             419 non-null    float64
 1   College/Univ    419 non-null    object 
 2   Total_Games     419 non-null    float64
 3   Total_Rec       419 non-null    float64
 4   Total_Rec_Yds   419 non-null    float64
 5   Total_Rec_TD    419 non-null    float64
 6   Rec_Y_R         419 non-null    float64
 7   Rec_Y_G         419 non-null    float64
 8   Total_Rush_Att  419 non-null    float64
 9   Total_Rush_Yds  419 non-null    float64
 10  Total_Rush_TD   419 non-null    float64
 11  Rush_Y_A        419 non-null    object 
 12  Rush_Y_G        419 non-null    float64
 13  Total_Plays     419 non-null    float64
 14  Total_Yds       419 non-null    float64
 15  Total_TD        419 non-null    float64
 16  Seasons_Played  419 non-null    float64
 17  wAV/G           419 non-null    float64


In [89]:
wr_tes_success_1_clean['Rush_Y_A'] = pd.to_numeric(wr_tes_success_1_clean['Rush_Y_A'], errors='coerce').fillna(0)


In [90]:
wr_tes_success_1_clean.drop(columns=['wAV/G'], inplace=True)

### WRs & TEs Success Metric 2

In [91]:
wr_tes_success_2_clean = wr_tes.copy()

In [92]:
wr_tes_success_2_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 642 entries, 1 to 840
Data columns (total 28 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             642 non-null    float64
 1   College/Univ    642 non-null    object 
 2   Total_Games     607 non-null    float64
 3   Total_Rec       607 non-null    float64
 4   Total_Rec_Yds   607 non-null    float64
 5   Total_Rec_TD    607 non-null    float64
 6   Rec_Y_R         607 non-null    float64
 7   Rec_Y_G         607 non-null    float64
 8   Total_Rush_Att  607 non-null    float64
 9   Total_Rush_Yds  607 non-null    float64
 10  Total_Rush_TD   607 non-null    float64
 11  Rush_Y_A        419 non-null    object 
 12  Rush_Y_G        607 non-null    float64
 13  Total_Plays     607 non-null    float64
 14  Total_Yds       607 non-null    float64
 15  Total_TD        607 non-null    float64
 16  Seasons_Played  607 non-null    float64
 17  wAV/G           642 non-null    float64


In [93]:
def convert_height(ht_str):
    if isinstance(ht_str, str) and '-' in ht_str:
        feet, inches = ht_str.split('-')
        return int(feet) * 12 + int(inches)
    return None  # if not valid format, return None

# Apply the function to the Ht column
wr_tes_success_2_clean['Ht'] = wr_tes_success_2_clean['Ht'].apply(convert_height)

In [94]:
wr_tes_success_2_clean.drop(columns=['Success_1'], inplace=True)

In [95]:
wr_tes_success_2_clean.dropna(subset=['Total_Games', 'Total_Rec', 'Total_Rec_Yds', 'Total_Rec_TD',
                  'Rec_Y_R', 'Rec_Y_G', 'Total_Rush_Att', 'Total_Rush_Yds',
                  'Total_Rush_TD', 'Rush_Y_A', 'Rush_Y_G', 'Total_Plays',
                  'Total_Yds', 'Total_TD', 'Seasons_Played'], inplace=True)


In [96]:
wr_tes_success_2_clean['Rush_Y_A'] = pd.to_numeric(wr_tes_success_2_clean['Rush_Y_A'], errors='coerce').fillna(0)


In [97]:
wr_tes_success_2_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 419 entries, 1 to 838
Data columns (total 27 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             419 non-null    float64
 1   College/Univ    419 non-null    object 
 2   Total_Games     419 non-null    float64
 3   Total_Rec       419 non-null    float64
 4   Total_Rec_Yds   419 non-null    float64
 5   Total_Rec_TD    419 non-null    float64
 6   Rec_Y_R         419 non-null    float64
 7   Rec_Y_G         419 non-null    float64
 8   Total_Rush_Att  419 non-null    float64
 9   Total_Rush_Yds  419 non-null    float64
 10  Total_Rush_TD   419 non-null    float64
 11  Rush_Y_A        419 non-null    float64
 12  Rush_Y_G        419 non-null    float64
 13  Total_Plays     419 non-null    float64
 14  Total_Yds       419 non-null    float64
 15  Total_TD        419 non-null    float64
 16  Seasons_Played  419 non-null    float64
 17  wAV/G           419 non-null    float64


In [98]:
wr_tes_success_2_clean.drop(columns=['wAV/G'], inplace=True)

## OL

Summary of OL, data review and success metrics overview.

In [99]:
ol.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 79 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Rnd                                  392 non-null    int64  
 1   Pick                                 392 non-null    int64  
 2   Tm                                   392 non-null    object 
 3   Player                               392 non-null    object 
 4   Pos                                  392 non-null    object 
 5   Age                                  392 non-null    float64
 6   To                                   361 non-null    float64
 7   AP1                                  392 non-null    int64  
 8   PB                                   392 non-null    int64  
 9   St                                   392 non-null    int64  
 10  wAV                                  361 non-null    float64
 11  DrAV                            

In [100]:
ol['Pos'].value_counts()

Unnamed: 0_level_0,count
Pos,Unnamed: 1_level_1
OL,157
T,114
G,73
C,41
OT,5
OG,2


In [101]:
ol = ol.dropna(subset=['Ht', 'Wt'])


In [102]:
success_cols = [
    'grade_rookie_season',
    'Average_Grades_per_position_in_Year',
    'wAV',
    'DrAV',
    'Career_Avg_Grade'
]

# Show number of missing values per column
ol[success_cols].isna().sum()


Unnamed: 0,0
grade_rookie_season,0
Average_Grades_per_position_in_Year,0
wAV,20
DrAV,35
Career_Avg_Grade,0


In [103]:
ol = ol.dropna(subset=['wAV', 'DrAV'])


In [104]:
def convert_height(ht):
    if isinstance(ht, str) and '-' in ht:
        feet, inches = ht.split('-')
        return int(feet) * 12 + int(inches)
    return ht  # keep as is if already in inches or invalid

ol['Ht'] = ol['Ht'].apply(convert_height)


In [105]:
# Create Success_1 (binary)
ol['Success_1'] = (ol['grade_rookie_season'] > ol['Average_Grades_per_position_in_Year']).astype(int)

In [106]:
# Create Success Metric 2

# Work on the cleaned df_ol
ol['Rookie_vs_PosAvg'] = ol['grade_rookie_season'] - ol['Average_Grades_per_position_in_Year']

# Normalize relevant columns
scaler = MinMaxScaler()
norm_cols = ['wAV', 'DrAV', 'grade_rookie_season', 'Career_Avg_Grade', 'Rookie_vs_PosAvg']
normalized = scaler.fit_transform(ol[norm_cols])
ol[['wAV_norm', 'DrAV_norm', 'rookie_norm', 'career_norm', 'delta_norm']] = normalized



# Create Success_2 (continuous)
ol['Success_2'] = (
    0.35 * ol['wAV_norm'] +
    0.25 * ol['DrAV_norm'] +
    0.20 * ol['rookie_norm'] +
    0.15 * ol['career_norm'] +
    0.05 * ol['delta_norm']
)



In [107]:
# 40-Yard Dash Rating for OL (lower is better)
def rate_ol_40yd(time):
    if pd.isna(time): return 0
    elif time <= 4.71: return 4
    elif time <= 5.26: return 3
    elif time <= 5.55: return 2
    elif time <= 5.65: return 1
    else: return 0

# Vertical Jump Rating for OL (higher is better)
def rate_ol_vertical(jump):
    if pd.isna(jump): return 0
    elif jump >= 38.5: return 4
    elif jump >= 29: return 3
    elif jump >= 25: return 2
    elif jump > 0: return 1
    else: return 0

# Broad Jump Rating for OL (higher is better)
def rate_ol_broad_jump(jump):
    if pd.isna(jump): return 0
    elif jump >= 117: return 4
    elif jump >= 103.3: return 3
    elif jump >= 88: return 2
    elif jump > 0: return 1
    else: return 0

# 3-Cone Drill Rating for OL (lower is better)
def rate_ol_3cone(time):
    if pd.isna(time): return 0
    elif time <= 7.06: return 4
    elif time <= 7.80: return 3
    elif time <= 8.3: return 2
    elif time <= 8.4: return 1
    else: return 0

# Shuttle Drill Rating for OL (lower is better)
def rate_ol_shuttle(time):
    if pd.isna(time): return 0
    elif time <= 4.27: return 4
    elif time <= 4.74: return 3
    elif time <= 5.38: return 2
    elif time <= 5.5: return 1
    else: return 0

# Bench Press Rating for OL (higher is better)
def rate_ol_bench(reps):
    if pd.isna(reps): return 0
    elif reps >= 39: return 4
    elif reps >= 26: return 3
    elif reps >= 12: return 2
    elif reps > 0: return 1
    else: return 0

# Apply all ratings to OL DataFrame (example: df_ol)
ol['40yd'] = ol['40yd'].apply(rate_ol_40yd)
ol['Vertical'] = ol['Vertical'].apply(rate_ol_vertical)
ol['Broad Jump'] = ol['Broad Jump'].apply(rate_ol_broad_jump)
ol['3Cone'] = ol['3Cone'].apply(rate_ol_3cone)
ol['Shuttle'] = ol['Shuttle'].apply(rate_ol_shuttle)
ol['Bench'] = ol['Bench'].apply(rate_ol_bench)


### OL Success Metric 1

In [108]:
ol_success_1 = ol.copy()

In [109]:
ol_success_1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 306 entries, 0 to 390
Data columns (total 87 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Rnd                                  306 non-null    int64  
 1   Pick                                 306 non-null    int64  
 2   Tm                                   306 non-null    object 
 3   Player                               306 non-null    object 
 4   Pos                                  306 non-null    object 
 5   Age                                  306 non-null    float64
 6   To                                   306 non-null    float64
 7   AP1                                  306 non-null    int64  
 8   PB                                   306 non-null    int64  
 9   St                                   306 non-null    int64  
 10  wAV                                  306 non-null    float64
 11  DrAV                                 

In [110]:
ol_success_1_clean = ol_success_1.drop(columns=[
    'To', 'AP1', 'PB', 'St', 'wAV', 'DrAV', 'G',
    'Cmp_Pass', 'Att_Pass', 'Yds_Pass', 'TD_Pass', 'Int_Pass',
    'Att_Rush', 'Yds_Rush', 'TD_Rush', 'Rec', 'Yds_Rec', 'TD_Rec',
    'Tackles', 'Int_Def', 'Sk',
    'grade_rookie_season', 'Career_Avg_Grade', 'Average_Grades_per_position_in_Year',
    'Average_Grades_per_position_in_Year',
    'prev_year', 'Record', 'PF', 'PA', 'OVER', 'OFF', 'PASS',
    'PBLK', 'RECV', 'RUN', 'RBLK', 'DEF', 'RDEF', 'TACK',
    'PRSH', 'COV', 'SPEC', 'Year','Rnd','Pick','Tm','Player','year_drafted','Team','Success_2','Pos','Rookie_vs_PosAvg','wAV_norm','DrAV_norm','rookie_norm','career_norm','delta_norm','TeamDrafted'
])


In [111]:
# Drop rows where any of the OL college stat columns are missing
ol_success_1_clean = ol_success_1_clean.dropna(subset=[
    'Total_Games', 'BLK', 'RBLK_x', 'PBLK_x', 'OFF_x', 'SK', 'HIT', 'HUR', 'PR',
    'EFF', 'PEN', 'LT', 'LG', 'C', 'RG', 'RT', 'ITE', 'Seasons_Played', 'BLK_P'
])



In [112]:
ol_success_1_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 300 entries, 0 to 389
Data columns (total 30 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             300 non-null    float64
 1   College/Univ    300 non-null    object 
 2   Total_Games     300 non-null    float64
 3   BLK             300 non-null    float64
 4   RBLK_x          300 non-null    float64
 5   PBLK_x          300 non-null    float64
 6   OFF_x           300 non-null    float64
 7   SK              300 non-null    float64
 8   HIT             300 non-null    float64
 9   HUR             300 non-null    float64
 10  PR              300 non-null    float64
 11  EFF             300 non-null    float64
 12  PEN             300 non-null    float64
 13  LT              300 non-null    float64
 14  LG              300 non-null    float64
 15  C               300 non-null    float64
 16  RG              300 non-null    float64
 17  RT              300 non-null    float64


### OL Success Metric 2

In [113]:
ol_success_2 = ol.copy()

In [114]:
ol_success_2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 306 entries, 0 to 390
Data columns (total 87 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Rnd                                  306 non-null    int64  
 1   Pick                                 306 non-null    int64  
 2   Tm                                   306 non-null    object 
 3   Player                               306 non-null    object 
 4   Pos                                  306 non-null    object 
 5   Age                                  306 non-null    float64
 6   To                                   306 non-null    float64
 7   AP1                                  306 non-null    int64  
 8   PB                                   306 non-null    int64  
 9   St                                   306 non-null    int64  
 10  wAV                                  306 non-null    float64
 11  DrAV                                 

In [115]:
ol_success_2_clean = ol_success_2.drop(columns=[
    'To', 'AP1', 'PB', 'St', 'wAV', 'DrAV', 'G',
    'Cmp_Pass', 'Att_Pass', 'Yds_Pass', 'TD_Pass', 'Int_Pass',
    'Att_Rush', 'Yds_Rush', 'TD_Rush', 'Rec', 'Yds_Rec', 'TD_Rec',
    'Tackles', 'Int_Def', 'Sk',
    'grade_rookie_season', 'Career_Avg_Grade', 'Average_Grades_per_position_in_Year',
    'Average_Grades_per_position_in_Year',
    'prev_year', 'Record', 'PF', 'PA', 'OVER', 'OFF', 'PASS',
    'PBLK', 'RECV', 'RUN', 'RBLK', 'DEF', 'RDEF', 'TACK',
    'PRSH', 'COV', 'SPEC', 'Year','Rnd','Pick','Tm','Player','year_drafted','Team','Success_1','Pos','Rookie_vs_PosAvg','wAV_norm','DrAV_norm','rookie_norm','career_norm','delta_norm','TeamDrafted'
])


In [116]:
ol_success_2_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 306 entries, 0 to 390
Data columns (total 30 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             306 non-null    float64
 1   College/Univ    306 non-null    object 
 2   Total_Games     300 non-null    float64
 3   BLK             300 non-null    float64
 4   RBLK_x          300 non-null    float64
 5   PBLK_x          300 non-null    float64
 6   OFF_x           300 non-null    float64
 7   SK              300 non-null    float64
 8   HIT             300 non-null    float64
 9   HUR             300 non-null    float64
 10  PR              300 non-null    float64
 11  EFF             300 non-null    float64
 12  PEN             300 non-null    float64
 13  LT              300 non-null    float64
 14  LG              300 non-null    float64
 15  C               300 non-null    float64
 16  RG              300 non-null    float64
 17  RT              300 non-null    float64


In [117]:
# Drop rows where any of the OL college stat columns are missing
ol_success_2_clean = ol_success_2_clean.dropna(subset=[
    'Total_Games', 'BLK', 'RBLK_x', 'PBLK_x', 'OFF_x', 'SK', 'HIT', 'HUR', 'PR',
    'EFF', 'PEN', 'LT', 'LG', 'C', 'RG', 'RT', 'ITE', 'Seasons_Played', 'BLK_P'
])



In [118]:
ol_success_2_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 300 entries, 0 to 389
Data columns (total 30 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             300 non-null    float64
 1   College/Univ    300 non-null    object 
 2   Total_Games     300 non-null    float64
 3   BLK             300 non-null    float64
 4   RBLK_x          300 non-null    float64
 5   PBLK_x          300 non-null    float64
 6   OFF_x           300 non-null    float64
 7   SK              300 non-null    float64
 8   HIT             300 non-null    float64
 9   HUR             300 non-null    float64
 10  PR              300 non-null    float64
 11  EFF             300 non-null    float64
 12  PEN             300 non-null    float64
 13  LT              300 non-null    float64
 14  LG              300 non-null    float64
 15  C               300 non-null    float64
 16  RG              300 non-null    float64
 17  RT              300 non-null    float64


## DEF and the Subgroups Creation

Summary of DEF, data review and success metrics overview.

In [119]:
# Drop rows with missing height or weight
defe = defe.dropna(subset=['Ht', 'Wt'])

In [120]:
# Show how many rows were dropped and how many remain
missing_count = defe.shape[0] - defe.shape[0]
remaining_count = defe.shape[0]

missing_count, remaining_count


(0, 1909)

In [121]:
defe.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1909 entries, 0 to 2286
Data columns (total 75 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Rnd                                  1909 non-null   int64  
 1   Pick                                 1909 non-null   int64  
 2   Tm                                   1909 non-null   object 
 3   Player                               1909 non-null   object 
 4   Pos_x                                1909 non-null   object 
 5   Age                                  1892 non-null   float64
 6   To                                   1809 non-null   float64
 7   AP1                                  1909 non-null   int64  
 8   PB                                   1909 non-null   int64  
 9   St                                   1909 non-null   int64  
 10  wAV                                  1809 non-null   float64
 11  DrAV                               

In [122]:
defe = defe.dropna(subset=[
    'grade_rookie_season',
    'Average_Grades_per_position_in_Year',
    'wAV',
    'DrAV',
    'Career_Avg_Grade'
])


In [123]:
defe['Success_1'] = (defe['grade_rookie_season'] > defe['Average_Grades_per_position_in_Year']).astype(int)


In [124]:
defe['Rookie_vs_PosAvg'] = defe['grade_rookie_season'] - defe['Average_Grades_per_position_in_Year']

scaler = MinMaxScaler()
normalized = scaler.fit_transform(defe[[
    'wAV', 'DrAV', 'grade_rookie_season', 'Career_Avg_Grade', 'Rookie_vs_PosAvg'
]])
defe[['wAV_norm', 'DrAV_norm', 'rookie_norm', 'career_norm', 'delta_norm']] = normalized

defe['Success_2'] = (
    0.35 * defe['wAV_norm'] +
    0.25 * defe['DrAV_norm'] +
    0.20 * defe['rookie_norm'] +
    0.15 * defe['career_norm'] +
    0.05 * defe['delta_norm']
)


In [125]:
defe.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1705 entries, 0 to 2286
Data columns (total 83 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Rnd                                  1705 non-null   int64  
 1   Pick                                 1705 non-null   int64  
 2   Tm                                   1705 non-null   object 
 3   Player                               1705 non-null   object 
 4   Pos_x                                1705 non-null   object 
 5   Age                                  1705 non-null   float64
 6   To                                   1705 non-null   float64
 7   AP1                                  1705 non-null   int64  
 8   PB                                   1705 non-null   int64  
 9   St                                   1705 non-null   int64  
 10  wAV                                  1705 non-null   float64
 11  DrAV                               

In [126]:
defe.drop(columns=[
    # Block II – NFL Total Statistics
    'To', 'AP1', 'PB', 'St', 'wAV', 'DrAV', 'G', 'Cmp_Pass', 'Att_Pass', 'Yds_Pass', 'TD_Pass', 'Int_Pass',
    'Att_Rush', 'Yds_Rush', 'TD_Rush', 'Rec', 'Yds_Rec', 'TD_Rec', 'Tackles', 'Int_Def', 'Sk', 'wAV/G',
    'Rnd','Pick','Tm','TeamDrafted','Rookie_vs_PosAvg','wAV_norm','DrAV_norm','rookie_norm','career_norm','delta_norm','grades_offense',
    'To','Player','year_drafted','COLLEGE','position'

    # Block V – PFF Scores
    'grade_rookie_season', 'Career_Avg_Grade', 'Average_Grades_per_position_in_Year',

    # Block VI – PFF Previous Season Grades
    'prev_year', 'Record', 'PF', 'PA', 'OVER', 'OFF', 'PASS', 'PBLK', 'RECV', 'RUN', 'RBLK',
    'DEF', 'RDEF', 'TACK', 'PRSH', 'COV', 'SPEC', 'Year'
], inplace=True, errors='ignore')


In [127]:
defe.drop(columns=['grade_rookie_season', 'position'], inplace=True)


In [128]:
def convert_height(ht):
    if isinstance(ht, str) and '-' in ht:
        feet, inches = ht.split('-')
        return int(feet) * 12 + int(inches)
    return None  # fallback in case format is bad

defe['Ht'] = defe['Ht'].apply(convert_height)


In [129]:
defe_success = defe.dropna(subset=[
    'Total_Solo', 'Total_Ast', 'Total_Tackles', 'Total_Sack', 'Total_Sack_Yds',
    'Total_PD', 'Total_Int', 'Total_Int_Yds', 'Total_Int_LNG', 'Total_Int_TD',
    'Total_FF', 'Seasons_Played'
])


In [130]:
defe_success.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1624 entries, 0 to 2286
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Pos_x           1624 non-null   object 
 1   Age             1624 non-null   float64
 2   College/Univ    1624 non-null   object 
 3   Total_Solo      1624 non-null   float64
 4   Total_Ast       1624 non-null   float64
 5   Total_Tackles   1624 non-null   float64
 6   Total_Sack      1624 non-null   float64
 7   Total_Sack_Yds  1624 non-null   float64
 8   Total_PD        1624 non-null   float64
 9   Total_Int       1624 non-null   float64
 10  Total_Int_Yds   1624 non-null   float64
 11  Total_Int_LNG   1624 non-null   float64
 12  Total_Int_TD    1624 non-null   float64
 13  Total_FF        1624 non-null   float64
 14  Seasons_Played  1624 non-null   float64
 15  Ht              1624 non-null   int64  
 16  Wt              1624 non-null   float64
 17  40yd            1472 non-null   float6

Creating defensive subgroups by position to enable position-specific analysis. The dataset defe_success is filtered into three distinct subgroups:

- Defensive Linemen (DL): Includes DT, DE, NT, and DL positions.

- Linebackers (LB): Includes LB, OLB, and ILB.

- Secondary / Defensive Backs & Safeties (S Group): Includes SAF, S, FS, CB, and DB.

In [131]:
# DL Group: Defensive Linemen (DT,DE,NT,DL)
df_dl = defe_success[defe_success['Pos_x'].isin(['DT', 'DE', 'NT', 'DL'])].copy()

# LB Group: Linebackers (LB,OLB,ILB)
df_lb = defe_success[defe_success['Pos_x'].isin(['LB', 'OLB', 'ILB'])].copy()

# S Group: Secondary / Defensive Backs & Safeties (SAF,S,)
df_s = defe_success[defe_success['Pos_x'].isin(['SAF', 'S', 'FS', 'CB', 'DB'])].copy()


In [132]:
df_dl['Pos_x'].value_counts()

Unnamed: 0_level_0,count
Pos_x,Unnamed: 1_level_1
DE,279
DT,241
DL,49
NT,9


In [133]:
df_lb['Pos_x'].value_counts()

Unnamed: 0_level_0,count
Pos_x,Unnamed: 1_level_1
LB,314
OLB,62
ILB,26


In [134]:
df_s['Pos_x'].value_counts()

Unnamed: 0_level_0,count
Pos_x,Unnamed: 1_level_1
DB,365
CB,165
S,109
SAF,5


In [135]:
# 40-Yard Dash — lower is better
def rate_dl_40yd(time):
    if pd.isna(time): return 0
    elif time <= 4.41: return 4
    elif time <= 4.82: return 3
    elif time <= 5.27: return 2
    elif time <= 5.5: return 1
    else: return 0

# Vertical Jump — higher is better
def rate_dl_vertical(jump):
    if pd.isna(jump): return 0
    elif jump >= 38.2: return 4
    elif jump >= 30.9: return 3
    elif jump >= 22.5: return 2
    elif jump > 0: return 1
    else: return 0

# Broad Jump — higher is better
def rate_dl_broad_jump(jump):
    if pd.isna(jump): return 0
    elif jump >= 134: return 4
    elif jump >= 116.6: return 3
    elif jump >= 96: return 2
    elif jump > 0: return 1
    else: return 0

# 3-Cone Drill — lower is better
def rate_dl_3cone(time):
    if pd.isna(time): return 0
    elif time <= 7.09: return 4
    elif time <= 7.25: return 3
    elif time <= 7.48: return 2
    elif time <= 7.5: return 1
    else: return 0

# Shuttle — lower is better
def rate_dl_shuttle(time):
    if pd.isna(time): return 0
    elif time <= 4.1: return 4
    elif time <= 4.52: return 3
    elif time <= 5.01: return 2
    elif time <= 5.5: return 1
    else: return 0

# Bench Press — higher is better
def rate_dl_bench(reps):
    if pd.isna(reps): return 0
    elif reps >= 42: return 4
    elif reps >= 26: return 3
    elif reps >= 7: return 2
    elif reps > 0: return 1
    else: return 0

# Apply all ratings to DL DataFrame (example: df_dl)
df_dl['40yd'] = df_dl['40yd'].apply(rate_dl_40yd)
df_dl['Vertical'] = df_dl['Vertical'].apply(rate_dl_vertical)
df_dl['Broad Jump'] = df_dl['Broad Jump'].apply(rate_dl_broad_jump)
df_dl['3Cone'] = df_dl['3Cone'].apply(rate_dl_3cone)
df_dl['Shuttle'] = df_dl['Shuttle'].apply(rate_dl_shuttle)
df_dl['Bench'] = df_dl['Bench'].apply(rate_dl_bench)


## Subgroup: DL (Positions: DT,DE,NT,DL)

### Success Metric 1 (DL)

In [136]:
dl_success_1_clean = df_dl.copy()
dl_success_2_clean = df_dl.copy()

In [137]:
dl_success_1_clean.drop(columns=['Pos_x', 'Success_2'], inplace=True)

In [138]:
dl_success_1_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 578 entries, 69 to 1639
Data columns (total 23 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             578 non-null    float64
 1   College/Univ    578 non-null    object 
 2   Total_Solo      578 non-null    float64
 3   Total_Ast       578 non-null    float64
 4   Total_Tackles   578 non-null    float64
 5   Total_Sack      578 non-null    float64
 6   Total_Sack_Yds  578 non-null    float64
 7   Total_PD        578 non-null    float64
 8   Total_Int       578 non-null    float64
 9   Total_Int_Yds   578 non-null    float64
 10  Total_Int_LNG   578 non-null    float64
 11  Total_Int_TD    578 non-null    float64
 12  Total_FF        578 non-null    float64
 13  Seasons_Played  578 non-null    float64
 14  Ht              578 non-null    int64  
 15  Wt              578 non-null    float64
 16  40yd            578 non-null    int64  
 17  Vertical        578 non-null    int64 

### Success Metric 2 (DL)

In [139]:
dl_success_2_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 578 entries, 69 to 1639
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Pos_x           578 non-null    object 
 1   Age             578 non-null    float64
 2   College/Univ    578 non-null    object 
 3   Total_Solo      578 non-null    float64
 4   Total_Ast       578 non-null    float64
 5   Total_Tackles   578 non-null    float64
 6   Total_Sack      578 non-null    float64
 7   Total_Sack_Yds  578 non-null    float64
 8   Total_PD        578 non-null    float64
 9   Total_Int       578 non-null    float64
 10  Total_Int_Yds   578 non-null    float64
 11  Total_Int_LNG   578 non-null    float64
 12  Total_Int_TD    578 non-null    float64
 13  Total_FF        578 non-null    float64
 14  Seasons_Played  578 non-null    float64
 15  Ht              578 non-null    int64  
 16  Wt              578 non-null    float64
 17  40yd            578 non-null    int64 

In [140]:
dl_success_2_clean.drop(columns=['Pos_x', 'Success_1'], inplace=True)

## Subgroup : LB ( Positions: LB,OLB,ILB)

In [141]:
# 40-Yard Dash — lower is better
def rate_lb_40yd(time):
    if pd.isna(time): return 0
    elif time <= 4.4: return 4
    elif time <= 4.6: return 3
    elif time <= 5.0: return 2
    elif time <= 5.5: return 1
    else: return 0

# Vertical Jump — higher is better
def rate_lb_vertical(jump):
    if pd.isna(jump): return 0
    elif jump >= 42.5: return 4
    elif jump >= 34: return 3
    elif jump >= 30: return 2
    elif jump > 0: return 1
    else: return 0

# Broad Jump — higher is better
def rate_lb_broad_jump(jump):
    if pd.isna(jump): return 0
    elif jump >= 130: return 4
    elif jump >= 116: return 3
    elif jump >= 105: return 2
    elif jump > 0: return 1
    else: return 0

# 3-Cone Drill — lower is better
def rate_lb_3cone(time):
    if pd.isna(time): return 0
    elif time <= 6.9: return 4
    elif time <= 7.13: return 3
    elif time <= 7.3: return 2
    elif time <= 7.5: return 1
    else: return 0

# Shuttle — lower is better
def rate_lb_shuttle(time):
    if pd.isna(time): return 0
    elif time <= 4.01: return 4
    elif time <= 4.30: return 3
    elif time <= 4.54: return 2
    elif time <= 5.5: return 1
    else: return 0

# Bench Press — higher is better
def rate_lb_bench(reps):
    if pd.isna(reps): return 0
    elif reps >= 31: return 4
    elif reps >= 20.6: return 3
    elif reps >= 12.5: return 2
    elif reps > 0: return 1
    else: return 0

# Apply all ratings to LB DataFrame (example: df_lb)
df_lb['40yd'] = df_lb['40yd'].apply(rate_lb_40yd)
df_lb['Vertical'] = df_lb['Vertical'].apply(rate_lb_vertical)
df_lb['Broad Jump'] = df_lb['Broad Jump'].apply(rate_lb_broad_jump)
df_lb['3Cone'] = df_lb['3Cone'].apply(rate_lb_3cone)
df_lb['Shuttle'] = df_lb['Shuttle'].apply(rate_lb_shuttle)
df_lb['Bench'] = df_lb['Bench'].apply(rate_lb_bench)


### SUCCESS METRIC 1 (LB)

In [142]:
lb_success_1_clean = df_lb.copy()
lb_success_2_clean = df_lb.copy()

In [143]:
lb_success_1_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 402 entries, 0 to 1605
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Pos_x           402 non-null    object 
 1   Age             402 non-null    float64
 2   College/Univ    402 non-null    object 
 3   Total_Solo      402 non-null    float64
 4   Total_Ast       402 non-null    float64
 5   Total_Tackles   402 non-null    float64
 6   Total_Sack      402 non-null    float64
 7   Total_Sack_Yds  402 non-null    float64
 8   Total_PD        402 non-null    float64
 9   Total_Int       402 non-null    float64
 10  Total_Int_Yds   402 non-null    float64
 11  Total_Int_LNG   402 non-null    float64
 12  Total_Int_TD    402 non-null    float64
 13  Total_FF        402 non-null    float64
 14  Seasons_Played  402 non-null    float64
 15  Ht              402 non-null    int64  
 16  Wt              402 non-null    float64
 17  40yd            402 non-null    int64  

In [144]:
lb_success_1_clean.drop(columns=['Pos_x', 'Success_2'], inplace=True)

### Success Metric 2

In [145]:
lb_success_2_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 402 entries, 0 to 1605
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Pos_x           402 non-null    object 
 1   Age             402 non-null    float64
 2   College/Univ    402 non-null    object 
 3   Total_Solo      402 non-null    float64
 4   Total_Ast       402 non-null    float64
 5   Total_Tackles   402 non-null    float64
 6   Total_Sack      402 non-null    float64
 7   Total_Sack_Yds  402 non-null    float64
 8   Total_PD        402 non-null    float64
 9   Total_Int       402 non-null    float64
 10  Total_Int_Yds   402 non-null    float64
 11  Total_Int_LNG   402 non-null    float64
 12  Total_Int_TD    402 non-null    float64
 13  Total_FF        402 non-null    float64
 14  Seasons_Played  402 non-null    float64
 15  Ht              402 non-null    int64  
 16  Wt              402 non-null    float64
 17  40yd            402 non-null    int64  

In [146]:
lb_success_2_clean.drop(columns=['Pos_x', 'Success_1'], inplace=True)

## Subgroup: S (Positions: SAF,S,FS,CB,DB)

In [147]:
# 40-Yard Dash — lower is better
def rate_db_40yd(time):
    if pd.isna(time): return 0
    elif time <= 4.28: return 4
    elif time <= 4.52: return 3
    elif time <= 4.79: return 2
    elif time <= 5.5: return 1
    else: return 0

# Vertical Jump — higher is better
def rate_db_vertical(jump):
    if pd.isna(jump): return 0
    elif jump >= 43: return 4
    elif jump >= 35.7: return 3
    elif jump >= 29.5: return 2
    elif jump > 15 : return 1
    else: return 0

# Broad Jump — higher is better
def rate_db_broad_jump(jump):
    if pd.isna(jump): return 0
    elif jump >= 133: return 4
    elif jump >= 118: return 3
    elif jump >= 106: return 2
    elif jump > 0: return 1
    else: return 0

# 3-Cone Drill — lower is better
def rate_db_3cone(time):
    if pd.isna(time): return 0
    elif time <= 6.85: return 4
    elif time <= 7.00: return 3
    elif time <= 7.29: return 2
    elif time <= 7.5: return 1
    else: return 0

# Shuttle Drill — lower is better
def rate_db_shuttle(time):
    if pd.isna(time): return 0
    elif time <= 3.89: return 4
    elif time <= 4.18: return 3
    elif time <= 4.56: return 2
    elif time <= 5.5: return 1
    else: return 0

# Bench Press — higher is better
def rate_db_bench(reps):
    if pd.isna(reps): return 0
    elif reps >= 22: return 4
    elif reps >= 14.0: return 3
    elif reps >= 4: return 2
    elif reps > 0: return 1
    else: return 0

# Apply to DB DataFrame (example: df_s)
df_s['40yd'] = df_s['40yd'].apply(rate_db_40yd)
df_s['Vertical'] = df_s['Vertical'].apply(rate_db_vertical)
df_s['Broad Jump'] = df_s['Broad Jump'].apply(rate_db_broad_jump)
df_s['3Cone'] = df_s['3Cone'].apply(rate_db_3cone)
df_s['Shuttle'] = df_s['Shuttle'].apply(rate_db_shuttle)
df_s['Bench'] = df_s['Bench'].apply(rate_db_bench)


### Subgroup S: Success Metric 1

In [148]:
s_success_1_clean =df_s.copy()
s_success_2_clean =df_s.copy()

In [149]:
s_success_1_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 644 entries, 11 to 2286
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Pos_x           644 non-null    object 
 1   Age             644 non-null    float64
 2   College/Univ    644 non-null    object 
 3   Total_Solo      644 non-null    float64
 4   Total_Ast       644 non-null    float64
 5   Total_Tackles   644 non-null    float64
 6   Total_Sack      644 non-null    float64
 7   Total_Sack_Yds  644 non-null    float64
 8   Total_PD        644 non-null    float64
 9   Total_Int       644 non-null    float64
 10  Total_Int_Yds   644 non-null    float64
 11  Total_Int_LNG   644 non-null    float64
 12  Total_Int_TD    644 non-null    float64
 13  Total_FF        644 non-null    float64
 14  Seasons_Played  644 non-null    float64
 15  Ht              644 non-null    int64  
 16  Wt              644 non-null    float64
 17  40yd            644 non-null    int64 

In [150]:
s_success_1_clean.drop(columns=['Pos_x', 'Success_2'], inplace=True)

### Subrgroup S: Success Metric 2

In [151]:
s_success_2_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 644 entries, 11 to 2286
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Pos_x           644 non-null    object 
 1   Age             644 non-null    float64
 2   College/Univ    644 non-null    object 
 3   Total_Solo      644 non-null    float64
 4   Total_Ast       644 non-null    float64
 5   Total_Tackles   644 non-null    float64
 6   Total_Sack      644 non-null    float64
 7   Total_Sack_Yds  644 non-null    float64
 8   Total_PD        644 non-null    float64
 9   Total_Int       644 non-null    float64
 10  Total_Int_Yds   644 non-null    float64
 11  Total_Int_LNG   644 non-null    float64
 12  Total_Int_TD    644 non-null    float64
 13  Total_FF        644 non-null    float64
 14  Seasons_Played  644 non-null    float64
 15  Ht              644 non-null    int64  
 16  Wt              644 non-null    float64
 17  40yd            644 non-null    int64 

In [152]:
s_success_2_clean.drop(columns=['Pos_x', 'Success_1'], inplace=True)

# Model Training and Validation

## Model Training Explanation

Training Approach
For each player position group, the project followed a consistent machine learning pipeline:

Feature Selection with RFE:
Recursive Feature Elimination (RFE) was used to select the most predictive features. This reduced dimensionality and focused the models on the most relevant inputs.

Train-Test Split:
The dataset was split into training and testing subsets, typically using a 75/25 split. The training set was used to fit the models, while the test set evaluated performance on unseen data.

Model Training:
Multiple types of models were trained for each position group:

Random Forest: Ensemble-based classification to capture non-linear patterns and reduce overfitting.

XGBoost: Gradient boosting model known for strong performance on tabular data.

Neural Network (Keras Sequential Model): Deep learning architecture with dense layers and dropout for generalization.

Decision Tree (GridSearchCV): Tuned using grid search to optimize hyperparameters like depth and splitting rules.

Scaling and Preprocessing:
Neural networks were trained on standardized (scaled) inputs using StandardScaler. Label encoding was also applied to handle categorical variables like college/university.

Evaluation:
Models were validated using metrics such as:
Accuracy
Precision
Recall
F1 Score

Where precision and F1 score were highly valued. They were at the top of the list since, in this case, the team cannot afford to be wrong, so we want to make sure that the players who are drafted are actually good.

**How defenders were grouped:**

1. Defensive Backs (DBs)
Includes: CB (Cornerback), FS (Free Safety), SS (Strong Safety), S (generic Safety)
🧠 These players primarily defend against the pass and cover receivers. Success is typically measured by coverage stats (INTs, PDs), PFF coverage grades, and snap counts.

2. Linebackers (LBs)
Includes: ILB, MLB, OLB, RLB, SLB
LBs are hybrid defenders responsible for tackling, pass coverage, and occasionally blitzing. Metrics such as tackle count, sack pressure, and coverage grade are often used.

3. Defensive Linemen (DL)
Includes: DT (Defensive Tackle), DE (Defensive End), NT (Nose Tackle), LDE, RDE
These players focus on stopping the run and rushing the passer. Stats like sacks, pressures, and run-stop win rate are most indicative of success.

**Why were these buckets chosen?**:
Different Skill Sets & Responsibilities
Each group performs very different roles. Modeling them together would dilute predictive accuracy due to feature noise (e.g., coverage stats mean little for DTs).

Different Success Indicators
A good DB might shine in interceptions and PFF coverage grade, while a good DL might succeed in sacks and PFF pass rush. Grouping by position ensures the success metric is relevant and fair.

Improved Model Performance
Group-specific models usually outperform general models, especially in sports analytics. Predictive features and class balance differ significantly across these buckets.

Position-Specific Draft Value
NFL teams evaluate positions differently in terms of draft capital and expected performance. Accurate separation lets you tailor insights for scouting or roster-building strategy.

## QB'S Models for Succes Metric 1

### QBs Random Forest (Success Metric 1)

Preparing quarterback performance data (qbs_success_1_clean) for a Random Forest model to evaluate "Success Metric 1". This involves:

- Encoding the categorical College/Univ column using LabelEncoder() so it can be used as a numeric input for modeling.

- Verifying the dataset structure with .info() to ensure all 26 columns (such as passing stats, physical attributes, and success indicators) are complete and ready for training.

In [349]:
# Label Encode 'College/Univ'
le = LabelEncoder()
qbs_success_1_clean['College/Univ'] = le.fit_transform(qbs_success_1_clean['College/Univ'])


In [350]:
qbs_success_1_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 134 entries, 6 to 207
Data columns (total 26 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Age              134 non-null    float64
 1   College/Univ     134 non-null    int64  
 2   Total_Games      134 non-null    float64
 3   Total_Cmp        134 non-null    float64
 4   Total_Att        134 non-null    float64
 5   Total_Yds        134 non-null    float64
 6   Total_TD         134 non-null    float64
 7   Total_Int        134 non-null    float64
 8   Cmp_Percentage   134 non-null    float64
 9   TD_Percentage    134 non-null    float64
 10  Int_Percentage   134 non-null    float64
 11  Y_A              134 non-null    float64
 12  AY_A             134 non-null    float64
 13  Y_C              134 non-null    float64
 14  Y_G              134 non-null    float64
 15  Rate             134 non-null    float64
 16  Seasons_Played   134 non-null    float64
 17  Ht               134 

In [351]:
# 📦 Imports
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# 🎯 Split features and target
X = qbs_success_1_clean.drop('Success_1', axis=1)
y = qbs_success_1_clean['Success_1']

# 🧪 Train-Test Split (Stratified for class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

# 🌳 Random Forest with class weight balancing
rf = RandomForestClassifier(random_state=42, class_weight='balanced')

# 🔍 Randomized Hyperparameter Search
param_dist_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'bootstrap': [True]
}

rf_random = RandomizedSearchCV(
    rf,
    param_distributions=param_dist_rf,
    n_iter=10,
    cv=3,
    scoring='accuracy',
    random_state=42,
    n_jobs=-1
)

# 🚀 Train the model
rf_random.fit(X_train, y_train)

# 🧠 Best model & prediction
best_rf = rf_random.best_estimator_
y_pred_rf = best_rf.predict(X_test)

# 📊 Evaluation Function
def evaluate_model(y_true, y_pred, model_name="Model"):
    print(f"--- {model_name} Evaluation ---")
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score : {f1_score(y_true, y_pred):.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred))

# 🧾 Print Metrics
evaluate_model(y_test, y_pred_rf, "Random Forest (Balanced)")


--- Random Forest (Balanced) Evaluation ---
Accuracy : 0.8824
Precision: 0.0000
Recall   : 0.0000
F1 Score : 0.0000

Classification Report:
               precision    recall  f1-score   support

           0       0.88      1.00      0.94        30
           1       0.00      0.00      0.00         4

    accuracy                           0.88        34
   macro avg       0.44      0.50      0.47        34
weighted avg       0.78      0.88      0.83        34



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### QBs XGBOOST (Success Metric 1)

Objective: Build and evaluate an XGBoost model to predict quarterback success (Success_1) using performance metrics.

This section includes:

- Feature Engineering: Separates features (X) from the binary target variable (Success_1).

- Train-Test Split (Stratified): Ensures that the class balance (success vs. non-success) is preserved in both training and testing sets.

- Class Imbalance Handling: Calculates the scale_pos_weight parameter to correct for imbalance in the number of positive vs. negative labels, improving model fairness and sensitivity.

- Model Initialization: Defines an XGBClassifier optimized for binary classification with log loss as the evaluation metric.

- Hyperparameter Tuning: Uses RandomizedSearchCV to explore a range of hyperparameters (n_estimators, learning_rate, max_depth, subsample, colsample_bytree) over 10 randomized iterations with 3-fold cross-validation.

- Model Training & Prediction: Fits the tuned model on training data and uses the best configuration to make predictions on the test set.

- Model Evaluation: Assesses performance using accuracy, precision, recall, F1-score, and a detailed classification report to understand how well the model identifies successful quarterbacks.

In [352]:
# Prepare features and target
X = qbs_success_1_clean.drop('Success_1', axis=1)
y = qbs_success_1_clean['Success_1']

# Train-Test Split (Stratified)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

# Calculate scale_pos_weight = (# negative class) / (# positive class)
from collections import Counter
counter = Counter(y_train)
scale_pos_weight = counter[0] / counter[1]  # imbalance ratio

# XGBoost Classifier
xgb_model = xgb.XGBClassifier(
    objective='binary:logistic',
    use_label_encoder=False,
    eval_metric='logloss',
    scale_pos_weight=scale_pos_weight,
    random_state=42
)

# Hyperparameter Tuning (Randomized Search)
param_dist_xgb = {
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5, 7],
    'subsample': [0.7, 1.0],
    'colsample_bytree': [0.7, 1.0]
}

xgb_random = RandomizedSearchCV(
    xgb_model,
    param_distributions=param_dist_xgb,
    n_iter=10,
    scoring='accuracy',
    cv=3,
    random_state=42,
    n_jobs=-1
)

# Train the model
xgb_random.fit(X_train, y_train)
best_xgb = xgb_random.best_estimator_

# Predictions
y_pred_xgb = best_xgb.predict(X_test)

# Evaluation
def evaluate_model(y_true, y_pred, model_name="Model"):
    print(f"--- {model_name} Evaluation ---")
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score : {f1_score(y_true, y_pred):.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred))

evaluate_model(y_test, y_pred_xgb, "XGBoost (Balanced)")


Parameters: { "use_label_encoder" } are not used.



--- XGBoost (Balanced) Evaluation ---
Accuracy : 0.8235
Precision: 0.0000
Recall   : 0.0000
F1 Score : 0.0000

Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.93      0.90        30
           1       0.00      0.00      0.00         4

    accuracy                           0.82        34
   macro avg       0.44      0.47      0.45        34
weighted avg       0.77      0.82      0.80        34



### QBs Neural Network (Success Metric 1)

Builds and trains a deep neural network using Keras to predict quarterback success (Success_1) with enhanced architecture and class imbalance handling.

This includes:

Feature Preparation & Split:

- Inputs (X) and target (y) are extracted from qbs_success_1_clean.

- A stratified train-test split maintains class balance in training and testing.

Manual Class Weighting:

- Strongly up-weights the minority class (class 1 = “successful QB”) by a factor of 10 to combat class imbalance during training.

Feature Scaling:

- Standardizes input features using StandardScaler, essential for neural networks to converge efficiently.

Model Architecture:

- A deep feedforward neural network built using Sequential:

 Input layer → 128 neurons (ReLU) + Dropout

 Hidden layer → 64 neurons (ReLU) + Dropout

 Output layer → 1 neuron with sigmoid activation (for binary classification)

Model Training:

- Uses binary cross-entropy loss and Adam optimizer.

- EarlyStopping prevents overfitting by monitoring validation loss and restoring the best weights.

- Trains silently for up to 100 epochs with 20% of training data used for validation.

Prediction & Threshold Adjustment:

- Instead of default 0.5, a lower threshold of 0.3 is used to convert predicted probabilities into class labels—favoring recall over precision.

Evaluation:

- The model is evaluated using accuracy, precision, recall, F1-score, and a classification report, giving a full picture of its predictive performance.

In [353]:
# Prepare features and target
X = qbs_success_1_clean.drop('Success_1', axis=1).values
y = qbs_success_1_clean['Success_1'].values

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

# Boosted Class Weights (manually tuned)
class_weight = {
    0: 1.0,
    1: 10.0  # emphasize class 1 much more
}

# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Enhanced Keras Model
model = Sequential([
    Dense(128, input_dim=X_train.shape[1], activation='relu'),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])

# Compile
model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

# Early Stopping
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Train
history = model.fit(
    X_train_scaled, y_train,
    validation_split=0.2,
    epochs=100,
    batch_size=16,
    class_weight=class_weight,
    callbacks=[early_stop],
    verbose=0
)

# Predict probabilities
y_pred_prob = model.predict(X_test_scaled).flatten()

# Lowered threshold from 0.5 → 0.3
y_pred_thresh = (y_pred_prob >= 0.3).astype(int)

# Evaluation Function
def evaluate_model(y_true, y_pred, model_name="Model"):
    print(f"--- {model_name} Evaluation ---")
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score : {f1_score(y_true, y_pred):.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred))

# Evaluate
evaluate_model(y_test, y_pred_thresh, "Keras NN (Threshold=0.3 + More Layers + Class Weight)")


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 94ms/step
--- Keras NN (Threshold=0.3 + More Layers + Class Weight) Evaluation ---
Accuracy : 0.7059
Precision: 0.0000
Recall   : 0.0000
F1 Score : 0.0000

Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.80      0.83        30
           1       0.00      0.00      0.00         4

    accuracy                           0.71        34
   macro avg       0.43      0.40      0.41        34
weighted avg       0.76      0.71      0.73        34



### QBs Decision Tree (SM1)

Decision Tree Overview and Evaluation

In [354]:
# Data setup
X = qbs_success_1_clean.drop('Success_1', axis=1)
y = qbs_success_1_clean['Success_1']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

# GridSearch for Decision Tree
param_grid_dt = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

dt = DecisionTreeClassifier(class_weight='balanced', random_state=42)
grid_search_dt = GridSearchCV(dt, param_grid_dt, cv=3, scoring='f1', n_jobs=-1)
grid_search_dt.fit(X_train, y_train)

best_dt = grid_search_dt.best_estimator_
y_pred_dt = best_dt.predict(X_test)

# Evaluation
def evaluate_model(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score : {f1_score(y_true, y_pred):.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred))

evaluate_model(y_test, y_pred_dt, "Decision Tree (Tuned)")



--- Decision Tree (Tuned) Evaluation ---
Accuracy : 0.7059
Precision: 0.2000
Recall   : 0.5000
F1 Score : 0.2857

Classification Report:
               precision    recall  f1-score   support

           0       0.92      0.73      0.81        30
           1       0.20      0.50      0.29         4

    accuracy                           0.71        34
   macro avg       0.56      0.62      0.55        34
weighted avg       0.83      0.71      0.75        34



### QBs KNN (SM1)

Trains a k-Nearest Neighbors (kNN) model to predict quarterback success using cleaned and scaled data. A GridSearchCV is used to tune key hyperparameters (n_neighbors, weights, p) based on F1 score. The model is evaluated with standard classification metrics.

*This same logic and tuning process will be applied to other kNN models later in the notebook for consistency across metrics.*

In [355]:
# Data setup
# Drop rows with NaN values
qbs_success_1_clean = qbs_success_1_clean.dropna()
X = qbs_success_1_clean.drop('Success_1', axis=1)
y = qbs_success_1_clean['Success_1']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# GridSearchCV for kNN
param_grid_knn = {
    'n_neighbors': [3, 5, 7],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]  # Manhattan and Euclidean distances
}

warnings.simplefilter('ignore', FitFailedWarning)

knn = KNeighborsClassifier()
grid_search_knn = GridSearchCV(knn, param_grid_knn, cv=3, scoring='f1', n_jobs=-1, error_score='raise')
grid_search_knn.fit(X_train_scaled, y_train)

best_knn = grid_search_knn.best_estimator_
y_pred_knn = best_knn.predict(X_test_scaled)

# Evaluation
def evaluate_model(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score : {f1_score(y_true, y_pred):.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred))

evaluate_model(y_test, y_pred_knn, "kNN (Tuned)")


--- kNN (Tuned) Evaluation ---
Accuracy : 0.8529
Precision: 0.0000
Recall   : 0.0000
F1 Score : 0.0000

Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.97      0.92        30
           1       0.00      0.00      0.00         4

    accuracy                           0.85        34
   macro avg       0.44      0.48      0.46        34
weighted avg       0.78      0.85      0.81        34



| Model                          | Accuracy | Precision (1) | Recall (1) | F1 (1) | Notes                              |
| ------------------------------ | -------- | ------------- | ---------- | ------ | ---------------------------------- |
| **Random Forest (balanced)**   | 0.87     | 0.00          | 0.00       | 0.00   | Ignored class 1 completely         |
| **XGBoost (balanced)**         | 0.78     | 0.17          | 0.17       | 0.17   | First model to predict class 1     |
| **XGBoost (threshold=0.3)**    | 0.70     | 0.10          | 0.17       | 0.13   | Lowered threshold slightly helped  |
| **Neural Net (class\_weight)** | 0.87     | 0.00          | 0.00       | 0.00   | Still ignored class 1              |
| **Decision Tree (tuned)**      | 0.74     | 0.00          | 0.00       | 0.00   | Couldn’t learn signal from class 1 |
| **kNN (tuned)**                | 0.88     | 0.00          | 0.00       | 0.00   | High accuracy, no class 1 recall   |


## QB'S Success Metric 2

This section employs the same structured modeling approach applied in earlier success metric analyses, ensuring consistency in methodology.

### QBs Decision Tree and KNN (SM2)

In [356]:
# Label Encode 'College/Univ'
le = LabelEncoder()
qbs_success_2_clean['College/Univ'] = le.fit_transform(qbs_success_2_clean['College/Univ'])

In [357]:
# Data Setup
X = qbs_success_2_clean.drop('Success_2', axis=1)
y = qbs_success_2_clean['Success_2']

# Fill missing values if any
X = X.fillna(X.median())

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

# Scaling for kNN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# -------------------------------
# Decision Tree Regressor
# -------------------------------
param_grid_dt = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

dt = DecisionTreeRegressor(random_state=42)
grid_dt = GridSearchCV(dt, param_grid_dt, scoring='neg_mean_squared_error', cv=3, n_jobs=-1)
grid_dt.fit(X_train, y_train)
best_dt = grid_dt.best_estimator_
y_pred_dt = best_dt.predict(X_test)

# -------------------------------
# kNN Regressor
# -------------------------------
param_grid_knn = {
    'n_neighbors': [3, 5, 7],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]
}

knn = KNeighborsRegressor()
grid_knn = GridSearchCV(knn, param_grid_knn, scoring='neg_mean_squared_error', cv=3, n_jobs=-1)
grid_knn.fit(X_train_scaled, y_train)
best_knn = grid_knn.best_estimator_
y_pred_knn = best_knn.predict(X_test_scaled)

# Evaluation Function
def evaluate_regression(y_true, y_pred, model_name):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"MAE  : {mean_absolute_error(y_true, y_pred):.4f}")
    print(f"MSE  : {mean_squared_error(y_true, y_pred):.4f}")
    print(f"RMSE : {np.sqrt(mean_squared_error(y_true, y_pred)):.4f}")
    print(f"R²   : {r2_score(y_true, y_pred):.4f}")

# Print Metrics
evaluate_regression(y_test, y_pred_dt, "Decision Tree Regressor (Tuned)")
evaluate_regression(y_test, y_pred_knn, "kNN Regressor (Tuned)")



--- Decision Tree Regressor (Tuned) Evaluation ---
MAE  : 0.1627
MSE  : 0.0433
RMSE : 0.2082
R²   : -1.2758

--- kNN Regressor (Tuned) Evaluation ---
MAE  : 0.1236
MSE  : 0.0236
RMSE : 0.1537
R²   : -0.2399


Both models have negative R²:
Means both models are doing worse than a simple average predictor (e.g., just predicting the mean of Success_2)

This indicates:

Very noisy data

Weak signal in features

Or a highly nonlinear / complex relationship that simple models can’t capture

### QBs XG Boost (SM2)

This section employs the same structured modeling approach applied in earlier success metric analyses, ensuring consistency in methodology.

In [358]:
# Setup features and target
X = qbs_success_2_clean.drop('Success_2', axis=1)
y = qbs_success_2_clean['Success_2']

# Handle missing values
X = X.fillna(X.median())

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

# Optional scaling (XGBoost doesn't require it but OK to keep it consistent)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# XGBoost Regressor
xgb_reg = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)

# Grid Search Parameters
param_grid_xgb = {
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

grid_xgb = GridSearchCV(
    xgb_reg,
    param_grid=param_grid_xgb,
    scoring='neg_mean_squared_error',
    cv=3,
    n_jobs=-1
)

# Fit model
grid_xgb.fit(X_train_scaled, y_train)
best_xgb = grid_xgb.best_estimator_

# Predict
y_pred_xgb = best_xgb.predict(X_test_scaled)

# Evaluation Function
def evaluate_regression(y_true, y_pred, model_name):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"MAE  : {mean_absolute_error(y_true, y_pred):.4f}")
    print(f"MSE  : {mean_squared_error(y_true, y_pred):.4f}")
    print(f"RMSE : {np.sqrt(mean_squared_error(y_true, y_pred)):.4f}")
    print(f"R²   : {r2_score(y_true, y_pred):.4f}")

# Evaluate
evaluate_regression(y_test, y_pred_xgb, "XGBoost Regressor (Tuned)")



--- XGBoost Regressor (Tuned) Evaluation ---
MAE  : 0.1087
MSE  : 0.0207
RMSE : 0.1438
R²   : -0.0856


### QBs Neural Network (SM2)

This section employs the same structured modeling approach applied in earlier success metric analyses, ensuring consistency in methodology.

In [359]:
# Prepare features and target
X = qbs_success_2_clean.drop('Success_2', axis=1).fillna(qbs_success_2_clean.median())
y = qbs_success_2_clean['Success_2']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

# Standard scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Neural Network model
model = Sequential([
    Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.2),
    Dense(64, activation='relu'),
    Dense(1)  # regression output
])

# Compile
model.compile(optimizer=Adam(learning_rate=0.001), loss='mse')

# Early stopping
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Train
history = model.fit(
    X_train_scaled, y_train,
    validation_split=0.2,
    epochs=100,
    batch_size=16,
    callbacks=[early_stop],
    verbose=0
)

# Predict
y_pred_nn = model.predict(X_test_scaled).flatten()

# Evaluate
def evaluate_regression(y_true, y_pred, model_name):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"MAE  : {mean_absolute_error(y_true, y_pred):.4f}")
    print(f"MSE  : {mean_squared_error(y_true, y_pred):.4f}")
    print(f"RMSE : {np.sqrt(mean_squared_error(y_true, y_pred)):.4f}")
    print(f"R²   : {r2_score(y_true, y_pred):.4f}")

evaluate_regression(y_test, y_pred_nn, "Keras Neural Network Regressor")


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 263ms/step

--- Keras Neural Network Regressor Evaluation ---
MAE  : 0.2807
MSE  : 0.4397
RMSE : 0.6631
R²   : -22.0870


| Model             | MAE    | RMSE   | R²      |
| ----------------- | ------ | ------ | ------- |
| **kNN**           | 0.1236 | 0.1537 | -0.2399 |
| **XGBoost**       | 0.1087 | 0.1438 | -0.0856 |
| **Decision Tree** | 0.1627 | 0.2082 | -1.2758 |
| **Keras NN**      | 0.1669 | 0.3233 | -4.4883 |


XGBoost Regressor is the best overall (lowest error, least negative R²)

## RBs Models for Success Metric 1


### RBs Random Forest (SM1)



In [360]:
# Label Encode 'College/Univ'
le = LabelEncoder()
rbs_success_1_clean['College/Univ'] = le.fit_transform(rbs_success_1_clean['College/Univ'])

In [361]:
# Features and Target
X = rbs_success_1_clean.drop('Success_1', axis=1)
y = rbs_success_1_clean['Success_1']

# Optional: Fill NaNs
X = X.fillna(X.median())

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

# Random Forest with class weight balancing
rf = RandomForestClassifier(class_weight='balanced', random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Evaluation
def evaluate_model(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score : {f1_score(y_true, y_pred):.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred))

evaluate_model(y_test, y_pred_rf, "Random Forest (RBs)")



--- Random Forest (RBs) Evaluation ---
Accuracy : 0.6061
Precision: 0.4615
Recall   : 0.2400
F1 Score : 0.3158

Classification Report:
               precision    recall  f1-score   support

           0       0.64      0.83      0.72        41
           1       0.46      0.24      0.32        25

    accuracy                           0.61        66
   macro avg       0.55      0.53      0.52        66
weighted avg       0.57      0.61      0.57        66



### RBs XG Boost (SM1)

In [362]:
# Features and Target
X = rbs_success_1_clean.drop('Success_1', axis=1).fillna(rbs_success_1_clean.median())
y = rbs_success_1_clean['Success_1']

# Train-Test Split (Stratified)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

# Class Imbalance Ratio
class_counts = Counter(y_train)
scale_pos_weight = class_counts[0] / class_counts[1]  # ratio = majority / minority

# XGBoost Classifier
xgb_model = xgb.XGBClassifier(
    objective='binary:logistic',
    use_label_encoder=False,
    eval_metric='logloss',
    scale_pos_weight=scale_pos_weight,
    random_state=42
)

# Parameter Grid
param_grid_xgb = {
    'n_estimators': [50, 100],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

grid_search_xgb = GridSearchCV(
    xgb_model,
    param_grid=param_grid_xgb,
    cv=3,
    scoring='f1',
    n_jobs=-1
)

# Fit Model
grid_search_xgb.fit(X_train, y_train)
best_xgb = grid_search_xgb.best_estimator_

# Predict
y_pred_xgb = best_xgb.predict(X_test)

# Evaluation
def evaluate_model(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score : {f1_score(y_true, y_pred):.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred))

# Print Results
evaluate_model(y_test, y_pred_xgb, "XGBoost Classifier (RBs)")


Parameters: { "use_label_encoder" } are not used.




--- XGBoost Classifier (RBs) Evaluation ---
Accuracy : 0.5758
Precision: 0.4400
Recall   : 0.4400
F1 Score : 0.4400

Classification Report:
               precision    recall  f1-score   support

           0       0.66      0.66      0.66        41
           1       0.44      0.44      0.44        25

    accuracy                           0.58        66
   macro avg       0.55      0.55      0.55        66
weighted avg       0.58      0.58      0.58        66



In [363]:
# Load & Prepare
X = rbs_success_1_clean.drop('Success_1', axis=1).fillna(rbs_success_1_clean.median())
y = rbs_success_1_clean['Success_1']

# Feature Selection with RFE (top 10 features)
rfe_selector = RFE(estimator=RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe_selector.fit_transform(X, y)

# Train-Test Split (stratified)
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, stratify=y, random_state=42
)

# Compute imbalance ratio
class_counts = Counter(y_train)
scale_pos_weight = class_counts[0] / class_counts[1]

# XGBoost Classifier (with weight balancing)
xgb_model = xgb.XGBClassifier(
    objective='binary:logistic',
    use_label_encoder=False,
    eval_metric='logloss',
    scale_pos_weight=scale_pos_weight,
    random_state=42
)

# Grid Search
param_grid_xgb = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7],
    'subsample': [0.7, 0.9, 1.0],
    'colsample_bytree': [0.7, 0.9, 1.0]
}

grid_xgb = GridSearchCV(
    xgb_model,
    param_grid=param_grid_xgb,
    scoring='f1',
    cv=3,
    n_jobs=-1
)

grid_xgb.fit(X_train, y_train)
best_xgb = grid_xgb.best_estimator_

# Predict
y_pred = best_xgb.predict(X_test)

# Evaluation
def evaluate_model(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score : {f1_score(y_true, y_pred):.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred))

evaluate_model(y_test, y_pred, "XGBoost + RFE (RBs)")


Parameters: { "use_label_encoder" } are not used.




--- XGBoost + RFE (RBs) Evaluation ---
Accuracy : 0.6212
Precision: 0.5000
Recall   : 0.4000
F1 Score : 0.4444

Classification Report:
               precision    recall  f1-score   support

           0       0.67      0.76      0.71        41
           1       0.50      0.40      0.44        25

    accuracy                           0.62        66
   macro avg       0.59      0.58      0.58        66
weighted avg       0.61      0.62      0.61        66



✅ Interpretation
Improved Class 1 Recall from 0.28 (NN) and 0.24 (RF) → now 0.40

Precision for Class 1 stays at 0.50 (same as Neural Net, higher than XGB alone)

F1 score for Class 1 = 0.44 (best overall so far)

Accuracy, macro avg, and weighted F1 are also well balanced



In [364]:
# Predict class probabilities
y_prob = best_xgb.predict_proba(X_test)[:, 1]

# Apply threshold = 0.3
y_pred_thresh = (y_prob >= 0.3).astype(int)

# Evaluation
evaluate_model(y_test, y_pred_thresh, "XGBoost + RFE (Threshold = 0.3)")



--- XGBoost + RFE (Threshold = 0.3) Evaluation ---
Accuracy : 0.3636
Precision: 0.3559
Recall   : 0.8400
F1 Score : 0.5000

Classification Report:
               precision    recall  f1-score   support

           0       0.43      0.07      0.12        41
           1       0.36      0.84      0.50        25

    accuracy                           0.36        66
   macro avg       0.39      0.46      0.31        66
weighted avg       0.40      0.36      0.27        66



✅ Huge recall gain for class 1 (from 0.40 → 0.84)

✅ F1 score remains strong at 0.50, even better than original threshold

❌ Accuracy drops sharply, which is expected — model now predicts many more 1s (including false positives)

### RBs Neural Network (SM1)

In [365]:
# Features and Target
X = rbs_success_1_clean.drop('Success_1', axis=1).fillna(rbs_success_1_clean.median())
y = rbs_success_1_clean['Success_1']

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

# Compute Class Weights
counter = Counter(y_train)
total = len(y_train)
class_weight = {
    0: total / (2 * counter[0]),
    1: total / (2 * counter[1])
}

# Scale Features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define Neural Network Model
model = Sequential([
    Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.2),
    Dense(1, activation='sigmoid')  # Binary output
])

# Compile
model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

# Early Stopping
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Train Model
model.fit(
    X_train_scaled, y_train,
    validation_split=0.2,
    epochs=100,
    batch_size=16,
    class_weight=class_weight,
    callbacks=[early_stop],
    verbose=0
)

# Predict Probabilities and Apply Threshold
y_pred_prob = model.predict(X_test_scaled).flatten()
y_pred_class = (y_pred_prob >= 0.5).astype(int)  # can adjust to 0.3 later

# Evaluation Function
def evaluate_model(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score : {f1_score(y_true, y_pred):.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred))

# Evaluate
evaluate_model(y_test, y_pred_class, "Keras Neural Network (RBs)")


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step

--- Keras Neural Network (RBs) Evaluation ---
Accuracy : 0.6212
Precision: 0.5000
Recall   : 0.4400
F1 Score : 0.4681

Classification Report:
               precision    recall  f1-score   support

           0       0.68      0.73      0.71        41
           1       0.50      0.44      0.47        25

    accuracy                           0.62        66
   macro avg       0.59      0.59      0.59        66
weighted avg       0.61      0.62      0.62        66



### RBs Decision Tree (SM1)

In [366]:
# Features and Target
X = rbs_success_1_clean.drop('Success_1', axis=1).fillna(rbs_success_1_clean.median())
y = rbs_success_1_clean['Success_1']

# RFE with Random Forest to select top 10 features
rfe_selector = RFE(estimator=RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe_selector.fit_transform(X, y)

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, stratify=y, random_state=42
)

# Decision Tree + GridSearchCV
param_grid_dt = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

dt = DecisionTreeClassifier(class_weight='balanced', random_state=42)
grid_dt = GridSearchCV(dt, param_grid_dt, cv=3, scoring='f1', n_jobs=-1)
grid_dt.fit(X_train, y_train)
best_dt = grid_dt.best_estimator_

# Predict
y_pred_dt = best_dt.predict(X_test)

# Evaluation
def evaluate_model(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score : {f1_score(y_true, y_pred):.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred))

evaluate_model(y_test, y_pred_dt, "Decision Tree + RFE (RBs)")



--- Decision Tree + RFE (RBs) Evaluation ---
Accuracy : 0.5758
Precision: 0.4118
Recall   : 0.2800
F1 Score : 0.3333

Classification Report:
               precision    recall  f1-score   support

           0       0.63      0.76      0.69        41
           1       0.41      0.28      0.33        25

    accuracy                           0.58        66
   macro avg       0.52      0.52      0.51        66
weighted avg       0.55      0.58      0.55        66



### RBs kNN (SM1)

In [367]:
# Suppress warnings from failing fits (just in case)
warnings.simplefilter('ignore', FitFailedWarning)

# Prepare Features & Target
X = rbs_success_1_clean.drop('Success_1', axis=1).fillna(rbs_success_1_clean.median())
y = rbs_success_1_clean['Success_1']

# RFE with Random Forest
rfe = RFE(RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe.fit_transform(X, y)

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, stratify=y, random_state=42
)

# Scale for kNN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# kNN Grid Search
param_grid_knn = {
    'n_neighbors': [3, 5, 7],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]  # 1 = Manhattan, 2 = Euclidean
}

knn = KNeighborsClassifier()
grid_knn = GridSearchCV(knn, param_grid_knn, scoring='f1', cv=3, n_jobs=-1)
grid_knn.fit(X_train_scaled, y_train)
best_knn = grid_knn.best_estimator_

# Predict
y_pred_knn = best_knn.predict(X_test_scaled)

# Evaluate
def evaluate_model(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score : {f1_score(y_true, y_pred):.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred))

evaluate_model(y_test, y_pred_knn, "kNN + RFE (RBs)")



--- kNN + RFE (RBs) Evaluation ---
Accuracy : 0.5455
Precision: 0.4138
Recall   : 0.4800
F1 Score : 0.4444

Classification Report:
               precision    recall  f1-score   support

           0       0.65      0.59      0.62        41
           1       0.41      0.48      0.44        25

    accuracy                           0.55        66
   macro avg       0.53      0.53      0.53        66
weighted avg       0.56      0.55      0.55        66



| Model               | Precision (1) | Recall (1) | F1 (1) | Accuracy |
| ------------------- | ------------- | ---------- | ------ | -------- |
| Random Forest       | 0.46          | 0.24       | 0.32   | 0.61     |
| XGBoost             | 0.44          | 0.44       | 0.44   | 0.58     |
| XGBoost + Threshold | 0.36          | **0.84**   | 0.50   | 0.36     |
| Neural Net          | **0.50**      | 0.28       | 0.36   | 0.62     |
| Decision Tree + RFE | 0.41          | 0.28       | 0.33   | 0.58     |
| **kNN + RFE**       | 0.41          | 0.48       | 0.44   | 0.55     |


## RBs Models for Success Metric 2

### RBs Decision Tree (SM2)

In [368]:
# Label Encode 'College/Univ'
le = LabelEncoder()
rbs_success_2_clean['College/Univ'] = le.fit_transform(rbs_success_2_clean['College/Univ'])

In [369]:
# Features and Target
X = rbs_success_2_clean.drop('Success_2', axis=1).fillna(rbs_success_2_clean.median())
y = rbs_success_2_clean['Success_2']

# RFE with Random Forest (Top 10 features)
rfe = RFE(estimator=RandomForestRegressor(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe.fit_transform(X, y)

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, random_state=42
)

# Decision Tree with Grid Search
param_grid_dt = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

dt = DecisionTreeRegressor(random_state=42)
grid_dt = GridSearchCV(dt, param_grid_dt, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)
grid_dt.fit(X_train, y_train)
best_dt = grid_dt.best_estimator_

# Predict
y_pred_dt = best_dt.predict(X_test)

# Evaluation Function
def evaluate_regression(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"MAE  : {mean_absolute_error(y_true, y_pred):.4f}")
    print(f"MSE  : {mean_squared_error(y_true, y_pred):.4f}")
    print(f"RMSE : {np.sqrt(mean_squared_error(y_true, y_pred)):.4f}")
    print(f"R²   : {r2_score(y_true, y_pred):.4f}")

evaluate_regression(y_test, y_pred_dt, "Decision Tree Regressor + RFE (RBs)")



--- Decision Tree Regressor + RFE (RBs) Evaluation ---
MAE  : 0.1077
MSE  : 0.0211
RMSE : 0.1454
R²   : 0.1678


Metric	Value	Interpretation
MAE	0.1077	Small average prediction error ✅
MSE	0.0211	Low squared error ✅
RMSE	0.1454	Good error magnitude ✅
R²	0.168	Explains ~17% of variance in Success_2 🔼

### RBs Random Forest (SM2)

In [370]:
# Features & Target
X = rbs_success_2_clean.drop('Success_2', axis=1).replace([np.inf, -np.inf], np.nan).fillna(rbs_success_2_clean.median())
y = rbs_success_2_clean['Success_2']

# RFE (top 10 features)
rfe_selector = RFE(estimator=RandomForestRegressor(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe_selector.fit_transform(X, y)

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, random_state=42
)

# Fit final Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

# Evaluate
def evaluate_regression(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"MAE  : {mean_absolute_error(y_true, y_pred):.4f}")
    print(f"MSE  : {mean_squared_error(y_true, y_pred):.4f}")
    print(f"RMSE : {np.sqrt(mean_squared_error(y_true, y_pred)):.4f}")
    print(f"R²   : {r2_score(y_true, y_pred):.4f}")

evaluate_regression(y_test, y_pred_rf, "Random Forest Regressor + RFE (RBs)")



--- Random Forest Regressor + RFE (RBs) Evaluation ---
MAE  : 0.1031
MSE  : 0.0201
RMSE : 0.1419
R²   : 0.2075


Strongest R² yet for RB regression — better than DT (0.1678), XGB (0.1198), and kNN (0.0164)

Solid performance across MAE and RMSE too

📌 Random Forest Regressor is now the best model for rbs_success_2_clean

### RBs kNN (SM2)

In [371]:
# Suppress grid search warnings
warnings.simplefilter('ignore', FitFailedWarning)

# Prepare Features and Target
X = rbs_success_2_clean.drop('Success_2', axis=1).fillna(rbs_success_2_clean.median())
y = rbs_success_2_clean['Success_2']

# RFE with Random Forest (Top 10 features)
rfe = RFE(estimator=RandomForestRegressor(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe.fit_transform(X, y)

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, random_state=42
)

# Scale Features for kNN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# GridSearchCV for kNN Regressor
param_grid_knn = {
    'n_neighbors': [3, 5, 7],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]  # 1 = Manhattan, 2 = Euclidean
}

knn = KNeighborsRegressor()
grid_knn = GridSearchCV(knn, param_grid_knn, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)
grid_knn.fit(X_train_scaled, y_train)
best_knn = grid_knn.best_estimator_

# Predict
y_pred_knn = best_knn.predict(X_test_scaled)

# Evaluation
def evaluate_regression(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"MAE  : {mean_absolute_error(y_true, y_pred):.4f}")
    print(f"MSE  : {mean_squared_error(y_true, y_pred):.4f}")
    print(f"RMSE : {np.sqrt(mean_squared_error(y_true, y_pred)):.4f}")
    print(f"R²   : {r2_score(y_true, y_pred):.4f}")

evaluate_regression(y_test, y_pred_knn, "kNN Regressor + RFE (RBs)")



--- kNN Regressor + RFE (RBs) Evaluation ---
MAE  : 0.1125
MSE  : 0.0250
RMSE : 0.1581
R²   : 0.0164


In [372]:
# Prepare Features and Target
X = rbs_success_2_clean.drop('Success_2', axis=1).fillna(rbs_success_2_clean.median())
y = rbs_success_2_clean['Success_2']

# Feature Selection: RFE (Top 10)
rfe = RFE(estimator=RandomForestRegressor(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe.fit_transform(X, y)

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, random_state=42
)

# XGBoost Regressor
xgb_model = xgb.XGBRegressor(
    objective='reg:squarederror',
    random_state=42
)

# GridSearchCV Parameters
param_grid_xgb = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

grid_xgb = GridSearchCV(
    xgb_model,
    param_grid=param_grid_xgb,
    scoring='neg_mean_squared_error',
    cv=3,
    n_jobs=-1
)

# Fit
grid_xgb.fit(X_train, y_train)
best_xgb = grid_xgb.best_estimator_

# Predict
y_pred_xgb = best_xgb.predict(X_test)

# Evaluation
def evaluate_regression(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"MAE  : {mean_absolute_error(y_true, y_pred):.4f}")
    print(f"MSE  : {mean_squared_error(y_true, y_pred):.4f}")
    print(f"RMSE : {np.sqrt(mean_squared_error(y_true, y_pred)):.4f}")
    print(f"R²   : {r2_score(y_true, y_pred):.4f}")

evaluate_regression(y_test, y_pred_xgb, "XGBoost Regressor + RFE (RBs)")



--- XGBoost Regressor + RFE (RBs) Evaluation ---
MAE  : 0.1091
MSE  : 0.0224
RMSE : 0.1496
R²   : 0.1198


### RBs Neural Network (SM2)

In [373]:
# Prepare Features and Target
X = rbs_success_2_clean.drop('Success_2', axis=1).fillna(rbs_success_2_clean.median())
y = rbs_success_2_clean['Success_2']

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

# Scale Inputs
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Build Keras Model
model = Sequential([
    Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.2),
    Dense(1)  # Regression output
])

# Compile Model
model.compile(optimizer=Adam(learning_rate=0.001), loss='mse')

# Early Stopping
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Train Model
model.fit(
    X_train_scaled, y_train,
    validation_split=0.2,
    epochs=100,
    batch_size=16,
    callbacks=[early_stop],
    verbose=0
)

# Predict
y_pred_nn = model.predict(X_test_scaled).flatten()

# Evaluation
def evaluate_regression(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"MAE  : {mean_absolute_error(y_true, y_pred):.4f}")
    print(f"MSE  : {mean_squared_error(y_true, y_pred):.4f}")
    print(f"RMSE : {np.sqrt(mean_squared_error(y_true, y_pred)):.4f}")
    print(f"R²   : {r2_score(y_true, y_pred):.4f}")

evaluate_regression(y_test, y_pred_nn, "Keras Neural Network Regressor (RBs)")


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step

--- Keras Neural Network Regressor (RBs) Evaluation ---
MAE  : 0.1234
MSE  : 0.0267
RMSE : 0.1633
R²   : -0.0493


| Model                 | MAE        | RMSE       | R²         |
| --------------------- | ---------- | ---------- | ---------- |
| ✅ Decision Tree + RFE | **0.1077** | **0.1454** | **0.1678** |
| XGBoost + RFE         | 0.1091     | 0.1496     | 0.1198     |
| Neural Net            | 0.1100     | 0.1529     | 0.0800     |
| kNN + RFE             | 0.1125     | 0.1581     | 0.0164     |


## WR_TE's Models for Success Metric 1

In [374]:
# Label Encode 'College/Univ'
le = LabelEncoder()
wr_tes_success_1_clean['College/Univ'] = le.fit_transform(wr_tes_success_1_clean['College/Univ'])

### WR & TEs Random Forest (SM1)

In [375]:
# Prepare features and target
X = wr_tes_success_1_clean.drop('Success_1', axis=1).fillna(wr_tes_success_1_clean.median())
y = wr_tes_success_1_clean['Success_1']

# Replace infinite values with NaN, then fill with column medians
X.replace([np.inf, -np.inf], np.nan, inplace=True)
X = X.fillna(X.median())

# Train/Test Split (Stratified)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

# Random Forest with class_weight
rf_model = RandomForestClassifier(class_weight='balanced', random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

# Evaluation
def evaluate_model(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score : {f1_score(y_true, y_pred):.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred))

evaluate_model(y_test, y_pred_rf, "Random Forest (WRs/TEs)")



--- Random Forest (WRs/TEs) Evaluation ---
Accuracy : 0.6095
Precision: 0.2000
Recall   : 0.0571
F1 Score : 0.0889

Classification Report:
               precision    recall  f1-score   support

           0       0.65      0.89      0.75        70
           1       0.20      0.06      0.09        35

    accuracy                           0.61       105
   macro avg       0.43      0.47      0.42       105
weighted avg       0.50      0.61      0.53       105



### WR & TEs XG Boost (SM1)

In [376]:
# Features and Target
X = wr_tes_success_1_clean.drop('Success_1', axis=1)
y = wr_tes_success_1_clean['Success_1']

# Replace infinities and fill NaNs
X.replace([np.inf, -np.inf], np.nan, inplace=True)
X = X.fillna(X.median())

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

# Compute imbalance ratio
class_counts = Counter(y_train)
scale_pos_weight = class_counts[0] / class_counts[1]

# XGBoost Classifier
xgb_model = xgb.XGBClassifier(
    objective='binary:logistic',
    use_label_encoder=False,
    eval_metric='logloss',
    scale_pos_weight=scale_pos_weight,
    random_state=42
)

# Fit
xgb_model.fit(X_train, y_train)


# Predict probabilities
y_prob = xgb_model.predict_proba(X_test)[:, 1]

# Apply threshold = 0.3
y_pred_thresh = (y_prob >= 0.3).astype(int)

# Evaluate
evaluate_model(y_test, y_pred_thresh, "XGBoost Classifier (Threshold = 0.3)")

# Predict
y_pred_xgb = xgb_model.predict(X_test)

# Evaluation
def evaluate_model(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score : {f1_score(y_true, y_pred):.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred))

evaluate_model(y_test, y_pred_xgb, "XGBoost Classifier (WRs/TEs)")



--- XGBoost Classifier (Threshold = 0.3) Evaluation ---
Accuracy : 0.6000
Precision: 0.4054
Recall   : 0.4286
F1 Score : 0.4167

Classification Report:
               precision    recall  f1-score   support

           0       0.71      0.69      0.70        70
           1       0.41      0.43      0.42        35

    accuracy                           0.60       105
   macro avg       0.56      0.56      0.56       105
weighted avg       0.61      0.60      0.60       105


--- XGBoost Classifier (WRs/TEs) Evaluation ---
Accuracy : 0.6286
Precision: 0.4286
Recall   : 0.3429
F1 Score : 0.3810

Classification Report:
               precision    recall  f1-score   support

           0       0.70      0.77      0.73        70
           1       0.43      0.34      0.38        35

    accuracy                           0.63       105
   macro avg       0.56      0.56      0.56       105
weighted avg       0.61      0.63      0.62       105



Parameters: { "use_label_encoder" } are not used.



### WR & TEs Neural Network (SM1)

In [377]:
# Prepare Features and Target
X = wr_tes_success_1_clean.drop('Success_1', axis=1).replace([np.inf, -np.inf], np.nan).fillna(wr_tes_success_1_clean.median())
y = wr_tes_success_1_clean['Success_1']

# Train/Test Split (Stratified)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

# Compute Class Weights
counter = Counter(y_train)
total = len(y_train)
class_weight = {
    0: total / (2 * counter[0]),
    1: total / (2 * counter[1])
}

# Standard Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Build Neural Network
model = Sequential([
    Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.2),
    Dense(1, activation='sigmoid')  # Binary classifier
])

# Compile
model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

# Early stopping
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Train model
model.fit(
    X_train_scaled, y_train,
    validation_split=0.2,
    epochs=100,
    batch_size=16,
    class_weight=class_weight,
    callbacks=[early_stop],
    verbose=0
)

# Predict and apply threshold
y_pred_prob = model.predict(X_test_scaled).flatten()
y_pred_class = (y_pred_prob >= 0.5).astype(int)

# Evaluation Function
def evaluate_model(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score : {f1_score(y_true, y_pred):.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred))

# Evaluate at default threshold (0.5)
evaluate_model(y_test, y_pred_class, "Keras Neural Network (WRs/TEs – Threshold = 0.5)")


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step

--- Keras Neural Network (WRs/TEs – Threshold = 0.5) Evaluation ---
Accuracy : 0.6190
Precision: 0.4000
Recall   : 0.2857
F1 Score : 0.3333

Classification Report:
               precision    recall  f1-score   support

           0       0.69      0.79      0.73        70
           1       0.40      0.29      0.33        35

    accuracy                           0.62       105
   macro avg       0.54      0.54      0.53       105
weighted avg       0.59      0.62      0.60       105



 Class 1 (Success = 1, the minority class)
Recall = 0.46 → 46% of successful WRs/TEs were correctly identified

Precision = 0.42 → 42% of players the model flagged as successful actually were

F1 Score = 0.44 → This is your best balance so far between catching successful players and avoiding false positives

🔹 Class 0
Very strong performance (F1 = 0.70), confirming the model is not overcompensating

🔹 Overall
Balanced performance across both classes

Much better than Random Forest (Recall = 0.06)

Slightly better F1 than XGBoost even with threshold tuning

Best macro and weighted F1 so far

### WR & TEs Decision Tree (SM1)

In [378]:
# Features & Target
X = wr_tes_success_1_clean.drop('Success_1', axis=1).replace([np.inf, -np.inf], np.nan).fillna(wr_tes_success_1_clean.median())
y = wr_tes_success_1_clean['Success_1']

# RFE for top 10 features
rfe_selector = RFE(estimator=RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe_selector.fit_transform(X, y)

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, stratify=y, random_state=42
)

# Decision Tree + GridSearchCV
param_grid_dt = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

dt = DecisionTreeClassifier(class_weight='balanced', random_state=42)
grid_dt = GridSearchCV(dt, param_grid_dt, cv=3, scoring='f1', n_jobs=-1)
grid_dt.fit(X_train, y_train)
best_dt = grid_dt.best_estimator_

# Predict
y_pred_dt = best_dt.predict(X_test)

# Evaluation
def evaluate_model(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score : {f1_score(y_true, y_pred):.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred))

evaluate_model(y_test, y_pred_dt, "Decision Tree + RFE (WRs/TEs)")



--- Decision Tree + RFE (WRs/TEs) Evaluation ---
Accuracy : 0.4095
Precision: 0.3043
Recall   : 0.6000
F1 Score : 0.4038

Classification Report:
               precision    recall  f1-score   support

           0       0.61      0.31      0.42        70
           1       0.30      0.60      0.40        35

    accuracy                           0.41       105
   macro avg       0.46      0.46      0.41       105
weighted avg       0.51      0.41      0.41       105



### WR & TEs kNN (SM1)

In [379]:
# Features & Target
X = wr_tes_success_1_clean.drop('Success_1', axis=1).replace([np.inf, -np.inf], np.nan).fillna(wr_tes_success_1_clean.median())
y = wr_tes_success_1_clean['Success_1']

# RFE with Random Forest (top 10 features)
rfe_selector = RFE(estimator=RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe_selector.fit_transform(X, y)

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, stratify=y, random_state=42
)

# Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# GridSearchCV for kNN Classifier
param_grid_knn = {
    'n_neighbors': [3, 5, 7],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]  # Manhattan or Euclidean
}

knn = KNeighborsClassifier()
grid_knn = GridSearchCV(knn, param_grid_knn, cv=3, scoring='f1', n_jobs=-1)
grid_knn.fit(X_train_scaled, y_train)
best_knn = grid_knn.best_estimator_

# Predict
y_pred_knn = best_knn.predict(X_test_scaled)

# Evaluate
def evaluate_model(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score : {f1_score(y_true, y_pred):.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred))

evaluate_model(y_test, y_pred_knn, "kNN Classifier + RFE (WRs/TEs)")



--- kNN Classifier + RFE (WRs/TEs) Evaluation ---
Accuracy : 0.6286
Precision: 0.4231
Recall   : 0.3143
F1 Score : 0.3607

Classification Report:
               precision    recall  f1-score   support

           0       0.70      0.79      0.74        70
           1       0.42      0.31      0.36        35

    accuracy                           0.63       105
   macro avg       0.56      0.55      0.55       105
weighted avg       0.61      0.63      0.61       105



## WR_TEs Models for Success Metric 2

In [380]:
# Label Encode 'College/Univ'
le = LabelEncoder()
wr_tes_success_2_clean['College/Univ'] = le.fit_transform(wr_tes_success_2_clean['College/Univ'])

### WR & TEs DECISION TREE (SM2)

In [381]:
# Features & Target
X = wr_tes_success_2_clean.drop('Success_2', axis=1).replace([np.inf, -np.inf], np.nan).fillna(wr_tes_success_2_clean.median())
y = wr_tes_success_2_clean['Success_2']

# RFE with Random Forest (top 10 features)
rfe_selector = RFE(estimator=RandomForestRegressor(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe_selector.fit_transform(X, y)

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, random_state=42
)

# Decision Tree Regressor with GridSearchCV
param_grid_dt = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

dt = DecisionTreeRegressor(random_state=42)
gridt = GridSearchCV(dt, param_grid_dt, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)
gridt.fit(X_train, y_train)
best_dt = gridt.best_estimator_

# 🔮 Predict
y_pred_dt = best_dt.predict(X_test)

# Evaluation
def evaluate_regression(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"MAE  : {mean_absolute_error(y_true, y_pred):.4f}")
    print(f"MSE  : {mean_squared_error(y_true, y_pred):.4f}")
    print(f"RMSE : {np.sqrt(mean_squared_error(y_true, y_pred)):.4f}")
    print(f"R²   : {r2_score(y_true, y_pred):.4f}")

evaluate_regression(y_test, y_pred_dt, "Decision Tree Regressor + RFE (WRs/TEs)")



--- Decision Tree Regressor + RFE (WRs/TEs) Evaluation ---
MAE  : 0.0933
MSE  : 0.0178
RMSE : 0.1336
R²   : 0.0687


### WR & TEs kNN (SM2)

In [382]:
warnings.simplefilter('ignore', FitFailedWarning)

# Features & Target
X = wr_tes_success_2_clean.drop('Success_2', axis=1).replace([np.inf, -np.inf], np.nan).fillna(wr_tes_success_2_clean.median())
y = wr_tes_success_2_clean['Success_2']

# RFE with Random Forest (top 10 features)
rfe_selector = RFE(estimator=RandomForestRegressor(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe_selector.fit_transform(X, y)

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, random_state=42
)

# Standard Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# GridSearchCV for kNN Regressor
param_grid_knn = {
    'n_neighbors': [3, 5, 7],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]  # 1 = Manhattan, 2 = Euclidean
}

knn = KNeighborsRegressor()
grid_knn = GridSearchCV(knn, param_grid_knn, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)
grid_knn.fit(X_train_scaled, y_train)
best_knn = grid_knn.best_estimator_

# Predict
y_pred_knn = best_knn.predict(X_test_scaled)

# Evaluation
def evaluate_regression(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"MAE  : {mean_absolute_error(y_true, y_pred):.4f}")
    print(f"MSE  : {mean_squared_error(y_true, y_pred):.4f}")
    print(f"RMSE : {np.sqrt(mean_squared_error(y_true, y_pred)):.4f}")
    print(f"R²   : {r2_score(y_true, y_pred):.4f}")

evaluate_regression(y_test, y_pred_knn, "kNN Regressor + RFE (WRs/TEs)")



--- kNN Regressor + RFE (WRs/TEs) Evaluation ---
MAE  : 0.0986
MSE  : 0.0171
RMSE : 0.1308
R²   : 0.1067


✅ MAE and RMSE slightly better than Decision Tree

✅ R² = 0.1067, better than Decision Tree's R² (0.0687)

📌 kNN is currently outperforming DT on all metrics for WRs/TEs Success_2

### WR & TEs XG Boost (SM2)

In [383]:
# Features & Target
X = wr_tes_success_2_clean.drop('Success_2', axis=1).replace([np.inf, -np.inf], np.nan).fillna(wr_tes_success_2_clean.median())
y = wr_tes_success_2_clean['Success_2']

# RFE with Random Forest (top 10 features)
rfe_selector = RFE(estimator=RandomForestRegressor(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe_selector.fit_transform(X, y)

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, random_state=42
)

# XGBoost Regressor
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)

# Grid Search Parameters
param_grid_xgb = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

grid_xgb = GridSearchCV(
    xgb_model,
    param_grid=param_grid_xgb,
    scoring='neg_mean_squared_error',
    cv=3,
    n_jobs=-1
)

grid_xgb.fit(X_train, y_train)
best_xgb = grid_xgb.best_estimator_

# Predict
y_pred_xgb = best_xgb.predict(X_test)

# Evaluation
def evaluate_regression(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"MAE  : {mean_absolute_error(y_true, y_pred):.4f}")
    print(f"MSE  : {mean_squared_error(y_true, y_pred):.4f}")
    print(f"RMSE : {np.sqrt(mean_squared_error(y_true, y_pred)):.4f}")
    print(f"R²   : {r2_score(y_true, y_pred):.4f}")

evaluate_regression(y_test, y_pred_xgb, "XGBoost Regressor + RFE (WRs/TEs)")



--- XGBoost Regressor + RFE (WRs/TEs) Evaluation ---
MAE  : 0.0862
MSE  : 0.0157
RMSE : 0.1255
R²   : 0.1784


✅ Lowest MAE and RMSE across all models tried so far

✅ Best R² score → explains ~18% of the variance in Success_2

This is now your top-performing model for WRs/TEs Success_2

### WR & TEs Neural Network


In [384]:
# Features & Target
X = wr_tes_success_2_clean.drop('Success_2', axis=1).replace([np.inf, -np.inf], np.nan).fillna(wr_tes_success_2_clean.median())
y = wr_tes_success_2_clean['Success_2']

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

# Standard Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Build Model
model = Sequential([
    Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.2),
    Dense(1)  # Output layer for regression
])

# Compile
model.compile(optimizer=Adam(learning_rate=0.001), loss='mse')

# Early Stopping
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Train
model.fit(
    X_train_scaled, y_train,
    validation_split=0.2,
    epochs=100,
    batch_size=16,
    callbacks=[early_stop],
    verbose=0
)

# Predict
y_pred_nn = model.predict(X_test_scaled).flatten()

# Evaluate
def evaluate_regression(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"MAE  : {mean_absolute_error(y_true, y_pred):.4f}")
    print(f"MSE  : {mean_squared_error(y_true, y_pred):.4f}")
    print(f"RMSE : {np.sqrt(mean_squared_error(y_true, y_pred)):.4f}")
    print(f"R²   : {r2_score(y_true, y_pred):.4f}")

evaluate_regression(y_test, y_pred_nn, "Keras Neural Network Regressor (WRs/TEs)")


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step

--- Keras Neural Network Regressor (WRs/TEs) Evaluation ---
MAE  : 0.1057
MSE  : 0.0194
RMSE : 0.1392
R²   : -0.0114


### WR & TEs Random Forest (SM2)

In [385]:
# Features & Target
X = wr_tes_success_2_clean.drop('Success_2', axis=1).replace([np.inf, -np.inf], np.nan).fillna(wr_tes_success_2_clean.median())
y = wr_tes_success_2_clean['Success_2']

# RFE with Random Forest (top 10 features)
rfe_selector = RFE(estimator=RandomForestRegressor(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe_selector.fit_transform(X, y)

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, random_state=42
)

# Train final Random Forest Regressor
wrte2_rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
wrte2_rf_model.fit(X_train, y_train)
y_pred_rf = wrte2_rf_model.predict(X_test)

# Evaluate
def evaluate_regression(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"MAE  : {mean_absolute_error(y_true, y_pred):.4f}")
    print(f"MSE  : {mean_squared_error(y_true, y_pred):.4f}")
    print(f"RMSE : {np.sqrt(mean_squared_error(y_true, y_pred)):.4f}")
    print(f"R²   : {r2_score(y_true, y_pred):.4f}")

evaluate_regression(y_test, y_pred_rf, "Random Forest Regressor + RFE (WRs/TEs)")



--- Random Forest Regressor + RFE (WRs/TEs) Evaluation ---
MAE  : 0.0977
MSE  : 0.0174
RMSE : 0.1318
R²   : 0.0931


| Metric   | Value  | Interpretation                                                                            |
| -------- | ------ | ----------------------------------------------------------------------------------------- |
| **MAE**  | 0.0977 | On average, the model’s predictions are off by \~0.10 points from the actual rookie grade |
| **MSE**  | 0.0174 | Low squared error — good, but not the lowest                                              |
| **RMSE** | 0.1318 | Slightly worse than XGBoost (0.1255), but still solid                                     |
| **R²**   | 0.0931 | Explains \~9.3% of variance — acceptable, but below XGBoost and kNN                       |




This Random Forest Regressor is a solid fallback model — but not the best choice for wr_tes_success_2_clean

Best overall model remains: XGBoost Regressor + RFE

## SUMMARY OF RESULTS (QBs, RBs, WRs, and TEs)

Success Metric 1 (Classification – Predicting Rookie Year Success)


| Dataset                  | Best Model                                            | Notes                                      |
| ------------------------ | ----------------------------------------------------- | ------------------------------------------ |
| `rbs_success_1_clean`    | **XGBoost Classifier + RFE + Threshold = 0.3**        | Best F1 (0.50), recall improved to 0.84    |
| `wr_tes_success_1_clean` | **Keras Neural Network Classifier (Threshold = 0.5)** | Best F1 (0.44), recall = 0.46              |
| `qbs_success_1_clean`    | **XGBoost Classifier + RFE**                          | Only model to meaningfully predict class 1 |


Success Metric 2 (Regression – Predicting Scaled Rookie Grades)

| Dataset                  | Best Model                        | Notes                                    |
| ------------------------ | --------------------------------- | ---------------------------------------- |
| `rbs_success_2_clean`    | **Random Forest Regressor + RFE** | R² = 0.2075 — strongest among all models |
| `wr_tes_success_2_clean` | **XGBoost Regressor + RFE**       | R² = 0.1784 — best for WRs/TEs           |
| `qbs_success_2_clean`    | (To be filled if QBs were run)    |

Success Metric 2 — Final Model Evaluation Summary
🟩 Running Backs (rbs_success_2_clean)
Model	MAE	RMSE	R²
Decision Tree Regressor	0.1077	0.1454	0.1678

kNN Regressor	0.1125	0.1581	0.0164

XGBoost Regressor	0.1091	0.1496	0.1198

Keras NN Regressor	0.1100	0.1529	0.0800

Random Forest Regressor	0.1031	0.1419	0.2075 ✅

📌 Best Model: Random Forest Regressor + RFE

🟦 WRs/TEs (wr_tes_success_2_clean)
Model	MAE	RMSE	R²

Decision Tree Regressor	0.0933	0.1336	0.0687

kNN Regressor	0.0986	0.1308	0.1067

XGBoost Regressor	0.0862	0.1255	0.1784 ✅

Keras NN Regressor	0.1034	0.1407	-0.0331

Random Forest Regressor	0.0977	0.1318	0.0931

📌 Best Model: XGBoost Regressor + RFE

🟥 Quarterbacks (qbs_success_2_clean)
Model	MAE	RMSE	R²

Decision Tree Regressor	0.1627	0.2082	-1.2758

kNN Regressor	0.1236	0.1537	-0.2399

XGBoost Regressor	0.1087	0.1438	-0.0856 ✅

Keras NN Regressor	0.2619	0.6408	-20.5624

📌 Best Model: XGBoost Regressor (least negative R² and lowest MAE)



Success Metric 1 — Final Model Evaluation Summary
🟩 Running Backs (rbs_success_1_clean)
Model	Precision	Recall	F1 Score	Notes

Random Forest	0.46	0.24	0.32	-

XGBoost	0.44	0.44	0.44	-

Keras Neural Network	0.50	0.28	0.36	-

Decision Tree	0.41	0.28	0.33	-

kNN	0.41	0.48	0.44	-

XGBoost (Threshold = 0.3)	0.36	0.84	0.50 ✅	Best F1 and recall (threshold-tuned)

📌 Best Model: XGBoost Classifier + RFE + Threshold = 0.3

🟦 WRs/TEs (wr_tes_success_1_clean)
Model	Precision	Recall	F1 Score	Notes

Random Forest	0.20	0.06	0.09	Very low recall ❌

XGBoost	0.43	0.34	0.38	Strong

XGBoost (Threshold = 0.3)	0.41	0.43	0.42	Good balance

Keras Neural Network (0.5)	0.42	0.46	0.44 ✅	Highest F1

Decision Tree	0.30	0.60	0.40	Best recall, but worse F1

kNN:  

📌 Best Model: Keras Neural Network Classifier (Threshold = 0.5)

🟥 Quarterbacks (qbs_success_1_clean)
Model	Result
XGBoost	✅ Only model that predicted class 1 reasonably
Random Forest, kNN, NN, DT	Performed poorly (F1 ≈ 0)

📌 Best Model: XGBoost Classifier

## Subgroup DL Models for Success Metric 1

In [394]:
# Label Encode 'College/Univ'
le = LabelEncoder()
dl_success_1_clean['College/Univ'] = le.fit_transform(dl_success_1_clean['College/Univ'])

### Random Forest (SM1)

In [410]:
# Features & Target
X = dl_success_1_clean.drop('Success_1', axis=1).replace([np.inf, -np.inf], np.nan).fillna(dl_success_1_clean.median())
y = dl_success_1_clean['Success_1']

# RFE
rfe_selector = RFE(estimator=RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe_selector.fit_transform(X, y)

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, stratify=y, random_state=42
)

# 🌲 Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

# Evaluation
def evaluate_model(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score : {f1_score(y_true, y_pred):.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred))

# ✅ Only evaluate RF model here
evaluate_model(y_test, y_pred_rf, "Random Forest Classifier + RFE (DL)")


--- Random Forest Classifier + RFE (DL) Evaluation ---
Accuracy : 0.4690
Precision: 0.3488
Recall   : 0.2344
F1 Score : 0.2804

Classification Report:
               precision    recall  f1-score   support

           0       0.52      0.65      0.58        81
           1       0.35      0.23      0.28        64

    accuracy                           0.47       145
   macro avg       0.43      0.44      0.43       145
weighted avg       0.44      0.47      0.45       145



### DL XG Boost & NN (SM1)

In [396]:
# Features & Target
X = dl_success_1_clean.drop('Success_1', axis=1).replace([np.inf, -np.inf], np.nan).fillna(dl_success_1_clean.median())
y = dl_success_1_clean['Success_1']

# RFE
rfe_selector = RFE(RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe_selector.fit_transform(X, y)

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, stratify=y, random_state=42
)

# ---------------------------------
# XGBoost Classifier
# ---------------------------------
scale_pos_weight = Counter(y_train)[0] / Counter(y_train)[1]

xgb_model = xgb.XGBClassifier(
    objective='binary:logistic',
    eval_metric='logloss',
    use_label_encoder=False,
    scale_pos_weight=scale_pos_weight,
    random_state=42
)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)

# ---------------------------------
# Keras Neural Network Classifier
# ---------------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

class_counts = Counter(y_train)
total = len(y_train)
class_weight = {
    0: total / (2 * class_counts[0]),
    1: total / (2 * class_counts[1])
}

dl_nn_model = Sequential([
    Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])
dl_nn_model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy')
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
dl_nn_model.fit(
    X_train_scaled, y_train,
    validation_split=0.2,
    epochs=100,
    batch_size=16,
    class_weight=class_weight,
    callbacks=[early_stop],
    verbose=0
)
y_pred_nn = dl_nn_model.predict(X_test_scaled).flatten()
y_pred_nn_class = (y_pred_nn >= 0.5).astype(int)

# Evaluation
def evaluate_model(y_true, y_pred, model_name):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score : {f1_score(y_true, y_pred):.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred))

evaluate_model(y_test, y_pred_xgb, "XGBoost Classifier + RFE (DL)")
evaluate_model(y_test, y_pred_nn_class, "Keras Neural Network Classifier (DL)")


Parameters: { "use_label_encoder" } are not used.

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step

--- XGBoost Classifier + RFE (DL) Evaluation ---
Accuracy : 0.4276
Precision: 0.3333
Recall   : 0.2969
F1 Score : 0.3140

Classification Report:
               precision    recall  f1-score   support

           0       0.49      0.53      0.51        81
           1       0.33      0.30      0.31        64

    accuracy                           0.43       145
   macro avg       0.41      0.41      0.41       145
weighted avg       0.42      0.43      0.42       145


--- Keras Neural Network Classifier (DL) Evaluation ---
Accuracy : 0.4414
Precision: 0.3731
Recall   : 0.3906
F1 Score : 0.3817

Classification Report:
               precision    recall  f1-score   support

           0       0.50      0.48      0.49        81
           1       0.37      0.39      0.38        64

    accuracy                           0.44       145
   macro avg       0.44      0.44      0.44       145
weighted avg       0.44     

📊 Model Comparison Summary

Model	Accuracy	Precision	Recall	F1 Score	Notes

Decision Tree	0.4966	0.4118	0.3281	0.3652	Strong early baseline

kNN Classifier	0.4138	0.2766	0.2031	0.2342	❌ Weakest performance

Random Forest	0.4690	0.3488	0.2344	0.2804	Improved over kNN

XGBoost	0.4276	0.3333	0.2969	0.3140	Fair, but lower than RF

Keras NN	0.4552	0.3770	0.3594	0.3680 ✅

## Subgroup DL Models for Success Metric 2

In [397]:
# Label Encode 'College/Univ'
le = LabelEncoder()
dl_success_2_clean['College/Univ'] = le.fit_transform(dl_success_2_clean['College/Univ'])

In [398]:
# Features & Target
X = dl_success_2_clean.drop('Success_2', axis=1).replace([np.inf, -np.inf], np.nan).fillna(dl_success_2_clean.median())
y = dl_success_2_clean['Success_2']

# RFE
rfe_selector = RFE(estimator=RandomForestRegressor(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe_selector.fit_transform(X, y)

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, random_state=42
)

# Decision Tree Regressor + Tuning
param_grid_dt = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

dt = DecisionTreeRegressor(random_state=42)
grid_dt = GridSearchCV(dt, param_grid_dt, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)
grid_dt.fit(X_train, y_train)
best_dt = grid_dt.best_estimator_

# Predict
y_pred_dt = best_dt.predict(X_test)

# Evaluate
def evaluate_regression(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"MAE  : {mean_absolute_error(y_true, y_pred):.4f}")
    print(f"MSE  : {mean_squared_error(y_true, y_pred):.4f}")
    print(f"RMSE : {np.sqrt(mean_squared_error(y_true, y_pred)):.4f}")
    print(f"R²   : {r2_score(y_true, y_pred):.4f}")

evaluate_regression(y_test, y_pred_dt, "Decision Tree Regressor + RFE (DL)")



--- Decision Tree Regressor + RFE (DL) Evaluation ---
MAE  : 0.0938
MSE  : 0.0172
RMSE : 0.1310
R²   : -0.0305


❌ Negative R²: model performed worse than predicting the mean

MAE and RMSE are low, but lack meaningful variance explanation

Weak starting baseline for DL regression

### Subgroup DL Random Tree (SM2)

In [399]:
# -------------------------------
# Features & Target
# -------------------------------
X = dl_success_2_clean.drop('Success_2', axis=1).replace([np.inf, -np.inf], np.nan).fillna(dl_success_2_clean.median())
y = dl_success_2_clean['Success_2']

# RFE (top 10 features)
rfe_selector = RFE(estimator=RandomForestRegressor(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe_selector.fit_transform(X, y)

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, random_state=42
)

# ---------------------------------
# ✅ Random Forest Regressor
# ---------------------------------
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

# ---------------------------------
# 📊 Evaluation
# ---------------------------------
def evaluate_model(y_true, y_pred, model_name):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"MAE  : {mean_absolute_error(y_true, y_pred):.4f}")
    print(f"MSE  : {mean_squared_error(y_true, y_pred):.4f}")
    print(f"RMSE : {np.sqrt(mean_squared_error(y_true, y_pred)):.4f}")
    print(f"R²   : {r2_score(y_true, y_pred):.4f}")

# Run Evaluation
evaluate_model(y_test, y_pred_rf, "Random Forest Regressor + RFE (DL)")



--- Random Forest Regressor + RFE (DL) Evaluation ---
MAE  : 0.0988
MSE  : 0.0171
RMSE : 0.1306
R²   : -0.0247


### DL XG Boost & NN's (SM2)

In [400]:
# Features & Target
X = dl_success_2_clean.drop('Success_2', axis=1).replace([np.inf, -np.inf], np.nan).fillna(dl_success_2_clean.median())
y = dl_success_2_clean['Success_2']

# RFE
rfe_selector = RFE(estimator=RandomForestRegressor(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe_selector.fit_transform(X, y)

# Split
X_train, X_test, y_train, y_test = train_test_split(X_rfe, y, test_size=0.25, random_state=42)

# ----------------------------------------
# XGBoost Regressor
# ----------------------------------------
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)

param_grid_xgb = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

grid_xgb = GridSearchCV(xgb_model, param_grid_xgb, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)
grid_xgb.fit(X_train, y_train)
best_xgb = grid_xgb.best_estimator_
y_pred_xgb = best_xgb.predict(X_test)

# ----------------------------------------
# Keras Neural Network Regressor
# ----------------------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

nn_model = Sequential([
    Dense(128, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.2),
    Dense(1)
])
nn_model.compile(optimizer=Adam(learning_rate=0.001), loss='mse')
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

nn_model.fit(
    X_train_scaled, y_train,
    validation_split=0.2,
    epochs=100,
    batch_size=16,
    callbacks=[early_stop],
    verbose=0
)
y_pred_nn = nn_model.predict(X_test_scaled).flatten()

# Evaluation
def evaluate_model(y_true, y_pred, model_name):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"MAE  : {mean_absolute_error(y_true, y_pred):.4f}")
    print(f"MSE  : {mean_squared_error(y_true, y_pred):.4f}")
    print(f"RMSE : {np.sqrt(mean_squared_error(y_true, y_pred)):.4f}")
    print(f"R²   : {r2_score(y_true, y_pred):.4f}")

# Run
evaluate_model(y_test, y_pred_xgb, "XGBoost Regressor + RFE (DL)")
evaluate_model(y_test, y_pred_nn, "Keras Neural Network Regressor (DL)")


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step

--- XGBoost Regressor + RFE (DL) Evaluation ---
MAE  : 0.0922
MSE  : 0.0158
RMSE : 0.1255
R²   : 0.0534

--- Keras Neural Network Regressor (DL) Evaluation ---
MAE  : 0.0951
MSE  : 0.0169
RMSE : 0.1302
R²   : -0.0177


## Subgroup LBs Models for Success Metric 1

In [401]:
# Label Encode 'College/Univ'
le = LabelEncoder()
lb_success_1_clean['College/Univ'] = le.fit_transform(lb_success_1_clean['College/Univ'])

### LBs DECISION TREE & KNN (SM1)

In [402]:
# Features & Target
X = lb_success_1_clean.drop('Success_1', axis=1).replace([np.inf, -np.inf], np.nan).fillna(lb_success_1_clean.median())
y = lb_success_1_clean['Success_1']

# RFE
rfe_selector = RFE(estimator=RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe_selector.fit_transform(X, y)

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, stratify=y, random_state=42
)

# -----------------------------
# Decision Tree Classifier
# -----------------------------
param_grid_dt = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

dt = DecisionTreeClassifier(class_weight='balanced', random_state=42)
grid_dt = GridSearchCV(dt, param_grid_dt, cv=3, scoring='f1', n_jobs=-1)
grid_dt.fit(X_train, y_train)
best_dt = grid_dt.best_estimator_
y_pred_dt = best_dt.predict(X_test)

# -----------------------------
# kNN Classifier
# -----------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

param_grid_knn = {
    'n_neighbors': [3, 5, 7],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]
}

knn = KNeighborsClassifier()
lb1_grid_knn = GridSearchCV(knn, param_grid_knn, cv=3, scoring='f1', n_jobs=-1)
lb1_grid_knn.fit(X_train_scaled, y_train)
lb1_best_knn = lb1_grid_knn.best_estimator_
y_pred_knn = lb1_best_knn.predict(X_test_scaled)

# Evaluation Function
def evaluate_model(y_true, y_pred, model_name):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score : {f1_score(y_true, y_pred):.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred))

# Run Both
evaluate_model(y_test, y_pred_dt, "Decision Tree Classifier + RFE (LB)")
evaluate_model(y_test, y_pred_knn, "kNN Classifier + RFE (LB)")



--- Decision Tree Classifier + RFE (LB) Evaluation ---
Accuracy : 0.4950
Precision: 0.4390
Recall   : 0.3913
F1 Score : 0.4138

Classification Report:
               precision    recall  f1-score   support

           0       0.53      0.58      0.56        55
           1       0.44      0.39      0.41        46

    accuracy                           0.50       101
   macro avg       0.49      0.49      0.49       101
weighted avg       0.49      0.50      0.49       101


--- kNN Classifier + RFE (LB) Evaluation ---
Accuracy : 0.4653
Precision: 0.4167
Recall   : 0.4348
F1 Score : 0.4255

Classification Report:
               precision    recall  f1-score   support

           0       0.51      0.49      0.50        55
           1       0.42      0.43      0.43        46

    accuracy                           0.47       101
   macro avg       0.46      0.46      0.46       101
weighted avg       0.47      0.47      0.47       101



📊 Model Comparison: DT vs. kNN
Metric	Decision Tree	kNN	Best
Accuracy	0.4950	0.4653	DT ✅
Precision	0.4390	0.4167	DT ✅
Recall	0.3913	0.4348	kNN ✅
F1 Score	0.4138	0.4255	kNN ✅

📌 kNN Classifier is currently best, thanks to the highest F1 and Recall scores.

### LBs Random Forest + XG Boost + NNs (SM1)


This section compares Random Forest, XGBoost, and a Neural Network to predict linebacker success (Success_1), using a unified pipeline with feature selection and model evaluation.

Key steps include:

- Data cleaning and median imputation.

- Feature selection via RFE using Random Forest.

- Stratified train-test split to maintain class balance.

- Modeling: Each model (RF, XGB, NN) is trained with class imbalance adjustments and evaluated using accuracy, precision, recall, and F1-score.

In [403]:
# Features & Target
X = lb_success_1_clean.drop('Success_1', axis=1).replace([np.inf, -np.inf], np.nan).fillna(lb_success_1_clean.median())
y = lb_success_1_clean['Success_1']

# RFE
rfe_selector = RFE(RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe_selector.fit_transform(X, y)

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, stratify=y, random_state=42
)

# ---------------------------------
# Random Forest Classifier
# ---------------------------------
rf_model = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

# ---------------------------------
# XGBoost Classifier
# ---------------------------------
scale_pos_weight = Counter(y_train)[0] / Counter(y_train)[1]
xgb_model = xgb.XGBClassifier(
    objective='binary:logistic',
    eval_metric='logloss',
    use_label_encoder=False,
    scale_pos_weight=scale_pos_weight,
    random_state=42
)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)

# ---------------------------------
# Keras Neural Network Classifier
# ---------------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

class_counts = Counter(y_train)
total = len(y_train)
class_weight = {
    0: total / (2 * class_counts[0]),
    1: total / (2 * class_counts[1])
}

lb1_nn_model = Sequential([
    Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])
lb1_nn_model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy')
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

lb1_nn_model.fit(
    X_train_scaled, y_train,
    validation_split=0.2,
    epochs=100,
    batch_size=16,
    class_weight=class_weight,
    callbacks=[early_stop],
    verbose=0
)
y_pred_nn = lb1_nn_model.predict(X_test_scaled).flatten()
y_pred_nn_class = (y_pred_nn >= 0.5).astype(int)

# Evaluation
def evaluate_model(y_true, y_pred, model_name):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score : {f1_score(y_true, y_pred):.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred))

# Evaluate All
evaluate_model(y_test, y_pred_rf, "Random Forest Classifier + RFE (LB)")
evaluate_model(y_test, y_pred_xgb, "XGBoost Classifier + RFE (LB)")
evaluate_model(y_test, y_pred_nn_class, "Keras Neural Network Classifier (LB)")


Parameters: { "use_label_encoder" } are not used.

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step

--- Random Forest Classifier + RFE (LB) Evaluation ---
Accuracy : 0.4653
Precision: 0.4000
Recall   : 0.3478
F1 Score : 0.3721

Classification Report:
               precision    recall  f1-score   support

           0       0.51      0.56      0.53        55
           1       0.40      0.35      0.37        46

    accuracy                           0.47       101
   macro avg       0.45      0.46      0.45       101
weighted avg       0.46      0.47      0.46       101


--- XGBoost Classifier + RFE (LB) Evaluation ---
Accuracy : 0.4356
Precision: 0.4000
Recall   : 0.4783
F1 Score : 0.4356

Classification Report:
               precision    recall  f1-score   support

           0       0.48      0.40      0.44        55
           1       0.40      0.48      0.44        46

    accuracy                           0.44       101
   macro avg       0.44      0.44      0.44       101
weighted avg       0.44      

Interpretation:

Best Class 1 Performance: ✅ Keras NN — by far highest recall (0.5435) and F1 (0.4673)

Decision Tree had highest accuracy, but NN had more balanced performance for our success metric

📌 Final Pick: Keras Neural Network Classifier ✅

## Subgroup LBs Models for Success Metric 2

In [404]:
# Label Encode 'College/Univ'
le = LabelEncoder()
lb_success_2_clean['College/Univ'] = le.fit_transform(lb_success_2_clean['College/Univ'])


In [405]:
# Features & Target
X = lb_success_2_clean.drop('Success_2', axis=1).replace([np.inf, -np.inf], np.nan).fillna(lb_success_2_clean.median())
y = lb_success_2_clean['Success_2']

# RFE
rfe = RFE(estimator=RandomForestRegressor(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe.fit_transform(X, y)

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, random_state=42
)

# -------------------------------
# Decision Tree Regressor
# -------------------------------
dt = DecisionTreeRegressor(random_state=42)
grid_dt = GridSearchCV(dt, {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)
grid_dt.fit(X_train, y_train)
y_pred_dt = grid_dt.best_estimator_.predict(X_test)

# -------------------------------
# kNN Regressor
# -------------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

grid_knn = GridSearchCV(KNeighborsRegressor(), {
    'n_neighbors': [3, 5, 7],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]
}, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)
grid_knn.fit(X_train_scaled, y_train)
y_pred_knn = grid_knn.best_estimator_.predict(X_test_scaled)

# -------------------------------
# Random Forest Regressor
# -------------------------------
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# -------------------------------
# XGBoost Regressor
# -------------------------------
grid_xgb = GridSearchCV(xgb.XGBRegressor(objective='reg:squarederror', random_state=42), {
    'n_estimators': [100],
    'learning_rate': [0.1],
    'max_depth': [3, 5],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)
grid_xgb.fit(X_train, y_train)
y_pred_xgb = grid_xgb.best_estimator_.predict(X_test)

# -------------------------------
# Keras Neural Network Regressor
# -------------------------------
nn = Sequential([
    Dense(128, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.2),
    Dense(1)
])
nn.compile(optimizer=Adam(0.001), loss='mse')
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
nn.fit(X_train_scaled, y_train, validation_split=0.2, epochs=100, batch_size=16, callbacks=[early_stop], verbose=0)
y_pred_nn = nn.predict(X_test_scaled).flatten()

# Evaluation Function
def eval_model(name, y_true, y_pred):
    print(f"\n--- {name} ---")
    print(f"MAE  : {mean_absolute_error(y_true, y_pred):.4f}")
    print(f"MSE  : {mean_squared_error(y_true, y_pred):.4f}")
    print(f"RMSE : {np.sqrt(mean_squared_error(y_true, y_pred)):.4f}")
    print(f"R²   : {r2_score(y_true, y_pred):.4f}")

# Evaluate All
eval_model("Decision Tree Regressor + RFE (LB)", y_test, y_pred_dt)
eval_model("kNN Regressor + RFE (LB)", y_test, y_pred_knn)
eval_model("Random Forest Regressor + RFE (LB)", y_test, y_pred_rf)
eval_model("XGBoost Regressor + RFE (LB)", y_test, y_pred_xgb)
eval_model("Keras Neural Network Regressor (LB)", y_test, y_pred_nn)


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step

--- Decision Tree Regressor + RFE (LB) ---
MAE  : 0.1061
MSE  : 0.0249
RMSE : 0.1577
R²   : -0.2804

--- kNN Regressor + RFE (LB) ---
MAE  : 0.1054
MSE  : 0.0223
RMSE : 0.1493
R²   : -0.1481

--- Random Forest Regressor + RFE (LB) ---
MAE  : 0.1004
MSE  : 0.0203
RMSE : 0.1426
R²   : -0.0466

--- XGBoost Regressor + RFE (LB) ---
MAE  : 0.1023
MSE  : 0.0205
RMSE : 0.1432
R²   : -0.0554

--- Keras Neural Network Regressor (LB) ---
MAE  : 0.1210
MSE  : 0.0262
RMSE : 0.1619
R²   : -0.3490


## Subgroup S Models for Success Metric 1

In [406]:
# Label Encode 'College/Univ'
le = LabelEncoder()
s_success_1_clean['College/Univ'] = le.fit_transform(s_success_1_clean['College/Univ'])

In [407]:
# Features & Target
X = s_success_1_clean.drop('Success_1', axis=1).replace([np.inf, -np.inf], np.nan).fillna(s_success_1_clean.median())
y = s_success_1_clean['Success_1']

# RFE
rfe = RFE(RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe.fit_transform(X, y)

# Split
X_train, X_test, y_train, y_test = train_test_split(X_rfe, y, test_size=0.25, stratify=y, random_state=42)

# ----------------------------------
# Decision Tree Classifier
# ----------------------------------
dt = DecisionTreeClassifier(class_weight='balanced', random_state=42)
s1_dt_grid = GridSearchCV(dt, {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
}, cv=3, scoring='f1', n_jobs=-1)
s1_dt_grid.fit(X_train, y_train)
y_pred_dt = s1_dt_grid.best_estimator_.predict(X_test)

# ----------------------------------
# kNN Classifier (scaled)
# ----------------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_grid = GridSearchCV(KNeighborsClassifier(), {
    'n_neighbors': [3, 5, 7],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]
}, cv=3, scoring='f1', n_jobs=-1)
knn_grid.fit(X_train_scaled, y_train)
y_pred_knn = knn_grid.best_estimator_.predict(X_test_scaled)

# ----------------------------------
# Random Forest Classifier
# ----------------------------------
s1_rf = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
s1_rf.fit(X_train, y_train)
y_pred_rf = s1_rf.predict(X_test)

# ----------------------------------
# XGBoost Classifier
# ----------------------------------
scale_pos_weight = Counter(y_train)[0] / Counter(y_train)[1]
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', scale_pos_weight=scale_pos_weight, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)

# ----------------------------------
# Keras Neural Network Classifier
# ----------------------------------
class_weights = {
    0: len(y_train) / (2 * Counter(y_train)[0]),
    1: len(y_train) / (2 * Counter(y_train)[1])
}

nn_model = Sequential([
    Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])
nn_model.compile(optimizer=Adam(0.001), loss='binary_crossentropy')
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
nn_model.fit(X_train_scaled, y_train, validation_split=0.2, epochs=100, batch_size=16,
             callbacks=[early_stop], class_weight=class_weights, verbose=0)
y_pred_nn = nn_model.predict(X_test_scaled).flatten()
y_pred_nn_class = (y_pred_nn >= 0.5).astype(int)

# Evaluation Function
def evaluate_model(y_true, y_pred, model_name):
    print(f"\n--- {model_name} ---")
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score : {f1_score(y_true, y_pred):.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred))

# Evaluate All
evaluate_model(y_test, y_pred_dt, "Decision Tree Classifier + RFE (S)")
evaluate_model(y_test, y_pred_knn, "kNN Classifier + RFE (S)")
evaluate_model(y_test, y_pred_rf, "Random Forest Classifier + RFE (S)")
evaluate_model(y_test, y_pred_xgb, "XGBoost Classifier + RFE (S)")
evaluate_model(y_test, y_pred_nn_class, "Keras Neural Network Classifier (S)")


Parameters: { "use_label_encoder" } are not used.

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step

--- Decision Tree Classifier + RFE (S) ---
Accuracy : 0.4845
Precision: 0.4464
Recall   : 0.7042
F1 Score : 0.5464

Classification Report:
               precision    recall  f1-score   support

           0       0.57      0.31      0.40        90
           1       0.45      0.70      0.55        71

    accuracy                           0.48       161
   macro avg       0.51      0.51      0.47       161
weighted avg       0.52      0.48      0.47       161


--- kNN Classifier + RFE (S) ---
Accuracy : 0.4907
Precision: 0.4179
Recall   : 0.3944
F1 Score : 0.4058

Classification Report:
               precision    recall  f1-score   support

           0       0.54      0.57      0.55        90
           1       0.42      0.39      0.41        71

    accuracy                           0.49       161
   macro avg       0.48      0.48      0.48       161
weighted avg       0.49      0.49      0.49       161


-

## Subgroup S Success Metric 2

In [408]:
# Label Encode 'College/Univ'
le = LabelEncoder()
s_success_2_clean['College/Univ'] = le.fit_transform(s_success_2_clean['College/Univ'])

In [409]:
# -------------------------------
# Features & Target
# -------------------------------
X = s_success_2_clean.drop('Success_2', axis=1).replace([np.inf, -np.inf], np.nan).fillna(s_success_2_clean.median())
y = s_success_2_clean['Success_2']

# RFE
rfe = RFE(estimator=RandomForestRegressor(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe.fit_transform(X, y)

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X_rfe, y, test_size=0.25, random_state=42)

# Scaling for NN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# -------------------------------
# ✅ Random Forest Regressor (No CV)
# -------------------------------
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# -------------------------------
# ✅ XGBoost Regressor (No CV, fixed params)
# -------------------------------
xgb_model = xgb.XGBRegressor(
    objective='reg:squarederror',
    learning_rate=0.1,
    n_estimators=100,
    max_depth=5,
    subsample=1.0,
    colsample_bytree=1.0,
    random_state=42
)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)

# -------------------------------
# ✅ Keras Neural Network Regressor (No CV)
# -------------------------------
nn = Sequential([
    Dense(128, activation='relu', input_shape=(X_train_scaled.shape[1],)),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.2),
    Dense(1)
])
nn.compile(optimizer=Adam(0.001), loss='mse')
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
nn.fit(X_train_scaled, y_train, validation_split=0.2, epochs=100, batch_size=16, callbacks=[early_stop], verbose=0)
y_pred_nn = nn.predict(X_test_scaled).flatten()

# -------------------------------
# Evaluation
# -------------------------------
def eval_model(name, y_true, y_pred):
    print(f"\n--- {name} ---")
    print(f"MAE  : {mean_absolute_error(y_true, y_pred):.4f}")
    print(f"MSE  : {mean_squared_error(y_true, y_pred):.4f}")
    print(f"RMSE : {np.sqrt(mean_squared_error(y_true, y_pred)):.4f}")
    print(f"R²   : {r2_score(y_true, y_pred):.4f}")

# ✅ Run Only Non-CV Models
eval_model("Random Forest Regressor + RFE (S)", y_test, y_pred_rf)
eval_model("XGBoost Regressor + RFE (S)", y_test, y_pred_xgb)
eval_model("Keras Neural Network Regressor (S)", y_test, y_pred_nn)


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step

--- Random Forest Regressor + RFE (S) ---
MAE  : 0.0871
MSE  : 0.0152
RMSE : 0.1232
R²   : -0.1051

--- XGBoost Regressor + RFE (S) ---
MAE  : 0.0916
MSE  : 0.0160
RMSE : 0.1264
R²   : -0.1638

--- Keras Neural Network Regressor (S) ---
MAE  : 0.0872
MSE  : 0.0151
RMSE : 0.1229
R²   : -0.1004


Best Model (Based on These Results):

🔹 Keras Neural Network Regressor

Lowest RMSE (0.1231) → Most accurate predictions on average

Second-best MAE (0.0876) → Very close to Random Forest (0.0871)

Best R² (-0.1031) → Still negative, but less so than others (closer to 0)

# Data preparation for 2025 Draft

## Data preparation explanation

Which Tables Are Used as "New Data"?
In the notebook, the following tables are used for applying the trained models to new, unseen NFL draft prospects (i.e., candidates for prediction):

Prospect Tables:
These contain player data that has not yet been labeled as successful or not (because these are forecast targets):

qbs_prospects → New quarterbacks 2025

rbs_prospects → New running backs 2025

ol_prospects → New offensive linemen 2025

wrtes_prospects → New wide receivers and tight ends 2025

def_prospects → New defenders 2025

These are essentially "draft class" tables — likely including 2024 or future prospects.

What Was Done to Prepare Them?
For each prospect table, the notebook follows these steps:

Drop columns not used in training
(like Player, Team, etc.)

Label encode any categorical features
Example: the College/Univ column must be encoded using the same label encoder trained on the original data — otherwise, unknown labels can break the model.

Standardize or scale numerical features
The same scaler (like StandardScaler) must be reused to ensure new inputs align with the distribution the model was trained on.

Apply the same RFE selector
This ensures the model sees only the top N features it was trained with, and in the correct order.

Separate defenders table just as in the training data to use different models for each defender subgroup

## Importing 2025 Data

In [284]:
#Importing data files
qbs = pd.read_csv('QBs_prospects.csv')
rbs = pd.read_csv('RB_prospects.csv')
ol = pd.read_csv('OL_prospects.csv')
wr_tes = pd.read_csv('WRs_TEs_prospects.csv')
defe = pd.read_csv('DEF_prospects.csv')

## WR & TEs Data Prep

In [285]:
wr_tes.loc[wr_tes['Player'] == 'Caleb Lohner', 'Age'] = 24
wr_tes.loc[wr_tes['Player'] == 'Robbie Ouzts', 'Age'] = 22
wr_tes.loc[wr_tes['Player'] == 'Kaden Prather', 'Age'] = 22
wr_tes.loc[wr_tes['Player'] == 'LaJohntay Wester', 'Age'] = 23
wr_tes.info()

def convert_height(ht_str):
    if isinstance(ht_str, str) and '-' in ht_str:
        feet, inches = ht_str.split('-')
        return int(feet) * 12 + int(inches)
    return None  # if not valid format, return None

# Apply the function to the Ht column
wr_tes['Ht'] = wr_tes['Ht'].apply(convert_height)
wr_tes

# Columns from Block II: NFL Total Statistics
nfl_total_cols = [
    'To', 'AP1', 'PB', 'St', 'wAV', 'DrAV', 'G',
    'Cmp_Pass', 'Att_Pass', 'Yds_Pass', 'TD_Pass', 'Int_Pass',
    'Att_Rush', 'Yds_Rush', 'TD_Rush', 'Rec', 'Yds_Rec', 'TD_Rec',
    'Tackles', 'Int_Def', 'Sk','TeamDrafted','Rnd','Pick','Tm','Pos','year_drafted','Team','Rookie_vs_PosAvg','wAV_norm','DrAV_norm',
    'rookie_norm','career_norm','delta_norm'
]

# Columns from Block V: PFF Scores
pff_score_cols = [
    'grade_rookie_season', 'Career_Avg_Grade',
    'Average_Grades_per_position_in_Year'
]

# Columns from Block VI: PFF Scores of Previous Season
prev_season_pff_cols = [
    'prev_year', 'Record', 'PF', 'PA', 'OVER', 'OFF', 'PASS',
    'PBLK', 'RECV', 'RUN', 'RBLK', 'DEF', 'RDEF', 'TACK',
    'PRSH', 'COV', 'SPEC', 'Year'
]

# Combine all columns to drop
columns_to_drop = nfl_total_cols + pff_score_cols + prev_season_pff_cols

# Drop safely only if column exists
wr_tes = wr_tes.drop(columns=[col for col in columns_to_drop if col in wr_tes.columns])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 28 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Player          46 non-null     object 
 1   Pos             46 non-null     object 
 2   Age             46 non-null     float64
 3   College/Univ    46 non-null     object 
 4   Team            46 non-null     object 
 5   Total_Games     46 non-null     float64
 6   Total_Rec       46 non-null     float64
 7   Total_Rec_Yds   46 non-null     float64
 8   Total_Rec_TD    46 non-null     float64
 9   Rec_Y_R         46 non-null     float64
 10  Rec_Y_G         46 non-null     float64
 11  Total_Rush_Att  46 non-null     float64
 12  Total_Rush_Yds  46 non-null     float64
 13  Total_Rush_TD   46 non-null     float64
 14  Rush_Y_A        34 non-null     float64
 15  Rush_Y_G        46 non-null     float64
 16  Total_Plays     46 non-null     float64
 17  Total_Yds       46 non-null     float

In [286]:
# Columns from Block II: NFL Total Statistics
nfl_total_cols = [
    'To', 'AP1', 'PB', 'St', 'wAV', 'DrAV', 'G',
    'Cmp_Pass', 'Att_Pass', 'Yds_Pass', 'TD_Pass', 'Int_Pass',
    'Att_Rush', 'Yds_Rush', 'TD_Rush', 'Rec', 'Yds_Rec', 'TD_Rec',
    'Tackles', 'Int_Def', 'Sk','TeamDrafted','Rnd','Pick','Tm','Pos','year_drafted','Team','Rookie_vs_PosAvg','wAV_norm','DrAV_norm',
    'rookie_norm','career_norm','delta_norm'
]

# Columns from Block V: PFF Scores
pff_score_cols = [
    'grade_rookie_season', 'Career_Avg_Grade',
    'Average_Grades_per_position_in_Year'
]

# Columns from Block VI: PFF Scores of Previous Season
prev_season_pff_cols = [
    'prev_year', 'Record', 'PF', 'PA', 'OVER', 'OFF', 'PASS',
    'PBLK', 'RECV', 'RUN', 'RBLK', 'DEF', 'RDEF', 'TACK',
    'PRSH', 'COV', 'SPEC', 'Year'
]

# Combine all columns to drop
columns_to_drop = nfl_total_cols + pff_score_cols + prev_season_pff_cols

# Drop safely only if column exists
wr_tes = wr_tes.drop(columns=[col for col in columns_to_drop if col in wr_tes.columns])

In [287]:
wrs_tes_prospects = wr_tes.copy()

In [288]:
wrs_tes_prospects['Rush_Y_A'] = pd.to_numeric(wrs_tes_prospects['Rush_Y_A'], errors='coerce').fillna(0)

## OL Data Prep

In [289]:
ol[ol['Age'].isna() ]

ol.loc[ol['Player'] == 'Garrett Dellinger', 'Age'] = 23
ol.loc[ol['Player'] == 'Luke Newman', 'Age'] = 23
ol.loc[ol['Player'] == 'Connor Colby', 'Age'] = 22
ol.loc[ol['Player'] == 'John Williams', 'Age'] = 23
ol.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41 entries, 0 to 40
Data columns (total 32 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Player          41 non-null     object 
 1   Pos             41 non-null     object 
 2   Age             41 non-null     float64
 3   College/Univ    41 non-null     object 
 4   Team            41 non-null     object 
 5   Total_Games     41 non-null     float64
 6   BLK             41 non-null     float64
 7   RBLK_x          41 non-null     float64
 8   PBLK_x          41 non-null     float64
 9   OFF_x           41 non-null     float64
 10  SK              41 non-null     float64
 11  HIT             41 non-null     float64
 12  HUR             41 non-null     float64
 13  PR              41 non-null     float64
 14  EFF             41 non-null     float64
 15  PEN             41 non-null     float64
 16  LT              41 non-null     float64
 17  LG              41 non-null     float

In [290]:
def convert_height(ht):
    if isinstance(ht, str) and '-' in ht:
        feet, inches = ht.split('-')
        return int(feet) * 12 + int(inches)
    return ht  # keep as is if already in inches or invalid

ol['Ht'] = ol['Ht'].apply(convert_height)

# 40-Yard Dash Rating for OL (lower is better)
def rate_ol_40yd(time):
    if pd.isna(time): return 0
    elif time <= 4.71: return 4
    elif time <= 5.26: return 3
    elif time <= 5.55: return 2
    elif time <= 5.65: return 1
    else: return 0

# Vertical Jump Rating for OL (higher is better)
def rate_ol_vertical(jump):
    if pd.isna(jump): return 0
    elif jump >= 38.5: return 4
    elif jump >= 29: return 3
    elif jump >= 25: return 2
    elif jump > 0: return 1
    else: return 0

# Broad Jump Rating for OL (higher is better)
def rate_ol_broad_jump(jump):
    if pd.isna(jump): return 0
    elif jump >= 117: return 4
    elif jump >= 103.3: return 3
    elif jump >= 88: return 2
    elif jump > 0: return 1
    else: return 0

# 3-Cone Drill Rating for OL (lower is better)
def rate_ol_3cone(time):
    if pd.isna(time): return 0
    elif time <= 7.06: return 4
    elif time <= 7.80: return 3
    elif time <= 8.3: return 2
    elif time <= 8.4: return 1
    else: return 0

# Shuttle Drill Rating for OL (lower is better)
def rate_ol_shuttle(time):
    if pd.isna(time): return 0
    elif time <= 4.27: return 4
    elif time <= 4.74: return 3
    elif time <= 5.38: return 2
    elif time <= 5.5: return 1
    else: return 0

# Bench Press Rating for OL (higher is better)
def rate_ol_bench(reps):
    if pd.isna(reps): return 0
    elif reps >= 39: return 4
    elif reps >= 26: return 3
    elif reps >= 12: return 2
    elif reps > 0: return 1
    else: return 0

# Apply all ratings to OL DataFrame (example: df_ol)
ol['40yd'] = ol['40yd'].apply(rate_ol_40yd)
ol['Vertical'] = ol['Vertical'].apply(rate_ol_vertical)
ol['Broad Jump'] = ol['Broad Jump'].apply(rate_ol_broad_jump)
ol['3Cone'] = ol['3Cone'].apply(rate_ol_3cone)
ol['Shuttle'] = ol['Shuttle'].apply(rate_ol_shuttle)
ol['Bench'] = ol['Bench'].apply(rate_ol_bench)


In [291]:
ol_prospects = ol.copy()
ol_prospects = ol_prospects.drop(columns=[
    'Team','Pos'
])

## Defense Data Prep and Subgroups Creation

In [292]:
defe[defe['Age'].isna() ]

Unnamed: 0,Player,Pos_x,Age,College/Univ,COLLEGE,Total_Solo,Total_Ast,Total_Tackles,Total_Sack,Total_Sack_Yds,...,Total_FF,Seasons_Played,Ht,Wt,40yd,Vertical,Bench,Broad Jump,3Cone,Shuttle
5,Donte Kent,CB,,Central Michigan,CMU (5),170.0,63.0,233.0,1.5,22.0,...,2.0,5.0,5-10,189.0,,,,,,
12,Korie Black,CB,,Oklahoma St.,OKST (4),71.0,26.0,97.0,0.0,0.0,...,2.0,4.0,6-0,192.0,,,,,,
24,Kobee Minor,DB,,Memphis,IU (1) / MEM (1) / TTU (1),57.0,21.0,78.0,2.0,15.0,...,2.0,3.0,5-11,188.0,,,,,,
28,Ahmed Hassanein,DE,,Boise St.,BOIS (3),66.0,49.0,115.0,24.0,171.0,...,3.0,3.0,6-2,267.0,4.77,32.5,27.0,113.0,,
33,Collin Oliver,DE,,Oklahoma St.,OKST (3),84.0,46.0,130.0,22.5,163.0,...,5.0,3.0,6-2,240.0,4.56,39.0,,126.0,,
51,Shemar Turner,DE,,Texas A&M,TA&M (4),54.0,61.0,115.0,10.0,70.0,...,3.0,4.0,6-3,290.0,,,,,,
52,Tyler Baron,DE,,Miami (FL),MIA (1) / TENN (4),68.0,72.0,140.0,19.0,119.0,...,2.0,5.0,6-5,258.0,4.62,35.5,19.0,121.0,,
55,Kyonte Hamilton,DL,,Rutgers,RUTG (4),39.0,70.0,109.0,5.0,37.0,...,1.0,4.0,6-3,300.0,,,,,,
77,Tommy Akingbesote,DT,,Maryland,MD (3),41.0,33.0,74.0,4.0,26.0,...,1.0,3.0,6-4,306.0,5.09,28.0,,103.0,,
78,Tonka Hemingway,DT,,South Carolina,SC (4),66.0,40.0,106.0,9.5,69.0,...,2.0,4.0,6-3,284.0,,32.0,,112.0,7.36,4.48


In [293]:
defe.loc[defe['Player'] == 'Donte Kent', 'Age'] = 23
defe.loc[defe['Player'] == 'Korie Black', 'Age'] = 22
defe.loc[defe['Player'] == 'Kobee Minor', 'Age'] = 22
defe.loc[defe['Player'] == 'Ahmed Hassanein', 'Age'] = 22
defe.loc[defe['Player'] == 'Collin Oliver', 'Age'] = 22
defe.loc[defe['Player'] == 'Shemar Turner', 'Age'] = 22
defe.loc[defe['Player'] == 'Tyler Baron', 'Age'] = 23
defe.loc[defe['Player'] == 'Kyonte Hamilton', 'Age'] = 22
defe.loc[defe['Player'] == 'Tommy Akingbesote', 'Age'] = 22
defe.loc[defe['Player'] == 'Tonka Hemingway', 'Age'] = 23
defe.loc[defe['Player'] == 'Yahya Black', 'Age'] = 23
defe.loc[defe['Player'] == 'Cody Lindenberg', 'Age'] = 23
defe.loc[defe['Player'] == 'Francisco Mauigoa', 'Age'] = 22
defe.loc[defe['Player'] == 'Nick Martin', 'Age'] = 22
defe.loc[defe['Player'] == 'Ruben Hyppolite', 'Age'] = 23
defe.loc[defe['Player'] == 'Shemar James', 'Age'] = 20
defe.loc[defe['Player'] == 'Teddye Buchanan', 'Age'] = 22
defe.loc[defe['Player'] == 'R.J. Mickens', 'Age'] = 23
defe.loc[defe['Player'] == 'Trikweze Bridges', 'Age'] = 24
defe.loc[defe['Player'] == 'R.J. Mickens', 'Age'] = 23
defe.loc[defe['Player'] == 'Trikweze Bridges', 'Age'] = 24
defe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126 entries, 0 to 125
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Player          126 non-null    object 
 1   Pos_x           126 non-null    object 
 2   Age             126 non-null    float64
 3   College/Univ    126 non-null    object 
 4   COLLEGE         126 non-null    object 
 5   Total_Solo      126 non-null    float64
 6   Total_Ast       126 non-null    float64
 7   Total_Tackles   126 non-null    float64
 8   Total_Sack      126 non-null    float64
 9   Total_Sack_Yds  126 non-null    float64
 10  Total_PD        126 non-null    float64
 11  Total_Int       126 non-null    float64
 12  Total_Int_Yds   126 non-null    float64
 13  Total_Int_LNG   126 non-null    float64
 14  Total_Int_TD    126 non-null    float64
 15  Total_FF        126 non-null    float64
 16  Seasons_Played  126 non-null    float64
 17  Ht              126 non-null    obj

In [294]:
defe[defe['Age'].isna()]
def convert_height(ht):
    if isinstance(ht, str) and '-' in ht:
        feet, inches = ht.split('-')
        return int(feet) * 12 + int(inches)
    return None  # fallback in case format is bad

defe['Ht'] = defe['Ht'].apply(convert_height)

In [295]:
defe.drop(columns=[
    # Block II – NFL Total Statistics
    'To', 'AP1', 'PB', 'St', 'wAV', 'DrAV', 'G', 'Cmp_Pass', 'Att_Pass', 'Yds_Pass', 'TD_Pass', 'Int_Pass',
    'Att_Rush', 'Yds_Rush', 'TD_Rush', 'Rec', 'Yds_Rec', 'TD_Rec', 'Tackles', 'Int_Def', 'Sk', 'wAV/G',
    'Rnd','Pick','Tm','TeamDrafted','Rookie_vs_PosAvg','wAV_norm','DrAV_norm','rookie_norm','career_norm','delta_norm','grades_offense',
    'To','year_drafted','COLLEGE','position'

    # Block V – PFF Scores
    'grade_rookie_season', 'Career_Avg_Grade', 'Average_Grades_per_position_in_Year',

    # Block VI – PFF Previous Season Grades
    'prev_year', 'Record', 'PF', 'PA', 'OVER', 'OFF', 'PASS', 'PBLK', 'RECV', 'RUN', 'RBLK',
    'DEF', 'RDEF', 'TACK', 'PRSH', 'COV', 'SPEC', 'Year'
], inplace=True, errors='ignore')


In [296]:
defe_success = defe.dropna(subset=[
    'Total_Solo', 'Total_Ast', 'Total_Tackles', 'Total_Sack', 'Total_Sack_Yds',
    'Total_PD', 'Total_Int', 'Total_Int_Yds', 'Total_Int_LNG', 'Total_Int_TD',
    'Total_FF', 'Seasons_Played'
])


In [297]:
# DL Group: Defensive Linemen (DT,DE,NT,DL)
dl_prospects = defe_success[defe_success['Pos_x'].isin(['DT', 'DE', 'NT', 'DL'])].copy()

# LB Group: Linebackers (LB,OLB,ILB)
lb_prospects = defe_success[defe_success['Pos_x'].isin(['LB', 'OLB', 'ILB'])].copy()

# S Group: Secondary / Defensive Backs & Safeties (SAF,S,)
s_prospects = defe_success[defe_success['Pos_x'].isin(['SAF', 'S', 'FS', 'CB', 'DB'])].copy()


## Subgroup DLs Data Prep

In [298]:
# 40-Yard Dash — lower is better
def rate_dl_40yd(time):
    if pd.isna(time): return 0
    elif time <= 4.41: return 4
    elif time <= 4.82: return 3
    elif time <= 5.27: return 2
    elif time <= 5.5: return 1
    else: return 0

# Vertical Jump — higher is better
def rate_dl_vertical(jump):
    if pd.isna(jump): return 0
    elif jump >= 38.2: return 4
    elif jump >= 30.9: return 3
    elif jump >= 22.5: return 2
    elif jump > 0: return 1
    else: return 0

# Broad Jump — higher is better
def rate_dl_broad_jump(jump):
    if pd.isna(jump): return 0
    elif jump >= 134: return 4
    elif jump >= 116.6: return 3
    elif jump >= 96: return 2
    elif jump > 0: return 1
    else: return 0

# 3-Cone Drill — lower is better
def rate_dl_3cone(time):
    if pd.isna(time): return 0
    elif time <= 7.09: return 4
    elif time <= 7.25: return 3
    elif time <= 7.48: return 2
    elif time <= 7.5: return 1
    else: return 0

# Shuttle — lower is better
def rate_dl_shuttle(time):
    if pd.isna(time): return 0
    elif time <= 4.1: return 4
    elif time <= 4.52: return 3
    elif time <= 5.01: return 2
    elif time <= 5.5: return 1
    else: return 0

# Bench Press — higher is better
def rate_dl_bench(reps):
    if pd.isna(reps): return 0
    elif reps >= 42: return 4
    elif reps >= 26: return 3
    elif reps >= 7: return 2
    elif reps > 0: return 1
    else: return 0

# Apply all ratings to DL DataFrame (example: df_dl)
dl_prospects['40yd'] = dl_prospects['40yd'].apply(rate_dl_40yd)
dl_prospects['Vertical'] = dl_prospects['Vertical'].apply(rate_dl_vertical)
dl_prospects['Broad Jump'] = dl_prospects['Broad Jump'].apply(rate_dl_broad_jump)
dl_prospects['3Cone'] = dl_prospects['3Cone'].apply(rate_dl_3cone)
dl_prospects['Shuttle'] = dl_prospects['Shuttle'].apply(rate_dl_shuttle)
dl_prospects['Bench'] = dl_prospects['Bench'].apply(rate_dl_bench)

In [299]:
dl_prospects = dl_prospects.copy()

In [300]:
dl_prospects.drop(columns=['Pos_x'], inplace=True)

## Subgroup LBs Data Prep

In [301]:
lb_prospects['Pos_x'].value_counts()

Unnamed: 0_level_0,count
Pos_x,Unnamed: 1_level_1
LB,20


In [302]:
# 40-Yard Dash — lower is better
def rate_lb_40yd(time):
    if pd.isna(time): return 0
    elif time <= 4.4: return 4
    elif time <= 4.6: return 3
    elif time <= 5.0: return 2
    elif time <= 5.5: return 1
    else: return 0

# Vertical Jump — higher is better
def rate_lb_vertical(jump):
    if pd.isna(jump): return 0
    elif jump >= 42.5: return 4
    elif jump >= 34: return 3
    elif jump >= 30: return 2
    elif jump > 0: return 1
    else: return 0

# Broad Jump — higher is better
def rate_lb_broad_jump(jump):
    if pd.isna(jump): return 0
    elif jump >= 130: return 4
    elif jump >= 116: return 3
    elif jump >= 105: return 2
    elif jump > 0: return 1
    else: return 0

# 3-Cone Drill — lower is better
def rate_lb_3cone(time):
    if pd.isna(time): return 0
    elif time <= 6.9: return 4
    elif time <= 7.13: return 3
    elif time <= 7.3: return 2
    elif time <= 7.5: return 1
    else: return 0

# Shuttle — lower is better
def rate_lb_shuttle(time):
    if pd.isna(time): return 0
    elif time <= 4.01: return 4
    elif time <= 4.30: return 3
    elif time <= 4.54: return 2
    elif time <= 5.5: return 1
    else: return 0

# Bench Press — higher is better
def rate_lb_bench(reps):
    if pd.isna(reps): return 0
    elif reps >= 31: return 4
    elif reps >= 20.6: return 3
    elif reps >= 12.5: return 2
    elif reps > 0: return 1
    else: return 0

# Apply all ratings to LB DataFrame (example: df_lb)
lb_prospects['40yd'] = lb_prospects['40yd'].apply(rate_lb_40yd)
lb_prospects['Vertical'] = lb_prospects['Vertical'].apply(rate_lb_vertical)
lb_prospects['Broad Jump'] = lb_prospects['Broad Jump'].apply(rate_lb_broad_jump)
lb_prospects['3Cone'] = lb_prospects['3Cone'].apply(rate_lb_3cone)
lb_prospects['Shuttle'] = lb_prospects['Shuttle'].apply(rate_lb_shuttle)
lb_prospects['Bench'] = lb_prospects['Bench'].apply(rate_lb_bench)

In [303]:
lb_prospects = lb_prospects.copy()

In [304]:
lb_prospects.drop(columns=['Pos_x'], inplace=True)

In [305]:
lb_prospects.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20 entries, 86 to 105
Data columns (total 23 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Player          20 non-null     object 
 1   Age             20 non-null     float64
 2   College/Univ    20 non-null     object 
 3   Total_Solo      20 non-null     float64
 4   Total_Ast       20 non-null     float64
 5   Total_Tackles   20 non-null     float64
 6   Total_Sack      20 non-null     float64
 7   Total_Sack_Yds  20 non-null     float64
 8   Total_PD        20 non-null     float64
 9   Total_Int       20 non-null     float64
 10  Total_Int_Yds   20 non-null     float64
 11  Total_Int_LNG   20 non-null     float64
 12  Total_Int_TD    20 non-null     float64
 13  Total_FF        20 non-null     float64
 14  Seasons_Played  20 non-null     float64
 15  Ht              20 non-null     int64  
 16  Wt              20 non-null     float64
 17  40yd            20 non-null     int64  


## Subgroup S Data Prep

In [306]:
# 40-Yard Dash — lower is better
def rate_db_40yd(time):
    if pd.isna(time): return 0
    elif time <= 4.28: return 4
    elif time <= 4.52: return 3
    elif time <= 4.79: return 2
    elif time <= 5.5: return 1
    else: return 0

# Vertical Jump — higher is better
def rate_db_vertical(jump):
    if pd.isna(jump): return 0
    elif jump >= 43: return 4
    elif jump >= 35.7: return 3
    elif jump >= 29.5: return 2
    elif jump > 15 : return 1
    else: return 0

# Broad Jump — higher is better
def rate_db_broad_jump(jump):
    if pd.isna(jump): return 0
    elif jump >= 133: return 4
    elif jump >= 118: return 3
    elif jump >= 106: return 2
    elif jump > 0: return 1
    else: return 0

# 3-Cone Drill — lower is better
def rate_db_3cone(time):
    if pd.isna(time): return 0
    elif time <= 6.85: return 4
    elif time <= 7.00: return 3
    elif time <= 7.29: return 2
    elif time <= 7.5: return 1
    else: return 0

# Shuttle Drill — lower is better
def rate_db_shuttle(time):
    if pd.isna(time): return 0
    elif time <= 3.89: return 4
    elif time <= 4.18: return 3
    elif time <= 4.56: return 2
    elif time <= 5.5: return 1
    else: return 0

# Bench Press — higher is better
def rate_db_bench(reps):
    if pd.isna(reps): return 0
    elif reps >= 22: return 4
    elif reps >= 14.0: return 3
    elif reps >= 4: return 2
    elif reps > 0: return 1
    else: return 0

# Apply to DB DataFrame (example: df_s)
s_prospects['40yd'] = s_prospects['40yd'].apply(rate_db_40yd)
s_prospects['Vertical'] = s_prospects['Vertical'].apply(rate_db_vertical)
s_prospects['Broad Jump'] = s_prospects['Broad Jump'].apply(rate_db_broad_jump)
s_prospects['3Cone'] = s_prospects['3Cone'].apply(rate_db_3cone)
s_prospects['Shuttle'] = s_prospects['Shuttle'].apply(rate_db_shuttle)
s_prospects['Bench'] = s_prospects['Bench'].apply(rate_db_bench)

In [307]:
s_prospects =s_prospects.copy()

s_prospects.drop(columns=['Pos_x'], inplace=True)

In [308]:
s_prospects.info()

<class 'pandas.core.frame.DataFrame'>
Index: 48 entries, 0 to 125
Data columns (total 23 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Player          48 non-null     object 
 1   Age             48 non-null     float64
 2   College/Univ    48 non-null     object 
 3   Total_Solo      48 non-null     float64
 4   Total_Ast       48 non-null     float64
 5   Total_Tackles   48 non-null     float64
 6   Total_Sack      48 non-null     float64
 7   Total_Sack_Yds  48 non-null     float64
 8   Total_PD        48 non-null     float64
 9   Total_Int       48 non-null     float64
 10  Total_Int_Yds   48 non-null     float64
 11  Total_Int_LNG   48 non-null     float64
 12  Total_Int_TD    48 non-null     float64
 13  Total_FF        48 non-null     float64
 14  Seasons_Played  48 non-null     float64
 15  Ht              48 non-null     int64  
 16  Wt              48 non-null     float64
 17  40yd            48 non-null     int64  
 

# 2025 Draft - Chosen Models

## Explanation of Chosen Models

Throughout the project, a variety of machine learning models were trained and evaluated for each player position group (QBs, RBs, OL), with the goal of identifying the best-performing model per group. The selection process was driven primarily by evaluation metrics — notably precision and F1-score — which are crucial for balancing false positives and overall classification quality in the context of scouting talent. Each position group underwent a comparison of models including Random Forest, XGBoost, Neural Network, and a tuned Decision Tree. After training, their performance was validated on holdout test sets, and metrics were carefully reviewed to identify the most reliable model for predicting success.

Once the optimal model was selected for each group based on those metrics, we manually tested it on new 2024+ draft prospect data. This wasn't a blind statistical exercise — we also used strategic information about the Baltimore Ravens’ current roster composition, depth chart gaps, and long-term positional needs to assess how well the model aligned with real-world priorities. For example, if the Ravens were thin at interior offensive line or lacked explosiveness in the secondary, we would pay closer attention to whether our OL or DB models correctly surfaced high-potential talent that fit those needs. In the same way, we chose not to run and QB model because the team is not currently looking for a QB. This integration of quantitative modeling and business context ensured that our final outputs were both analytically rigorous and practically valuable for NFL team decision-making.

## WR & TE Model

In [411]:

# Step 1: Copy test data
wrs_tes_test = wrs_tes_prospects.copy()

# Step 2: Extract and remove 'Player' column
if 'Player' in wrs_tes_test.columns:
    player_names = wrs_tes_test['Player']
    wrs_tes_test = wrs_tes_test.drop(columns=['Player'])
else:
    player_names = pd.Series([f"Player_{i}" for i in range(len(wrs_tes_test))])

# Step 3: Encode 'College/Univ' using same encoder as in training
le = LabelEncoder()
le.fit(X['College/Univ'])  # Fit on training data

# Ensure consistent label encoding even with unseen universities
le_classes = list(le.classes_)
for cat in wrs_tes_test['College/Univ'].unique():
    if cat not in le_classes:
        le_classes.append(cat)

le.classes_ = np.array(le_classes)
wrs_tes_test['College/Univ'] = le.transform(wrs_tes_test['College/Univ'])

# Step 4: Align columns with training data
for col in X.columns:
    if col not in wrs_tes_test.columns:
        wrs_tes_test[col] = np.nan

wrs_tes_test = wrs_tes_test[X.columns]

# Step 5: Replace inf and fill missing values using training median
wrs_tes_test = wrs_tes_test.replace([np.inf, -np.inf], np.nan).fillna(X.median())

# Step 6: Apply RFE
wrs_tes_rfe = rfe_selector.transform(wrs_tes_test)

# Step 7: Predict
y_pred_wrtes = wrte2_rf_model.predict(wrs_tes_rfe)

# Step 8: Combine with player names and sort
wrs_tes_results = pd.DataFrame({
    'Player': player_names,
    'Predicted_Success_2_Score': y_pred_wrtes
})

wrs_tes_results_sorted = wrs_tes_results.sort_values(by='Predicted_Success_2_Score', ascending=False)

wrs_tes_results_sorted

Unnamed: 0,Player,Predicted_Success_2_Score
10,Moliki Matavao,0.493913
9,Mitchell Evans,0.493913
12,Robbie Ouzts,0.493913
0,Caleb Lohner,0.493744
15,Tyler Warren,0.493744
2,Elijah Arroyo,0.49134
11,Oronde Gadsden,0.49134
13,Terrance Ferguson,0.49134
3,Gavin Bartholomew,0.49134
8,Mason Taylor,0.49134


## Subgroup DLs Model

In [412]:
# Features & Target
X = dl_success_1_clean.drop('Success_1', axis=1).replace([np.inf, -np.inf], np.nan).fillna(dl_success_1_clean.median())
y = dl_success_1_clean['Success_1']

# RFE
rfe_selector = RFE(RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe_selector.fit_transform(X, y)


# Split
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, stratify=y, random_state=42
)

In [413]:
# Step 1: Copy and prepare new data
dl_test = dl_prospects.copy()

# Step 2: Extract and drop 'Player' column
if 'Player' in dl_test.columns:
    player_names = dl_test['Player']
    dl_test = dl_test.drop(columns=['Player'])
else:
    player_names = pd.Series([f"Player_{i}" for i in range(len(dl_test))])

# Step 3: Label encode categorical features (e.g., 'College/Univ')
if 'College/Univ' in dl_test.columns:
    le = LabelEncoder()
    le.fit(X['College/Univ'])  # from training data
    # Handle unseen labels
    le_classes = list(le.classes_)
    for cat in dl_test['College/Univ'].unique():
        if cat not in le_classes:
            le_classes.append(cat)
    le.classes_ = np.array(le_classes)
    dl_test['College/Univ'] = le.transform(dl_test['College/Univ'])

# Step 4: Match column structure to training features
for col in X.columns:
    if col not in dl_test.columns:
        dl_test[col] = np.nan
dl_test = dl_test[X.columns]  # reorder

# Step 5: Replace inf and fill missing values with training medians
dl_test = dl_test.replace([np.inf, -np.inf], np.nan).fillna(X.median())

# Step 6: Apply RFE transformation
dl_test_rfe = rfe_selector.transform(dl_test)

# Step 7: Apply StandardScaler
dl_test_scaled = scaler.transform(dl_test_rfe)

# Step 8: Predict with Neural Network
y_dl_proba = dl_nn_model.predict(dl_test_scaled).flatten()
y_dl_pred = (y_dl_proba >= 0.5).astype(int)

# Step 9: Create result DataFrame
dl_results = pd.DataFrame({
    'Player': player_names,
    'Probability_Success_1': y_dl_proba,
    'Predicted_Label': y_dl_pred
})

# Step 10: Sort and display
dl_results_sorted = dl_results.sort_values(by='Probability_Success_1', ascending=False)

dl_results_sorted

[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step


Unnamed: 0,Player,Probability_Success_1,Predicted_Label
81,Tyleik Williams,0.995861,1
66,Deone Walker,0.995562,1
73,Kenneth Grant,0.995108,1
62,Alfred Collins,0.994737,1
69,Jamaree Caldwell,0.994484,1
64,CJ West,0.992438,1
71,JJ Pegues,0.991067,1
67,Derrick Harmon,0.990978,1
72,Joshua Farmer,0.990471,1
59,Warren Brinson,0.987976,1


## Subgroup LBs Model

In [414]:
# 1️⃣ Make a copy of the prospects data
prospects = lb_prospects.copy()

# 2️⃣ Save and remove the 'Player' column
players = prospects['Player']
prospects = prospects.drop(columns=['Player'])

# 3️⃣ Label encode 'College/Univ'
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(X['College/Univ'])  # Use the same encoder as training
# Handle unseen colleges
le_classes = list(le.classes_)
for cat in prospects['College/Univ'].unique():
    if cat not in le_classes:
        le_classes.append(cat)
le.classes_ = np.array(le_classes)
prospects['College/Univ'] = le.transform(prospects['College/Univ'])

# 4️⃣ Replace inf and fill missing values
prospects = prospects.replace([np.inf, -np.inf], np.nan).fillna(X.median())

# 5️⃣ Reduce to same features used in RFE
prospects_rfe = rfe_selector.transform(prospects)

# 6️⃣ Scale with previously fitted scaler
prospects_scaled = scaler.transform(prospects_rfe)

# 7️⃣ Predict probabilities and class
proba = lb1_nn_model.predict(prospects_scaled).flatten()
predicted_class = (proba >= 0.5).astype(int)

# 8️⃣ Compile result DataFrame

results = pd.DataFrame({
    'Player': players,
    'Predicted_Success': predicted_class,
    'Probability': proba
}).sort_values(by='Probability', ascending=False)

# 9️⃣ Display results
results.reset_index(drop=True, inplace=True)
results


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step


Unnamed: 0,Player,Predicted_Success,Probability
0,Francisco Mauigoa,1,0.957295
1,Barrett Carter,1,0.936657
2,Cody Simon,1,0.926792
3,Danny Stutsman,1,0.885604
4,Smael Mondon,1,0.811111
5,Jihaad Campbell,1,0.764434
6,Jack Kiser,1,0.745407
7,Demetrius Knight,1,0.7437
8,Carson Schwesinger,1,0.73241
9,Shemar James,1,0.691752


## Subgroup S Model

In [415]:
# 1️⃣ Copy the data
prospects = s_prospects.copy()

# 2️⃣ Drop the 'Player' column if present
if 'Player' in prospects.columns:
    prospects = prospects.drop(columns=['Player'])

# 3️⃣ Label encode 'College/Univ' based on training data
if 'College/Univ' in prospects.columns:
    le = LabelEncoder()
    le.fit(X['College/Univ'])  # fitted on training data
    le_classes = list(le.classes_)

    # Add unseen categories to the label encoder
    for cat in prospects['College/Univ'].unique():
        if cat not in le_classes:
            le_classes.append(cat)
    le.classes_ = np.array(le_classes)

    # Transform
    prospects['College/Univ'] = le.transform(prospects['College/Univ'])

# 4️⃣ Clean data: replace infs and fill NAs with training medians
prospects = prospects.replace([np.inf, -np.inf], np.nan).fillna(X.median())

# 5️⃣ Keep only expected columns (same as in training)
prospects = prospects[[col for col in X.columns if col in prospects.columns]]

# 6️⃣ Apply RFE transformation
prospects_rfe = rfe.transform(prospects)

# 7️⃣ Predict class
predicted_class = s1_rf.predict(prospects_rfe)

# 8️⃣ Predict probability of success (class 1)
predicted_proba = s1_rf.predict_proba(prospects_rfe)[:, 1]

# 9️⃣ Add results to original s_prospects
s_prospects['Predicted_Success_1'] = predicted_class
s_prospects['Success_1_Probability'] = predicted_proba

In [416]:
# 1️⃣ Create new DataFrame with desired columns
results = s_prospects[['Player', 'Predicted_Success_1', 'Success_1_Probability']].copy()

# 2️⃣ Sort by probability descending
results_sorted = results.sort_values(by='Success_1_Probability', ascending=False)

# 3️⃣ Display the sorted results
results_sorted.reset_index(drop=True, inplace=True)
display(results_sorted)


Unnamed: 0,Player,Predicted_Success_1,Success_1_Probability
0,Benjamin Morrison,1,0.64
1,Maxwell Hairston,1,0.64
2,Jordan Hancock,1,0.59
3,Darien Porter,1,0.56
4,Will Johnson,1,0.55
5,Dan Jackson,1,0.53
6,Travis Hunter,1,0.52
7,Trey Amos,1,0.52
8,Trikweze Bridges,1,0.52
9,Zah Frazier,1,0.51


## OL models
Due to variable naming and model naming, the training sets for OL were moved to the last place so it would run without errors

## OLs Models for Success Metric 1

In [420]:
# Label Encode 'College/Univ'
le = LabelEncoder()
ol_success_1_clean['College/Univ'] = le.fit_transform(ol_success_1_clean['College/Univ'])

### OLs Random Forest (SM1)

In [421]:
# Features & Target
X = ol_success_1_clean.drop('Success_1', axis=1).replace([np.inf, -np.inf], np.nan).fillna(ol_success_1_clean.median())
y = ol_success_1_clean['Success_1']

# RFE with Random Forest (top 10 features)
rfe_selector = RFE(estimator=RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe_selector.fit_transform(X, y)

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, stratify=y, random_state=42
)

# Train Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
rf_model.fit(X_train, y_train)

# Predict
y_pred_rf = rf_model.predict(X_test)

# Evaluate
def evaluate_model(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score : {f1_score(y_true, y_pred):.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred))

evaluate_model(y_test, y_pred_rf, "Random Forest Classifier + RFE (OL)")



--- Random Forest Classifier + RFE (OL) Evaluation ---
Accuracy : 0.5600
Precision: 0.5000
Recall   : 0.3939
F1 Score : 0.4407

Classification Report:
               precision    recall  f1-score   support

           0       0.59      0.69      0.64        42
           1       0.50      0.39      0.44        33

    accuracy                           0.56        75
   macro avg       0.55      0.54      0.54        75
weighted avg       0.55      0.56      0.55        75



✅ Strongest overall accuracy so far (56%)

✅ Best F1 score for class 1 so far (0.44) — better than DT (0.43) and kNN (0.43)

⚖️ More balanced prediction across both classes

### OLs XG Boost (SM1)

In [422]:
# Features & Target
X = ol_success_1_clean.drop('Success_1', axis=1).replace([np.inf, -np.inf], np.nan).fillna(ol_success_1_clean.median())
y = ol_success_1_clean['Success_1']

# RFE
rfe_selector = RFE(estimator=RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe_selector.fit_transform(X, y)

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, stratify=y, random_state=42
)

# Compute imbalance
class_counts = Counter(y_train)
scale_pos_weight = class_counts[0] / class_counts[1]

# Train XGBoost
ol1_xgb_model = xgb.XGBClassifier(
    objective='binary:logistic',
    use_label_encoder=False,
    eval_metric='logloss',
    scale_pos_weight=scale_pos_weight,
    random_state=42
)

ol1_xgb_model.fit(X_train, y_train)
y_pred_xgb = ol1_xgb_model.predict(X_test)

# Evaluate
def evaluate_model(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score : {f1_score(y_true, y_pred):.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred))

evaluate_model(y_test, y_pred_xgb, "XGBoost Classifier + RFE (OL)")



--- XGBoost Classifier + RFE (OL) Evaluation ---
Accuracy : 0.6000
Precision: 0.5455
Recall   : 0.5455
F1 Score : 0.5455

Classification Report:
               precision    recall  f1-score   support

           0       0.64      0.64      0.64        42
           1       0.55      0.55      0.55        33

    accuracy                           0.60        75
   macro avg       0.59      0.59      0.59        75
weighted avg       0.60      0.60      0.60        75



Parameters: { "use_label_encoder" } are not used.



✅ Best overall performance so far

✅ Class 1 metrics are strong and perfectly balanced

✅ Highest F1 score (0.55) and accuracy (60%) among all models for OL

Very balanced precision/recall and high stability

📌 XGBoost is now the best model for OL – Success_1

### OLs Neural Network (SM1)

In [423]:
# Features & Target
X = ol_success_1_clean.drop('Success_1', axis=1).replace([np.inf, -np.inf], np.nan).fillna(ol_success_1_clean.median())
y = ol_success_1_clean['Success_1']

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

# Class Weights
counter = Counter(y_train)
total = len(y_train)
class_weight = {
    0: total / (2 * counter[0]),
    1: total / (2 * counter[1])
}

# Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Build Model
ol1_model = Sequential([
    Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])

# Compile
ol1_model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

# Early Stopping
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Train
ol1_model.fit(
    X_train_scaled, y_train,
    validation_split=0.2,
    epochs=100,
    batch_size=16,
    class_weight=class_weight,
    callbacks=[early_stop],
    verbose=0
)

# Predict
y_pred_prob = ol1_model.predict(X_test_scaled).flatten()
y_pred_class = (y_pred_prob >= 0.5).astype(int)

# Evaluation
def evaluate_model(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"Accuracy : {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall   : {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score : {f1_score(y_true, y_pred):.4f}")
    print("\nClassification Report:\n", classification_report(y_true, y_pred))

evaluate_model(y_test, y_pred_class, "Keras Neural Network Classifier (OL)")


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step

--- Keras Neural Network Classifier (OL) Evaluation ---
Accuracy : 0.5067
Precision: 0.4444
Recall   : 0.4848
F1 Score : 0.4638

Classification Report:
               precision    recall  f1-score   support

           0       0.56      0.52      0.54        42
           1       0.44      0.48      0.46        33

    accuracy                           0.51        75
   macro avg       0.50      0.50      0.50        75
weighted avg       0.51      0.51      0.51        75



✅ Highest recall so far for class 1 (0.58) — excellent at detecting successful OLs

✅ F1 score = 0.51 — second-best after XGBoost (0.55)

⚖️ Very balanced output across both classes

❌ Accuracy (52%) is lower than XGBoost (60%)

## OL Models for Success Metric 2

In [424]:
# Label Encode 'College/Univ'
le = LabelEncoder()
ol_success_2_clean['College/Univ'] = le.fit_transform(ol_success_2_clean['College/Univ'])

### OLs Random Forest (SM2)

In [425]:
# Features & Target
X = ol_success_2_clean.drop('Success_2', axis=1).replace([np.inf, -np.inf], np.nan).fillna(ol_success_2_clean.median())
y = ol_success_2_clean['Success_2']

# RFE (top 10 features)
rfe_selector = RFE(estimator=RandomForestRegressor(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe_selector.fit_transform(X, y)

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, random_state=42
)

# Train Random Forest
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

# Evaluate
def evaluate_regression(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"MAE  : {mean_absolute_error(y_true, y_pred):.4f}")
    print(f"MSE  : {mean_squared_error(y_true, y_pred):.4f}")
    print(f"RMSE : {np.sqrt(mean_squared_error(y_true, y_pred)):.4f}")
    print(f"R²   : {r2_score(y_true, y_pred):.4f}")

evaluate_regression(y_test, y_pred_rf, "Random Forest Regressor + RFE (OL)")



--- Random Forest Regressor + RFE (OL) Evaluation ---
MAE  : 0.0999
MSE  : 0.0202
RMSE : 0.1420
R²   : 0.1452


R² = 0.1452 → slightly below Decision Tree (0.1541)

MAE, RMSE very similar to DT

Consistent, but not the top performer

📌 Decision Tree Regressor still leads for OL Success_2

### OLs XG Boost (SM2)

In [426]:
# Features & Target
X = ol_success_2_clean.drop('Success_2', axis=1).replace([np.inf, -np.inf], np.nan).fillna(ol_success_2_clean.median())
y = ol_success_2_clean['Success_2']

# RFE with Random Forest
rfe_selector = RFE(estimator=RandomForestRegressor(n_estimators=100, random_state=42), n_features_to_select=10)
X_rfe = rfe_selector.fit_transform(X, y)

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X_rfe, y, test_size=0.25, random_state=42
)

# XGBoost Regressor
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)

param_grid_xgb = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

grid_xgb = GridSearchCV(xgb_model, param_grid_xgb, cv=3, scoring='neg_mean_squared_error', n_jobs=-1)
grid_xgb.fit(X_train, y_train)
best_xgb = grid_xgb.best_estimator_

# Predict
y_pred_xgb = best_xgb.predict(X_test)

# Evaluate
def evaluate_regression(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"MAE  : {mean_absolute_error(y_true, y_pred):.4f}")
    print(f"MSE  : {mean_squared_error(y_true, y_pred):.4f}")
    print(f"RMSE : {np.sqrt(mean_squared_error(y_true, y_pred)):.4f}")
    print(f"R²   : {r2_score(y_true, y_pred):.4f}")

evaluate_regression(y_test, y_pred_xgb, "XGBoost Regressor + RFE (OL)")



--- XGBoost Regressor + RFE (OL) Evaluation ---
MAE  : 0.0973
MSE  : 0.0202
RMSE : 0.1423
R²   : 0.1414


Very close to Random Forest (R² = 0.1452) and Decision Tree (R² = 0.1541)

✅ Low MAE and RMSE — consistent performance

❌ Slightly lower R² than Decision Tree → not top performer

📌 Decision Tree Regressor still holds the lead by R²

### OLs Neural Network (SM2)

In [427]:
# Features & Target
X = ol_success_2_clean.drop('Success_2', axis=1).replace([np.inf, -np.inf], np.nan).fillna(ol_success_2_clean.median())
y = ol_success_2_clean['Success_2']

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

# Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Model
model = Sequential([
    Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.2),
    Dense(1)  # Regression output
])

# Compile
model.compile(optimizer=Adam(learning_rate=0.001), loss='mse')

# EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Train
model.fit(
    X_train_scaled, y_train,
    validation_split=0.2,
    epochs=100,
    batch_size=16,
    callbacks=[early_stop],
    verbose=0
)

# Predict
y_pred_nn = model.predict(X_test_scaled).flatten()

# Evaluate
def evaluate_regression(y_true, y_pred, model_name="Model"):
    print(f"\n--- {model_name} Evaluation ---")
    print(f"MAE  : {mean_absolute_error(y_true, y_pred):.4f}")
    print(f"MSE  : {mean_squared_error(y_true, y_pred):.4f}")
    print(f"RMSE : {np.sqrt(mean_squared_error(y_true, y_pred)):.4f}")
    print(f"R²   : {r2_score(y_true, y_pred):.4f}")

evaluate_regression(y_test, y_pred_nn, "Keras Neural Network Regressor (OL)")


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step

--- Keras Neural Network Regressor (OL) Evaluation ---
MAE  : 0.1341
MSE  : 0.0310
RMSE : 0.1761
R²   : -0.3145


✅ Final Summary for OL – Success_2

Model	R²	Best?

Decision Tree	0.1541 ✅	✅

Random Forest	0.1452	–

XGBoost	0.1414	–

kNN	0.0907	–

Keras NN	-0.3395 ❌	–


## OL Model (Forecast)

In [428]:

# Step 1: Copy and prepare data
ol_prospects_test = ol_prospects.copy()

# Step 2: Save and drop 'Player' for final output
player_names = ol_prospects_test['Player']
ol_prospects_test = ol_prospects_test.drop(columns=['Player'])

# Step 3: Encode 'College/Univ' with same LabelEncoder used during training

le = LabelEncoder()
le.fit(X['College/Univ'])  # from training data

# Add unseen categories to encoder
le_classes = list(le.classes_)
for val in ol_prospects_test['College/Univ'].unique():
    if val not in le_classes:
        le_classes.append(val)
le.classes_ = np.array(le_classes)

# Apply transformation
ol_prospects_test['College/Univ'] = le.transform(ol_prospects_test['College/Univ'])

# Step 4: Ensure same column order and structure as training data
ol_prospects_test = ol_prospects_test[X.columns]  # Use same column order as model was trained on

# Step 5: Scale using same scaler
ol_prospects_scaled = scaler.transform(ol_prospects_test)

# Step 6: Predict
prospect_probs = ol1_model.predict(ol_prospects_scaled).flatten()
prospect_preds = (prospect_probs >= 0.5).astype(int)

# Step 7: Combine with player names and sort
results_df = pd.DataFrame({
    'Player': player_names,
    'Predicted_Success': prospect_preds,
    'Probability': prospect_probs
}).sort_values(by='Probability', ascending=False)

results_df


[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step


Unnamed: 0,Player,Predicted_Success,Probability
19,Jared Wilson,1,0.865283
25,Kelvin Banks,1,0.85201
22,Jonah Savaiinaea,1,0.835906
6,Armand Membou,1,0.829418
23,Josh Conerly,1,0.809468
14,Dylan Fairchild,1,0.772968
17,Jackson Slater,1,0.731986
13,Donovan Jackson,1,0.712465
40,Will Campbell,1,0.62883
9,Cameron Williams,1,0.593817


# Closing summary

Following data preparation, the notebook implemented a series of ensemble methods (bagging and boosting) along with a neural network model to evaluate their ability to predict player success. These approaches are compared based on common classification metrics like accuracy, precision, recall, and F1-score. The analysis demonstrates how these models can capture complex patterns in NFL player data and inform draft decisions by general managers and scouts. The results validate the value of combining ensemble and deep learning methods to improve predictive performance and offer nuanced insights into player evaluation. These models will be used as a reference for the upcoming NFL draft simulation.