**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Finnegan Sullivan A17672893
- Nabila Afifah Qotrunnada A18500553
- Kalkidan Berhe Gebrekirstos A18484099
- Chi-en Kao A18064210
- Aziza Hussein A16954820

# Research Question

-  Can a Quarterback's college statistics and NFL combine metrics be used to accurately predict whether they will be drafted into the NFL?

-  This project aims to answer: To what extent can a quarterback’s college performance metrics and NFL Combine results predict their likelihood of being drafted into the NFL, and which specific features (e.g., college QBR, rushing statistics, 40-yard dash time) are most strongly associated with this outcome?






## Background and Prior Work

NFL teams rely heavily on college statistics and NFL Combine performance when evaluating quarterback talent. However, it’s unclear which specific metrics actually influence draft outcomes. Several prior studies provide insight into this complex decision-making process.

Craig & Winchester (2021) found that college passing metrics are the strongest indicators of a quarterback being drafted, while rushing performance better predicts future NFL success. They argue that mobility and rushing ability are undervalued by scouts.<a name="#1"> </a><sup>1</sup>

Wolfson et al. (2011) explored the “Quarterback Prediction Problem” and concluded that many draft outcomes do not align with actual NFL performance. They attribute this to a mix of random variation and overemphasis on traits like height and arm strength that may not correlate with success.<a name="#2"> </a><sup>2</sup>

More recent studies like those from Sports Info Solutions (2025) show a substantial drop in on-target passing percentages from college to the NFL, suggesting that translating college success to professional performance is not straightforward.<a name="#3"> </a><sup>3</sup>

Together, these studies point out the need for a more data-driven approach to draft forecasting. Our project builds on this work by using modern datasets and tools to identify which pre-draft traits are most predictive of quarterback draft status.


<a name="ref1">1.</a> Craig, J. D., & Winchester, N. (2021). Predicting the national football league potential of college quarterbacks. European Journal of Operational Research. https://doi.org/10.1016/j.ejor.2021.03.013 ↩
<a name="ref2">2.</a> Wolfson, J., Addona, V., & Schmicker, R. (2011). The Quarterback Prediction Problem. Journal of Quantitative Analysis in Sports. https://doi.org/10.2202/1559-0410.1302 ↩
<a name="ref3">3.</a> Study Comparing College and NFL On-Target Percentage. Sports Info Solutions (2025). https://www.sportsinfosolutions.com/2025/04/08/study-comparing-college-and-nfl-on-target-percentage/ ↩


# Hypothesis


Null Hypothesis (H₀): A quarterback’s college statistics and NFL Combine metrics (such as QBR, rushing yards, and 40-yard dash time) have no association with whether they are drafted into the NFL.

Alternative Hypothesis (H₁): At least one of these features is significantly associated with a quarterback’s likelihood of being drafted.

We predict that quarterbacks with higher college QRBs and faster 40 yard dash times are more likely to be drafted, considering the current game plans needing dual threat quarterbacks, which increases their value. We are supporting this hypothesis by Craig & Winchester, Predicting the national football league potential of college quarterbacks (2021), who found that passing performance is a strong indicator of draft selection, while rushing traits are often undervalued.1

1. Craig, J. D., & Winchester, N. (2021). Predicting the national football league potential of college quarterbacks. European Journal of Operational Research. https://doi.org/10.1016/j.ejor.2021.03.013



# Data

## Data overview


- Dataset #1
  - Dataset Name: NFL Combine Results Dataset 2000–2022
  - Link to the dataset:https://www.kaggle.com/datasets/mitchellweg1/nfl-combine-results-dataset-2000-2022
  - Number of observations:6,128 players
  - Number of variables: 7
  - Description: This dataset includes NFL Scouting Combine performance metrics from 2000 to 2022, including player positions and schools. It also features attributes such as 40-yard dash time, bench press reps, vertical jump, broad jump, 3-cone drill, shuttle run, height, and weight. This dataset serves as a comprehensive resource that evaluates physical and athletic capabilities of NFL prospects. These performance variables will be used to explore their predictive relationship with draft outcomes and long-term success in the NFL. 
- Dataset #2 
  - Dataset Name: NFL Draft Dataset 2000–2022
  - Link to the dataset:https://www.kaggle.com/datasets/mitchellweg1/nfl-draft-dataset-2000-2022
  - Number of observations: 2,000 players
  - Number of variables: 10
  - Description: This dataset records players drafted between 2000 and 2022, with fields for draft year, round, overall pick number, team, player name, and position. When merged with the NFL Combine dataset, this will enable analyses that assess how measurable pre-draft attributes affect draft position and NFL career outcomes.
- Dataset #3
  - Dataset Name: NCAA College Quarterback Data
  - Link to the dataset: https://www.kaggle.com/datasets/av8ramit/ncaa-college-quarterback-data
  - Number of observations: ~2,500 quarterbacks (across multiple seasons)
  - Number of variables: season-by-season passing (yardage, TDs, INTs), rushing stats, completion %, plus an indicator for “drafted vs. not drafted.”
  - Description: directly ties college performance to eventual draft status, letting you validate your Combat-108 model on a broader college-only cohort



In [None]:
!pip install -q kagglehub

In [None]:
import kagglehub 
from kagglehub import KaggleDatasetAdapter
import pandas as pd
import os
import glob

## Dataset: NFL Combine Results Dataset 2000–2022

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 

# Define dataset and generate file paths dynamically
dataset = "mitchellweg1/nfl-combine-results-dataset-2000-2022"
years = range(2000, 2023)  # Covers 2000 through 2022
file_paths = [f"{year}_combine.csv" for year in years]  # Constructs filenames

# Load all files into a single dataframe
dfs = []
for file_path in file_paths:
    df = kagglehub.load_dataset(KaggleDatasetAdapter.PANDAS, dataset, file_path)
    df["Year"] = file_path[:4]  # Add the year as a column
    dfs.append(df)

# Merge all individual dataframes into one
combine_df = pd.concat(dfs, ignore_index=True)

  df = kagglehub.load_dataset(KaggleDatasetAdapter.PANDAS, dataset, file_path)
  df = kagglehub.load_dataset(KaggleDatasetAdapter.PANDAS, dataset, file_path)
  df = kagglehub.load_dataset(KaggleDatasetAdapter.PANDAS, dataset, file_path)
  df = kagglehub.load_dataset(KaggleDatasetAdapter.PANDAS, dataset, file_path)
  df = kagglehub.load_dataset(KaggleDatasetAdapter.PANDAS, dataset, file_path)
  df = kagglehub.load_dataset(KaggleDatasetAdapter.PANDAS, dataset, file_path)
  df = kagglehub.load_dataset(KaggleDatasetAdapter.PANDAS, dataset, file_path)
  df = kagglehub.load_dataset(KaggleDatasetAdapter.PANDAS, dataset, file_path)
  df = kagglehub.load_dataset(KaggleDatasetAdapter.PANDAS, dataset, file_path)
  df = kagglehub.load_dataset(KaggleDatasetAdapter.PANDAS, dataset, file_path)
  df = kagglehub.load_dataset(KaggleDatasetAdapter.PANDAS, dataset, file_path)
  df = kagglehub.load_dataset(KaggleDatasetAdapter.PANDAS, dataset, file_path)
  df = kagglehub.load_dataset(KaggleDatasetAdapter.P

In [None]:
combine_df.head()

Unnamed: 0,Player,Pos,School,Ht,Wt,40yd,Vertical,Bench,Broad Jump,3Cone,Shuttle,Year
0,John Abraham,OLB,South Carolina,6-4,252.0,4.55,,,,,,2000
1,Shaun Alexander,RB,Alabama,6-0,218.0,4.58,,,,,,2000
2,Darnell Alford,OT,Boston Col.,6-4,334.0,5.56,25.0,23.0,94.0,8.48,4.98,2000
3,Kyle Allamon,TE,Texas Tech,6-2,253.0,4.97,29.0,,104.0,7.29,4.49,2000
4,Rashard Anderson,CB,Jackson State,6-2,206.0,4.55,34.0,,123.0,7.18,4.15,2000


In [None]:
print('Columns: ', combine_df.columns)
print('Shape: ', combine_df.shape)
print('Data types: ', combine_df.dtypes)
print('Missing values: ', combine_df.isnull().sum())

Columns:  Index(['Player', 'Pos', 'School', 'Ht', 'Wt', '40yd', 'Vertical', 'Bench',
       'Broad Jump', '3Cone', 'Shuttle', 'Year'],
      dtype='object')
Shape:  (7680, 12)
Data types:  Player         object
Pos            object
School         object
Ht             object
Wt            float64
40yd          float64
Vertical      float64
Bench         float64
Broad Jump    float64
3Cone         float64
Shuttle       float64
Year           object
dtype: object
Missing values:  Player           0
Pos              0
School           0
Ht              29
Wt              24
40yd           474
Vertical      1748
Bench         2584
Broad Jump    1821
3Cone         2888
Shuttle       2785
Year             0
dtype: int64


## Dataset: NFL Draft Dataset 2000–2022

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION
# Import necessary libraries

# Define dataset
dataset = "mitchellweg1/nfl-draft-dataset-2000-2022"

# Download the entire dataset
download_path = kagglehub.dataset_download(dataset)
    
    # Find all CSV files in the downloaded directory
csv_files = glob.glob(os.path.join(download_path, "*.csv"))
    
draft_df = pd.concat([pd.read_csv(file) for file in csv_files], ignore_index=True)


In [None]:
draft_df.head()

Unnamed: 0,Rnd,Pick,Tm,Player,Pos,Age
0,1,1,CLE,Courtney Brown,DE,22.0
1,1,2,WAS,LaVar Arrington,LB,22.0
2,1,3,WAS,Chris Samuels,T,23.0
3,1,4,CIN,Peter Warrick,WR,23.0
4,1,5,BAL,Jamal Lewis,RB,21.0


## Dataset: NCAA College Quarterback Data

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 


# Define dataset
dataset = "av8ramit/ncaa-college-quarterback-data"

# Download the entire dataset
download_path = kagglehub.dataset_download(dataset)

# Find all CSV files in the downloaded directory
csv_files = glob.glob(os.path.join(download_path, "*.csv"))
qb_df = pd.concat([pd.read_csv(file) for file in csv_files], ignore_index=True)
qb_df

Unnamed: 0,DraftYear,Round,Pick,Age,GamesPlayed,Completions,Attempts,Yards,Touchdowns,Interceptions,RushAttempts,RushYards,RushTouchdowns,Player,College,Conference,Team,Heisman,Verdict
0,2013,7,237,23,47,649,1132,8433,52,39,526,2068,5,B.J. Daniels,South Florida,Southeastern,SFO,0,0
1,2004,1,1,23,43,829,1363,10119,81,35,128,-135,5,Eli Manning,Mississippi,Southeastern,SDG,0,1
2,2001,2,32,22,45,1026,1678,11792,90,45,252,900,14,Drew Brees,Purdue,Big Ten,SDG,0,1
3,2001,4,109,23,30,306,587,4164,18,26,164,660,14,Sage Rosenfels,Iowa St.,Big 12,WAS,0,0
4,2001,1,1,21,22,192,343,3299,21,11,235,1299,13,Michael Vick,Virginia Tech,Atlantic Coast,ATL,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,1999,4,131,23,34,357,651,5118,33,24,199,547,14,Aaron Brooks,Virginia,Atlantic Coast,GNB,0,1
153,1999,7,227,23,23,244,480,4401,36,13,324,1314,23,Michael Bishop,Kansas St.,Big 12,NWE,0,0
154,1999,7,245,23,35,281,497,3862,26,15,80,-245,2,Scott Covington,Miami (FL),Atlantic Coast,CIN,0,0
155,1998,1,1,22,45,863,1381,11201,89,33,153,-181,12,Peyton Manning,Tennessee,Southeastern,IND,0,1


## Merge the Draft Result with Combine Result

In [None]:
# Merge the two dataframes on the 'Player' column
merged_df = pd.merge(
    draft_df,
    combine_df,
    on=['Player', 'Pos'],
    how='right'
)
# Filter for Quarterbacks (QB) and merge with the QB data
qb_only_df = merged_df.query('Pos == "QB"')
# Merge the QB-only dataframe with the QB data
qb_data = pd.merge(
    qb_only_df,
    qb_df,
    on=['Player','Pick','Age'],
    how='left'
)

In [None]:
qb_data.head()

Unnamed: 0,Rnd,Pick,Tm,Player,Pos,Age,School,Ht,Wt,40yd,...,Touchdowns,Interceptions,RushAttempts,RushYards,RushTouchdowns,College,Conference,Team,Heisman,Verdict
0,6.0,199.0,NWE,Tom Brady,QB,23.0,Michigan,6-4,211.0,5.28,...,30.0,17.0,90.0,-150.0,3.0,Michigan,Big Ten,NWE,0.0,1.0
1,,,,Travis Brown,QB,,Northern Arizona,6-3,218.0,5.01,...,,,,,,,,,,
2,6.0,168.0,NOR,Marc Bulger,QB,23.0,West Virginia,6-2,208.0,4.97,...,59.0,34.0,107.0,-326.0,2.0,West Virginia,Big 12,NOR,0.0,1.0
3,,,,Bill Burke,QB,,Michigan State,6-4,206.0,5.03,...,,,,,,,,,,
4,3.0,65.0,SFO,Giovanni Carmazzi,QB,,Hofstra,6-3,224.0,4.74,...,,,,,,,,,,


We can't drop the null values because players who weren't drafted have missing values for their round and pick.

# Ethics & Privacy

The data required to answer our research question, which is whether a quarterback's college statistics and NFL combine performance can predict their likelihood of being drafted, is primarily sourced from publicly available databases maintained for the most part by the NCAA, NFL, and third party sports analytics organizations. These sources provide performance metrics such as college passing efficiency, rushing yards, and NFL combine results sources like the 40 yard dash, vertical leap, which are legally collected and publicly reported. So the data is considered ethical to use within the bounds of this project. Additionally, previous academic research such as Craig & Winchester (2021) used similar types of performance data to investigate draft outcomes, leading support to the relevance of these data sources. However, even though the data itself is public, we remain cautious about how combinations of variables like the college, height, and combine scores could make certain players identifiable,especially in smaller groups.To mitigate these concerns, we will consider recoding variables to reduce specificity like grouping schools by conference or categorizing physical metrics into ranges and avoid drawing attention to specific individuals.

Our analysis will focus on identifying performance related patterns associated with being drafted without implying judgment of players' inherent ability or value. The intention isn’t to reinforce or replicate scouting but to explore whether certain quantifiable features are consistently correlated with draft status. However, as noted in the background of the project, scouting and drafting processes themselves are not always rational or objective. For example Craig & Winchester (2021) found that rushing ability among quarterbacks is often undervalued despite its significant contribution to NFL success, suggesting the presence of bias in talent evaluations. Wolfson (2021) argued that the NFL draft outcomes are affected by both randomness and the overvaluation of certain traits, which can affect long term success predictions. As well as the overrepresentation of athletes from elite conferences or the undervaluation of mobile quaterbacks may reflect broader systemic inequities. Therefore we will audit our data for imbalances, for example between Power Five and non Power Five conferences, and avoid including features that serve as proxies for race or socioeconomic status unless we are explicitly analyzing their impact on our data. Our goal is not to predict player “worth” but to better understand how past decisions have been made and if there's any patterns, can we use those patterns to predict accurately if they will be drafted?

# Team Expectations 


Read over the [COGS108 Team Policies](https://github.com/COGS108/Projects/blob/master/COGS108_TeamPolicies.md) individually. Then, include your group’s expectations of one another for successful completion of your COGS108 project below. Discuss and agree on what all of your expectations are. Discuss how your team will communicate throughout the quarter and consider how you will communicate respectfully should conflicts arise. By including each member’s name above and by adding their name to the submission, you are indicating that you have read the COGS108 Team Policies, accept your team’s expectations below, and have every intention to fulfill them. These expectations are for your team’s use and benefit — they won’t be graded for their details.

* We use Discord for all team communication, and everyone must check and respond within 24 hours.
* Each member must contribute to at least two meaningful opinions for each assignment discussion.
* It is acceptable to tag others for reminders.
* If you will be unavailable for the day of an assignment, inform the group ahead of time.
* Everyone must be responsive and available the day before and the day of each major deadline.


# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 5/6  |  8 PM | Search for additional datasets  | Choosing final datasets | 
| 5/12  |  8 PM |  Working on data checkpoint | Making sure data checkpoint assignment is on track/ready to submit  | 
| 5/19  | 8 PM  | Work on EDA, import and wrangle data   | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part, thinking about visualizations   |
| 5/26  | 8 PM  | Prepare for EDA submission | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 6/2  | 8 PM  | Working on analysis | Discuss/edit Analysis; plan time for video part of project and what we will need  |
| 6/9  | 8 PM  | Complete analysis; Draft results/conclusion/discussion. Ideally, should be done with recording video and project| Discuss/edit full project and video |
| 6/11  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |