## NFL Combine Statistics Web Scraping

The purpose of this project is to provide functions and a notebook for scraping statistics for players participating in the NFL combine across a user-specified range of years. All data is scraped from pro-football-reference.com, using BeautifulSoup and Selenium.  

Statistics include combine metrics for players participating in the combine as well as their associated college football statistics. Combine metrics include height/weight, 40-yard dash, vertical jump, bench press reps, broad jump, 3-cone drill, 20-yard shuttle, and draft result. College football statistics include passing, rushing, receiving, and defensive statistics. As a demonstration, this script scrapes combine and college football statistics for all players participating in the NFL combine over the last 5 years (2019-2023).  

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

from nfl_combine_statistics_web_scraping import(
    get_soups,
    get_pfr_combines_stats,
    get_pfr_players_stats
)

pd.set_option('display.max_columns', None)

### Scraping 2019-2023 NFL Combine Metric Statistics

First, I will scrape the past five years (2019-2023) worth of NFL combine metric statistics. The resulting dataframe is indexed by combine year and uniquely-generated player id and has columns for each combine metric.

On pro-football-reference.com, one column in the combine metric statistics table, 'college', contains a hyperlink to a given player's college statistics. Rather than include the hyperlink text, the actual link has been extracted instead. 

Note: Most players only have valid values for a subset of combine metrics. All missing combine metric entries have been set to np.nan.

In [3]:
years = np.arange(2019, 2024)

urls = [f"https://aws.pro-football-reference.com/draft/{i}-combine.htm" for i in years]

soups = get_soups(urls)

combine_metrics_df = get_pfr_combines_stats(years, soups)

combine_metrics_df.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,player,pos,school_name,college,height,weight,forty_yd,vertical,bench_reps,broad_jump,cone,shuttle,draft_info
combine_year,player_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2019,0,Johnathan Abram,S,Mississippi State,https://www.sports-reference.com/cfb/players/j...,5-11,205.0,4.45,,,116.0,,,Oakland Raiders / 1st / 27th pick / 2019
2019,1,Paul Adams,OT,Missouri,https://www.sports-reference.com/cfb/players/p...,6-6,317.0,5.18,27.0,16.0,103.0,7.68,4.74,
2019,2,Nasir Adderley,S,Delaware,,6-0,206.0,,,,,,,Los Angeles Chargers / 2nd / 60th pick / 2019


In [4]:
combine_metrics_df.to_csv("datasets/Combine Metrics 2019-2023.csv")

### Scraping College Football Statistics from 2019-2023 NFL Combine Participants

For players participating in the last five drafts, I will use the links scraped from the combine metrics statistic tables (see above) to obtain their college football statistics. 

The resulting dataframe is a concatenation of passing, rushing/receiving, and defense college football statistics tables. The dataframe is indexed by player id (see above) and college football season year and has columns for each college football statistic (e.g., rush attempts, passing yards).

Note: Most players only have valid statistics for a subset of college football statistics. All missing college football statistic entries have been set to np.nan.

In [5]:
valid_link = (combine_metrics_df.loc[combine_metrics_df["college"].notna(), "college"])

links = valid_link.values
player_ids = valid_link.index.get_level_values(1)
 
soups = get_soups(links) 

college_stats_df = get_pfr_players_stats(player_ids, soups)

college_stats_df.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,school_name,conf_abbr,class,pos,g,tackles_solo,tackles_assists,tackles_total,tackles_loss,sacks,def_int,def_int_yds,def_int_yds_per_int,def_int_td,pass_defended,fumbles_rec,fumbles_rec_yds,fumbles_rec_td,fumbles_forced,rush_att,rush_yds,rush_yds_per_att,rush_td,rec,rec_yds,rec_yds_per_rec,rec_td,scrim_att,scrim_yds,scrim_yds_per_att,scrim_td,pass_cmp,pass_att,pass_cmp_pct,pass_yds,pass_yds_per_att,adj_pass_yds_per_att,pass_td,pass_int,pass_rating
player_id,year_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1
0,*2015,Georgia,SEC,FR,S,8.0,11.0,14.0,25.0,1.5,0.0,0.0,0.0,,0.0,0.0,0.0,,,0.0,,,,,,,,,,,,,,,,,,,,,
0,*2017,Mississippi State,SEC,JR,DB,12.0,43.0,28.0,71.0,5.0,2.0,0.0,0.0,,0.0,5.0,0.0,,,2.0,,,,,,,,,,,,,,,,,,,,,
0,*2018,Mississippi State,SEC,SR,S,13.0,53.0,46.0,99.0,9.0,3.0,2.0,9.0,4.5,0.0,5.0,1.0,,,1.0,,,,,,,,,,,,,,,,,,,,,


In [6]:
college_stats_df.to_csv("datasets/College Football Statistics 2019-2023.csv")

Next, I will group by player name to (1) average each numeric statistic across all seasons played and (2) keep the numeric statistic in the most recent season played. I will also rename all columns in the pattern: "{statistic} _ {aggregation type}"

In [7]:
agg_college_stats_df = (college_stats_df
                        .select_dtypes("number")
                        .groupby("player_id")
                        .agg(["mean", lambda x: x.iloc[-1]])
                        .rename(columns={"mean": "avg", "<lambda_0>": "last_season"},
                                level=1))

agg_college_stats_df.columns = [stat+"_"+agg for stat, agg in agg_college_stats_df.columns]

agg_college_stats_df.head(3)

Unnamed: 0_level_0,g_avg,g_last_season,tackles_solo_avg,tackles_solo_last_season,tackles_assists_avg,tackles_assists_last_season,tackles_total_avg,tackles_total_last_season,tackles_loss_avg,tackles_loss_last_season,sacks_avg,sacks_last_season,def_int_avg,def_int_last_season,def_int_yds_avg,def_int_yds_last_season,def_int_yds_per_int_avg,def_int_yds_per_int_last_season,def_int_td_avg,def_int_td_last_season,pass_defended_avg,pass_defended_last_season,fumbles_rec_avg,fumbles_rec_last_season,fumbles_rec_yds_avg,fumbles_rec_yds_last_season,fumbles_rec_td_avg,fumbles_rec_td_last_season,fumbles_forced_avg,fumbles_forced_last_season,rush_att_avg,rush_att_last_season,rush_yds_avg,rush_yds_last_season,rush_yds_per_att_avg,rush_yds_per_att_last_season,rush_td_avg,rush_td_last_season,rec_avg,rec_last_season,rec_yds_avg,rec_yds_last_season,rec_yds_per_rec_avg,rec_yds_per_rec_last_season,rec_td_avg,rec_td_last_season,scrim_att_avg,scrim_att_last_season,scrim_yds_avg,scrim_yds_last_season,scrim_yds_per_att_avg,scrim_yds_per_att_last_season,scrim_td_avg,scrim_td_last_season,pass_cmp_avg,pass_cmp_last_season,pass_att_avg,pass_att_last_season,pass_cmp_pct_avg,pass_cmp_pct_last_season,pass_yds_avg,pass_yds_last_season,pass_yds_per_att_avg,pass_yds_per_att_last_season,adj_pass_yds_per_att_avg,adj_pass_yds_per_att_last_season,pass_td_avg,pass_td_last_season,pass_int_avg,pass_int_last_season,pass_rating_avg,pass_rating_last_season
player_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1
0,11.0,13.0,35.666667,53.0,29.333333,46.0,65.0,99.0,5.166667,9.0,1.666667,3.0,0.666667,2.0,3.0,9.0,4.5,4.5,0.0,0.0,3.333333,5.0,0.333333,1.0,,,,,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,10.75,6.0,50.25,24.0,48.5,19.0,98.75,43.0,7.75,3.0,1.875,1.5,0.25,0.0,7.5,0.0,30.0,,0.25,0.0,1.5,0.0,0.0,0.0,,,,,0.25,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,9.4,12.0,23.4,39.0,28.6,40.0,52.0,79.0,7.3,14.5,2.1,4.0,0.2,0.0,0.0,0.0,0.0,,0.0,0.0,1.0,1.0,0.0,0.0,,,,,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


### Merging 2019-2023 NFL Combine Metrics and College Football Statistics 

Finally, I will merge the combine statistics and the college statistics into one dataframe. 

In [8]:
integrated_df = combine_metrics_df.merge(agg_college_stats_df,
                                         how="left",
                                         left_on="player_id",
                                         right_on="player_id")

integrated_df.head(3)

Unnamed: 0_level_0,player,pos,school_name,college,height,weight,forty_yd,vertical,bench_reps,broad_jump,cone,shuttle,draft_info,g_avg,g_last_season,tackles_solo_avg,tackles_solo_last_season,tackles_assists_avg,tackles_assists_last_season,tackles_total_avg,tackles_total_last_season,tackles_loss_avg,tackles_loss_last_season,sacks_avg,sacks_last_season,def_int_avg,def_int_last_season,def_int_yds_avg,def_int_yds_last_season,def_int_yds_per_int_avg,def_int_yds_per_int_last_season,def_int_td_avg,def_int_td_last_season,pass_defended_avg,pass_defended_last_season,fumbles_rec_avg,fumbles_rec_last_season,fumbles_rec_yds_avg,fumbles_rec_yds_last_season,fumbles_rec_td_avg,fumbles_rec_td_last_season,fumbles_forced_avg,fumbles_forced_last_season,rush_att_avg,rush_att_last_season,rush_yds_avg,rush_yds_last_season,rush_yds_per_att_avg,rush_yds_per_att_last_season,rush_td_avg,rush_td_last_season,rec_avg,rec_last_season,rec_yds_avg,rec_yds_last_season,rec_yds_per_rec_avg,rec_yds_per_rec_last_season,rec_td_avg,rec_td_last_season,scrim_att_avg,scrim_att_last_season,scrim_yds_avg,scrim_yds_last_season,scrim_yds_per_att_avg,scrim_yds_per_att_last_season,scrim_td_avg,scrim_td_last_season,pass_cmp_avg,pass_cmp_last_season,pass_att_avg,pass_att_last_season,pass_cmp_pct_avg,pass_cmp_pct_last_season,pass_yds_avg,pass_yds_last_season,pass_yds_per_att_avg,pass_yds_per_att_last_season,adj_pass_yds_per_att_avg,adj_pass_yds_per_att_last_season,pass_td_avg,pass_td_last_season,pass_int_avg,pass_int_last_season,pass_rating_avg,pass_rating_last_season
player_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1
0,Johnathan Abram,S,Mississippi State,https://www.sports-reference.com/cfb/players/j...,5-11,205.0,4.45,,,116.0,,,Oakland Raiders / 1st / 27th pick / 2019,11.0,13.0,35.666667,53.0,29.333333,46.0,65.0,99.0,5.166667,9.0,1.666667,3.0,0.666667,2.0,3.0,9.0,4.5,4.5,0.0,0.0,3.333333,5.0,0.333333,1.0,,,,,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,Paul Adams,OT,Missouri,https://www.sports-reference.com/cfb/players/p...,6-6,317.0,5.18,27.0,16.0,103.0,7.68,4.74,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,Nasir Adderley,S,Delaware,,6-0,206.0,,,,,,,Los Angeles Chargers / 2nd / 60th pick / 2019,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [9]:
integrated_df.to_csv("datasets/Combine Metrics and Aggregated College Football Statistics 2019-2023.csv")