# NCAA to NFL Draft Predictions – Exploratory Data Analysis (EDA)

This notebook explores NCAA player stats, cleans the data, and visualizes key trends related to NFL Draft outcomes.

## Environment Check

In [1]:
import sys
import pandas as pd
import numpy as np
import matplotlib
import seaborn as sns
import sklearn
import statsmodels.api as sm

print("Python version:", sys.version)
print("Pandas:", pd.__version__)
print("NumPy:", np.__version__)
print("Matplotlib:", matplotlib.__version__)
print("Seaborn:", sns.__version__)
print("scikit-learn:", sklearn.__version__)
print("Statsmodels:", sm.__version__)


Python version: 3.10.18 (main, Jun  5 2025, 08:37:47) [Clang 14.0.6 ]
Pandas: 2.3.2
NumPy: 2.0.1
Matplotlib: 3.10.6
Seaborn: 0.13.2
scikit-learn: 1.7.2
Statsmodels: 0.14.5


## 1. Import Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## 2. Load Data

In [3]:
passing = pd.read_csv('../data/raw/CFB_Passing_2021.csv')
receiving = pd.read_csv('../data/raw/CFB_Receiving_2021.csv')
rushing = pd.read_csv('../data/raw/CFB_Rushing_2021.csv')
draft = pd.read_csv('../data/raw/NFL_Draft_2023.csv')

## 3. Inspect Data
- Look at shape, data types, and missing values.
- Summarize numeric stats.

In [4]:
# for df, name in zip([passing, receiving, rushing, draft], ['Passing', 'Receiving', 'Rushing', 'Draft']):
#     print(f"{name} shape:", df.shape)
#     display(df.head(3))

draft.info()
passing.info()
rushing.info()
receiving.info()

draft.isna().sum().head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 259 entries, 0 to 258
Data columns (total 30 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rnd           259 non-null    int64  
 1   Pick          259 non-null    int64  
 2   Tm            259 non-null    object 
 3   Player        259 non-null    object 
 4   Pos           259 non-null    object 
 5   Age           258 non-null    float64
 6   To            248 non-null    float64
 7   AP1           259 non-null    int64  
 8   PB            259 non-null    int64  
 9   St            259 non-null    int64  
 10  wAV           248 non-null    float64
 11  DrAV          237 non-null    float64
 12  G             248 non-null    float64
 13  Cmp           248 non-null    float64
 14  Att           248 non-null    float64
 15  Yds           248 non-null    float64
 16  TD            248 non-null    float64
 17  Int           248 non-null    float64
 18  Att.1         248 non-null    

Rnd        0
Pick       0
Tm         0
Player     0
Pos        0
Age        1
To        11
AP1        0
PB         0
St         0
dtype: int64

## 4. Data Cleaning
- Handle missing values
- Rename columns for consistency
- Drop duplicates

In [11]:
# Make a copy to avoid overwriting the original
draft_clean = draft.copy()

# Drop empty or irrelevant columns
draft_clean = draft_clean.drop(columns=['Unnamed: 28','-9999'], errors='ignore')

# Rename confusing duplicate columns
draft_clean = draft_clean.rename(columns={
    'Att': 'Pass_Att',
    'Yds': 'Pass_Yds',
    'TD': 'Pass_TD',
    'Int': 'Pass_Int',
    'Att.1': 'Rush_Att',
    'Yds.1': 'Rush_Yds',
    'TD.1': 'Rush_TD',
})

# Check result
draft_clean.head()

Unnamed: 0,Rnd,Pick,Tm,Player,Pos,Age,To,AP1,PB,St,...,Rush_Att,Rush_Yds,Rush_TD,Rec,Yds.2,TD.2,Solo,Int.1,Sk,College/Univ
0,1,1,CAR,Bryce Young,QB,22.0,2025.0,0,0,1,...,92.0,555.0,7.0,0.0,0.0,0.0,,,,Alabama
1,1,2,HOU,C.J. Stroud,QB,21.0,2025.0,0,1,2,...,108.0,492.0,3.0,1.0,0.0,0.0,,,,Ohio St.
2,1,3,HOU,Will Anderson,LB,22.0,2025.0,0,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,66.0,,21.0,Alabama
3,1,4,IND,Anthony Richardson,QB,21.0,2025.0,0,0,1,...,115.0,634.0,10.0,1.0,-1.0,0.0,,,,Florida
4,1,5,SEA,Devon Witherspoon,DB,22.0,2025.0,0,2,2,...,0.0,0.0,0.0,0.0,0.0,0.0,134.0,1.0,4.5,Illinois


## 5. Exploratory Visualizations
- Distribution of key stats
- Correlations between performance and draft position

In [None]:
# Example histogram
sns.histplot(df["passing_yards"], bins=30, kde=True)
plt.title("Distribution of Passing Yards")
plt.show()

# Example correlation heatmap
sns.heatmap(df.corr(), cmap="coolwarm", annot=False)
plt.title("Correlation Heatmap")
plt.show()

## 6. Initial Insights
- Note any interesting patterns (e.g., higher passing yards correlates with earlier draft rounds).