
# Fourth Down Bravery: Logistic Regression on Real NFL Data

Coaches face a game-changing decision every time fourth down rolls around: **should we go for it or play it safe?**
In this notebook we'll use real 2023 NFL play-by-play data to estimate the chance of successfully converting
those nerve-wracking fourth downs. Along the way we'll build intuition for logistic regression and how its
coefficients translate into football insight.



## Step 1: Loading real play-by-play data
We start with a dataset of every fourth-down play from the 2023 season. The data was pulled from the
[nflfastR](https://github.com/nflverse/nflfastR-data) project and pared down to a few useful columns.
Each row represents a single fourth-down attempt with information about field position, yards to go,
score differential, time remaining in the half, and whether the play gained a first down.


In [None]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load the real fourth-down data
fourth = pd.read_csv('../data/fourth_down_plays_2023.csv')
# only keep plays where the offense went for it (run or pass)
fourth = fourth[fourth['play_type'].isin(['run','pass'])]
print(f"Loaded {len(fourth):,} fourth-down attempts from {fourth['season'].iloc[0]}.")
fourth.head()



## Step 2: Feature engineering
The goal is to predict whether the offense converts the fourth down. We'll use a handful of intuitive features:

* `ydstogo` - yards needed for a first down
* `yardline_100` - distance from the opponent's goal line
* `score_differential` - how many points the offense is ahead or behind
* `half_seconds_remaining` - time left in the current half
* `shotgun` and `qb_scramble` indicators

These covariates capture both **context** (field position, time, score) and **play design** (formation and quarterback decision).


In [None]:

feature_cols = ['ydstogo', 'yardline_100', 'score_differential',
                'half_seconds_remaining', 'shotgun', 'qb_scramble']
X = fourth[feature_cols]
y = fourth['first_down']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

print("Model trained. Coefficients (log-odds per unit):")
for col, coef in zip(feature_cols, model.coef_[0]):
    print(f"  {col:22s} {coef:+.3f}")



## Step 3: Evaluating the model
How well does the model predict conversions on unseen plays?


In [None]:

pred = model.predict(X_test)
print(classification_report(y_test, pred, digits=3))



## Step 4: What the numbers tell us
Logistic regression estimates the effect of each feature on the **log-odds** of a conversion. Negative
coefficients (like `ydstogo`) reduce the chance of success, while positive ones (like being in shotgun
formation) increase it. Coaches can translate these odds into decisions: if the probability of success
multiplied by the value of a conversion outweighs the cost of failure, going for it becomes the analytically
sound choice.

Even with a simple model we capture meaningful tendencies from real NFL behavior—proof that data can
shed light on those pivotal fourth-down gambles.
