Regression Model to Predict Over/Under Outcomes for the Upcoming NFL Season

Goal

Develop a regression model that can recommend to bettors whether teams are likely to go over or under the set line

ETL

To train our regression model, we used a dataset with every NFL game since 1979, with features including betting lines, game outcomes, weather conditions, and more.
We needed to engineer a lot of new features to augment the data we started with. First, since our data did not include the results of any betting line (i.e. whether the home team pushed), we needed to add new columns to identify whether the over/under line was exceeded, pushed, or above the game's total score. To add more context to the difference in quality between teams, we went through the dataset and calculated each team's current record and point differential at the time of their game.
Since 16 game seasons are an awfully small sample size, a team's record can be misleading, so using the point differential we obtained, we calculated each team's Adjusted Pythagorean Expected Win Percentage at the time of each matchup, using Football Outsiders formula. This provided a more accurate look at the true strength of each team, using point differential to calculate what their expected win percentage should be.
Since total points scored in a game is less reliant on the relative strength of each team and more dependent solely on offensive and defensive performance, we calculated for each team their average points scored and allowed per game at the time of each matchup.
While we did not have injury data, which would have clued our model in on when a team's actual performance would be worse than its expected performance, we were able to calculate a rolling win percentage of each team's last four games, which provided a more accurate glimpse of how good or bad the team currently was, rather then their season-to-date performance.

Model and Feature Selection

To narrow down our features, we explored the correlations each had with the Over/Under line, as well as with each other so that we could limit multicollinearity. We ultimately decided on using 7 features: points per game, points allowed per game, temperature, wind, dome (binary 1 or 0), and season (to account for changes between eras).

Correlation heatmap

Regression for Points per Game

For our model, we tested out linear, log-linear, and log-log regression models, settling on a linear regression which fit out data best. Due to the odd distribution of NFL scores (most scoring plays are either 3 or 7 points), we used the BoxCox Power Transformation on each of our variables to transform them into a more normal distribution. Our final regression model had an Adjusted R^2 of 0.697.

Boxcox Transformation for Points per Game

Clustering the Data

In order to train a more accurate regression model, we experimented with clustering our data, classifying games as being one of four types: good offense vs. good defense, good offense vs. bad defense, bad offense vs. bad defense, and bad offense vs. good defense. Our hypothesis was that by clustering the games into these four categories and regressing on each individually, we'd be able to predict Over/Under lines with even higher accuracy. However, after testing this on 4 separate regression models, we found that the regression on each cluster was, in fact, less accurate than the overall regression model, since having a good offense far outweighed having a strong defense.

Data Clustered into 4 Game Types

Week One Predictions

After running our regressions, we plugged in the data from Week 1 of the 2018 season and predicted for each game what the Over/Under line should be. If our line was higher than the actual line, we recommended better the over, and vice versa for the under. Using Naive Bayes, we then calculated the probability of a game going over or under, given our predictions and the past history of games with the same line.

Week 1 Predictions

For Week 1, we correctly predicted 9 of 16 games. We also ran every prior game through our model to see how accurate it would be if we had bet on every single game since 1979 (Weeks 5-16). Our model gave good predictions 54.79% of the time, classified as when our model guided the bettor to a win or push.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.ipynb_checkpoints		.ipynb_checkpoints
additional_files		additional_files
data		data
images		images
Data Exploration and Variable Selection.ipynb		Data Exploration and Variable Selection.ipynb
README.md		README.md
Regression.ipynb		Regression.ipynb
Spread Clusters.ipynb		Spread Clusters.ipynb
clean_regression.ipynb		clean_regression.ipynb
clusters.ipynb		clusters.ipynb
nfl_betting_data.ipynb		nfl_betting_data.ipynb
nfl_betting_df.csv		nfl_betting_df.csv
over_under-clusters.ipynb		over_under-clusters.ipynb
over_under.ipynb		over_under.ipynb
spreadspoke_scores.csv		spreadspoke_scores.csv
week1predictions.png		week1predictions.png
week_one_predictions.ipynb		week_one_predictions.ipynb

slieb74/NFL-Betting-Data

Folders and files

Latest commit

History

Repository files navigation

Regression Model to Predict Over/Under Outcomes for the Upcoming NFL Season

Goal

ETL

Model and Feature Selection

Correlation heatmap

Regression for Points per Game

Boxcox Transformation for Points per Game

Clustering the Data

Data Clustered into 4 Game Types

Week One Predictions

Week 1 Predictions

Back-tested Results

About

Topics

Resources

Stars

Watchers

Forks

Languages