# Ideation

Are there any arbitrage opportunities available in over/under betting for NFL?

1. Pull in historical NFL games and any relevant information. Maybe use Madden scores again...
2. Data cleanup etc. If I use home/away, potentially scramble/average out for 2020 due to no fans. Players with dupe names. Teams that changed locations (raiders) or names (Wsh)
3. What features are most relevant? If there is nothing too relevant, scrap idea
4. If there are features that are relevant, let's look to use ML (need to treat this as time series)
5. ML should be a regression rather than a classifier, targeting how many points I expect the teams will get.
6. Get historical over/under sports betting odds
7. Decide on how much to wager depending on how much my prediction differs from bookie (if it is a significant difference, bet more)
8. Backtest based on betting methodology and see results


Other notes:
https://github.com/jp-wright/nfl_betting_market_analysis 

The Spread:
"This finding is shown consistently in this project as the matchup features rank high in importance."
weighted DVOA (a measure of how well a team has been playing recently) being the second biggest predictor

Over Under:
Unsurprisingly the most important features for determining the combined points scored in a game are statistics that relate to how effective a team is at scoring or preventing a score. We see a strong divergence from important features in predicting the spread, with no matchup delta metrics present. This fits common sense as we aren't concerned with how much better Team A is than Team B at something, but rather how good or bad both teams combined are.



1. Pulled data from: 
2. Deleted any data from 66-78 as it had no betting information. Not a great predictor anyway (see jp-wright)
3. Weather - no data in 2019/2020, do i need to fill it in?

In [1]:
# Imports etc

#!pip install scikit-optimize
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import mpl_toolkits
from functools import reduce
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn import ensemble
#from skopt.space import Real, Integer
#from skopt.utils import use_named_args
#from skopt import gp_minimize

%matplotlib inline



In [2]:
# Get data
global_df = pd.read_csv("nfl_games_and_bets.csv")
global_df

Unnamed: 0,schedule_date,schedule_season,schedule_week,schedule_playoff,team_home,score_home,score_away,team_away,team_favorite_id,spread_favorite,over_under_line,stadium,stadium_neutral,weather_temperature,weather_wind_mph,weather_humidity,weather_detail
0,9/1/1979,1979,1,False,Tampa Bay Buccaneers,31.0,16.0,Detroit Lions,TB,-3.0,30.0,Houlihan's Stadium,False,79.0,9.0,87.0,
1,9/2/1979,1979,1,False,Buffalo Bills,7.0,9.0,Miami Dolphins,MIA,-5.0,39.0,Ralph Wilson Stadium,False,74.0,15.0,74.0,
2,9/2/1979,1979,1,False,Chicago Bears,6.0,3.0,Green Bay Packers,CHI,-3.0,31.0,Soldier Field,False,78.0,11.0,68.0,
3,9/2/1979,1979,1,False,Denver Broncos,10.0,0.0,Cincinnati Bengals,DEN,-3.0,31.5,Mile High Stadium,False,69.0,6.0,38.0,
4,9/2/1979,1979,1,False,Kansas City Chiefs,14.0,0.0,Baltimore Colts,KC,-1.0,37.0,Arrowhead Stadium,False,76.0,8.0,71.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10713,1/9/2021,2021,18,False,Tampa Bay Buccaneers,,,Carolina Panthers,,,,,,,,,
10714,1/9/2021,2021,18,False,Arizona Cardinals,,,Seattle Seahawks,,,,,,,,,
10715,1/9/2021,2021,18,False,Denver Broncos,,,Kansas City Chiefs,,,,,,,,,
10716,1/9/2021,2021,18,False,Las Vegas Raiders,,,Los Angeles Chargers,,,,,,,,,


In [3]:
# Drop 2021 season as it hasn't happened yet
pre_2021_df = global_df.drop(global_df[global_df.schedule_season == 2021].index)

# Drop everything before 2010 as it's probably not that helpful
recent_df = pre_2021_df.drop(pre_2021_df[pre_2021_df.schedule_season < 2010].index)

# Account for team moves
old_to_new_team_name = {"San Diego Chargers": "Los Angeles Chargers", "St. Louis Rams": "Los Angeles Rams", "Washington Redskins" : "Washington Football Team"}

recent_df

Unnamed: 0,schedule_date,schedule_season,schedule_week,schedule_playoff,team_home,score_home,score_away,team_away,team_favorite_id,spread_favorite,over_under_line,stadium,stadium_neutral,weather_temperature,weather_wind_mph,weather_humidity,weather_detail
7507,9/9/2010,2010,1,False,New Orleans Saints,14.0,9.0,Minnesota Vikings,NO,-5.0,49.5,Louisiana Superdome,False,72.0,0.0,,DOME
7508,9/12/2010,2010,1,False,Buffalo Bills,10.0,15.0,Miami Dolphins,MIA,-3.0,39.0,Ralph Wilson Stadium,False,64.0,7.0,81.0,
7509,9/12/2010,2010,1,False,Chicago Bears,19.0,14.0,Detroit Lions,CHI,-6.5,45.0,Soldier Field,False,75.0,1.0,45.0,
7510,9/12/2010,2010,1,False,Houston Texans,34.0,24.0,Indianapolis Colts,IND,-1.0,48.0,Reliant Stadium,False,89.0,5.0,,DOME (Open Roof)
7511,9/12/2010,2010,1,False,Jacksonville Jaguars,24.0,17.0,Denver Broncos,JAX,-3.0,41.5,EverBank Field,False,91.0,1.0,67.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10441,1/17/2021,2020,Division,True,Kansas City Chiefs,22.0,17.0,Cleveland Browns,KC,-8.0,56.0,Arrowhead Stadium,False,,,,
10442,1/17/2021,2020,Division,True,New Orleans Saints,20.0,30.0,Tampa Bay Buccaneers,NO,-2.5,53.0,Mercedes-Benz Superdome,False,,,,
10443,1/24/2021,2020,Conference,True,Green Bay Packers,26.0,31.0,Tampa Bay Buccaneers,GB,-3.0,53.0,Lambeau Field,False,,,,
10444,1/24/2021,2020,Conference,True,Kansas City Chiefs,38.0,24.0,Buffalo Bills,KC,-3.0,55.0,Arrowhead Stadium,False,,,,


In [4]:
# How often is the outcome way below the over/under?
pre_2021_df_with_score = recent_df
pre_2021_df_with_score['total_score'] = pre_2021_df_with_score.apply (lambda row: row.score_home + row.score_away, axis=1)

outliers_df = pre_2021_df_with_score[abs(pre_2021_df_with_score.total_score - pre_2021_df_with_score.over_under_line) > 20]
super_outliers_df = pre_2021_df_with_score[abs(pre_2021_df_with_score.total_score - pre_2021_df_with_score.over_under_line) > 30]
outliers_df



# How often is the outcome way above the over/under?

Unnamed: 0,schedule_date,schedule_season,schedule_week,schedule_playoff,team_home,score_home,score_away,team_away,team_favorite_id,spread_favorite,over_under_line,stadium,stadium_neutral,weather_temperature,weather_wind_mph,weather_humidity,weather_detail,total_score
7507,9/9/2010,2010,1,False,New Orleans Saints,14.0,9.0,Minnesota Vikings,NO,-5.0,49.5,Louisiana Superdome,False,72.0,0.0,,DOME,23.0
7529,9/19/2010,2010,2,False,Detroit Lions,32.0,35.0,Philadelphia Eagles,PHI,-6.5,41.0,Ford Field,False,72.0,0.0,,DOME,67.0
7548,9/26/2010,2010,3,False,New England Patriots,38.0,30.0,Buffalo Bills,NE,-14.5,43.0,Gillette Stadium,False,68.0,11.0,,,68.0
7561,10/3/2010,2010,4,False,New York Giants,17.0,3.0,Chicago Bears,NYG,-3.5,44.0,MetLife Stadium,False,59.0,1.0,52.0,,20.0
7571,10/10/2010,2010,5,False,Buffalo Bills,26.0,36.0,Jacksonville Jaguars,BUF,-2.5,41.5,Ralph Wilson Stadium,False,63.0,1.0,52.0,,62.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10417,1/3/2021,2020,17,False,Buffalo Bills,56.0,26.0,Miami Dolphins,MIA,-3.0,42.5,New Era Field,False,,,,,82.0
10424,1/3/2021,2020,17,False,Houston Texans,38.0,41.0,Tennessee Titans,TEN,-7.0,55.5,NRG Stadium,False,72.0,0.0,,DOME,79.0
10437,1/10/2021,2020,Wildcard,True,Pittsburgh Steelers,37.0,48.0,Cleveland Browns,PIT,-5.5,47.5,Heinz Field,False,,,,,85.0
10438,1/10/2021,2020,Wildcard,True,Tennessee Titans,13.0,20.0,Baltimore Ravens,BAL,-3.5,53.5,Nissan Stadium,False,,,,,33.0


In [5]:
recent_df.describe()

Unnamed: 0,schedule_season,score_home,score_away,spread_favorite,over_under_line,weather_temperature,weather_wind_mph,weather_humidity,total_score
count,2939.0,2939.0,2939.0,2939.0,2939.0,2501.0,2499.0,672.0,2939.0
mean,2015.003403,23.912215,21.844505,-5.290915,45.321028,62.834866,4.591437,58.436012,45.75672
std,3.164428,10.289965,9.921303,3.394564,4.42448,15.414497,4.63054,18.927473,13.995296
min,2010.0,0.0,0.0,-26.5,33.0,-6.0,0.0,4.0,6.0
25%,2012.0,17.0,15.0,-7.0,42.5,53.0,0.0,45.0,36.0
50%,2015.0,24.0,21.0,-4.0,45.0,70.0,4.0,57.0,45.0
75%,2018.0,31.0,28.0,-3.0,48.0,72.0,7.0,72.25,54.5
max,2020.0,62.0,59.0,0.0,63.5,97.0,40.0,100.0,105.0


In [6]:
outliers_df.describe()

Unnamed: 0,schedule_season,score_home,score_away,spread_favorite,over_under_line,weather_temperature,weather_wind_mph,weather_humidity,total_score
count,360.0,360.0,360.0,360.0,360.0,311.0,311.0,85.0,360.0
mean,2015.038889,26.819444,24.880556,-5.213889,45.7625,61.733119,4.498392,61.223529,51.7
std,3.083542,15.694087,14.65612,3.393846,4.576628,17.017339,4.568387,19.342389,26.690092
min,2010.0,0.0,0.0,-20.0,33.0,-6.0,0.0,23.0,6.0
25%,2013.0,13.0,10.75,-7.0,42.5,52.0,0.0,47.0,23.0
50%,2015.0,30.0,27.0,-4.0,45.5,69.0,4.0,61.0,65.0
75%,2018.0,39.0,37.0,-3.0,48.5,72.0,7.0,78.0,73.0
max,2020.0,59.0,59.0,0.0,63.5,89.0,23.0,100.0,105.0


In [7]:
outliers_df

Unnamed: 0,schedule_date,schedule_season,schedule_week,schedule_playoff,team_home,score_home,score_away,team_away,team_favorite_id,spread_favorite,over_under_line,stadium,stadium_neutral,weather_temperature,weather_wind_mph,weather_humidity,weather_detail,total_score
7507,9/9/2010,2010,1,False,New Orleans Saints,14.0,9.0,Minnesota Vikings,NO,-5.0,49.5,Louisiana Superdome,False,72.0,0.0,,DOME,23.0
7529,9/19/2010,2010,2,False,Detroit Lions,32.0,35.0,Philadelphia Eagles,PHI,-6.5,41.0,Ford Field,False,72.0,0.0,,DOME,67.0
7548,9/26/2010,2010,3,False,New England Patriots,38.0,30.0,Buffalo Bills,NE,-14.5,43.0,Gillette Stadium,False,68.0,11.0,,,68.0
7561,10/3/2010,2010,4,False,New York Giants,17.0,3.0,Chicago Bears,NYG,-3.5,44.0,MetLife Stadium,False,59.0,1.0,52.0,,20.0
7571,10/10/2010,2010,5,False,Buffalo Bills,26.0,36.0,Jacksonville Jaguars,BUF,-2.5,41.5,Ralph Wilson Stadium,False,63.0,1.0,52.0,,62.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10417,1/3/2021,2020,17,False,Buffalo Bills,56.0,26.0,Miami Dolphins,MIA,-3.0,42.5,New Era Field,False,,,,,82.0
10424,1/3/2021,2020,17,False,Houston Texans,38.0,41.0,Tennessee Titans,TEN,-7.0,55.5,NRG Stadium,False,72.0,0.0,,DOME,79.0
10437,1/10/2021,2020,Wildcard,True,Pittsburgh Steelers,37.0,48.0,Cleveland Browns,PIT,-5.5,47.5,Heinz Field,False,,,,,85.0
10438,1/10/2021,2020,Wildcard,True,Tennessee Titans,13.0,20.0,Baltimore Ravens,BAL,-3.5,53.5,Nissan Stadium,False,,,,,33.0


In [8]:
super_outliers_df.describe()

Unnamed: 0,schedule_season,score_home,score_away,spread_favorite,over_under_line,weather_temperature,weather_wind_mph,weather_humidity,total_score
count,74.0,74.0,74.0,74.0,74.0,65.0,65.0,22.0,74.0
mean,2014.891892,33.945946,32.621622,-4.912162,45.817568,61.415385,3.846154,52.454545,66.567568
std,3.414584,17.166303,15.236674,2.987827,4.811925,16.944664,4.047494,16.247277,29.933754
min,2010.0,0.0,0.0,-15.5,36.0,18.0,0.0,32.0,6.0
25%,2012.0,28.75,24.5,-6.5,42.5,54.0,0.0,39.5,71.25
50%,2015.5,39.0,35.5,-4.0,45.5,68.0,3.0,48.5,78.5
75%,2018.0,45.0,43.0,-3.0,48.5,72.0,7.0,55.75,85.0
max,2020.0,56.0,59.0,-1.0,63.5,86.0,17.0,90.0,105.0


In [9]:
# next - use madden offensive / defensive ratings to see if there is any difference

# Drop 2021 season as it hasn't happened yet
df_2017_to_2020 = global_df.drop(global_df[global_df.schedule_season == 2021].index)
df_2017_to_2020 = df_2017_to_2020.drop(df_2017_to_2020[df_2017_to_2020.schedule_season < 2017].index)

madden_ratings = pd.read_csv("madden_team_ratings.csv")

df_with_madden = pd.merge(df_2017_to_2020, madden_ratings, how='left', left_on=['schedule_season', 'team_home'], right_on=['Year', 'Team']) \
    .drop(columns=['Team', 'Overall', 'Year']) \
    .rename(columns={'Offense': 'home_off', 'Defense' : 'home_def'})

df_with_madden = pd.merge(df_with_madden, madden_ratings, how='left', left_on=['schedule_season', 'team_away'], right_on=['Year', 'Team']) \
    .drop(columns=['Team', 'Overall', 'Year']) \
    .rename(columns={'Offense': 'away_off', 'Defense' : 'away_def'})

df_with_madden['total_score'] = df_with_madden.apply (lambda row: row.score_home + row.score_away, axis=1)
df_with_madden
#df_2017_to_2020


#df_2017_to_2020['home_off'] = pre_2021_df_with_score.apply (lambda row: row.score_home + row.score_away, axis=1)


#combined_df = pd.merge(df_2017_to_2020, madden_ratings, how='left', left_on=['schedule_season'])



Unnamed: 0,schedule_date,schedule_season,schedule_week,schedule_playoff,team_home,score_home,score_away,team_away,team_favorite_id,spread_favorite,...,stadium_neutral,weather_temperature,weather_wind_mph,weather_humidity,weather_detail,home_off,home_def,away_off,away_def,total_score
0,9/7/2017,2017,1,False,New England Patriots,27.0,42.0,Kansas City Chiefs,NE,-9.0,...,False,63.0,7.0,,,92.0,88.0,82.0,91.0,69.0
1,9/10/2017,2017,1,False,Buffalo Bills,21.0,12.0,New York Jets,BUF,-9.5,...,False,61.0,5.0,,,80.0,80.0,73.0,85.0,33.0
2,9/10/2017,2017,1,False,Chicago Bears,17.0,23.0,Atlanta Falcons,ATL,-7.0,...,False,66.0,9.0,,,78.0,82.0,89.0,80.0,40.0
3,9/10/2017,2017,1,False,Cincinnati Bengals,0.0,20.0,Baltimore Ravens,CIN,-3.0,...,False,71.0,8.0,,,79.0,80.0,80.0,85.0,20.0
4,9/10/2017,2017,1,False,Cleveland Browns,18.0,21.0,Pittsburgh Steelers,PIT,-9.0,...,False,67.0,9.0,,,76.0,78.0,89.0,84.0,39.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1065,1/17/2021,2020,Division,True,Kansas City Chiefs,22.0,17.0,Cleveland Browns,KC,-8.0,...,False,,,,,92.0,77.0,82.0,76.0,39.0
1066,1/17/2021,2020,Division,True,New Orleans Saints,20.0,30.0,Tampa Bay Buccaneers,NO,-2.5,...,False,,,,,96.0,86.0,87.0,83.0,50.0
1067,1/24/2021,2020,Conference,True,Green Bay Packers,26.0,31.0,Tampa Bay Buccaneers,GB,-3.0,...,False,,,,,89.0,80.0,87.0,83.0,57.0
1068,1/24/2021,2020,Conference,True,Kansas City Chiefs,38.0,24.0,Buffalo Bills,KC,-3.0,...,False,,,,,92.0,77.0,77.0,84.0,62.0


In [10]:
df_with_madden_outliers_over = df_with_madden[df_with_madden.total_score - df_with_madden.over_under_line > 20]
df_with_madden_outliers_under = df_with_madden[df_with_madden.over_under_line - df_with_madden.total_score > 20]
df_with_madden_outliers_over

Unnamed: 0,schedule_date,schedule_season,schedule_week,schedule_playoff,team_home,score_home,score_away,team_away,team_favorite_id,spread_favorite,...,stadium_neutral,weather_temperature,weather_wind_mph,weather_humidity,weather_detail,home_off,home_def,away_off,away_def,total_score
0,9/7/2017,2017,1,False,New England Patriots,27.0,42.0,Kansas City Chiefs,NE,-9.0,...,False,63.0,7.0,,,92.0,88.0,82.0,91.0,69.0
26,9/17/2017,2017,2,False,Oakland Raiders,45.0,20.0,New York Jets,LVR,-14.0,...,False,69.0,6.0,,,86.0,79.0,73.0,85.0,65.0
31,9/21/2017,2017,3,False,San Francisco 49ers,39.0,41.0,Los Angeles Rams,LAR,-2.0,...,False,65.0,11.0,,,75.0,76.0,74.0,86.0,80.0
41,9/24/2017,2017,3,False,New England Patriots,36.0,33.0,Houston Texans,NE,-13.0,...,False,76.0,8.0,,,92.0,88.0,77.0,90.0,69.0
54,10/1/2017,2017,4,False,Houston Texans,57.0,14.0,Tennessee Titans,TEN,-2.0,...,False,72.0,0.0,,DOME,77.0,90.0,86.0,79.0,71.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1025,12/25/2020,2020,16,False,New Orleans Saints,52.0,33.0,Minnesota Vikings,NO,-6.5,...,False,72.0,0.0,,DOME,96.0,86.0,80.0,81.0,85.0
1032,12/27/2020,2020,16,False,Houston Texans,31.0,37.0,Cincinnati Bengals,HOU,-7.5,...,False,72.0,0.0,,DOME,78.0,79.0,71.0,85.0,68.0
1041,1/3/2021,2020,17,False,Buffalo Bills,56.0,26.0,Miami Dolphins,MIA,-3.0,...,False,,,,,77.0,84.0,69.0,78.0,82.0
1048,1/3/2021,2020,17,False,Houston Texans,38.0,41.0,Tennessee Titans,TEN,-7.0,...,False,72.0,0.0,,DOME,78.0,79.0,77.0,81.0,79.0


In [11]:
df_with_madden.describe()

Unnamed: 0,schedule_season,score_home,score_away,spread_favorite,over_under_line,weather_temperature,weather_wind_mph,weather_humidity,home_off,home_def,away_off,away_def,total_score
count,1070.0,1070.0,1070.0,1070.0,1070.0,666.0,665.0,1.0,1062.0,1062.0,1062.0,1062.0,1070.0
mean,2018.502804,23.790654,22.557944,-5.458879,46.038318,63.995495,4.47218,78.0,81.293785,81.629944,81.298493,81.515066,46.348598
std,1.119389,10.234076,10.117916,3.573522,4.582415,15.303987,5.018,,5.757292,5.246207,5.722319,5.224395,14.361658
min,2017.0,0.0,0.0,-21.5,35.0,10.0,0.0,78.0,66.0,69.0,66.0,69.0,6.0
25%,2018.0,17.0,16.0,-7.0,43.0,55.0,0.0,78.0,78.0,78.0,78.0,78.0,37.0
50%,2019.0,24.0,23.0,-4.5,46.0,72.0,3.0,78.0,81.0,82.0,81.0,82.0,46.0
75%,2020.0,31.0,30.0,-3.0,49.0,72.0,8.0,78.0,85.0,85.0,85.0,85.0,55.0
max,2020.0,57.0,59.0,0.0,63.5,97.0,24.0,78.0,97.0,93.0,97.0,93.0,105.0


In [12]:
df_with_madden_outliers_under.describe()

Unnamed: 0,schedule_season,score_home,score_away,spread_favorite,over_under_line,weather_temperature,weather_wind_mph,weather_humidity,home_off,home_def,away_off,away_def,total_score
count,62.0,62.0,62.0,62.0,62.0,42.0,42.0,0.0,62.0,62.0,61.0,61.0,62.0
mean,2018.193548,10.5,11.209677,-5.08871,46.725806,64.047619,6.285714,,81.467742,81.580645,82.721311,82.557377,21.709677
std,1.083992,7.05424,6.606348,3.432635,4.460829,18.12686,5.848851,,5.553666,5.271477,5.320185,4.934655,5.781222
min,2017.0,0.0,0.0,-13.5,39.0,10.0,0.0,,67.0,71.0,67.0,72.0,6.0
25%,2017.0,6.0,7.0,-7.5,42.625,50.75,0.0,,79.0,77.0,80.0,79.0,19.25
50%,2018.0,10.0,10.0,-3.5,46.5,72.0,5.0,,81.5,81.0,83.0,83.0,21.5
75%,2019.0,15.75,16.0,-2.5,49.875,72.0,9.0,,85.0,85.0,86.0,86.0,26.0
max,2020.0,27.0,23.0,-1.0,56.5,89.0,18.0,,92.0,93.0,92.0,92.0,33.0


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=f0efbe77-01fa-4860-b5ee-e7eac30d44e8' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>