# Beat the Books

#### A data science project by Jonathan Sears

### Project Plan
The main goal of this project is to find a way to profit of of sports betting. There are several reasons I want to do this. First I want to make money, I feel like that is pretty self explanatory. Additionally, sports books and casinos always stack the odds in their favor, so I think it would be cool to find a way to beat them at their own game. 

### The Plan
My approach to beat the books will be as follows:

1) Build a couple different machine learning models. One that given some data we can aquire before a game starts, predicts the winner of a game (moneyline), one that predicts the point differential (spread), and one that predicts the total number of points (over under). 
2) Scrape the odds from many different sportsbooks. Since sportsbooks operate independently, their odds are constantly changing. Exploting these discrepencies to get the best odds for any bet is essential if you want to beat the books. 
3) Using some math I'm going to calculate my predicted expected value for a given bet, if the expected value > 0, then I will classify it as a winning bet. Theoretically, if I place enough bets and my model is accurate, I should be able to beat the books. 

# Math

### Expected Value and the Law of Large Numbers
There is some important math we will need to have a strong understanding of during this project. The plus EV strategy relies on identifying bets with positive expected values, so it's important to have a strong understanding of what expected values are. The expected value of a bet will be:

    EV = P(event A happens)* (Payoff of event A) + P(event B) * (Payoff of event B)

When looking at this we can see how important it is to properly asses the probability of a given event happening is, as that's what will be the real difference between making and losing money.

The second important tool is the law of large numbers, which states that the more bets we place, the higher probability we have of reaching our expected value. This is important because while we might have a positive EV on a bet this doesn't guarantee that it will hit. What the law of large numbers is saying that if we place 1000 bets with 50% probability of hitting, then we have an incredibly high probability of hitting around 500 of them. Basically, the more bets we place, the closer we should get to converging to our expected value.  

### The Kelly Criterion
The Kelly Criterion is a simple mathematical formula we can use to size our bets:

f<sup>*</sup> = p - (1-p)/b 


**f<sup>*</sup>** is the fraction of our bankroll we should put on the bet

**p** is our estimated probability of winning

**b** is the proportion of the bet we stand to win (eg for 2:1 odds b =2)

# Data
The two datasets I found that I think will come in handy for this project are the 538 ELO model dataset and the spreadspoke historical odds dataset. Unfortunately I can't use the 538 ELO model as a predictor when building my own model, as it was discontinued before the 2023 NFL season. However, I think it could still be useful to compare it to my own model to get a gaugue of how well I'm doing. The spreadspoke dataset will likely be one of the most important finds for this whole project as historical odds, spreads, and over under lines will come in extremely handy when building my own model. Finding free data about the NFL turned out to be a lot harder than I thought. I couldn't find any datasets that had historical box scores or anything like that, so instead I decided to make my own by using webscraping. Lastly I needed to find a way to get current odds of NFL games from a wide range of sportsbooks. The Odds API came in extremely handy for this

**Data Sources:**
 
538 NFL ELO: https://github.com/fivethirtyeight/data/tree/master/nfl-elo

Spreadspoke: https://www.kaggle.com/datasets/tobycrabtree/nfl-scores-and-betting-data

Box scores scraped from: https://www.footballdb.com/games/

Odds API: https://the-odds-api.com/ 



Other potential sources:

NFL Data: https://pypi.org/project/nfl-data-py/

PFR webscraper: https://pypi.org/project/pro-football-reference-web-scraper/ 

League Average data by season from PFR: https://www.pro-football-reference.com/years/NFL/index.htm 

# ETL

In [2]:
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import io
from bs4 import BeautifulSoup
import re 
import json
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error
from random import randint
pd.set_option('display.max_columns',None)

In [None]:
#First lets read in our data
games = pd.read_csv('./Data/spreadspoke_scores.csv')
teams = pd.read_csv('./Data/nfl_teams.csv')
stadiums = pd.read_csv('./Data/nfl_stadiums.csv', encoding="unicode_escape")
past_elo = pd.read_csv('./Data/nfl_elo.csv')
recent_elo = pd.read_csv('./Data/nfl_elo_latest.csv')

In [None]:
#Let's also check our data types to make sure everything looks okay
games.dtypes
past_elo.dtypes
teams.dtypes
stadiums.dtypes

#lets change the neutral site from the elo stasets to a boolean type
past_elo['neutral'] = past_elo['neutral'].astype(bool)
recent_elo['neutral'] = recent_elo['neutral'].astype(bool)

#convert the date columns to datetime object
past_elo.loc[:,'date'] = pd.to_datetime(past_elo['date']).dt.floor('D')
recent_elo.loc[:,'date'] = pd.to_datetime(recent_elo['date']).dt.floor('D')
games.loc[:,'schedule_date'] = pd.to_datetime(games['schedule_date']).dt.floor('D')


In [None]:
# There's a lot of data in the elo datasets and a lot of it won't be useful for us. Lets drop all the data from before the superbowl era
sb_era_elo = past_elo[past_elo['season'] >= 1966]
#and lets add te recent games to this dataset as well
sb_era_elo = pd.concat([sb_era_elo,recent_elo])
#reindex the df
sb_era_elo = sb_era_elo.reindex(index=range(len(sb_era_elo)),copy=False)
#drop a random column that has all NaNs
sb_era_elo.dropna(subset = ['team1','team2','date','elo1_pre','elo2_pre'],inplace=True)

In [None]:
#Lets also drop games without betting data from the scores dataset
games.dropna(subset='spread_favorite', inplace=True)
games.dropna(subset='over_under_line', inplace=True)


Let's make some new columns indicating the winner of the game, who covered the spread, and if the over hit. 
For the over we will use 0 if the over did not hit, 1 if the over did hit, and 2 if the game was a push

In [None]:
def winner(df):
    if df['score_home'] > df['score_away']:
        return df['team_home']
    elif df['score_away'] > df['score_home']:
        return df['team_away']
    else:
        return 'Tie'
    
def over(df):
    if float(df['score_home'] + df['score_away']) > float(df['over_under_line']):
        return 1
    elif float(df['score_home'] + df['score_away']) < float(df['over_under_line']):
        return 0
    else:
        return 2


games['winner'] = games.apply(winner, axis = 1)
games['over'] = games.apply(over, axis = 1)

### Merging Datasets!

Let's merge the games and elo datasets into one massive dataset we can use to build our model

In [None]:
#Define functions to get the abbreviation for each team in the games dataset
def find_home_team_abbrev(df,):
    match = teams[teams['team_name'] == df['team_home']]
    abrev = match.iloc[0]['team_id']
    return abrev
def find_away_team_abbrev(df,):
    match = teams[teams['team_name'] == df['team_away']]
    abrev = match.iloc[0]['team_id']
    return abrev
games['home_abrev'] = games.apply(find_home_team_abbrev,axis = 1)
games['away_abrev'] = games.apply(find_away_team_abbrev,axis = 1)



In [None]:
#Create a gameID from the two date a dame was played, and the two team name abbreviations in alphabetical order
#function made for the teams_df
def make_game_id(teams_df):
    team1 = teams_df['home_abrev']
    team2 = teams_df['away_abrev']
    teams = [team1, team2]
    sorted_teams = sorted(teams)
    date_str = str(teams_df['schedule_date'])
    gameID = date_str + ' ' + teams[0] + ' vs ' + teams[1]
    return gameID
games['gameID'] = games.apply(make_game_id, axis = 1)
#function made for the elo_df
def make_game_id_2(elo_df):
    team1 = elo_df['team1']
    team2 = elo_df['team2']
    teams = [team1, team2]
    sorted_teams = sorted(teams)
    date_str = str(elo_df['date'])
    if type(teams[0]) != str:
        print(teams[0])
    if type(teams[1]) != str:
        print(teams[1])
    gameID = date_str + ' ' + teams[0] + ' vs ' + teams[1]
    return gameID
sb_era_elo['gameID'] = sb_era_elo.apply(make_game_id_2, axis = 1)

Now, we merge!

In [None]:
#merge the datasets
master_df = sb_era_elo.merge(games,on=['gameID'],how='inner')
master_df

This dataframe is massive and has some duplicate data, lets clean it up a bit. First Let's drop the columns we don't need anymore. Since we are using the data to predict the outcome of the game, the only relavent data is the data before the game, so let's drop all the elo adjustments that happen after the game. 

In [None]:
master_df.columns

In [None]:
#since we want to build a predictive model drop al of the posterior values
master_df.drop(['elo1_post',"elo2_post","qb1_value_post","qb2_value_post","qb1_game_value","qb2_game_value","qbelo1_post","qbelo2_post" ], axis=1,inplace=True)

In [None]:

def find_point_diff(df):
    '''
    function to find the actual point differential in a game. 
    define the point differential as favored team points - other team points
    This function is supposed to be applied to a dataframe
    '''
    if df['team_favorite_id'] == df['home_abrev']:
        return df['score_home'] - df['score_away']
    elif df['team_favorite_id'] == df['away_abrev']:
        return df['score_away'] - df['score_home']
    else:
        return np.nan
 
games['point_diff'] = games.apply(find_point_diff,axis=1)
games['point_total'] = games['score_home'] + games['score_away']
games

### Scraping

The boxscores from previous games could be useful information to have. Let's write a scraper to get every boxscore from 1978 to today from footballdb.com

In [2]:
user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15'
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 13_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15'
    ]

In [6]:
# You don't need to run this code anymore just read from links.txt
years = list(range(1978,2024))
'''
use a different user agent and proxy for each request so we don't get banned since we are sending so many requests
first scrape all of the links to the boxscores
'''
with open('Data/http_proxies.txt', 'r') as file:
    ip_list = file.readlines()
links = []
for year in years:
    header = {"User-Agent": user_agents[randint(0,len(user_agents) - 1)]}
    # proxy = ip_list[randint(0,len(ip_list)-1)]
    # proxies = {
    #         "http":f"'http://{proxy[:-1]}",
    #         "https":f"https://{proxy[:-1]}"
    #         }
    r = requests.get("https://www.footballdb.com/games/index.html",params={"lg":"NFL","yr":year},headers=header)
    soup = BeautifulSoup(r.content,"html.parser")
    tables = soup.find_all("table",class_ = "statistics")
    tables

    for table in tables:
        tbl_links = (table.find_all("a", href = True))
        for link in tbl_links:
            links.append("https://www.footballdb.com/"+link['href'])


In [7]:
#write links to a file so we don't need to scrape multiple times and risk getting IP banned
with open("links.txt",'w') as fp:
    for link in links:
        fp.write(f"{link}\n")
    fp.close()


In [8]:
with open("links.txt",'r') as fp:
    urls = fp.readlines()
    fp.close()
urls
#for url in urls:
header = {"User-Agent": user_agents[randint(0,len(user_agents) - 1)]}
r = requests.get(urls[0],headers=header)


In [35]:
def parse_req(r,gameid):
    #parse the request to get the the box score tables
    soup = BeautifulSoup(r.content,'html.parser')
    stats = soup.find('div',id='divBox_team')
    table = stats.find_all('table',class_ = 'statistics')
    table_str = io.StringIO(str(table))
    tables = pd.read_html(table_str)
    pre_box_score = pd.concat(tables)
    #clean the data up
    pre_box_score.set_index('Unnamed: 0',inplace=True)
    pre_box_score['gamedID'] = gameid
    box_score = pre_box_score.T
    box_score[['passing-attempts','completions','int-thrown']] = box_score['Att-Comp-Int'].split('-', expand = True)
    box_score[['interceptions', 'int-return-yards']] = box_score['Interception Returns'].split('-',expand = True)
    box_score[['fumbles','fumbles-lost']] = box_score['Fumbles - Lost'].split('-',expand = True)
    box_score[['fga','fgm']] = box_score['Field Goals'].split('-',expand = True)
    box_score[['third-down-conv','third-downs','third-down-conv-rate']] = box_score['Third Downs'].split('-',expand = True)
    box_score[['punts','yards-per-punt']] = box_score['Punts - Average'].split('-',expand=True)
    box_score[['penalties','penalty-yards']] = box_score['Penalties - Yards'].split('-'expand=True)
    box_score.rename(columns={"First downs":"total first downs",
                              "Rushing": "rushing-first-downs",
                              "Passing": "passing-first-downs",
                              "Penalty": "penalty-first-downs",
                              "Average Gain": "avg-gain-rushing",
                              "Avg. Yards/Att": "yards-per-att",
                              })
    return 
demo = parse_req(r)
demo

Unnamed: 0,First downs,Rushing,Passing,Penalty,Total Net Yards,Net Yards Rushing,Rushing Plays,Average Gain,Net Yards Passing,Att - Comp - Int,...,Had Blocked,Punt Returns,Kickoff Returns,Interception Returns,Penalties - Yards,Fumbles - Lost,Field Goals,Third Downs,Total Plays,Average Gain.1
NY GiantsNYG,12,6,6,0,238,76,35,2.2,162,25-12-1,...,0,4--11,4-103,3-46,7-64,0-0,2-2,5-18-27%,64,3.7
Tampa BayTB,16,9,4,3,251,165,39,4.2,86,28-10-3,...,0,5-76,5-86,1-3,8-55,4-1,2-2,4-17-23%,68,3.7


In [36]:
demo.columns


Index(['First downs', 'Rushing', 'Passing', 'Penalty', 'Total Net Yards',
       'Net Yards Rushing', 'Rushing Plays', 'Average Gain',
       'Net Yards Passing', 'Att - Comp - Int', 'Sacked - Yds Lost',
       'Gross Yards Passing', 'Avg. Yds/Att', 'Punts - Average', 'Had Blocked',
       'Punt Returns', 'Kickoff Returns', 'Interception Returns',
       'Penalties - Yards', 'Fumbles - Lost', 'Field Goals', 'Third Downs',
       'Total Plays', 'Average Gain'],
      dtype='object', name='Unnamed: 0')

Scraping each book individually is going to be a pain, let's use the odds api to get the odds for every game for every bookmaker this week

In [3]:
#TODO: remove api key from request and add it as a parameter so you can read it in from a file
with open('api-key.txt','r') as api_key_file:
    api_key = api_key_file.read()
    api_key_file.close()

odds_req = requests.get(f"https://api.the-odds-api.com//v4/sports/americanfootball_nfl/odds/?apiKey={api_key}&regions=us,eu&markets=h2h,totals,spreads")

In [5]:

odds_json = json.loads(odds_req.content)
odds_df = pd.json_normalize(odds_json, record_path=['bookmakers','markets','outcomes'], meta=['id','commence_time','home_team','away_team',['bookmakers','title'],['bookmakers','markets','key'],])
odds_df

Unnamed: 0,name,price,point,id,commence_time,home_team,away_team,bookmakers.title,bookmakers.markets.key
0,Dallas Cowboys,1.22,,b2eeb176fc9adfc63b9098b313905792,2023-12-01T01:15:00Z,Dallas Cowboys,Seattle Seahawks,FanDuel,h2h
1,Seattle Seahawks,4.50,,b2eeb176fc9adfc63b9098b313905792,2023-12-01T01:15:00Z,Dallas Cowboys,Seattle Seahawks,FanDuel,h2h
2,Dallas Cowboys,1.91,-8.5,b2eeb176fc9adfc63b9098b313905792,2023-12-01T01:15:00Z,Dallas Cowboys,Seattle Seahawks,FanDuel,spreads
3,Seattle Seahawks,1.91,8.5,b2eeb176fc9adfc63b9098b313905792,2023-12-01T01:15:00Z,Dallas Cowboys,Seattle Seahawks,FanDuel,spreads
4,Over,1.93,47.5,b2eeb176fc9adfc63b9098b313905792,2023-12-01T01:15:00Z,Dallas Cowboys,Seattle Seahawks,FanDuel,totals
...,...,...,...,...,...,...,...,...,...
2875,Under,1.91,44.5,2c4ca7353e2db06d12d583f8a46d38a7,2023-12-12T01:16:00Z,Miami Dolphins,Tennessee Titans,Bovada,totals
2876,Miami Dolphins,1.91,-12.5,2c4ca7353e2db06d12d583f8a46d38a7,2023-12-12T01:16:00Z,Miami Dolphins,Tennessee Titans,BetUS,spreads
2877,Tennessee Titans,1.91,12.5,2c4ca7353e2db06d12d583f8a46d38a7,2023-12-12T01:16:00Z,Miami Dolphins,Tennessee Titans,BetUS,spreads
2878,Over,1.91,44.5,2c4ca7353e2db06d12d583f8a46d38a7,2023-12-12T01:16:00Z,Miami Dolphins,Tennessee Titans,BetUS,totals


# Exploratory Data Analysis

Let's look at how often teams actually cover the spread. In a perfect world (for the bookmakers), the probability of a team covering the spread would be 50/50. However, let's see the actual numbers

In [None]:
games['spread_favorite']
games.dropna(subset = ['spread_favorite','team_favorite_id'],inplace=True)
def covered(row):
    fav_covered = False
    if row['team_favorite_id'] == row['home_abrev']:
        point_diff = row['score_home']-row['score_away']
    else:
        point_diff = row['score_away'] - row['score_home']
    if point_diff > np.abs(row['spread_favorite']):
        return True
    elif point_diff < np.abs(row['spread_favorite']):
        return False
    else:
        return np.nan
games['favorite_covered'] = games.apply(covered,axis=1)
plt.bar(x=["favorite did not cover", "favorite covered"] ,height = games['favorite_covered'].value_counts())

It looks like the spread favorite only covers the spread about 5195 / (5195 + 5570) * 100 = 48.25824431% of the time! which means the books are approximately 1.75% off when calculating their spread probabilities. 
While this may look like a small percentage, it's enough to work with and give me hope that we can in fact beat the books.


Let's look at the relationship between the predicted ELO probabilities and the actual win rates of games

In [None]:
plt.scatter(x=master_df['elo1_pre'],y=master_df['elo2_pre'],)

In [None]:
games['total'] = games['score_away'] + games['score_home']
highest_scoring = games.nlargest(n=250,columns='total')
highest_scoring.describe()

Let's examine some of the highest ELO teams in the Superbowl era. By examining these teams we can try to look at similarities between them and see if we can identify common factors that indicate a team is on the rise. 

In [None]:
# first lets find the stronger opponent heading into a given matchup
sb_era_elo.loc[:,'stronger_team'] = sb_era_elo.loc[:,['elo1_pre','elo2_pre']].max(axis=1).copy()
#Now lets find the 250 strongest rated teams of the superbowl era and look at some summary statistics
strongest_250 = sb_era_elo.nlargest(250, columns='stronger_team')
strongest_250

In [None]:
display(strongest_250.columns)
def find_stronger_qb(row):
    if row['stronger_team'] == row['elo1_pre']:
        return row['qbelo1_pre']
    else:
        return row['qbelo2_pre']
strongest_250['stronger_team_qb_elo'] = strongest_250.apply(find_stronger_qb,axis=1)
strongest_250['stronger_team_qb_elo']


In [None]:
plt.bar(x=["strongest 250 team qb elo", "average qb elo"],height = [strongest_250['stronger_team_qb_elo'].mean(),(sb_era_elo['qbelo1_pre'].mean() + sb_era_elo['qbelo2_pre'].mean())/2])

These stats can give us a good idea of what an elite NFL team looks like. 

We can check out the correlation matrix for the superbowl era to give us a good idea of which variables are strongly related to each other

In [None]:
sb_era_elo.corr(numeric_only=True)

#### Hypothesis
Games played in colder weather will tend to be lower scoring and therefore will not hit the over

In [None]:
#Let's test out our theory
freezing = games[games['weather_temperature'] <= 32]
freezing['over'].value_counts()

It doesn't look like it makes a difference, however there might be another explanation...

In [None]:
display(games['over_under_line'].describe())
display(freezing['over_under_line'].describe())

It looks like bookmakers are already adjusting for the weather. We need to start thinking outside the box and look for factors they haven't though about yet if we want to get an edge. 

Lets look at games that hit the over vs games that did not hit the over

In [None]:

games['over'] = games['over'].map({0:'Red', 1:'Green', 2:'Blue'})
games.plot.scatter(x='over_under_line',y='total',c='over', alpha = .3)

This scatter plot shows us games that hit the over in green, games that hit the under in red, and games that pushed in blue, with total points on the y-axis, and the over under line on the x. This is a good way to visualize the data as we can clearly see where the over-under line is on the plot. 

# Building a Model

Onto the good stuff:
My approach for my model is will be:
- Build a Machine Learning model using the master dataframe that predicts the probability of a team winning a given game (classification)
- Build a second model that predicts point totals of a game (regression)
- potential independent variables for the models will be, Team ELOs, QB Elos, sportsbook odds, season record, momentum score (fraction of x previous games won), weather, injuries, and any other useful statitistics I can find
- Dependent variables for the model will be the winner/win probability for the first model, and the predicted score for each team in the second model
- Test the model using cross validation
- Use the predictions and probabilities from the model, along with the new odds from the Odds API to identify potential positive EV bets
- Test to see if our identified "positive EV" bets are actually profitable. 
- Repeat until we make a model that is profitable

The First Model! Let's start with a logistic regression model to predict scores for each team

In [None]:
master_df.columns

In [None]:
master_df[['elo1_pre','elo2_pre','elo_prob1','elo_prob2','qbelo1_pre','qbelo2_pre','quality','importance','over_under_line','spread_favorite','weather_temperature']]

In [None]:
x_train_dict = master_df[['elo1_pre','elo2_pre','elo_prob1','elo_prob2','qbelo1_pre','qbelo2_pre','quality','importance','over_under_line','spread_favorite',]].dropna(axis=1).to_dict(orient='records')
y_train = master_df['score1']
vec = DictVectorizer(sparse = False)
vec.fit(x_train_dict)
x_train = vec.transform(x_train_dict)
scaler = StandardScaler()
scaler.fit(x_train)
x_train_sc = scaler.transform(x_train)
model = LogisticRegression(solver='newton-cg',max_iter=8000)
model.fit(x_train_sc,y_train)
#let's test on our training data 
y_pred = model.predict(x_train_sc)
y_pred
mean_absolute_error(master_df['score1'],y_pred)

Yikes, that's a big error, and that's only our TRAINING error. We have some work to do...

We need to make sure we are getting the best odds from books, so let's write a function that gives us the best odds in odd_df

In [12]:
def find_best_odds(odds_df):
    '''
    function that performs a line search on the odds dataframe to find and return the best odds for each game
    returns: list of dataframes of the best odds for each market of each game
    '''
    ids = pd.unique(odds_df['id'])
    markets = pd.unique(odds_df['bookmakers.markets.key'])
    games = odds_df.groupby(by=['id','bookmakers.markets.key'])
    all_games_lst = []
    best_bets = []
    for id in ids:
        for market in markets:
            if market in odds_df[odds_df['id'] == id]['bookmakers.markets.key'].tolist():
                game_market_odds = odds_df.iloc[games.groups[(id,market)]]
                all_games_lst.append(game_market_odds)
    for game in all_games_lst:
        h2h_odds = game[game['bookmakers.markets.key'] == 'h2h']
        h2h_lay_odds = game[game['bookmakers.markets.key'] == 'h2h_lay']
        spread_odds = game[game['bookmakers.markets.key'] == 'spread']
        over_under_odds = game[game['bookmakers.markets.key'] == 'totals']

        home_odds_h2h = h2h_odds[h2h_odds['name'] == h2h_odds['home_team']]
        away_odds_h2h = h2h_odds[h2h_odds['name'] == h2h_odds['away_team']]
        best_home_odds_h2h = home_odds_h2h.loc[home_odds_h2h['price'] == home_odds_h2h['price'].max()]
        best_bets.append(best_home_odds_h2h)
        best_away_odds = away_odds_h2h.loc[away_odds_h2h['price'] == away_odds_h2h['price'].max()]
        best_bets.append(best_away_odds)


        home_odds_h2h_lay = h2h_lay_odds[h2h_lay_odds['name'] == h2h_lay_odds['home_team']]
        away_odds_h2h_lay = h2h_lay_odds[h2h_lay_odds['name'] == h2h_lay_odds['away_team']]
        best_home_odds_h2h_lay = home_odds_h2h_lay.loc[home_odds_h2h_lay['price'] == home_odds_h2h_lay['price'].max()]
        best_bets.append(best_home_odds_h2h_lay)
        best_away_odds_h2h_lay = away_odds_h2h_lay.loc[away_odds_h2h_lay['price'] == away_odds_h2h_lay['price'].max()]
        best_bets.append(best_away_odds_h2h_lay)

        home_odds_spread = spread_odds[spread_odds['name'] == spread_odds['home_team']]
        away_odds_spread = spread_odds[spread_odds['name'] == spread_odds['away_team']]
        spreads = pd.unique(spread_odds['point'])
        for spread in spreads:
            home_line_odds_spread = home_odds_spread.loc[home_odds_spread['price'] == spread]
            best_home_odds_for_spread = home_line_odds_spread.loc[home_line_odds_spread['price'] == home_line_odds_spread['price'].max()]
            best_bets.append(best_home_odds_for_spread)

            away_line_odds_spread = away_odds_spread.loc[away_odds_spread['price'] == spread]
            best_away_odds_for_spread = away_line_odds_spread.loc[away_line_odds_spread['price'] == away_line_odds_spread['price'].max()]
            best_bets.append(best_away_odds_for_spread)
     
        
        home_odds_over_under = over_under_odds[over_under_odds['name'] == over_under_odds['home_team']]
        away_odds_over_under = over_under_odds[over_under_odds['name'] == over_under_odds['away_team']]
        lines = pd.unique(over_under_odds['point'])
        for line in lines:
            home_line_odds = home_odds_over_under.loc[home_odds_over_under['price'] == line]
            best_home_line_price = home_line_odds.loc[home_odds_over_under['price'] == home_odds_over_under['price'].max()]
            best_bets.append(best_home_line_price)

            away_line_odds = away_odds_over_under.loc[away_odds_over_under['price'] == line]
            best_away_line_price = away_line_odds.loc[away_odds_over_under['price'] == away_odds_over_under['price'].max()]
            best_bets.append(best_away_line_price)

    return best_bets
    
best = find_best_odds(odds_df)

#remove empty dfs from best
best_odds = []
for df in best:
    if not df.empty:
        best_odds.append(df)
best_odds

[               name  price  point                                id  \
 82   Dallas Cowboys   1.25    NaN  b2eeb176fc9adfc63b9098b313905792   
 102  Dallas Cowboys   1.25    NaN  b2eeb176fc9adfc63b9098b313905792   
 120  Dallas Cowboys   1.25    NaN  b2eeb176fc9adfc63b9098b313905792   
 140  Dallas Cowboys   1.25    NaN  b2eeb176fc9adfc63b9098b313905792   
 
             commence_time       home_team         away_team bookmakers.title  \
 82   2023-12-01T01:15:00Z  Dallas Cowboys  Seattle Seahawks        Matchbook   
 102  2023-12-01T01:15:00Z  Dallas Cowboys  Seattle Seahawks           Unibet   
 120  2023-12-01T01:15:00Z  Dallas Cowboys  Seattle Seahawks         Pinnacle   
 140  2023-12-01T01:15:00Z  Dallas Cowboys  Seattle Seahawks          Betfair   
 
     bookmakers.markets.key  
 82                     h2h  
 102                    h2h  
 120                    h2h  
 140                    h2h  ,
                  name  price  point                                id  \
 141  

In [9]:
display(odds_df[odds_df['id'] == '4275f59e3fbef9a01b3007538c0b61fc']['bookmakers.markets.key'])

2022        h2h
2023        h2h
2024    spreads
2025    spreads
2026     totals
2027     totals
2028        h2h
2029        h2h
2030    spreads
2031    spreads
2032     totals
2033     totals
2034        h2h
2035        h2h
2036    spreads
2037    spreads
2038        h2h
2039        h2h
2040        h2h
2041        h2h
2042    spreads
2043    spreads
2044     totals
2045     totals
2046        h2h
2047        h2h
2048    spreads
2049    spreads
2050     totals
2051     totals
2052        h2h
2053        h2h
2054    spreads
2055    spreads
2056        h2h
2057        h2h
2058    spreads
2059    spreads
2060     totals
2061     totals
2062    spreads
2063    spreads
2064     totals
2065     totals
2066    spreads
2067    spreads
2068     totals
2069     totals
2070        h2h
2071        h2h
2072    spreads
2073    spreads
2074     totals
2075     totals
2076    spreads
2077    spreads
2078     totals
2079     totals
Name: bookmakers.markets.key, dtype: object

Now that we have the best possible odds, we can calculate our expected value of our bet

First use Kelly Criterion to size our bet:

f<sup>*</sup> = p - (1-p)/b 


**f<sup>*</sup>** is the fraction of our bankroll we should put on the bet

**p** is our estimated probability of winning

**b** is the proportion of the bet we stand to win (eg for 2:1 odds b =2)

In [None]:
def size_kelly_bet(bankroll, win_prob,odds):
    return bankroll * (win_prob - (1-win_prob)/odds)

Next let's calculate our expected value

In [None]:
def calc_ev(bet_size,odds,win_prob):
    ev = bet_size*odds*win_prob - bet_size(1-win_prob)
    return ev

and now let's search for positive EV bets!

In [None]:
#Final function should look something like this:
def find_plus_ev(win_prob,bankroll):
    plus_ev = []
    for game in best_odds:
        # win_prob = model.predict(game[cols])
        odds = game['price']
        bet_size = size_kelly_bet(bankroll,win_prob,odds)
        ev = calc_ev(bet_size,odds,win_prob)
        if ev > 0:
            plus_ev.append({'team':game['name'],'sportsbook':game['bookmakers.title'],'bet-size':bet_size,'odds':odds,'point':game['point'],'market':game['bookmakers.markets.key']})
    return plus_ev
