# Beat the Books

#### A data science project by Jonathan Sears

### Goal
The main goal of this project is to find a way to profit of of sports betting. There are several reasons I want to do this. First I want to make money, I feel like that is pretty self explanatory. What motivated me more however is how detestable I find the gambling industry to be. Casinos and sports books will happily profit off of people with gambling addictions and have no problem marketing to underage kids. Speaking from experience I know several of my friends and have watched them throw their money away into sports betting since high school. 

### Approach
I plan on using a couple different approaches to try to beat the books. The first approach is to use an arbitrage strategy. The idea behind the arbitrage strategy is to find 2 different books with different odds on the same game with a big enough gap between them so I can bet on two different outcomes of the game and be guaranteed to win money. This is the safest approach as winning money is guaranteed but it is also the slowest, as finding situations like this is pretty rare. I plan on using webscraping to scrape data from different sports books and then analyzing the data to try to identify arbitrage opportunities. A second approach is plus EV betting. This approach is similar to arbitrage, except it is riskier, as plus EV betting forgoes the hedge bet. Although this approach is riskier in the short run, in the long run it is much more profitable, due to the law of large numbers, assuming we are properly assesing our probability of a bet hitting, we should still see some serious returns. Lastly I plan on using data from past games to try to train a machine learning model to accurately predict the odds of a team winning a game, and use that data to identify plus EV bets. 

### Potential Challenges
- Assesing risk
- accuracy
- Different sports have different factors that play into a game environment and outcomes
- How can we account for teams changing over time?


### Mathematical Toolbox
- Expected Value
- Law Of Large Numbers
- Kelly Criterion


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


First lets read in our data and drop data without betting data, as we won't have much use for it.

In [31]:
games = pd.read_csv('./Data/spreadspoke_scores.csv')
teams = pd.read_csv('./Data/nfl_teams.csv')
stadiums = pd.read_csv('./Data/nfl_stadiums.csv', encoding="unicode_escape")


games.dropna(subset='spread_favorite', inplace=True)
games.dropna(subset='over_under_line', inplace=True)
games.head()

Unnamed: 0,schedule_date,schedule_season,schedule_week,schedule_playoff,team_home,score_home,score_away,team_away,team_favorite_id,spread_favorite,over_under_line,stadium,stadium_neutral,weather_temperature,weather_wind_mph,weather_humidity,weather_detail
350,1/14/1968,1967,Superbowl,True,Green Bay Packers,33.0,14.0,Oakland Raiders,GB,-13.5,43,Orange Bowl,True,60.0,12.0,74.0,
538,1/12/1969,1968,Superbowl,True,Baltimore Colts,7.0,16.0,New York Jets,IND,-18.0,40,Orange Bowl,True,66.0,12.0,80.0,rain
727,1/11/1970,1969,Superbowl,True,Kansas City Chiefs,23.0,7.0,Minnesota Vikings,MIN,-12.0,39,Tulane Stadium,True,55.0,14.0,84.0,rain
916,1/17/1971,1970,Superbowl,True,Baltimore Colts,16.0,13.0,Dallas Cowboys,IND,-2.5,36,Orange Bowl,True,59.0,11.0,60.0,
1105,1/16/1972,1971,Superbowl,True,Dallas Cowboys,24.0,3.0,Miami Dolphins,DAL,-6.0,34,Tulane Stadium,True,34.0,18.0,40.0,


Let's make some new columns indicating the winner of the game, who covered the spread, and if the over hit. Any pushes or ties will be replaced with NaN's

In [36]:
def winner(df):
    if df['score_home'] > df['score_away']:
        return df['team_home']
    elif df['score_away'] > df['score_home']:
        return df['team_away']
    else:
        return None
    
# def over(df):
#     if float(df['score_home'] + df['score_away']) > float(df['over_under_line']):
#         return True
#     elif float(df['score_home'] + df['score_away']) > float(df['over_under_line']):
#         return False
#     else:
#         return None

# def covered(df):
#     if df['score_home'] - df['score_away'] > df['spread_favorite']:
        

games['winner'] = games.apply(winner, axis = 1)
# games['over'] = games.apply(over, axis = 1)
games.head()

Unnamed: 0,schedule_date,schedule_season,schedule_week,schedule_playoff,team_home,score_home,score_away,team_away,team_favorite_id,spread_favorite,over_under_line,stadium,stadium_neutral,weather_temperature,weather_wind_mph,weather_humidity,weather_detail,winner
350,1/14/1968,1967,Superbowl,True,Green Bay Packers,33.0,14.0,Oakland Raiders,GB,-13.5,43,Orange Bowl,True,60.0,12.0,74.0,,Green Bay Packers
538,1/12/1969,1968,Superbowl,True,Baltimore Colts,7.0,16.0,New York Jets,IND,-18.0,40,Orange Bowl,True,66.0,12.0,80.0,rain,New York Jets
727,1/11/1970,1969,Superbowl,True,Kansas City Chiefs,23.0,7.0,Minnesota Vikings,MIN,-12.0,39,Tulane Stadium,True,55.0,14.0,84.0,rain,Kansas City Chiefs
916,1/17/1971,1970,Superbowl,True,Baltimore Colts,16.0,13.0,Dallas Cowboys,IND,-2.5,36,Orange Bowl,True,59.0,11.0,60.0,,Baltimore Colts
1105,1/16/1972,1971,Superbowl,True,Dallas Cowboys,24.0,3.0,Miami Dolphins,DAL,-6.0,34,Tulane Stadium,True,34.0,18.0,40.0,,Dallas Cowboys
