# Baseball Game Predictor and Gambling Program
----------------------------------------------

This notebook will go through the entire process of pulling data, cleaning data, feature engineering, and machine learning model creation 
to predict games and ultimately show the results of a betting strategy based upon the model's predictions.

## Pulling Data from PyBaseball's API
--------------------------------

The initial raw data will be pulled from PyBaseball.  Two functions are being used that pull our batting and hitting statistics. The is a little light
cleaning of the initial raw data to drop unneeded columns and format to a datetime index.  These function are located in the libs folder inside 
the PyBaseball_data_pull_and_cleaning.py file. After pulling and cleaning data the csv files were saved in the Data folder.  DO NOT RUN THESE
CELLS.  The data has already been saved.  The process of pulling this data takes a considerable amount of time.

In [None]:
# import functions from Libs
from Libs.PyBaseball_data_pull_and_cleaning import get_batting_data, get_pitching_data, clean_batting_data, clean_pitching_data
import pandas as pd

In [None]:
# Pull batting and pitching data for 2016, 2017, 2018, and 2019 
batting_data_2016 = get_batting_data('2016-04-03', '2016-10-02')
batting_data_2017 = get_batting_data('2017-04-02', '2017-10-01')
batting_data_2018 = get_batting_data('2018-03-29', '2018-10-01')
batting_data_2019 = get_batting_data('2019-03-28', '2019-09-29')

pitching_data_2016 = get_pitching_data('2016-04-03','2016-10-02')
pitching_data_2017 = get_pitching_data('2017-04-02', '2017-10-01')
pitching_data_2018 = get_pitching_data('2018-03-29', '2018-10-01')
pitching_data_2019 = get_pitching_data('2019-03-28', '2019-09-29')


In [None]:
# Example of raw batting_data.  These cells can be run.
raw_hitting_data = pd.read_csv('./Data/Batting/Raw_Data/raw_batting_data_2017.csv')
raw_hitting_data

In [None]:
# Example of raw pitching data
raw_pitching_data = pd.read_csv('./Data/Pitching/Raw_Data/raw_pitching_data_2017.csv')
raw_pitching_data.head()

In [None]:
# Initial light cleaning of pulled data 
batting_data_clean_2016 = clean_batting_data(batting_data_2016)
batting_data_clean_2017 = clean_batting_data(batting_data_2017)
batting_data_clean_2018 = clean_batting_data(batting_data_2018)
batting_data_clean_2019 = clean_batting_data(batting_data_2019)
pitching_data_clean_2016 = clean_pitching_data(pitching_data_2016)
pitching_data_clean_2017 = clean_pitching_data(pitching_data_2017)
pitching_data_clean_2018 = clean_pitching_data(pitching_data_2018)
pitching_data_clean_2019 = clean_pitching_data(pitching_data_2019)


In [None]:
# Example of clean batting_data 
clean_hitting_data = pd.read_csv('./Data/Batting/Clean_Data/clean_batting_data_2017.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
clean_hitting_data.head()

In [None]:
# Example of clean pitching_data
clean_pitching_data = pd.read_csv('./Data/Pitching/Clean_Data/clean_pitching_data_2017.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
clean_pitching_data.head()

## Creating DataFrame for Feature Selection
---------------------------------------------

This section will create a dataframe from our saved batting and pitching csv files and concatenate it with the odds csv
files we downloaded to create one dataframe for each season.  The functions used for this process are in the Training_DataFrame_creation.py
file. The resulting dataframes have been saved in the Training Data folder.  Many different features were experimented with but ultimately
these are the features we settled upon. A look back period of 10 days to calculate stats resulted in the best performing model.

In [None]:
# Import functions for dataframe creation and pandas to read in csv files
import pandas as pd
from Libs.Training_DataFrame_creation import df_for_feature_selection

In [None]:
# Read in necessary data files for batting, pitching, and gambling odds
batting_data_2016 = pd.read_csv('./Data/Batting/Clean_Data/clean_batting_data_2016.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
batting_data_2017 = pd.read_csv('./Data/Batting/Clean_Data/clean_batting_data_2017.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
batting_data_2018 = pd.read_csv('./Data/Batting/Clean_Data/clean_batting_data_2018.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
batting_data_2019 = pd.read_csv('./Data/Batting/Clean_Data/clean_batting_data_2019.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)

pitching_data_2016 = pd.read_csv('./Data/Pitching/Clean_Data/clean_pitching_data_2016.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
pitching_data_2017 = pd.read_csv('./Data/Pitching/Clean_Data/clean_pitching_data_2017.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
pitching_data_2018 = pd.read_csv('./Data/Pitching/Clean_Data/clean_pitching_data_2018.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
pitching_data_2019 = pd.read_csv('./Data/Pitching/Clean_Data/clean_pitching_data_2019.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)

odds_df_2016 = pd.read_csv('./Betting_Odds/Clean_Odds/mlb_odds_2016.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
odds_df_2017 = pd.read_csv('./Betting_Odds/Clean_Odds/mlb_odds_2017.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
odds_df_2018 = pd.read_csv('./Betting_Odds/Clean_Odds/mlb_odds_2018.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
odds_df_2019 = pd.read_csv('./Betting_Odds/Clean_Odds/mlb_odds_2019.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)


In [None]:
batting_data_2016.head()

In [None]:
pitching_data_2016.head()

In [None]:
odds_df_2016.head()

In [None]:
#Create training dataframes for each season
training_df_2016 = df_for_feature_selection(odds_df_2016, batting_data_2016, pitching_data_2016, look_back = 10)
training_df_2017 = df_for_feature_selection(odds_df_2017, batting_data_2017, pitching_data_2017, look_back = 10)
training_df_2018 = df_for_feature_selection(odds_df_2018, batting_data_2018, pitching_data_2018, look_back = 10)
training_df_2019 = df_for_feature_selection(odds_df_2019, batting_data_2019, pitching_data_2019, look_back = 10)


In [None]:
training_df_2016.head()

In [None]:
training_df_2017.head()

In [None]:
training_df_2018

In [None]:
training_df_2019.head()

## Feature Selection and Stat Calculations 

Now that are dataframes are created for each season, our features are selected and stats are calculated using functions in the Baseball_stats.py
file located in Libs.

In [None]:
# import functions from Baseball_stats.py in Libs folder
from Libs.Baseball_stats import baseball_stats_calculator_hitting, baseball_stats_calculator_pitching

In [None]:
# Calculate stats
feature_df_hitting_2016 = baseball_stats_calculator_hitting(training_df_2016)
final_feature_df_2016 = baseball_stats_calculator_pitching(feature_df_hitting_2016)
feature_df_hitting_2017 = baseball_stats_calculator_hitting(training_df_2017)
final_feature_df_2017 = baseball_stats_calculator_pitching(feature_df_hitting_2017)
feature_df_hitting_2018 = baseball_stats_calculator_hitting(training_df_2018)
final_feature_df_2018 = baseball_stats_calculator_pitching(feature_df_hitting_2018)
feature_df_hitting_2019 = baseball_stats_calculator_hitting(training_df_2019)
final_feature_df_2019 = baseball_stats_calculator_pitching(feature_df_hitting_2019)


In [None]:
final_feature_df_2016.head()

In [None]:
final_feature_df_2017.head()

In [None]:
final_feature_df_2018.head()

In [None]:
final_feature_df_2019.head()

## Model creation
---------------------

We tried many different machine learning models such as SVM, RandomForestClassifier, AdaBoostClassifier, and Neural Networks. The AdaBoostClassifier
returned the best model. 

In [12]:
import numpy as np
import pandas as pd
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report




In [13]:
baseball_data_2019 = pd.read_csv('./Training_Data/2019_10_day.csv',index_col = 'Date', infer_datetime_format = True, parse_dates = True)


In [14]:
no_lines = baseball_data_2019[baseball_data_2019['home_open_odds'] == 'NL']
no_lines.head()

Unnamed: 0_level_0,home,visitor,home_open_odds,visitor_open_odds,home_close_odds,visitor_close_odds,home_win_loss,visitor_win_loss,Home_PitchingK%,Home_PitchingBB%,...,Visitor_PitchingOBP_allowed,Visitor_PitchingSLG%_allowed,Home_HittingK%,Home_HittingBB%,Home_HittingOBP,Home_HittingSLG%,Visitor_HittingK%,Visitor_HittingBB%,Visitor_HittingOBP,Visitor_HittingSLG%
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-04-13,WAS,PIT,NL,NL,-119,109,1,0,0.266026,0.092949,...,0.25228,0.359736,0.202985,0.122388,0.364458,0.496479,0.25,0.0625,0.306306,0.401294
2019-04-18,TAM,BAL,NL,NL,-250,220,0,1,0.326241,0.070922,...,0.342246,0.537092,0.21365,0.130564,0.404762,0.574913,0.211382,0.089431,0.288828,0.373494
2019-04-20,CLE,ATL,NL,NL,-134,124,1,0,0.250859,0.092784,...,0.358108,0.417969,0.230508,0.105085,0.306122,0.323077,0.278986,0.112319,0.350365,0.458333
2019-04-28,CWS,DET,NL,NL,110,-120,1,0,0.204545,0.068182,...,0.377358,0.485816,0.200704,0.073944,0.352113,0.469466,0.245161,0.093548,0.367742,0.525547
2019-05-01,KAN,TAM,NL,NL,200,-220,1,0,0.171521,0.100324,...,0.303371,0.403433,0.250825,0.085809,0.313531,0.424354,0.258555,0.114068,0.319392,0.385965


In [15]:
baseball_data_2019.head()


Unnamed: 0_level_0,home,visitor,home_open_odds,visitor_open_odds,home_close_odds,visitor_close_odds,home_win_loss,visitor_win_loss,Home_PitchingK%,Home_PitchingBB%,...,Visitor_PitchingOBP_allowed,Visitor_PitchingSLG%_allowed,Home_HittingK%,Home_HittingBB%,Home_HittingOBP,Home_HittingSLG%,Visitor_HittingK%,Visitor_HittingBB%,Visitor_HittingOBP,Visitor_HittingSLG%
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-03-30,WAS,NYM,-130,110,-112,102,0,1,0.352941,0.088235,...,0.212121,0.193548,0.424242,0.030303,0.212121,0.193548,0.352941,0.088235,0.235294,0.258065
2019-03-30,PHI,ATL,-145,125,-145,135,1,0,0.25,0.166667,...,0.351351,0.516129,0.243243,0.162162,0.351351,0.516129,0.25,0.166667,0.361111,0.366667
2019-03-30,MIA,COL,125,-145,118,-128,1,0,0.148148,0.08642,...,0.184615,0.305085,0.246154,0.061538,0.184615,0.305085,0.148148,0.08642,0.333333,0.432432
2019-03-30,MIL,STL,-125,105,-132,122,1,0,0.253012,0.108434,...,0.298507,0.508197,0.19403,0.059701,0.298507,0.508197,0.253012,0.108434,0.353659,0.555556
2019-03-30,SDG,SFO,-125,105,-130,120,0,1,0.25,0.058824,...,0.322581,0.418182,0.290323,0.080645,0.322581,0.418182,0.25,0.058824,0.235294,0.269841


In [23]:
baseball_data_2019 = baseball_data_2019[baseball_data_2019['home_open_odds'] != 'NL'] 

In [24]:
baseball_data_2019.head()

Unnamed: 0_level_0,home,visitor,home_open_odds,visitor_open_odds,home_close_odds,visitor_close_odds,home_win_loss,visitor_win_loss,Home_PitchingK%,Home_PitchingBB%,...,Visitor_PitchingOBP_allowed,Visitor_PitchingSLG%_allowed,Home_HittingK%,Home_HittingBB%,Home_HittingOBP,Home_HittingSLG%,Visitor_HittingK%,Visitor_HittingBB%,Visitor_HittingOBP,Visitor_HittingSLG%
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-03-30,WAS,NYM,-130,110,-112,102,0,1,0.352941,0.088235,...,0.212121,0.193548,0.424242,0.030303,0.212121,0.193548,0.352941,0.088235,0.235294,0.258065
2019-03-30,PHI,ATL,-145,125,-145,135,1,0,0.25,0.166667,...,0.351351,0.516129,0.243243,0.162162,0.351351,0.516129,0.25,0.166667,0.361111,0.366667
2019-03-30,MIA,COL,125,-145,118,-128,1,0,0.148148,0.08642,...,0.184615,0.305085,0.246154,0.061538,0.184615,0.305085,0.148148,0.08642,0.333333,0.432432
2019-03-30,MIL,STL,-125,105,-132,122,1,0,0.253012,0.108434,...,0.298507,0.508197,0.19403,0.059701,0.298507,0.508197,0.253012,0.108434,0.353659,0.555556
2019-03-30,SDG,SFO,-125,105,-130,120,0,1,0.25,0.058824,...,0.322581,0.418182,0.290323,0.080645,0.322581,0.418182,0.25,0.058824,0.235294,0.269841


In [25]:
baseball_data_2019.columns.values

array(['home', 'visitor', 'home_open_odds', 'visitor_open_odds',
       'home_close_odds', 'visitor_close_odds', 'home_win_loss',
       'visitor_win_loss', 'Home_PitchingK%', 'Home_PitchingBB%',
       'Home_PitchingOBP_allowed', 'Home_PitchingSLG%_allowed',
       'Visitor_PitchingK%', 'Visitor_PitchingBB%',
       'Visitor_PitchingOBP_allowed', 'Visitor_PitchingSLG%_allowed',
       'Home_HittingK%', 'Home_HittingBB%', 'Home_HittingOBP',
       'Home_HittingSLG%', 'Visitor_HittingK%', 'Visitor_HittingBB%',
       'Visitor_HittingOBP', 'Visitor_HittingSLG%'], dtype=object)

In [26]:
X = baseball_data_2019[['home_open_odds', 'visitor_open_odds',
       'home_close_odds', 'visitor_close_odds', 'Home_PitchingK%', 'Home_PitchingBB%',
       'Home_PitchingOBP_allowed', 'Home_PitchingSLG%_allowed',
       'Visitor_PitchingK%', 'Visitor_PitchingBB%',
       'Visitor_PitchingOBP_allowed', 'Visitor_PitchingSLG%_allowed',
       'Home_HittingK%', 'Home_HittingBB%', 'Home_HittingOBP',
       'Home_HittingSLG%', 'Visitor_HittingK%', 'Visitor_HittingBB%',
       'Visitor_HittingOBP', 'Visitor_HittingSLG%']]
y = baseball_data_2019['home_win_loss']

In [27]:
scaler = StandardScaler()
X_transformed = scaler.fit_transform(X)

In [28]:
len(baseball_data_2019) * 0.50

1198.5

In [29]:
X_train = X_transformed[:1199]
X_test = X_transformed[1200:]
y_train = y[:1199]
y_test = y[1200:]

In [30]:
model = SVC(kernel = 'rbf', random_state = 1, probability = True)
model.fit(X_train, y_train)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=True, random_state=1, shrinking=True,
    tol=0.001, verbose=False)

In [31]:
model.score(X_test, y_test)

0.6015037593984962

In [32]:
rf_model = RandomForestClassifier(n_estimators= 1000, random_state= 1)

In [33]:
rf_model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

In [34]:
predictions_rf = rf_model.predict(X_test)

In [36]:
acc_score = accuracy_score(y_test, predictions_rf)

In [37]:
print(acc_score)

0.5822890559732665


In [38]:
clf = AdaBoostClassifier(n_estimators = 2500, random_state = 1)

In [39]:
clf.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=2500, random_state=1)

In [40]:
predictions_clf = clf.predict(X_test)

In [42]:
acc_score_clf = balanced_accuracy_score(y_test, predictions_clf)

In [43]:
print(acc_score_clf)

0.5486076098282411


In [44]:
actual_df = pd.DataFrame(y_test)
actual_df.reset_index(inplace = True)

In [45]:
predict_df = pd.DataFrame(predictions_rf)


In [46]:
actual_predict_df = pd.concat([actual_df,predict_df], axis = 1, join = 'inner')

In [47]:
odds_df_new = baseball_data_2019[['home','visitor','home_open_odds','visitor_open_odds']][1200:]
odds_df_new.reset_index(inplace = True)
odds_df_new.drop(columns = ['Date'],inplace = True)

In [48]:
df = pd.concat([actual_df,predict_df, odds_df_new], axis = 1, join ='inner')

In [49]:
df.set_index('Date', inplace = True)

In [50]:
df.columns = ['Actual','Predicted','Home','Visitor','Home_Open_Odds','Visitor_Open_Odds']

In [51]:
df.head()

Unnamed: 0_level_0,Actual,Predicted,Home,Visitor,Home_Open_Odds,Visitor_Open_Odds
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-06-29,0,0,SFO,ARI,140,-160
2019-06-29,1,1,SDG,STL,-105,-115
2019-06-29,0,0,BOS,NYY,110,-130
2019-06-29,1,0,BAL,CLE,148,-170
2019-06-29,1,1,TOR,KAN,-140,120


In [52]:
df.to_csv('./predictions_2019.csv')

## Gambling Results
---------------------

Here we experiment with a few different gambling strategies based on our model's results