# Baseball Game Predictor and Gambling Program
----------------------------------------------

This notebook will go through the entire process of pulling data, cleaning data, feature engineering, and machine learning model creation 
to predict games and ultimately show the results of a betting strategy based upon the model's predictions.

## Pulling Data from PyBaseball's API
--------------------------------

The initial raw data will be pulled from PyBaseball.  Two functions are being used that pull our batting and hitting statistics. The is a little light
cleaning of the initial raw data to drop unneeded columns and format to a datetime index.  These function are located in the libs folder inside 
the PyBaseball_data_pull_and_cleaning.py file. After pulling and cleaning data the csv files were saved in the Data folder.  DO NOT RUN THESE
CELLS.  The data has already been saved.  The process of pulling this data takes a considerable amount of time.

In [None]:
# import functions from Libs
from Libs.PyBaseball_data_pull_and_cleaning import get_batting_data, get_pitching_data, clean_batting_data, clean_pitching_data
import pandas as pd

In [None]:
# Pull batting and pitching data for 2016, 2017, 2018, and 2019 
batting_data_2016 = get_batting_data('2016-04-03', '2016-10-02')
batting_data_2017 = get_batting_data('2017-04-02', '2017-10-01')
batting_data_2018 = get_batting_data('2018-03-29', '2018-10-01')
batting_data_2019 = get_batting_data('2019-03-28', '2019-09-29')

pitching_data_2016 = get_pitching_data('2016-04-03','2016-10-02')
pitching_data_2017 = get_pitching_data('2017-04-02', '2017-10-01')
pitching_data_2018 = get_pitching_data('2018-03-29', '2018-10-01')
pitching_data_2019 = get_pitching_data('2019-03-28', '2019-09-29')


In [None]:
# Example of raw batting_data.  These cells can be run.
raw_hitting_data = pd.read_csv('./Data/Batting/Raw_Data/raw_batting_data_2017.csv')
raw_hitting_data

In [None]:
# Example of raw pitching data
raw_pitching_data = pd.read_csv('./Data/Pitching/Raw_Data/raw_pitching_data_2017.csv')
raw_pitching_data.head()

In [None]:
# Initial light cleaning of pulled data 
batting_data_clean_2016 = clean_batting_data(batting_data_2016)
batting_data_clean_2017 = clean_batting_data(batting_data_2017)
batting_data_clean_2018 = clean_batting_data(batting_data_2018)
batting_data_clean_2019 = clean_batting_data(batting_data_2019)
pitching_data_clean_2016 = clean_pitching_data(pitching_data_2016)
pitching_data_clean_2017 = clean_pitching_data(pitching_data_2017)
pitching_data_clean_2018 = clean_pitching_data(pitching_data_2018)
pitching_data_clean_2019 = clean_pitching_data(pitching_data_2019)


In [None]:
# Example of clean batting_data 
clean_hitting_data = pd.read_csv('./Data/Batting/Clean_Data/clean_batting_data_2017.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
clean_hitting_data.head()

In [None]:
# Example of clean pitching_data
clean_pitching_data = pd.read_csv('./Data/Pitching/Clean_Data/clean_pitching_data_2017.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
clean_pitching_data.head()

## Creating DataFrame for Feature Selection
---------------------------------------------

This section will create a dataframe from our saved batting and pitching csv files and concatenate it with the odds csv
files we downloaded to create one dataframe for each season.  The functions used for this process are in the Training_DataFrame_creation.py
file. The resulting dataframes have been saved in the Training Data folder.  Many different features were experimented with but ultimately
these are the features we settled upon. A look back period of 10 days to calculate stats resulted in the best performing model.

In [None]:
# Import functions for dataframe creation and pandas to read in csv files
import pandas as pd
from Libs.Training_DataFrame_creation import df_for_feature_selection

In [None]:
# Read in necessary data files for batting, pitching, and gambling odds
batting_data_2016 = pd.read_csv('./Data/Batting/Clean_Data/clean_batting_data_2016.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
batting_data_2017 = pd.read_csv('./Data/Batting/Clean_Data/clean_batting_data_2017.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
batting_data_2018 = pd.read_csv('./Data/Batting/Clean_Data/clean_batting_data_2018.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
batting_data_2019 = pd.read_csv('./Data/Batting/Clean_Data/clean_batting_data_2019.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)

pitching_data_2016 = pd.read_csv('./Data/Pitching/Clean_Data/clean_pitching_data_2016.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
pitching_data_2017 = pd.read_csv('./Data/Pitching/Clean_Data/clean_pitching_data_2017.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
pitching_data_2018 = pd.read_csv('./Data/Pitching/Clean_Data/clean_pitching_data_2018.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
pitching_data_2019 = pd.read_csv('./Data/Pitching/Clean_Data/clean_pitching_data_2019.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)

odds_df_2016 = pd.read_csv('./Betting_Odds/Clean_Odds/mlb_odds_2016.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
odds_df_2017 = pd.read_csv('./Betting_Odds/Clean_Odds/mlb_odds_2017.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
odds_df_2018 = pd.read_csv('./Betting_Odds/Clean_Odds/mlb_odds_2018.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
odds_df_2019 = pd.read_csv('./Betting_Odds/Clean_Odds/mlb_odds_2019.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)


In [None]:
batting_data_2016.head()

In [None]:
pitching_data_2016.head()

In [None]:
odds_df_2016.head()

In [None]:
#Create training dataframes for each season
training_df_2016 = df_for_feature_selection(odds_df_2016, batting_data_2016, pitching_data_2016, look_back = 10)
training_df_2017 = df_for_feature_selection(odds_df_2017, batting_data_2017, pitching_data_2017, look_back = 10)
training_df_2018 = df_for_feature_selection(odds_df_2018, batting_data_2018, pitching_data_2018, look_back = 10)
training_df_2019 = df_for_feature_selection(odds_df_2019, batting_data_2019, pitching_data_2019, look_back = 10)


In [None]:
training_df_2016.head()

In [None]:
training_df_2017.head()

In [None]:
training_df_2018

In [None]:
training_df_2019.head()

## Feature Selection and Stat Calculations 

Now that are dataframes are created for each season, our features are selected and stats are calculated using functions in the Baseball_stats.py
file located in Libs.

In [None]:
# import functions from Baseball_stats.py in Libs folder
from Libs.Baseball_stats import baseball_stats_calculator_hitting, baseball_stats_calculator_pitching

In [None]:
# Calculate stats
feature_df_hitting_2016 = baseball_stats_calculator_hitting(training_df_2016)
final_feature_df_2016 = baseball_stats_calculator_pitching(feature_df_hitting_2016)
feature_df_hitting_2017 = baseball_stats_calculator_hitting(training_df_2017)
final_feature_df_2017 = baseball_stats_calculator_pitching(feature_df_hitting_2017)
feature_df_hitting_2018 = baseball_stats_calculator_hitting(training_df_2018)
final_feature_df_2018 = baseball_stats_calculator_pitching(feature_df_hitting_2018)
feature_df_hitting_2019 = baseball_stats_calculator_hitting(training_df_2019)
final_feature_df_2019 = baseball_stats_calculator_pitching(feature_df_hitting_2019)


In [None]:
final_feature_df_2016.head()

In [None]:
final_feature_df_2017.head()

In [None]:
final_feature_df_2018.head()

In [None]:
final_feature_df_2019.head()

## Model creation
---------------------

We tried many different machine learning models such as SVM, RandomForestClassifier, AdaBoostClassifier, and Neural Networks. The AdaBoostClassifier
returned the best model. 

In [67]:
import numpy as np
import pandas as pd
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from imblearn.metrics import classification_report_imbalanced




In [36]:
baseball_data_2018 = pd.read_csv('./Training_Data/2018_10_day.csv',index_col = 'Date', infer_datetime_format = True, parse_dates = True)


In [None]:
no_lines = baseball_data_2019[baseball_data_2019['home_open_odds'] == 'NL']
no_lines.head()

In [37]:
baseball_data_2018.head()


Unnamed: 0_level_0,home,visitor,home_open_odds,visitor_open_odds,home_close_odds,visitor_close_odds,home_win_loss,visitor_win_loss,Home_PitchingK%,Home_PitchingBB%,...,Visitor_PitchingOBP_allowed,Visitor_PitchingSLG%_allowed,Home_HittingK%,Home_HittingBB%,Home_HittingOBP,Home_HittingSLG%,Visitor_HittingK%,Visitor_HittingBB%,Visitor_HittingOBP,Visitor_HittingSLG%
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2018-04-08,PIT,CIN,-150,135,-155.0,140.0,1,0,0.193939,0.112121,...,0.350365,0.459574,0.153846,0.107692,0.357143,0.441281,0.234201,0.096654,0.334586,0.378723
2018-04-08,PHI,MIA,-185,165,-200.0,175.0,0,1,0.212687,0.08209,...,0.353887,0.455657,0.274368,0.126354,0.334545,0.402542,0.228739,0.087977,0.297059,0.278689
2018-04-08,MIL,CUB,115,-130,104.0,-114.0,0,1,0.222527,0.087912,...,0.319149,0.314685,0.247126,0.063218,0.300578,0.383648,0.238764,0.109551,0.325779,0.370492
2018-04-08,STL,ARI,-145,130,-165.0,145.0,0,1,0.23871,0.109677,...,0.275974,0.35461,0.270903,0.073579,0.304348,0.416974,0.230769,0.132308,0.350932,0.4
2018-04-08,COL,ATL,-140,125,-155.0,140.0,0,1,0.190616,0.111437,...,0.345455,0.402878,0.239193,0.106628,0.318841,0.416938,0.168675,0.105422,0.375385,0.464539


In [38]:
baseball_data_2018 = baseball_data_2018[baseball_data_2018['home_open_odds'] != 'NL'] 

  result = method(y)


In [39]:
baseball_data_2018.head()

Unnamed: 0_level_0,home,visitor,home_open_odds,visitor_open_odds,home_close_odds,visitor_close_odds,home_win_loss,visitor_win_loss,Home_PitchingK%,Home_PitchingBB%,...,Visitor_PitchingOBP_allowed,Visitor_PitchingSLG%_allowed,Home_HittingK%,Home_HittingBB%,Home_HittingOBP,Home_HittingSLG%,Visitor_HittingK%,Visitor_HittingBB%,Visitor_HittingOBP,Visitor_HittingSLG%
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2018-04-08,PIT,CIN,-150,135,-155.0,140.0,1,0,0.193939,0.112121,...,0.350365,0.459574,0.153846,0.107692,0.357143,0.441281,0.234201,0.096654,0.334586,0.378723
2018-04-08,PHI,MIA,-185,165,-200.0,175.0,0,1,0.212687,0.08209,...,0.353887,0.455657,0.274368,0.126354,0.334545,0.402542,0.228739,0.087977,0.297059,0.278689
2018-04-08,MIL,CUB,115,-130,104.0,-114.0,0,1,0.222527,0.087912,...,0.319149,0.314685,0.247126,0.063218,0.300578,0.383648,0.238764,0.109551,0.325779,0.370492
2018-04-08,STL,ARI,-145,130,-165.0,145.0,0,1,0.23871,0.109677,...,0.275974,0.35461,0.270903,0.073579,0.304348,0.416974,0.230769,0.132308,0.350932,0.4
2018-04-08,COL,ATL,-140,125,-155.0,140.0,0,1,0.190616,0.111437,...,0.345455,0.402878,0.239193,0.106628,0.318841,0.416938,0.168675,0.105422,0.375385,0.464539


In [40]:
baseball_data_2018.columns.values

array(['home', 'visitor', 'home_open_odds', 'visitor_open_odds',
       'home_close_odds', 'visitor_close_odds', 'home_win_loss',
       'visitor_win_loss', 'Home_PitchingK%', 'Home_PitchingBB%',
       'Home_PitchingOBP_allowed', 'Home_PitchingSLG%_allowed',
       'Visitor_PitchingK%', 'Visitor_PitchingBB%',
       'Visitor_PitchingOBP_allowed', 'Visitor_PitchingSLG%_allowed',
       'Home_HittingK%', 'Home_HittingBB%', 'Home_HittingOBP',
       'Home_HittingSLG%', 'Visitor_HittingK%', 'Visitor_HittingBB%',
       'Visitor_HittingOBP', 'Visitor_HittingSLG%'], dtype=object)

In [85]:
X = baseball_data_2018[['home_open_odds', 'visitor_open_odds','Home_PitchingK%', 'Home_PitchingBB%',
       'Home_PitchingOBP_allowed', 'Home_PitchingSLG%_allowed',
       'Visitor_PitchingK%', 'Visitor_PitchingBB%',
       'Visitor_PitchingOBP_allowed', 'Visitor_PitchingSLG%_allowed',
       'Home_HittingK%', 'Home_HittingBB%', 'Home_HittingOBP',
       'Home_HittingSLG%', 'Visitor_HittingK%', 'Visitor_HittingBB%',
       'Visitor_HittingOBP', 'Visitor_HittingSLG%']]
y = baseball_data_2018['home_win_loss']

In [86]:
scaler = StandardScaler()
X_transformed = scaler.fit_transform(X)

In [87]:
len(baseball_data_2017) * 0.50

1157.5

In [88]:
X.head()

Unnamed: 0_level_0,home_open_odds,visitor_open_odds,Home_PitchingK%,Home_PitchingBB%,Home_PitchingOBP_allowed,Home_PitchingSLG%_allowed,Visitor_PitchingK%,Visitor_PitchingBB%,Visitor_PitchingOBP_allowed,Visitor_PitchingSLG%_allowed,Home_HittingK%,Home_HittingBB%,Home_HittingOBP,Home_HittingSLG%,Visitor_HittingK%,Visitor_HittingBB%,Visitor_HittingOBP,Visitor_HittingSLG%
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2018-04-08,-150,135,0.193939,0.112121,0.341463,0.400697,0.163636,0.109091,0.350365,0.459574,0.153846,0.107692,0.357143,0.441281,0.234201,0.096654,0.334586,0.378723
2018-04-08,-185,165,0.212687,0.08209,0.325758,0.376068,0.218667,0.104,0.353887,0.455657,0.274368,0.126354,0.334545,0.402542,0.228739,0.087977,0.297059,0.278689
2018-04-08,115,-130,0.222527,0.087912,0.333333,0.42284,0.216867,0.108434,0.319149,0.314685,0.247126,0.063218,0.300578,0.383648,0.238764,0.109551,0.325779,0.370492
2018-04-08,-145,130,0.23871,0.109677,0.349673,0.402256,0.275641,0.076923,0.275974,0.35461,0.270903,0.073579,0.304348,0.416974,0.230769,0.132308,0.350932,0.4
2018-04-08,-140,125,0.190616,0.111437,0.338235,0.398649,0.228916,0.138554,0.345455,0.402878,0.239193,0.106628,0.318841,0.416938,0.168675,0.105422,0.375385,0.464539


In [89]:
X_train = X_transformed[:1158]
X_test = X_transformed[1159:]
y_train = y[:1158]
y_test = y[1159:]

In [90]:
model = SVC(kernel = 'rbf', random_state = 1, probability = True)
model.fit(X_train, y_train)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=True, random_state=1, shrinking=True,
    tol=0.001, verbose=False)

In [91]:
model.score(X_test, y_test)

0.5916594265855778

In [92]:
rf_model = RandomForestClassifier(n_estimators= 1000, random_state= 1)

In [93]:
rf_model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

In [94]:
predictions_rf = rf_model.predict(X_test)

In [95]:
acc_score = accuracy_score(y_test, predictions_rf)

In [96]:
importances = rf_model.feature_importances_

In [97]:
sorted(zip(rf_model.feature_importances_, X.columns), reverse=True)

[(0.06185174560061404, 'Home_PitchingOBP_allowed'),
 (0.058797684825467544, 'Visitor_PitchingK%'),
 (0.05870756949312254, 'Home_PitchingK%'),
 (0.05768967382679909, 'home_open_odds'),
 (0.05680163998003066, 'visitor_open_odds'),
 (0.05590467243592977, 'Home_HittingSLG%'),
 (0.05542503848346521, 'Visitor_HittingSLG%'),
 (0.05503421328373044, 'Visitor_PitchingSLG%_allowed'),
 (0.054748542656268544, 'Home_HittingOBP'),
 (0.0547470021387975, 'Visitor_PitchingOBP_allowed'),
 (0.05459646747564976, 'Visitor_HittingBB%'),
 (0.0544505711789852, 'Home_PitchingBB%'),
 (0.054388598162900516, 'Visitor_HittingOBP'),
 (0.054140997826084704, 'Home_HittingK%'),
 (0.05397506130188192, 'Visitor_HittingK%'),
 (0.053163888904809026, 'Home_HittingBB%'),
 (0.05292707037045212, 'Home_PitchingSLG%_allowed'),
 (0.05264956205501136, 'Visitor_PitchingBB%')]

In [84]:
print(acc_score)

0.5412684622067767


In [68]:
print(classification_report_imbalanced(y_test, predictions_rf))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.54      0.50      0.64      0.52      0.57      0.32       531
          1       0.60      0.64      0.50      0.62      0.57      0.32       620

avg / total       0.57      0.58      0.56      0.57      0.57      0.32      1151



In [53]:
clf = AdaBoostClassifier(n_estimators = 2500, random_state = 1)

In [54]:
clf.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=2500, random_state=1)

In [55]:
predictions_clf = clf.predict(X_test)

In [56]:
acc_score_clf = balanced_accuracy_score(y_test, predictions_clf)

In [57]:
print(acc_score_clf)

0.5507502581860154


In [58]:
actual_df = pd.DataFrame(y_test)
actual_df.reset_index(inplace = True)

In [59]:
predict_df = pd.DataFrame(predictions_rf)


In [60]:
actual_predict_df = pd.concat([actual_df,predict_df], axis = 1, join = 'inner')

In [61]:
odds_df_new = baseball_data_2019[['home','visitor','home_open_odds','visitor_open_odds']][1160:]
odds_df_new.reset_index(inplace = True)
odds_df_new.drop(columns = ['Date'],inplace = True)

In [62]:
df = pd.concat([actual_df,predict_df, odds_df_new], axis = 1, join ='inner')

In [63]:
df.set_index('Date', inplace = True)

In [64]:
df.columns = ['Actual','Predicted','Home','Visitor','Home_Open_Odds','Visitor_Open_Odds']

In [65]:
df.head()

Unnamed: 0_level_0,Actual,Predicted,Home,Visitor,Home_Open_Odds,Visitor_Open_Odds
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-07-04,0,0,TAM,BOS,134,-149
2018-07-04,0,0,MIN,BAL,-156,141
2018-07-04,0,0,SEA,OAK,-144,129
2018-07-04,1,1,DET,SFO,-111,101
2018-07-04,1,0,CLE,SDG,-184,164


In [66]:
df.to_csv('./predictions_2018.csv')

## Gambling Results
---------------------

Here we experiment with a few different gambling strategies based on our model's results