# Baseball Game Predictor and Gambling Program
----------------------------------------------

This notebook will go through the entire process of pulling data, cleaning data, feature engineering, and machine learning model creation 
to predict games and ultimately show the results of a betting strategy based upon the model's predictions.

## Pulling Data from PyBaseball's API
--------------------------------

The initial raw data will be pulled from PyBaseball.  Two functions are being used that pull our batting and hitting statistics. The is a little light
cleaning of the initial raw data to drop unneeded columns and format to a datetime index.  These function are located in the libs folder inside 
the PyBaseball_data_pull_and_cleaning.py file. After pulling and cleaning data the csv files were saved in the Data folder.  DO NOT RUN THESE
CELLS.  The data has already been saved.  The process of pulling this data takes a considerable amount of time.

In [2]:
# import functions from Libs
from Libs.PyBaseball_data_pull_and_cleaning import get_batting_data, get_pitching_data, clean_batting_data, clean_pitching_data
import pandas as pd

In [None]:
# Pull batting and pitching data for 2016, 2017, 2018, and 2019 
batting_data_2016 = get_batting_data('2016-04-03', '2016-10-02')
batting_data_2017 = get_batting_data('2017-04-02', '2017-10-01')
batting_data_2018 = get_batting_data('2018-03-29', '2018-10-01')
batting_data_2019 = get_batting_data('2019-03-28', '2019-09-29')

pitching_data_2016 = get_pitching_data('2016-04-03','2016-10-02')
pitching_data_2017 = get_pitching_data('2017-04-02', '2017-10-01')
pitching_data_2018 = get_pitching_data('2018-03-29', '2018-10-01')
pitching_data_2019 = get_pitching_data('2019-03-28', '2019-09-29')


In [3]:
# Example of raw batting_data.  These cells can be run.
raw_hitting_data = pd.read_csv('./Data/Batting/Raw_Data/raw_batting_data_2017.csv')
raw_hitting_data

Unnamed: 0,Name,Age,#days,Lev,Date,Tm,Unnamed: 7,Opp,G,PA,...,HBP,SH,SF,GDP,SB,CS,BA,OBP,SLG,OPS
0,Nick Ahmed,27,953,MLB-NL,"Apr 2, 2017",Arizona,,San Francisco,1,1,...,0,0,0,0,0,0,1.000,1.00,1.000,2.000
1,Javier Baez,24,953,MLB-NL,"Apr 2, 2017",Chicago,@,St. Louis,1,4,...,0,0,0,0,0,0,0.250,0.25,0.250,0.500
2,Tim Beckham,27,953,MLB-AL,"Apr 2, 2017",Tampa Bay,,New York,1,4,...,0,0,0,0,0,0,0.250,0.25,0.500,0.750
3,Brandon Belt,29,953,MLB-NL,"Apr 2, 2017",San Francisco,@,Arizona,1,5,...,0,0,0,0,0,0,0.000,0.40,0.000,0.400
4,Greg Bird,24,953,MLB-AL,"Apr 2, 2017",New York,@,Tampa Bay,1,5,...,0,0,0,0,0,0,0.000,0.20,0.000,0.200
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51016,Christian Yelich,25,771,MLB-NL,"Oct 1, 2017",Miami,,Atlanta,1,5,...,1,0,0,0,0,0,0.667,0.80,0.667,1.467
51017,Chris Young,33,771,MLB-AL,"Oct 1, 2017",Boston,,Houston,1,4,...,0,0,0,0,0,0,0.250,0.25,0.250,0.500
51018,Eric Young Jr.,32,771,MLB-AL,"Oct 1, 2017",Los Angeles,,Seattle,1,2,...,0,0,0,0,0,0,0.500,0.50,2.000,2.500
51019,Ryan Zimmerman,32,771,MLB-NL,"Oct 1, 2017",Washington,,Pittsburgh,1,2,...,0,0,0,0,0,0,0.000,0.50,0.000,0.500


In [4]:
# Example of raw pitching data
raw_pitching_data = pd.read_csv('./Data/Pitching/Raw_Data/raw_pitching_data_2017.csv')
raw_pitching_data.head()

Unnamed: 0,Name,Age,#days,Lev,Date,Tm,Unnamed: 7,Opp,G,GS,...,Str,StL,StS,GB/FB,LD,PU,WHIP,BAbip,SO9,SO/W
0,Chris Archer,28,953,MLB-AL,"Apr 2, 2017",Tampa Bay,,New York,1,1,...,0.61,0.2,0.11,0.55,0.14,0.05,1.143,0.304,6.4,5.0
1,Ty Blach,26,953,MLB-NL,"Apr 2, 2017",San Francisco,@,Arizona,1,0,...,0.6,0.2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,
2,Madison Bumgarner,27,953,MLB-NL,"Apr 2, 2017",San Francisco,@,Arizona,1,1,...,0.75,0.2,0.15,0.5,0.13,0.06,0.857,0.333,14.1,
3,Andrew Chafin,27,953,MLB-NL,"Apr 2, 2017",Arizona,,San Francisco,1,0,...,0.63,0.06,0.13,0.75,0.0,0.0,2.0,0.333,0.0,
4,Alex Colome,28,953,MLB-AL,"Apr 2, 2017",Tampa Bay,,New York,1,0,...,0.62,0.08,0.23,0.0,0.0,0.0,0.0,0.0,9.0,


In [None]:
# Initial light cleaning of pulled data 
batting_data_clean_2016 = clean_batting_data(batting_data_2016)
batting_data_clean_2017 = clean_batting_data(batting_data_2017)
batting_data_clean_2018 = clean_batting_data(batting_data_2018)
batting_data_clean_2019 = clean_batting_data(batting_data_2019)
pitching_data_clean_2016 = clean_pitching_data(pitching_data_2016)
pitching_data_clean_2017 = clean_pitching_data(pitching_data_2017)
pitching_data_clean_2018 = clean_pitching_data(pitching_data_2018)
pitching_data_clean_2019 = clean_pitching_data(pitching_data_2019)


In [5]:
# Example of clean batting_data 
clean_hitting_data = pd.read_csv('./Data/Batting/Clean_Data/clean_batting_data_2017.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
clean_hitting_data.head()

Unnamed: 0_level_0,Name,Tm,VH,Opp,G,PA,AB,R,H,2B,...,RBI,BB,IBB,SO,HBP,SH,SF,GDP,SB,CS
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-04-02,Nick Ahmed,ARI,0,San Francisco,1,1,1,1,1,0,...,1,0,0,0,0,0,0,0,0,0
2017-04-02,Javier Baez,CUB,1,St. Louis,1,4,4,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2017-04-02,Tim Beckham,TAM,0,New York,1,4,4,1,1,1,...,0,0,0,2,0,0,0,0,0,0
2017-04-02,Brandon Belt,SFO,1,Arizona,1,5,3,0,0,0,...,0,2,0,1,0,0,0,0,0,0
2017-04-02,Greg Bird,NYY,1,Tampa Bay,1,5,4,0,0,0,...,0,1,0,1,0,0,0,0,0,0


In [6]:
# Example of clean pitching_data
clean_pitching_data = pd.read_csv('./Data/Pitching/Clean_Data/clean_pitching_data_2017.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
clean_pitching_data.head()

Unnamed: 0_level_0,Name,Tm,VH,Opp,G,GS,IP,H,R,ER,...,Str,StL,StS,GB/FB,LD,PU,WHIP,BAbip,SO9,SO/W
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-04-02,Chris Archer,TAM,0,New York,1,1,7.0,7,2,2,...,0.61,0.2,0.11,0.55,0.14,0.05,1.143,0.304,6.4,5.0
2017-04-02,Ty Blach,SFO,1,Arizona,1,0,0.2,0,0,0,...,0.6,0.2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,
2017-04-02,Madison Bumgarner,SFO,1,Arizona,1,1,7.0,6,3,3,...,0.75,0.2,0.15,0.5,0.13,0.06,0.857,0.333,14.1,
2017-04-02,Andrew Chafin,ARI,0,San Francisco,1,0,1.0,2,1,1,...,0.63,0.06,0.13,0.75,0.0,0.0,2.0,0.333,0.0,
2017-04-02,Alex Colome,TAM,0,New York,1,0,1.0,0,0,0,...,0.62,0.08,0.23,0.0,0.0,0.0,0.0,0.0,9.0,


## Creating DataFrame for Feature Selection
---------------------------------------------

This section will create a dataframe from our saved batting and pitching csv files and concatenate it with the odds csv
files we downloaded to create one dataframe for each season.  The functions used for this process are in the Training_DataFrame_creation.py
file. The resulting dataframes have been saved in the Training Data folder.  Many different features were experimented with but ultimately
these are the features we settled upon. A look back period of 10 days to calculate stats resulted in the best performing model.

In [7]:
# Import functions for dataframe creation and pandas to read in csv files
import pandas as pd
from Libs.Training_DataFrame_creation import df_for_feature_selection

In [8]:
# Read in necessary data files for batting, pitching, and gambling odds
batting_data_2016 = pd.read_csv('./Data/Batting/Clean_Data/clean_batting_data_2016.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
batting_data_2017 = pd.read_csv('./Data/Batting/Clean_Data/clean_batting_data_2017.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
batting_data_2018 = pd.read_csv('./Data/Batting/Clean_Data/clean_batting_data_2018.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
batting_data_2019 = pd.read_csv('./Data/Batting/Clean_Data/clean_batting_data_2019.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)

pitching_data_2016 = pd.read_csv('./Data/Pitching/Clean_Data/clean_pitching_data_2016.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
pitching_data_2017 = pd.read_csv('./Data/Pitching/Clean_Data/clean_pitching_data_2017.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
pitching_data_2018 = pd.read_csv('./Data/Pitching/Clean_Data/clean_pitching_data_2018.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
pitching_data_2019 = pd.read_csv('./Data/Pitching/Clean_Data/clean_pitching_data_2019.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)

odds_df_2016 = pd.read_csv('./Betting_Odds/Clean_Odds/mlb_odds_2016.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
odds_df_2017 = pd.read_csv('./Betting_Odds/Clean_Odds/mlb_odds_2017.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
odds_df_2018 = pd.read_csv('./Betting_Odds/Clean_Odds/mlb_odds_2018.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
odds_df_2019 = pd.read_csv('./Betting_Odds/Clean_Odds/mlb_odds_2019.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)


In [9]:
batting_data_2016.head()

Unnamed: 0_level_0,Name,Tm,VH,Opp,G,PA,AB,R,H,2B,...,RBI,BB,IBB,SO,HBP,SH,SF,GDP,SB,CS
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2016-04-03,Matt Adams,STL,1,Pittsburgh,1,4,4,0,0,0,...,0,0,0,2,0,0,0,0,0,0
2016-04-03,Jose Bautista,TOR,1,Tampa Bay,1,4,2,1,0,0,...,0,2,0,1,0,0,0,0,0,0
2016-04-03,Asdrubal Cabrera,NYM,1,Kansas City,1,4,4,0,1,0,...,0,0,0,1,0,0,0,0,0,0
2016-04-03,Lorenzo Cain,KAN,0,New York,1,4,2,2,1,0,...,0,2,0,1,0,0,0,0,0,0
2016-04-03,Eric Campbell,NYM,1,Kansas City,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
pitching_data_2016.head()

Unnamed: 0_level_0,Name,Tm,VH,Opp,G,GS,IP,H,R,ER,...,Str,StL,StS,GB/FB,LD,PU,WHIP,BAbip,SO9,SO/W
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2016-04-03,Chris Archer,TAM,0,Toronto,1,1,5.0,5,3,2,...,0.62,0.19,0.18,0.44,0.56,0.0,1.6,0.556,21.6,4.0
2016-04-03,Jerry Blevins,NYM,1,Kansas City,1,0,1.0,0,0,0,...,0.64,0.27,0.09,0.67,0.0,0.0,0.0,0.0,0.0,
2016-04-03,Bartolo Colon,NYM,1,Kansas City,1,0,1.1,1,0,0,...,0.65,0.3,0.0,0.75,0.0,0.25,0.75,0.25,6.8,
2016-04-03,Wade Davis,KAN,0,New York,1,0,1.0,1,0,0,...,0.69,0.19,0.08,1.0,0.0,0.0,2.0,0.5,18.0,2.0
2016-04-03,Dana Eveland,TAM,0,Toronto,1,0,1.2,0,0,0,...,0.65,0.18,0.18,0.67,0.0,0.0,0.0,0.0,10.8,


In [11]:
odds_df_2016.head()

Unnamed: 0_level_0,VH,Team,Pitcher,Open,Close,Final
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2016-04-03,V,STL,WAINWRIGHT-R,-115,109,1
2016-04-03,H,PIT,FLIRIANO-L,105,-119,4
2016-04-03,V,TOR,MSTROMAN-R,-115,111,5
2016-04-03,H,TAM,CARCHER-R,105,-121,3
2016-04-03,V,NYM,MHARVEY-R,-119,-120,3


In [12]:
#Create training dataframes for each season
training_df_2016 = df_for_feature_selection(odds_df_2016, batting_data_2016, pitching_data_2016, look_back = 10)
training_df_2017 = df_for_feature_selection(odds_df_2017, batting_data_2017, pitching_data_2017, look_back = 10)
training_df_2018 = df_for_feature_selection(odds_df_2018, batting_data_2018, pitching_data_2018, look_back = 10)
training_df_2019 = df_for_feature_selection(odds_df_2019, batting_data_2019, pitching_data_2019, look_back = 10)


0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

In [13]:
training_df_2016.head()

Unnamed: 0,home,visitor,home_pitcher,visitor_pitcher,home_open_odds,visitor_open_odds,home_close_odds,visitor_close_odds,home_win_loss,visitor_win_loss,...,Visitor_PitchingStr,Visitor_PitchingStL,Visitor_PitchingStS,Visitor_PitchingGB/FB,Visitor_PitchingLD,Visitor_PitchingPU,Visitor_PitchingWHIP,Visitor_PitchingBAbip,Visitor_PitchingSO9,Visitor_PitchingSO/W
2016-04-13,NYM,MIA,LVERRETT-R,ACONLEY-L,-130,115,-115,105,1,0,...,17.14,4.72,3.08,8.27,9.08,2.11,48.104,9.311,232.1,29.67
2016-04-13,WAS,ATL,TROARK-R,MWISLER-R,-240,210,-141,126,1,0,...,22.47,4.85,3.42,15.76,5.44,1.16,61.495,10.888,318.4,23.49
2016-04-13,PHI,SDG,JEICKHOFF-R,CREA-R,-120,110,-115,105,1,0,...,20.9,5.32,3.45,14.53,7.85,3.28,51.958,7.751,296.3,33.08
2016-04-13,CUB,CIN,JLACKEY-R,ASIMON-R,-215,190,-200,175,1,0,...,18.42,5.4,3.49,14.2,6.75,1.8,44.915,5.96,253.5,27.75
2016-04-13,STL,MIL,MLEAKE-R,CANDERSON-R,-170,150,-145,130,0,1,...,20.87,5.37,3.04,14.86,8.17,1.55,60.623,8.784,190.2,16.75


In [14]:
training_df_2017.head()

Unnamed: 0,home,visitor,home_pitcher,visitor_pitcher,home_open_odds,visitor_open_odds,home_close_odds,visitor_close_odds,home_win_loss,visitor_win_loss,...,Visitor_PitchingStr,Visitor_PitchingStL,Visitor_PitchingStS,Visitor_PitchingGB/FB,Visitor_PitchingLD,Visitor_PitchingPU,Visitor_PitchingWHIP,Visitor_PitchingBAbip,Visitor_PitchingSO9,Visitor_PitchingSO/W
2017-04-12,COL,SDG,KFREELAND-L,ZLEE-R,-166,146,-153,138,0,1,...,22.66,6.28,3.74,15.46,5.29,3.64,59.32,7.593,340.0,26.5
2017-04-12,PIT,CIN,INOVA-R,AGARRETT-L,-151,136,-147,132,0,1,...,22.26,5.91,4.86,21.83,5.9,1.94,46.448,7.94,353.9,31.47
2017-04-12,PHI,NYM,VVELASQUEZ-R,ZWHEELER-R,-111,101,-118,108,0,1,...,24.88,6.81,4.66,18.46,9.55,1.35,83.464,10.978,381.3,28.33
2017-04-12,WAS,STL,MSCHERZER-R,MLEAKE-R,-201,176,-180,160,0,1,...,20.09,5.75,2.97,12.92,5.93,2.91,60.135,9.782,201.9,25.5
2017-04-12,MIA,ATL,TKOEHLER-R,JGARCIA-L,-126,111,-109,-101,0,1,...,17.8,5.03,2.4,12.27,7.37,1.26,49.749,7.696,124.3,15.33


In [15]:
training_df_2018

Unnamed: 0,home,visitor,home_pitcher,visitor_pitcher,home_open_odds,visitor_open_odds,home_close_odds,visitor_close_odds,home_win_loss,visitor_win_loss,...,Visitor_PitchingStr,Visitor_PitchingStL,Visitor_PitchingStS,Visitor_PitchingGB/FB,Visitor_PitchingLD,Visitor_PitchingPU,Visitor_PitchingWHIP,Visitor_PitchingBAbip,Visitor_PitchingSO9,Visitor_PitchingSO/W
2018-04-08,PIT,CIN,JTAILLON-R,TMAHLE-R,-150,135,-155.0,140.0,1,0,...,19.00,5.12,3.61,13.06,7.94,1.34,61.293,7.581,208.2,24.67
2018-04-08,PHI,MIA,JARRIETA-R,TRICHARDS-R,-185,165,-200.0,175.0,0,1,...,26.20,6.12,4.56,15.10,9.09,3.52,63.438,11.517,405.9,40.17
2018-04-08,MIL,CUB,CANDERSON-R,JQUINTANA-L,115,-130,104.0,-114.0,0,1,...,25.37,6.55,5.00,15.47,7.60,2.78,50.047,9.846,391.7,38.18
2018-04-08,STL,ARI,LWEAVER-R,TWALKER-R,-145,130,-165.0,145.0,0,1,...,23.19,5.89,4.76,15.11,6.40,4.07,35.415,7.033,352.5,35.97
2018-04-08,COL,ATL,KFREELAND-L,SNEWCOMB-L,-140,125,-155.0,140.0,0,1,...,25.49,7.04,5.33,14.75,11.13,2.17,69.515,9.942,418.0,37.84
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2018-09-30,MIN,CWS,ZLITTELL-R,DCOVEY-R,-145,130,-132.0,117.0,1,0,...,30.54,7.59,5.47,17.04,14.62,4.47,102.068,15.102,400.6,37.50
2018-09-30,KAN,CLE,ESKOGLUND-L,CCARRASCO-R,190,-215,205.0,-230.0,0,1,...,34.49,9.54,6.12,23.79,10.43,4.94,101.610,15.241,504.4,37.87
2018-09-30,MIL,DET,GGONZALEZ-L,STURNBULL-R,-260,230,-285.0,250.0,1,0,...,28.49,7.36,4.23,15.48,14.66,2.47,67.666,12.057,369.8,27.00
2018-10-01,CUB,MIL,JQUINTANA-L,JCHACIN-R,-130,115,-122.0,112.0,0,1,...,37.63,10.08,7.03,23.33,12.41,3.97,63.468,14.968,576.2,21.50


In [16]:
training_df_2019.head()

Unnamed: 0,home,visitor,home_pitcher,visitor_pitcher,home_open_odds,visitor_open_odds,home_close_odds,visitor_close_odds,home_win_loss,visitor_win_loss,...,Visitor_PitchingStr,Visitor_PitchingStL,Visitor_PitchingStS,Visitor_PitchingGB/FB,Visitor_PitchingLD,Visitor_PitchingPU,Visitor_PitchingWHIP,Visitor_PitchingBAbip,Visitor_PitchingSO9,Visitor_PitchingSO/W
2019-03-30,WAS,NYM,SSTRASBURG-R,NSYNDERGAARD-R,-130,110,-112,102,0,1,...,2.59,0.62,0.8,0.66,1.75,0.08,1.0,0.417,51.0,10.0
2019-03-30,PHI,ATL,NPIVETTA-R,BWILSON-R,-145,125,-145,135,1,0,...,3.03,0.53,0.78,2.29,0.0,0.0,9.7,0.523,53.1,4.0
2019-03-30,MIA,COL,PLOPEZ-R,TANDERSON-L,125,-145,118,-128,1,0,...,4.78,1.36,0.77,3.94,0.53,0.05,3.262,0.156,50.7,7.33
2019-03-30,MIL,STL,BWOODRUFF-R,DHUDSON-R,-125,105,-132,122,1,0,...,5.74,1.49,1.04,4.59,2.11,0.13,7.646,1.238,60.5,9.0
2019-03-30,SDG,SFO,NMARGEVICIUS-L,DRODRIGUEZ-R,-125,105,-130,120,0,1,...,4.94,1.21,0.84,3.35,1.86,0.07,7.357,2.052,51.7,12.5


## Feature Selection and Stat Calculations 

Now that are dataframes are created for each season, our features are selected and stats are calculated using functions in the Baseball_stats.py
file located in Libs.

In [18]:
# import functions from Baseball_stats.py in Libs folder
from Libs.Baseball_stats import baseball_stats_calculator_hitting, baseball_stats_calculator_pitching

In [19]:
# Calculate stats
feature_df_hitting_2016 = baseball_stats_calculator_hitting(training_df_2016)
final_feature_df_2016 = baseball_stats_calculator_pitching(feature_df_hitting_2016)
feature_df_hitting_2017 = baseball_stats_calculator_hitting(training_df_2017)
final_feature_df_2017 = baseball_stats_calculator_pitching(feature_df_hitting_2017)
feature_df_hitting_2018 = baseball_stats_calculator_hitting(training_df_2018)
final_feature_df_2018 = baseball_stats_calculator_pitching(feature_df_hitting_2018)
feature_df_hitting_2019 = baseball_stats_calculator_hitting(training_df_2019)
final_feature_df_2019 = baseball_stats_calculator_pitching(feature_df_hitting_2019)


In [20]:
final_feature_df_2016.head()

Unnamed: 0,home,visitor,home_pitcher,visitor_pitcher,home_open_odds,visitor_open_odds,home_close_odds,visitor_close_odds,home_win_loss,visitor_win_loss,...,Home_PitchingSLG%_allowed,Visitor_PitchingK%,Visitor_PitchingBB%,Visitor_PitchingOBP_num,Visitor_PitchingOBP_den,Visitor_PitchingOBP_allowed,Visitor_Pitching1B,Visitor_PitchingSLG%_num,Visitor_PitchingSLG%_den,Visitor_PitchingSLG%_allowed
2016-04-13,NYM,MIA,LVERRETT-R,ACONLEY-L,-130,115,-115,105,1,0,...,0.353448,0.234568,0.119342,84.0,240.0,0.35,36.0,87.0,210.0,0.414286
2016-04-13,WAS,ATL,TROARK-R,MWISLER-R,-240,210,-141,126,1,0,...,0.338384,0.213058,0.113402,107.0,289.0,0.370242,49.0,110.0,247.0,0.445344
2016-04-13,PHI,SDG,JEICKHOFF-R,CREA-R,-120,110,-115,105,1,0,...,0.330677,0.219672,0.091803,102.0,305.0,0.334426,38.0,129.0,271.0,0.476015
2016-04-13,CUB,CIN,JLACKEY-R,ASIMON-R,-215,190,-200,175,1,0,...,0.309417,0.209738,0.108614,86.0,262.0,0.328244,32.0,84.0,226.0,0.371681
2016-04-13,STL,MIL,MLEAKE-R,CANDERSON-R,-170,150,-145,130,0,1,...,0.341991,0.164794,0.093633,93.0,266.0,0.349624,36.0,127.0,235.0,0.540426


In [21]:
final_feature_df_2017.head()

Unnamed: 0,home,visitor,home_pitcher,visitor_pitcher,home_open_odds,visitor_open_odds,home_close_odds,visitor_close_odds,home_win_loss,visitor_win_loss,...,Home_PitchingSLG%_allowed,Visitor_PitchingK%,Visitor_PitchingBB%,Visitor_PitchingOBP_num,Visitor_PitchingOBP_den,Visitor_PitchingOBP_allowed,Visitor_Pitching1B,Visitor_PitchingSLG%_num,Visitor_PitchingSLG%_den,Visitor_PitchingSLG%_allowed
2017-04-12,COL,SDG,KFREELAND-L,ZLEE-R,-166,146,-153,138,0,1,...,0.437931,0.190769,0.101538,103.0,325.0,0.316923,34.0,130.0,287.0,0.452962
2017-04-12,PIT,CIN,INOVA-R,AGARRETT-L,-151,136,-147,132,0,1,...,0.437008,0.259386,0.129693,87.0,292.0,0.297945,32.0,77.0,253.0,0.304348
2017-04-12,PHI,NYM,VVELASQUEZ-R,ZWHEELER-R,-111,101,-118,108,0,1,...,0.551471,0.240385,0.080128,94.0,310.0,0.303226,48.0,104.0,283.0,0.367491
2017-04-12,WAS,STL,MSCHERZER-R,MLEAKE-R,-201,176,-180,160,0,1,...,0.442857,0.188679,0.078616,110.0,314.0,0.350318,48.0,130.0,280.0,0.464286
2017-04-12,MIA,ATL,TKOEHLER-R,JGARCIA-L,-126,111,-109,-101,0,1,...,0.424242,0.141791,0.100746,92.0,265.0,0.34717,46.0,92.0,232.0,0.396552


In [22]:
final_feature_df_2018.head()

Unnamed: 0,home,visitor,home_pitcher,visitor_pitcher,home_open_odds,visitor_open_odds,home_close_odds,visitor_close_odds,home_win_loss,visitor_win_loss,...,Home_PitchingSLG%_allowed,Visitor_PitchingK%,Visitor_PitchingBB%,Visitor_PitchingOBP_num,Visitor_PitchingOBP_den,Visitor_PitchingOBP_allowed,Visitor_Pitching1B,Visitor_PitchingSLG%_num,Visitor_PitchingSLG%_den,Visitor_PitchingSLG%_allowed
2018-04-08,PIT,CIN,JTAILLON-R,TMAHLE-R,-150,135,-155.0,140.0,1,0,...,0.400697,0.163636,0.109091,96.0,274.0,0.350365,37.0,108.0,235.0,0.459574
2018-04-08,PHI,MIA,JARRIETA-R,TRICHARDS-R,-185,165,-200.0,175.0,0,1,...,0.376068,0.218667,0.104,132.0,373.0,0.353887,52.0,149.0,327.0,0.455657
2018-04-08,MIL,CUB,CANDERSON-R,JQUINTANA-L,115,-130,104.0,-114.0,0,1,...,0.42284,0.216867,0.108434,105.0,329.0,0.319149,46.0,90.0,286.0,0.314685
2018-04-08,STL,ARI,LWEAVER-R,TWALKER-R,-145,130,-165.0,145.0,0,1,...,0.402256,0.275641,0.076923,85.0,308.0,0.275974,42.0,100.0,282.0,0.35461
2018-04-08,COL,ATL,KFREELAND-L,SNEWCOMB-L,-140,125,-155.0,140.0,0,1,...,0.398649,0.228916,0.138554,114.0,330.0,0.345455,37.0,112.0,278.0,0.402878


In [23]:
final_feature_df_2019.head()

Unnamed: 0,home,visitor,home_pitcher,visitor_pitcher,home_open_odds,visitor_open_odds,home_close_odds,visitor_close_odds,home_win_loss,visitor_win_loss,...,Home_PitchingSLG%_allowed,Visitor_PitchingK%,Visitor_PitchingBB%,Visitor_PitchingOBP_num,Visitor_PitchingOBP_den,Visitor_PitchingOBP_allowed,Visitor_Pitching1B,Visitor_PitchingSLG%_num,Visitor_PitchingSLG%_den,Visitor_PitchingSLG%_allowed
2019-03-30,WAS,NYM,SSTRASBURG-R,NSYNDERGAARD-R,-130,110,-112,102,0,1,...,0.258065,0.424242,0.030303,7.0,33.0,0.212121,4.0,6.0,31.0,0.193548
2019-03-30,PHI,ATL,NPIVETTA-R,BWILSON-R,-145,125,-145,135,1,0,...,0.366667,0.243243,0.162162,13.0,37.0,0.351351,4.0,16.0,31.0,0.516129
2019-03-30,MIA,COL,PLOPEZ-R,TANDERSON-L,125,-145,118,-128,1,0,...,0.432432,0.246154,0.061538,12.0,65.0,0.184615,2.0,18.0,59.0,0.305085
2019-03-30,MIL,STL,BWOODRUFF-R,DHUDSON-R,-125,105,-132,122,1,0,...,0.555556,0.19403,0.059701,20.0,67.0,0.298507,9.0,31.0,61.0,0.508197
2019-03-30,SDG,SFO,NMARGEVICIUS-L,DRODRIGUEZ-R,-125,105,-130,120,0,1,...,0.269841,0.290323,0.080645,20.0,62.0,0.322581,9.0,23.0,55.0,0.418182


## Model creation
---------------------

We tried many different machine learning models such as SVM, RandomForestClassifier, AdaBoostClassifier, and Neural Networks. The AdaBoostClassifier
returned the best model. 

In [59]:
import numpy as np
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report




In [45]:
baseball_data_2016 = pd.read_csv('./Training_Data/2016_10_day.csv',index_col = 'Date', infer_datetime_format = True, parse_dates = True)


In [46]:
baseball_data_2016.head()


Unnamed: 0_level_0,home,visitor,home_open_odds,visitor_open_odds,home_close_odds,visitor_close_odds,home_win_loss,visitor_win_loss,Home_PitchingK%,Home_PitchingBB%,...,Visitor_PitchingOBP_allowed,Visitor_PitchingSLG%_allowed,Home_HittingK%,Home_HittingBB%,Home_HittingOBP,Home_HittingSLG%,Visitor_HittingK%,Visitor_HittingBB%,Visitor_HittingOBP,Visitor_HittingSLG%
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2016-04-13,NYM,MIA,-130,115,-115,105,1,0,0.262745,0.062745,...,0.35,0.414286,0.247059,0.109804,0.279528,0.248889,0.218623,0.089069,0.355102,0.442396
2016-04-13,WAS,ATL,-240,210,-141,126,1,0,0.221719,0.095023,...,0.370242,0.445344,0.217391,0.117391,0.346491,0.383838,0.241509,0.120755,0.301527,0.290749
2016-04-13,PHI,SDG,-120,110,-115,105,1,0,0.246429,0.085714,...,0.334426,0.476015,0.251799,0.061151,0.275362,0.366142,0.227425,0.070234,0.298658,0.364964
2016-04-13,CUB,CIN,-215,190,-200,175,1,0,0.239316,0.038462,...,0.328244,0.371681,0.200692,0.131488,0.371528,0.440329,0.190283,0.089069,0.331967,0.422018
2016-04-13,STL,MIL,-170,150,-145,130,0,1,0.221818,0.123636,...,0.349624,0.540426,0.239203,0.106312,0.369128,0.470588,0.273092,0.116466,0.322581,0.396313


In [47]:
baseball_data_2016.columns.values

array(['home', 'visitor', 'home_open_odds', 'visitor_open_odds',
       'home_close_odds', 'visitor_close_odds', 'home_win_loss',
       'visitor_win_loss', 'Home_PitchingK%', 'Home_PitchingBB%',
       'Home_PitchingOBP_allowed', 'Home_PitchingSLG%_allowed',
       'Visitor_PitchingK%', 'Visitor_PitchingBB%',
       'Visitor_PitchingOBP_allowed', 'Visitor_PitchingSLG%_allowed',
       'Home_HittingK%', 'Home_HittingBB%', 'Home_HittingOBP',
       'Home_HittingSLG%', 'Visitor_HittingK%', 'Visitor_HittingBB%',
       'Visitor_HittingOBP', 'Visitor_HittingSLG%'], dtype=object)

In [48]:
X = baseball_data_2016[['home_open_odds', 'visitor_open_odds',
       'home_close_odds', 'visitor_close_odds', 'Home_PitchingK%', 'Home_PitchingBB%',
       'Home_PitchingOBP_allowed', 'Home_PitchingSLG%_allowed',
       'Visitor_PitchingK%', 'Visitor_PitchingBB%',
       'Visitor_PitchingOBP_allowed', 'Visitor_PitchingSLG%_allowed',
       'Home_HittingK%', 'Home_HittingBB%', 'Home_HittingOBP',
       'Home_HittingSLG%', 'Visitor_HittingK%', 'Visitor_HittingBB%',
       'Visitor_HittingOBP', 'Visitor_HittingSLG%']]
y = baseball_data_2016['home_win_loss']

In [49]:
scaler = StandardScaler()
X_transformed = scaler.fit_transform(X)

In [50]:
len(baseball_data_2016) * 0.50

1159.0

In [51]:
X_train = X_transformed[:1159]
X_test = X_transformed[1160:]
y_train = y[:1159]
y_test = y[1160:]

In [52]:
model = SVC(kernel = 'rbf', random_state = 1, probability = True)
model.fit(X_train, y_train)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=True, random_state=1, shrinking=True,
    tol=0.001, verbose=False)

In [54]:
model.score(X_test, y_test)

0.540587219343696

In [68]:
rf_model = RandomForestClassifier(n_estimators= 1000, random_state= 1)

In [69]:
rf_model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

In [70]:
predictions_rf = rf_model.predict(X_test)

In [71]:
acc_score = accuracy_score(y_test, predictions)

In [72]:
print(acc_score)

0.5552677029360967


In [62]:
clf = AdaBoostClassifier(n_estimators = 2500, random_state = 1)

In [63]:
clf.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=2500, random_state=1)

In [64]:
predictions_clf = clf.predict(X_test)

In [65]:
acc_score_clf = balanced_accuracy_score(y_test, predictions)

In [66]:
print(acc_score_clf)

0.5437545213407282


In [67]:
actual_df = pd.DataFrame(y_test)
actual_df.reset_index(inplace = True)

In [74]:
predict_df = pd.DataFrame(predictions_rf)


In [75]:
actual_predict_df = pd.concat([actual_df,predict_df], axis = 1, join = 'inner')

In [77]:
odds_df_new = baseball_data_2016[['home','visitor','home_open_odds','visitor_open_odds']][1160:]
odds_df_new.reset_index(inplace = True)
odds_df_new.drop(columns = ['Date'],inplace = True)

In [78]:
df = pd.concat([actual_df,predict_df, odds_df_new], axis = 1, join ='inner')

In [79]:
df.set_index('Date', inplace = True)

In [80]:
df.columns = ['Actual','Predicted','Home','Visitor','Home_Open_Odds','Visitor_Open_Odds']