# Baseball Game Predictor and Gambling Program
----------------------------------------------

This notebook will go through the entire process of pulling data, cleaning data, feature engineering, and machine learning model creation 
to predict games and ultimately show the results of a betting strategy based upon the model's predictions.

## Pulling Data from PyBaseball's API
--------------------------------

The initial raw data will be pulled from PyBaseball.  Two functions are being used that pull our batting and hitting statistics. The is a little light
cleaning of the initial raw data to drop unneeded columns and format to a datetime index.  These function are located in the libs folder inside 
the PyBaseball_data_pull_and_cleaning.py file. After pulling and cleaning data the csv files were saved in the Data folder.  DO NOT RUN THESE
CELLS.  The data has already been saved.  The process of pulling this data takes a considerable amount of time.

In [None]:
# import functions from Libs
from Libs.PyBaseball_data_pull_and_cleaning import get_batting_data, get_pitching_data, clean_batting_data, clean_pitching_data
import pandas as pd

In [None]:
# Pull batting and pitching data for 2016, 2017, 2018, and 2019 
batting_data_2016 = get_batting_data('2016-04-03', '2016-10-02')
batting_data_2017 = get_batting_data('2017-04-02', '2017-10-01')
batting_data_2018 = get_batting_data('2018-03-29', '2018-10-01')
batting_data_2019 = get_batting_data('2019-03-28', '2019-09-29')

pitching_data_2016 = get_pitching_data('2016-04-03','2016-10-02')
pitching_data_2017 = get_pitching_data('2017-04-02', '2017-10-01')
pitching_data_2018 = get_pitching_data('2018-03-29', '2018-10-01')
pitching_data_2019 = get_pitching_data('2019-03-28', '2019-09-29')


In [None]:
# Example of raw batting_data.  These cells can be run.
raw_hitting_data = pd.read_csv('./Data/Batting/Raw_Data/raw_batting_data_2017.csv')
raw_hitting_data

In [None]:
# Example of raw pitching data
raw_pitching_data = pd.read_csv('./Data/Pitching/Raw_Data/raw_pitching_data_2017.csv')
raw_pitching_data.head()

In [None]:
# Initial light cleaning of pulled data 
batting_data_clean_2016 = clean_batting_data(batting_data_2016)
batting_data_clean_2017 = clean_batting_data(batting_data_2017)
batting_data_clean_2018 = clean_batting_data(batting_data_2018)
batting_data_clean_2019 = clean_batting_data(batting_data_2019)
pitching_data_clean_2016 = clean_pitching_data(pitching_data_2016)
pitching_data_clean_2017 = clean_pitching_data(pitching_data_2017)
pitching_data_clean_2018 = clean_pitching_data(pitching_data_2018)
pitching_data_clean_2019 = clean_pitching_data(pitching_data_2019)


In [None]:
# Example of clean batting_data 
clean_hitting_data = pd.read_csv('./Data/Batting/Clean_Data/clean_batting_data_2017.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
clean_hitting_data.head()

In [None]:
# Example of clean pitching_data
clean_pitching_data = pd.read_csv('./Data/Pitching/Clean_Data/clean_pitching_data_2017.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
clean_pitching_data.head()

## Creating DataFrame for Feature Selection
---------------------------------------------

This section will create a dataframe from our saved batting and pitching csv files and concatenate it with the odds csv
files we downloaded to create one dataframe for each season.  The functions used for this process are in the Training_DataFrame_creation.py
file. The resulting dataframes have been saved in the Training Data folder.  Many different features were experimented with but ultimately
these are the features we settled upon. A look back period of 10 days to calculate stats resulted in the best performing model.

In [None]:
# Import functions for dataframe creation and pandas to read in csv files
import pandas as pd
from Libs.Training_DataFrame_creation import df_for_feature_selection

In [None]:
# Read in necessary data files for batting, pitching, and gambling odds
batting_data_2016 = pd.read_csv('./Data/Batting/Clean_Data/clean_batting_data_2016.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
batting_data_2017 = pd.read_csv('./Data/Batting/Clean_Data/clean_batting_data_2017.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
batting_data_2018 = pd.read_csv('./Data/Batting/Clean_Data/clean_batting_data_2018.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
batting_data_2019 = pd.read_csv('./Data/Batting/Clean_Data/clean_batting_data_2019.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)

pitching_data_2016 = pd.read_csv('./Data/Pitching/Clean_Data/clean_pitching_data_2016.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
pitching_data_2017 = pd.read_csv('./Data/Pitching/Clean_Data/clean_pitching_data_2017.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
pitching_data_2018 = pd.read_csv('./Data/Pitching/Clean_Data/clean_pitching_data_2018.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
pitching_data_2019 = pd.read_csv('./Data/Pitching/Clean_Data/clean_pitching_data_2019.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)

odds_df_2016 = pd.read_csv('./Betting_Odds/Clean_Odds/mlb_odds_2016.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
odds_df_2017 = pd.read_csv('./Betting_Odds/Clean_Odds/mlb_odds_2017.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
odds_df_2018 = pd.read_csv('./Betting_Odds/Clean_Odds/mlb_odds_2018.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)
odds_df_2019 = pd.read_csv('./Betting_Odds/Clean_Odds/mlb_odds_2019.csv', parse_dates = True, index_col = 'Date', infer_datetime_format = True)


In [None]:
batting_data_2016.head()

In [None]:
pitching_data_2016.head()

In [None]:
odds_df_2016.head()

In [None]:
#Create training dataframes for each season
training_df_2016 = df_for_feature_selection(odds_df_2016, batting_data_2016, pitching_data_2016, look_back = 10)
training_df_2017 = df_for_feature_selection(odds_df_2017, batting_data_2017, pitching_data_2017, look_back = 10)
training_df_2018 = df_for_feature_selection(odds_df_2018, batting_data_2018, pitching_data_2018, look_back = 10)
training_df_2019 = df_for_feature_selection(odds_df_2019, batting_data_2019, pitching_data_2019, look_back = 10)


In [None]:
training_df_2016.head()

In [None]:
training_df_2017.head()

In [None]:
training_df_2018

In [None]:
training_df_2019.head()

## Feature Selection and Stat Calculations 

Now that are dataframes are created for each season, our features are selected and stats are calculated using functions in the Baseball_stats.py
file located in Libs.

In [None]:
# import functions from Baseball_stats.py in Libs folder
from Libs.Baseball_stats import baseball_stats_calculator_hitting, baseball_stats_calculator_pitching

In [None]:
# Calculate stats
feature_df_hitting_2016 = baseball_stats_calculator_hitting(training_df_2016)
final_feature_df_2016 = baseball_stats_calculator_pitching(feature_df_hitting_2016)
feature_df_hitting_2017 = baseball_stats_calculator_hitting(training_df_2017)
final_feature_df_2017 = baseball_stats_calculator_pitching(feature_df_hitting_2017)
feature_df_hitting_2018 = baseball_stats_calculator_hitting(training_df_2018)
final_feature_df_2018 = baseball_stats_calculator_pitching(feature_df_hitting_2018)
feature_df_hitting_2019 = baseball_stats_calculator_hitting(training_df_2019)
final_feature_df_2019 = baseball_stats_calculator_pitching(feature_df_hitting_2019)


In [None]:
final_feature_df_2016.head()

In [None]:
final_feature_df_2017.head()

In [None]:
final_feature_df_2018.head()

In [None]:
final_feature_df_2019.head()

## Model creation
---------------------

We tried many different machine learning models such as SVM, RandomForestClassifier, AdaBoostClassifier, and Neural Networks. The RandomForestClassifier
returned the best model. 

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from imblearn.metrics import classification_report_imbalanced




In [None]:
baseball_data_2017 = pd.read_csv('./Training_Data/2017_10_day.csv',index_col = 'Date', infer_datetime_format = True, parse_dates = True)


In [None]:
no_lines = baseball_data_2017[baseball_data_2017['home_open_odds'] == 'NL']
no_lines.head()

In [None]:
baseball_data_2017.head()


In [None]:
baseball_data_2016 = baseball_data_2016[baseball_data_2016['home_open_odds'] != 'NL'] 

In [None]:
baseball_data_2016.head()

In [None]:
baseball_data_2017.columns.values

In [None]:
X = baseball_data_2017[['home_open_odds', 'visitor_open_odds','Home_PitchingK%', 'Home_PitchingBB%',
       'Home_PitchingOBP_allowed', 'Home_PitchingSLG%_allowed',
       'Visitor_PitchingK%', 'Visitor_PitchingBB%',
       'Visitor_PitchingOBP_allowed', 'Visitor_PitchingSLG%_allowed',
       'Home_HittingK%', 'Home_HittingBB%', 'Home_HittingOBP',
       'Home_HittingSLG%', 'Visitor_HittingK%', 'Visitor_HittingBB%',
       'Visitor_HittingOBP', 'Visitor_HittingSLG%']]
y = baseball_data_2017['home_win_loss']

In [None]:
scaler = StandardScaler()
X_transformed = scaler.fit_transform(X)

In [None]:
len(baseball_data_2017) * 0.50

In [None]:
X.head()

In [None]:
X_train = X_transformed[:1158]
X_test = X_transformed[1159:]
y_train = y[:1158]
y_test = y[1159:]

In [None]:
model = SVC(kernel = 'rbf', random_state = 1, probability = True)
model.fit(X_train, y_train)

In [None]:
model.score(X_test, y_test)

In [None]:
rf_model = RandomForestClassifier(n_estimators= 1000, random_state= 1)

In [None]:
rf_model.fit(X_train, y_train)

In [None]:
predictions_rf = rf_model.predict(X_test)

In [None]:
acc_score = accuracy_score(y_test, predictions_rf)

In [None]:
importances = rf_model.feature_importances_

In [None]:
sorted(zip(rf_model.feature_importances_, X.columns), reverse=True)

In [None]:
print(acc_score)

In [None]:
print(classification_report_imbalanced(y_test, predictions_rf))

In [None]:
clf = AdaBoostClassifier(n_estimators = 2500, random_state = 1)

In [None]:
clf.fit(X_train, y_train)

In [None]:
predictions_clf = clf.predict(X_test)

In [None]:
acc_score_clf = balanced_accuracy_score(y_test, predictions_clf)

In [None]:
print(acc_score_clf)

In [None]:
actual_df = pd.DataFrame(y_test)
actual_df.reset_index(inplace = True)

In [None]:
predict_df = pd.DataFrame(predictions_rf)


In [None]:
actual_predict_df = pd.concat([actual_df,predict_df], axis = 1, join = 'inner')

In [None]:
odds_df_new = baseball_data_2017[['home','visitor','home_open_odds','visitor_open_odds']][1159:]
odds_df_new.reset_index(inplace = True)
odds_df_new.drop(columns = ['Date'],inplace = True)

In [None]:
df = pd.concat([actual_df,predict_df, odds_df_new], axis = 1, join ='inner')

In [None]:
df.set_index('Date', inplace = True)

In [None]:
df.columns = ['Actual','Predicted','Home','Visitor','Home_Open_Odds','Visitor_Open_Odds']

In [None]:
df.head()

In [None]:
df.to_csv('./predictions_2017.csv')

## Gambling Results
---------------------

Here we experiment with a few different gambling strategies based on our model's results

In [1]:
from Libs.Gambling import find_total_profits

In [24]:
find_total_profits('2016')

Betting on the favorites to win at home: $-875.35    The accuracy of betting on the favorites to win at home is 60.3%
Betting on the favorites to win on the road: $-2637.7    The accuracy of betting on the favorites to win on the road is 50.0%
Betting on the underdogs to win at home: $-10318    The accuracy of betting on the underdogs to win at home is 29.61%
Betting on the dogs to win on the road: $3769    The accuracy of betting on the underdogs to win on the road is 50.94%
Total profits for the second half of the 2016 season: $-10062.05
