# MLB-Predict

### In this notebook, I will...
* import code from data.py
* instantiate an intances of the LeagueStats class 
* collect a range of historical game data from that team
* save the historical data to an .xlsx file
* load the historical data into this notebook
* prepare the data for training (i.e. strip unnecessary features)
* divide data into testing and training sets
* fine tune an existing gradient boosting framework
* tune hyperparameters to get best model and then save model weights

In [None]:
from data import LeagueStats, TeamStats

mlb = LeagueStats()

## MLB Data Retrieval Example

This shows how I would retrieve data manually using my LeagueStats class. In reality, I use the data_retriever.py script to collect massive amounts of data (seasons at a time) and have it automatically saved to disk in an organized format. 

In [None]:
# mlb.get_data(start_date="04/01/2023", end_date="08/15/2023", file_path="data/seasons/2023.xlsx")

## Example Data Preparation and Model Training

The code below shows the steps for preparing data for training and how I would train a model. It is not fully executable due to missing file paths and data etc. This is just used to show the steps that I take and which I eventually consolidate in the prepare_data function. Below the definition of prepare_data begins the actual training for various models including the discontinued  mlb3year model and my new mlb4year model. 

## Merging and loading data
Load and merge season data from each xlsx file into a single xlsx file

In [None]:
import pandas as pd
import os

# path to data sheets to combine data from
directory = '<data_sheets_folder>'

data = pd.DataFrame()

# iterate through directory to combine data
for filename in os.listdir(directory):
    if filename.endswith('.xlsx'):
        path = os.path.join(directory, filename)
        df = pd.read_excel(path)
        data = pd.concat([data, df], ignore_index=True)
        
data.to_excel('<combined_data>.xlsx')

#### *Loading and processing data* 
Load data from the master .xlsx file into a data frame. Prepare data for training

In [None]:
data = pd.read_excel('<combined_data>.xlsx')

# remove the game-id, date, home/away team features
data.drop(columns=['game-id', 'date', 'home-team', 'away-team'], inplace=True)

# drops rows with missing labels
data = data.dropna(subset=['did-home-win'])

# convert 'did-home-win' labels to binary values
data['did-home-win'] = data['did-home-win'].astype(int)

## Data Processing
#### *Experminetal Optimization*
Rearrange the order of the features to attempt to optimize the model

* Order 1: Place most important features first with each home team statistic immediately followed by the away's counter part. Allows for many meaningful comparisons between adjacent features
* Order 2: Place most import features first with all of the home teams statistics appearing before any of the away team's statistics. Stats for each team are in the same order just completely separated. 

In [None]:
order1 = [
    "did-home-win",
    "home-win-percentage", "away-win-percentage",
    "home-starter-season-era", "away-starter-season-era",
    "home-starter-season-win-percentage", "away-starter-season-win-percentage",
    "home-top5-hr-avg", "away-top5-hr-avg",
    "home-last10-avg-runs", "away-last10-avg-runs",
    "home-last10-avg-ops", "away-last10-avg-ops",
    "home-starter-season-whip", "away-starter-season-whip",
    "home-top5-rbi-avg", "away-top5-rbi-avg",
    "home-last10-avg-runs-allowed", "away-last10-avg-runs-allowed",
    "home-starter-season-avg", "away-starter-season-avg",
    "home-top5-batting-avg", "away-top5-batting-avg",
    "home-starter-season-strike-percentage", "away-starter-season-strike-percentage",
    "home-last10-avg-hits", "away-last10-avg-hits",
    "home-last10-avg-hits-allowed", "away-last10-avg-hits-allowed",
    "home-last10-avg-obp", "away-last10-avg-obp",
    "home-last10-avg-avg", "away-last10-avg-avg",
    "home-last10-avg-rbi", "away-last10-avg-rbi",
    "home-starter-season-runs-per9", "away-starter-season-runs-per9",
    "home-top5-stolenBases-avg", "away-top5-stolenBases-avg",
    "home-top5-totalBases-avg", "away-top5-totalBases-avg",
    "home-last10-avg-strikeouts", "away-last10-avg-strikeouts",
    "home-starter-career-era", "away-starter-career-era",
]

order2 = [
    "did-home-win",
    "home-win-percentage", "home-starter-season-era",
    "home-starter-season-win-percentage", "home-top5-hr-avg",
    "home-last10-avg-runs", "home-last10-avg-ops",
    "home-starter-season-whip", "home-top5-rbi-avg",
    "home-last10-avg-runs-allowed", "home-starter-season-avg",
    "home-top5-batting-avg", "home-starter-season-strike-percentage",
    "home-last10-avg-hits", "home-last10-avg-hits-allowed",
    "home-last10-avg-obp", "home-last10-avg-avg",
    "home-last10-avg-rbi", "home-starter-season-runs-per9",
    "home-top5-stolenBases-avg", "home-top5-totalBases-avg",
    "home-last10-avg-strikeouts", "home-starter-career-era",
    "away-win-percentage", "away-starter-season-era",
    "away-starter-season-win-percentage", "away-top5-hr-avg",
    "away-last10-avg-runs", "away-last10-avg-ops",
    "away-starter-season-whip", "away-top5-rbi-avg",
    "away-last10-avg-runs-allowed", "away-starter-season-avg",
    "away-top5-batting-avg", "away-starter-season-strike-percentage",
    "away-last10-avg-hits", "away-last10-avg-hits-allowed",
    "away-last10-avg-obp", "away-last10-avg-avg",
    "away-last10-avg-rbi", "away-starter-season-runs-per9",
    "away-top5-stolenBases-avg", "away-top5-totalBases-avg",
    "away-last10-avg-strikeouts", "away-starter-career-era",
]

In [None]:
# reorder the columns
data = data[order2]

#### *Drop rows with missing labels*
Drops all rows that are missing more than THRESHOLD features

In [None]:
# constant representing the amount of features that must missing for a row to be excluded/removed
THRESHOLD = 10

In [None]:
# drops rows with too many missing features
data = data.dropna(thresh=df.shape[1] - THRESHOLD)

#### *Min-Max Feature Normalization*
Normalize the numeric features to a scale of [0,1]

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

In [None]:
import pickle

# apply min-max normalization to selected features
data = scaler.fit_transform(data)

scaler_path = 'models/scalers/'
with open(scaler_path + '<scaler_name>.pkl', 'wb') as file:
    pickle.dump(scaler, file)

#### *Randomize data order, Split training and testing data*
Ensures that training and testing data aren't chronologically grouped. Thus, will treat each game more independently

In [None]:
import numpy as np

# randomize the order of the rows in the dataframe
data = data.sample(frac=1).reset_index(drop=True)

# separate labels from the features
features = data.drop('did-home-win', axis=1).values
labels = data['did-home-win'].values

# verify shapes features and labels
print("Features shape: ", features.shape)
print("Labels shape: ", labels.shape)

indices = list(range(len(features)))
split_index = int(0.85 * len(features))

train_indices = indices[:split_index]
test_indices  = indices[split_index:]

x_train = features[train_indices]
x_test  = features[btest_indices ]
y_train = labels[train_indices]
y_test  = labels[test_indices ]

# verify shapes of training/testing sets
print("Training set shape: ", x_train.shape, y_train.shape)
print("Testing set shape: " , x_test.shape,  y_test.shape )

## Building the model
Using LightGBM model and defining parameters

In [None]:
import lightgbm as lgb
from sklearn.metrics import accuracy_score

# model parameters 
params = {
    'objective': 'binary',
    'metric': 'accuracy',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
}

In [None]:
# create a LightGBM Dataset with training features and labels
train_data = lgb.Dataset(x_train, label=y_train)

## Training the models

In [None]:
model = lgb.train(params, mlb_train_data, num_boost_round=1000)
model.save_model('<model_name>.txt')

## Testing the model
Here I would test the models' accuracy on its testing set. 

In [None]:
y_pred = model.predict(x_test)
y_pred_binary = (y_pred > 0.5).astype(int)
accuracy = accuracy_score(y_test, y_pred_binary)

print('Accuracy:', accuracy)

## End of Example/Demo Code

Everything above is demo code that is incomplete and is meant to simply show the data prep. / training process. 

## Generalized Data Preparation using `prepare_data()`

Below is the definition of the `prepare_data` function which generalizes that code which was shown above. Below the definition begins the actual training of the discontinued mlb3year model and all models created since then. 

In [None]:
from sklearn.preprocessing import MinMaxScaler
import pandas as pd        
import numpy as np
import pickle
import os

def prepare_data(data_dirs, model_name, order=order2, save_dir=None, missing_data_threshold=10):
    """
    Args:
        data_dirs: list of paths to folders with the .xlsx data sheets
        model_name: name that model should be saved as
        order: order of the data features used for training 
        save_dir: file_path to save merged .xlsx sheet to if desired
            -> must end in .xlsx and be a valid (existing) file path
        missing_data_threshold: max number of acceptable missing features in training sample
    
    Returns:
        x_train, x_test, y_train, y_test
    """
    df = pd.DataFrame()
    # iterate through mlb data directory to retrieve each file
    for dir in data_dirs:
        for filename in os.listdir(dir):
            if filename.endswith('.xlsx'):
                path = os.path.join(dir, filename)
                d = pd.read_excel(path)
                df = pd.concat([df, d], ignore_index=True)
    if save_dir:
        df.to_excel(save_dir)
    
    # remove the game-id, date, home/away team features
    df.drop(columns=['game-id', 'date', 'home-team', 'away-team'], inplace=True)
    # drops rows with missing labels
    df = df.dropna(subset=['did-home-win'])
    # convert 'did-home-win' labels to binary values
    df['did-home-win'] = df['did-home-win'].astype(int)
    # randomize the order of the rows in the dataframe
    df = df.sample(frac=1).reset_index(drop=True)
    # change feature order of df
    df = df[order]
    # drop samples missing more than THRESH values
    df = df.dropna(thresh=df.shape[1] - missing_data_threshold)
    labels = df['did-home-win'].values
    df = df.drop(columns=['did-home-win'])
    scaler = MinMaxScaler()
    # apply min-max normalization to features
    df = scaler.fit_transform(df)
    scaler_path = 'models/scalers/'
    with open(scaler_path + model_name + '_scaler.pkl', 'wb') as file:
        pickle.dump(scaler, file)
    features = df
    # verify shapes features and labels
    print("Features shape: ", df.shape)
    print("Labels shape: ", df.shape)
    indices = list(range(len(features)))
    split_index = int(0.85 * len(features))
    train_indices = indices[:split_index]
    test_indices  = indices[split_index:]
    x_train = features[train_indices]
    x_test  = features[test_indices ]
    y_train = labels[train_indices]
    y_test  = labels[test_indices ]
    # verify shapes of training/testing sets
    print("Training set shape: ", x_train.shape, y_train.shape)
    print("Testing set shape: " , x_test.shape,  y_test.shape )
    return x_train, x_test, y_train, y_test

## MLB 3 Year (DISCONTINUED)
Using the data_retrieval.py script, I collected game data from 2021-2023 seasons in the background and I'm going to pull that data and use it to train a model. The data is from every MLB game during these 3 season (2023 cutoff is 07/09). 

#### *Note*: mlb3year model uses my original data blend which included some ELO statistics that I have decided to move away from; therefore, this model is not for use any more and would not be able to be retrained and predicted on due to new code in data.py etc.

In [None]:
import lightgbm as lgb
from sklearn.metrics import accuracy_score

data = ['data/seasons/2021', 'data/seasons/2022', 'data/seasons/2023']
x_train, x_test, y_train, y_test = prepare_data(data_dirs=data, model_name="mlb3year", order=order2, missing_data_threshold=10)

# model parameters 
params = {
    'objective': 'binary',
    'metric': 'accuracy',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
}

data = lgb.Dataset(x_train, label=y_train)

model = lgb.train(params, data, num_boost_round=1000)
model.save_model('models/mlb3year.txt')

y_pred = model.predict(x_test)
y_pred_binary = (y_pred > 0.5).astype(int)
accuracy = accuracy_score(y_test, y_pred_binary)

print('Accuracy:', accuracy)

## Testing different feature orders
In past models, I've used "order1" which creates direct comparisons between adjacent features by placing a home team's statistic directly next to the away team's same stat. Here I will try a different order, primarily the order where I place all of one team's statistics first and then the other team's after. Below I will use the same data as ***mlb3year***

In [None]:
import lightgbm as lgb
from sklearn.metrics import accuracy_score

data = ['data/seasons/2021', 'data/seasons/2022', 'data/seasons/2023']
x_train, x_test, y_train, y_test = prepare_data(
    data_dirs=data, 
    model_name="mlb3year_test", 
    order=order2, 
    missing_data_threshold=10
)

# model parameters 
params = {
    'objective': 'binary',
    'metric': 'accuracy',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
}

data = lgb.Dataset(x_train, label=y_train)

model = lgb.train(params, data, num_boost_round=1000)
model.save_model('models/mlb3year_test.txt')

y_pred = model.predict(x_test)
y_pred_binary = (y_pred > 0.5).astype(int)
accuracy = accuracy_score(y_test, y_pred_binary)

print('Accuracy:', accuracy)

Return from code above...

`Features shape:  (5975, 37)`

`Labels shape:  (5975, 37)`

`Training set shape:  (5078, 36) (5078,)`

`Testing set shape:  (897, 36) (897,)`

`[LightGBM] [Info] Number of positive: 2727, number of negative: 2351`

`[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000804 seconds.`

`You can set force_col_wise=true to remove the overhead.`

`[LightGBM] [Info] Total Bins 8551`

`[LightGBM] [Info] Number of data points in the train set: 5078, number of used features: 36`

`[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.537022 -> initscore=0.148361`

`[LightGBM] [Info] Start training from score 0.148361`

`Accuracy: 0.6432552954292085`

### Remarks
After training the model a couple times with "order2", it appears that the order doesn't make a significant impact on the model's testing accuracy. If anything, however, I've noticed that the training accuracies were on average a little bit higher. Moving forward, I will use order2 as the default feature order. 

Additionally, I will retrain the official mlb3year model with order2, so that the saved model is up to date.  

## MLB 4 Year

This model is the first model to use my new data blend that *does not* longer include ELO statistics.

In [None]:
import lightgbm as lgb
from sklearn.metrics import accuracy_score

data = ['data/seasons/2020', 'data/seasons/2021', 
        'data/seasons/2022', 'data/seasons/2023']
x_train, x_test, y_train, y_test = prepare_data(
    data_dirs=data, 
    model_name="mlb4year", 
    order=order2, 
    missing_data_threshold=10
)

# model parameters 
params = {
    'objective': 'binary',
    'metric': 'accuracy',
    'boosting_type': 'gbdt',
    'num_leaves': 128,
    'learning_rate': 0.005,
    'tree_learner': 'serial',
    'min_data_in_leaf': 20,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'lambda_l1': 0.1,
    'lambda_l2': 0.1,
    'scale_pos_weight': 1.0,
}

data = lgb.Dataset(x_train, label=y_train)

model = lgb.train(params, data, num_boost_round=500)
model.save_model('models/mlb4year.txt')

y_pred = model.predict(x_test)
y_pred_binary = (y_pred > 0.5).astype(int)
accuracy = accuracy_score(y_test, y_pred_binary)

print('Accuracy:', accuracy)

Return from code above...

`Features shape:  (7406, 44)`

`Labels shape:  (7406, 44)`

`Training set shape:  (6295, 44) (6295,)`

`Testing set shape:  (1111, 44) (1111,)`

`[LightGBM] [Info] Number of positive: 3355, number of negative: 2940`

`[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000666 seconds.`

`You can set force_col_wise=true to remove the overhead.`

`[LightGBM] [Info] Total Bins 8318`

`[LightGBM] [Info] Number of data points in the train set: 6295, number of used features: 44`

`[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.532963 -> initscore=0.132042`

`[LightGBM] [Info] Start training from score 0.132042`

`Accuracy: 0.6552655265526552`


## Results of MLB 4 Year 

I am pleased to see that the accuracy has not worsened as a result of my transition away from ELO statistics. My new blend of data features in each sample includes more features that attempt to measure a team's current and season-long momentum and this was meant to supplement the ELO statistics. 

Additionally, I spent more time tuning the model's hyper parameters and this resulted in a more consistent and higher accuracy. 