# Predicting Batting statistics

## Goal
The main goal of this notebook is to understand machine learning.
Using baseball statistics to try and predict a batters offensive performance for the comming year based on past years performance.

### Things to achieve
- Calculate wOBA
- Group records by player and year.
- Find outliers (Injury, Team change, etc.)
- Project wOBA using a players past wOBA.

## Load Data from csv file

In [None]:
import pandas as pd

# Enable Copy-on-Write
pd.options.mode.copy_on_write = True

raw_data = pd.read_csv('data/Batting.csv')

## Baseball statistic to quantify offensive production - wOBA

wOBA is a sabermetric statistic used to measure a hitter's overall offensive contributions per plate appearance, assigning more accurate weights to different offensive outcomes.
General Formula:
$$
wOBA = \frac{(wBB \times BB) + (wHBP \times HBP) + (w1B \times 1B) + (w2B \times 2B) + (w3B \times 3B) + (wHR \times HR)}{AB + BB - IBB + SF + HBP}
$$
Where:
- \( wBB, wHBP, w1B, w2B, w3B, wHR \) are weights that change slightly every year.
- \( BB \) = Base on balls (walks)
- \( IBB \) = Intentional walks
- \( HBP \) = Hit by pitch
- \( AB \) = At bats
- \( SF \) = Sacrifice flies
- \( 1B \) = Singles


### Calculate Singles

Singles are not included in the dataset.
To represent this we take the total hits('H') and remove doubles('2B'), triples('3B') and homeruns('HR')

In [11]:
work_df = raw_data.copy()
work_df['1B'] = work_df['H'] - work_df['2B'] - work_df['3B'] - work_df['HR']
work_df['1B'] = work_df['1B'].fillna(0)
work_df['1B'] = work_df['1B'].astype(int)

## wOBA weights
During the prototyping phases static weights are used.

To improve accuracy yearly weights can be calculate form the dataset. 

In [12]:
wBB = 0.69
wHBP = 0.72
w1B = 0.88
w2B = 1.247
w3B = 1.578
wHR = 2.031

## Cleaned dataset

Remove records with empty empty values in required columns

In [13]:
required_columns = ['BB', 'IBB', 'HBP', '1B', '2B', '3B', 'HR', 'AB', 'SF']
cleaned_df = work_df.dropna(subset=required_columns)

## Save to CSV

Save cleaned dataframe (cleaned_df) to cleaned.csv 

In [14]:
cleaned_df.to_csv("data/cleaned.csv", index=False)

## Calculate wOBA Numerator & Denominator

In [15]:
# Numerator
cleaned_df['wOBA_num'] = (
    wBB * (cleaned_df['BB'] - cleaned_df['IBB']) +
    wHBP * cleaned_df['HBP'] +
    w1B * cleaned_df['1B'] +
    w2B * cleaned_df['2B'] +
    w3B * cleaned_df['3B'] +
    wHR * cleaned_df['HR']
)

#Denominator
cleaned_df['wOBA_deno'] = cleaned_df['AB'] + cleaned_df['BB'] - cleaned_df['IBB'] + cleaned_df['HBP'] + cleaned_df['SF']

### Calculate wOBA

In [16]:
cleaned_df['wOBA'] = cleaned_df['wOBA_num'] / cleaned_df['wOBA_deno']
cleaned_df['wOBA'] = cleaned_df['wOBA'].round(3)

## Save wOBA

Save Dataframe with the new calculated wOBA statistic to woba.csv for archiving purposes.

In [17]:
# Save cleaned_df to woba.csv
cleaned_df.to_csv("data/woba.csv", index=False)

# WOBA Cleaning
After calculating wOBA for all records, older records where unable to have a valid woba

In [21]:
# Drop records with missing wOBA value
cleaned_df = cleaned_df.dropna(subset=['wOBA'])

## Separate training data and validation data

Find a separatrion point to divide data for model training and model validation.

In [23]:
cutoff_year = 2010

training_data = cleaned_df[cleaned_df['yearID']<= cutoff_year]
validation_data = cleaned_df[cleaned_df['yearID'] > cutoff_year]

print(f"Training data size: {len(training_data)}")
print(f"Validation data size: {len(validation_data)}")

Training data size: 45201
Validation data size: 5072


## Model training

### Random Forest Regressor.
A Random forest regression model combines multiple decision trees to create a single model.
Each tree in the forest builds from a different subset of the data and makes its own independent prediction.
The final prediction for input is based on the average or weighted average of all the individual trees’ predictions.

In [26]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import  mean_squared_error, r2_score

# Split in features and target
features = ['BB', 'IBB', 'HBP', '1B', '2B', '3B', 'HR', 'AB', 'SF']
target = 'wOBA'

X_train = training_data[features]
Y_train = training_data[target]
X_val = validation_data[features]
Y_val = validation_data[target]

rf_model = RandomForestRegressor(random_state=41, n_estimators=100)

rf_model.fit(X_train, Y_train)

#### Prediction and Validation

Using the trained model to predit and validate that prediction using the validation data.

In [None]:
# Prediction and Validation
y_pred = rf_model.predict(X_val)

#Eval
mse = mean_squared_error(Y_val, y_pred)
r2 = r2_score(Y_val, y_pred)

print(f"MSE: {mse:.4f}")
print(f"r2: {r2:.4f}")

MSE: 0.0002
r2: 0.9921
