# Regression Model
I chose to use ridge regression, which addresses multicollinearity issues and avoids overfitting based on a single season. It's also important to note that predicted OPS is calculated by subtracting the league average change in OPS from 2023 to 2024 from the results of the regression. This is because the league-wide decrease in OPS skewed the regression, leading to it predicting that 95% of players will have worse outcomes in 2025. These predictions are based on the assumption that the league average change in OPS next season will be 0.

In [924]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline

df = pd.read_csv('../out.csv')

In [925]:
names = df[['last_name, first_name', 'on_base_plus_slg (2024)']]

df.drop(columns = ['player_id', 'last_name, first_name'], inplace=True)

X = df[['wobadiff (2023)', 'xslgdiff (2023)', 'xbadiff (2023)' , 'babipdiff (2023)', 'xhrdiff (2023)']]
y = df['ops_change 2023-2024']

steps = [('imputer', SimpleImputer()),
         ('scaler', StandardScaler()),
         ('ridge', Ridge())]

parameters = {'imputer__strategy': ['mean', 'median'],
              'ridge__alpha': [x for x in range(0, 100)]}

pipeline = Pipeline(steps=steps)
clf = GridSearchCV(pipeline, param_grid=parameters, cv=5)

clf.fit(X, y)

best_model = clf.best_estimator_.named_steps['ridge']

print(f"Best Parameters: {clf.best_params_}" 
      f"\nBest Score: {clf.best_score_}" 
      f"\nIntercept: {best_model.intercept_}")

values = {'Coefficients': best_model.coef_}
values_df = pd.DataFrame.from_dict(values, orient='index', columns=['wOBA-xwOBA', 'SLG-xSLG', 'BA-xBA', 'BABIP Diff', 'HR-xHR'])

display(values_df)

stats_2024 = df[['wobadiff (2024)', 'xslgdiff (2024)', 'xbadiff (2024)' , 'babipdiff (2024)', 'xhrdiff (2024)']]
stats_2024.fillna(value=stats_2024.mean(), inplace=True)
predictions = pd.DataFrame.join(names, pd.DataFrame(best_model.predict(stats_2024), columns=['% Change OPS']))
predictions['% Change Relative to League Avg Change'] = predictions['% Change OPS'] - df['ops_change 2023-2024'].mean()
predictions['Predicted 2025 OPS'] = predictions['on_base_plus_slg (2024)'] * (1 + (predictions['% Change Relative to League Avg Change'])/100)

predictions.to_csv('../predictions.csv')

Best Parameters: {'imputer__strategy': 'median', 'ridge__alpha': 70}
Best Score: 0.18479707616244
Intercept: -3.616856416780667


Unnamed: 0,wOBA-xwOBA,SLG-xSLG,BA-xBA,BABIP Diff,HR-xHR
Coefficients,-1.640897,-1.166776,-1.501314,-1.205142,-1.527901
