Skip to content

Modelling the relationship between a player’s first-time eligible arbitration salary and multiple variables.

Notifications You must be signed in to change notification settings

Allen-Ho-0302/First-Time-Eligible-Arbitration-Salary-Prediction

Repository files navigation

**Special thanks to Kuan-Cheng Fu for providing huge help on this project

In this project, I worked with a subset of batting-level data from players' career, platform year(the year before the year of their arbitration contract), py-1(the year before platform year) and py-2(two years before platform year). There were 80ish variables, including plate appearance, run, hit, avg, obp, war, mvp vote, ss vote, etc. Detailed definitions of the variables can be found in the excel file. As for the code and writeup, please read the RMarkdown file first for the first part of this project. Then please move on to the Jupyter Notebook file for the second part of the project. The primary reason for this is that I finished the Jupyter Notebook file first a year ago and then this year I figured I should at least do some simpler models for this project. I was more of a Python user before but now more of an R user due to my job. Hopefully this won't confuse people too much.

My goal of this research was to develop a framework which is capable of predicting a player’s first-time eligible arbitration salary and placing them into tiers. The framework is basically composed of below parts: feature preprocessing, model building, hyperparameter tuning, model evaluation, and first-time eligible arbitration salary prediction.

In detail, I used a simple linear regression with stepwise feature selection first, as can be seen in the RMarkdown file. Then in the Jupyter Notebook file I defined two functions for model building and hyperparameter tuning. I then checked the distribution of my response, salary_1te, to see if there is an imbalanced problem. All in all I built two groups of models, each composed of three, which are knn, randomforest and LightGBM. The first group represented that the three regression models predicting salary_1te without any data transformation or over-sampling techniques. The second group represented also the three regression models but using over-sampling technique, SMOGN (Synthetic Minority Over-Sampling Technique for Regression with Gaussian Noise). After that, I evaluated their performances using test MSE (Mean Squared Error) and test MAPE (Mean Absolute Percentage Error).

The result shows that the random forest regressor outperforms the k-nearest neighbors regressor and the LightGBM regressor on both test MSE and test MAPE. Furthermore, I checked the distributions of the true salary_1te and the predicted salary_1te of the three regressors. It's obvious that the three regressors are all less effective where salary_1te is over about 5M. Again, the goal of this task was to develop a reliable framework which is capable of predicting first-time eligible arbitration salary for players and placing them into tiers. Oversampling technique SMOGN models were not better with worse MSE and MAPE than the basic models. I used the proposed basic Random Forest model to make the predictions at the end.

Although the basic(non-SMOGN) random forest model with hyperparameter tuning had the best performance out of the models in the Jupyter Notebook file, the simple linear regression model with stepwise feature selection in the RMarkdown file actually had the best outcome in this project.

About

Modelling the relationship between a player’s first-time eligible arbitration salary and multiple variables.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published