GitHub - Allen-Ho-0302/First-Time-Eligible-Arbitration-Salary-Prediction: Modelling the relationship between a player’s first-time eligible arbitration salary and multiple variables.

**Special thanks to Kuan-Cheng Fu for providing huge help on this project

In this project, I worked with a subset of batting-level data from players' career, platform year(the year before the year of their arbitration contract), py-1(the year before platform year) and py-2(two years before platform year). There were 80ish variables, including plate appearance, run, hit, avg, obp, war, mvp vote, ss vote, etc. Detailed definitions of the variables can be found in the excel file. As for the code and writeup, please read the RMarkdown file first for the first part of this project. Then please move on to the Jupyter Notebook file for the second part of the project. The primary reason for this is that I finished the Jupyter Notebook file first a year ago and then this year I figured I should at least do some simpler models for this project. I was more of a Python user before but now more of an R user due to my job. Hopefully this won't confuse people too much.

My goal of this research was to develop a framework which is capable of predicting a player’s first-time eligible arbitration salary and placing them into tiers. The framework is basically composed of below parts: feature preprocessing, model building, hyperparameter tuning, model evaluation, and first-time eligible arbitration salary prediction.

In detail, I used a simple linear regression with stepwise feature selection first, as can be seen in the RMarkdown file. Then in the Jupyter Notebook file I defined two functions for model building and hyperparameter tuning. I then checked the distribution of my response, salary_1te, to see if there is an imbalanced problem. All in all I built two groups of models, each composed of three, which are knn, randomforest and LightGBM. The first group represented that the three regression models predicting salary_1te without any data transformation or over-sampling techniques. The second group represented also the three regression models but using over-sampling technique, SMOGN (Synthetic Minority Over-Sampling Technique for Regression with Gaussian Noise). After that, I evaluated their performances using test MSE (Mean Squared Error) and test MAPE (Mean Absolute Percentage Error).

The result shows that the random forest regressor outperforms the k-nearest neighbors regressor and the LightGBM regressor on both test MSE and test MAPE. Furthermore, I checked the distributions of the true salary_1te and the predicted salary_1te of the three regressors. It's obvious that the three regressors are all less effective where salary_1te is over about 5M. Again, the goal of this task was to develop a reliable framework which is capable of predicting first-time eligible arbitration salary for players and placing them into tiers. Oversampling technique SMOGN models were not better with worse MSE and MAPE than the basic models. I used the proposed basic Random Forest model to make the predictions at the end.

Although the basic(non-SMOGN) random forest model with hyperparameter tuning had the best performance out of the models in the Jupyter Notebook file, the simple linear regression model with stepwise feature selection in the RMarkdown file actually had the best outcome in this project.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Arbitration_1TE_POS_Market_Definitions.xlsx		Arbitration_1TE_POS_Market_Definitions.xlsx
First-Time Eligible Arbitration Salary Prediction.ipynb		First-Time Eligible Arbitration Salary Prediction.ipynb
Part 1.Rmd		Part 1.Rmd
Part 1.pdf		Part 1.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

Allen-Ho-0302/First-Time-Eligible-Arbitration-Salary-Prediction

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages