In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [2]:
batting_url = 'https://raw.githubusercontent.com/SeanG347/LahmanBaseballDatabase/main/Batting.csv'
salary_url = 'https://raw.githubusercontent.com/SeanG347/LahmanBaseballDatabase/main/Salaries.csv'
people_url = 'https://raw.githubusercontent.com/SeanG347/LahmanBaseballDatabase/main/People.csv'
fielding_url = 'https://raw.githubusercontent.com/SeanG347/LahmanBaseballDatabase/main/Fielding.csv'
pitching_url = 'https://raw.githubusercontent.com/SeanG347/LahmanBaseballDatabase/main/Pitching.csv'

batting_df = pd.read_csv(batting_url)
salary_df = pd.read_csv(salary_url)
people_df = pd.read_csv(people_url)
fielding_df = pd.read_csv(fielding_url)
pitching_df = pd.read_csv(pitching_url)

In [3]:
# Adding a "season" feature, which will denote what year the player is in during each record. Very important for distinguishing whether a player is in arbitration or free agency.

salary_df['season'] = salary_df.groupby('playerID').cumcount() + 1

In [4]:
salary_df.head()

Unnamed: 0,yearID,teamID,lgID,playerID,salary,season
0,2004,SFN,NL,aardsda01,300000,1
1,2007,CHA,AL,aardsda01,387500,2
2,2008,BOS,AL,aardsda01,403250,3
3,2009,SEA,AL,aardsda01,419000,4
4,2010,SEA,AL,aardsda01,2750000,5


In [5]:
batting_df.head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,G_batting,AB,R,H,...,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,G_old
0,aardsda01,2004,1,SFN,NL,11,,0,0,0,...,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,
1,aardsda01,2006,1,CHN,NL,45,,2,0,0,...,0.0,0.0,0,0.0,0.0,0.0,1.0,0.0,0.0,
2,aardsda01,2007,1,CHA,AL,25,,0,0,0,...,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,
3,aardsda01,2008,1,BOS,AL,47,,1,0,0,...,0.0,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0,
4,aardsda01,2009,1,SEA,AL,73,,0,0,0,...,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,


In [6]:
# Seeing if we can utilize ZR as a metric for quantifying player defensive ability.

fielding_df[fielding_df['ZR']>-100]

Unnamed: 0,playerID,yearID,stint,teamID,lgID,POS,G,GS,InnOuts,PO,A,E,DP,PB,WP,SB,CS,ZR
621,adamsdo01,1969,1,CHA,AL,C,4,3.0,78.0,9,2,0.0,0,1.0,0.0,0.0,0.0,0.0
854,adlesda01,1963,1,HOU,NL,C,6,0.0,46.0,8,0,1.0,0,1.0,1.0,1.0,0.0,0.0
855,adlesda01,1964,1,HOU,NL,C,3,2.0,63.0,11,2,0.0,0,0.0,2.0,2.0,2.0,0.0
856,adlesda01,1965,1,HOU,NL,C,13,10.0,246.0,51,5,0.0,1,1.0,3.0,4.0,3.0,0.0
857,adlesda01,1966,1,HOU,NL,C,1,0.0,21.0,11,0,0.0,0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
153405,zimmeje01,1967,1,MIN,AL,C,104,81.0,2077.0,572,44,5.0,7,1.0,20.0,29.0,26.0,3.0
153406,zimmeje01,1968,1,MIN,AL,C,24,18.0,435.0,109,7,1.0,0,0.0,1.0,6.0,5.0,0.0
153620,zupofr01,1957,1,BAL,AL,C,8,1.0,78.0,20,1,2.0,0,1.0,0.0,2.0,1.0,0.0
153621,zupofr01,1958,1,BAL,AL,C,1,0.0,12.0,4,0,0.0,0,0.0,1.0,0.0,0.0,0.0


Since ZR is scarcely reported in the dataset, we will unfortunately have to use errors to quantify a player's defensive prowess.

In [7]:
# Finding when salaries began being tracked in the dataset.

salary_df['yearID'].min()

1985

Since salaries are only tracked from 1985 onward in the dataset, we obviously will have to pare down the other datasets to only include data from 1985 onwards, as salary is the target variable here.

Game-plan here is to determine which features I want to include, and then joining the appropriate dataframes on playerID and yearID, resulting in an aggregated dataset with the important features and the target variable.


### Offensive Categories

* Batting Average: Traditional statistic that only recently was "moved away from", a solid indicator of a player's overall hitting ability
* Home Runs: "Eye-popping" statistic that historically has been linked with high-paying contracts and good offensive production
* OPS: A very good encapsulative stat for offensive production, but since this is a relatively new stat, it may not necessarily be a great training attribute.
* OBP: Similar to OPS, except it does not account for SLG. It may be wise to use batting average, OBP, and Home Runs as our training features, as they are encapsulative of almost everything that has been historically valued when it comes to offensive production.

### Defensive Categories

* Position: Extremely important, premium defensive positions get paid more on average. It is important, however, to find a means to distinguish a good and bad defender.
* Errors: Since the dataset does not have a lot of other defensive metrics, errors will have to suffice. This is also not horrible, as in the past this has been used by front offices to determine a player's defensive capabilities.

### Miscellaneous

* Batting Handedness: This one is more speculative, I plan on training a model with and without including batting handedness and comparing performance.
* Age: This is more important if we were to try and predict the full contract at which a player will sign, but since we are focusing primarily on a season-by-season salary prediction, age will likely not be as important an attribute.
* Season: This is an extremely important attribute for baseball especially (the attribute is the number of seasons the player has played, inclusive). This is due to the arbitration system. The model should be able to pick up on the trends with salaries and season numbers.

In [8]:
# Contract model for batting
# Note for the model we must use a normalized salary as the average salaries have varied significantly year over year.
important_features = ['AVG','OPS','HR','bats','POS','Age','ZR/dWAR', 'season']
# So we have to find a metric to determine defensive effectiveness.
# Since this model will focus on free-agency contracts, not arbitration, we have to 
# find a way to determine when a player has signed a free agent contract.

Ideas for free agent contract determination:
* The salaries players make before signing in free agency are almost always relatively low, i.e., below the mean. This is especially true in pre-arbitration years, however, some players, i.e., Bo Bichette in 2025, made well over the league average salary, this will have to be accounted for.
* Percentage increase from previous salary, most free agents will make some amount more than they did the previous year (last year of arbitration).
* Could potentially simply work out when the player would be free-agency eligible based on the arbitration rules, and then only include data from the arbitration/free-agency era (which we should do anyways as it is more relevant to the current free-agency market).

Notes:
   * This model needs to take in the player data from the years leading up to a free agent contract, and then extrapolate what it learned from each player to a new player. The goal for this is to have this done before Kyle Tucker and Bo Bichette sign contracts this offseason, and see how it does on those predictions. 
   * Incorporating contract term is another factor.