# Baseball Player WAR Model

In this project, we'll predict future season stats for baseball players.  Specifically, we'll predict the wins above replacement (WAR) a player will generate next season.  We'll first download baseball season data using pybaseball and clean it.  We'll do feature selection using a sequential feature selector to identify the most promising predictors for machine learning.  We'll then train a ridge regression model to predict future season WAR.  We'll measure error and improve the model.

WAR is a statistic that measures a player's value that compares them to an average replacement level player. The higher the WAR, the more valuable the baseball player.

**Project Steps**

* Download baseball season data
* Clean the data and prepare it for ML
* Run feature selection
* Create a machine learning model and estimate accuracy
* Improve accuracy

In [2]:
import os
import numpy as np

In [3]:
import pandas as pd

batting_stats = "batting.csv"

batting = pd.read_csv(batting_stats)

START = 2002
END = 2022
QUAL_THRESHOLD = 200

filtered_batting = batting[
    (batting['Season'] >= START) &
    (batting['Season'] <= END) &
    (batting['AB'] >= QUAL_THRESHOLD)
]

print(filtered_batting.head())

   Unnamed: 0   IDfg  Season          Name Team  Age    G   AB   PA    H  ...  \
0           0   1109    2002   Barry Bonds  SFG   37  143  403  612  149  ...   
1           1   1109    2004   Barry Bonds  SFG   39  147  373  617  135  ...   
2          15  13611    2018  Mookie Betts  BOS   25  136  520  614  180  ...   
3           2   1109    2003   Barry Bonds  SFG   38  130  390  550  133  ...   
4          78  10155    2013    Mike Trout  LAA   21  157  589  716  190  ...   

   Barrel%  maxEV  HardHit  HardHit%  Events  CStr%   CSW%  xBA  xSLG  xwOBA  
0      NaN    NaN      NaN       NaN       0  0.127  0.191  NaN   NaN    NaN  
1      NaN    NaN      NaN       NaN       0  0.124  0.164  NaN   NaN    NaN  
2    0.131  110.6    217.0       0.5     434  0.220  0.270  NaN   NaN    NaN  
3      NaN    NaN      NaN       NaN       0  0.135  0.223  NaN   NaN    NaN  
4      NaN    NaN      0.0       NaN       0  0.200  0.266  NaN   NaN    NaN  

[5 rows x 320 columns]


In [4]:
batting = batting.groupby("IDfg", group_keys=False).filter(lambda x: x.shape[0] > 1)

In [5]:
batting

Unnamed: 0.1,Unnamed: 0,IDfg,Season,Name,Team,Age,G,AB,PA,H,...,Barrel%,maxEV,HardHit,HardHit%,Events,CStr%,CSW%,xBA,xSLG,xwOBA
0,0,1109,2002,Barry Bonds,SFG,37,143,403,612,149,...,,,,,0,0.127,0.191,,,
1,1,1109,2004,Barry Bonds,SFG,39,147,373,617,135,...,,,,,0,0.124,0.164,,,
2,15,13611,2018,Mookie Betts,BOS,25,136,520,614,180,...,0.131,110.6,217.0,0.500,434,0.220,0.270,,,
3,2,1109,2003,Barry Bonds,SFG,38,130,390,550,133,...,,,,,0,0.135,0.223,,,
4,78,10155,2013,Mike Trout,LAA,21,157,589,716,190,...,,,0.0,,0,0.200,0.266,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7084,6861,1698,2010,Gerald Laird,DET,30,89,270,299,56,...,,,0.0,,0,0.166,0.252,,,
7086,7019,9272,2018,Chris Davis,BAL,32,128,470,522,79,...,0.096,111.8,113.0,0.401,282,0.174,0.316,,,
7087,6655,319,2011,Adam Dunn,CHW,31,122,415,496,66,...,,,0.0,,0,0.169,0.295,,,
7088,6962,620,2002,Neifi Perez,KCR,29,145,554,585,131,...,,,,,0,0.130,0.187,,,


In [6]:
def next_season(player):
    player = player.sort_values("Season")
    player["Next_WAR"] = player["WAR"].shift(-1)
    return player

batting = batting.groupby("IDfg", group_keys=False).apply(next_season)

In [7]:
batting[["Name", "Season", "WAR", "Next_WAR"]]

Unnamed: 0,Name,Season,WAR,Next_WAR
3934,Alfredo Amezaga,2006,1.1,2.0
2593,Alfredo Amezaga,2007,2.0,1.2
3759,Alfredo Amezaga,2008,1.2,
1019,Garret Anderson,2002,3.7,5.1
427,Garret Anderson,2003,5.1,0.8
...,...,...,...,...
4837,Owen Miller,2022,0.6,
6190,Andrew Vaughn,2021,-0.3,0.5
4887,Andrew Vaughn,2022,0.5,
5038,Ha-seong Kim,2021,0.5,2.6


In [8]:
null_count = batting.isnull().sum()

In [9]:
null_count

Unnamed: 0       0
IDfg             0
Season           0
Name             0
Team             0
              ... 
CSW%             0
xBA           6737
xSLG          6737
xwOBA         6737
Next_WAR      1174
Length: 321, dtype: int64

In [10]:
complete_cols = list(batting.columns[null_count == 0])

In [11]:
batting = batting[complete_cols + ["Next_WAR"]].copy()

In [12]:
batting

Unnamed: 0.1,Unnamed: 0,IDfg,Season,Name,Team,Age,G,AB,PA,H,...,Pull%+,Cent%+,Oppo%+,Soft%+,Med%+,Hard%+,Events,CStr%,CSW%,Next_WAR
3934,5549,1,2006,Alfredo Amezaga,FLA,28,132,334,378,87,...,86,107,113,143,109,63,0,0.188,0.256,2.0
2593,5000,1,2007,Alfredo Amezaga,FLA,29,133,400,448,105,...,92,101,112,109,113,75,0,0.175,0.227,1.2
3759,5243,1,2008,Alfredo Amezaga,FLA,30,125,311,337,82,...,99,101,101,123,111,64,0,0.178,0.244,
1019,1168,2,2002,Garret Anderson,ANA,30,158,638,678,195,...,118,91,80,65,97,129,0,0.137,0.232,5.1
427,866,2,2003,Garret Anderson,ANA,31,159,638,673,201,...,112,101,80,90,99,109,0,0.164,0.252,0.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4837,5980,24655,2022,Owen Miller,CLE,25,119,391,433,97,...,92,111,99,127,102,82,315,0.191,0.269,
6190,4880,26197,2021,Andrew Vaughn,CHW,23,127,417,469,98,...,87,104,116,84,99,110,321,0.185,0.285,0.5
4887,2097,26197,2022,Andrew Vaughn,CHW,24,118,456,497,132,...,88,108,108,93,99,105,382,0.205,0.287,
5038,6604,27506,2021,Ha-seong Kim,SDP,25,117,267,298,54,...,126,99,59,137,96,88,201,0.216,0.303,2.6


In [13]:
batting.dtypes

Unnamed: 0      int64
IDfg            int64
Season          int64
Name           object
Team           object
               ...   
Hard%+          int64
Events          int64
CStr%         float64
CSW%          float64
Next_WAR      float64
Length: 133, dtype: object

In [14]:
batting.dtypes[batting.dtypes == "object"]

Name       object
Team       object
Dol        object
Age Rng    object
dtype: object

In [15]:
batting["Dol"]

3934      $5.5
2593     $11.2
3759      $7.2
1019     $14.6
427      $22.0
         ...  
4837      $4.7
6190    ($2.6)
4887      $3.6
5038      $3.9
1892     $21.1
Name: Dol, Length: 6737, dtype: object

In [16]:
del batting["Dol"]

In [17]:
batting["Age Rng"]

3934    28 - 28
2593    29 - 29
3759    30 - 30
1019    30 - 30
427     31 - 31
         ...   
4837    25 - 25
6190    23 - 23
4887    24 - 24
5038    25 - 25
1892    26 - 26
Name: Age Rng, Length: 6737, dtype: object

In [18]:
del batting["Age Rng"]

In [19]:
batting["team_code"] = batting["Team"].astype("category").cat.codes

In [20]:
batting_full = batting.copy()
batting = batting.dropna()

In [21]:
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import TimeSeriesSplit

rr = Ridge(alpha=1)

split = TimeSeriesSplit(n_splits=3)

sfs = SequentialFeatureSelector(rr, n_features_to_select=20, direction="forward", cv=split, n_jobs=4)

In [22]:
removed_columns = ["Next_WAR", "Name", "Team", "IDfg", "Season"]
selected_columns = batting.columns[~batting.columns.isin(removed_columns)]

In [23]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
batting.loc[:, selected_columns] = scaler.fit_transform(batting[selected_columns])

In [24]:
batting

Unnamed: 0.1,Unnamed: 0,IDfg,Season,Name,Team,Age,G,AB,PA,H,...,Cent%+,Oppo%+,Soft%+,Med%+,Hard%+,Events,CStr%,CSW%,Next_WAR,team_code
3934,0.782762,1,2006,Alfredo Amezaga,FLA,0.346154,0.735043,0.312950,0.307958,0.245690,...,0.539326,0.503759,0.662921,0.652174,0.210884,0.000000,0.582979,0.524229,2.0,0.352941
2593,0.705318,1,2007,Alfredo Amezaga,FLA,0.384615,0.743590,0.431655,0.429066,0.323276,...,0.471910,0.496241,0.471910,0.710145,0.292517,0.000000,0.527660,0.396476,1.2,0.352941
1019,0.164762,2,2002,Garret Anderson,ANA,0.423077,0.957265,0.859712,0.826990,0.711207,...,0.359551,0.255639,0.224719,0.478261,0.659864,0.000000,0.365957,0.418502,5.1,0.029412
427,0.122161,2,2003,Garret Anderson,ANA,0.461538,0.965812,0.859712,0.818339,0.737069,...,0.471910,0.255639,0.365169,0.507246,0.523810,0.000000,0.480851,0.506608,0.8,0.029412
4349,0.362957,2,2004,Garret Anderson,ANA,0.500000,0.564103,0.507194,0.475779,0.443966,...,0.494382,0.218045,0.297753,0.608696,0.448980,0.000000,0.531915,0.585903,-0.2,0.029412
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2051,0.269855,23667,2021,Wander Franco,TBR,0.038462,0.205128,0.217626,0.186851,0.219828,...,0.617978,0.390977,0.421348,0.608696,0.394558,0.409015,0.391489,0.352423,1.2,0.911765
4626,0.827056,24618,2021,Ryan Jeffers,MIN,0.192308,0.333333,0.192446,0.160900,0.099138,...,0.415730,0.315789,0.376404,0.347826,0.619048,0.265442,0.514894,0.788546,1.0,0.558824
6861,0.988715,24655,2021,Owen Miller,CLE,0.192308,0.119658,0.055755,0.003460,0.038793,...,0.584270,0.593985,0.331461,0.681159,0.394558,0.230384,0.548936,0.700441,0.6,0.264706
6190,0.688390,26197,2021,Andrew Vaughn,CHW,0.153846,0.692308,0.462230,0.465398,0.293103,...,0.505618,0.526316,0.331461,0.507246,0.530612,0.535893,0.570213,0.651982,0.5,0.205882


In [25]:
batting.describe()

Unnamed: 0.1,Unnamed: 0,IDfg,Season,Age,G,AB,PA,H,1B,2B,...,Cent%+,Oppo%+,Soft%+,Med%+,Hard%+,Events,CStr%,CSW%,Next_WAR,team_code
count,5563.0,5563.0,5563.0,5563.0,5563.0,5563.0,5563.0,5563.0,5563.0,5563.0,...,5563.0,5563.0,5563.0,5563.0,5563.0,5563.0,5563.0,5563.0,5563.0,5563.0
mean,0.452243,5346.361136,2011.143268,0.360701,0.653156,0.479133,0.481446,0.366375,0.290768,0.399673,...,0.457554,0.403273,0.410782,0.511008,0.478735,0.172547,0.498866,0.545701,1.787758,0.474051
std,0.2797,5116.526623,5.601356,0.147526,0.255806,0.242278,0.262085,0.182445,0.13871,0.171662,...,0.113984,0.131154,0.121118,0.130367,0.134085,0.273872,0.137239,0.120687,1.989465,0.305009
min,0.0,1.0,2002.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-3.4,0.0
25%,0.20955,1129.0,2006.0,0.269231,0.478632,0.276978,0.259516,0.211207,0.179245,0.258621,...,0.382022,0.315789,0.331461,0.42029,0.387755,0.0,0.408511,0.46696,0.3,0.205882
50%,0.433347,3516.0,2011.0,0.346154,0.709402,0.507194,0.508651,0.37069,0.287736,0.37931,...,0.460674,0.398496,0.404494,0.507246,0.489796,0.0,0.493617,0.546256,1.5,0.470588
75%,0.68296,8722.0,2016.0,0.461538,0.871795,0.688849,0.711073,0.508621,0.391509,0.517241,...,0.52809,0.488722,0.483146,0.594203,0.564626,0.345576,0.591489,0.625551,2.9,0.735294
max,1.0,27506.0,2021.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11.9,1.0


In [26]:
sfs.fit(batting[selected_columns], batting["Next_WAR"])

In [27]:
sfs.get_support()

array([False,  True, False, False, False, False, False, False, False,
       False, False, False, False,  True,  True, False, False, False,
       False,  True, False, False, False, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
       False,  True, False, False, False, False, False, False, False,
       False,  True,  True, False, False, False, False, False, False,
       False,  True, False, False, False, False, False, False,  True,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True, False, False,
       False,  True, False, False, False, False, False, False,  True,
       False, False,  True, False, False, False, False, False, False,
       False, False,  True, False, False,  True, False, False, False,
        True, False, False,  True, False,  True, False, False, False,
       False])

In [28]:
predictors = list(selected_columns[sfs.get_support()])

In [29]:
predictors

['Age',
 'IBB',
 'SO',
 'SB',
 'BU',
 'BABIP',
 'IFH%',
 'WAR',
 'Spd',
 'PH',
 'CB%',
 'Z-Contact%',
 'SwStr%',
 'wGDP',
 'Oppo%',
 'SLG+',
 'LD+%',
 'Pull%+',
 'Soft%+',
 'Hard%+']

In [30]:
sorted(batting["Season"].unique())

[2002,
 2003,
 2004,
 2005,
 2006,
 2007,
 2008,
 2009,
 2010,
 2011,
 2012,
 2013,
 2014,
 2015,
 2016,
 2017,
 2018,
 2019,
 2020,
 2021]

In [31]:
def backtest(data, model, predictors, start=5, step=1):
    all_predictions = []
    
    years = sorted(data["Season"].unique())
    
    for i in range(start, len(years), step):
        current_year = years[i]
        
        train = data[data["Season"] < current_year]
        test = data[data["Season"] == current_year]
        
        model.fit(train[predictors], train["Next_WAR"])
        
        preds = model.predict(test[predictors])
        preds = pd.Series(preds, index=test.index)
        combined = pd.concat([test["Next_WAR"], preds], axis=1)
        combined.columns = ["actual", "prediction"]
        
        all_predictions.append(combined)
    return pd.concat(all_predictions)

In [32]:
predictions = backtest(batting, rr, predictors)

In [33]:
predictions

Unnamed: 0,actual,prediction
2593,1.2,1.514187
3367,1.4,0.804184
4554,-0.1,0.587281
4647,0.6,0.890092
1741,4.8,2.307446
...,...,...
2051,1.2,2.697911
4626,1.0,1.926963
6861,0.6,1.545744
6190,0.5,1.646229


In [71]:
from sklearn.metrics import mean_squared_error

mean_squared_error(predictions["actual"], predictions["prediction"])

2.7671807143292715

In [75]:
batting["Next_WAR"].describe()

count    5563.000000
mean        1.787758
std         1.989465
min        -3.400000
25%         0.300000
50%         1.500000
75%         2.900000
max        11.900000
Name: Next_WAR, dtype: float64

In [77]:
2.7671807143292715 ** .5

1.6634845097954087

In [83]:
def player_history(df):
    df = df.sort_values("Season")
    
    df["player_season"] = range(0, df.shape[0])

In [97]:
ga = batting[batting["IDfg"] == 2].copy()

In [99]:
ga

Unnamed: 0.1,Unnamed: 0,IDfg,Season,Name,Team,Age,G,AB,PA,H,...,Cent%+,Oppo%+,Soft%+,Med%+,Hard%+,Events,CStr%,CSW%,Next_WAR,team_code
1019,0.164762,2,2002,Garret Anderson,ANA,0.423077,0.957265,0.859712,0.82699,0.711207,...,0.359551,0.255639,0.224719,0.478261,0.659864,0.0,0.365957,0.418502,5.1,0.029412
427,0.122161,2,2003,Garret Anderson,ANA,0.461538,0.965812,0.859712,0.818339,0.737069,...,0.47191,0.255639,0.365169,0.507246,0.52381,0.0,0.480851,0.506608,0.8,0.029412
4349,0.362957,2,2004,Garret Anderson,ANA,0.5,0.564103,0.507194,0.475779,0.443966,...,0.494382,0.218045,0.297753,0.608696,0.44898,0.0,0.531915,0.585903,-0.2,0.029412
6033,0.59148,2,2005,Garret Anderson,LAA,0.538462,0.820513,0.746403,0.697232,0.573276,...,0.213483,0.278195,0.421348,0.478261,0.503401,0.0,0.421277,0.53304,0.1,0.441176
5578,0.560023,2,2006,Garret Anderson,LAA,0.576923,0.811966,0.688849,0.67128,0.525862,...,0.41573,0.300752,0.353933,0.434783,0.591837,0.0,0.442553,0.511013,1.4,0.441176
3367,0.271265,2,2007,Garret Anderson,LAA,0.615385,0.529915,0.46223,0.432526,0.405172,...,0.382022,0.285714,0.44382,0.42029,0.52381,0.0,0.442553,0.480176,1.4,0.441176
3399,0.472281,2,2008,Garret Anderson,LAA,0.653846,0.846154,0.714029,0.679931,0.573276,...,0.303371,0.285714,0.38764,0.565217,0.442177,0.0,0.52766,0.53304,-1.1,0.441176


In [101]:
ga["player_season"] = range(0, ga.shape[0])

In [103]:
ga

Unnamed: 0.1,Unnamed: 0,IDfg,Season,Name,Team,Age,G,AB,PA,H,...,Oppo%+,Soft%+,Med%+,Hard%+,Events,CStr%,CSW%,Next_WAR,team_code,player_season
1019,0.164762,2,2002,Garret Anderson,ANA,0.423077,0.957265,0.859712,0.82699,0.711207,...,0.255639,0.224719,0.478261,0.659864,0.0,0.365957,0.418502,5.1,0.029412,0
427,0.122161,2,2003,Garret Anderson,ANA,0.461538,0.965812,0.859712,0.818339,0.737069,...,0.255639,0.365169,0.507246,0.52381,0.0,0.480851,0.506608,0.8,0.029412,1
4349,0.362957,2,2004,Garret Anderson,ANA,0.5,0.564103,0.507194,0.475779,0.443966,...,0.218045,0.297753,0.608696,0.44898,0.0,0.531915,0.585903,-0.2,0.029412,2
6033,0.59148,2,2005,Garret Anderson,LAA,0.538462,0.820513,0.746403,0.697232,0.573276,...,0.278195,0.421348,0.478261,0.503401,0.0,0.421277,0.53304,0.1,0.441176,3
5578,0.560023,2,2006,Garret Anderson,LAA,0.576923,0.811966,0.688849,0.67128,0.525862,...,0.300752,0.353933,0.434783,0.591837,0.0,0.442553,0.511013,1.4,0.441176,4
3367,0.271265,2,2007,Garret Anderson,LAA,0.615385,0.529915,0.46223,0.432526,0.405172,...,0.285714,0.44382,0.42029,0.52381,0.0,0.442553,0.480176,1.4,0.441176,5
3399,0.472281,2,2008,Garret Anderson,LAA,0.653846,0.846154,0.714029,0.679931,0.573276,...,0.285714,0.38764,0.565217,0.442177,0.0,0.52766,0.53304,-1.1,0.441176,6


In [108]:
ga[["player_season", "WAR"]]

Unnamed: 0,player_season,WAR
1019,0,0.440994
427,1,0.52795
4349,2,0.26087
6033,3,0.198758
5578,4,0.217391
3367,5,0.298137
3399,6,0.298137


In [110]:
ga[["player_season", "WAR"]].expanding().corr()

Unnamed: 0,Unnamed: 1,player_season,WAR
1019,player_season,,
1019,WAR,,
427,player_season,1.0,1.0
427,WAR,1.0,1.0
4349,player_season,1.0,-0.661143
4349,WAR,-0.661143,1.0
6033,player_season,1.0,-0.836562
6033,WAR,-0.836562,1.0
5578,player_season,1.0,-0.836312
5578,WAR,-0.836312,1.0


In [114]:
ga[["player_season", "WAR"]].expanding().corr().loc[(slice(None), "player_season"), "WAR"]

1019  player_season         NaN
427   player_season    1.000000
4349  player_season   -0.661143
6033  player_season   -0.836562
5578  player_season   -0.836312
3367  player_season   -0.692192
3399  player_season   -0.595013
Name: WAR, dtype: float64

In [116]:
list(ga[["player_season", "WAR"]].expanding().corr().loc[(slice(None), "player_season"), "WAR"])

[nan,
 1.0,
 -0.6611430912519525,
 -0.8365619976685157,
 -0.8363121929961224,
 -0.6921918007562201,
 -0.5950132649769155]

In [118]:
def player_history(df):
    df = df.sort_values("Season")
    
    df["player_season"] = range(0, df.shape[0])
    df["war_corr"] = list(df[["player_season", "WAR"]].expanding().corr().loc[(slice(None), "player_season"), "WAR"])
    df["war_corr"].fillna(1, inplace=True)
    
    df["war_diff"] = df["WAR"] / df["WAR"].shift(1)
    df["war_diff"].fillna(1, inplace=True)
    
    df["war_diff"][df["war_diff"] == np.inf] = 1
    
    return df

batting = batting.groupby("IDfg", group_keys=False).apply(player_history)

In [120]:
def group_averages(df):
    return df["WAR"] / df["WAR"].mean()

In [124]:
batting["war_season"] = batting.groupby("Season", group_keys=False).apply(group_averages)

In [126]:
new_predictors = predictors + ["player_season", "war_corr", "war_season", "war_diff"]

In [128]:
predictions = backtest(batting, rr, new_predictors)

In [130]:
mean_squared_error(predictions["actual"], predictions["prediction"])

2.670955561949691

In [132]:
rr.coef_

array([-2.72509987e+00,  1.75754340e+00, -7.07326023e-01,  1.02288463e+00,
       -9.75975450e-01, -1.53447788e+00,  3.80016854e-01, -1.78865572e+00,
        7.17269101e-01, -7.38206446e-01, -2.14705287e-01, -6.95398354e-01,
       -1.04738915e+00, -4.77138335e-01,  6.60879022e-01, -1.22739668e+00,
       -2.23573176e-01, -2.31475367e-01, -1.25689266e+00,  2.25664234e+00,
        5.60662139e-05, -1.22821122e-01,  3.43649720e+00, -5.86508567e-01])

In [138]:
pd.Series(rr.coef_, index=new_predictors)

Age             -2.725100
IBB              1.757543
SO              -0.707326
SB               1.022885
BU              -0.975975
BABIP           -1.534478
IFH%             0.380017
WAR             -1.788656
Spd              0.717269
PH              -0.738206
CB%             -0.214705
Z-Contact%      -0.695398
SwStr%          -1.047389
wGDP            -0.477138
Oppo%            0.660879
SLG+            -1.227397
LD+%            -0.223573
Pull%+          -0.231475
Soft%+          -1.256893
Hard%+           2.256642
player_season    0.000056
war_corr        -0.122821
war_season       3.436497
war_diff        -0.586509
dtype: float64

In [140]:
pd.Series(rr.coef_, index=new_predictors).sort_values()

Age             -2.725100
WAR             -1.788656
BABIP           -1.534478
Soft%+          -1.256893
SLG+            -1.227397
SwStr%          -1.047389
BU              -0.975975
PH              -0.738206
SO              -0.707326
Z-Contact%      -0.695398
war_diff        -0.586509
wGDP            -0.477138
Pull%+          -0.231475
LD+%            -0.223573
CB%             -0.214705
war_corr        -0.122821
player_season    0.000056
IFH%             0.380017
Oppo%            0.660879
Spd              0.717269
SB               1.022885
IBB              1.757543
Hard%+           2.256642
war_season       3.436497
dtype: float64

Anything with a large coefficient, the model is taking it more into account. Anything with a small coefficient, the model is taking it less into account.

In [146]:
diff = predictions["actual"] - predictions["prediction"]

In [148]:
diff

2593   -0.312362
3367    0.923673
4554   -0.548822
4647   -0.306251
1741    2.689615
          ...   
2051   -1.453435
4626   -0.693463
6861   -0.576410
6190   -0.866456
5038    1.556792
Length: 4115, dtype: float64

In [150]:
merged = predictions.merge(batting, left_index=True, right_index=True)

In [154]:
merged["diff"] = (predictions["actual"] - predictions["prediction"]).abs()

In [156]:
merged

Unnamed: 0.1,actual,prediction,Unnamed: 0,IDfg,Season,Name,Team,Age,G,AB,...,Events,CStr%,CSW%,Next_WAR,team_code,player_season,war_corr,war_diff,war_season,diff
2593,1.2,1.512362,0.705318,1,2007,Alfredo Amezaga,FLA,0.384615,0.743590,0.431655,...,0.000000,0.527660,0.396476,1.2,0.352941,1,1.000000,1.200000,0.998355,0.312362
3367,1.4,0.476327,0.271265,2,2007,Garret Anderson,LAA,0.615385,0.529915,0.462230,...,0.000000,0.442553,0.480176,1.4,0.441176,5,-0.692192,1.371429,0.887427,0.923673
4554,-0.1,0.448822,0.438426,10,2007,David Eckstein,STL,0.500000,0.606838,0.492806,...,0.000000,0.676596,0.436123,-0.1,0.852941,5,-0.694330,0.836735,0.758010,0.548822
4647,0.6,0.906251,0.815630,11,2007,Darin Erstad,CHW,0.538462,0.350427,0.269784,...,0.000000,0.765957,0.691630,0.6,0.205882,4,-0.828562,0.803922,0.758010,0.306251
1741,4.8,2.110385,0.156298,15,2007,Troy Glaus,TOR,0.423077,0.589744,0.404676,...,0.000000,0.634043,0.704846,4.8,0.970588,5,0.231396,0.897059,1.127772,2.689615
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2051,1.2,2.653435,0.269855,23667,2021,Wander Franco,TBR,0.038462,0.205128,0.217626,...,0.409015,0.391489,0.352423,1.2,0.911765,0,1.000000,1.000000,1.053432,1.453435
4626,1.0,1.693463,0.827056,24618,2021,Ryan Jeffers,MIN,0.192308,0.333333,0.192446,...,0.265442,0.514894,0.788546,1.0,0.558824,0,1.000000,1.000000,0.744667,0.693463
6861,0.6,1.176410,0.988715,24655,2021,Owen Miller,CLE,0.192308,0.119658,0.055755,...,0.230384,0.548936,0.700441,0.6,0.264706,0,1.000000,1.000000,0.435903,0.576410
6190,0.5,1.366456,0.688390,26197,2021,Andrew Vaughn,CHW,0.153846,0.692308,0.462230,...,0.535893,0.570213,0.651982,0.5,0.205882,0,1.000000,1.000000,0.563041,0.866456


In [158]:
merged[["IDfg", "Season", "Name", "WAR", "Next_WAR", "diff"]].sort_values(["diff"])

Unnamed: 0,IDfg,Season,Name,WAR,Next_WAR,diff
3670,13359,2019,Tyler Naquin,0.285714,1.2,0.000061
3382,2106,2008,Ryan Church,0.298137,1.0,0.001045
1261,11846,2016,Leonys Martin,0.422360,1.9,0.001072
3308,1275,2007,Ivan Rodriguez,0.304348,0.8,0.002014
3204,96,2011,Andruw Jones,0.304348,0.3,0.002228
...,...,...,...,...,...,...
3246,4810,2007,Brian McCann,0.304348,8.6,6.337638
5709,5631,2010,Matt Kemp,0.211180,8.3,6.356862
3555,1875,2009,Josh Hamilton,0.291925,8.4,6.538779
848,9166,2010,Buster Posey,0.459627,10.1,6.653248
