# Player evolution prediction

Goal: try to predict how many points per game that player will have on the next year based on past season data

We are going to use data between seasons 17/18 and 20/21 to try to predict the 21/22 season

In [41]:
import pandas as pd

In [155]:
import numpy as np

In [53]:
df = pd.read_csv('players_stats_by_season.csv')

In [54]:
df.head()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,TRB,AST,STL,BLK,TOV,PF,PTS,Player Name,Player Link,Season
0,1,Álex Abrines\abrinal01,SG,24,OKC,75,8,15.1,1.5,3.9,...,1.5,0.4,0.5,0.1,0.3,1.7,4.7,Álex Abrines,https://www.basketball-reference.com/players/a...,season_17_18
1,2,Quincy Acy\acyqu01,PF,27,BRK,70,8,19.4,1.9,5.2,...,3.7,0.8,0.5,0.4,0.9,2.1,5.9,Quincy Acy,https://www.basketball-reference.com/players/a...,season_17_18
2,3,Steven Adams\adamsst01,C,24,OKC,76,76,32.7,5.9,9.4,...,9.0,1.2,1.2,1.0,1.7,2.8,13.9,Steven Adams,https://www.basketball-reference.com/players/a...,season_17_18
3,4,Bam Adebayo\adebaba01,C,20,MIA,69,19,19.8,2.5,4.9,...,5.5,1.5,0.5,0.6,1.0,2.0,6.9,Bam Adebayo,https://www.basketball-reference.com/players/a...,season_17_18
4,5,Arron Afflalo\afflaar01,SG,32,ORL,53,3,12.9,1.2,3.1,...,1.2,0.6,0.1,0.2,0.4,1.1,3.4,Arron Afflalo,https://www.basketball-reference.com/players/a...,season_17_18


In [55]:
df_player_info = pd.read_csv('player_info.csv')

In [56]:
df_player_info.head()

Unnamed: 0,Player Link,Weight in kg,Height in cm,Birth Date,Experience,Player Name
0,https://www.basketball-reference.com/players/a...,90,198,1993-08-01,3 years,Álex Abrines
1,https://www.basketball-reference.com/players/a...,108,201,1990-10-06,7 years,Quincy Acy
2,https://www.basketball-reference.com/players/a...,120,211,1993-07-20,8 years,Steven Adams
3,https://www.basketball-reference.com/players/a...,115,206,1997-07-18,4 years,Bam Adebayo
4,https://www.basketball-reference.com/players/a...,95,196,1985-10-15,11 years,Arron Afflalo


In [57]:
df = df.merge(df_player_info.drop(columns = ['Player Link']), on = 'Player Name')

## Model 1
### Last Year Results + Experience + Info + Lift on the previous year

We are going to use the results from last year + the lift on each metric from the previous year to try to predict the next year

We are going to train using the 19-20 season to predict the 20-21

The final test is going to be on the 20-21 season predicting the 21-22 (without training using any values of 21-22)

In [88]:
df_y = df.loc[df['Season'] == 'season_20_21',['Player Name','PTS']]

In [89]:
df_y

Unnamed: 0,Player Name,PTS
7,Steven Adams,7.6
12,Bam Adebayo,18.7
19,LaMarcus Aldridge,13.5
24,Jarrett Allen,12.8
33,Al-Farouq Aminu,4.4
...,...,...
2609,Patrick Williams,9.2
2611,Dylan Windler,5.2
2613,Cassius Winston,1.9
2615,James Wiseman,11.5


In our model we are using information on the previous 2 seasons to predict the next one, we are goin to consider only players that have played in both 3 seasons: 18/19, 19/20 and 20/21. In a future work we could think about the best way to handle players that did not played in all the seasosn e.g. Rookies or Injuried

In [90]:
df_X = df.loc[df['Season'].isin(['season_19_20','season_18_19']) & (df['Player Name'].isin(df_y['Player Name'].unique()))]

In [91]:
df_X

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,TOV,PF,PTS,Player Name,Player Link,Season,Weight in kg,Height in cm,Birth Date,Experience
5,4,Steven Adams\adamsst01,C,25,OKC,80,80,33.4,6.0,10.1,...,1.7,2.6,13.9,Steven Adams,https://www.basketball-reference.com/players/a...,season_18_19,120,211,1993-07-20,8 years
6,1,Steven Adams\adamsst01,C,26,OKC,63,63,26.7,4.5,7.6,...,1.5,1.9,10.9,Steven Adams,https://www.basketball-reference.com/players/a...,season_19_20,120,211,1993-07-20,8 years
10,5,Bam Adebayo\adebaba01,C,21,MIA,82,28,23.3,3.4,5.9,...,1.5,2.5,8.9,Bam Adebayo,https://www.basketball-reference.com/players/a...,season_18_19,115,206,1997-07-18,4 years
11,2,Bam Adebayo\adebaba01,PF,22,MIA,72,72,33.6,6.1,11.0,...,2.8,2.5,15.9,Bam Adebayo,https://www.basketball-reference.com/players/a...,season_19_20,115,206,1997-07-18,4 years
17,8,LaMarcus Aldridge\aldrila01,C,33,SAS,81,81,33.2,8.4,16.3,...,1.8,2.2,21.3,LaMarcus Aldridge,https://www.basketball-reference.com/players/a...,season_18_19,113,211,1985-07-19,15 years
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2429,504,Paul Watson\watsopa01,SF,25,TOT,10,0,8.7,1.0,2.6,...,0.3,0.5,3.1,Paul Watson,https://www.basketball-reference.com/players/w...,season_19_20,95,198,1994-12-30,2 years
2432,505,Quinndary Weatherspoon\weathqu01,SG,23,SAS,11,0,7.1,0.5,1.5,...,0.5,0.7,1.1,Quinndary Weatherspoon,https://www.basketball-reference.com/players/w...,season_19_20,92,190,1996-09-10,2 years
2435,507,Coby White\whiteco01,PG,19,CHI,65,1,25.8,4.8,12.2,...,1.7,1.8,13.2,Coby White,https://www.basketball-reference.com/players/w...,season_19_20,88,196,2000-02-16,2 years
2438,511,Grant Williams\willigr01,PF,21,BOS,69,5,15.1,1.3,3.1,...,0.7,2.4,3.4,Grant Williams,https://www.basketball-reference.com/players/w...,season_19_20,107,198,1998-11-30,2 years


In [92]:
#lets select only the players that played both 18/19 and 19/20 season
df_players_per_season = pd.DataFrame(df_X.groupby('Player Name')['Season'].count()).reset_index().rename(columns = {'Season':'Season Count'})
df_X = df_X[df_X['Player Name'].isin(df_players_per_season.loc[df_players_per_season['Season Count'] == 2,'Player Name'].unique())]

In [93]:
df_X

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,TOV,PF,PTS,Player Name,Player Link,Season,Weight in kg,Height in cm,Birth Date,Experience
5,4,Steven Adams\adamsst01,C,25,OKC,80,80,33.4,6.0,10.1,...,1.7,2.6,13.9,Steven Adams,https://www.basketball-reference.com/players/a...,season_18_19,120,211,1993-07-20,8 years
6,1,Steven Adams\adamsst01,C,26,OKC,63,63,26.7,4.5,7.6,...,1.5,1.9,10.9,Steven Adams,https://www.basketball-reference.com/players/a...,season_19_20,120,211,1993-07-20,8 years
10,5,Bam Adebayo\adebaba01,C,21,MIA,82,28,23.3,3.4,5.9,...,1.5,2.5,8.9,Bam Adebayo,https://www.basketball-reference.com/players/a...,season_18_19,115,206,1997-07-18,4 years
11,2,Bam Adebayo\adebaba01,PF,22,MIA,72,72,33.6,6.1,11.0,...,2.8,2.5,15.9,Bam Adebayo,https://www.basketball-reference.com/players/a...,season_19_20,115,206,1997-07-18,4 years
17,8,LaMarcus Aldridge\aldrila01,C,33,SAS,81,81,33.2,8.4,16.3,...,1.8,2.2,21.3,LaMarcus Aldridge,https://www.basketball-reference.com/players/a...,season_18_19,113,211,1985-07-19,15 years
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2134,516,Robert Williams\williro04,C,22,BOS,29,1,13.4,2.2,3.0,...,0.7,1.8,5.2,Robert Williams,https://www.basketball-reference.com/players/w...,season_19_20,107,203,1997-10-17,3 years
2137,521,Christian Wood\woodch01,PF,23,TOT,21,2,12.0,2.9,5.6,...,0.8,0.8,8.2,Christian Wood,https://www.basketball-reference.com/players/w...,season_18_19,97,208,1995-09-27,5 years
2138,521,Christian Wood\woodch01,PF,24,DET,62,12,21.4,4.6,8.2,...,1.4,1.6,13.1,Christian Wood,https://www.basketball-reference.com/players/w...,season_19_20,97,208,1995-09-27,5 years
2141,526,Trae Young\youngtr01,PG,20,ATL,81,81,30.9,6.5,15.5,...,3.8,1.7,19.1,Trae Young,https://www.basketball-reference.com/players/y...,season_18_19,74,185,1998-09-19,3 years


In [95]:
#lets reduce the Y to contain only players that also played in both 18/19 and 19/20 season
df_y = df_y[df_y['Player Name'].isin(df_X['Player Name'].unique())]

Now we are going form our X. It will be consisted in two parts:
- 19/20 season stats per game
- lift from 18/19 to 19/20 in these same stats
- a few other infos such as age, position and years in league

In [118]:
df_X.columns

Index(['Rk', 'Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%',
       '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%',
       'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS',
       'Player Name', 'Player Link', 'Season', 'Weight in kg', 'Height in cm',
       'Birth Date', 'Experience'],
      dtype='object')

In [135]:
num_cols = ['G', 'GS', 'MP', 'FG', 'FGA', 'FG%',
       '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%',
       'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']

In [136]:
df_1920 = df_X[df_X['Season'] == 'season_19_20'].reset_index(drop = True)
df_1819 = df_X[df_X['Season'] == 'season_18_19'].reset_index(drop = True)

In [137]:
df_lift = df_1920[num_cols].sub(df_1819[num_cols], axis = 'columns').div(df_1819[num_cols], axis = 'columns')

In [139]:
df_X_final = df_1920.merge(df_lift, left_index = True, right_index = True, suffixes=('','_LIFT'))

In [145]:
#adjusting age
df_X_final['Age'] = df_X_final['Age'] - 1

In [222]:
X = df_X_final.drop(columns = ['Rk','Player','Tm','Player Name','Player Link', 'Season','Birth Date'])

In [223]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 335 entries, 0 to 334
Data columns (total 55 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Pos           335 non-null    object 
 1   Age           335 non-null    int64  
 2   G             335 non-null    int64  
 3   GS            335 non-null    int64  
 4   MP            335 non-null    float64
 5   FG            335 non-null    float64
 6   FGA           335 non-null    float64
 7   FG%           335 non-null    float64
 8   3P            335 non-null    float64
 9   3PA           335 non-null    float64
 10  3P%           325 non-null    float64
 11  2P            335 non-null    float64
 12  2PA           335 non-null    float64
 13  2P%           335 non-null    float64
 14  eFG%          335 non-null    float64
 15  FT            335 non-null    float64
 16  FTA           335 non-null    float64
 17  FT%           333 non-null    float64
 18  ORB           335 non-null    

In [224]:
X = X.fillna(X.mean())
X = X.replace([np.inf, -np.inf], 0)    

In [225]:
X.Pos.unique()

array(['C', 'PF', 'SF', 'PG', 'SG', 'SF-SG', 'PF-C', 'SF-PF', 'PF-SF',
       'SG-PG', 'PG-SG'], dtype=object)

In [226]:
X.Pos = X.Pos.replace(['PG','SG','PG-SG','SG-PG','SF','PF','C','SF-SG','PF-C','SF-PF','PF-SF'],
                      ['Guard','Guard','Guard','Guard','Forward','Forward','Forward','Forward','Forward','Forward','Forward'])

In [227]:
X = X.merge(pd.get_dummies(X.Pos), left_index = True, right_index = True)
X = X.drop(columns = ['Pos'])

In [228]:
X['Experience'] = X.Experience.str.extract('(\d+)').astype(int)

In [229]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 335 entries, 0 to 334
Data columns (total 56 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Age           335 non-null    int64  
 1   G             335 non-null    int64  
 2   GS            335 non-null    int64  
 3   MP            335 non-null    float64
 4   FG            335 non-null    float64
 5   FGA           335 non-null    float64
 6   FG%           335 non-null    float64
 7   3P            335 non-null    float64
 8   3PA           335 non-null    float64
 9   3P%           335 non-null    float64
 10  2P            335 non-null    float64
 11  2PA           335 non-null    float64
 12  2P%           335 non-null    float64
 13  eFG%          335 non-null    float64
 14  FT            335 non-null    float64
 15  FTA           335 non-null    float64
 16  FT%           335 non-null    float64
 17  ORB           335 non-null    float64
 18  DRB           335 non-null    

In [230]:
y = df_y.PTS

We are going to split our base in train and test, remember:
- The X dataframe consists on stats of 19/20 season and the lift comparing to the previous season
- The y dataframe consists on the Points per Game stats of 20/21 season

We are using a Random Forest Regressor initially in order to retrieve the feature importance of our model

In [231]:
from sklearn.model_selection import train_test_split

In [232]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=27)

In [233]:
from sklearn.ensemble import RandomForestRegressor

In [234]:
rfreg = RandomForestRegressor(random_state=27)

In [235]:
rfreg.fit(X_train, y_train)

RandomForestRegressor(random_state=27)

In [236]:
rfreg.predict(X_test)

array([ 6.669, 17.144,  5.048, 16.225, 17.055, 14.794, 15.318, 25.705,
        3.98 , 10.447, 12.19 , 14.705, 16.747, 15.753,  6.61 , 15.529,
       14.609,  7.124,  8.927,  5.864,  8.83 , 14.238,  4.914,  9.73 ,
        7.389, 16.893, 11.148,  4.437, 14.1  ,  4.69 ,  9.12 ,  6.989,
        6.107,  4.372,  6.855,  4.828,  6.92 ,  4.762,  7.294, 19.412,
        6.491, 13.846,  7.106,  6.476, 14.201, 12.395,  9.404,  4.774,
        8.647, 12.592, 18.671,  6.338,  6.477,  4.077,  7.107,  6.617,
        4.006, 19.057, 12.312,  4.345,  6.439, 16.403,  8.625,  7.448,
       11.955,  7.598,  9.881, 19.859,  7.798,  6.106, 10.324, 17.025,
       11.258,  8.239,  7.033,  7.967, 11.696,  7.014,  5.418,  6.066,
        6.785,  7.793, 22.604,  4.953])

In [237]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [238]:
print(f'Our model had an mean absolute error of {mean_absolute_error(y_test,rfreg.predict(X_test))} \nAnd a mean squared error of {mean_squared_error(y_test,rfreg.predict(X_test))}')

Our model had an mean absolute error of 2.787845238095238 
And a mean squared error of 12.510513583333333


In [255]:
pd.DataFrame(rfreg.feature_importances_, index = X2.columns, columns = ['Feature Importance']).sort_values(by = 'Feature Importance', ascending=False).head(10)

Unnamed: 0,Feature Importance
PTS,0.513017
FG,0.159821
FGA,0.102106
FT,0.015995
2PA,0.011126
2P,0.010515
FTA,0.009393
G,0.007758
MP,0.006755
TRB_LIFT,0.0067


## Conclusion part 1

By far the most important feature to predict the next year points per game is the previous season points per game, which make sense since the future perspective of a player is not actually expected to be extremely far away from the current perspective of this player. Perhaps our model is not actually delivering any new facts we don't know, yet, we confirm our beliefs by seeing these results

Regardless of the previous results let's try to use the same fitted model to predict the 21/22 season for all the players who played the 19/20 and 20/21 seasons

In [241]:
df_y2 = df.loc[df['Season'] == 'season_21_22',['Player Name','PTS']]
df_X2 = df.loc[df['Season'].isin(['season_19_20','season_20_21']) & (df['Player Name'].isin(df_y2['Player Name'].unique()))]

#lets select only the players that played both 19/20 and 20/21 season
df_players_per_season2 = pd.DataFrame(df_X2.groupby('Player Name')['Season'].count()).reset_index().rename(columns = {'Season':'Season Count'})
df_X2 = df_X2[df_X2['Player Name'].isin(df_players_per_season2.loc[df_players_per_season2['Season Count'] == 2,'Player Name'].unique())]

#lets reduce the Y to contain only players that also played in both 19/20 and 20/21 season
df_y2 = df_y2[df_y2['Player Name'].isin(df_X2['Player Name'].unique())]

df_1920_2 = df_X2[df_X2['Season'] == 'season_19_20'].reset_index(drop = True)
df_2021_2 = df_X2[df_X2['Season'] == 'season_20_21'].reset_index(drop = True)

df_lift2 = df_2021_2[num_cols].sub(df_1920_2[num_cols], axis = 'columns').div(df_1920_2[num_cols], axis = 'columns')

df_X_final2 = df_2021_2.merge(df_lift2, left_index = True, right_index = True, suffixes=('','_LIFT'))

X2 = df_X_final2.drop(columns = ['Rk','Player','Tm','Player Name','Player Link', 'Season','Birth Date'])

X2 = X2.fillna(X2.mean())
X2 = X2.replace([np.inf, -np.inf], 0)    

X2.Pos = X2.Pos.replace(['PG','SG','PG-SG','SG-PG','SF','PF','C','SF-SG','PF-C','SF-PF','PF-SF','C-PF','SG-SF'],
                      ['Guard','Guard','Guard','Guard','Forward','Forward','Forward','Forward','Forward','Forward','Forward','Forward','Forward'])

X2 = X2.merge(pd.get_dummies(X2.Pos), left_index = True, right_index = True)
X2 = X2.drop(columns = ['Pos'])

X2['Experience'] = X2.Experience.str.extract('(\d+)').astype(int)

y2 = df_y2.PTS

In [245]:
print(f'Our model had an mean absolute error of {mean_absolute_error(y2,rfreg.predict(X2))} \nAnd a mean squared error of {mean_squared_error(y2,rfreg.predict(X2))}')

Our model had an mean absolute error of 2.4289083557951487 
And a mean squared error of 8.99491191644205


### A few other metrics

In [282]:
df_change = df_y2.reset_index(drop = True)

In [283]:
df_change = df_change.merge(df_2021_2[['PTS']], left_index = True, right_index = True, suffixes=('_21_22','_20_21'))

In [284]:
df_change['increased_ppg'] = 1
df_change.loc[df_change['PTS_21_22'] < df_change['PTS_20_21'],'increased_ppg'] = 0

In [285]:
df_change

Unnamed: 0,Player Name,PTS_21_22,PTS_20_21,increased_ppg
0,Steven Adams,6.9,7.6,0
1,Bam Adebayo,19.1,18.7,1
2,LaMarcus Aldridge,12.9,13.5,0
3,Jarrett Allen,16.1,12.8,1
4,Kyle Anderson,7.6,12.4,0
...,...,...,...,...
366,Tremont Waters,3.3,3.8,0
367,Paul Watson,3.4,4.1,0
368,Quinndary Weatherspoon,2.7,2.3,1
369,Coby White,12.7,15.1,0


In [286]:
df_change['PTS_21_22_predicted'] = rfreg.predict(X2)

In [287]:
df_change['increased_ppg_in_pred'] = 1
df_change.loc[df_change['PTS_21_22_predicted'] < df_change['PTS_20_21'],'increased_ppg_in_pred'] = 0

In [302]:
print(f'{df_change.increased_ppg.sum()} players have increased their points per game stats from 20/21 season to 21/22 season. \nOf those, the model predicted a correct increase in {df_change[df_change.increased_ppg == 1].increased_ppg_in_pred.sum()/df_change.increased_ppg.sum()} of the cases')

161 players have increased their points per game stats from 20/21 season to 21/22 season. 
Of those, the model predicted a correct increase in 0.5403726708074534 of the cases


In [305]:
print(f'{df_change.shape[0] - df_change.increased_ppg.sum()} players has decreased their points per game from 20/21 season to 21/22 season. \nOf those, the model predicted a correct decrease in {(df_change[df_change.increased_ppg == 0].shape[0] - df_change[df_change.increased_ppg == 0].increased_ppg_in_pred.sum())/(df_change.shape[0] - df_change.increased_ppg.sum())} of the cases')

210 players has decreased their points per game from 20/21 season to 21/22 season. 
Of those, the model predicted a correct decrease in 0.5904761904761905 of the cases
