# Predicting NBA MVP - Machine Learning

This is the final file in the project to use machine learning to predict the MVP for the NBA. I want to predict the 5 players who receive the most votes for MVP. I'm not worried about picking them in order, only getting the top 5 players in any order. 

## Import all the things

In [77]:
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor

In [2]:
stats = pd.read_csv("all_stats.csv", index_col=0)
stats.head()

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,Pts Max,Share,Team,W,L,W/L%,GB,PS/G,PA/G,SRS
0,A.C. Green,PF,27,LAL,82,21,26.4,3.1,6.6,0.476,...,0.0,0.0,Los Angeles Lakers,58,24,0.707,5.0,106.3,99.6,6.73
1,Byron Scott,SG,29,LAL,82,82,32.1,6.1,12.8,0.477,...,0.0,0.0,Los Angeles Lakers,58,24,0.707,5.0,106.3,99.6,6.73
2,Elden Campbell,PF,22,LAL,52,0,7.3,1.1,2.4,0.455,...,0.0,0.0,Los Angeles Lakers,58,24,0.707,5.0,106.3,99.6,6.73
3,Irving Thomas,PF,25,LAL,26,0,4.2,0.7,1.9,0.34,...,0.0,0.0,Los Angeles Lakers,58,24,0.707,5.0,106.3,99.6,6.73
4,James Worthy,SF,29,LAL,78,74,38.6,9.2,18.7,0.492,...,0.0,0.0,Los Angeles Lakers,58,24,0.707,5.0,106.3,99.6,6.73


## Data Prep

### Quick examination of the data

Since I scraped the data and put it all together, I know it's in pretty good shape already. I'd like to take a quick look, though, just to make sure there aren't any issues I missed earlier.

In [3]:
stats.tail()

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,Pts Max,Share,Team,W,L,W/L%,GB,PS/G,PA/G,SRS
14087,Spencer Hawes,PF,28,MIL,54,1,14.8,2.5,5.1,0.484,...,0.0,0.0,Milwaukee Bucks,42,40,0.512,9.0,103.6,103.8,-0.45
14088,Steve Novak,PF,33,MIL,8,0,2.8,0.3,0.9,0.286,...,0.0,0.0,Milwaukee Bucks,42,40,0.512,9.0,103.6,103.8,-0.45
14089,Terrence Jones,PF,25,MIL,54,12,23.5,4.3,9.1,0.47,...,0.0,0.0,Milwaukee Bucks,42,40,0.512,9.0,103.6,103.8,-0.45
14090,Thon Maker,C,19,MIL,57,34,9.9,1.5,3.2,0.459,...,0.0,0.0,Milwaukee Bucks,42,40,0.512,9.0,103.6,103.8,-0.45
14091,Tony Snell,SG,25,MIL,80,80,29.2,3.1,6.8,0.455,...,0.0,0.0,Milwaukee Bucks,42,40,0.512,9.0,103.6,103.8,-0.45


In [4]:
stats.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14092 entries, 0 to 14091
Data columns (total 41 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Player   14092 non-null  object 
 1   Pos      14092 non-null  object 
 2   Age      14092 non-null  int64  
 3   Tm       14092 non-null  object 
 4   G        14092 non-null  int64  
 5   GS       14092 non-null  int64  
 6   MP       14092 non-null  float64
 7   FG       14092 non-null  float64
 8   FGA      14092 non-null  float64
 9   FG%      14042 non-null  float64
 10  3P       14092 non-null  float64
 11  3PA      14092 non-null  float64
 12  3P%      12050 non-null  float64
 13  2P       14092 non-null  float64
 14  2PA      14092 non-null  float64
 15  2P%      14008 non-null  float64
 16  eFG%     14042 non-null  float64
 17  FT       14092 non-null  float64
 18  FTA      14092 non-null  float64
 19  FT%      13630 non-null  float64
 20  ORB      14092 non-null  float64
 21  DRB      140

All the columns are the correct data type, but I see some missing values. Let me get a number on those missing values to see if I can do anythinng to fix them. 

In [5]:
pd.isnull(stats).sum()

Player        0
Pos           0
Age           0
Tm            0
G             0
GS            0
MP            0
FG            0
FGA           0
FG%          50
3P            0
3PA           0
3P%        2042
2P            0
2PA           0
2P%          84
eFG%         50
FT            0
FTA           0
FT%         462
ORB           0
DRB           0
TRB           0
AST           0
STL           0
BLK           0
TOV           0
PF            0
PTS           0
Year          0
Pts Won       0
Pts Max       0
Share         0
Team          0
W             0
L             0
W/L%          0
GB            0
PS/G          0
PA/G          0
SRS           0
dtype: int64

All the missing data is in columns with percents. It makes sense that the 3P% percentage would be missing if there weren't any 3 point attempts. The same is true for the other columns. If this s what's going on in the data, I can change the missing values to 0.

In [6]:
stats[pd.isnull(stats["FG%"])][["Player", "FG"]].head()

Unnamed: 0,Player,FG
103,Adrian Caldwell,0.0
250,Guy Rucker,0.0
428,Gani Lawal,0.0
1961,Ronny Turiaf,0.0
2240,Lari Ketner,0.0


In [7]:
stats[pd.isnull(stats["3P%"])][["Player", "3PA"]].head()

Unnamed: 0,Player,3PA
2,Elden Campbell,0.0
3,Irving Thomas,0.0
18,Jack Haley,0.0
20,Keith Owens,0.0
30,Benoit Benjamin,0.0


In [8]:
stats[pd.isnull(stats["2P%"])][["Player", "2PA"]].head()

Unnamed: 0,Player,2PA
103,Adrian Caldwell,0.0
250,Guy Rucker,0.0
428,Gani Lawal,0.0
516,Anthony Brown,0.0
798,Josh McRoberts,0.0


In [9]:
stats[pd.isnull(stats["FT%"])][["Player", "FTA"]].head()

Unnamed: 0,Player,FTA
77,John Coker,0.0
92,Jason Sasser,0.0
103,Adrian Caldwell,0.0
119,Bruno Šundov,0.0
158,Jamal Robinson,0.0


It definitely looks like the percentage columns with missing values could be switched to 0. 

In [10]:
stats = stats.fillna(0)

In [11]:
stats.isnull().sum()

Player     0
Pos        0
Age        0
Tm         0
G          0
GS         0
MP         0
FG         0
FGA        0
FG%        0
3P         0
3PA        0
3P%        0
2P         0
2PA        0
2P%        0
eFG%       0
FT         0
FTA        0
FT%        0
ORB        0
DRB        0
TRB        0
AST        0
STL        0
BLK        0
TOV        0
PF         0
PTS        0
Year       0
Pts Won    0
Pts Max    0
Share      0
Team       0
W          0
L          0
W/L%       0
GB         0
PS/G       0
PA/G       0
SRS        0
dtype: int64

### Split data into train and test set

I want to split up the columns into the predictors and target. Share will be the target I'm trying to predict and all the predictor columns will be numeric. The Share column is calulated by dividing the Pts Max by the Pts Won columns, so those will also need to be dropped from our training set. Not dropping those columns would lead the model to overfit the data.

Since this is a time series, I need to be careful not to try to predict the past. Since 2021 is the most recent year in the data, I'll use that as my test set and all the other years as the training set. I could also break each year apart and use 1991 to predict 1992, 1992 to predict 1993, etc. I think giving the model more data to test on will allow it to make a better prediction, so I'll start with that.

In [12]:
# get column names and create preditor list of numeric columns
stats.columns

Index(['Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P',
       '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB',
       'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'Year',
       'Pts Won', 'Pts Max', 'Share', 'Team', 'W', 'L', 'W/L%', 'GB', 'PS/G',
       'PA/G', 'SRS'],
      dtype='object')

In [13]:
# Take out all the non-numeric columns
# I want to take out the Share (target), Pts Won, Pts Max columns
# Pts Won and Pts Max are used to calculate the Share column, so drop those too
predictors = ["Age", "G", "GS", "MP", "FG", "FGA", 
              'FG%', '3P', '3PA', '3P%', '2P', '2PA', 
              '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 
              'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 
              'PF', 'PTS', 'W', 'L', 'W/L%','GB', 'PS/G',
              'PA/G', 'SRS']

In [14]:
train = stats[~(stats["Year"] == 2021)]
test = stats[stats["Year"] == 2021]

## Modeling

### Baseline Model - Ridge Regression

I'll run a ridge regression model first to get a baseline to see how well the model fits. Ridge regression is a type of linear regression where the coefficients are shrunk by the alph. The higher the alpha, the more the coefficients are shrunk. Scikit-learn has a way to see which alpha is best, so I'll use that.

#### Use RidgeCV to choose the best alpha

In [15]:
# split the train set into predictor and target columns
train_X = train[predictors]
train_y = train['Share']

In [16]:
# use standard scaler
scaler = StandardScaler()
X_std = scaler.fit_transform(train_X)

In [17]:
# give ridgecv a few options to try
regr_cv = RidgeCV(alphas=[0.1, 1.0, 10.0, 50.0])

In [18]:
#fit the linear regression
model_cv = regr_cv.fit(X_std, train_y)

In [19]:
model_cv.alpha_

10.0

#### Ridge Regression Model

In [20]:
# split the test set into predictor and target columns for checking our model
test_X = test['Share']
test_y = test[predictors]

According to ridgecv the best alpha is 10, so I'll use that.

In [21]:
reg = Ridge(alpha = 10)
#reg = Ridge(alpha = .10)

In [22]:
# fit the training data to the model
reg.fit(train_X, train_y);

In [23]:
# take a look at the predictions after using the model on the test set
# index = test.index keeps the index from the test set and gives the predictions for those indices
# making it a dataframe makes it print out nicer
predictions = reg.predict(test_y)
predictions = pd.DataFrame(predictions, columns=["predictions"], index=test.index)
predictions

Unnamed: 0,predictions
630,0.016321
631,-0.014708
632,0.002248
633,-0.005017
634,0.012725
...,...
13897,-0.012385
13898,-0.009385
13899,0.017973
13900,-0.015343


In [24]:
# add the predictions from the last cell to the test set (only player and share columns)
# to see how closely the predictions are to the actual Share value
combination = pd.concat([test[["Player", "Share"]], predictions], axis=1)
combination.head(2)

Unnamed: 0,Player,Share,predictions
630,Aaron Gordon,0.0,0.016321
631,Austin Rivers,0.0,-0.014708


In [25]:
# sort the combination dataframe to see the 20 players who got the most votes for MVP
# and compare that with the predictions the model gave
combination.sort_values("Share", ascending=False).head(20)

Unnamed: 0,Player,Share,predictions
641,Nikola Jokić,0.961,0.154064
8624,Joel Embiid,0.58,0.163082
3651,Stephen Curry,0.449,0.147108
9907,Giannis Antetokounmpo,0.345,0.20649
1389,Chris Paul,0.138,0.073669
10997,Luka Dončić,0.042,0.152907
7464,Damian Lillard,0.038,0.120208
3536,Julius Randle,0.02,0.090696
3531,Derrick Rose,0.01,0.036226
11358,Rudy Gobert,0.008,0.096022


There are some predictions that are in the same ballpark, but for the most part it doesn't seem like the model was that accurate. I'll look at the MSE for our predictions to see how close they were overall. The lower the MSE, the better the accuracy since it calculates the average distance between the actual and predicted values.

In [26]:
mean_squared_error(combination["Share"], combination["predictions"])

0.0026954439980941447

Maybe the MSE isn't the right way to judge how good the predictions were. I don't want to know how close our values are to the actual values. What I really want to know is the players who got the biggest shares (I'll say top 5) of MVP votes. My prediction doesn't need to be close to the actual value, it only needs to have the same top 5 players getting the top 5 predicted shares of votes. There are lots of zeroes in both the actual and predicted values, so the MSE is going to be inflated due to that. So although 0.0027 (ish) _seems_ like it means the model has a good accuracy, it's not really telling me how accurate my model is. I'll take a look to see how many zeros I have in the actual Share values just to make sure my thinking is correct.

In [27]:
combination["Share"].value_counts()

0.000    525
0.001      3
0.961      1
0.138      1
0.010      1
0.020      1
0.449      1
0.005      1
0.038      1
0.003      1
0.580      1
0.345      1
0.042      1
0.008      1
Name: Share, dtype: int64

The data for 2021 (the test data) has 540 players and 525 of those players received 0 MVP votes. That's definitely going to mess with the MSE value. I need to come up with a better way of determining how accurate my model is. Since I'm interested in the rankings, I'll add a couple of columns to my data to show the actual ranking and the predicted ranking.

In [28]:
# add a column for rank based on the actual shares
combination = combination.sort_values("Share", ascending=False)
combination["Rank"] = list(range(1, combination.shape[0]+1))


In [29]:
# add another column based on the predicted shares
combination = combination.sort_values("predictions", ascending=False)
combination["Predicted_Rank"] = list(range(1,combination.shape[0]+1))

In [30]:
combination[['Player', 'Rank', 'Predicted_Rank']].head()

Unnamed: 0,Player,Rank,Predicted_Rank
9907,Giannis Antetokounmpo,4,1
8624,Joel Embiid,2,2
641,Nikola Jokić,1,3
10997,Luka Dončić,6,4
3736,LeBron James,15,5


Now I see that out of the top 5 players my model predicted, 3 of those were actually in the top 5. One of the 2 that I missed was number 6 and the other was 15, so I got close with 4 of my 5 predictions. That's not too shabby, but I think I can do better.

Although I'm using regression on the data, to evaluate the accuracy is more like classification - the player's shares of MVP votes either puts him in the top 5 or doesn't. It doesn't matter if the predicted top player is the actual top player, only that the player is in the top 5 of the actual votes. 

In [31]:
# function to find the average precision
# checks to see if the predictions include the top 5 actual players
# if not, then how far down the line of predicted rank is that player

def find_ap(combination):
    
    # only want the top 5 actual players
    # by changing 5 to another number, we can look at the accuracy of picking the top 10, 20, etc.
    actual = combination.sort_values("Share", ascending=False).head(5)
    # need to use all the predicted rankings
    predicted = combination.sort_values("predictions", ascending=False)
    
    ps = []
    found = 0
    seen = 1
    
    # loop through predictions and see if that player is in the actual top 5
    # if so, then add 1 to my accuracy score
    # if not, then penalize myself for how far down my prediction is
    for index,row in predicted.iterrows():
        if row["Player"] in actual["Player"].values:
            found += 1
            ps.append(found / seen)
        seen += 1
     
    # see the accuracy scores for the top 5 predictions
    #print('individual player accuracy score', ps, '\nsum of accuracy score', sum(ps), '\nlength of accuracy score',len(ps))

    return sum(ps) / len(ps)

In [32]:
# find the ap for the predicted values
# value is between 0 and 1 with 1 being the best
ap = find_ap(combination)
ap

0.7636363636363636

According to the average precision, I'm about 76% accurate at picking the 5 players who received the most MVP votes. I changed the numbers (top 10, top 3, etc) and found the I was most accurate at picking the top 6. It not very useful to know, but it's interesting all the same.

### Backtesting

Now that I have a better way of judging the accuracy of my model, I'll go back and test it on earlier years. I'll set up a pipeline for splitting the data into the correct years, use the model to come up with predictions, and then check the accuracy. I'll keep at least the first 5 years as the test set and start making predictions from 1996 until 2021. This will help me be more confident that my model is fairly accurate.

In [33]:
years = list(range(1991,2022))

In [34]:
# this pipeline will give me a list of dataframes for the precision for every year (1996 to 2021)

aps = []
all_predictions = []

for year in years[5:]:
    train = stats[stats["Year"] < year]
    test = stats[stats["Year"] == year]
    reg.fit(train[predictors],train["Share"])
    predictions = reg.predict(test[predictors])
    predictions = pd.DataFrame(predictions, columns=["predictions"], index=test.index)
    combination = pd.concat([test[["Player", "Share"]], predictions], axis=1)
    all_predictions.append(combination)
    aps.append(find_ap(combination))

In [35]:
#Now to find the mean average precision across all years

sum(aps) / len(aps)

0.7212353680599484

In [36]:
# add the rank, predicted rank, and difference between them into our dataframe to take a closer look
def add_ranks(predictions):
    predictions = predictions.sort_values("predictions", ascending=False)
    predictions["Predicted_Rank"] = list(range(1,predictions.shape[0]+1))
    predictions = predictions.sort_values("Share", ascending=False)
    predictions["Rank"] = list(range(1,predictions.shape[0]+1))
    predictions["Diff"] = (predictions["Rank"] - predictions["Predicted_Rank"])
    return predictions

In [37]:
# take a look at 1996 (all_predictions[0])
# 2021 would be all_predictions[25]
add_ranks(all_predictions[0])

Unnamed: 0,Player,Share,predictions,Predicted_Rank,Rank,Diff
10510,Michael Jordan,0.986,0.173589,5,1,-4
9830,David Robinson,0.508,0.209406,1,2,1
7350,Anfernee Hardaway,0.319,0.095968,12,3,-9
4952,Hakeem Olajuwon,0.211,0.203710,3,4,1
10513,Scottie Pippen,0.200,0.066416,19,5,-14
...,...,...,...,...,...,...
9335,Jayson Williams,0.000,0.009531,150,424,274
2035,Travis Best,0.000,0.009532,149,425,276
8235,Charlie Ward,0.000,0.009866,148,426,278
3105,Matt Bullard,0.000,0.010011,147,427,280


In [38]:
# put earlier steps together to create a backtesting function
# change reg to model so I can use this for all models

def backtest(stats, model, years, predictors):
    
    aps = []
    all_predictions = []
    
    for year in years:
        train = stats[stats["Year"] < year]
        test = stats[stats["Year"] == year]
        model.fit(train[predictors],train["Share"])
        predictions = model.predict(test[predictors])
        predictions = pd.DataFrame(predictions, columns=["predictions"], index=test.index)
        combination = pd.concat([test[["Player", "Share"]], predictions], axis=1)
        combination = add_ranks(combination)
        all_predictions.append(combination)
        aps.append(find_ap(combination))
        
    return sum(aps) / len(aps), aps, pd.concat(all_predictions)

In [39]:
mean_ap, aps, all_predictions = backtest(stats, reg, years[5:], predictors)

In [40]:
mean_ap

0.7212353680599484

In [41]:
#Take a look at the biggest differences between the actual and predicted ranks
all_predictions[all_predictions["Rank"] < 5].sort_values("Diff").head(10)

Unnamed: 0,Player,Share,predictions,Predicted_Rank,Rank,Diff
1224,Jason Kidd,0.712,0.029001,54,2,-52
5175,Steve Nash,0.839,0.031453,47,1,-46
5193,Steve Nash,0.739,0.05185,36,1,-35
12726,Joakim Noah,0.258,0.04801,35,4,-31
8516,Peja Stojaković,0.228,0.042474,29,4,-25
5208,Steve Nash,0.785,0.070427,23,2,-21
4682,Tim Hardaway,0.207,0.062507,20,4,-16
937,Allen Iverson,0.27,0.073182,14,4,-10
7350,Anfernee Hardaway,0.319,0.095968,12,3,-9
6746,Kobe Bryant,0.291,0.078992,13,4,-9


I may need to take a closer look at which stats are different between players where my predictions were way off and those that my predictions were close or exact. But first, I'll look at the coefficients for each column to get an idea of which variables play a bigger part in my model. This will help me figure out why predictions for players like Jason Kidd and Steve Nash are so far off. 

In [42]:
# this is a list of the variables and their coefficients from most important to least
stats_coeff = pd.concat([pd.Series(reg.coef_).rename('Coefficient'), pd.Series(predictors).rename('Statistics')], axis=1).sort_values('Coefficient', ascending=False)
stats_coeff

Unnamed: 0,Coefficient,Statistics
18,0.021338,DRB
10,0.013411,2P
21,0.011832,STL
15,0.011578,FTA
22,0.011238,BLK
17,0.008362,ORB
4,0.007892,FG
20,0.007297,AST
25,0.006299,PTS
28,0.005025,W/L%


I'll try adding some ratios to my dataframe to see if that gives better predictions.

In [43]:
stat_ratios = stats[["PTS", "AST", "STL", "BLK", "3P", "Year"]].groupby("Year").apply(lambda x: x/x.mean())
stat_ratios

Unnamed: 0,PTS,AST,STL,BLK,3P,Year
0,1.013334,0.420714,0.961127,0.673469,0.508587,1.0
1,1.614653,1.028412,1.647646,0.673469,4.577279,1.0
2,0.311795,0.093492,0.274608,1.571429,0.000000,1.0
3,0.200440,0.186984,0.274608,0.000000,0.000000,1.0
4,2.383005,1.636110,1.784950,0.897959,1.525760,1.0
...,...,...,...,...,...,...
14087,0.735752,0.819562,0.479763,1.528302,0.650951,1.0
14088,0.071202,0.000000,0.000000,0.000000,0.130190,1.0
14089,1.281633,0.601012,1.119447,2.547170,0.520761,1.0
14090,0.474679,0.218550,0.319842,1.273585,0.650951,1.0


In [44]:
stats[["PTS_R", "AST_R", "STL_R", "BLK_R", "3P_R"]] = stat_ratios[["PTS", "AST", "STL", "BLK", "3P"]]

In [45]:
stats.head()

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,W/L%,GB,PS/G,PA/G,SRS,PTS_R,AST_R,STL_R,BLK_R,3P_R
0,A.C. Green,PF,27,LAL,82,21,26.4,3.1,6.6,0.476,...,0.707,5.0,106.3,99.6,6.73,1.013334,0.420714,0.961127,0.673469,0.508587
1,Byron Scott,SG,29,LAL,82,82,32.1,6.1,12.8,0.477,...,0.707,5.0,106.3,99.6,6.73,1.614653,1.028412,1.647646,0.673469,4.577279
2,Elden Campbell,PF,22,LAL,52,0,7.3,1.1,2.4,0.455,...,0.707,5.0,106.3,99.6,6.73,0.311795,0.093492,0.274608,1.571429,0.0
3,Irving Thomas,PF,25,LAL,26,0,4.2,0.7,1.9,0.34,...,0.707,5.0,106.3,99.6,6.73,0.20044,0.186984,0.274608,0.0,0.0
4,James Worthy,SF,29,LAL,78,74,38.6,9.2,18.7,0.492,...,0.707,5.0,106.3,99.6,6.73,2.383005,1.63611,1.78495,0.897959,1.52576


In [46]:
new_predictors = predictors + ["PTS_R", "AST_R", "STL_R", "BLK_R", "3P_R"]

In [47]:
mean_ap, aps, all_predictions = backtest(stats, reg, years[5:], new_predictors)

In [48]:
mean_ap

0.7279975945481169

Adding the ratios improved my model, but only very slightly. Let me change a couple of categorical variables to numeric variables to see if that improves anything. 

### Random Forest Regressor

I'll change the Pos (position) and Tm (Team) into numerical values. Changing these variables to numerical will cause issues with ridge regression since there's no particular order to these new numerical variables.  We'll need to change models before we can roll these two columns into our predictors. I'll try random forest first since that's another fairly simple model to work with. Random forest model creates a series of decision trees and averages the predictions from those trees.

In [49]:
stats['NPos'] = stats['Pos'].astype('category').cat.codes
stats['NTm'] = stats['Tm'].astype('category').cat.codes

In [50]:
more_new_predictors = new_predictors + ['NPos', 'NTm']

In [51]:
rf = RandomForestRegressor(n_estimators = 50, random_state = 42, min_samples_split = 5)

In [53]:
# I want to run this quickly to see how it compares to ridge regression
# so I'll set the year starting at 2019 (years[28]) then run the ridge regression for the same years
mean_ap, aps, all_predictions = backtest(stats, rf, years[28:], more_new_predictors)

In [54]:
# check average precision for random forest
mean_ap

0.7952177452177452

In [55]:
# run random forest for years 2019 through 2021 to compare with random forest
mean_ap, aps, all_predictions = backtest(stats, reg, years[28:], predictors)

In [56]:
# check average precision for ridge regression
mean_ap

0.7813564213564214

#### Fine tuning Random Forest

So I can see the random forest model gives a little better accuracy. First I'll use GridCV to choose the best value for n-estimators, run it only using years 2019 until 2021 to make sure the accuracy gets better. Finally, I'll add all the years to see if that has any affect on accuracy.

In [98]:
rf = RandomForestRegressor(n_estimators = 200, random_state = 42, min_samples_split = 5)

In [99]:
# I want to run this quickly to see how it compares to ridge regression
# I'll increase the number of years after finding the right estimators
mean_ap, aps, all_predictions = backtest(stats, rf, years[10:], more_new_predictors)

In [100]:
# check average precision for random forest
mean_ap

0.7360929150214865

When the estimators were set above 400, the accuracy started going back down, so I'll leave it there and change years to include more of the years to give me a better prediction. When I used from 1996 to 2021, my accuracy declined to about 72%. I'll use fewer years and see if that makes the accuracy a little better. I'll decrease the years (start at 2000) and change the estimators again to see what happens.