This script tests the linear regression class that I created. It follows a similar sequence to the dataTraining script but this time only predicting points.

In [192]:
# Import libraries
from linear_regression_class import linear_regression as lin
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd


In [193]:
# Filter data and drop third year columns
df = pd.read_csv('NBA_stats_3_years.csv')
outliers = ((df['FG%_2-3'] > 90) |
            (df['FG%_2-3'] == 0) |
            (df['3P%_2-3'] > 90) |
            (df['3P%_2-3'] == 0) |
            (df['FT%_2-3'] == 0) |
            (df['FG%_1-2'] > 90) |
            (df['FG%_1-2'] == 0) |
            (df['3P%_1-2'] > 90) |
            (df['3P%_1-2'] == 0) |
            (df['FT%_1-2'] == 0) |
            (df['FG%_0-1'] > 90) |
            (df['FG%_0-1'] == 0) |
            (df['3P%_0-1'] > 90) |
            (df['3P%_0-1'] == 0) |
            (df['FT%_0-1'] == 0))

df = df[~outliers]
df1 = df
df1 = df1.drop(['GP_2-3', 'MIN_2-3', 'FGM_2-3', 'FGA_2-3', 'FG%_2-3', '3PM_2-3', '3PA_2-3',
'3P%_2-3', 'FTM_2-3', 'FTA_2-3', 'FT%_2-3', 'OREB_2-3', 'DREB_2-3', 'REB_2-3', 'AST_2-3',
'STL_2-3', 'BLK_2-3', 'TOV_2-3', 'EFF_2-3'], axis = 1)

df1.head()


Unnamed: 0,PTS_2-3,GP_1-2,MIN_1-2,PTS_1-2,FGM_1-2,FGA_1-2,FG%_1-2,3PM_1-2,3PA_1-2,3P%_1-2,...,FTA_0-1,FT%_0-1,OREB_0-1,DREB_0-1,REB_0-1,AST_0-1,STL_0-1,BLK_0-1,TOV_0-1,EFF_0-1
0,13.9,68,30.2,16.3,6.3,11.2,56.4,0.9,2.5,34.7,...,3.1,74.3,1.7,4.2,5.9,2.5,0.6,0.6,1.8,16.7
1,6.6,63,13.4,3.9,1.5,3.5,41.8,0.6,1.4,40.9,...,1.1,86.8,0.4,1.6,1.9,2.4,0.7,0.1,1.1,7.3
2,8.6,63,30.5,9.8,3.6,7.6,47.6,2.3,5.2,44.6,...,1.4,84.2,1.6,6.1,7.7,3.4,0.7,1.3,0.9,17.7
3,21.1,75,28.9,14.8,5.9,10.7,55.3,0.3,0.8,33.3,...,3.2,71.1,1.9,3.5,5.5,2.6,0.8,0.9,2.0,12.6
5,25.9,79,36.0,24.6,8.9,19.5,45.9,2.7,7.3,36.9,...,3.9,78.6,0.9,3.9,4.8,3.8,1.5,0.6,2.6,18.8


In [213]:
# Create model
my_model = lin()
X_train, X_test, Y_train, Y_test = train_test_split(df1.drop(columns = 'PTS_2-3'), df1['PTS_2-3'], test_size=0.2, random_state=370)

# Train model
my_model.batch_backpropagation(X_train, Y_train, 30000, 5*10**(-5))

y_pred = np.matmul(X_test, my_model.weights)
[x+my_model.bias for x in y_pred]

[0.08782623489329892]

In [214]:
print(f"R2 Score: {my_model.r2_score(X_test, Y_test)}")
for i in range(len(Y_test)):
    print(f"actual ppg: {Y_test.iloc[i]} vs predicted ppg: {y_pred.iloc[i,0]}")

R2 Score: 0.8034562208244757
actual ppg: 12.2 vs predicted ppg: 11.6418192092029
actual ppg: 12.6 vs predicted ppg: 14.394096796018017
actual ppg: 9.8 vs predicted ppg: 11.601029696273429
actual ppg: 5.5 vs predicted ppg: 10.012915740421123
actual ppg: 5.3 vs predicted ppg: 4.165899529873137
actual ppg: 10.6 vs predicted ppg: 7.515780135014149
actual ppg: 11.2 vs predicted ppg: 14.778276255064805
actual ppg: 27.0 vs predicted ppg: 24.436198283642995
actual ppg: 19.6 vs predicted ppg: 21.63445170033102
actual ppg: 28.3 vs predicted ppg: 25.858022647835895
actual ppg: 12.1 vs predicted ppg: 9.65384359421702
actual ppg: 7.6 vs predicted ppg: 12.972370413402029
actual ppg: 27.4 vs predicted ppg: 25.314732822058275
actual ppg: 15.0 vs predicted ppg: 16.60242647138774
actual ppg: 18.0 vs predicted ppg: 9.393907125656487
actual ppg: 11.0 vs predicted ppg: 9.15865841043965
actual ppg: 13.7 vs predicted ppg: 11.019999066080159
actual ppg: 11.0 vs predicted ppg: 13.465371890782865
actual ppg: 19

The final regression score is about 80%. Trying different learning rates and epochs yields a similar number, suggesting that the reason for this accuracy is underfitting, or bias. This may be counterracted using polynomial regression, having more features, or using a stronger more complex model.

This yield is not unexpected, as many important features and statistics are missing. Players get injured, have personal issues, minute restrictions, etc. Rosters may increase or reduce player minutes year to year, or completely change their rosters, leading to great swings in player performance.

Statistics that cover day to day matchups, player to player matchups, injury proneness, and more would be necessary to accurately predict statistics in a new season.

However, perhaps we can marginally increase the predicted ppg's accuracy by giving some features from the third season group.

Let's see how the model performs when it has the minutes statistic for the third year:

In [196]:
# Same data but do not remove third year minutes
df2 = df
df2 = df2.drop(['GP_2-3', 'FGM_2-3', 'FGA_2-3', 'FG%_2-3', '3PM_2-3', '3PA_2-3',
'3P%_2-3', 'FTM_2-3', 'FTA_2-3', 'FT%_2-3', 'OREB_2-3', 'DREB_2-3', 'REB_2-3', 'AST_2-3',
'STL_2-3', 'BLK_2-3', 'TOV_2-3', 'EFF_2-3'], axis = 1)

df2.head()

Unnamed: 0,MIN_2-3,PTS_2-3,GP_1-2,MIN_1-2,PTS_1-2,FGM_1-2,FGA_1-2,FG%_1-2,3PM_1-2,3PA_1-2,...,FTA_0-1,FT%_0-1,OREB_0-1,DREB_0-1,REB_0-1,AST_0-1,STL_0-1,BLK_0-1,TOV_0-1,EFF_0-1
0,31.5,13.9,68,30.2,16.3,6.3,11.2,56.4,0.9,2.5,...,3.1,74.3,1.7,4.2,5.9,2.5,0.6,0.6,1.8,16.7
1,16.3,6.6,63,13.4,3.9,1.5,3.5,41.8,0.6,1.4,...,1.1,86.8,0.4,1.6,1.9,2.4,0.7,0.1,1.1,7.3
2,26.8,8.6,63,30.5,9.8,3.6,7.6,47.6,2.3,5.2,...,1.4,84.2,1.6,6.1,7.7,3.4,0.7,1.3,0.9,17.7
3,32.5,21.1,75,28.9,14.8,5.9,10.7,55.3,0.3,0.8,...,3.2,71.1,1.9,3.5,5.5,2.6,0.8,0.9,2.0,12.6
5,35.1,25.9,79,36.0,24.6,8.9,19.5,45.9,2.7,7.3,...,3.9,78.6,0.9,3.9,4.8,3.8,1.5,0.6,2.6,18.8


In [211]:
# Create new model
my_model1 = lin()
X_train1, X_test1, Y_train1, Y_test1 = train_test_split(df2.drop(columns = 'PTS_2-3'), df2['PTS_2-3'], test_size=0.2, random_state=370)

# Train model
my_model1.batch_backpropagation(X_train1, Y_train1, 30000, 5*10**(-5))

y_pred1 = np.matmul(X_test1, my_model1.weights)
[x+my_model1.bias for x in y_pred1]

[0.2963328220707824]

In [212]:
print(f"R2 Score: {my_model1.r2_score(X_test1, Y_test1)}")
for i in range(len(Y_test1)):
    print(f"actual ppg: {Y_test1.iloc[i]} vs predicted ppg: {y_pred1.iloc[i,0]}")

R2 Score: 0.9061159165052037
actual ppg: 12.2 vs predicted ppg: 14.294593131528442
actual ppg: 12.6 vs predicted ppg: 12.486116846197717
actual ppg: 9.8 vs predicted ppg: 11.403060518183601
actual ppg: 5.5 vs predicted ppg: 5.8940606638535105
actual ppg: 5.3 vs predicted ppg: 5.158351685617976
actual ppg: 10.6 vs predicted ppg: 12.406404098380115
actual ppg: 11.2 vs predicted ppg: 13.870572716207649
actual ppg: 27.0 vs predicted ppg: 24.921639296929133
actual ppg: 19.6 vs predicted ppg: 20.41197008256725
actual ppg: 28.3 vs predicted ppg: 27.147855899650935
actual ppg: 12.1 vs predicted ppg: 9.290240542693574
actual ppg: 7.6 vs predicted ppg: 9.092752810838258
actual ppg: 27.4 vs predicted ppg: 24.956726712400364
actual ppg: 15.0 vs predicted ppg: 15.76934319814688
actual ppg: 18.0 vs predicted ppg: 15.280582532832117
actual ppg: 11.0 vs predicted ppg: 11.88865577132141
actual ppg: 13.7 vs predicted ppg: 10.811324327755957
actual ppg: 11.0 vs predicted ppg: 11.053794318949336
actual pp

The R2 Score increases by 10%! From this we can see the obvious impact of minutes played on ppg in a season. For predictive purposes, it may be worth creating a separate linear regression model to predict the minutes per game of a player based on separate statistics.