## NBA Player Points Prediction example

This is a step-by-step guide on how to generate probabilities for the amount of points an nba player will score in a games using machine-learning and various feature engineering approaches.

The primary intention is an introduction with some ideas/inspiration on how to think about modelling NBA player points. This section, however uses default hypa To generate accurate predictions that can compete with bookmakers.

Also note, that the dataset used only contains 20% of all game-ids within a a period of a few years (in order to keep it relatively fast). When generating your own model, make sure to use your own dataset.


In [1]:
# Load subsmapled data.
import pandas as pd
df = pd.read_pickle(r"data/game_player_subsample.pickle")
#Filter away potentially bugged data where there are not 2 different team_ids playing

df = (
    df.assign(team_count=df.groupby("game_id")["team_id"].transform('nunique'))
    .loc[lambda x: x.team_count == 2]
)
df.head()

Unnamed: 0,team_id,start_date,game_id,player_id,team_id_opponent,points,minutes,won,team_count
38953,1610612755,2022-10-18,22200001,202699,1610612738,18.0,34.233,0,2
38954,1610612755,2022-10-18,22200001,200782,1610612738,6.0,33.017,0,2
38955,1610612755,2022-10-18,22200001,203954,1610612738,26.0,37.267,0,2
38956,1610612755,2022-10-18,22200001,1630178,1610612738,21.0,38.2,0,2
38957,1610612755,2022-10-18,22200001,201935,1610612738,35.0,37.267,0,2


In [2]:
df.describe()

Unnamed: 0,team_id,player_id,team_id_opponent,points,minutes,team_count
count,19872.0,19872.0,19872.0,19872.0,19872.0,19872.0
mean,1610613000.0,1270595.0,1610613000.0,8.921548,18.892618,2.0
std,8.635314,620489.8,8.655319,9.177834,13.163723,0.0
min,1610613000.0,2544.0,1610613000.0,0.0,0.0,2.0
25%,1610613000.0,204456.0,1610613000.0,0.0,5.933,2.0
50%,1610613000.0,1629011.0,1610613000.0,7.0,20.15,2.0
75%,1610613000.0,1630227.0,1610613000.0,14.0,30.35,2.0
max,1610613000.0,1631495.0,1610613000.0,71.0,52.733,2.0


In [3]:
print(f"{len(df['game_id'].unique())} number of games. Ranging from {df['start_date'].min()} to {df['start_date'].max()}")

776 number of games. Ranging from 2022-10-18 to 2023-02-01


One thing that perhaps isn't as intuitive is what happens below, which is "clipping". 
First we define the target (the column we are trying to predict).
Next, we clip it, meaning we limit it's values too between 0 and 40.
Our Machine-learning models don't work too well if they need to predict too many unique values. Thus, our Machine-learning model will only be able within the 0-40 threshold.

In [4]:

from player_performance_ratings import PredictColumnNames
df[PredictColumnNames.TARGET] = df["points"]
df[PredictColumnNames.TARGET] = df[PredictColumnNames.TARGET].clip(0, 40)

The first feature we will be experimenting with is a rolling-mean of past player points. The idea is that the past rolling average of a players points has predictive power. 

In [5]:
from player_performance_ratings.transformation.post_transformers import RollingMeanTransformation
rolling_mean = RollingMeanTransformation(feature_names=["points"], window=15, granularity=["player_id"])
df_rolling_mean = rolling_mean.transform(df)
df_rolling_mean.tail()

To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)


	>>> .groupby(..., group_keys=True)
  data = data.assign(**{output_column_name: data.groupby(self.granularity)[feature_name].apply(


Unnamed: 0,team_id,start_date,game_id,player_id,team_id_opponent,points,minutes,won,team_count,__target,rolling_mean_15_points
58820,1610612756,2023-02-01,22200777,1630240,1610612737,6.0,21.483,0,2,6.0,5.066667
58821,1610612756,2023-02-01,22200777,202687,1610612737,2.0,10.0,0,2,2.0,3.666667
58822,1610612756,2023-02-01,22200777,1629006,1610612737,2.0,15.367,0,2,2.0,6.066667
58823,1610612756,2023-02-01,22200777,1630688,1610612737,4.0,18.95,0,2,4.0,3.866667
58824,1610612756,2023-02-01,22200777,1629111,1610612737,6.0,10.517,0,2,6.0,5.866667


Before we can train and evaluate a machine-learning model we will split the dataset into train and predict. Below we create the separation based on start_date and we ensure that 75% of games goes to the train data and 25% to the predict dataset.

In [6]:
train_date_threshold = pd.to_datetime(df['start_date']).quantile([0.25, 0.5, 0.75]).tolist()[2]
train_date_threshold

Timestamp('2023-01-06 00:00:00')

In [7]:
from lightgbm import LGBMClassifier

#Create train and test data
train = df_rolling_mean[pd.to_datetime(df_rolling_mean['start_date'])<train_date_threshold]
test = df_rolling_mean[pd.to_datetime(df_rolling_mean['start_date'])>=train_date_threshold]

#Instantiate LGBM Machine learning model
model = LGBMClassifier(verbose=-100, max_depth=2, n_estimators=300, learning_rate=0.05)

#Train machine learning model using the rolling_mean_feature we created
model.fit(train[rolling_mean.features_created], train['__target'])


Exception in thread Thread-6:
Traceback (most recent call last):
  File "C:\Users\Admin\anaconda3\lib\threading.py", line 973, in _bootstrap_inner
    self.run()
  File "C:\Users\Admin\anaconda3\lib\threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\Admin\anaconda3\lib\subprocess.py", line 1479, in _readerthread
    buffer.append(fh.read())
  File "C:\Users\Admin\anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3: character maps to <undefined>
found 0 physical cores < 1
  File "C:\Users\Admin\anaconda3\lib\site-packages\joblib\externals\loky\backend\context.py", line 217, in _count_physical_cores
    raise ValueError(


So we trained a model. Next, let's check what the predictions look like. 
Below we 

In [8]:
probs = model.predict_proba(test[rolling_mean.features_created])
for class_idx, points in enumerate(model.classes_):
    print(f"probability for playerid {test.iloc[500]['player_id']} to score {points} points in gameid {test.iloc[500]['game_id']} is: {probs[:, class_idx][500]}")


probability for playerid 1628374 to score 0.0 points in gameid 0022200600 is: 0.03401286381997298
probability for playerid 1628374 to score 1.0 points in gameid 0022200600 is: 0.0004056315772471626
probability for playerid 1628374 to score 2.0 points in gameid 0022200600 is: 0.0007411034828695288
probability for playerid 1628374 to score 3.0 points in gameid 0022200600 is: 0.0008726459104589151
probability for playerid 1628374 to score 4.0 points in gameid 0022200600 is: 0.0010616235209828795
probability for playerid 1628374 to score 5.0 points in gameid 0022200600 is: 0.00036204939273700354
probability for playerid 1628374 to score 6.0 points in gameid 0022200600 is: 0.0011035525515961024
probability for playerid 1628374 to score 7.0 points in gameid 0022200600 is: 0.0006411568345926713
probability for playerid 1628374 to score 8.0 points in gameid 0022200600 is: 0.009431644278012628
probability for playerid 1628374 to score 9.0 points in gameid 0022200600 is: 0.0013154162038999458
pr

Is the above good or bad? Hard to say by just looking at a single player for a single game. To evalulate overall model performance we use the logloss metric. 
The lower the score, the better our model performance is.

In [9]:
from sklearn.metrics import log_loss
probs = model.predict_proba(test[rolling_mean.features_created])
log_loss(test[PredictColumnNames.TARGET], probs)

2.6071138766914728

Let's see if we can improve the logloss performance with better feature engineering. 
A problem with rolling-mean is that it weights performances 15 games ago similarly to performances in the most recent game.
To address that performance we can add lags for the past 5 games and let our machine-learning model identify the importance of each lag.

In [10]:
from player_performance_ratings.transformation.post_transformers import LagTransformation
#Creates 10 lags
lag_transformation = LagTransformation(feature_names=["points"], lag_length=5, granularity=['player_id'])
df_rolling_mean_lag = lag_transformation.transform(df_rolling_mean)
df_rolling_mean_lag.tail()

Unnamed: 0,team_id,start_date,game_id,player_id,team_id_opponent,points,minutes,won,team_count,__target,rolling_mean_15_points,lag_1_points,lag_2_points,lag_3_points,lag_4_points,lag_5_points
58820,1610612756,2023-02-01,22200777,1630240,1610612737,6.0,21.483,0,2,6.0,5.066667,2.0,7.0,0.0,2.0,2.0
58821,1610612756,2023-02-01,22200777,202687,1610612737,2.0,10.0,0,2,2.0,3.666667,0.0,8.0,4.0,6.0,4.0
58822,1610612756,2023-02-01,22200777,1629006,1610612737,2.0,15.367,0,2,2.0,6.066667,3.0,2.0,0.0,9.0,4.0
58823,1610612756,2023-02-01,22200777,1630688,1610612737,4.0,18.95,0,2,4.0,3.866667,7.0,0.0,3.0,2.0,0.0
58824,1610612756,2023-02-01,22200777,1629111,1610612737,6.0,10.517,0,2,6.0,5.866667,4.0,0.0,2.0,15.0,12.0


In [11]:
# Printing out all our new features
features = rolling_mean.features_created + lag_transformation.features_created
features

['rolling_mean_15_points',
 'lag_1_points',
 'lag_2_points',
 'lag_3_points',
 'lag_4_points',
 'lag_5_points']

In [12]:
train = df_rolling_mean_lag[pd.to_datetime(df_rolling_mean_lag['start_date'])<'2023-01-09']
test = df_rolling_mean_lag[pd.to_datetime(df_rolling_mean_lag['start_date'])>='2023-01-09']
model = LGBMClassifier(verbose=-100, max_depth=2, n_estimators=300, learning_rate=0.05, reg_alpha=1.5)

model.fit(train[features], train[PredictColumnNames.TARGET])
model.feature_importances_

array([8516, 5179, 5546, 5577, 5437, 5400])

Above we trained a new machine-learning model with the additional lagged features and we also printed out the feature importances of each feature. 
The higher the feature importance, the more impact the feature has on the prediction. 
As would be expected, we see that the lags further back have a lower feature importance, thus matters less according to the mode.

In [13]:
import numpy as np
np.set_printoptions(suppress=True)
probs = model.predict_proba(test[features])
log_loss(test["__target"], probs)

2.5910563903453556

An alternative to lagged features is a concept I called time_weighted_rating.
Instead of relying on artifical cutoffs, it takes into account all past performances and weight recent performances higher. 
Below BayesianTimeWeightedRating is used. This is a bayesian approach because it creates a prior "rating" of a player and as it gets more data (evidence) it updates the rating. 

By default the prior will be equal to the avearage across all players.
The time_weighted_rating_evidence is the time-weighted value of the player. In this case it's all past points-data for a player where recent performances are weighted higher
The time_weighted_rating_likelihood_ratio takes a value between 0 and 1. The more data we have on the better (and the more recent it is), the closer the value will be to 1.
The final rating is time_weighted_rating  which is calcualted as prior * (1-time_weighted_rating_likelihood_ratio) + time_weighted_rating_likelihood_ratio * time_weighted_rating_evidence 

In [14]:
from player_performance_ratings.ratings.time_weight_ratings import BayesianTimeWeightedRating
from player_performance_ratings import ColumnNames
column_names = ColumnNames(
    team_id='team_id',
    match_id='game_id',
    start_date="start_date",
    player_id="player_id",
    performance="points",
)
df_rolling_mean_lag = df_rolling_mean_lag.sort_values(by=[column_names.start_date, column_names.match_id, column_names.team_id, column_names.player_id])
time_weight_rating = BayesianTimeWeightedRating()
generated_time_weight_ratings = time_weight_rating.generate(df=df_rolling_mean_lag, column_names=column_names)
for rating_feature, values in generated_time_weight_ratings.items():
    print(rating_feature, values[2300:2305])

  evidence_performances = np.sum(


time_weighted_rating [8.123092527455931, 8.562398337056653, 8.476783072859371, 8.20081217586016, 9.116075673380804]
time_weighted_rating_likelihood_ratio [0.08949740420666664, 0.08949740420666664, 0.08949740420666664, 0.08949740420666664, 0.0724613059075374]
time_weighted_rating_evidence [0.0, 4.908587165123588, 3.951964289228997, 0.8684011463031783, 11.606122305892972]


When calling .generate() above it returns a dictionary mapped to a list of the rating-values as can be seen above. 
Above we simply printed out the ratings for 5 rows. 
To add it as features to our dataframe, simply follow steps below:

In [18]:
features = rolling_mean.features_created + lag_transformation.features_created + ["time_weighted_rating"]
df_rolling_mean_lag_time_weighted = df_rolling_mean_lag.copy()
df_rolling_mean_lag_time_weighted["time_weighted_rating"] = generated_time_weight_ratings["time_weighted_rating"]
    
df_rolling_mean_lag_time_weighted.tail()
    
    

Unnamed: 0,team_id,start_date,game_id,player_id,team_id_opponent,points,minutes,won,team_count,__target,rolling_mean_15_points,lag_1_points,lag_2_points,lag_3_points,lag_4_points,lag_5_points,hour_number,time_weighted_rating
58815,1610612756,2023-02-01,22200777,1629028,1610612737,20.0,26.533,0,2,20.0,12.066667,22.0,23.0,19.0,0.0,0.0,465336,11.050646
58824,1610612756,2023-02-01,22200777,1629111,1610612737,6.0,10.517,0,2,6.0,5.866667,4.0,0.0,2.0,15.0,12.0,465336,7.667579
58813,1610612756,2023-02-01,22200777,1629661,1610612737,6.0,15.333,0,2,6.0,13.066667,4.0,15.0,22.0,24.0,8.0,465336,9.511736
58820,1610612756,2023-02-01,22200777,1630240,1610612737,6.0,21.483,0,2,6.0,5.066667,2.0,7.0,0.0,2.0,2.0,465336,8.291053
58823,1610612756,2023-02-01,22200777,1630688,1610612737,4.0,18.95,0,2,4.0,3.866667,7.0,0.0,3.0,2.0,0.0,465336,6.779943


In [19]:
features

['rolling_mean_15_points',
 'lag_1_points',
 'lag_2_points',
 'lag_3_points',
 'lag_4_points',
 'lag_5_points',
 'time_weighted_rating']

In [20]:
train = df_rolling_mean_lag_time_weighted[pd.to_datetime(df_rolling_mean_lag_time_weighted['start_date'])<'2023-01-09']
test = df_rolling_mean_lag_time_weighted[pd.to_datetime(df_rolling_mean_lag_time_weighted['start_date'])>='2023-01-09']
model = LGBMClassifier(verbose=-100, max_depth=2, n_estimators=300, learning_rate=0.05, reg_alpha=1.5)

model.fit(train[features], train['__target'])
model.feature_importances_

array([6745, 4492, 4772, 4805, 4824, 4660, 5479])

In [21]:
probs = model.predict_proba(test[features])
log_loss(test["__target"], probs)

2.5925631664928903

## Packaging it all together

It's a bit messy to have the various transformations, rating-models and machine-learning-models as seperate components. Wouldn't it be easier if we can run all of these things as once?
Luckily, that is exactly what the MatchPredictor class does. Below is an example of how this can be accomplished.

Notice, we no longer need to separate train and predict datasets nor even specify the predictor. All of those things have default logic for how the features are passed to the predictor and which predictor is used.
Although for proper optimization it is recommended to specify the machine-learning model, however if you just want to get started quickly, this makes it possible. 

We also added a few more features below. player_points_per_minute and minutes.


In [22]:
from player_performance_ratings import ColumnNames, PredictColumnNames
from player_performance_ratings.predictor.match_predictor import MatchPredictor

df = df.sort_values(by=[column_names.start_date, column_names.match_id, column_names.team_id, column_names.player_id])
df["player_points_per_minute"] = df["points"] / df["minutes"]


post_rating_transformers = [
    LagTransformation(feature_names=["points", "player_points_per_minute"], lag_length=10, granularity=['player_id']), 
    RollingMeanTransformation(feature_names=["points", "player_points_per_minute"], window=15, granularity=["player_id"]),
    RollingMeanTransformation(feature_names=["minutes"], window=10, granularity=["player_id"])    
]

rating_generators = [BayesianTimeWeightedRating()]


match_predictor = MatchPredictor(column_names=column_names,rating_generators=rating_generators, post_rating_transformers=post_rating_transformers)
df_with_predictions = match_predictor.generate_historical(df)
probabilities = np.stack(df_with_predictions[match_predictor.predictor.pred_column].values)
print(log_loss(df_with_predictions[PredictColumnNames.TARGET], probabilities))



2.2742979311163913
