## NBA Player Points Prediction example

This is a step-by-step guide on how to generate probabilities for the amount of points an nba player will score in a games using machine-learning and various feature engineering approaches.

The primary intention is an introduction with some ideas/inspiration on how to think about modelling NBA player points. This section, however uses default hypa To generate accurate predictions that can compete with bookmakers.

Also note, that the dataset used only contains 20% of all game-ids within a a period of a few years (in order to keep it relatively fast). When generating your own model, make sure to use your own dataset.


In [14]:
# Load subsmapled data.
import pandas as pd
df = pd.read_pickle(r"data/game_player_subsample.pickle")
#Filter away potentially bugged data where there are not 2 different team_ids playing

df = (
    df.assign(team_count=df.groupby("game_id")["team_id"].transform('nunique'))
    .loc[lambda x: x.team_count == 2]
)
df.head()

Unnamed: 0,game_id,team_id,player_id,player_name,plus_minus,points,minutes,free_throws_attempted,free_throws_made,three_pointers_made,...,assists,blocks,defensive_rebounds,offensive_rebounds,game_minutes,start_date,score,score_opponent,won,team_count
113,22100007,1610612739,1628374,Lauri Markkanen,-4.0,10.0,31.467,0.0,0.0,2.0,...,2.0,0.0,7.0,2.0,48.0,2021-10-20,121,132,0,2
114,22100007,1610612739,1630596,Evan Mobley,-18.0,17.0,38.3,3.0,2.0,1.0,...,6.0,1.0,9.0,0.0,48.0,2021-10-20,121,132,0,2
115,22100007,1610612739,1628386,Jarrett Allen,5.0,25.0,28.917,4.0,3.0,0.0,...,1.0,3.0,3.0,1.0,48.0,2021-10-20,121,132,0,2
116,22100007,1610612739,1629012,Collin Sexton,-14.0,17.0,29.367,3.0,1.0,2.0,...,1.0,0.0,0.0,1.0,48.0,2021-10-20,121,132,0,2
117,22100007,1610612739,1629636,Darius Garland,0.0,13.0,31.917,0.0,0.0,3.0,...,12.0,0.0,1.0,0.0,48.0,2021-10-20,121,132,0,2


In [22]:
df.describe()

Unnamed: 0,team_id,player_id,plus_minus,points,minutes,free_throws_attempted,free_throws_made,three_pointers_made,three_pointers_attempted,two_pointers_made,...,steals,assists,blocks,defensive_rebounds,offensive_rebounds,game_minutes,score,score_opponent,team_count,__target
count,11380.0,11380.0,11380.0,11380.0,11380.0,11380.0,11380.0,11380.0,11380.0,11380.0,...,11380.0,11380.0,11380.0,11380.0,11380.0,11380.0,11380.0,11380.0,11380.0,11380.0
mean,1610613000.0,1221869.0,-0.004833,8.393761,18.017833,1.717311,1.340246,0.941213,2.595958,2.114938,...,0.564851,1.853515,0.363269,2.518278,0.776186,48.036661,111.628207,111.684534,2.0,8.377065
std,8.524005,648039.6,9.993032,8.804058,13.203532,2.652131,2.203437,1.402848,2.986012,2.558562,...,0.889893,2.490033,0.751613,2.731973,1.28225,1.951027,14.936647,14.11616,0.0,8.735658
min,1610613000.0,2544.0,-45.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,39.03,0.0,1.0,2.0,0.0
25%,1610613000.0,203933.0,-5.0,0.0,4.083,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,48.0,102.0,102.0,2.0,0.0
50%,1610613000.0,1628973.0,0.0,6.0,19.133,0.0,0.0,0.0,2.0,1.0,...,0.0,1.0,0.0,2.0,0.0,48.0,112.0,112.0,2.0,6.0
75%,1610613000.0,1630168.0,5.0,13.0,29.233,2.0,2.0,2.0,4.0,3.0,...,1.0,3.0,1.0,4.0,1.0,48.0,121.0,121.0,2.0,13.0
max,1610613000.0,1641941.0,45.0,60.0,50.167,27.0,23.0,12.0,18.0,17.0,...,7.0,20.0,9.0,19.0,14.0,58.0,157.0,157.0,2.0,40.0


In [27]:
print(f"{len(df['game_id'].unique())} number of games. Ranging from {df['start_date'].min()} to {df['start_date'].max()}")

425 number of games. Ranging from 2020-12-11 to 2023-11-17


One thing that perhaps isn't as intuitive is what happens below, which is "clipping". 
First we define the target (the column we are trying to predict).
Next, we clip it, meaning we limit it's values too between 0 and 40.
Our Machine-learning models don't work too well if they need to predict too many unique values. Thus, our Machine-learning model will only be able within the 0-40 threshold.

In [31]:

from player_performance_ratings import PredictColumnNames
df[PredictColumnNames.TARGET] = df["points"]
df[PredictColumnNames.TARGET] = df[PredictColumnNames.TARGET].clip(0, 40)

The first feature we will be experimenting with is a rolling-mean of past player points. The idea is that the past rolling average of a players points has predictive power. 

In [36]:
from player_performance_ratings.transformation.post_transformers import RollingMeanTransformation
rolling_mean = RollingMeanTransformation(feature_names=["points"], window=15, granularity=["player_id"])
df_rolling_mean = rolling_mean.transform(df)
df_rolling_mean.tail()

To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)


	>>> .groupby(..., group_keys=True)
  data = data.assign(**{output_column_name: data.groupby(self.granularity)[feature_name].apply(


Unnamed: 0,team_id,start_date,game_id,player_id,points,minutes,won,__target,rolling_mean_15_points
113355,1610612741,2023-11-17,22300033,1628470,3.0,25.917,0,3.0,4.4
113356,1610612741,2023-11-17,22300033,1630172,0.0,13.017,0,0.0,10.666667
113357,1610612741,2023-11-17,22300033,1628975,0.0,9.8,0,0.0,4.666667
113358,1610612741,2023-11-17,22300033,1641763,0.0,0.0,0,0.0,2.0
113359,1610612741,2023-11-17,22300033,1630678,0.0,0.0,0,0.0,6.375


Before we can train and evaluate a machine-learning model we will split the dataset into train and predict. Below we create the separation based on start_date and we ensure that 75% of games goes to the train data and 25% to the predict dataset.

In [37]:
train_date_threshold = pd.to_datetime(df['start_date']).quantile([0.25, 0.5, 0.75]).tolist()[2]
train_date_threshold

Timestamp('2023-01-04 00:00:00')

In [42]:
from lightgbm import LGBMClassifier

#Create train and test data
train = df_rolling_mean[pd.to_datetime(df_rolling_mean['start_date'])<train_date_threshold]
test = df_rolling_mean[pd.to_datetime(df_rolling_mean['start_date'])>=train_date_threshold]

#Instantiate LGBM Machine learning model
model = LGBMClassifier(verbose=-100, max_depth=2)

#Train machine learning model using the rolling_mean_feature we created
model.fit(train[rolling_mean.features_created], train['__target'])


So we trained a model. Next, let's check what the predictions look like. 
Below we 

In [54]:
probs = model.predict_proba(test[rolling_mean.features_created])
for class_idx, points in enumerate(model.classes_):
    print(f"probability for playerid {test.iloc[500]['player_id']} to score {points} points in gameid {test.iloc[500]['game_id']} is: {probs[:, class_idx][500]}")


probability for playerid 1629028 to score 0.0 points in gameid 0022200762 is: 0.06842785011051888
probability for playerid 1629028 to score 1.0 points in gameid 0022200762 is: 0.00577508185931411
probability for playerid 1629028 to score 2.0 points in gameid 0022200762 is: 0.01658190496311323
probability for playerid 1629028 to score 3.0 points in gameid 0022200762 is: 0.0138159467279666
probability for playerid 1629028 to score 4.0 points in gameid 0022200762 is: 0.01605405769039626
probability for playerid 1629028 to score 5.0 points in gameid 0022200762 is: 0.02619860023991661
probability for playerid 1629028 to score 6.0 points in gameid 0022200762 is: 0.02879730298129993
probability for playerid 1629028 to score 7.0 points in gameid 0022200762 is: 0.031523743875593604
probability for playerid 1629028 to score 8.0 points in gameid 0022200762 is: 0.03096898347388094
probability for playerid 1629028 to score 9.0 points in gameid 0022200762 is: 0.029903751138616985
probability for pla

Is the above good or bad? Hard to say by just looking at a single player for a single game. To evalulate overall model performance we use the logloss metric. 
The lower the score, the better our model performance is.

In [56]:
from sklearn.metrics import log_loss
probs = model.predict_proba(test[rolling_mean.features_created])
log_loss(test[PredictColumnNames.TARGET], probs)

2.814920937004274

Let's see if we can improve the logloss performance with better feature engineering. 
A problem with rolling-mean is that it weights performances 15 games ago similarly to performances in the most recent game.
To address that performance we can add lags for the past 10 games and let our machine-learning model identify the importance of each lag.

In [59]:
from player_performance_ratings.transformation.post_transformers import LagTransformation
#Creates 10 lags
lag_transformation = LagTransformation(feature_names=["points"], lag_length=10, granularity=['player_id'])
df_rolling_mean_lag = lag_transformation.transform(df_rolling_mean)
df_rolling_mean_lag.tail()

Unnamed: 0,team_id,start_date,game_id,player_id,points,minutes,won,__target,rolling_mean_15_points,lag_1_points,lag_2_points,lag_3_points,lag_4_points,lag_5_points,lag_6_points,lag_7_points,lag_8_points,lag_9_points,lag_10_points
113355,1610612741,2023-11-17,22300033,1628470,3.0,25.917,0,3.0,4.4,3.0,18.0,20.0,5.0,0.0,0.0,2.0,0.0,0.0,2.0
113356,1610612741,2023-11-17,22300033,1630172,0.0,13.017,0,0.0,10.666667,6.0,10.0,2.0,7.0,10.0,16.0,11.0,17.0,6.0,10.0
113357,1610612741,2023-11-17,22300033,1628975,0.0,9.8,0,0.0,4.666667,6.0,6.0,3.0,0.0,0.0,3.0,0.0,0.0,9.0,0.0
113358,1610612741,2023-11-17,22300033,1641763,0.0,0.0,0,0.0,2.0,2.0,,,,,,,,,
113359,1610612741,2023-11-17,22300033,1630678,0.0,0.0,0,0.0,6.375,5.0,3.0,0.0,11.0,9.0,2.0,15.0,6.0,,


In [60]:
# Printing out all our new features
features = rolling_mean.features_created + lag_transformation.features_created
features

['rolling_mean_15_points',
 'lag_1_points',
 'lag_2_points',
 'lag_3_points',
 'lag_4_points',
 'lag_5_points',
 'lag_6_points',
 'lag_7_points',
 'lag_8_points',
 'lag_9_points',
 'lag_10_points']

In [61]:
train = df_rolling_mean_lag[pd.to_datetime(df_rolling_mean_lag['start_date'])<'2023-01-09']
test = df_rolling_mean_lag[pd.to_datetime(df_rolling_mean_lag['start_date'])>='2023-01-09']
model = LGBMClassifier(verbose=-100, max_depth=2)

model.fit(train[features], train[PredictColumnNames.TARGET])
model.feature_importances_

array([1891, 1338, 1220, 1092, 1035,  925, 1011,  946,  888,  868,  813])

Above we trained a new machine-learning model with the additional lagged features and we also printed out the feature importances of each feature. 
The higher the feature importance, the more impact the feature has on the prediction. 
As would be expected, we see that the lags further back have a lower feature importance, thus matters less according to the mode.

In [62]:
import numpy as np
np.set_printoptions(suppress=True)
probs = model.predict_proba(test[features])
log_loss(test["__target"], probs)

2.885992476651109

An alternative to lagged features is a concept I called time_weighted_rating.
This take

In [12]:
from player_performance_ratings.ratings.time_weight_ratings import BayesianTimeWeightedRating
from player_performance_ratings import ColumnNames
column_names = ColumnNames(
    team_id='team_id',
    match_id='game_id',
    start_date="start_date",
    player_id="player_id",
    performance="points",
)
df_rolling_mean_lag = df_rolling_mean_lag.sort_values(by=[column_names.start_date, column_names.match_id, column_names.team_id, column_names.player_id])
time_weight_rating = BayesianTimeWeightedRating()
generated_time_weight_ratings = time_weight_rating.generate(df=df_rolling_mean_lag, column_names=column_names)
for rating_feature, values in generated_time_weight_ratings.items():
    print(rating_feature, values[2300:2305])

  evidence_performances = np.sum(


time_weighted_rating [8.076933010814129, 8.647562314157026, 7.794975927243408, 7.816208870644075, 7.920269469147545]
time_weighted_rating_likelihood_ratio [0.0676183352518337, 0.08362847921173143, 0.08362847921173143, 0.08362847921173143, 0.08362847921173143]
time_weighted_rating_evidence [4.440800536139237, 12.010783773126832, 1.8158550245718048, 2.069751108786229, 3.314071219403203]


In [13]:
features = rolling_mean.features_created + lag_transformation.features_created
df_rolling_mean_lag_time_weighted = df_rolling_mean_lag.copy()
for rating_feature, values in generated_time_weight_ratings.items():
    features.append(rating_feature)
    df_rolling_mean_lag_time_weighted[rating_feature] = values
    
df_rolling_mean_lag_time_weighted.tail()
    
    

Unnamed: 0,game_id,team_id,player_id,player_name,plus_minus,points,minutes,free_throws_attempted,free_throws_made,three_pointers_made,...,lag_5_points,lag_6_points,lag_7_points,lag_8_points,lag_9_points,lag_10_points,hour_number,time_weighted_rating,time_weighted_rating_likelihood_ratio,time_weighted_rating_evidence
113460,22300037,1610612746,1628464,Daniel Theis,-6.0,2.0,10.45,2.0,0.0,0.0,...,7.0,16.0,6.0,23.0,9.0,9.0,472272,6.661967,0.19629,0.272531
113464,22300037,1610612746,1629599,Amir Coffey,0.0,0.0,0.0,0.0,0.0,0.0,...,8.0,0.0,1.0,0.0,6.0,2.0,472272,6.939268,0.201286,1.378569
113458,22300037,1610612746,1629611,Terance Mann,15.0,1.0,21.15,2.0,1.0,0.0,...,7.0,0.0,5.0,12.0,17.0,5.0,472272,7.956237,0.117313,5.063963
113466,22300037,1610612746,1630538,Bones Hyland,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,4.0,17.0,18.0,12.0,17.0,472272,7.755266,0.201038,9.241945
113465,22300037,1610612746,1631217,Moussa Diabate,0.0,0.0,0.0,0.0,0.0,0.0,...,4.0,0.0,2.0,0.0,0.0,2.0,472272,4.438815,0.205546,2.589463


In [14]:
train = df_rolling_mean_lag_time_weighted[df_rolling_mean_lag_time_weighted['start_date']<'2023-01-09']
test = df_rolling_mean_lag_time_weighted[df_rolling_mean_lag_time_weighted['start_date']>='2023-01-09']
model = LGBMClassifier(verbose=-100, max_depth=2)

model.fit(train[features], train['__target'])
model.feature_importances_

array([1325, 1091,  785,  641,  607,  644,  683,  635,  740,  641,  654,
        964, 1044, 1767])

In [15]:
probs = model.predict_proba(test[features])
log_loss(test["__target"], probs)

2.5472253787981827

In [16]:
from player_performance_ratings.ratings.opponent_adjusted_rating.rating_generator import OpponentAdjustedRatingGenerator
column_names_game_winner = ColumnNames(
    team_id='team_id',
    match_id='game_id',
    start_date="start_date",
    player_id="player_id",
    performance="won",
)


opponent_adjusted_rating_generator = OpponentAdjustedRatingGenerator(
    team_rating_generator=TeamRatingGenerator(
        start_rating_generator=StartRatingGenerator(
            team_weight=0,
        )
    )
)

generated_opponent_adjusted_ratings = opponent_adjusted_rating_generator.generate(df=df_rolling_mean_lag_time_weighted, column_names = column_names_game_winner)
for rating_feature, values in generated_opponent_adjusted_ratings.items():
    print(rating_feature, values[2300:2305])


rating_difference [-98.32912823  98.32912823  98.32912823  98.32912823  98.32912823]
player_league [None, None, None, None, None]
opponent_league [None, None, None, None, None]
player_rating [914.4060581487137, 1012.7351863741532, 1012.7351863741532, 1012.7351863741532, 1012.7351863741532]
player_rating_change [-21.210377929239037, 21.093975255139277, 21.093975255139277, 21.093975255139277, 21.093975255139277]
match_id ['0022000008', '0022000008', '0022000008', '0022000008', '0022000008']
team_rating [914.4060581487137, 1012.7351863741532, 1012.7351863741532, 1012.7351863741532, 1012.7351863741532]
opponent_rating [1012.7351863741532, 914.4060581487137, 914.4060581487137, 914.4060581487137, 914.4060581487137]
rating_mean [963.57062226 963.57062226 963.57062226 963.57062226 963.57062226]
player_predicted_performance [0.36214165275479165, 0.6378583472452084, 0.6378583472452084, 0.6378583472452084, 0.6378583472452084]


In [17]:
df_rolling_mean_lag_time_weighted_game_winner_ratings = df_rolling_mean_lag_time_weighted.copy()

df_rolling_mean_lag_time_weighted_game_winner_ratings["team_rating"] = generated_opponent_adjusted_ratings["team_rating"]
df_rolling_mean_lag_time_weighted_game_winner_ratings["opponent_rating"] = generated_opponent_adjusted_ratings["opponent_rating"]
df_rolling_mean_lag_time_weighted_game_winner_ratings["rating_difference"] = generated_opponent_adjusted_ratings["rating_difference"]
features = rolling_mean.features_created + lag_transformation.features_created +  ["team_rating", "opponent_rating", "rating_difference"]
for rating_feature in generated_time_weight_ratings:
    features.append(rating_feature)


df_rolling_mean_lag_time_weighted_game_winner_ratings.tail()

Unnamed: 0,game_id,team_id,player_id,player_name,plus_minus,points,minutes,free_throws_attempted,free_throws_made,three_pointers_made,...,lag_8_points,lag_9_points,lag_10_points,hour_number,time_weighted_rating,time_weighted_rating_likelihood_ratio,time_weighted_rating_evidence,team_rating,opponent_rating,rating_difference
113460,22300037,1610612746,1628464,Daniel Theis,-6.0,2.0,10.45,2.0,0.0,0.0,...,23.0,9.0,9.0,472272,6.661967,0.19629,0.272531,946.343661,966.759973,-20.416312
113464,22300037,1610612746,1629599,Amir Coffey,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,6.0,2.0,472272,6.939268,0.201286,1.378569,946.343661,966.759973,-20.416312
113458,22300037,1610612746,1629611,Terance Mann,15.0,1.0,21.15,2.0,1.0,0.0,...,12.0,17.0,5.0,472272,7.956237,0.117313,5.063963,946.343661,966.759973,-20.416312
113466,22300037,1610612746,1630538,Bones Hyland,0.0,0.0,0.0,0.0,0.0,0.0,...,18.0,12.0,17.0,472272,7.755266,0.201038,9.241945,946.343661,966.759973,-20.416312
113465,22300037,1610612746,1631217,Moussa Diabate,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,2.0,472272,4.438815,0.205546,2.589463,946.343661,966.759973,-20.416312


In [18]:
df_rolling_mean_lag_time_weighted_game_winner_ratings[df_rolling_mean_lag_time_weighted_game_winner_ratings['rating_difference']>0]["won"].mean()


0.6067119288000844

In [19]:
train = df_rolling_mean_lag_time_weighted_game_winner_ratings[df_rolling_mean_lag_time_weighted_game_winner_ratings['start_date']<'2023-01-09']
test = df_rolling_mean_lag_time_weighted_game_winner_ratings[df_rolling_mean_lag_time_weighted_game_winner_ratings['start_date']>='2023-01-09']
model = LGBMClassifier(verbose=-100, max_depth=2)

model.fit(train[features], train['__target'])
model.feature_importances_

array([1162,  962,  633,  522,  495,  534,  591,  516,  591,  515,  569,
        651,  706,  522,  853,  800, 1611])

In [20]:
probs = model.predict_proba(test[features])
log_loss(test["__target"], probs)

2.55150170174948

## Packaging it all together

It's a bit messy to have the various transformations, rating-models and machine-learning-models as seperate components. Wouldn't it be easier if we can run all of these things as once?
Luckily, that is exactly what the MatchPredictor class does. Below is an example of how this can be accomplished.

Notice, we no longer need to separate train and predict datasets nor even specify the predictor. All of those things have default logic for how the features are passed to the predictor and which predictor is used.
Although for proper optimization it is recommended to specify the machine-learning model, however if you just want to get started quickly, this makes it possible. 

We also added a few more features below. player_points_per_minute and minutes.


In [21]:
from player_performance_ratings import ColumnNames, PredictColumnNames
from player_performance_ratings.predictor.match_predictor import MatchPredictor

df = df.sort_values(by=[column_names.start_date, column_names.match_id, column_names.team_id, column_names.player_id])
df["player_points_per_minute"] = df["points"] / df["minutes"]


post_rating_transformers = [
    LagTransformation(feature_names=["points", "player_points_per_minute"], lag_length=10, granularity=['player_id']), 
    RollingMeanTransformation(feature_names=["points", "player_points_per_minute"], window=15, granularity=["player_id"]),
    RollingMeanTransformation(feature_names=["minutes"], window=10, granularity=["player_id"])    
]

rating_generators = [BayesianTimeWeightedRating()]


match_predictor = MatchPredictor(column_names=column_names,rating_generators=rating_generators, post_rating_transformers=post_rating_transformers)
df_with_predictions = match_predictor.generate_historical(df)
probabilities = np.stack(df_with_predictions[match_predictor.predictor.pred_column].values)
print(log_loss(df_with_predictions[PredictColumnNames.TARGET], probabilities))



2.474183009807008
