## NBA Player Points Prediction example

This is a step-by-step guide on how to generate probabilities for the amount of points an nba player will score in a games using machine-learning and various feature engineering approaches.

The primary intention is an introduction with some ideas/inspiration on how to think about modelling NBA player points. This section, however uses default hypa To generate accurate predictions that can compete with bookmakers.

Also note, that the dataset used only contains 20% of all game-ids within a a period of a few years (in order to keep it relatively fast). When generating your own model, make sure to use your own dataset.


In [1]:
# Load subsmapled data.
from player_performance_ratings.examples.utils import load_nba_subsampled_game_player_data
df = load_nba_subsampled_game_player_data()

#Filter away potentially bugged data where there are not 2 different team_ids playing

df = (
    df.assign(team_count=df.groupby("game_id")["team_id"].transform('nunique'))
    .loc[lambda x: x.team_count == 2]
)
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Admin\\anaconda3\\lib\\site-packages\\player_performance_ratings\\examples\\nba\\data\\game_player_subsample.pickle'

In [3]:
import pandas as pd
pd.to_datetime(df['start_date']).quantile([0.25, 0.5, 0.75])

0.25   2021-05-07
0.50   2022-03-14
0.75   2023-01-09
Name: start_date, dtype: datetime64[ns]

In [4]:
df["__target"] = df["points"]
df["__target"] = df["__target"].clip(0, 40)
df["__target"] 


0         11.0
1         33.0
2          9.0
3         27.0
4         28.0
          ... 
113513     6.0
113514     5.0
113515     6.0
113516     0.0
113517     0.0
Name: __target, Length: 113518, dtype: float64

In [5]:
from player_performance_ratings.transformation.post_transformers import RollingMeanTransformation
rolling_mean = RollingMeanTransformation(feature_names=["points"], window=15, granularity=["player_id"])
df_rolling_mean = rolling_mean.transform(df)
df_rolling_mean.tail()

Unnamed: 0,game_id,team_id,player_id,player_name,plus_minus,points,minutes,free_throws_attempted,free_throws_made,three_pointers_made,...,offensive_rebounds,score,score_opponent,won,game_minutes,season_id,start_date,team_count,__target,rolling_mean_15_points
113513,22100002,1610612747,1628370,Malik Monk,-10.0,6.0,18.733,0.0,0.0,2.0,...,0.0,114,121,0,48.0,22021,2021-10-19,2,6.0,10.2
113514,22100002,1610612747,2730,Dwight Howard,-7.0,5.0,12.817,4.0,3.0,0.0,...,0.0,114,121,0,48.0,22021,2021-10-19,2,5.0,7.666667
113515,22100002,1610612747,202340,Avery Bradley,1.0,6.0,8.233,0.0,0.0,2.0,...,1.0,114,121,0,48.0,22021,2021-10-19,2,6.0,3.2
113516,22100002,1610612747,1629635,Sekou Doumbouya,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,114,121,0,48.0,22021,2021-10-19,2,0.0,7.133333
113517,22100002,1610612747,1630559,Austin Reaves,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,114,121,0,48.0,22021,2021-10-19,2,0.0,12.266667


In [6]:
from lightgbm import LGBMClassifier
train = df_rolling_mean[df_rolling_mean['start_date']<'2023-01-09']
test = df_rolling_mean[df_rolling_mean['start_date']>='2023-01-09']
model = LGBMClassifier(verbose=-100, max_depth=2)

model.fit(train[['rolling_mean_15_points']], train['__target'])



Exception in thread Thread-6:
Traceback (most recent call last):
  File "C:\Users\Admin\anaconda3\lib\threading.py", line 973, in _bootstrap_inner
    self.run()
  File "C:\Users\Admin\anaconda3\lib\threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\Admin\anaconda3\lib\subprocess.py", line 1479, in _readerthread
    buffer.append(fh.read())
  File "C:\Users\Admin\anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3: character maps to <undefined>
found 0 physical cores < 1
  File "C:\Users\Admin\anaconda3\lib\site-packages\joblib\externals\loky\backend\context.py", line 217, in _count_physical_cores
    raise ValueError(


In [7]:
from sklearn.metrics import log_loss
probs = model.predict_proba(test[["rolling_mean_15_points"]])
log_loss(test["__target"], probs)

2.611428314905507

In [8]:
from player_performance_ratings.transformation.post_transformers import LagTransformation
lag_transformation = LagTransformation(feature_names=["points"], lag_length=10, granularity=['player_id'])
df_rolling_mean_lag = lag_transformation.transform(df_rolling_mean)
df_rolling_mean_lag.tail()

Unnamed: 0,game_id,team_id,player_id,player_name,plus_minus,points,minutes,free_throws_attempted,free_throws_made,three_pointers_made,...,lag_1_points,lag_2_points,lag_3_points,lag_4_points,lag_5_points,lag_6_points,lag_7_points,lag_8_points,lag_9_points,lag_10_points
113513,22100002,1610612747,1628370,Malik Monk,-10.0,6.0,18.733,0.0,0.0,2.0,...,20.0,4.0,7.0,8.0,5.0,3.0,20.0,2.0,11.0,11.0
113514,22100002,1610612747,2730,Dwight Howard,-7.0,5.0,12.817,4.0,3.0,0.0,...,5.0,0.0,4.0,4.0,19.0,4.0,10.0,1.0,14.0,19.0
113515,22100002,1610612747,202340,Avery Bradley,1.0,6.0,8.233,0.0,0.0,2.0,...,2.0,7.0,0.0,0.0,3.0,0.0,0.0,6.0,0.0,7.0
113516,22100002,1610612747,1629635,Sekou Doumbouya,0.0,0.0,0.0,0.0,0.0,0.0,...,12.0,5.0,12.0,14.0,20.0,6.0,9.0,14.0,4.0,0.0
113517,22100002,1610612747,1630559,Austin Reaves,0.0,0.0,0.0,0.0,0.0,0.0,...,7.0,9.0,16.0,18.0,15.0,7.0,23.0,20.0,15.0,11.0


In [9]:
features = rolling_mean.features_created + lag_transformation.features_created
features

['rolling_mean_15_points',
 'lag_1_points',
 'lag_2_points',
 'lag_3_points',
 'lag_4_points',
 'lag_5_points',
 'lag_6_points',
 'lag_7_points',
 'lag_8_points',
 'lag_9_points',
 'lag_10_points']

In [10]:
train = df_rolling_mean_lag[df_rolling_mean_lag['start_date']<'2023-01-09']
test = df_rolling_mean_lag[df_rolling_mean_lag['start_date']>='2023-01-09']
model = LGBMClassifier(verbose=-100, max_depth=2)

model.fit(train[features], train['__target'])
model.feature_importances_

array([2620, 1653, 1136,  952,  841,  835,  883,  800,  941,  735,  814])

In [11]:
import numpy as np
np.set_printoptions(suppress=True)
probs = model.predict_proba(test[features])
log_loss(test["__target"], probs)

2.5573092801648927

In [12]:
from player_performance_ratings.ratings.time_weight_ratings import BayesianTimeWeightedRating
from player_performance_ratings import ColumnNames
column_names = ColumnNames(
    team_id='team_id',
    match_id='game_id',
    start_date="start_date",
    player_id="player_id",
    performance="points",
)
df_rolling_mean_lag = df_rolling_mean_lag.sort_values(by=[column_names.start_date, column_names.match_id, column_names.team_id, column_names.player_id])
time_weight_rating = BayesianTimeWeightedRating()
generated_time_weight_ratings = time_weight_rating.generate(df=df_rolling_mean_lag, column_names=column_names)
for rating_feature, values in generated_time_weight_ratings.items():
    print(rating_feature, values[2300:2305])

  evidence_performances = np.sum(


time_weighted_rating [8.076933010814129, 8.647562314157026, 7.794975927243408, 7.816208870644075, 7.920269469147545]
time_weighted_rating_likelihood_ratio [0.0676183352518337, 0.08362847921173143, 0.08362847921173143, 0.08362847921173143, 0.08362847921173143]
time_weighted_rating_evidence [4.440800536139237, 12.010783773126832, 1.8158550245718048, 2.069751108786229, 3.314071219403203]


In [13]:
features = rolling_mean.features_created + lag_transformation.features_created
df_rolling_mean_lag_time_weighted = df_rolling_mean_lag.copy()
for rating_feature, values in generated_time_weight_ratings.items():
    features.append(rating_feature)
    df_rolling_mean_lag_time_weighted[rating_feature] = values
    
df_rolling_mean_lag_time_weighted.tail()
    
    

Unnamed: 0,game_id,team_id,player_id,player_name,plus_minus,points,minutes,free_throws_attempted,free_throws_made,three_pointers_made,...,lag_5_points,lag_6_points,lag_7_points,lag_8_points,lag_9_points,lag_10_points,hour_number,time_weighted_rating,time_weighted_rating_likelihood_ratio,time_weighted_rating_evidence
113460,22300037,1610612746,1628464,Daniel Theis,-6.0,2.0,10.45,2.0,0.0,0.0,...,7.0,16.0,6.0,23.0,9.0,9.0,472272,6.661967,0.19629,0.272531
113464,22300037,1610612746,1629599,Amir Coffey,0.0,0.0,0.0,0.0,0.0,0.0,...,8.0,0.0,1.0,0.0,6.0,2.0,472272,6.939268,0.201286,1.378569
113458,22300037,1610612746,1629611,Terance Mann,15.0,1.0,21.15,2.0,1.0,0.0,...,7.0,0.0,5.0,12.0,17.0,5.0,472272,7.956237,0.117313,5.063963
113466,22300037,1610612746,1630538,Bones Hyland,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,4.0,17.0,18.0,12.0,17.0,472272,7.755266,0.201038,9.241945
113465,22300037,1610612746,1631217,Moussa Diabate,0.0,0.0,0.0,0.0,0.0,0.0,...,4.0,0.0,2.0,0.0,0.0,2.0,472272,4.438815,0.205546,2.589463


In [14]:
train = df_rolling_mean_lag_time_weighted[df_rolling_mean_lag_time_weighted['start_date']<'2023-01-09']
test = df_rolling_mean_lag_time_weighted[df_rolling_mean_lag_time_weighted['start_date']>='2023-01-09']
model = LGBMClassifier(verbose=-100, max_depth=2)

model.fit(train[features], train['__target'])
model.feature_importances_

array([1325, 1091,  785,  641,  607,  644,  683,  635,  740,  641,  654,
        964, 1044, 1767])

In [15]:
probs = model.predict_proba(test[features])
log_loss(test["__target"], probs)

2.5472253787981827

In [16]:
from player_performance_ratings.ratings.opponent_adjusted_rating.rating_generator import OpponentAdjustedRatingGenerator
column_names_game_winner = ColumnNames(
    team_id='team_id',
    match_id='game_id',
    start_date="start_date",
    player_id="player_id",
    performance="won",
)


opponent_adjusted_rating_generator = OpponentAdjustedRatingGenerator(
    team_rating_generator=TeamRatingGenerator(
        start_rating_generator=StartRatingGenerator(
            team_weight=0,
        )
    )
)

generated_opponent_adjusted_ratings = opponent_adjusted_rating_generator.generate(df=df_rolling_mean_lag_time_weighted, column_names = column_names_game_winner)
for rating_feature, values in generated_opponent_adjusted_ratings.items():
    print(rating_feature, values[2300:2305])


rating_difference [-98.32912823  98.32912823  98.32912823  98.32912823  98.32912823]
player_league [None, None, None, None, None]
opponent_league [None, None, None, None, None]
player_rating [914.4060581487137, 1012.7351863741532, 1012.7351863741532, 1012.7351863741532, 1012.7351863741532]
player_rating_change [-21.210377929239037, 21.093975255139277, 21.093975255139277, 21.093975255139277, 21.093975255139277]
match_id ['0022000008', '0022000008', '0022000008', '0022000008', '0022000008']
team_rating [914.4060581487137, 1012.7351863741532, 1012.7351863741532, 1012.7351863741532, 1012.7351863741532]
opponent_rating [1012.7351863741532, 914.4060581487137, 914.4060581487137, 914.4060581487137, 914.4060581487137]
rating_mean [963.57062226 963.57062226 963.57062226 963.57062226 963.57062226]
player_predicted_performance [0.36214165275479165, 0.6378583472452084, 0.6378583472452084, 0.6378583472452084, 0.6378583472452084]


In [17]:
df_rolling_mean_lag_time_weighted_game_winner_ratings = df_rolling_mean_lag_time_weighted.copy()

df_rolling_mean_lag_time_weighted_game_winner_ratings["team_rating"] = generated_opponent_adjusted_ratings["team_rating"]
df_rolling_mean_lag_time_weighted_game_winner_ratings["opponent_rating"] = generated_opponent_adjusted_ratings["opponent_rating"]
df_rolling_mean_lag_time_weighted_game_winner_ratings["rating_difference"] = generated_opponent_adjusted_ratings["rating_difference"]
features = rolling_mean.features_created + lag_transformation.features_created +  ["team_rating", "opponent_rating", "rating_difference"]
for rating_feature in generated_time_weight_ratings:
    features.append(rating_feature)


df_rolling_mean_lag_time_weighted_game_winner_ratings.tail()

Unnamed: 0,game_id,team_id,player_id,player_name,plus_minus,points,minutes,free_throws_attempted,free_throws_made,three_pointers_made,...,lag_8_points,lag_9_points,lag_10_points,hour_number,time_weighted_rating,time_weighted_rating_likelihood_ratio,time_weighted_rating_evidence,team_rating,opponent_rating,rating_difference
113460,22300037,1610612746,1628464,Daniel Theis,-6.0,2.0,10.45,2.0,0.0,0.0,...,23.0,9.0,9.0,472272,6.661967,0.19629,0.272531,946.343661,966.759973,-20.416312
113464,22300037,1610612746,1629599,Amir Coffey,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,6.0,2.0,472272,6.939268,0.201286,1.378569,946.343661,966.759973,-20.416312
113458,22300037,1610612746,1629611,Terance Mann,15.0,1.0,21.15,2.0,1.0,0.0,...,12.0,17.0,5.0,472272,7.956237,0.117313,5.063963,946.343661,966.759973,-20.416312
113466,22300037,1610612746,1630538,Bones Hyland,0.0,0.0,0.0,0.0,0.0,0.0,...,18.0,12.0,17.0,472272,7.755266,0.201038,9.241945,946.343661,966.759973,-20.416312
113465,22300037,1610612746,1631217,Moussa Diabate,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,2.0,472272,4.438815,0.205546,2.589463,946.343661,966.759973,-20.416312


In [18]:
df_rolling_mean_lag_time_weighted_game_winner_ratings[df_rolling_mean_lag_time_weighted_game_winner_ratings['rating_difference']>0]["won"].mean()


0.6067119288000844

In [19]:
train = df_rolling_mean_lag_time_weighted_game_winner_ratings[df_rolling_mean_lag_time_weighted_game_winner_ratings['start_date']<'2023-01-09']
test = df_rolling_mean_lag_time_weighted_game_winner_ratings[df_rolling_mean_lag_time_weighted_game_winner_ratings['start_date']>='2023-01-09']
model = LGBMClassifier(verbose=-100, max_depth=2)

model.fit(train[features], train['__target'])
model.feature_importances_

array([1162,  962,  633,  522,  495,  534,  591,  516,  591,  515,  569,
        651,  706,  522,  853,  800, 1611])

In [20]:
probs = model.predict_proba(test[features])
log_loss(test["__target"], probs)

2.55150170174948

## Packaging it all together

It's a bit messy to have the various transformations, rating-models and machine-learning-models as seperate components. Wouldn't it be easier if we can run all of these things as once?
Luckily, that is exactly what the MatchPredictor class does. Below is an example of how this can be accomplished.

Notice, we no longer need to separate train and predict datasets nor even specify the predictor. All of those things have default logic for how the features are passed to the predictor and which predictor is used.
Although for proper optimization it is recommended to specify the machine-learning model, however if you just want to get started quickly, this makes it possible. 

We also added a few more features below. player_points_per_minute and minutes.


In [21]:
from player_performance_ratings import ColumnNames, PredictColumnNames
from player_performance_ratings.predictor.match_predictor import MatchPredictor

df = df.sort_values(by=[column_names.start_date, column_names.match_id, column_names.team_id, column_names.player_id])
df["player_points_per_minute"] = df["points"] / df["minutes"]


post_rating_transformers = [
    LagTransformation(feature_names=["points", "player_points_per_minute"], lag_length=10, granularity=['player_id']), 
    RollingMeanTransformation(feature_names=["points", "player_points_per_minute"], window=15, granularity=["player_id"]),
    RollingMeanTransformation(feature_names=["minutes"], window=10, granularity=["player_id"])    
]

rating_generators = [BayesianTimeWeightedRating()]


match_predictor = MatchPredictor(column_names=column_names,rating_generators=rating_generators, post_rating_transformers=post_rating_transformers)
df_with_predictions = match_predictor.generate_historical(df)
probabilities = np.stack(df_with_predictions[match_predictor.predictor.pred_column].values)
print(log_loss(df_with_predictions[PredictColumnNames.TARGET], probabilities))



2.474183009807008
