Imagine we are tasked with predicting the amount of points an NBA player will generate in a game. 
This notebook contains a step-by-step guide on how to model that using some of the tools by spforge.



In [1]:
import pandas as pd
from spforge.transformers import LagTransformer, RollingMeanTransformer
df = pd.read_parquet("data/game_player_subsample.parquet")
df.head()

Unnamed: 0,team_id,start_date,game_id,player_id,player_name,start_position,team_id_opponent,points,game_minutes,minutes,won,plus_minus,location,score,score_opponent
38956,1610612755,2022-10-18,22200001,202699,Tobias Harris,F,1610612738,18.0,48.0,34.233,0,-1.0,away,117,126
38957,1610612755,2022-10-18,22200001,200782,P.J. Tucker,F,1610612738,6.0,48.0,33.017,0,-6.0,away,117,126
38958,1610612755,2022-10-18,22200001,203954,Joel Embiid,C,1610612738,26.0,48.0,37.267,0,-13.0,away,117,126
38959,1610612755,2022-10-18,22200001,1630178,Tyrese Maxey,G,1610612738,21.0,48.0,38.2,0,-6.0,away,117,126
38960,1610612755,2022-10-18,22200001,201935,James Harden,G,1610612738,35.0,48.0,37.267,0,1.0,away,117,126


In [2]:
print(f"Data from {df['start_date'].min()} to {df['start_date'].max()} | "
      f"Total rows: {len(df):,} | Unique games: {df['game_id'].nunique():,}")


Data from 2022-10-18 to 2023-02-01 | Total rows: 19,872 | Unique games: 776


To start, let's quickly check if the data roughly matches what we expect given domain knowledge. Do the highes scoring NBA players such as Luca Doncic and Giannis have the highest points per game?

In [None]:
import importlib.util
import subprocess
import sys

if importlib.util.find_spec("seaborn") is None:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "seaborn"])
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df_avg = (
    df.groupby('player_name')['points']
    .mean()
    .reset_index()
    .sort_values(by='points', ascending=False)
).head(20)

sns.set(style="whitegrid")

plt.figure(figsize=(10, 6))
ax = sns.barplot(
    data=df_avg,
    y='player_name',
    x='points',
    palette='viridis'  
)

plt.title('Average Points per Player', fontsize=16, weight='bold')
plt.xlabel('Average Points', fontsize=12)
plt.ylabel('Player', fontsize=12)

sns.despine(left=True, bottom=True)
plt.tight_layout()

plt.show()


Some of the features that could be predictive of future points scored:
* Previous Points scored
* Previous Minutes played
* Previous points_per_minute

In spforge, we can calculate a players past performances in a few different ways. One of the more straight-forward approaches is through a RollingMeanTransformer as seen below.



In [None]:
df['points_per_minute'] = df['points'] / df['minutes']
df = df.sort_values(by=['start_date'])
rm_transformer_window20 = RollingMeanTransformer(
    features=['points', 'points_per_minute', 'minutes'],
    granularity=['player_id'],
    window=20,
    update_column='game_id'
)
df = rm_transformer_window20.transform_historical(df)
df[['player_name',*rm_transformer_window20.features_out]].tail()

Below is Jayson Tatum's 20-window Rolling Mean.

In [None]:
df[df['player_name']=='Jayson Tatum'].set_index('start_date')['rolling_mean_points20'].plot(title="Jayson Tatum Rolling Mean Points")

In [None]:
df[[*rm_transformer_window20.features_out, 'points']].corr()

All of the 3 features correlate quite strongly with points scored which indicate that they make sense to use as features to our final machine-learning model.
Below we split into train and test data.

In [None]:
from lightgbm import LGBMRegressor
unique_dates = df['start_date'].unique().tolist()
train_max_date = unique_dates[int(len(unique_dates)*0.7)]
train = df[df['start_date']<=train_max_date]
test = df[df['start_date']>train_max_date]
len(train), len(test)



We use a LGBMRegressor as the machine-learning model with a low max_depth as we only have 3 features with a relatively straight-forward relationship with the target.

In [None]:
features = rm_transformer_window20.features_out
estimator =LGBMRegressor(verbose=-100, max_depth=3)
estimator.fit(train[features], train['points'])
feature_importances = [f/sum(estimator.feature_importances_) for f in estimator.feature_importances_]
importance_df = pd.DataFrame({
    'Feature': features,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

# Plot
plt.figure(figsize=(8, 4))
plt.bar(importance_df['Feature'], importance_df['Importance'])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importances')
plt.tight_layout()
plt.show()


In [None]:
test['predicted_points'] = estimator.predict(test[features])
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(test['points'], test['predicted_points'])
print(f"Mean absolute error: {mae}")

df_avg = (
    test.groupby('player_name')['predicted_points']
    .mean()
    .reset_index()
    .sort_values(by='predicted_points', ascending=False)
).head(20)


sns.set(style="whitegrid")


plt.figure(figsize=(10, 6))
ax = sns.barplot(
    data=df_avg,
    y='player_name',
    x='predicted_points',
    palette='viridis' 
)


plt.title('Average Predicted Points per Player', fontsize=16, weight='bold')
plt.xlabel('Average Predicted Points', fontsize=12)
plt.ylabel('Player', fontsize=12)


sns.despine(left=True, bottom=True)
plt.tight_layout()

plt.show()


Above we can see that the predicted points generally aligns with what we expected from the entire historical dataset.
This indicates that the model's output isn't completely off.

In [None]:
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(test['points'], test['predicted_points'])
print(f"Mean absolute error: {mae}")

The mean absolute error is 4.27.

We can probably improve upon that by adding additional features. Instead of only having a single window =20 for rolling mean, let's try and add multiple windows of different lengths and let's also add the past 5 lags for the player. 

In [None]:
rm_transformer_window10 = RollingMeanTransformer(
    features=['points', 'points_per_minute', 'minutes'],
    granularity=['player_id'],
    window=10,
    update_column='game_id'
)
df = rm_transformer_window10.transform_historical(df)
rm_transformer_window5 = RollingMeanTransformer(
    features=['points', 'points_per_minute', 'minutes'],
    granularity=['player_id'],
    window=5,
    update_column='game_id'
)
df = rm_transformer_window5.transform_historical(df)
rm_transformer_window40 = RollingMeanTransformer(
    features=['points', 'points_per_minute', 'minutes'],
    granularity=['player_id'],
    window=40,
    update_column='game_id'
)
df = rm_transformer_window40.transform_historical(df)

lag_transformer = LagTransformer(
    features=['points', 'points_per_minute', 'minutes'],
    granularity=['player_id'],
    lag_length=5,
    update_column='game_id'
)
df = lag_transformer.transform_historical(df)
df.tail()


In [None]:
all_features = (rm_transformer_window20.features_out +
lag_transformer.features_out +
rm_transformer_window10.features_out +
rm_transformer_window5.features_out+ 
rm_transformer_window40.features_out)
train = df[df['start_date']<=train_max_date]
test = df[df['start_date']>train_max_date]
estimator_all_feats =LGBMRegressor(verbose=-100, max_depth=3, random_state=42)
estimator_all_feats.fit(train[all_features], train['points'])
feature_importances = [f/sum(estimator_all_feats.feature_importances_) for f in estimator_all_feats.feature_importances_]
importance_df = pd.DataFrame({
    'Feature': all_features,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

# Plot
plt.figure(figsize=(13, 4))
plt.bar(importance_df['Feature'], importance_df['Importance'])
plt.xticks(rotation=90)
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importances')
plt.tight_layout()
plt.show()

In [None]:
test.loc[:, 'predicted_points_all_features'] = estimator_all_feats.predict(test[all_features])
mae = mean_absolute_error(test['points'], test['predicted_points_all_features'])
print(f"Mean absolute error All Features Model: {mae}")

As expected we get a better mean absolute error. 

However, this process has been a bit tedious. We had to manually add all the features from the various transformers. We had to create the train-test split ourself. And we didn't even utilise proper cross-validation with multiple splits. If we had add more splits, it would be even more complex. 

And while we so far are only exploring historical data. How would we go about implementing the lag-and-rolling mean transformers for future games. To calculate future games, it would need to have acccess to the historical data as well which increases complexity. 

Luckily, the spforge Pipeline is designed to do all of these things with little input from the user.

In [None]:
from spforge import Pipeline, ColumnNames
from spforge.predictor import SklearnPredictor
#Reloading original dataframe to ensure we start all over from scratch
df = pd.read_parquet("data/game_player_subsample.parquet")
df['points_per_minute'] = df['points'] / df['minutes']
train = df[df['start_date']<=train_max_date]
test = df[df['start_date']>train_max_date]
column_names = ColumnNames(
    team_id="team_id",
    match_id="game_id",
    start_date="start_date",
    player_id="player_id"    
)
predictor = SklearnPredictor(estimator=LGBMRegressor(max_depth=4,verbose=-100), target='points')

pipeline = Pipeline(
    lag_transformers = [lag_transformer, rm_transformer_window5, rm_transformer_window10, rm_transformer_window20, rm_transformer_window40],
    predictor = predictor,
    column_names=column_names
)

pipeline.train(train)
test = pipeline.predict(test, cross_validation=True, return_features=True)
test.tail()

In [None]:
mae = mean_absolute_error(test['points'], test[pipeline.pred_column])
print(f"Mean absolute error Pipeline Model: {mae}")

As seen below the same features are used by the LGBM Regressor

In [None]:
print(pipeline.predictor.features)

More experienced users of machine-learning may be aware that performing only a single train-test split is not the best practice. Ideally we perform cross_validation with multiple splits. 

spforge has support for that as well. 

In [None]:
from spforge.cross_validator import MatchKFoldCrossValidator

cross_validator = MatchKFoldCrossValidator(
    match_id_column_name='game_id',
    date_column_name='start_date',
    predictor = pipeline,
)
df = cross_validator.generate_validation_df(df,  add_train_prediction=True, return_features=True)
df.tail()

Note that we added a parameter to return predictions for the non-validated (training) data-split as well. There is a column is_validation that flags whether the predictions are from the non-validated predictions. 

Returning the non-validating predictions can be useful to evaluate difference between training data and validation data accuracy. But also in case where the prediction-output will be used as an input to another model.

To measure the cross-validated mean_absolute_error we need to import an Sklearn Scorer Wrapper. This wrapper will automatically filter out the training-data predictions.

In [None]:
from spforge.scorer import SklearnScorer
mean_absolute_scorer = SklearnScorer(pred_column=pipeline.pred_column, scorer_function=mean_absolute_error, target='points')
cross_validator.cross_validation_score(df, scorer=mean_absolute_scorer )

Some readers may think: But what about the opponent? So far we only evaluated using the players own historical stats. But if a player is facing a strong team chances are he will score fewer points. 

There are a few ways we can handle that. The simplest approach is to add the opponent as a categorical feature. 
Because we already calculated the lag-and-rolling-mean features we don't need to recalculate them. Because the pipeline has the same methods as a normal predictor, any predictor type can also be passed into MatchKFoldCrossValidator instead of the pipeline.

Instead we need to the opponent as a categorical feature along with the features generated by the pipeline. 
We also set convert_cat_features_to_cat_dtype to True. This ensures non-numeric features are converted to categorical.

In [None]:
predictor_cat_feats = SklearnPredictor(estimator=LGBMRegressor(max_depth=4,verbose=-100), target='points', features = ['team_id_opponent',*pipeline.features], 
                                       convert_cat_features_to_cat_dtype=True)

cross_validator_cat_feats = MatchKFoldCrossValidator(
    match_id_column_name='game_id',
    date_column_name='start_date',
    predictor = predictor_cat_feats,
)
df = cross_validator_cat_feats.generate_validation_df(df,  add_train_prediction=True)
cross_validator_cat_feats.cross_validation_score(df, scorer=mean_absolute_scorer )

This shows a small improvement. There are definitely better and more dynamic ways to take into account opponent. Rating Models are one way of doing that, but it's beyond the scope of this guide. 

Finally, let's assume we have a future match which we want to generate predictions for. So far we only performed cross-validation, but we never trained the pipline on the entire dataset. 


In [None]:
final_predictor = SklearnPredictor(estimator=LGBMRegressor(max_depth=4,verbose=-100), target='points', features = ['team_id_opponent'], 
                                       convert_cat_features_to_cat_dtype=True)
final_pipeline = Pipeline(
    lag_transformers = [lag_transformer, rm_transformer_window5, rm_transformer_window10, rm_transformer_window20, rm_transformer_window40],
    predictor = final_predictor,
    column_names=column_names
)

final_pipeline.train(df)


In [None]:
team_id_1= df['team_id'].iloc[0]
team_id_2= df['team_id_opponent'].iloc[1]
player_id_stephen_curry = df[df['player_name']=='Stephen Curry']['player_id'].iloc[1]
player_id_jayson_tatum = df[df['player_name']=='Jayson Tatum']['player_id'].iloc[1]
player_id_kevin_durant = df[df['player_name']=='Kevin Durant']['player_id'].iloc[1]
player_id_trae_young = df[df['player_name']=='Trae Young']['player_id'].iloc[1]
future_game = pd.DataFrame(
    {
        "game_id": ["99999"]* 4,
        "start_date": [pd.to_datetime(df['start_date'].max()) + pd.Timedelta(days=1)]*4,
        'team_id': [team_id_1, team_id_1, team_id_2, team_id_2 ],
        'team_id_opponent': [team_id_2, team_id_2, team_id_1, team_id_1 ],
        'player_id': [player_id_stephen_curry, player_id_jayson_tatum, player_id_kevin_durant, player_id_trae_young],
        'player_name':['Stephen Curry', 'Jayson Tatum', 'Kevin Durant', 'Trae Young']
    }
)
future_game.head()
final_pipeline.predict(future_game, return_features=True).head()[['player_name', final_pipeline.pred_column]]


The above example is obviously a nonsensical 2x2 match. The model knows nothing about the team-mates nor the amount of players on the opposing team. Thus it will generate pretty naive predictions that assumes a "normal" game.