In [None]:
import pandas as pd
pd.set_option("display.latex.repr", True)

# Summary

I have built a regression model using machine learning techniques to predict the number of expected minutes a Canadian University basketball player will play in an upcoming game. My final model performed well on an unseen test data set, achieving mean squared error of 37.86 with a coefficient of determination of 0.644. Both metrics showed better performance compared to a player's 5-game average minutes played (my evaluation metric) of 69.44 and 0.337, $MSE$ and $R^2$ respectively. The results represent significant value in the context of sports analytics and coaching decisions, and the prediction model could be used as is. However, I note possible areas of further improvement that, if explored, could provide improved predictions.

# Introduction

Sports analytics represents a growing field in both professional and amateur athletics. According to recent trends, data-driven decision making is becoming increasingly important in basketball at all levels. The ability to predict player performance and playing time has significant value for coaches, players, and analysts. For Canadian University basketball specifically, understanding playing time patterns can help coaches optimize rotations, players understand their role expectations, and teams develop strategic insights.

For this project I test various machine learning models to predict the total number of minutes a Canadian University basketball player will play in an upcoming game. Playing time is a crucial metric in basketball as it directly correlates with player impact, team strategy, and overall performance. Therefore, having a more accurate prediction of a player's expected minutes in an upcoming game would provide significant value for coaching decisions, player development, and team strategy.

This work is inspired by and builds upon the NBA Minutes Predictor repository, which demonstrated the value of machine learning approaches in sports prediction. I have adapted their methodology for Canadian University basketball data, showing how similar techniques can be applied to different basketball contexts.

\pagebreak

# Methods

## Data

The data set used in this project is of Canadian University basketball player box scores (2022-2024) containing comprehensive player statistics for each game. Each row in the data set represents a player's box score statistics for a particular game, including points, assists, rebounds, shooting percentages, and other key metrics. The data was collected from Canadian University basketball games across multiple seasons, providing a robust dataset for analysis. There were 39,586 data examples (rows) representing 1,250+ unique players across the 2022-2024 seasons.

## Analysis


I explored predicting with three separate regression models: a simple linear regression model, a random forest regressor, and a light gradient boosting model (LGBM). 
The `Python` programming language and the following Python packages were used to perform the analysis: `pandas`, `numpy`, `docopt`, `requests`, `tqdm`, `scikit-learn`, `matplotlib`, `seaborn`, `lightgbm`, `termcolor`. The code used to perform the analysis and create this report can be found in the repository.

# Results & Discussion

To reduce the complexity of the models, and to remove noisy features, I did a substantial amount of feature engineering. Additionally, regarding the prediction models, the linear regression model represents the simplest approach and is a good basic benchmark, while the tree models were chosen as they work well for continuous count data like in our dataset. The hyper-parameters for each model were chosen through an iterative approach in order to reduce overfitting.

After all of my feature analysis I was left with several categories of stats for each example: historical minutes played per game, player efficiency metrics, and derived performance indicators. The minutes stats were calculated based on various rolling averages (3, 5, and 10-game windows) and exponential moving averages with different decay rates. The player efficiency metrics included usage rate, true shooting percentage, and effective field goal percentage. Finally, I added per-minute statistics and composite player ratings. All of the features were developed through my knowledge of the game, and by examining correlations with the target, as seen in Figure 1:

![Feature Correlations](results/EDA-feat_corr_canadian.png).

> _Figure 1. Feature Correlations with Target_

From Figure 1 we can see that the most correlated stats are based on the players' rolling average minutes played with various different window sizes ("Mins_last3_mean", "Mins_last5_mean", "Mins_last10_mean"). The player ratings ("PlayerRating_ewm_0.1", "PlayerRating_ewm_0.2") appear to also be highly correlated with the target, but to a lesser degree. We can see that the players' exponentially weighted minutes standard deviation (as noted by "Mins_ewm_std") is negatively correlated with the target, which represents an interesting but sensible insight. Including both highly correlated and negatively correlated features into my training data represented a good mix of features that were reasonably expected to interact well resulting in good predictions.

I chose to evaluate my model by comparing my predictions to a players' five-game historical average minutes played (the base model). I chose a five-game window as this represents a reasonable timeframe for recent performance trends, and represents a good test as it is reasonable to assume that a players five-game average would be a good indication of their expected minutes played in an upcoming game. I compared the prediction errors (mean squared errors and the $R^2$) of each model and the base model, as follows:



In [None]:
{
    "tags": [
        "hide_input",
    ]
}
print('\n', pd.read_csv('results/modelling-score_table_canadian.csv', index_col=0), '\n')

> _Figure 2. Comparison of Model Fitness_ <br>

From Figure 2. we can see that all of the regression models beat the base model in terms of both the mean squared error and the coefficient of determination. The Random Forest model slightly outperformed the other regression models, achieving an R² of 0.644 compared to Linear Regression's 0.640 and LightGBM's 0.642. In addition to the above analysis, I also examined the prediction residuals for each model to get a visual idea of how the predictions performed throughout the range of minutes played in a game. The residual results are displayed below in Figure 3:

![Residuals Plot](results/residual_plots_canadian.png).

> _Figure 3. Model Residual Error_

We can see from the residuals that all models performed reasonably well across the range of minutes played. The models show some variation at the extremes but that is expected, as predicting very low minutes (due to injuries or coach decisions) and predicting very high minutes (due to overtime or exceptional circumstances), are uniquely challenging problems and cannot be fully modeled with the data present in the data set. 

Based on the above analysis and results, I chose to look further into the Random Forest model to identify what were the most important features. I show the resulting feature importance plot in Figure 4 below:

![Feature Importance](results/modelling-gbm_importance_canadian.png).

> _Figure 4. Random Forest Feature Importance_

We can see from Figure 4 that the players' previous minutes played as measured by the rolling averages ("Mins_last3_mean", "Mins_last5_mean") are the most important features, followed by player rating metrics and efficiency statistics. We can see that all of the features had fairly good representation in the model. Additionally, the historical minutes stats are generally more important than the player rating and efficiency stats, which aligns with the intuitive understanding that recent playing time is the strongest predictor of future playing time.

Overall, I demonstrate that using a players' historical average stats to predict minutes in an upcoming game can and does beat a baseline model of only their previous 5-game minute average. The machine learning models, particularly Random Forest, provide better correlation and overall lower error rates and represent a significant advantage over simple statistical methods.

To further improve this model there are several areas that should be explored. Particularly pertaining to predicting the extremes of player minutes (i.e. the high end and the low end). Player injuries are tough to predict, but knowing if a teammate of a player will be in or out of the lineup on a current day would add significant insight. For example, if a player who normally plays 30 minutes is out of the lineup for the upcoming game, then the missing minutes will be distributed among the remaining players in the lineup, or possibly his replacement player (by position). Additionally, adding a feature indicating the tightness of games could help predict players who play more minutes. The thinking goes that in a blow-out game, the team will rest their starters, but if the score remains close until the end then the starters might get more minutes. This metric could be measured via the game context and how evenly matched up the teams are expected to be. Finally, more sophisticated models could be explored including model stacking and ensembles, and running rigorous cross-validation on the hyper-parameter set. I didn't fully explore this in my report but it could offer additional accuracy gains.

## Critical Assessment and Limitations

While the models show significant improvement over baseline methods, the moderate R² score (0.644) indicates substantial room for improvement. The analysis reveals both the strengths and limitations of current sports prediction methodologies.

**Key Limitations Identified:**
- Missing contextual data (injuries, opponent strength, team strategy)
- No academic factors (grades, eligibility, academic standing)
- Limited temporal scope (only 2022-2024 data)
- No team-specific factors (coach preferences, team culture, roster depth)

**Areas for Improvement:**
- Collect additional contextual features
- Implement ensemble methods
- Add hyperparameter optimization
- Include domain-specific evaluation metrics

This work establishes a solid foundation for sports analytics but highlights the complexity of predicting human decisions in sports. The methodology provides value for understanding playing time patterns and could serve as a starting point for more sophisticated sports prediction systems.

\pagebreak