In [None]:
import pandas as pd
pd.set_option("display.latex.repr", True)

# Summary

This project develops a machine learning system to forecast playing time for Canadian University basketball players. Using historical game data from the 2022-2024 seasons, I constructed predictive models that achieved a mean squared error of 37.86 and R² of 0.644 on test data. These results significantly outperform a simple 5-game rolling average baseline (MSE: 69.44, R²: 0.337), demonstrating the value of sophisticated feature engineering and machine learning approaches in sports analytics. The models provide actionable insights for coaching decisions and player development strategies.

# Introduction

Basketball analytics has evolved from simple box score analysis to sophisticated predictive modeling. At the university level, understanding playing time patterns offers unique insights into team dynamics, player development, and strategic decision-making. Canadian University basketball presents an interesting case study due to its distinct competitive environment, academic constraints, and developmental focus.

This analysis focuses on predicting individual player minutes in upcoming games using machine learning techniques. Playing time serves as a fundamental metric that influences team strategy, player development trajectories, and overall team performance. Accurate predictions can support coaching decisions, help players understand their roles, and provide insights into team dynamics.

The methodology builds upon established sports analytics principles, adapting techniques from professional basketball analysis to the university context. This work demonstrates how machine learning approaches can be successfully applied to different basketball environments, providing a framework for similar analyses in other collegiate sports.

\pagebreak

# Methods

## Data

The dataset comprises Canadian University basketball game statistics spanning the 2022-2024 seasons. Each observation represents a player's performance in a specific game, including traditional box score metrics (points, rebounds, assists) alongside advanced statistics (shooting percentages, efficiency metrics). The dataset contains 39,586 game records from over 1,250 unique players, providing sufficient variation for robust model development while maintaining the specific characteristics of university-level competition.

## Analysis


The analysis employs three distinct machine learning approaches: linear regression for interpretability, random forest for capturing non-linear relationships, and LightGBM for gradient boosting performance. Implementation utilized Python with key libraries including `pandas` for data manipulation, `scikit-learn` for traditional machine learning algorithms, `lightgbm` for gradient boosting, and `matplotlib`/`seaborn` for visualization. The complete analysis pipeline and supporting code are available in the project repository.

# Results & Discussion

Feature engineering played a crucial role in model development. The analysis created several feature categories: historical playing time patterns, player efficiency metrics, and composite performance indicators. Historical features included rolling averages across different time windows (3, 5, and 10 games) and exponential moving averages with varying decay rates. Efficiency metrics encompassed usage rate, true shooting percentage, and effective field goal percentage. Additional features included per-minute statistics and composite player ratings.

Feature selection involved examining correlations with the target variable and domain knowledge of basketball dynamics. The correlation analysis, shown in Figure 1, reveals that recent playing time patterns ("Mins_last3_mean", "Mins_last5_mean", "Mins_last10_mean") exhibit the strongest relationships with future minutes. Player rating metrics ("PlayerRating_ewm_0.1", "PlayerRating_ewm_0.2") show moderate correlations, while volatility measures ("Mins_ewm_std") display negative correlations, indicating that consistent playing time patterns predict future minutes more reliably.

![Feature Correlations](results/EDA-feat_corr_canadian.png).

> _Figure 1. Feature Correlations with Target_

Model evaluation compared machine learning approaches against a 5-game rolling average baseline. This baseline represents a reasonable expectation for recent performance trends and provides a practical benchmark for assessing model improvement. Performance metrics included mean squared error and coefficient of determination (R²) to capture both absolute and relative prediction accuracy.



In [None]:
{
    "tags": [
        "hide_input",
    ]
}
print('\n', pd.read_csv('results/modelling-score_table_canadian.csv', index_col=0), '\n')

> _Figure 2. Comparison of Model Fitness_ <br>

The results demonstrate that all machine learning models significantly outperform the baseline approach. The Random Forest model achieved the highest performance (R² = 0.644), followed closely by LightGBM (R² = 0.642) and Linear Regression (R² = 0.640). These improvements represent substantial enhancements over the baseline model's performance (R² = 0.337).

Residual analysis, presented in Figure 3, provides insights into model behavior across the range of playing time values. All models show reasonable performance across typical playing time ranges, with some degradation at the extremes. This pattern reflects the inherent challenges in predicting unusual circumstances such as injuries, foul trouble, or overtime situations.

![Residuals Plot](results/residual_plots_canadian.png).

> _Figure 3. Model Residual Error_

Feature importance analysis for the Random Forest model, shown in Figure 4, reveals that recent playing time patterns dominate the prediction process. Rolling averages of minutes played ("Mins_last3_mean", "Mins_last5_mean") emerge as the most influential features, followed by player rating metrics and efficiency statistics. This hierarchy aligns with basketball intuition, where recent playing time serves as the strongest indicator of future opportunities.

![Feature Importance](results/modelling-gbm_importance_canadian.png).

> _Figure 4. Random Forest Feature Importance_

The analysis demonstrates that sophisticated machine learning approaches provide meaningful improvements over simple statistical methods for predicting playing time. The Random Forest model's superior performance suggests that capturing non-linear relationships and interaction effects enhances prediction accuracy. These improvements translate to practical value for coaching decisions and player development strategies.

Future enhancements could address several limitations. Incorporating contextual information such as opponent strength, team injury status, and academic factors could improve prediction accuracy. Advanced modeling techniques including ensemble methods and hyperparameter optimization might yield additional improvements. Expanding the temporal scope and including team-specific factors could provide more nuanced insights into playing time patterns.

## Critical Assessment and Limitations

While the models demonstrate significant improvement over baseline methods, the moderate R² score (0.644) indicates substantial room for enhancement. This analysis reveals both the potential and limitations of current sports prediction methodologies.

**Key Limitations Identified:**
- Missing contextual data (injuries, opponent strength, team strategy)
- No academic factors (grades, eligibility, academic standing)
- Limited temporal scope (only 2022-2024 data)
- No team-specific factors (coach preferences, team culture, roster depth)

**Areas for Improvement:**
- Collect additional contextual features
- Implement ensemble methods
- Add hyperparameter optimization
- Include domain-specific evaluation metrics

This work establishes a solid foundation for sports analytics but highlights the complexity of predicting human decisions in sports. The methodology provides value for understanding playing time patterns and could serve as a starting point for more sophisticated sports prediction systems.

\pagebreak