# Basketball Statistics Prediction

This projects aims at predicting basketball games scores and statistics, using French championship (LNB Jeep Elite) data. This notebook will be used as a raw note book, with planning, daily updates, notes etc.

## Planning

| Phase | Description                                                    |
|-------|----------------------------------------------------------------|
| 1     | Research on the data source, tools and methods                 |
| 2     | Data retrieval                                                 |
| 3     | Exploratory data analysis                                      |
| 4     | Feature engineering                                            |
| 5     | Model for game score prediction                                |

## Phase 1 - Research of datasources, tools and methods

**05/04 - 3:27pm**  
Installation of a conda environment with Jupyter and pandas. Git repository creation .

*Data-source(s) and data retrieval*:
- It seems Jeep Elite data is not opened. 
- We will probably need to do web-scrapping with **Selenium**. Selenium with Python documentation here: https://selenium-python.readthedocs.io/.
- List of all games here: https://www.lnb.fr/fr/pro-a/calendrier-proa-25.html.  
- Season statistics per player avilable here : https://www.lnb.fr/fr/pro-a/statistiques-joueurs-39.html (both average and absolute values).
- Game statistics here: https://www.lnb.fr/fr/proa/match/chalons-reims-pau-lacq-orthez-201615620.html#stats (available through list of games).
- Players list here: https://www.lnb.fr/fr/pro-a/joueurs-proa-26.html
- To get injuries: outliers in minutes played for a given player ?
- Basketball refernce seems to have ProA statistics (https://www.basketball-reference.com/international/france-lnb-pro-a/) and a Python API (https://sportsreference.readthedocs.io/en/latest/)

Features ideas:
- 3 categories of features: historical matches, player performance indicators, and opposition information.
- Team defense/offense averages.
- Home/Away split.
- Average physical measures of the players in each position (size/weight).
- Injuries.
- Offensive KPIs:
    - Points, 
    - Assists,
    - 3PM and 3PA (attempts to see tendency to play beyond the 3pts line),
    - 2PM and 2PA (attempts to see tendency to play close to the paint),
    - FTM and FTA (attempts to see tendency to play close to the paint and draw fouls),
    - Offensive rebounds,
    - Turnovers.
- Defensive KPIs:
    - Blocks,
    - Steals,
    - Defensive rebounds,
    - Opponents turnovers,
    - Opponents FG%.
- New players added to roster or coach change.
- Four Factors Rating (https://medium.com/@patrickoxford/an-introduction-to-four-factors-rating-3e9ee475ceed).
- Lagged KPIs over past few games.
- Team budget/salary mass.
- Arena attendance.
   

Model:
- Basic model using Random Forest, with few teams statistics as features, applied to college basketball: https://towardsdatascience.com/predict-college-basketball-scores-in-30-lines-of-python-148f6bd71894
- Research paper on basketball game prediction. Importance of features selection and their relationships rather than model selection. Limit to the prediction task itself. https://arxiv.org/abs/1310.3607v1
- NBA prediction using mutliple models and comparing them. Importance of injuries? Using past seasons data? http://dionny.github.io/NBAPredictions/website/index.html#conclusions
- Baseline model could be to select the best ranked team as winner and average points scored on the season.
- Model to predict March Madness outcome with more subtle features such as experience and leadership: https://harvardsportsanalysis.wordpress.com/2012/03/14/survival-of-the-fittest-a-new-model-for-ncaa-tournament-prediction/

Other links:
- Onine forum about Basketball statistics: http://apbr.org/metrics/viewforum.php?f=2&sid=ddc8d5a6aece569fcc353bcac518233e

**08/04 - 10:37pm**  

Red https://towardsdatascience.com/predict-college-basketball-scores-in-30-lines-of-python-148f6bd71894 article. Key takeawys:
- Usage of Basketball reference Python API.  
- Target variables: home/away scores.  
- Straight forward implementation of random forests using scikit-learn.  
- No feature engineering.  
- Could be a good model to start.  
  
Red https://arxiv.org/abs/1310.3607v1. Key takeaways:
- All features are standardize by number of possessions in the game. However the number of possessions is usually not directly computed live, so it is estimated. As a possession end either when a shot is taken, a rebound is taken, or if there is a turnover, it is usually estimated using those stats. The way they are weighted varies. This article does a nice recap: https://fansided.com/2015/12/21/nylon-calculus-101-possessions/. Here are two formulas used:
    - In the paper -> $Possessions=0.96*(FGA−OR−TO+(0.475∗FTA))$
    - On NBA.com -> $Possessions=(FGA+0.44*FTA–ORB+TOV)/2$    
This however raises the question on if the computation should be different for FIBA basketball.
- "Four Factors" are four stats that are said to be the most relevant in a team success in the litterature. They are: *Effective field goal percentage*, *Turnover percentage*, *Offensive Rebound Percentage* and *Free throw rate*. They should be considered as based features probably.
- Usage of offensive and defensive ratings, but adjusted with the opponents ratings, as putting good offensive number should be more relevant aginst a good defensive rather than a poor defensive team. Adjusted offensive and defensive ratings should also be taken into account more.
- A model to estmate win ratio: https://en.wikipedia.org/wiki/Pythagorean_expectation#Use_in_basketball
- They estimated which were the most relevant features to make a prediction (measure accuracy among all models): "We found the combi-nations of location and adjusted offensive and defensive efficiencies, and locationand Four Factors to work best"

**06/10 - 04:24pm** 

Very interesting article about how to quantify a coach impact's on a team win/loss ratio: [https://towardsdatascience.com/quantifying-the-contribution-of-nba-coaches-using-fixed-effects-56f77f22153a](https://towardsdatascience.com/quantifying-the-contribution-of-nba-coaches-using-fixed-effects-56f77f22153a). ALthough it is not of direct interest for this project, it first, shows the importance of the coach, which may be interesting to quantify somehow in our features, and secondly, talks about a statistic to quantify player's quality: the **Value Over Replacement Player (VORP)**. This page ([https://www.basketball-reference.com/about/bpm2.html](https://www.basketball-reference.com/about/bpm2.html)) gives a more in-depth look at how it is computed and how it can be used.



## Phase 2 - Data retrieval

**06/05 - 10:32pm**

Installation of Selenium to pull data from [lnb.fr](lnb.fr).
First pulled game data, with the following features :
- date, 
- home team name, 
- home team score,
- away team name,
- away team score,
- each quarters home team scores,
- each quarters away team scores,
- each of the eventual overtime home team scores,
- each of the eventual overtime away team scores,
- if the game is on TV

This should allow to build a first model. Script available in notebook `data_retrieval.ipynb`.

**05/09 - 10:00pm**  
Pulled aditional game statistics for ech game. Statistics pulled:
- minutes,
- total assists, 
- total defensive rebounds,
- total offensive rebounds,
- total 2 points shots attempts, 
- total 2 points shots made, 
- total 3 points shots attempts, 
- total 3 points shots made, 
- total free throws attempts, 
- total free throws made, 
- total blocks,
- total steals,
- total turnovers,
- total personal fouls,
- total personal fouls drawn, 

and this for both the home and away team, of each game pulled in the step before.

## Phase 3 - Data preparation

**30/10 - 06:00pm**
First clean of the data:
- Removing all star game teams,
- Adding the current w/l ratio of both teams at each game,

To add more statistics, a good start is to add possessions. The formula to compute possessions seems to still be discussed. Red a couple more articles on the subjet:
- [https://fansided.com/2015/12/21/nylon-calculus-101-possessions/](https://fansided.com/2015/12/21/nylon-calculus-101-possessions/)
- [https://cbbstatshelp.com/efficiency/possessions/](https://cbbstatshelp.com/efficiency/possessions/)
- [https://www.nbastuffer.com/analytics101/possession/](https://www.nbastuffer.com/analytics101/possession/)  
Although the NBA formula is pretty clear ($(FGA + 0.44*FTA – ORB + TOV)$), it acocunts for all ways a possessions ends (FGA, TOV, FTA) minus when it is not ending (ORB). The mutiplier is here to balance the fact that free-throws are either shot alone ("and one"), by pairs (most of the cases), or by three. Thus 44% of the FA end the possession on average. It is unclear if this also applies to FIBA basketball.

Ideas: 
- To add more data (games), we could add past season, but it would require to modify season depending stats, such as the W/L ratio.