# Probability of a Player Making All-NBA Team
## Bayesian Network Analysis of NBA Player Data
### By Jaleel Walter Henry Savoy
#### 10/19/2020

## Introduction

The All-NBA Teams are one of the highest individual honors an NBA player can receive based on regular season performance. Since 1988, there have been three All-NBA Teams and they are composed of two guards, two forwards, and one center; players receive votes for the position at which they most frequently play.

During many NBA seasons it is fairly evident to determine which players are likely to make an All-NBA Team, and sometimes it is even clear which of the All-NBA Teams those players will make. There also are many seasons where it is difficult to choose exactly which players make an All-NBA Team and which do not.

The goal of this analysis is to analyze the relationship between a set of NBA players' end-of-season statistics and whether the player made an All-NBA team; to do this a Bayesian network will be fitted using the `bnlearn` package in R, then the network will be analyzed to see which of the statistics contribute to making an All-NBA Team, and lastly the predictive capabilities of the fitted network will be assessed.

## Data

The data used for this analysis is end-of-season data that has been binned into four equal-width bins for the following player statistics: True Shooting Percentage (`ts`), Player Efficiency Rating (`per`), Effective Field Goal Percentage (`efg`), Usage Percentage (`usg_perc`), Win Shares (`ws`), and whether the player made an All-NBA Team that season (`all.nba`). The player data was binned so that the fitted Bayesian network would be a discrete network.

Those statistics were chosen on the more traditional counting statistics and simpler percentages because they are considered more reliable indicators of player performance. Admittedly, the chosen metrics focus more on offensive performance than defensive performance, but both Win Shares and Player Efficiency Rating incorporate defense performance.

The Bayesian network will be fitted on player data from the 1988-89 season to the 2018-19 season, while the 2019-20 season data will be held out to assess predictive capabilities of the network. 

Only players that played at least 42 games, started at least 20 games, average at least 25 minutes per game, and had a Player Efficiency Rating of at least 15; any player that did not meet these conditions were excluded for the data sets. The decision to exclude those players, and the conditions used to exclude them, were made based on subjective prior beliefs of what makes an All-NBA player.

**The following figure shows the bin ranges for the player metrics:**
![image.png](./bin_ranges.png)

**The following figure shows the bin sizes for the player metrics:**
![image.png](./bin_sizes.png)

## Method

A Bayesian network is a probabilistic graphical model that represents random variables and their conditional dependencies as nodes and edges via a directed acyclic graph; nodes that are connected are conditionally dependent of each other and those that are not connected are conditionally independent of each other.

A fitted Bayesian network can be used to answer probabilistic queries about their variables; given some observed evidence, the network can give the probability of a certain event occurring.

For this analysis, the Bayesian network's structure was learned using a hill-climbing greedy search and the parameters of the network's local distributions were fitted using Bayesian parameter estimation.

A denylist was specified to prevent the `all.nba` random variable from being the parent node to any other node, which follows the logic that whether or not a player is selected to an All-NBA Team has no influence on their performance during the season, the relationship would be the other way; a player's performance during the season influences their selection, or lack thereof, to an All-NBA Team.

## Results

The fitted Bayesian network found that Usage Percentage, Win Shares, and Player Efficiency Rating had direct influence on whether or not a player made an All-NBA Team, while both Effective Field Goal Percentage and True Shooting Percentage had an indirect influence.

**The following figure is the learned structure of the fitted Bayesian network:**
![](fitted_bayesian_network.png)

For 71 of the 81 players, the fitted Bayesian network whether or not they made an All-NBA Team; with an F<sub>1</sub> score of 0.67 for the positive class (making an All-NBA Team) and an F<sub>1</sub> score of 0.924 for the negative class (not making an All-NBA Team). The model had 5 false-positives and five false-negatives.

These results show that the model is much better than the using random guess, which would have had an F<sub>1</sub> score of 0.27 for the positive class and 0.62 for the negative class.

**The following table shows any player that made an All-NBA Team or was predicted to make an All-NBA Team for the 2019-20 season, sorted by the predicted probability of making an All-NBA Team:**
![](pred_prob.png)

The model was fairly confident that Trae Young would make an All-NBA Team, predicting an 88% probability, even though he did not make any of the teams (although he did received 13 points, with one second-team vote and 10 third-team votes); the model was confident he would make an All-NBA Team due to his high Usage Percentage and a fairly high Player Efficiency Rating.

The model was also mildly confident about Joel Embiid (72.3%) and Hassan Whiteside (69.3%) making an All-NBA Team, even though they ultimately did not make one. For Joel Embiid, the model saw the evidence of his fairly high Player Efficiency Rating and True Shooting Percentage. For Hassan Whiteside, it was the fairly high Player Efficiency Rating and Win Shares.

For Karl-Anthony Towns, Bam Adebayo, Rudy Golbert and Russel Westbrook, the model was more uncertain (the predicted probabilities of making an All-NBA Team ranged from 33% to 68%), and it resulted in two false-positives (Towns and Adebayo) and two false-negatives (Golbert and Westbrook).

The model really struggled with predicting an All-NBA selection for Pascal Siakim, Jayson Tatum, and Chris Paul, where it predicted 14%, 14% and 12.3%, respectively. These players did not particularly stand out in the way the model was expecting of All-NBA players based on the historical data from which it learned.

## Conclusion

It appears that Usage Percentage, Player Efficiency Rating, and Win Shares are strong, direct influences on whether a player is selected to an All-NBA Team, but there is still a fair amount of variability not accounted for by these metrics. This can be partly explained by the voting process; some voters may use a more traditional, highly subjective eye test that is not always reflected in the actual data. Overall, the fitted Bayesian network, and the inferences that can be derived from it, would make an useful tool for reasoning about and predicting All-NBA selections.

## Code

```R
library(bnlearn)
library(tidyverse)

set.seed(123)
num_of_draws <- 15000000

bin.ranges.df <- read.csv("bin_ranges.csv")
bin.sizes.df <- read.csv("bin_sizes.csv")

nba_player_data_refined <- read.csv("nba_quartiles.csv", stringsAsFactors = T, row.names = "X")
nba_player_data_2020_refined <- read.csv("nba_quartiles_2020.csv", stringsAsFactors = T, row.names = "X")
nba_player_data_2020_refined <- nba_player_data_2020_refined[-1,] # remove dummmy row
nba_player_data_refined$all.nba <- as.factor(nba_player_data_refined$all.nba)

summary(nba_player_data_refined)

cols_of_interest <- c(
  "gs","ts","per","efg","usg_perc","ws","all.nba"
)

bin_labels <- c(
  "Bin 1",
  "Bin 2",
  "Bin 3",
  "Bin 4")

node.names <- names(nba_player_data_refined[, cols_of_interest])
all.nodes <- node.names[-grep("all.nba", node.names)]

denylist <- data.frame(
  from = c("all.nba"), 
  to = all.nodes)

res <- hc(nba_player_data_refined[, cols_of_interest],
          blacklist = denylist,
          score='k2')
plot(res)

strength = arc.strength(res,
                        nba_player_data_refined[, cols_of_interest],
                        criterion = "k2")
strength.plot(res, strength)
strength[order(strength$strength), ]

fittedbn <- bn.fit(res, data=nba_player_data_refined[, cols_of_interest],
                   method = "bayes",
                   iss=100)

graphviz.plot(fittedbn)
arcs(fittedbn)
print(fittedbn$all.nba$parents)

cpq_probs <- list()

for (i in bin_labels){
  for (j in bin_labels){
    cpq_probs <- append(cpq_probs,
                        cpquery(fittedbn,
                                event = (all.nba == "1"),
                                evidence = (ws  == i &
                                            per == j),
                                n=num_of_draws,
                                method = "ls")
    )
  }
}

all.nba.prob <- matrix(cpq_probs, 4)
colnames(all.nba.prob) <- c(
  "Bin 1 (ws)",
  "Bin 2 (ws)",
  "Bin 3 (ws)",
  "Bin 4 (ws)")
row.names(all.nba.prob) <- c(
  "Bin 1 (per)",
  "Bin 2 (per)",
  "Bin 3 (per)",
  "Bin 4 (per)")
all.nba.prob

predictions <- predict(fittedbn,
                       "all.nba",
                       data = nba_player_data_2020_refined,
                       prob = TRUE)

pred_probs <- attributes(predictions)

pred.df <- data.frame(
  Player = nba_player_data_2020_refined$player,
  Prediction = predictions,
  Actual = "0"
)

all.nba.2020 <- c(
  "Giannis Antetokounmpo", 
  "LeBron James",
  "James Harden",
  "Anthony Davis",
  "Luka Doncic",
  "Kawhi Leonard",
  "Nikola Jokic",
  "Damian Lillard",
  "Chris Paul",
  "Pascal Siakam",
  "Jayson Tatum",
  "Jimmy Butler",
  "Rudy Gobert",
  "Ben Simmons",
  "Russell Westbrook"
)

for (player in all.nba.2020){
  pred.df[which(pred.df$Player == player), "Actual"] <- "1"
}

pred.df$Actual <- as.factor(pred.df$Actual)

for (NBAPlayer in 1:nrow(nba_player_data_2020_refined)){
  pred.df$PredictedProbability[NBAPlayer] <- pred_probs[["prob"]][, NBAPlayer][2]
}

pred.df <- pred.df[order(pred.df$PredictedProbability), ]

all.nba.df <- pred.df[which(pred.df$Actual == "1" | pred.df$Prediction == "1"), ]
all.nba.df <- all.nba.df[order(-all.nba.df$PredictedProbability, all.nba.df$Actual), ]

summary(all.nba.df)
View(all.nba.df)

classification.matrix <- as.matrix(table(Actual = pred.df$actuals, Predicted = pred.df$preds))

n = sum(classification.matrix)
nc = nrow(classification.matrix)
diag = diag(classification.matrix)
rowsums = apply(classification.matrix, 1, sum)
colsums = apply(classification.matrix, 2, sum)
p = rowsums / n
q = colsums / n

accuracy_score <- sum(diag) / n
precision <- diag / colsums
recall <- diag / rowsums 
f1 <- (2 * (precision * recall)) / (precision + recall) 

macroPrecision <- mean(precision)
macroRecall <- mean(recall)
macroF1 <- mean(f1)

(n / nc) * matrix(rep(p, nc), nc, nc, byrow=F)
rgAccuracy <- 1 / nc
rgPrecision <- p
rgRecall <- 0*p + 1 / nc
rgF1 <- 2 * p / (nc * p + 1)

data.frame(rgPrecision, rgRecall, rgF1)
data.frame(macroPrecision, macroRecall, macroF1)
data.frame(precision, recall, f1)
classification.matrix
```