# NBA Player Value

This notebook seeks to model NBA player salaries by identifying valuable performance statistics that are predictive of monetary value (at least in the eyes of NBA general managers)

In [None]:
# Imports + Setup
library(tidyverse)
library(MASS)
library(kableExtra)

In [None]:
# Read in CSV from .py script
nba <- read_csv("./stats_and_salaires.csv")                               

# Custom histogram function
fastHistogram <- function(FEAT, NAME, BINS=30) {
        nba %>% ggplot(aes(x = scale(FEAT))) +
                geom_histogram(color="white", bins = BINS, fill="royalblue4", alpha=0.92) +
                labs(x = paste("Scaled Feature: ", NAME), y = "") +
                theme_minimal()
}

In [None]:
head(nba)

-----------

## Data Cleaning + Transformation

* We'll convert the Salary feature to a numeric value by stripping the leading **$ sign**

* **Points Per Minute** and **Assists Per Minute** can be considered proxies for a player's relative offensive impact on the floor

In [None]:
nba$Salary <- as.numeric(gsub("[\\$,]", "", nba$Salary))        # Convert salary to int value

nba <- nba %>% 
        mutate("PPM" = PTS / MP, "APM" = AST/MP)                # Calculate Points and Assists / Minute
        

# Numeric variables only
nba.reduced <- nba %>% 
        dplyr::select(!c("Rk", "Player", "Tm", "Pos")) %>% 
        na.omit()


nba  %>% arrange(desc(Salary))  %>% head()

In [None]:
# Plot Age distribution
fastHistogram(nba$Age, "Age%", 25)

After scaling `Age`, it’s evident that this distribution is slightly right-skewed (in other words, the majority of the data falls below the mean). We’ll attempt to log-transform this variable to approximate normality, though if the transformation yields any infinite values, we won’t be able to effectively perform the log-transform.

In [None]:
# Plot 3P%
fastHistogram(nba$`3P%`, "3-Point %", 40)

`3-Point%` is slightly left-skewed, and seems to have several severe outliers. This most likely represents low-volume shooters (for example, players that take and make one 3-pointer would be an extreme outlier). We likely will not need to transform this variable, as a log-transformation will not impact the extreme positive outliers observed here. 

In [None]:
# Plot salary distribution
fastHistogram(nba$Salary, "Salary", 65)

Salary is observed to be **massively right-skewed**, such that the majority of observations are less than the mean average of the total distribution. There are noticeable outliers in this distribution, with several observations 3 or more standard deviations from the mean. This is a good indicator that, to obtain a robust result, we’ll need to log-transform player salaries in our linear model. 

----------

## Modeling

This first model will include **all numeric features**

In [None]:
base.model <- lm(log(Salary) ~ ., data = nba.reduced)              # Build model with all features
summary(base.model)

The **coefficients** in the base model appear to be weakly predictive, at best. This is due in part to the combination of variables. To yield a better, more predictive model, we’re going to employ **AIC Stepwise Variable Selection** to incorporate different combinations of variables, with our final model containing the most robust set of predictors. 

In [None]:
# Log-transform points and 2-point attempts
base.model <- stats::update(base.model, . ~ . -PTS +log(PTS) -`2PA` +log(`2PA`))

# Select optimal variables via AIC stepwise selection
step.model <- MASS::stepAIC(base.model, 
                            direction = "both",
                            k = 2,
                            trace = F, 
                            steps = 1000)

summary(step.model)

Our AIC selection algorithm yields an interesting set of predictors. We observe an R-squared value of 0.4414; put simply, our model accounts for **44.14% of the variance in player salary**. Considering the swath of variance that we can’t account for given the present data (injury, market value, salary cap, etc.) this is actually pretty decent! Let’s apply this finished model to our full dataset to see how teams and players fare. 

-----------

## Predictions

Now that we have an optimized model, let’s apply it to our data. This will give us a `PREDICTIONS` column, which represents an estimation of what each player should be paid based on their individual stat lines.

In [None]:
# Predict salaries with our optimal model
nba["PREDICTIONS"] <- exp(predict(step.model, nba))


# Define custom X + Y labels
x_labels <- c("$0", "$10M", "$20M", "$30M", "$40M")
y_labels <- c("$0", "$20M", "$40M", "$60M")


# Plot actual vs. predicted salaries
salary.plot <- nba %>%
                        mutate(phase = ifelse(Salary > PREDICTIONS, 'Overpaid', 'Underpaid'))  %>% 
                        ggplot(aes(x = Salary, y = PREDICTIONS)) +
                        geom_smooth(color="white", alpha=0.65) +
                        geom_point(alpha=0.65, size=(nba$MP / 400), aes(color=phase)) +
                        theme_minimal() +
                        labs(x = "Actual Salary", 
                             y = "Predicted Salary",
                             title = "Salary Prediction Model",
                             subtitle = "Size = Minutes Played",
                             color = "") +
                        theme(plot.title = element_text(hjust = 0.5, face="bold"),
                              plot.subtitle = element_text(hjust = 0.5)) +
        scale_x_continuous(labels = x_labels) + scale_y_continuous(labels = y_labels) +
        scale_color_brewer(palette='Set1') + 
        theme(axis.text.x = element_text(hjust = 1),
              axis.text.y = element_text(vjust = -1))


salary.plot

We observe a moderately close fit between the the observed data points and the line of best fit in this plot. While there appears to be a general trend of points scored correlating with salary, this model seems to inflate player salaries slightly, such that players’ predicted salaries tend to be higher than their actual salaries. This is most likely due to the lack of control parameters built in to this model (for example, salary cap is not enforced in this environment, which provides no ceiling for the model to work under).

-----------

# Determining Player Value

*How will this work?*

Using the predicted values from our model, we’ll calculate the differential from actual salaries to create a `Salary.Differential` variable. If this variable is positive - i.e., the predicted salary is higher than the actual salary - we may assert that this player is **underpaid**. Conversely, if the predicted salary is lower than the actual salary, it would suggest that the player is **overpaid**.

In [None]:
nba <- nba %>%
        mutate("Salary.Differntial" = PREDICTIONS - Salary,
               "Overpaid" = ifelse(Salary.Differntial < 0, "Overpaid", "Underpaid"))


nba %>% 
        dplyr::filter(Overpaid != "NA") %>% 
        ggplot(aes(x = Overpaid, fill=Overpaid)) +
        geom_bar(color="white", alpha=0.85) +
        labs(x = "Salary Differential", 
             y ="",
             title = "Salary Differential") +
        theme_minimal() +
        theme(legend.position = "none",
              axis.title.x = element_blank(),
              axis.text.x = element_text(face="bold"),
              plot.title = element_text(face = "bold", hjust = 0.5)) +
        scale_fill_manual(values = c("royalblue4", "dodgerblue3"))

---------

## Underpaid Players

First, let’s look at players whose projected salaries are **higher** than what they are actually paid. We’ll consider these players **underpaid**:

In [None]:
nba %>%
        dplyr::select(Player, Age, Salary, PREDICTIONS, Salary.Differntial) %>% 
        arrange(desc(Salary.Differntial)) %>% 
        head(10)

**Obvious points**: You can’t overpay LeBron James. Whatever the league will allow you to pay him, pay him 150% of that. Similarly, the other players on this list include Luka, Siakam, and Bam Adebayo - in other words, young players who either haven’t hit the bank yet or are playing above their value.



**Less obvious**: Carmelo’s inclusion on this list. Our model indicates that his stat line should have put him around the `$16M / year` range. His actual salary is relatively modest in comparison, which makes him a bit of a bargain.

----------

## Overpaid Players

Here we’ll take the opposite approach, and explore players whose predicted salaries are less than their actual salaries. We’ll filter out **Steph Curry** since he was injured all year and makes lots of money. An outlier if ever there was one

In [None]:
nba %>%
        dplyr::select(Player, Age, Salary, PREDICTIONS, Salary.Differntial) %>% 
        dplyr::filter(Player != "Stephen Curry") %>% 
        arrange((Salary.Differntial)) %>% 
        head(10)

**Obvious points**: Relative to the underpaid player list, the majority of these players are in their late 20’s and early 30’s. This point is twofold - older players are more likely to have higher salaries (i.e., non-rookie deals), and older players are generally less likely to have robust stat lines relative to younger players (see: Blake Griffin on both accounts).



**Less obvious**: Westbrook (57 games played) and Kyrie (20 games played). The Westbrook outcome is especially interesting - it’d be interesting to approach this question slightly differently, to see how Westbrook stacks up in years that he plays the full season. He’s a former MVP, but he can easily drop to a below-average player at times. Surely this has been quantified by other researchers, but still worth a shot!

---------

## Team Cap Management

Lastly, let’s track how teams look overall - how many of their players are **overpaid** and how many of their teams are **underpaid**. We’ll calculate the percentage of each roster that is made up of overpaid players, then we’ll observe the “bottom 10” - i.e., the 10 teams that have the highest percentage of overpaid players.

In [None]:
nba %>% 
        group_by(Tm) %>% 
        filter(Tm != "TOT") %>% 
        summarise(Players = n(),
                  Underpaid.Players = sum(Overpaid == "Underpaid", na.rm = T),
                  Overpaid.Players = sum(Overpaid == "Overpaid", na.rm = T),
                  Pct.Overpaid = (Overpaid.Players / Players)) %>% 
        head(10) %>% 
        arrange(desc(Pct.Overpaid))

Seems to be a bit of a mixed bag, just by the eye test. Of the bottom-10 teams with a high-percentage of overpaid players, some of them are competitive - Dallas, Boston, Denver - while others are competing for the lottery every year (such as Cleveland and Detroit). Intuitively, it seems like it’s more important to pay the right players, even if those players are large cap expenditures.



Dallas is a good example of this, as they’re able to get amazing output from Luka while only paying him ~$7M annually.

In [None]:
nba %>% 
        dplyr::filter(Tm == "DAL") %>% 
        dplyr::select(Player, Age, Pos, Salary) %>% 
        arrange(desc(Salary)) %>% head(5)

--------

# Future Directions

* Calculate the correlation between “overpaid player percentage” and win total in the regular season

* Fix the dataset to minimize `NA` values in the model (and, in turn, yield a higher R-squared value)


<img src="https://media.newyorker.com/photos/60b10421d60710aaa9f4959a/master/pass/RH-SpikeLee-2560.png" width=50%>