## Title: Predict NBA player earnings based on 2022-2023 season 

###  Predictive Question

How does NBA players' statistics predict their salary? Specific focus on: age, games played, total minutes played, total shooting percentage?

### Introduction:
The National Basketball Association (NBA) is a North American professional basketball league consisting of 2 conferences. Each conference has 3 divisions, and each division has 5 teams. The NBA is one of the major professional sports leagues in the United States and Canada. It is regarded as the highest level men's professional basketball tournament in the world. We are going to use a dataset from "basketball-reference.com", which provides precise statistics for all players. In this project, we are going to investigate the relationship between the players' salary and the following variables during the 2022-23 season: 
* Position
* Age = age pf players
* GP = number of games played
* GS = number of games started 
* Total Minutes = total minutes played
* PTS = points
* eFG% = effective field goal percentage
* TS% = true shooting percentage
* WS = win shares
* VORP = value over replacement player

### Preliminary exploratory data analysis:

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
install.packages("kknn")
install.packages("gridExtra")
library(kknn)
library(gridExtra)
options(repr.matrix.max.rows = 6)
source("tests.R")
source('cleanup.R')

### Methods:
In the 2022-2023 season dataset, the following variables will be selected: 

    -Age: Often indicates a player's experience and professional maturity 

    -Games Played: Evidences a player's reliability 

    -Total Minutes Played: Indicates a player's significance to the team 

    -True Shooting Percent: Measures a player's ability to score efficiently 

The selection of these variables rests on the hypothesis that they determine a player's worth and, consequently, their salary. Observations with missing data will be removed. 

Scatterplots will illustrate the relationship between each variable and the player's salary, aiding in the visualization of potential correlations. If any variable shows weak correlation, it will be replaced. The main modeling technique will be multiple regression, with cross-validation ensuring model robustness. Finally, this model will aid in estimating a retired player's current earnings based on their past performances. 

 

### Read the dataset using a relative path：

In [None]:
nba <- read_csv("nba_2022-23.csv")
nba

### Wrangling data:
Since the data has large number of variables, we selected variabels which might be useful for the analysis 
to simplify our data.

In [None]:
nba_selected <- nba |>
select("Player Name", Salary, Position, Age, GP, GS, "Total Minutes", PTS, "eFG%", "TS%", WS, VORP) |>
rename(total_minutes = "Total Minutes") |>
rename(TSP = "TS%") |>
rename(eFGP = "eFG%") 
nba_selected

In [None]:
nba_filtered <- nba_selected |>
                   filter(abs(Salary - mean(Salary)) < 2 * sd(Salary)) |>
                   select(Salary, Age, GP, TSP, total_minutes)|>
                   drop_na()

nba_filtered

### Summary of the dataset:

In [None]:
summarize(nba_filtered,
          "Number of observations" = nrow(nba_filtered),
          "Rows with missing data" = sum(is.na(nba_filtered)))

results_salary <- nba_filtered |>
    summarize(variable = "Salary", max = max(Salary),min = min(Salary),mean = formatC(mean(Salary),digits=8)) 
results_age <- nba_filtered |>
    summarize(variable = "Age", max = max(Age),min = min(Age),mean = formatC(mean(`Age`),digits=4) )
results_GP <- nba_filtered |>
    summarize(variable = "GP", max = max(GP),min = min(GP),mean = formatC(mean(`GP`),digits=4)) 
results_mins <- nba_filtered |>
    summarize(variable = "Mins", max = max(total_minutes),min = min(total_minutes),mean = formatC(mean(total_minutes),digits=4)) 
results_TSP <- nba_filtered |>
    summarize(variable = "TSP", max = max(TSP),min = min(TSP),mean = formatC(mean(`TSP`),digits=4)) 

Summary_results <- rbind(results_salary,results_age,results_GP,results_mins,results_TSP) + 
Summary_results

### Visualization of dataset:

In [None]:
nba_salary_hist<- ggplot(nba_filtered, aes(x=Salary))+
                geom_histogram()+
                labs(x="Salary", title = "Distribution of Salary", tabs="Figure 1: Salary Distribution")+
                geom_vline(xintercept = 10021090, linetype = "dashed", linewidth = 1)

nba_salary_age_plot <- ggplot(nba_filtered, aes(x=Age, y=Salary))+
                        geom_point()+
                        labs(x="Age of players",y="Salary in USD",title="Plot for Age and Salary of NBA players",tabs = "Figure 2: Age vs. Salary")
    
 
nba_salary_GP_plot <- ggplot(nba_filtered, aes(x=GP, y=Salary))+
                        geom_point()+
                        labs(x="Games played",y="Salary in USD",title="Plot for number of games played and Salary of NBA players",tabs = "Figure 3: Games Played vs. Salary")

nba_salary_TSP_plot <- ggplot(nba_filtered, aes(x=TSP, y=Salary))+
                        geom_point()+
                        labs(x="True Shooting Percent",y="Salary in USD",title="Plot for TS% and Salary for NBA players",tabs = "Figure 4: True Shooting % vs. Salary")

nba_salary_mins_plot <- ggplot(nba_filtered, aes(x=total_minutes, y=Salary))+
                        geom_point()+
                        labs(x="Total minutes",y="Salary in USD",title="Plot for total minutes played and Salary of NBA players",tabs = "Figure 5
: Total Minutes Played vs. Salary")

nba_salary_hist
nba_salary_age_plot
nba_salary_GP_plot
nba_salary_TSP_plot
nba_salary_mins_plot

#Histogram to visualize the distrbution of salary for nba players in 2022/23 season
#Added a vline at where the mean salary lies

#Scatterplot is to show the correlation between age and salary 

### Perform analysis:

In [None]:
set.seed(1)
nba_split <- initial_split(nba_filtered, prop = 0.8, strata = Salary)
nba_training <- training(nba_split) #training data
nba_testing <- testing(nba_split) #testing data

In [None]:
nba_recipe <- recipe(Salary ~., data = nba_training) #create recipe to preprocess data

nba_spec <- linear_reg()|> #model specification
set_engine("lm") |>
set_mode("regression")

In [None]:
nba_fit <- workflow() |> #build the workflow and fit the model
add_recipe(nba_recipe) |>
add_model(nba_spec) |>
fit(data = nba_training)
nba_fit

In [None]:
nba_test_result <- nba_fit |> 
  predict(nba_testing) |>
  bind_cols(nba_testing)
nba_test_result

In [None]:
nba_test_results_plot <- ggplot(nba_test_result, aes(x=.pred,y=Salary)) +
                               geom_point()+
                                geom_abline(slope=1,intercept=0)

nba_test_results_plot

### Discussion

$Predicted \ Salary \ (USD)= -10778118 + (528004 * Age) - (92459 * GP) - (379557 * TSP) + (7226 * Total \ Minutes \ Played)$


 ### Impact:
The findings of this data model would be significant to the training of professional basketball players. Coaches are able to analyze changes in each variable to determine whether there has been positive improvement in the sport and can apply the relevant changes to their training methods.  

### Future questions:
Does the team the player is on have an impact on their salary?  
Do certain training methods yield higher salaries on average? 
How significant is the impact of player injuries on salary?