# Predicting NBA All-Star Chance Based on Player Performance

By: Bill Makwae, Ayush Vora, Ray Nguyen, QingRu Kong

## Introduction

Every year in February, NBA fans rejoice as they get to see their favorite players selected for the all-star game. Players are selected by media and fan votes, meaning that popularity is the nominating factor. However, players are more likely to be popular based on their individual game-to-game performance. Thus, this analysis hopes to answer the question: Can an NBA player’s selection to the all star game be predicted by their annual performance?

In order to answer this question, we will be using two sets of data, one from ["NBA Player Stats” on nba.com](https://www.nba.com/stats/players/traditional/?sort=PTS&dir=-1&Season=2015-16&SeasonType=Regular%20Season) and [“NBA All Stars 2000-2016” from kaggle.com](https://www.kaggle.com/fmejia21/nba-all-star-game-20002016?select=NBA+All+Stars+2000-2016+-+Sheet1.csv). NBA Player Stats includes all the NBA player statistics for each season from 2010-2016 and the All Star dataset includes the all star statistics from 2000-2016. Using these datasets we aim to make a classification model that will predict whether a player will be an all star for each season based on their annual performances.

## Preliminary Exploratory Data Analysis

In [1]:
## RUN THIS FIRST TO LOAD LIBRARIES

library(tidyverse)
library(repr)
library(readxl)
library(dplyr)
library(GGally)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2



In [2]:
## Data set #1: NBA Player Stats
## Filtered from years 2011-2015

players <- read.csv("https://raw.githubusercontent.com/RayNguyent/DSCI-100-project/develop/data/nba_player_stats.csv")

head(players)

players_filtered <- players %>% 
    filter(Year <= 2015 & Year >= 2011) %>% 
    filter(Player != "0") %>%
    # TODO: I'm not exactly sure what we decided on about the variable choice discussion we had earlier, but I think we should only
    #       have one table here, either the full or the cropped one. If we are only using the metrics below, we should probably
    #       use the filtered one, in which case remove head(players).
    select(Year, Player, MIN, PTS, FG., REB, AST)
head(players_filtered)

Unnamed: 0_level_0,Year,Player,not_Match_Up,MIN,PTS,not_WL,not_MIN,FG.,not_FGM,not_FGA,⋯,AST,not_DREB,not_REB,not_AST,not_STL,BLK,TOV,PF,PlusMinus,Fantasy
Unnamed: 0_level_1,<int>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>
1,2011,Kevin Durant,78,38.9,27.7,9.1,19.7,46.2,1.9,5.3,⋯,2.7,1.1,1.0,2.8,24.9,,,,,
2,2011,LeBron James,79,38.8,26.7,9.6,18.8,51.0,1.2,3.5,⋯,7.0,1.6,0.6,3.6,28.6,,,,,
3,2011,Carmelo Anthony,77,35.7,25.6,8.9,19.5,45.5,1.2,3.3,⋯,2.9,0.9,0.6,2.7,22.7,,,,,
4,2011,Dwyane Wade,76,37.1,25.5,9.1,18.2,50.0,0.8,2.7,⋯,4.6,1.5,1.1,3.1,24.8,,,,,
5,2011,Kobe Bryant,82,33.9,25.3,9.0,20.0,45.1,1.4,4.3,⋯,4.7,1.2,0.1,3.0,21.4,,,,,
6,2011,Amar'e Stoudemire,78,36.8,25.3,9.5,19.0,50.2,0.1,0.3,⋯,2.6,0.9,1.9,3.2,24.6,,,,,


Unnamed: 0_level_0,Year,Player,MIN,PTS,FG.,REB,AST
Unnamed: 0_level_1,<int>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,2011,Kevin Durant,38.9,27.7,46.2,6.8,2.7
2,2011,LeBron James,38.8,26.7,51.0,7.5,7.0
3,2011,Carmelo Anthony,35.7,25.6,45.5,7.3,2.9
4,2011,Dwyane Wade,37.1,25.5,50.0,6.4,4.6
5,2011,Kobe Bryant,33.9,25.3,45.1,5.1,4.7
6,2011,Amar'e Stoudemire,36.8,25.3,50.2,8.2,2.6


In [3]:
## Data set #1: NBA All Stars 2000-2016
## Filtered from years 2011-2015

all_stars <- read_csv("https://raw.githubusercontent.com/RayNguyent/DSCI-100-project/develop/data/all_stars_2000_2016.csv")
all_stars_filtered <- all_stars %>% 
    filter(Year <= 2015 & Year >= 2011) %>% 
    select(Year, Player) %>% 
    mutate(Is_All_Star = "All Star")
head(all_stars_filtered)

Parsed with column specification:
cols(
  Year = [32mcol_double()[39m,
  Player = [31mcol_character()[39m,
  Pos = [31mcol_character()[39m,
  HT = [31mcol_character()[39m,
  WT = [32mcol_double()[39m,
  Team = [31mcol_character()[39m,
  `Selection Type` = [31mcol_character()[39m,
  `NBA Draft Status` = [31mcol_character()[39m,
  Nationality = [31mcol_character()[39m
)



Year,Player,Is_All_Star
<dbl>,<chr>,<chr>
2015,LeBron James,All Star
2015,Dwyane Wade,All Star
2015,Paul George,All Star
2015,Carmelo Anthony,All Star
2015,Kyle Lowry,All Star
2015,Jimmy Butler,All Star


In [None]:
## Combined data sets

combined_data <- left_join(players_filtered, all_stars_filtered, by = c("Year", "Player")) %>% 
    replace(is.na(.), "Regular") %>% 
    mutate(Is_All_Star = as_factor(Is_All_Star))
head(combined_data)

In [None]:
## Graphs displaying relationships between each variable.

plot <- ggpairs(combined_data, columns = 3:7, 
                ggplot2::aes(color = Is_All_Star, alpha = 0.4),
                upper = list(continuous = "points", wrap("cor", size = 2.5)))
# TODO: Add a legend to the graph, along with x and y labels. Also, standardize the stats if possible.
plot

## Methods

We will find each player’s annual player stats (points, rebounds, assists, minutes per game, and field goal percentage) for each year and use these predictors in our classification model. We chose these variables because they are the most indicative of a player's offensive output, which is the main focus of the all star game.
We will find the annual performance of each player’s points, rebound, assist … 
Scatterplot and Density plot
Points, rebound, assist

TODO: This looks unfinished, so finish this.

## Expected Outcomes and Significance

Based on the annual summary statistics of each player, we expect to see players with top performances (point, assist, rebound …) to have a higher likelihood of being selected into the all star game.

What impact could such findings have?

Value of each player, commerical investment of players (or investment in general)
Companies and sponsors are able to sign and invest on certain players early on in the season based on their chance of getting selected into the all star game to maximize investment return. Teams can also focus their resources on certain players in the team based on our prediction. Based on our model, players are able to analyze their weaknesses and improve their overall performance????
What future questions could this lead to?
Based on the specific player being traded, could the value of a player to a specific team be assessed? (This player addresses a deficiency in defensive, offense, rebounding etc that the team may have)
NEW
How can players improve their performance based on our model? How to maximize fan votes, current member votes, and media votes other than improving their performances. Revenue of a team after a player/players on their team is selected to play in the all star game. How much revenue can an all star player bring to the team

TODO: This also looks unfinished, so finish/clean up too.

TODO: Before submitting... 
- Check word count and make sure its under 500.
- Check over rubric and make sure everything is good.
- Read over one last time to make sure everything sounds good.
- Push this into the main branch