# COGS 118A- Project Checkpoint

# Names

- Jonathan Park
- Daniel Lee
- Suebeen Noh
- Franklin Le
- Daniel Renteria

# Abstract 

The goal of our project is to use NBA player data to create a machine learning algorithm that will predict the likelihood of a player becoming an all-star. We will be looking at different player statistics from each regular season from 2011/12 through 2021/22 to create a regression line that predicts the likelihood of a player becoming an all-star. Success will be measured by testing the created model with earlier stats from players who became all-stars and seeing how accurate the predictions are.

# Background

Around halfway through each NBA season, fans and media members vote on their favorite NBA players to play in the All-Star game. Fans are able to vote through various online means such as the NBA app, NBA website, and via Twitter<a name="voting"></a>[<sup>[1]</sup>](#votinginfo), and represent 50% of the overall vote. Media members and current players make up the other 50%. Being selected as an all-star is a prestigious accomplishment, and many players take pride in the number of all-star games they participated in as a mark of their legacy and impact on basketball.

For teams, it is vital to scout and sign players that they believe have the growth potential to become all-star level players, but do not yet command an enormous salary. This is because the NBA has a salary cap system, where there is a maximum amount of money they are able to spend on player salaries in one season. In the 2021/22 season for example, the salary cap was set to 112.4 million dollars<a name="salary"></a>[<sup>[2]</sup>](#salarycapinfo). This sounds like a lot of money, but when top players make upwards of $50 million by themselves, this salary cap gets filled quickly. Because teams are looking for cheaper players with more growth potential, we believe that we can use machine learning to support the scouting systems already in place.

In terms of prior work already done on this subject, there is already a lot of work being done in this particular field. For example, ESPN created a model in 2017 to predict which draft picks are likely to become all-stars <a name="espn"></a>[<sup>[3]</sup>](#espnpredict). NBA teams often have their own analytics departments, and media outlets such as ESPN also recognize the power of using machine learning and analytics to predict which players have the most potential. We will be working to build off the wealth of knowledge afforded to us and create our own model that works as effectively as any others.

# Problem Statement

The problem we are looking to answer is to predict which NBA players have the most potential to become all-stars. We will be using a variety of player stats and variables to create a model that can predict the growth of a player, and use that predicted growth to make an educated guess about the likelihood that they will become an all-star in the future. Given that we will be using 10 seasons of player data to build our player growth model, we feel that the volume of data will be adequate to ensure that the model will have a solid base of relatively unbiased data. This problem will be able to be replicated and expanded quite easily by adding more seasons of player data and including a wider range of players. We will also be able to test the model many times because every season there are new all-stars that are selected, and we can use this information to check the accuracy of the player growth model and making accurate all-star predictions.

# Data

The data we will be using will come from an online NBA data resource called basketball-reference.com. This site includes all of the player data we will need from each season. Given that we will be working with 10 seasons of data from 2011/12 through 2021/22, we expect to have a dataset of about 5,000 - 6,000 observations and we will be looking at around 8 - 10 variables.

- Example: https://www.basketball-reference.com/leagues/NBA_2019_per_game.html
- Each season has 500 - 600 observations, so 10 seasons of data will give 5,000 - 6,000 observations.
- Each observation is a player. Each observation has 28 variables, including Games, Team, Points, Rebounds, etc. We will be reducing the number of variables to only include the most relevant ones in determining all-star selection.
- Some critical variables that will be included (but not limited to): Points per Game, 2-Point Percentage, Position, Assists per Game
- The data we will use from basketball reference is already very clean, but likely we will remove any players that were only in the NBA for 1 season (given no growth to track), players with no NBA minutes played, and reduce the number of variables we will be looking at.

In [1]:
#importing data and packages
import pandas as pd
import numpy as np
#TODO: PUT IN THE Y VALUES IN BASIC SINCE SOME PLAYERS DON'T GET ALL STAR EACH YEAR, EITHER SPLIT UP DATA FROM NBA_ALL_STARS OR USE PANDAS TO MATCH CORRECT VALUES
NBA_Data = pd.DataFrame()
for i in range(13):
    basic = pd.read_csv(f'DATA/NBA_STATS_{2010+i}.csv')
    advanced = pd.read_csv(f'DATA/AD_NBA_STATS_{2010+i}.csv')
    cols_to_use = advanced.columns.difference(basic.columns)
    basic = pd.concat([basic, advanced[cols_to_use]],axis=1)
    basic['Season'] = 2010+i
    basic.drop_duplicates(subset=['Player'], keep='first', inplace = True)
    NBA_Data = pd.concat([NBA_Data, basic], axis=0)

In [2]:
#dropping duplicates, and unnecessary data
NBA_Data.drop(columns = ['Rk','Tm'], inplace = True)
NBA_Data.reset_index(drop=True, inplace = True)

In [3]:
#entering the per-game data
NBA_PerGame = pd.DataFrame()
for i in range(13):
    pergame = pd.read_csv(f'DATA/NBA_PerGame_{2010+i}.csv')
    NBA_PerGame = pd.concat([NBA_PerGame, pergame], axis=0)

In [4]:
NBA_PerGame.columns

Index(['Rk', 'Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%',
       '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%',
       'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS'],
      dtype='object')

In [5]:
#dropping columns not used as a feature
NBA_PerGame = NBA_PerGame.drop(columns=['Rk', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%',
       '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%',
       'ORB', 'DRB', 'PF', 'STL', 'TOV', 'Pos', 'Player']).reset_index().drop(columns=['index'])
NBA_PerGame.columns

Index(['TRB', 'AST', 'BLK', 'PTS'], dtype='object')

In [6]:
#renaming pergame columns
pergame_rename = {'TRB': 'Rebounds Per Game', 'AST': 'Assists Per Game', 'BLK': 'Blocks Per Game',
                 'PTS': 'Points Per Game'}

NBA_PerGame.rename(columns=pergame_rename, inplace=True)

In [7]:
#adding relevant pergame data to the main dataframe
NBA_Data = pd.concat([NBA_Data, NBA_PerGame], axis=1)
NBA_Data

Unnamed: 0,Player,Pos,Age,G,GS,MP,FG,FGA,FG%,3P,...,TS%,USG%,VORP,WS,WS/48,Season,Rebounds Per Game,Assists Per Game,Blocks Per Game,Points Per Game
0,Arron Afflalo\afflaar01,SG,24,82,75,2221.0,272.0,585.0,0.465,108.0,...,0.576,14.0,0.9,4.3,0.092,2010,3.1,1.7,0.4,8.8
1,Alexis Ajinça\ajincal01,C,21,6,0,30.0,5.0,10.0,0.500,0.0,...,0.479,19.3,0.0,0.0,-0.013,2010,0.7,0.0,0.2,1.7
2,LaMarcus Aldridge\aldrila01,PF,24,78,78,2922.0,579.0,1169.0,0.495,5.0,...,0.535,22.9,2.3,8.8,0.145,2010,8.0,2.1,0.6,17.9
3,Joe Alexander\alexajo01,SF,23,8,0,29.0,1.0,6.0,0.167,0.0,...,0.273,11.3,0.0,0.0,0.030,2010,0.6,0.3,0.1,0.5
4,Malik Allen\allenma01,PF,31,51,3,456.0,46.0,116.0,0.397,1.0,...,0.431,14.0,-0.4,0.1,0.009,2010,1.6,0.3,0.1,2.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6516,Thaddeus Young\youngth01,PF,33,52,1,845.0,141.0,272.0,0.518,17.0,...,0.548,17.4,0.9,2.2,0.126,2022,4.0,2.0,0.3,6.2
6517,Trae Young\youngtr01,PG,23,76,76,2652.0,711.0,1544.0,0.460,233.0,...,0.603,34.4,4.8,10.0,0.181,2022,3.7,9.7,0.1,28.4
6518,Omer Yurtseven\yurtsom01,C,23,56,12,706.0,130.0,247.0,0.526,1.0,...,0.546,19.9,0.2,2.1,0.145,2022,5.3,0.9,0.4,5.3
6519,Cody Zeller\zelleco01,C,29,27,0,355.0,51.0,90.0,0.567,0.0,...,0.627,15.9,0.0,1.1,0.143,2022,4.6,0.8,0.2,5.2


In [8]:
# renaming columns
renameColumns = {'2P': '2 Point Field Goals', '2P%': '2 Point Field Goal Percentage', '2PA': '2 Point Field Goal Attempts', '3P': '3 Point Field Goals',
                 '3P%': '3 Point Field Goal Percentage', '3PA': '3 Point Field Goal Attempts', '3PAr': '3 Point Attempt Rate', 'AST': 'Assists',
                 'AST%': 'Assist Percentage', 'BLK': 'Blocks', 'BLK%': 'Block Percentage', 'BPM': 'Box Plus/Minus', 'DBPM': 'Defensive Box Plus/Minus',
                 'DRB': 'Defensive Rebounds', 'DRB%': 'Defensive Rebound Percentage', 'DWS': 'Defensive Win Shares', 'FG': 'Field Goals',
                 'FG%': 'Field Goal Percentage', 'FGA': 'Field Goals Attempts', 'FT': 'Free Throws', 'FT%': 'Free Throw Percentage',
                 'FTA': 'Free Throw Attempts', 'FTr': 'Free Throw Attempt Rate', 'G': 'Games', 'GS': 'Games Started', 'MP': 'Minutes Played',
                 'OBPM': 'Offensive Box Plus/Minus', 'ORB': 'Offensive Rebounds', 'ORB%': 'Offensive Rebound Percentage', 'OWS': 'Offensive Win Shares',
                 'PER': 'Player Efficiency Rating', 'PF': 'Personal Fouls', 'PTS': 'Points', 'STL': 'Steals', 'STL%': 'Steal Percentage', 'TOV': 'Turnovers',
                 'TOV%': 'Turnover Percentage', 'TRB': 'Total Rebounds', 'TRB%': 'Total Rebound Percentage', 'TS%': 'True Shooting Percentage',
                 'USG%': 'Usage Percentage', 'VORP': 'Value over Replacement Player', 'WS': 'Win Shares', 'WS/48': 'Win Shares Per 48 Minutes',
                 'eFG%': 'Effective Field Goal Percentage'
                 }
NBA_Data.rename(columns=renameColumns, inplace=True)


In [9]:
#one hot encoding
np.where(NBA_Data['Pos'].isin(['PF', 'PG','SF','SG', 'C']),
    NBA_Data['Pos'],
    'other')
dummy= pd.get_dummies(NBA_Data['Pos'])
NBA_Data = pd.concat([NBA_Data,dummy], axis=1)

def one_hot_multi(x):
    for i in x['Pos'].split('-'):
        x[i] = 1
    return x

NBA_Data[NBA_Data['Pos'].str.contains('-')] = NBA_Data[NBA_Data['Pos'].str.contains('-')].apply(lambda x: one_hot_multi(x), axis = 1)
NBA_Data = NBA_Data[NBA_Data.columns.drop(list(NBA_Data.filter(regex='-')))]
NBA_Data.drop(columns = 'Pos', inplace = True)
NBA_Data.rename(columns={'C':'Center','PF':'Power Forward','PG':'Point Guard','SF':'Small Forward','SG':'Shooting Guard'}, inplace=True)

In [10]:
def remove_asterisk(name):
    if name[-1]=='*':
        return name[:-1]
    else:
        return name

In [11]:
#filling null values and getting rid of string for player names
NBA_Data.fillna(0, inplace=True)
NBA_Data['Player'] = NBA_Data['Player'].apply(lambda x: x.split('\\')[0])
NBA_Data['Player'] = NBA_Data['Player'].apply(remove_asterisk)

In [12]:
NBA_Data.columns

Index(['Player', 'Age', 'Games', 'Games Started', 'Minutes Played',
       'Field Goals', 'Field Goals Attempts', 'Field Goal Percentage',
       '3 Point Field Goals', '3 Point Field Goal Attempts',
       '3 Point Field Goal Percentage', '2 Point Field Goals',
       '2 Point Field Goal Attempts', '2 Point Field Goal Percentage',
       'Effective Field Goal Percentage', 'Free Throws', 'Free Throw Attempts',
       'Free Throw Percentage', 'Offensive Rebounds', 'Defensive Rebounds',
       'Total Rebounds', 'Assists', 'Steals', 'Blocks', 'Turnovers',
       'Personal Fouls', 'Points', '3 Point Attempt Rate', 'Assist Percentage',
       'Block Percentage', 'Box Plus/Minus', 'Defensive Box Plus/Minus',
       'Defensive Rebound Percentage', 'Defensive Win Shares',
       'Free Throw Attempt Rate', 'Offensive Box Plus/Minus',
       'Offensive Rebound Percentage', 'Offensive Win Shares',
       'Player Efficiency Rating', 'Steal Percentage', 'Turnover Percentage',
       'Total Rebound 

In [13]:
relevant_cols = ['Player', '3 Point Field Goal Percentage', '2 Point Field Goal Percentage', 
                'Player Efficiency Rating', 'True Shooting Percentage', 'Usage Percentage', 
                'Win Shares Per 48 Minutes', 'Season', 'Rebounds Per Game',
                'Assists Per Game', 'Blocks Per Game', 'Points Per Game', 'Center',
                'Power Forward', 'Point Guard', 'Small Forward', 'Shooting Guard']

NBA_Features = NBA_Data[relevant_cols]

In [14]:
NBA_Features['All Star'] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  NBA_Features['All Star'] = 0


In [15]:
All_Stars = pd.read_csv('DATA/NBA_ALL_STARS.csv').drop(columns=['Year'])
All_Stars = All_Stars.fillna(0)
All_Stars

Unnamed: 0,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,LeBron James,Kevin Durant,Kevin Durant,Kevin Durant,Kyrie Irving,James Harden,Stephen Curry,Anthony Davis,LeBron James,LeBron James,Kawhi Leonard,Luka Dončić,Stephen Curry
1,Dwyane Wade,Kobe Bryant,Kobe Bryant,Kobe Bryant,LeBron James,Stephen Curry,Kobe Bryant,James Harden,Kevin Durant,James Harden,Anthony Davis,Stephen Curry,LeBron James
2,Dwight Howard,Chris Paul,Blake Griffin,Blake Griffin,Paul George,Marc Gasol,Kawhi Leonard,Stephen Curry,Russell Westbrook,Kevin Durant,LeBron James,Giannis Antetokounmpo,Giannis Antetokounmpo
3,Joe Johnson,Carmelo Anthony,Chris Paul,Chris Paul,Carmelo Anthony,Klay Thompson,Kevin Durant,Kevin Durant,Kyrie Irving,Kyrie Irving,James Harden,Nikola Jokić,DeMar DeRozan
4,Kevin Garnett,Tim Duncan,Andrew Bynum,Dwight Howard,Dwyane Wade,LaMarcus Aldridge,Russell Westbrook,Kawhi Leonard,Anthony Davis,Kawhi Leonard,Luka Dončić,LeBron James,Nikola Jokić
5,Chris Bosh,Pau Gasol,Russell Westbrook,James Harden,Joakim Noah,Chris Paul,James Harden,Marc Gasol,Paul George,Damian Lillard,Ben Simmons,Chris Paul,Luka Dončić
6,Rajon Rondo,Manu Ginóbili,Kevin Love,Tony Parker,John Wall,Russell Westbrook,Klay Thompson,Russell Westbrook,Andre Drummond,Klay Thompson,Russell Westbrook,Jaylen Brown,Darius Garland
7,Gerald Wallace,Deron Williams,Dirk Nowitzki,Russell Westbrook,DeMar DeRozan,DeMarcus Cousins,Chris Paul,Klay Thompson,Bradley Beal,Bradley Beal,Chris Paul,Paul George,Jarrett Allen
8,Derrick Rose,Blake Griffin,Marc Gasol,David Lee,Paul Millsap,Damian Lillard,Anthony Davis,Gordon Hayward,Victor Oladipo,Ben Simmons,Devin Booker,Damian Lillard,Fred VanVleet
9,Al Horford,Dirk Nowitzki,Tony Parker,Zach Randolph,Roy Hibbert,Tim Duncan,LaMarcus Aldridge,Draymond Green,Kemba Walker,Karl-Anthony Towns,Domantas Sabonis,Domantas Sabonis,Jimmy Butler


In [16]:
#STUCK HERE


# for i in np.arange(2010, 2023):
#     stars_that_season = All_Stars.get(str(i))
#     for j in stars_that_season:
#         if j!=0:
#             player = NBA_Features[NBA_Features.get('Player')==j]
#             player_index = int(player[player.get('Season')==i].index[0])
#             NBA_Features.loc[(player_index, 'All Star')]=1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


IndexError: index 0 is out of bounds for axis 0 with size 0

In [17]:
player = NBA_Features[NBA_Features.get('Player')=='Anthony Davis']
player
# player_index = int(player[player.get('Season')==2021].index[0])
# player_index

Unnamed: 0,Player,3 Point Field Goal Percentage,2 Point Field Goal Percentage,Player Efficiency Rating,True Shooting Percentage,Usage Percentage,Win Shares Per 48 Minutes,Season,Rebounds Per Game,Assists Per Game,Blocks Per Game,Points Per Game,Center,Power Forward,Point Guard,Small Forward,Shooting Guard,All Star
1471,Anthony Davis,0.0,0.521,21.7,0.559,21.8,0.159,2013,8.2,1.0,1.8,13.5,0,1,0,0,0,0
1956,Anthony Davis,0.222,0.522,26.5,0.582,25.2,0.212,2014,10.0,1.6,2.8,20.8,0,1,0,0,0,1
2439,Anthony Davis,0.083,0.54,30.8,0.591,27.8,0.274,2015,10.2,2.2,2.9,24.4,0,1,0,0,0,1
2921,Anthony Davis,0.324,0.511,25.0,0.559,29.6,0.16,2016,10.3,1.9,2.0,24.3,1,0,0,0,0,1
3390,Anthony Davis,0.299,0.524,27.5,0.58,32.6,0.195,2017,11.8,2.1,2.2,28.0,1,0,0,0,0,1
3898,Anthony Davis,0.34,0.558,28.9,0.612,30.0,0.241,2018,11.1,2.3,2.6,28.1,0,1,0,0,0,1
4442,Anthony Davis,0.331,0.547,30.3,0.597,29.5,0.247,2019,12.0,3.9,2.4,25.9,1,0,0,0,0,0
4969,Anthony Davis,0.33,0.546,27.4,0.61,29.3,0.25,2020,9.3,3.2,2.3,26.1,0,1,0,0,0,0
5493,Anthony Davis,0.26,0.536,22.1,0.556,29.2,0.152,2021,7.9,3.1,1.6,21.8,0,1,0,0,0,0
6042,Anthony Davis,0.186,0.571,23.9,0.578,27.1,0.155,2022,9.9,3.1,2.3,23.2,1,0,0,0,0,0


# Proposed Solution

Although the problem statement may still be subject to change, the current proposed problem statement is to create a machine learning program that could predict the success of a player based on their stats, measuring success as being part of an all-star team selection. We will use past all stars as the training subjects and test on this year’s all stars to determine how accurate the program is. One way that we can program this machine learning algorithm is by creating a boundary line that can separate non-allstars from all-stars. 

This might not be a viable solution due to the sheer number of ways players can impact winning basketball, but since nba all-stars historically are chosen due to their offensive impact, creating a boundary that favors offensive statistics like points and assists may help in creating a more accurate boundary. If there happens to be more than the total amount of all-star selections available, the program can choose players that are farther away from the boundary relative to other players and vice versa for the opposite scenario.

# Evaluation Metrics

One metric thet we'll be using is Precision. The reason we'll be using Precision is because there are more NBA Players that do not become All Stars than there are that do make it. Since it's highly selective, we believe that that there will be various false positives in our data sets since the stats could be so similar. The formula for precision is: $$ \frac{True \ Positive}{True \ Positive \ + False \ Positive} $$

We will also be using an F1-Score in order to see where the cut off is in our model. We went against ROC because this data is not balanced. In order to find the F1 score, you must know recall, which is $$ \frac{True \ Positive}{True \ Positive \ + False \ Neative} $$
then you can get F1, which is: $$F_{\beta} = (1+\beta^2)\frac{Precision \cdot Recall}{\beta^2 \cdot Precision + Recall} $$

# Preliminary results

NEW SECTION!

Please show any preliminary results you have managed to obtain.

Examples would include:
- Analyzing the suitability of a dataset or alogrithm for prediction/solving your problem 
- Performing feature selection or hand-designing features from the raw data. Describe the features available/created and/or show the code for selection/creation
- Showing the performance of a base model/hyper-parameter setting.  Solve the task with one "default" algorithm and characterize the performance level of that base model.
- Learning curves or validation curves for a particular model
- Tables/graphs showing the performance of different models/hyper-parameters



# Ethics & Privacy

Our datasets are from a public datasets with credible sources and not violating any privacy or terms of use agreement. Also, our datasets are not in the format of self-reported nor survey which removes any forms of biases. Most of our data is based on accurate statistical seasonal records of players from the NBA itself and other credible sources that put the focus only on the NBA statistics. The statistical seasonal records of NBA players is not used to expose any personal information nor criticize the players, but to help to analyze the statistics seasonal records of NBA players and predict who would be the next all-star player.

# Team Expectations 

Put things here that cement how you will interact/communicate as a team, how you will handle conflict and difficulty, how you will handle making decisions and setting goals/schedule, how much work you expect from each other, how you will handle deadlines, etc...
* Communicating about when you are unable to do your part for some reason.
* Willing to make time for team meetings.
* Doing the work that you commit yourself to in the team meetings.
* Even splitting of workload.
* Completing tasks in a timely manner.

# Project Timeline Proposal

Update accordingly

| Meeting Date  | Meeting Time | Objectives  | 
|---|---|---|---|
| 4/23  |  11 AM |  Brainstorm project ideas/datasets, communicate group guidelines (forms of communication, schedules, roles), complete Project Proposal  | 
| 4/30  |  11 AM |  Peer review of proposals, do background research, discuss datasets and cleaning, discuss ethics | 
| 5/7  | 11 AM  | data wrangling and possible analytical approaches, combine various datasets to create new views, assign group members to lead each specific part  | 
| 5/14  | 11 AM  | Review/edit data wrangling, discuss analysis plan, edit project code, Checkpoint | 
| 5/21  | 11 AM  | Peer review checkpoint, visualize data, discuss/edit project code | 
| 5/28  | 11 AM  | Discuss/edit full project| 
| 6/4  | 11 AM  | Have project ready for turn in on 6/8, team evaluation survey  | 

# Footnotes
<a name="votinginfo"></a>1.[^](#voting): Greer, J. (20 Jan 2022) NBA All-Star voting 2022: How it works, fan vote end date, latest results & leaders. *The Sporting News*. https://www.sportingnews.com/us/nba/news/nba-all-star-voting-2022-how-it-works-leaders-results-end-date/1ubkauu43tcfq1xoqp80sck14g<br> 
<a name="salarycapinfo"></a>2.[^](#salary): NBA. (2 Aug 2021) Salary cap set at Hundred twelve point four million for 2021-22 season. *NBA*. https://www.nba.com/news/salary-cap-set-at-112-4-million-for-2021-22-season<br>
<a name="espnpredict"></a>3.[^](#espn): Sabin, P. (20 Jun 2017) Analytics help separate the All-Stars from the potential busts. *ESPN*. https://www.espn.com/nba/story/_/id/19681478/most-likely-all-stars-starters-role-players-top-2017-nba-draft<br>