# COGS 118A- Project Checkpoint

# Names

- Jonathan Park
- Daniel Lee
- Suebeen Noh
- Franklin Le
- Daniel Renteria

# Abstract 

The goal of our project is to use NBA player data to create a machine learning algorithm that will predict the likelihood of a player becoming an all-star. We will be looking at different player statistics from each regular season from 2011/12 through 2021/22 to create a regression line that predicts the likelihood of a player becoming an all-star. Success will be measured by testing the created model with earlier stats from players who became all-stars and seeing how accurate the predictions are.

# Background

Around halfway through each NBA season, fans and media members vote on their favorite NBA players to play in the All-Star game. Fans are able to vote through various online means such as the NBA app, NBA website, and via Twitter<a name="voting"></a>[<sup>[1]</sup>](#votinginfo), and represent 50% of the overall vote. Media members and current players make up the other 50%. Being selected as an all-star is a prestigious accomplishment, and many players take pride in the number of all-star games they participated in as a mark of their legacy and impact on basketball.

For teams, it is vital to scout and sign players that they believe have the growth potential to become all-star level players, but do not yet command an enormous salary. This is because the NBA has a salary cap system, where there is a maximum amount of money they are able to spend on player salaries in one season. In the 2021/22 season for example, the salary cap was set to 112.4 million dollars<a name="salary"></a>[<sup>[2]</sup>](#salarycapinfo). This sounds like a lot of money, but when top players make upwards of $50 million by themselves, this salary cap gets filled quickly. Because teams are looking for cheaper players with more growth potential, we believe that we can use machine learning to support the scouting systems already in place.

In terms of prior work already done on this subject, there is already a lot of work being done in this particular field. For example, ESPN created a model in 2017 to predict which draft picks are likely to become all-stars <a name="espn"></a>[<sup>[3]</sup>](#espnpredict). NBA teams often have their own analytics departments, and media outlets such as ESPN also recognize the power of using machine learning and analytics to predict which players have the most potential. We will be working to build off the wealth of knowledge afforded to us and create our own model that works as effectively as any others.

# Problem Statement

The problem we are looking to answer is to predict which NBA players have the most potential to become all-stars. We will be using a variety of player stats and variables to create a model that can predict the growth of a player, and use that predicted growth to make an educated guess about the likelihood that they will become an all-star in the future. Given that we will be using 10 seasons of player data to build our player growth model, we feel that the volume of data will be adequate to ensure that the model will have a solid base of relatively unbiased data. This problem will be able to be replicated and expanded quite easily by adding more seasons of player data and including a wider range of players. We will also be able to test the model many times because every season there are new all-stars that are selected, and we can use this information to check the accuracy of the player growth model and making accurate all-star predictions.

# Data

The data we will be using will come from an online NBA data resource called basketball-reference.com. This site includes all of the player data we will need from each season. Given that we will be working with 10 seasons of data from 2011/12 through 2021/22, we expect to have a dataset of about 5,000 - 6,000 observations and we will be looking at around 8 - 10 variables.

- Example: https://www.basketball-reference.com/leagues/NBA_2019_per_game.html
- Each season has 500 - 600 observations, so 10 seasons of data will give 5,000 - 6,000 observations.
- Each observation is a player. Each observation has 28 variables, including Games, Team, Points, Rebounds, etc. We will be reducing the number of variables to only include the most relevant ones in determining all-star selection.
- Some critical variables that will be included (but not limited to): Points per Game, 2-Point Percentage, Position, Assists per Game
- The data we will use from basketball reference is already very clean, but likely we will remove any players that were only in the NBA for 1 season (given no growth to track), players with no NBA minutes played, and reduce the number of variables we will be looking at.

In [1]:
#importing data and packages
import pandas as pd
import numpy as np
data = pd.read_csv('DATA/NBA_STATS_2011.csv')

In [2]:
#dropping duplicates, and unnecessary data
data.drop(columns = ['Rk','Tm'], inplace = True)
data.drop_duplicates(subset=['Player'], keep='first', inplace = True)
data.reset_index(drop=True, inplace = True)

In [3]:
#renaming columns
renameColumns = {'G':'Games', 'GS':'Games Started', 'MP':'Minutes Played','FG':'Field Goals','FGA':'Field Goals Attempts','FG%':'Field Goal Percentage',
                 '3P':'3 Point Field Goals','3PA':'3 Point Field Goal Attempts', '3P%':'3 Point Field Goal Pertentage','2P':'2 Point Field Goals','2PA':'2 Point Field Goal Attempts', 
                 '2P%':'2 Point Field Goal Pertentage','eFG%':'Effective Field Goal Pertentage','FT':'Free Throws','FTA':'Free Throw Attempts', 'FT%':'Free Throw Pertentage',
                 'ORB':'Offesvie Rebounds','DRB':'Defensive Rebounds','TRB':'Total Rebounds','AST':'Assits','STL':'Steals','BLK':'Blocks','TOV':'Turnovers','PF':'Personal Fouls','PTS':'Points'
                 } 
data.rename(columns=renameColumns, inplace=True)

In [4]:
#one hot encoding
np.where(data['Pos'].isin(['PF', 'PG','SF','SG', 'C']),
    data['Pos'],
    'other')
data2 = pd.get_dummies(data['Pos'])
data = pd.concat([data,data2], axis=1)

def one_hot_multi(x):
    for i in x['Pos'].split('-'):
        x[i] = 1
    return x

data[data['Pos'].str.contains('-')] = data[data['Pos'].str.contains('-')].apply(lambda x: one_hot_multi(x), axis = 1)
data = data[data.columns.drop(list(data.filter(regex='-')))]
data.drop(columns = 'Pos', inplace = True)
data.rename(columns={'C':'Center','PF':'Power Forward','PG':'Point Guard','SF':'Small Forward','SG':'Shooting Guard'}, inplace=True)

In [5]:
#filling null values and getting rid of string for player names
data.fillna(0, inplace=True)
data

Unnamed: 0,Player,Age,Games,Games Started,Minutes Played,Field Goals,Field Goals Attempts,Field Goal Percentage,3 Point Field Goals,3 Point Field Goal Attempts,...,Steals,Blocks,Turnovers,Personal Fouls,Points,Center,Power Forward,Point Guard,Small Forward,Shooting Guard
0,Jeff Adrien\adrieje01,25,8,0,7.9,0.9,2.0,0.438,0.0,0.0,...,0.0,0.3,0.3,1.6,2.6,0,1,0,0,0
1,Arron Afflalo\afflaar01,26,62,62,33.6,5.3,11.3,0.471,1.4,3.6,...,0.6,0.2,1.4,2.2,15.2,0,0,0,0,1
2,Blake Ahearn\ahearbl01,27,4,0,7.5,1.0,3.5,0.286,0.5,2.3,...,0.0,0.0,1.3,1.0,2.5,0,0,1,0,0
3,Solomon Alabi\alabiso01,23,14,0,8.7,0.9,2.6,0.361,0.0,0.0,...,0.1,0.6,0.4,0.8,2.4,1,0,0,0,0
4,Cole Aldrich\aldrico01,23,26,0,6.7,0.8,1.6,0.524,0.0,0.0,...,0.3,0.6,0.3,0.8,2.2,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
473,Chris Wright\wrighch01,23,24,1,7.8,1.0,1.9,0.511,0.0,0.0,...,0.3,0.5,0.3,0.9,2.9,0,0,0,1,0
474,Dorell Wright\wrighdo01,26,61,61,27.0,3.6,8.6,0.422,1.7,4.8,...,1.0,0.4,0.8,1.6,10.3,0,0,0,1,0
475,Nick Young\youngni01,26,62,35,27.9,5.1,12.6,0.403,1.7,4.5,...,0.7,0.3,1.3,2.3,14.2,0,0,0,0,1
476,Sam Young\youngsa01,26,35,2,10.7,1.3,3.6,0.354,0.1,0.5,...,0.5,0.2,0.5,0.7,3.3,0,0,0,1,0


# Proposed Solution

Although the problem statement may still be subject to change, the current proposed problem statement is to create a machine learning program that could predict the success of a player based on their stats, measuring success as being part of an all-star team selection. We will use past all stars as the training subjects and test on this year’s all stars to determine how accurate the program is. One way that we can program this machine learning algorithm is by creating a boundary line that can separate non-allstars from all-stars. 

This might not be a viable solution due to the sheer number of ways players can impact winning basketball, but since nba all-stars historically are chosen due to their offensive impact, creating a boundary that favors offensive statistics like points and assists may help in creating a more accurate boundary. If there happens to be more than the total amount of all-star selections available, the program can choose players that are farther away from the boundary relative to other players and vice versa for the opposite scenario.

# Evaluation Metrics

One metric thet we'll be using is Precision. The reason we'll be using Precision is because there are more college altheses that do not make it to the NBA than there are that do make it. Since it's highly selective, we believe that that there will be various fales positives in our data sets since the stats could be so similar. The formula for precision is: $$ \frac{True \ Positive}{True \ Positive \ + False \ Positive} $$

We will also be using an F1-Score in order to see where the cut off is in our model. We went against ROC because this data is not balanced. In order to find the F1 score, you must know recall, which is $$ \frac{True \ Positive}{True \ Positive \ + False \ Neative} $$
then you can get F1, which is: $$F_{\beta} = (1+\beta^2)\frac{Precision \cdot Recall}{\beta^2 \cdot Precision + Recall} $$

# Preliminary results

NEW SECTION!

Please show any preliminary results you have managed to obtain.

Examples would include:
- Analyzing the suitability of a dataset or alogrithm for prediction/solving your problem 
- Performing feature selection or hand-designing features from the raw data. Describe the features available/created and/or show the code for selection/creation
- Showing the performance of a base model/hyper-parameter setting.  Solve the task with one "default" algorithm and characterize the performance level of that base model.
- Learning curves or validation curves for a particular model
- Tables/graphs showing the performance of different models/hyper-parameters



# Ethics & Privacy

Our datasets are from a public datasets with credible sources and not violating any privacy or terms of use agreement. Also, our datasets are not in the format of self-reported nor survey which removes any forms of biases. Most of our data is based on accurate statistical seasonal records of players from the NBA itself and other credible sources that put the focus only on the NBA statistics. The statistical seasonal records of NBA players is not used to expose any personal information nor criticize the players, but to help to analyze the statistics seasonal records of NBA players and predict who would be the next all-star player.

# Team Expectations 

Put things here that cement how you will interact/communicate as a team, how you will handle conflict and difficulty, how you will handle making decisions and setting goals/schedule, how much work you expect from each other, how you will handle deadlines, etc...
* Communicating about when you are unable to do your part for some reason.
* Willing to make time for team meetings.
* Doing the work that you commit yourself to in the team meetings.
* Even splitting of workload.
* Completing tasks in a timely manner.

# Project Timeline Proposal

Update accordingly

| Meeting Date  | Meeting Time | Objectives  | 
|---|---|---|---|
| 4/23  |  11 AM |  Brainstorm project ideas/datasets, communicate group guidelines (forms of communication, schedules, roles), complete Project Proposal  | 
| 4/30  |  11 AM |  Peer review of proposals, do background research, discuss datasets and cleaning, discuss ethics | 
| 5/7  | 11 AM  | Data wrangling and possible analytical approaches, combine various datasets to create new views, assign group members to lead each specific part  | 
| 5/14  | 11 AM  | Review/edit data wrangling, discuss analysis plan, edit project code, Checkpoint | 
| 5/21  | 11 AM  | Peer review checkpoint, visualize data, discuss/edit project code | 
| 5/28  | 11 AM  | Discuss/edit full project| 
| 6/4  | 11 AM  | Have project ready for turn in on 6/8, team evaluation survey  | 

# Footnotes
<a name="votinginfo"></a>1.[^](#voting): Greer, J. (20 Jan 2022) NBA All-Star voting 2022: How it works, fan vote end date, latest results & leaders. *The Sporting News*. https://www.sportingnews.com/us/nba/news/nba-all-star-voting-2022-how-it-works-leaders-results-end-date/1ubkauu43tcfq1xoqp80sck14g<br> 
<a name="salarycapinfo"></a>2.[^](#salary): NBA. (2 Aug 2021) Salary cap set at Hundred twelve point four million for 2021-22 season. *NBA*. https://www.nba.com/news/salary-cap-set-at-112-4-million-for-2021-22-season<br>
<a name="espnpredict"></a>3.[^](#espn): Sabin, P. (20 Jun 2017) Analytics help separate the All-Stars from the potential busts. *ESPN*. https://www.espn.com/nba/story/_/id/19681478/most-likely-all-stars-starters-role-players-top-2017-nba-draft<br>