# COGS 118A - Project Checkpoint

# Names

- Bobby Baylon
- Kyra Brandt
- Jayson Gutierrez
- Nathaniel Mackler
- Stephen Rabin

# Abstract 
 Load management is a significant topic issue in the NBA as it involves the decision-making process of whether a player should play on game day, in order to avoid injuries and to ensure long-term health and success for both the player and their team. The goal of our project is to create a system for categorizing NBA players' injuries based on performance data. A KNN technique for multi-class classification will be used to accomplish this using datasets that include information on each player's performance statistics and instances of player injury. To gather player performance information, we will use datasets that contain web scraping from the nba api and injury data from Kaggle, which has 27,106 observations and 5 variables. Due to the sheer volume of features in the dataset, we will use feature selection to measure player performance statistics using integer values that reflect different performance indicators.

# Background (Updated, Please Check)


A hot topic within the current NBA world revolves around the idea of load management. The concept of load management dictates if a player should play on game day and, if they do, how many minutes. One hypothetical example is that an NBA organization might sit their star player out for a game halfway through the season because that player has been used at high rates in recent games. That organization might analyze their star's workload, ask the player how their body is responding, and assess that it's time to give their star a rest to prevent any risk of injuries, minor or major. 
Practicing load management seems like an obvious protocol for an organization in order to preserve both the player and the organization’s future health and success. A study conducted through the 2012-2015 seasons in 17 NBA teams demonstrated that significant factors in player injuries were high game loads and player fatigue <a name="lewis"></a>[<sup>[1]</sup>](#lewisnote). This study reinforces that there’s an interaction between game load and injury risk, thus it's reasonable for fans, organizations, and players to assess minutes and games played to determine if a player should sit to prevent injury. Another study conducted using data from the 2017-2019 seasons found that other risk factors for injury were player age and position <a name="cohan"></a>[<sup>[2]</sup>](#cohannote). This study indicates that seasoned players are more prone to injury and that player build’s are considerable factors in injuries due to the size expectations of certain positions. A relevant article highlighted a trend an increasing injury severity over 11 seasons, attributing the causes to things such as lengthening the season and an increase in athletic intensity<a name="kosik"></a>[<sup>[3]</sup>](#kosiknote). These variables are valuable in selecting specific variables that should be used in a statistical analysis of injury likelihood from player performance.
On the surface, it seems obvious to people outside of the NBA world that you would want to prevent major injuries for the players, that could even be career ending, and that forcing any individual to play regardless of their physical status would be unethical and harmful. However, traditional NBA fans and retired players argue that current players, especially star players, shouldn’t sit due to fatigue or minor injuries as it robs fans of their full experience at an NBA game. When the late great Kobe Bryant was asked why he disapproves of modern NBA players sitting out due to fatigue, he stated this reasoning perfectly, “Because you have a lot of people paying a lot of money to come see these athletes play, and they deserve to see that."


# Problem Statement (UPDATED PLEASE CHECK)

The problem we’re attempting to solve is to classify NBA player injuries based on performance statistics such as usage rating, minutes played, and games played. As discussed in the background section above, being able to better predict player injuries would help teams with load management. Furtheremore, predicting injuries is significant because it will help players and coaches modify practices and performance so that players are able to sustain longer, healthier careers and deliver the performance their fans deserve. 

One solution we’re leaning towards is implementing a KNN algorithm to classify the injuries with respect to the performance statistics. One reason we’re considering using a KNN implementation is that it is a strong algorithm for multi-class classification if we wanted to classify injury statuses in a hierarchy of non-injured, minor injury, or major injury. KNN can also handle categorical data such as player position, because some injuries are more common than others at certain positions we’re going to use a one-hot encoding to describe the position feature of players. One might caution against using KNN, as the datasets available to us describe a player across 20+ features, with some of those features being categorical. Thus for those features that are categorical, the dimensionality of our data might increase drastically. And because KNN suffers in higher dimensions due to expensive computations, we might need to do some sort of feature selection depending on our implementation. However, we’d be testing each datapoint at most thousands or tens of thousands of times in a year, across roughly 15 years, meaning that we wouldn’t require extensive amounts of time for testing. 

# Data (ALMOST DONE, NOT FINISHED)

We have combined data from the Kaggle Injury Data 2010-2020 and the nba_api (both of which are described below) to create a final cleaned dataframe pairing individual player performance statistics with whether or not players were injured that particular season. If the player was injured, we also have a column for whether their injury was minor or severe. We will use this merged dataframe as the basis of our model.

The dataframe was constructed from the datasets below, both of which are in the Repo (as is our final dataframe) if you wish to view them:

**Injury Data 2010-2020 (Kaggle):**
- Link: https://www.kaggle.com/datasets/ghopkins/nba-injuries-2010-2018?resource=download 
- **Cleaned Dataset size: GUYS SOMEBODY PUT THE SIZE HERE** (Raw Dataset size: 27,106 X 5 = 27,106 observations and 5 variables)
- In the new, cleaned form, a single observation consists of **season, player name, their injury, and the severity of the injury**. In the raw form, a single observation consists of the name of the injured player, the team they played for while injured, notes detailing the injury, injury leave, and/or return from injury, and the date for which the player either left on injury leave or returned to play. 
- Moving forward, our critical variables are the **player's name**, since we use this to match with the player stats dataframe we scraped from the API, the **season** since this is also important for appropriate matching, their **injury** because it will become the ground truth label for our model, and the **severity** because it will become a class if we do a multiclass model.
- For cleaning, Some critical variables were the Required, Relinquished, and Notes variables. Required and Relinquished (based on whether or not a name is present) indicated whether a player is going on injury leave or returning to the field of play. Both of these variables were represented as string values and are categorical. The Notes variable contained more specific information on the injury (i.e. did not play or day-to-day [which is questionable to play]) and/or indicates the beginning or end of injury leave for a player. The Notes variable was represented as a string. 
- We had to do a fair amount of cleaning and transformations on this dataset. This involved expunging the Acquired observations since they provided very little information, transforming the date of injury into the season and extracting the severity of injury from the Notes column. We also set up the severity columns with additional columns for one-hot encoding to make that process easier later. 

**nba_api for web scraping (GitHub):**
- Link: https://github.com/swar/nba_api
- Dataset size: 6593 observations x 80 variables.
- An observation consists of **NUMBER OF** performance statistics for a given player (i.e. time played, points scored, number of rebounds, average free throws made, etc.) along with their name, player ID, some demographics (ex. age) and the season.
- Critical variables are the **player's name and season** because these are important for matching our dataframes as well as the **performance statistics and demographics** since these are the features we will feed into our models. Of the many player statistics we have data for, we expect to focus on variables that highlight general player performance, like average time played, average points scored in a game, and average shots made which would all be represented as integer values. A core element of our project moving forward will be to do feature selection from these performance and demographic statistics. 
- **KYRA NOTE: I have less insight into the cleaning that was performed here so I've left this unchanged, but it needs to be updated** As the data is being acquired through web scraping, there will surely be some necessary cleaning of the data. In addition to the constructing of the dataset, the variables will need to be checked for data type and, if necessary, altered accordingly, and the data will likely need to be relabelled for easier comprehension.


# Proposed Solution (UPDATED PLZ CHECK)

K-nearest neighbors is an obvious solution for this use case. Realistically, one would need to predict the injury class of NBA players no more frequently than tens of thousands of times per year. Therefore, the high complexity of testing with KNN is a non-issue for this use case. The curse of dimensionality is also not a significant problem, as we have thousands of data points but only tens of features. However, a large vulnerability would be the risk of useless features affecting our KNN classifier. We don’t know for sure which features will matter. We may try to pick some features based on intuition - but there is no guarantee that features that have predictive power will be intuitively so. If we did choose to use KNN, then we could use built-in SKlearn methods - we would have to be careful to scale all our data by z-score, however. 

An alternative solution would be to use an SVM with an RBF kernel (or some other kernel that is appropriate for a data set where we have more observations than features; this is a tunable hyperparameter) to build a classifier. To do feature selection, we could use a grid search CV with C (the “hardness” of the margin) and different sets of features (first selected on intuition and research, as using all possible combinations would result in a ~2^20 by count(values of C) grid, which would result in easily overfitting to noise, be impossible to compute at our scale, and which we have insufficient data to cross validate. This would still be vulnerable to the intuition issue, but should be more resistant to overfitting to noise. Alternatively, I read online about recursive feature selection in SVMs - but I would want to talk about this with a TA before considering this technique, as it was not discussed in class. In any case, SVM is well-implemented in SKlearn, so we could use SKlearn for this process. 

Previous studies' models, such as that conducted by Lewis or Cohan & Schuster, have used only player age and minutes played, since intuitively, these two factors are by far the most important in predicting whether a player will be injured. For this reason, our benchmark model is the logistic regression using only age and minutes played for input, so if a complex model fails to do significantly better than this model, we can safely dismiss it as not useful. Additionally, in our feature selection, we will use those two features (age, minutes played) as our baseline for determining our additional performanace features are useful.


# Evaluation Metrics

While there are certainly consequences for a false positive in this case (like potentially hurting a player's mindset or causing them to reduce their performance), the consequences of a false negative (an injury occuring that could have been prevented or whose probability could be reduced through better form/training moderation) are far worse. The obvious error metric for this case would be recall (TP/(TP+FN)), but this metric would give zero weight to false positives. The harm of false positives is not zero, just lower than that of false negatives. Therefore, I think the best metric would be an Fbeta metric. The value of beta is subjective. We will play around with a few values to determine the beta that “feels” right, but it should be over one to weight recall higher.

We may also use confusion matrices and ROC/AUC in performing out model selection.

# Preliminary results (UNFINISHED)

NEW SECTION!

Please show any preliminary results you have managed to obtain.

Examples would include:
- Analyzing the suitability of a dataset or alogrithm for prediction/solving your problem 
- Performing feature selection or hand-designing features from the raw data. Describe the features available/created and/or show the code for selection/creation
- Showing the performance of a base model/hyper-parameter setting.  Solve the task with one "default" algorithm and characterize the performance level of that base model.
- Learning curves or validation curves for a particular model
- Tables/graphs showing the performance of different models/hyper-parameters

**DATA CLEANING:** So far, most of our effort has been expended in retrieving and cleaning the data to begin building our model. As mentioned in the data section, we have significantly transformed the injury dataset. We began with five columns, two of which were acquired and relinquished. Acquired was not null when the player was returning from injury. Likewise, relinquished was not null when the player was being injured. Since we were more interested in the onset of injuries and we mainly needed to use this dataset to get the truth labels for classification, we dropped the rows in which relinquished was null and then condensed the rows for each player down to one per season. This also meant that we condensed the date column down to only the year, assigning it based on NBA seasons (i.e. september to june). This meant we had an observation that told us that the given player got injured in a given season. Using some tokenization, we also extracted the severity of the player's injury from the notes column. We then had to merge this dataframe to the one we had extracted from the NBA API. This meant that we had to do a long process of string cleaning due to white space and punctuation differences between the two sets of names. With this process complete, we were able to match the dataframes together based on player name and year, thus creating a final dataframe that had a row for each player in each season from 2010-2019 with their demographics, performance statistics and whether or not they were injured that season. We will use this final dataframe as the data we feed into our models. 

# Ethics & Privacy

We will be using data that is directly taken from the NBA.com website. We are aware that it is crucial to make sure the data is gathered methodically and objectively. We will be working with player statistics that are made available to the public via the website, and as the data will not contain sensitive information and is made available to the public, informed permission is not necessary. We will use prosportstransactions.com to access player injury data for the 2010–2011 season through the 2019–2020 season. We won't falsify the data to forward an objective, such as financial gain, so in order to account for honest portrayal and unintentional use of the data when doing our research.

# Team Expectations 

Put things here that cement how you will interact/communicate as a team, how you will handle conflict and difficulty, how you will handle making decisions and setting goals/schedule, how much work you expect from each other, how you will handle deadlines, etc...
* *Communicate with one another using the team groupchat on Discord. Try to respond to messages within 24 hours if not sooner.*
* *We will meet when necessary on Zoom. We will organize meeting times in Discord. Any anticipated absenses or schedule conflicts should be brought up with the team beforehand.*
* *We will do our best to divide the work equally between members of the team.*
* *Conflicts between team members should be brought up with the whole team, so that the other team members can mediate and help resolve the conflict.*

# Updated Project Timeline Proposal


| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/14 | 1 PM | Brainstorm topics/questions (all) | Introduce selves; Choose basic topic/area of interest for project; Begin looking for data sources|
| 2/22  |  10 AM |  Find data/develop deeper understanding of topic (all)  | Divide work on project proposal among team members; Clarify research question | 
| **2/22** | **11:59 PM** | **PROJECT PROPOSAL DUE** |
| 3/4  |  11:30 AM | Get and clean data from NBA API (Jayson), clean injury dataset and preliminary EDA (Nathaniel, Bobby) | Divide work for project checkpoint; Decide who will clean data and how models will be implemented and who will work on which ones | 
| 3/7  | 1 PM  | Implement(or be close to implementing) most if not all of the models | Edit, finalize, and submit checkpoint; Touch base on problems that arise; Discuss plans for next phase of project |
| **3/8** | **11:59 PM** | **PROJECT CHECKPOINT DUE** |
| 3/14  | 1 PM  | Do model selection; Calculate metrics | Review progress and tackle issues; Divide up work for final project notebook   |
| 3/20  | 12 PM  | Draft final project notebook | Discuss/edit project code; Proofread final notebook; Complete project |
| **3/22**  | **11:59 PM**  | **FINAL PROJECT DUE**  | Don't forget to fill out team evaluation survey! |

# Footnotes
<a name="lewisnote"></a>1.[^](#lewis): Lewis, M. (2018). It’s a Hard-Knock Life: Game Load, Fatigue, and Injury Risk in the National Basketball Association. J Athl Train. https://meridian.allenpress.com/jat/article/53/5/503/112788/It-s-a-Hard-Knock-Life-Game-Load-Fatigue-and<br> 
<a name="cohannote"></a>2.[^](#cohan): Cohan, A., Schuster, J. Fernandez, J. (2021). A deep learning approach to injury forecasting in NBA basketball. Journal of Sports Analytics. <br>
<a name="kosiknote"></a>3.[^](#kosik) Kosik, K., Lundquist, K., & McInnis, K. (2021). Temporal Trends and Severity in Injury and Illness Incidence in the National Basketball Association Over 11 Seasons. Journal of Athletic Training, 56(1), 15-23