# COGS 108 - Project Proposal

## Authors

- Arjun Yadalla: Worked on Research Question, Hypothesis, and Team Expectations sections.
- Aditya Mittal: Worked on Data section, pushing proposal to GitHub.
- Sal Martinez: Worked on Ethics section.
- Sri Gentela: Worked on Background Info and Project Timeline Proposal Section.
- Thaarak Sriram: Worked on Background Info and Project Timeline Proposal Section.

## Research Question

To what extent can NBA player salaries over the past five seasons be predicted by on-court performance metrics, and how closely do current player salaries align with statistically measured contributions to team success?

## Background and Prior Work

Basketball is one of the most popular and widely played sports in the world, with the National Basketball Association (NBA) leading global viewership in professional basketball. The league serves as the premier stage where elite athletes compete annually for the championship title. In recent years, NBA player salaries have skyrocketed, with compensation determined by a complex interplay of factors including individual performance statistics, team success, market dynamics, roster needs, and league-imposed salary cap regulations. For team management, understanding the relationship between a player's on-court contributions and their salary compensation is critically important—the ability to construct an effective roster within salary cap constraints often determines whether a franchise competes for championships or remains mediocre. This project seeks to leverage machine learning techniques to predict NBA player salaries based on statistical performance over the last five years, focusing specifically on quantifying how much players contribute to their teams relative to their compensation.

Multiple research studies have applied machine learning techniques to predict NBA player salaries with considerable success. An analysis using Random Forest and Gradient Boosting models found that minutes per game, points scored, and previous season salary were the most impactful features in predicting player compensation, with models achieving R² values around 0.85-0.90.[<sup>1</sup>](#cite_note-1) Interestingly, this research noted that advanced statistics like Value Over Replacement Player (VORP) and Win Shares were not as dominant in salary predictions as more basic counting metrics like points and games started. Additional research enhanced by optimization algorithms achieved even higher accuracy, with R² values reaching 0.987 during training phases, demonstrating the potential for sophisticated machine learning approaches to capture salary determination patterns.[<sup>2</sup>](#cite_note-2)

However, these studies also reveal important challenges and limitations in salary prediction models. Research has shown that factors beyond statistical performance significantly influence player compensation, including market size, player marketability, injury history, and contract structure constraints such as rookie scale contracts and maximum salary rules.[<sup>3</sup>](#cite_note-3) This suggests that models relying purely on performance statistics face inherent limitations when accounting for the realities of contract negotiations. Furthermore, analysis of how teams value different statistics in contract negotiations found that organizations tend to overvalue traditional counting stats like points and blocks while undervaluing efficiency metrics such as effective field goal percentage (eFG%) and Win Shares when determining salaries.[<sup>4</sup>](#cite_note-4) This disconnect between statistical value and market value presents both a challenge for prediction models and an opportunity for teams seeking competitive advantages.

While machine learning models have demonstrated strong predictive performance with typical Mean Absolute Errors around $3-5 million on test datasets, significant prediction errors often occur for specific player categories.[<sup>5</sup>](#cite_note-5) Rookie contract players, whose salaries are predetermined by draft position regardless of performance, and superstar players on maximum contracts, whose compensation is capped by collective bargaining agreement rules regardless of statistical dominance, present particular challenges for purely statistics-based models. Additionally, the NBA's rapid salary cap increases necessitate that researchers predict salary as a percentage of total cap space rather than absolute dollar amounts to account for temporal factors.

This project will build upon existing work by analyzing statistical data from the last five seasons, a period characterized by considerable salary cap growth and evolving team strategies. It aims to develop models that identify relationships between comprehensive player performance metrics and compensation while acknowledging the structural constraints that limit purely statistics-based predictions.

---

**References**

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Towards Data Science. (2025). Predicting NBA Salaries with Machine Learning. https://towardsdatascience.com/predicting-nba-salaries-with-machine-learning-ed68b6f75566/
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Cheng, Y., Song, Y., & Wang, M. (2025). Leveraging Machine Learning for Accurate Prediction of NBA Player Salaries. *Iraqi Journal for Computer Science and Mathematics*, Vol. 6, Iss. 3. https://ijcsm.researchcommons.org/ijcsm/vol6/iss3/1/
3. <a name="cite_note-3"></a> [^](#cite_ref-3) NHS Journal of Science. (2025). Computational Analysis of NBA Players with Machine and Deep Learning. https://nhsjs.com/2025/computational-analysis-of-nba-players-with-machine-and-deep-learning/
4. <a name="cite_note-4"></a> [^](#cite_ref-4) Harvard Sports Analysis Collective. (2023). MoneyB-ball: An Analysis of Over and Undervalued NBA Statistics. https://harvardsportsanalysis.org/2023/05/moneyb-ball-an-analysis-of-over-and-undervalued-nba-statistics/
5. <a name="cite_note-5"></a> [^](#cite_ref-5) Carr, E. (2025). Predicting NBA Contracts Using Statistics. *Medium - INST414: Data Science Techniques*. https://medium.com/inst414-data-science-tech/predicting-nba-contracts-using-statistics-8842f3bd45e3

## Hypothesis


We hypothesize that NBA player salaries over the past 5 seasons can be predicted with relatively high accuracy using machine learning models trained on on-court performance metrics, but that prediction accuracy will be constrained by non-performance factors embedded in the NBA’s contract system. 

We further hypothesize that traditional performance statistics, such as minutes played, points per game, and games started, will be stronger predictors of salary than advanced efficiency metrics like Win Shares and Box Plus/Minus. This expectation aligns with prior research indicating that teams tend to prioritize more visible and easily interpretable statistics during contract negotiations.

Finally, we hypothesize that discrepancies between predicted and actual salaries will be most pronounced for players on rookie-scale contracts and maximum contracts, where league rules cap compensation independently of on-court statistical performance. These discrepancies will indicate that salary alignment with measured contribution is imperfect, even when strong predictive models are used.

## Data

<h4>Explaining Ideal Dataset</h4>

Our ideal dataset would give insights into variables both on and off court that indicate player salaries (most likely linked to player performance). The only dependent variable we need is annual player salaries, as that’s what we’re predicting for. The ideal independent variables are listed here, chosen as they all indicate a player’s performance and estimated salary: 

Box Score Stats: Points (PTS), Assists (AST), Total Rebounds (TRB), Steals (STL), Blocks (BLK), Turnovers (TOV), Field Goal Percentage (FG%), Three-Point Percentage (3P%), Free Throw Percentage (FT%), Minutes Played (MP), Games Played (GP), Games Started (GS), 

Advanced Stats: Player Efficiency Rating (PER - overall efficiency per minute), Win Shares (WS - estimated wins contributed), Win Shares per 48 Minutes (WS/48), Value Over Replacement Player (VORP - value compared to a baseline "replacement" player), Box Plus/Minus (BPM - points per 100 possessions above average), True Shooting Percentage (TS% - scoring efficiency accounting for 2s, 3s, and free throws), Usage Rate (USG% - percentage of team plays used by player). 

Additional Stats: Age, Position, Years in the League

We’d like 2,250-3,000 observations, as there are 450-500 players per season, we’re analyzing the past 5 seasons, and we have some extra headroom to account for roster turnover and mid-season signings.

This data would be collected through public NBA datasets on Kaggle, web-scraping public NBA information from NBA stat websites, and using publicly available NBA stat APIs. 

The data will be stored in a clean and tidy way in a CSV file, with one row per player per season, with each variable as a column. 

<h4>Real datasets we’re using</h4>

Dataset #1: NBA Stats (1947-present) on Kaggle
This dataset is accessible at: https://www.kaggle.com/datasets/sumitrodatta/nba-aba-baa-stats. All we have to do to use this dataset is download the csv’s containing the necessary info, as it’s a public dataset on Kaggle. Literally all the independent variables we listed in the ideal dataset are here in this dataset, except for years of experience. However, we can easily calculate years of experience, as the dataset includes the first season of each player and the current season for each player's observation. 

Dataset #2: NBA Player Salaries on HoopsHype
This dataset is accessible at https://www.hoopshype.com/salaries/players/. We’ll have to web-scrape the site to access the data (likely using BeautifulSoup), as there is no convenient API or csv download to access the data. Since this data is all public, there are no concerns over the legality of web-scraping for it. The variable we’re using from this dataset is the player salaries each season, which is the final variable we needed to match our ideal dataset


## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?
> As far as the data that we use it will most likely stay in our resumes and the data will become outdated in less than a year so it will be deleted then. 


### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?
> Yes the process of documenting will be simple and follow clear steps and guidelines as well as the reasoning behind it. 

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
> There is a natural statistical bias towards players on great teams. Such as a defensive team having inflated defensive ratings as if they let a man through the next guy might stop him. Same for offense, lots of good shooters have more shots and more assist. 
 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
> Fairness across groups is difficult since situations matter a lot as talked about with better defensive teams better rating. Same with offense as well as coaching however our algorithm can only take in the numbers we can adjust if a certain team is statistically overperforming or at least not this anomaly. 
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
> Yes, the statistics we use are self explanatory and if human judgement is necessary things defensive quality and judging their situation could come into account though most likely this will not be needed at all. 
 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
> We do have a plan to monitor the model on how it does simply by comparing the contracts that the players are getting and what we predicted. As well as how good the team performance is with the changes in mind and if the model was correct.
 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?


## Team Expectations 

* Clear Roles/Responsibilities
  * Every team member will take responsibility for different components of the project. Responsibilities include data collection and cleaning, modeling and analysis, and report writing. While all responsibilities will be divided, all members of the team are expected to understand the full project and are able to provide feedback to their teammates as well as ask for help when needed. 
* Communication/Availability
  * The team must regularly and properly communicate through messages and meet at least once a week, on a date TBD. At this meeting, we will discuss current progress, address challenges that we are facing, plan the next steps of the project, and work on the project and get help. Team members are expected to respond to messages in a timely manner and notify the group if any conflicts arise. 
* Accountability/Deadlines
  * Deadlines will be set ahead of course deadlines to allow time for review and revisions amongst the team. Team members are expected to complete their assigned tasks on time and inform the group ahead of time if they need extra assistance. 
* Collaboration & Respect
  * All team members are expected to contribute respectfully, listen to differing opinions, and engage constructively in discussions. They shouldn’t be distracted by other tasks and instead give their full, undivided attention to the group. Decisions regarding the project will be made collaboratively. 
* Quality & Academic Integrity
  * The team is committed to producing high-quality, reproducible work. All code will be documented and shared through version control, and all sources will be properly cited. Any use of external tools or assistance will comply with course policies. 


## Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/28  |  12:30 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; Begin background research | 
| 2/5  |  12:30 PM |  Submit project proposal; understand hypothesis; get datasets and variables | Collect NBA performance and salary datasets; Align player and season data; Assign specific roles | 
| 2/12   | 12:30 PM  | Clean and preprocess data; Conduct exploratory data analysis  | Select final features; Discuss Analysis Plan   |
| 2/19  | 12:30 PM  | Build baseline salary prediction models; Impute Data; Encode Categorical Data | Evaluate model performance; Begin project checkpoint #2 |
| 2/26  | 12:30 PM  | Train machine learning models (RandomForestRegressor, XGBoost); Complete project checkpoint #2 | Review project checkpoint #2; Compare model results; Begin identifying overpaid and underpaid players |
| 3/5  | 12:30 PM  | Interpret results; Create visualizations | Start drafting Final Project; Assign Roles for Final Project|
| 3/12  | 12:30 PM  | Complete assigned final project parts; Format notebook | Review final project notebook; Create video|
| 3/18  | 11 PM  | Final edits & polish final submission | Turn in Final Project & Group Project Surveys|