# COGS 108 - Project Proposal

## Authors

**Jerry Ying:** Conceptualization, Analysis, Methodology

**Jimmy Ouyang:** Software, Visualization

**Zack Chen:** Methodology, Background Research

**Subika Haider:**  Analysis, Data Curation, Experimental Investigation

**Jeremy Wei:** Project Administration, Data Curation

**Everyone:** Writing – original draft, Writing – review & editing

## Research Question

Across NBA seasons 2016-17 to 2023-24, how well do a player’s basic box-score averages (points, rebounds, assists, steals, blocks) per game and selected advanced metrics (True-Shooting %, Player Efficiency Rating, Box Plus-Minus, and Win Shares per 48) explain the share of the salary cap the player earns in the following season?


## Background and Prior Work

NBA player salaries operate under a hard salary-cap system, which forces teams to divide a fixed pool of money across an entire roster. Because the cap changes from season to season, analysts often break salary figures down into share-of-cap terms to make contracts comparable across years. In this project, we ask whether commonly available performance data—basic box-score averages (points, rebounds, assists, steals, blocks) and selected advanced metrics (TS%, PER, BPM, and WS/48)—can explain how much of the cap a player earns in the following season. Framing the outcome this way helps factor out inflation effects from the analysis and keeps the focus on how teams appear to value on-court production when they set contracts.<a href="#ref1"><sup>1</sup></a><a href="#ref2"><sup>2</sup></a>

Prior research in sports economics consistently finds that teams still pay heavily for visible box-score output, particularly scoring. Studies that model NBA salaries using traditional performance variables often show that points per game, minutes, and assists explain a substantial portion of salary variation, even after accounting for other factors. One applied analysis of NBA salary determinants finds that scoring remains one of the strongest predictors of pay, suggesting that front offices continue to price offensive volume into contracts even as analytics become more common.<a href="#ref3"><sup>3</sup></a> These findings help set a baseline model up for our study, where box-score averages serve as the starting point for explaining salary share.

At the same time, salary does not track box-score production perfectly. Contracts also factor in reputation, age, injury risk, positional demand, and the timing of free agency, all of which can pull pay away from pure statistical output. To fill in these gaps, analysts have increasingly brought advanced metrics into salary models in an effort to pick up value teams might otherwise miss. Papadaki and Tsagris (2020) model NBA salary share directly using machine-learning methods and show that performance variables can explain a meaningful share of compensation, while also demonstrating that salary outcomes remain noisy and difficult to pin down exactly.<a href="#ref4"><sup>4</sup></a> Their work motivates our decision to focus on salary share and to compare how different sets of performance metrics explain it.

More recent academic and student research has followed a similar path by combining traditional and advanced statistics to study player valuation and pay inequality. These projects often find that while stars dominate the top end of the salary distribution, certain efficient or high-impact role players appear underpaid relative to their statistical contribution. One such study categorizes players by role and shows that efficiency-based metrics help explain why some lower-usage players provide strong on-court value without receiving star-level contracts.<a href="#ref5"><sup>5</sup></a> This body of work helps tie our hypothesis down: advanced metrics may not replace raw scoring as the strongest individual predictor, but they may improve overall model fit and reduce prediction error for non-star players.

The advanced metrics used in this project are well established in public basketball analytics. TS% adjusts scoring efficiency by accounting for three-point shooting and free throws, while BPM and WS/48 aim to roll a player's total impact into a single number that adjusts for playing time and team context. These metrics attempt to build efficiency and impact into one measure, making them especially useful for evaluating players who log fewer minutes but perform well when on the floor.<a href="#ref1"><sup>1</sup></a><a href="#ref6"><sup>6</sup></a><a href="#ref7"><sup>7</sup></a> By comparing a box-score-only model to one that folds these advanced metrics in, our project tests whether teams implicitly reward this type of efficiency when they set future salaries.

Finally, this study contributes by examining multiple seasons (2016–17 through 2023–24) and by linking performance in one season to salary share in the next. This approach better reflects how front offices operate, since teams pay players based on expected future value rather than past production alone. By comparing stars and role players using the same model, we test whether advanced metrics narrow the difference between what players are paid and what their performance predicts, especially for players whose value is not well captured by per-game averages.

<hr>

<h3>References</h3>

<p><a name="ref1"></a>1. <a href="#ref1">^</a> NBA Stats Help Glossary — True Shooting Percentage (TS%) definition and formula. NBA.com. <b>https://www.nba.com/stats/help/glossary</b></p>

<p><a name="ref2"></a>2. <a href="#ref2">^</a> Sports Reference / Basketball-Reference — WS/48 definition (and related advanced-stat glossary context). <b>https://www.basketball-reference.com/about/glossary.html</b></p>

<p><a name="ref3"></a>3. <a href="#ref3">^</a> The Sport Journal (2015). "Determinants of NBA Player Salaries." <b>https://thesportjournal.org/article/determinants-of-nba-player-salaries/</b></p>

<p><a name="ref4"></a>4. <a href="#ref4">^</a> Papadaki, I. & Tsagris, M. (2020). "Estimating NBA players' salary share according to their performance on court: A machine learning approach." arXiv. <b>https://arxiv.org/pdf/2007.14694</b></p>

<p><a name="ref5"></a>5. <a href="#ref5">^</a> Riccardi, N. (2025). "NBA player types and salaries: assessing the disparities in …" (uses box-score + advanced stats to study salary patterns). Syracuse University SURFACE repository (PDF). <b>https://surface.syr.edu/cgi/viewcontent.cgi?article=1068&context=sportmanagement</b></p>

<p><a name="ref6"></a>6. <a href="#ref6">^</a> Basketball-Reference — Box Plus/Minus (BPM) methodology overview. <b>https://www.basketball-reference.com/about/bpm2.html</b></p>

<p><a name="ref7"></a>7. <a href="#ref7">^</a> Basketball-Reference — Win Shares primer (context for how WS is allocated and interpreted). <b>https://www.basketball-reference.com/about/ws.html</b></p>



## Hypothesis


We hypothesize that both basic box-score averages and selected advanced metrics positively correlate with a player’s salary.

While raw box-score (points, rebounds, assists, steals, blocks) averages will remain the strongest individual predictors of salary share (especially points per game), the inclusion of advanced metrics (such as win shares per 48 minutes) will increase the model’s overall R^2 and provide a more accurate valuation of players who receive less minutes but perform exceptional (i.e. high box-score statistics per minute, but relatively low box-score per game).

## Data

### Ideal Dataset

**Unit of observation:**  
One player–season

**Seasons covered:**  
2016–17 through 2023–24 (8 seasons)

**Target sample size:**  
Approximately 30 teams × 15 players × 8 seasons ≈ **3,600 observations**.  

---

### Variables in the Dataset

#### Identification
- Player ID  
- Player Name (First and Last)  
- Season  
- Team  
- Primary Position  

#### Demographics and Control Variables
- Age (at the start of the season)  
- Draft Pick  
- Minutes Per Game  

#### Independent Variables (Season *t*)
- **Basic box score statistics (per game):**
  - Points (PTS)  
  - Rebounds (REB)  
  - Assists (AST)  
  - Steals (STL)  
  - Blocks (BLK)  

- **Selected advanced metrics:**
  - True Shooting Percentage (TS%)  
  - Player Efficiency Rating (PER)  
  - Box Plus/Minus (BPM)  
  - Win Shares per 48 minutes (WS/48)  

#### Dependent Variables (Season *t+1*)
- Salary (USD)  
- League Salary Cap (USD)  
- Cap-Adjusted Salary Ratio  
  - Defined as: salary / league salary cap  

#### Additional Supporting Outcome Variables
- Contract type (e.g., rookie scale, maximum, veteran minimum, two-way)  
- Years of NBA playing experience  

---

### Filteration method
- Implementation the criterion of minimum games played (certain players play a small fraction of the total 82 games in a season, making each of their averages depend highly on those few games, which would then make the comparison unfair for other players)

---

### Real Datasets

#### 1. Basketball-Reference: Player Season Totals & Advanced Statistics  
**Source:** Basketball-Reference, https://www.basketball-reference.com/leagues/NBA_2026_per_game.html 

**Access:** Free, downloadable CSV files  

**Description:**  
This dataset provides comprehensive player-level statistics, including basic per-game box score statistics (e.g., PTS, REB, AST) and advanced metrics (e.g., WS/48), across multiple NBA seasons. It contains the majority of variables required for our analysis.

#### 2. Spotrac: NBA Contract and Salary Data  
**Source:** Spotrac, https://www.spotrac.com/nba/contracts/

**Access:** Free, downloadable CSV files

**Description:** 
This dataset contains salary by season, contract team notes, and other relevant data related to contract.


**Difference between real and Ideal Dataset:**
Our real dataset is very close to the ideal dataset. They are widely used for any study relevant to the NBA.




## Ethics

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

    Although NBA salaries and performance statistics are in the public domain, our research involves the financial data of identifiable individuals. The principle of Beneficence dictates that our study should focus on aggregate market trends and systemic patterns rather than singling out specific athletes as "outliers" or "overpaid." The data should be handled by us with professional decorum, ensuring our research serves to advance the understanding of sports economics and labor market efficiency without causing undue reputational harm to the individuals being studied.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

    When analyzing the degree to which box-score and advanced metrics explain salary cap share, we have an ethical obligation to maintain "epistemological integrity." This means clearly distinguishing between statistical explanation (correlation) and causation. Us researchers must avoid "p-hacking" or manipulating the data range (2016–2024) to find a higher $R^2$ value. Ethically, our findings must be presented transparently, even if the chosen metrics fail to explain a significant portion of the salary variance, to avoid creating a false narrative about how the NBA labor market operates.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

    The selection of independent variables, specifically "all-in" metrics like PER, BPM, and Win Shares, carries an ethical weight. These metrics are human-made constructs with inherent biases (e.g., PER’s favoritism toward high-volume shooting). In our research, it is ethically necessary to acknowledge that using these metrics is an audit of the metrics themselves as much as it is an audit of the NBA’s salary structure. Us researchers must ensure the limitations of these mathematical formulas are disclosed so that the "explanation" provided is not mistaken for an objective truth about a player's total worth.

 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

    Ethical research requires a high degree of transparency regarding what the data cannot see. By focusing strictly on box scores and selected advanced stats, our study inherently ignores qualitative factors such as leadership, injury history, and defensive "gravity." It is an ethical imperative to frame the results with the caveat that these metrics only capture a portion of a player's professional value. This prevents the research from being misinterpreted as a definitive guide for what a player "should" be paid, which could otherwise be used to unfairly minimize the value of unquantifiable contributions.

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?

    Plan: We will implement a seasonal monitoring cycle to account for model drift. Because salary cap rules, positional value, and market behaviors shift annually, the model will be re-run each season. We will compare year-over-year performance, audit sample predictions, and generate a stability report to ensure the model remains accurate under new season conditions.
 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?

    Plan: We have established a response protocol for cases where a player may be reputationally harmed by model results (e.g., being publicly labeled as "overpaid"). We will evaluate these cases by auditing the specific inputs and logic that led to the label and will provide a mechanism to update the analysis if the harm stems from data inaccuracies or biased features.
 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?

    Plan: We have a "kill switch" and version control system in place. If errors are discovered in our analysis or visualizations after deployment, we can immediately roll back to a previous stable version or take the dashboard offline to prevent the spread of incorrect insights while we fix the underlying issue.
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

    Plan: To prevent misuse, all model outputs will include clear documentation and disclaimers. Specifically, we will explicitly state that the model identifies correlations, not causal relationships, to prevent stakeholders from assuming that changing one specific metric will certainly result in a higher salary. We will also monitor for "shadow" uses where the model might be used outside its intended scope (e.g., for injury prediction).


## Team Expectations

### Communication
- Primary channel: Discord (text + voice). Communicate via email if discord is not working.
- Response window: ≤ 24 hrs Mon-Fri, ≤ 36 hrs on weekends.
- Tone: “blunt-but-polite”. 
- Guidelines: Use I-statements, assume good intent, and ask clarifying questions. Whenever a conflict arises, focus on addressing the issue and not blaming other groupmates.

### Meetings
- Baseline: meet once per week. We can be flexible about meeting time as long as we meet the baseline. 
- More meetings shall be scheduled if necessary, and especially if closer to a deadline.
- Meetings should preferably be in person. Online meetings shall be made if it happens late at night (e.g. after 9PM), or if an unexpected circumstance happens to prevent arrival on the designated location on time.

### Decision-Making
- Decisions should be made in consensus. 
- If no consensus after 10 min, decide by simple majority vote. If someone is unable to show up for a particular decision and does not offer one virtually, their vote is automatically rescinded.
- When a deadline is < 24 hrs away, a decision can be made with the minimum of confirmation of 3 members. When a deadline is < 10 hrs away, a decision can be made with the minimum of confirmation of 2 members. When a deadline is < 2 hrs away, a decision can be made unilaterally given that 1) The decision-maker is fully confident that the decision will be more beneficial to the group than harmful; 2) The decision-maker will take full responsibility for the decision after it is made; 3) If other group members do not respond in a 30 minute window. 
- Regarding the previous point, no group member is allowed to make more than one unilateral decision. It should be treated as an absolute last-resort decision.

### Deadlines
- Internal deadlines are 48 hrs before the actual assignment deadline.
- Members should agree on the final deliverable before the deadline. Otherwise, extra meeting(s) should be called as soon as possible to discuss refinements.

### Conflict-Resolution Process
- Calm yourself down. A decision shall only be made when every group member is not impacted by their emotions.
- Think before you talk! Don’t talk just because you want to win the argument.
- Step into the other party’s shoes. There are often overlaps even when there seems to be complete disagreements.
- If absolutely necessary, talk to TAs about this issue and fix the problem together (last resort).
- If a group member does work late/does not meet team expectations, directly speak with the relevant group member and help them out but reinforce team expectations. If this happens recurrently, reach out to TAs.

### Inclusivity & Well-Being
- In an online meeting: Cameras optional; mic required.
- Allow religious holidays and accessibility needs.
- No “stacking” late-night deadlines.

### Agreement
By signing below with their full name, each group member confirms they have read, understand, and agree to follow the team expectations.

- Jimmy Ouyang
- Jeremy Wei
- Zack Chen
- Subika Haider
- Jerry Ying



## Project Timeline Proposal

### Team Meeting Schedule

| Meeting Date | Meeting Time | Completed Before Meeting | Discuss at Meeting | Status |
|-------------|--------------|--------------------------|-------------------|--------|
| 1/26/2026 | 17:00 | Determine the best form of communication; read & think about COGS 108 project expectations; review previous COGS 108 projects | Assign group-member tasks; review and discuss selected COGS 108 projects; analyze and evaluate the projects | ✅ |
| 2/3/2026  | 21:30 | Read through COGS108 project-proposal documents and critically analyze them | Discuss ideal dataset(s) and ethics; draft project proposal; assign group-member tasks | ✅ |
| 2/4/2026  | 13:00 | Members complete assigned tasks; finish first draft of project proposal | Discuss and complete final project proposal | ✅ |
| 2/6/2026  | 13:00 | N/A | Discuss and confirm tasks assigned to each member until next progress check | ⏳ |
| 2/11/2026 | 13:00 | Group members make progress on assigned tasks | Progress check and peer review | ⏳ |
| 2/17/2026 | 13:00 | Group members complete Checkpoint 1 requirements | Discuss and finalize Checkpoint 1 document; confirm tasks until next progress check | ⏳ |
| 2/25/2026 | 13:00 | Progress on data import,wrangling, and EDA | Progress check and peer review; review/edit wrangling & EDA; discuss analysis plan | ⏳ |
| 3/3/2026  | 13:00 | Finalize wrangling, EDA, and analysis | Discuss and finalize Checkpoint 2 document; complete project check-in; assign next tasks | ⏳ |
| 3/6/2026  | 13:00 | Complete analysis;draft results, conclusion, and discussion | Progress check and peer review; discuss and edit full project | ⏳ |
| 3/13/2026 | 13:00 | Progress on assigned tasks | Progress check and peer review | ⏳ |
| 3/16/2026 | 13:00 | Group members finish assigned parts | Discuss final project and video; finalize team-evaluation survey | ⏳ |
| 3/18/2026 | Before 11:59 | Video and project refined | Final project and video check; submit project on time | ⏳ |

---

### Role & Responsibility Matrix

| # | Tasks | Lead Contributor | Backup contributor | Support | Notes for implementation|
|---|--------------------|----------|------------|-------------|-----------|
| 1 | Project administration & timeline tracking | Jeremy | Jerry | Everyone | Sets agendas, posts minutes, updates Kanban, reminds team of deadlines. |
| 2 | Conceptualization & research question | Jerry | Zack | Everyone | Frames hypothesis, defines variables, keeps scope realistic. |
| 3 | Background / related-work section | Zack | Jerry | Everyone | Gathers literature and comparable projects; drafts background text. |
| 4 | Data sourcing & ethics checklist | Subika | Jeremy | Zack | Locates raw datasets, documents licenses/IRB issues, stores files in `/data/raw`. |
| 5 | Data curation & wrangling notebooks | Subika | Jimmy | Jeremy | Cleans, merges, and outputs tidy `player_season.csv`. |
| 6 | Analysis & modeling notebooks | Jerry | Subika | Jimmy | Builds regression models, checks assumptions, saves results tables/figures. |
| 7 | Visualization (EDA + final figs) | Jimmy | Jerry | Subika | Creates clear, colour-blind-friendly plots; exports to `figs/`. |
| 8 | Software engineering / GitOps | Jimmy | Jeremy | Everyone | Maintains repo structure, code style, CI tests, branch protection. |
| 9 | Writing – results&discussion | Jerry | Subika | Zack | Interprets coefficients, links to background, notes limitations. |
|10 | Writing – abstract, intro, methods | Zack | Jerry | Jeremy | Ensures consistency with background & data sections. |
|11 | Editing & proof-reading pass | Everyone | – | – | Two-person review rule before any section is marked “Done.” |
|12 | Video script & slide deck | Jeremy | Jimmy | Everyone | 2-min script locked by 3/13; rehearsals in 3/16 meeting. |
|13 | Video recording & post-production | Jimmy | Jeremy | Zack | Uses OBS + iMovie/DaVinci; exports MP4 < 100 MB. |
|14 | Final QA&submission to Gradescope | Jeremy | Jimmy | Everyone | Runs notebook end-to-end, checks links, submits by 3/18 23:59. |

### Credits
We used AI tools to help with citations, to format tables in the Jupyter Notebook, and to fix grammar. All ideas and analysis are our own.
