# COGS 108 - Data Checkpoint

## Authors

This is a modified [CRediT taxonomy of contributions](https://credit.niso.org). For each group member please list how they contributed to this project using these terms:
> Analysis, Background Research, Conceptualization, Data Curation, Experimental Investigation, Methodology, Project Administration, Software, Visualization, Writing – Original Draft, Writing – Review & Editing

- Justin Bourdlaies: Background Research, Experimental Investigation
- Zee Avila: Project Administration, Experimental Investigation
- Lance Mendoza: Conceptualization, Visualization, Methodology
- Jefferson Umanzor Urrutia: Data curation, Software, Writing - Review & Editing
- Majd Abu-Shamiyeh: Writing - Original Draft, Writing - Review & Editing

## Research Question

To what extent does an NBA player’s height (in inches) predict points scored per 36 minutes during the 2025-2026 NBA regular season? After testing for position and other key performance metrics such as usage rate and field goal attempts, how does height, measured by its partial R² contribution within a multiple regression model, vary across player positions and over time?

Additionally, how do scoring patterns, including shot attempts and efficiency, differ across height groups, and has the relationship between height and scoring efficiency changed across recent NBA seasons?

## Background and Prior Work

Player physical attributes, particularly height, have long played a central role in how basketball players are evaluated and used at the professional level. In the NBA, height strongly influences positional assignment and on-court responsibilities. Taller players are more likely to occupy interior positions such as center or power forward, where responsibilities emphasize rebounding, rim protection, and screening rather than high-volume scoring. Shorter players, especially guards, are typically more involved in ball handling and shot creation. Because of this specialization, height may be indirectly related to scoring output through role differences rather than scoring ability alone. This is further supported by the fact that players in the top height/weight category with low experience were mostly categorized by "two-point field goals", "offensive and defensive rebounds", "blocks", and "fouls".<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1)

Prior basketball analytics research has shown that scoring output varies substantially by position, which is closely correlated with height. Analyses of NBA data indicate that guards and wings tend to score more points per minute than forwards and centers due to higher usage rates and greater involvement in offensive actions. While the modern NBA has become more positionless, height still affects how players are used offensively, with taller players generally contributing less to scoring volume and more to non-scoring tasks. As stated in the Southwest Journal, "height remains a factor, but not the only one dictating a player's role".<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2)

Academic research has also examined the relationship between player anthropometrics and performance statistics. NBA player height and weight in relation to box score metrics and found that height was strongly associated with rebounding and shot blocking, but had a weaker and often negative relationship with scoring when controlling for playing time.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3)

Modern basketball analytics frequently normalize scoring by playing time using metrics such as points per 36 minutes to allow fair comparisons across players with different minute allocations. NBA statistical documentation recommends per-minute or per-possession metrics when evaluating player production.<a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4) Community-driven analytics projects using publicly available NBA data have applied regression models to examine how physical attributes relate to scoring and often find that variables such as usage rate and offensive role explain much more variance than height alone. However, we aim to see how height can influence performance as well where these previous studies have fallen short on exploring.

This project builds on prior work by focusing specifically on the relationship between player height and points scored per 36 minutes during a single modern NBA season. By treating height as a continuous variable and measuring both statistical significance and variance explained, this analysis aims to determine whether height has a meaningful independent effect on scoring rate or whether its impact is small relative to other factors.

References

1. <a name="cite_note-1"></a> [^](#cite_ref-1)
Zhang, S., Lorenzo, A., Gómez, M., Mateus N., Gonçalves, B., Sampaio, J. (20 Apr 2018) Clustering performances in the NBA according to players' anthropometric attributes and playing experience. *PubMed*. https://pubmed.ncbi.nlm.nih.gov/29676222/

2. <a name="cite_note-2"></a> [^](#cite_ref-2)
Ilic S. (12 Feb 2024) Average NBA Height By Position 2024: How They Measure Up?. *Southwest Journal*. https://www.southwestjournal.com/sport/nba/average-nba-height-by-position/

3. <a name="cite_note-3"></a> [^](#cite_ref-3)
Yixiong, C., Liu, F., Bao, D., Liu, H., Zhang, S., Gómez, M. (21 Oct 2019) Key Anthropometric and Physical Determinants for Different Playing Positions During National Basketball Association Draft Combine Test. *Frontiers*. https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2019.02359/full

4. <a name="cite_note-4"></a> [^](#cite_ref-4)
Wikipedia (19 Jul 2022) Player efficiency rating: Revision history. *Wikipedia*. https://en.wikipedia.org/wiki/Player_efficiency_rating

## Hypothesis


We predict that there will be a significant relationship between an NBA player's height and their points scored per 36 minutes. We expect taller players to score slightly fewer points per 36 minutes on average. This is because taller players have different roles such as defending and rebounding which can prevent them from focusing on attacking and scoring. We also predict that height will only have a small impact on the variance in scoring rate.

## Data

### Data overview

Instructions: REPLACE the contents of this cell with descriptions of your actual datasets.

DATA:

Link: https://www.nba.com/stats/players/bio

Link: https://www.nbastuffer.com/2025-2026-nba-player-stats/

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
  - Description of the variables most relevant to this project
  - Descriptions of any shortcomings this dataset has with repsect to the project
- Dataset #2 (if you have more than one!)
  - same as above
- etc

Each dataset deserves either a set of bullet points as above or a few sentences if you prefer that method.

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

In [None]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [1]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

Overall Download Progress:   0%|          | 0/2 [00:00<?, ?it/s]
Downloading airline-safety.csv:   0%|          | 0.00/1.23k [00:00<?, ?B/s][A
Overall Download Progress:  50%|█████     | 1/2 [00:00<00:00,  8.67it/s]   [A

Successfully downloaded: airline-safety.csv



Downloading bad-drivers.csv:   0%|          | 0.00/1.37k [00:00<?, ?B/s][A
Overall Download Progress: 100%|██████████| 2/2 [00:00<00:00,  8.95it/s][A

Successfully downloaded: bad-drivers.csv





### Dataset #1 

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. Describe the important metrics, what units they are in, and giv some sense of what they mean.  For example "Fasting blood glucose in units of mg glucose per deciliter of blood.  Normal values for healthy individuals range from 70 to 100 mg/dL.  Values 100-125 are prediabetic and values >125mg/dL indicate diabetes. Values <70 indicate hypoglycemia. Fasting idicates the patient hasn't eaten in the last 8 hours.  If blood glucose is >250 or <50 at any time (regardless of the time of last meal) the patient's life may be in immediate danger"
   2. If there are any major concerns with the dataset, describe them. For example "Dataset is composed of people who are serious enough about eating healthy that they voluntarily downloaded an app dedicated to tracking their eating patterns. This sample is likely biased because of that self-selection. These people own smartphones and may be healthier and may have more disposable income than the average person.  Those who voluntarily log conscientiously and for long amounts of time are also likely even more interested in health than those who download the app and only log a bit before getting tired of it"
3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you


In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


### Dataset #2 

See instructions above for Dataset #1.  Feel free to keep adding as many more datasets as you need.  Put each new dataset in its own section just like these. 

Lastly if you do have multiple datasets, add another section where you demonstrate how you will join, align, cross-reference or whatever to combine data from the different datasets

Please note that you can always keep adding more datasets in the future if these datasets you turn in for the checkpoint aren't sufficient.  The goal here is demonstrate that you can obtain and wrangle data.  You are not tied down to only use what you turn in right now.

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
       
> Data collected is publicly available public athlete performance data, with no direct human subjects interaction.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> Points per 36 minutes was chosen to represent a substantial amount of playing time (approximately three quarters of a game), but may still inflate scoring rates for players with limited minutes or specific roles. We can begin to mitigate such bias by acknowledging the limitations of points per 36 minutes and interpreting results cautiously rather than as definitive measures of scoring ability.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

> We can limit PII exposure by using only publicly available player statistics and collecting no personal information beyond what is  necessary for our analysis.

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

> We are not collecting protected attributes (race/gender), so downstream bias testing by protected group is not possible with our data. We will avoid claims about such groups.

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

> The data are public and not sensitive. We will not store passwords, keys, or any private information in the repo.

 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?

> The data collected is publicly available and non-sensitive. However, individual records could be removed from future analyses upon request.

 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

> The data are publicly available and non-sensitive, so they may be retained for reproducibility and future reference.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

> We were mindful of potential blindspots in a statistical approach. We confirmed our assumptions using basic basketball context, such as player roles and how scoring opportunities may vary by position and team system.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

> The dataset may reflect bias due to imbalanced height distributions across positions and survivorship bias, as only players who reached the NBA are included. We can mitigate potential bias by framing height as one factor among many and avoiding claims about its effect on scoring.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
> We will avoid misleading graphs and avoid claiming height causes scoring. We will show the full spread of the data and point out outliers.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

> We will avoid displaying personal identifiers and instead focus on aggregate statistical relationships rather than individual players.

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

> The analysis is documented in a version-controlled Jupyter notebook, making the steps reproducible and allowing issues to be identified and corrected if needed.

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

> Height may act as a proxy for player position or role, which could lead to oversimplified interpretations of scoring ability. We will interpret results carefully and avoid oversimplified claims about height.

 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

> Points per 36 minutes was selected to standardize scoring across players and reflect meaningful playing time, though it assumes linear scaling and may not capture all in-game dynamics.

 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?

> The methods used in this analysis are straightforward and interpretable, allowing results to be explained clearly without requiring complex model explanations.

 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

> We made an effort to clearly explain the limitations of the analysis, including potential sources of bias and the fact that results do not necessarily imply causation.

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?

> Since the analysis is not deployed, ongoing monitoring is not applicable. However, future work could reassess results as new data is available.

 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?

> In the unlikely event of harm or misuse, we would review the analysis and clarify or correct the findings as needed.

 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?

> Since the analysis is not deployed, rollback is not applicable. Results could also be updated or removed if necessary.

 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

> Results could be misinterpreted to suggest that height alone determines scoring ability, so findings are presented as correlational and exploratory.

## Team Expectations 

Justin Bourdlaies, Zee Avila, Lance Mendoza, Jefferson Umanzor Urrutia, Majd Abu-Shamiyeh

1. Check the group chat at least once a day and respond
2. Do your assigned share of work
3. If something comes up, discuss with the group and work can be redistributed accordingly (e.g. one person who misses work one week can help do more research the next week)
4. If there are conflicting plans/ideas for parts of the project compromise and integrate as much of both as we can

## Project Timeline Proposal

W7: Data Checkpoint 01 due on 18 February
- Export season data and choose a clear cutoff date
- Clean data and compute points per 36
- Save processed dataset for reuse and push notebook

W9: EDA Checkpoint 02 due on 6 March
- Load processed data from data/02-processed
- Create key EDA visuals and document patterns and outliers
- Decide final analysis approach and push notebook

W10: Final Project + Video 03 due on 18 March
- Run final statistical analysis with controls
- Finish figures and write discussion limitations and conclusion
- Record video summary and push final notebook