# COGS 108 - Data Checkpoint

## Authors

This is a modified [CRediT taxonomy of contributions](https://credit.niso.org). For each group member please list how they contributed to this project using these terms:
> Analysis, Background Research, Conceptualization, Data Curation, Experimental Investigation, Methodology, Project Administration, Software, Visualization, Writing – Original Draft, Writing – Review & Editing

- Justin Bourdlaies: Background Research, Experimental Investigation
- Zee Avila: Project Administration, Experimental Investigation
- Lance Mendoza: Conceptualization, Visualization, Methodology
- Jefferson Umanzor Urrutia: Data curation, Software, Writing - Review & Editing
- Majd Abu-Shamiyeh: Writing - Original Draft, Writing - Review & Editing

## Research Question

To what extent does an NBA player’s height (in inches) predict points scored per 36 minutes during the 2025-2026 NBA regular season? After testing for position and other key performance metrics such as usage rate and field goal attempts, how does height, measured by its partial R² contribution within a multiple regression model, vary across player positions and over time?

Additionally, how do scoring patterns, including shot attempts and efficiency, differ across height groups, and has the relationship between height and scoring efficiency changed across recent NBA seasons?

## Background and Prior Work

Player physical attributes, particularly height, have long played a central role in how basketball players are evaluated and used at the professional level. In the NBA, height strongly influences positional assignment and on-court responsibilities. Taller players are more likely to occupy interior positions such as center or power forward, where responsibilities emphasize rebounding, rim protection, and screening rather than high-volume scoring. Shorter players, especially guards, are typically more involved in ball handling and shot creation. Because of this specialization, height may be indirectly related to scoring output through role differences rather than scoring ability alone. This is further supported by the fact that players in the top height/weight category with low experience were mostly categorized by "two-point field goals", "offensive and defensive rebounds", "blocks", and "fouls".<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1)

Prior basketball analytics research has shown that scoring output varies substantially by position, which is closely correlated with height. Analyses of NBA data indicate that guards and wings tend to score more points per minute than forwards and centers due to higher usage rates and greater involvement in offensive actions. While the modern NBA has become more positionless, height still affects how players are used offensively, with taller players generally contributing less to scoring volume and more to non-scoring tasks. As stated in the Southwest Journal, "height remains a factor, but not the only one dictating a player's role".<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2)

Academic research has also examined the relationship between player anthropometrics and performance statistics. NBA player height and weight in relation to box score metrics and found that height was strongly associated with rebounding and shot blocking, but had a weaker and often negative relationship with scoring when controlling for playing time.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3)

Modern basketball analytics frequently normalize scoring by playing time using metrics such as points per 36 minutes to allow fair comparisons across players with different minute allocations. NBA statistical documentation recommends per-minute or per-possession metrics when evaluating player production.<a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4) Community-driven analytics projects using publicly available NBA data have applied regression models to examine how physical attributes relate to scoring and often find that variables such as usage rate and offensive role explain much more variance than height alone. However, we aim to see how height can influence performance as well where these previous studies have fallen short on exploring.

This project builds on prior work by focusing specifically on the relationship between player height and points scored per 36 minutes during a single modern NBA season. By treating height as a continuous variable and measuring both statistical significance and variance explained, this analysis aims to determine whether height has a meaningful independent effect on scoring rate or whether its impact is small relative to other factors.

References

1. <a name="cite_note-1"></a> [^](#cite_ref-1)
Zhang, S., Lorenzo, A., Gómez, M., Mateus N., Gonçalves, B., Sampaio, J. (20 Apr 2018) Clustering performances in the NBA according to players' anthropometric attributes and playing experience. *PubMed*. https://pubmed.ncbi.nlm.nih.gov/29676222/

2. <a name="cite_note-2"></a> [^](#cite_ref-2)
Ilic S. (12 Feb 2024) Average NBA Height By Position 2024: How They Measure Up?. *Southwest Journal*. https://www.southwestjournal.com/sport/nba/average-nba-height-by-position/

3. <a name="cite_note-3"></a> [^](#cite_ref-3)
Yixiong, C., Liu, F., Bao, D., Liu, H., Zhang, S., Gómez, M. (21 Oct 2019) Key Anthropometric and Physical Determinants for Different Playing Positions During National Basketball Association Draft Combine Test. *Frontiers*. https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2019.02359/full

4. <a name="cite_note-4"></a> [^](#cite_ref-4)
Wikipedia (19 Jul 2022) Player efficiency rating: Revision history. *Wikipedia*. https://en.wikipedia.org/wiki/Player_efficiency_rating

## Hypothesis


We predict that there will be a significant relationship between an NBA player's height and their points scored per 36 minutes. We expect taller players to score slightly fewer points per 36 minutes on average. This is because taller players have different roles such as defending and rebounding which can prevent them from focusing on attacking and scoring. We also predict that height will only have a small impact on the variance in scoring rate.

## Data

### Data overview

Instructions: REPLACE the contents of this cell with descriptions of your actual datasets.

DATA:

Link: https://www.nba.com/stats/players/bio

Link: https://www.nbastuffer.com/2025-2026-nba-player-stats/

### Dataset #1: NBA 2025–26 Player Bio and Performance Overview

The data consists of biographical and performance context data for players in the NBA 2025-2026 regular season at the player level. Each row symbolizes each player, identified by player_id, and with a few demographic data, age, height, weight, college, country and team abbreviation. Along with biographical data, the dataset will contain performance_related statistics, including points per game (pts), rebounds (reb), assists (ast), usage percentage (usg_pct), true shooting percentage (ts_pct), assist percentage (ast_pct) and net rating (net_rating). Height is presented as both formatted (feet inches) and as a numeric variable (player_height_inches) which is measured in inches and can be analyzed mathematically. 

The data has 532 players and 23 attributes (columns), given the fact that there is one row per player_id, it is clear that the data is tidy and uniquely indexed. Missingness is minimal with few missing values in variables such as player weight and draft number. Most of the performance variables appear complete, this facilitates sound analysis of player-level trends. The range of height values fall within a realistic range for professional basketball players suggesting there are no extreme outliers in this key variable.

A limitation in this data set would be the absence of total minutes played. Considering some metrics such as points per 36 minutes entail total minutes, a second dataset containing minutes (MIN) and total points (PTS) will be merged using player_id. Additionally, because the data is derived from official NBA sources and specifically limited to the 2025-2026 season, it only reflects players who featured in that season and may not provide insights into broader historical trends.

For each dataset include the following information
- Dataset #1
  - Dataset Name: NBA.com LeagueDashPlayerBioStats 2025–26 Regular Season
  - Link to the dataset: https://www.nba.com/stats/players/bio
  - Number of observations: 532 rows, one row per player
  - Number of variables: 23 columns
  - Description of the variables most relevant to this project
    - We use `PLAYER_ID` and `PLAYER_NAME` to identify players. Height is provided as `PLAYER_HEIGHT_INCHES` in inches. We also use context variables such as `TEAM_ABBREVIATION`, `AGE`, and performance context variables like `USG_PCT` and `TS_PCT`
  - Descriptions of any shortcomings this dataset has with repsect to the project
    - This dataset does not include total minutes played, so we cannot compute points per 36 minutes from it alone. It is also season-to-date unless we specify a cutoff date for when the data were pulled
- Dataset #2 (if you have more than one!)
  - same as above
- etc

Each dataset deserves either a set of bullet points as above or a few sentences if you prefer that method.

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.
- We will compute `points_per_36 = PTS / MIN * 36` using Dataset 2, then merge Dataset 2 with Dataset 1 on `PLAYER_ID` to attach height to the per-36 scoring data. The final merged dataset will be written to nba_2025_26_merged.csv

In [1]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [None]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

Overall Download Progress:   0%|          | 0/2 [00:00<?, ?it/s]

Overall Download Progress:  50%|█████     | 1/2 [00:00<00:00,  3.14it/s]

Successfully downloaded: nba_com_players_bio_2025_26.json


Overall Download Progress: 100%|██████████| 2/2 [00:00<00:00,  5.46it/s]

Successfully downloaded: bad-drivers.csv





### Dataset #1 

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. Describe the important metrics, what units they are in, and giv some sense of what they mean.  
   - This dataset comes from the NBA stats player bio table for the 2025–26 regular season. Each row represents one player, and the dataset includes identifiers and basic bio attributes along with a small set of performance context metrics.
   - The main variable we need from this dataset is height. Height appears both as a formatted string (feet-inches) and as `PLAYER_HEIGHT_INCHES`, which is already provided in inches and can be used directly in analysis. Other useful context variables include `TEAM_ABBREVIATION`, `AGE`, and role or efficiency proxies such as `USG_PCT` and `TS_PCT`.
   2. If there are any major concerns with the dataset, describe them. 
   - A key limitation for our project is that this bio dataset does not include minutes played. Because points per 36 minutes requires total minutes, we will use a second dataset that includes `MIN` and `PTS` totals and then merge the two datasets using `PLAYER_ID`.
3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you


In [11]:
import json
import pandas as pd
from pathlib import Path

RAW_DIR = Path("data/00-raw")
PROC_DIR = Path("data/02-processed")
PROC_DIR.mkdir(parents=True, exist_ok=True)

def nba_resultset_to_df(json_path, resultset_index=0):
    with open(json_path, "r") as f:
        obj = json.load(f)
    rs = obj["resultSets"][resultset_index]
    df = pd.DataFrame(rs["rowSet"], columns=rs["headers"])
    df.columns = [c.strip().lower() for c in df.columns]
    return df, obj

bio_path = RAW_DIR / "nba_com_players_bio_2025_26.json"
bio, bio_meta = nba_resultset_to_df(bio_path)

print("Dataset 1 shape:", bio.shape)
display(bio.head())

# Tidy check
print("One row per player_id:", bio["player_id"].is_unique)
print("Missing player_name:", bio["player_name"].isna().sum())

# Convert important columns to numeric
num_cols = ["player_height_inches", "age", "gp", "pts", "reb", "ast", "usg_pct", "ts_pct", "ast_pct", "net_rating"]
for c in num_cols:
    if c in bio.columns:
        bio[c] = pd.to_numeric(bio[c], errors="coerce")

# Missingness
missing_counts = bio.isna().sum().sort_values(ascending=False)
missing_fracs = bio.isna().mean().sort_values(ascending=False)

print("Top missing counts")
display(missing_counts.head(10))

print("Top missing fractions")
display(missing_fracs.head(10))

# Outliers and suspicious values
if "player_height_inches" in bio.columns:
    display(bio["player_height_inches"].describe())
    suspicious_height = bio[(bio["player_height_inches"] < 65) | (bio["player_height_inches"] > 90)]
    if len(suspicious_height) > 0:
        display(suspicious_height[["player_name", "player_height", "player_height_inches"]].head(30))

# Save processed dataset 1
bio_out = PROC_DIR / "nba_com_bio_2025_26_processed.csv"
bio.to_csv(bio_out, index=False)
print("Wrote processed Dataset 1 to", bio_out)

# Reload to confirm write succeeded
bio_reload = pd.read_csv(bio_out)
print("Reloaded shape:", bio_reload.shape)

Dataset 1 shape: (532, 23)


Unnamed: 0,player_id,player_name,team_id,team_abbreviation,age,player_height,player_height_inches,player_weight,college,country,...,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct
0,1630639,A.J. Lawson,1610612761,TOR,25.0,6-6,78,179,South Carolina,Canada,...,13,3.8,1.8,0.2,-14.4,0.056,0.16,0.186,0.545,0.045
1,1631260,AJ Green,1610612749,MIL,26.0,6-4,76,190,Northern Iowa,USA,...,49,10.7,2.6,2.0,-0.2,0.008,0.076,0.128,0.643,0.088
2,1642358,AJ Johnson,1610612742,DAL,21.0,6-5,77,160,,USA,...,28,2.5,1.1,0.8,-3.8,0.032,0.104,0.197,0.375,0.147
3,203932,Aaron Gordon,1610612743,DEN,30.0,6-8,80,235,Arizona,USA,...,23,17.7,6.2,2.5,14.0,0.047,0.162,0.228,0.632,0.13
4,1628988,Aaron Holiday,1610612745,HOU,29.0,6-0,72,185,UCLA,USA,...,35,5.7,0.9,1.0,3.8,0.012,0.049,0.173,0.57,0.103


One row per player_id: True
Missing player_name: 0
Top missing counts


player_weight    6
draft_number     1
player_id        0
ts_pct           0
usg_pct          0
dreb_pct         0
oreb_pct         0
net_rating       0
ast              0
reb              0
dtype: int64

Top missing fractions


player_weight    0.011278
draft_number     0.001880
player_id        0.000000
ts_pct           0.000000
usg_pct          0.000000
dreb_pct         0.000000
oreb_pct         0.000000
net_rating       0.000000
ast              0.000000
reb              0.000000
dtype: float64

count    532.000000
mean      78.588346
std        3.292647
min       67.000000
25%       76.000000
50%       79.000000
75%       81.000000
max       88.000000
Name: player_height_inches, dtype: float64

Wrote processed Dataset 1 to data/02-processed/nba_com_bio_2025_26_processed.csv
Reloaded shape: (532, 23)


### Dataset #2 

See instructions above for Dataset #1.  Feel free to keep adding as many more datasets as you need.  Put each new dataset in its own section just like these. 

Lastly if you do have multiple datasets, add another section where you demonstrate how you will join, align, cross-reference or whatever to combine data from the different datasets

Please note that you can always keep adding more datasets in the future if these datasets you turn in for the checkpoint aren't sufficient.  The goal here is demonstrate that you can obtain and wrangle data.  You are not tied down to only use what you turn in right now.

In [31]:
# Importing necessary libraries
import pandas as pd

# Dataset 2's file path and dataframe
dataset_two_file_path = "data/00-raw/2025-2026 NBA Player Stats - NBAstuffer.csv"
dataset_two_player_stats = pd.read_csv(dataset_two_file_path)

# Dataset 2's shape and head
print(dataset_two_player_stats.shape)
print(dataset_two_player_stats.head())

(588, 30)
   RANK                     NAME TEAM CUR  POS   AGE  GP   MpG  USG%   TO%  \
0   NaN              Luka Doncic  Lal   *  F-G  27.0  42  35.5  37.9  16.2   
1   NaN  Shai Gilgeous-Alexander  Okc   *    G  27.6  49  33.3  33.5   9.6   
2   NaN          Anthony Edwards  Min   *    G  24.5  46  35.5  31.2  11.6   
3   NaN             Jaylen Brown  Bos   *  G-F  29.3  49  34.2  36.9  13.7   
4   NaN         Donovan Mitchell  Cle   *    G  29.4  51  33.7  32.6  13.0   

   ...  ApG  SpG  BpG  TOpG   P+R   P+A  P+R+A    VI   ORtg   DRtg  
0  ...  8.5  1.5  0.5   4.3  40.7  41.4   49.2  14.6  119.9  110.7  
1  ...  6.4  1.3  0.8   2.1  36.2  38.2   42.7  11.6  134.5  106.4  
2  ...  3.7  1.3  0.8   2.7  34.5  33.0   38.2   9.3  119.4  111.9  
3  ...  4.7  1.0  0.4   3.6  36.1  34.0   40.8  11.5  113.3  107.9  
4  ...  5.9  1.5  0.3   3.1  33.5  34.9   39.4  10.9  120.6  111.3  

[5 rows x 30 columns]


In [32]:
# Clean column names
dataset_two_player_stats.columns = (dataset_two_player_stats.columns.str.lower().str.replace(" ", "_").str.replace("%", "pct"))
print(dataset_two_player_stats.columns)

Index(['rank', 'name', 'team', 'cur', 'pos', 'age', 'gp', 'mpg', 'usgpct',
       'topct', 'fta', 'ftpct', '2pa', '2ppct', '3pa', '3ppct', 'efgpct',
       'tspct', 'ppg', 'rpg', 'apg', 'spg', 'bpg', 'topg', 'p+r', 'p+a',
       'p+r+a', 'vi', 'ortg', 'drtg'],
      dtype='object')


In [33]:
# Check if data is tidy (drop completely empty columns and remove repeated header rows if any)
dataset_two_player_stats = dataset_two_player_stats.dropna(axis=1, how="all")
dataset_two_player_stats = dataset_two_player_stats[dataset_two_player_stats.iloc[:,0] != dataset_two_player_stats.columns[0]]

# Show dataset size after cleaning
print(dataset_two_player_stats.shape)

(588, 29)


In [34]:
# Check missing counts per column and missing percent per column
print(dataset_two_player_stats.isna().sum())
print(dataset_two_player_stats.isna().mean().round(4))

name       0
team       0
cur       56
pos        0
age        0
gp         0
mpg        0
usgpct     0
topct      0
fta        0
ftpct      0
2pa        0
2ppct      0
3pa        0
3ppct      0
efgpct     0
tspct      0
ppg        0
rpg        0
apg        0
spg        0
bpg        0
topg       0
p+r        0
p+a        0
p+r+a      0
vi         0
ortg       0
drtg       0
dtype: int64
name      0.0000
team      0.0000
cur       0.0952
pos       0.0000
age       0.0000
gp        0.0000
mpg       0.0000
usgpct    0.0000
topct     0.0000
fta       0.0000
ftpct     0.0000
2pa       0.0000
2ppct     0.0000
3pa       0.0000
3ppct     0.0000
efgpct    0.0000
tspct     0.0000
ppg       0.0000
rpg       0.0000
apg       0.0000
spg       0.0000
bpg       0.0000
topg      0.0000
p+r       0.0000
p+a       0.0000
p+r+a     0.0000
vi        0.0000
ortg      0.0000
drtg      0.0000
dtype: float64


In [35]:
# Convert key columns to numeric values and check their types after
numeric_cols = ["mpg", "ppg"]

for col in numeric_cols:
    if col in dataset_two_player_stats.columns:
        dataset_two_player_stats[col] = pd.to_numeric(dataset_two_player_stats[col], errors="coerce")

print(dataset_two_player_stats[numeric_cols].dtypes)

mpg    float64
ppg    float64
dtype: object


In [36]:
# Compute points_per_36
if "ppg" in dataset_two_player_stats.columns and "mpg" in dataset_two_player_stats.columns:
    dataset_two_player_stats["points_per_36"] = (dataset_two_player_stats["ppg"] / dataset_two_player_stats["mpg"]) * 36

    print(dataset_two_player_stats["points_per_36"].describe())

    print(dataset_two_player_stats.sort_values("points_per_36", ascending=False)[["points_per_36", "ppg", "mpg"]].head(15))

count    588.000000
mean      15.234179
std        6.533287
min        0.000000
25%       11.215812
50%       14.683282
75%       18.523216
max       55.384615
Name: points_per_36, dtype: float64
     points_per_36   ppg   mpg
528      55.384615   2.0   1.3
471      38.571429   3.0   2.8
44       35.108911  19.7  20.2
7        34.520548  28.0  29.2
1        34.378378  31.8  33.3
26       33.311203  22.3  24.1
0        33.261972  32.8  35.5
376      32.727273   5.0   5.5
9        31.284345  27.2  31.3
4        30.979228  29.0  33.7
3        30.842105  29.3  34.2
84       30.830769  16.7  19.5
8        30.621951  27.9  32.8
12       30.594249  26.6  31.3
21       30.289655  24.4  29.0


In [40]:
# Outlier checks
print(dataset_two_player_stats[dataset_two_player_stats["mpg"] > 48])

print(dataset_two_player_stats[dataset_two_player_stats["points_per_36"] > 40])

Empty DataFrame
Columns: [name, team, cur, pos, age, gp, mpg, usgpct, topct, fta, ftpct, 2pa, 2ppct, 3pa, 3ppct, efgpct, tspct, ppg, rpg, apg, spg, bpg, topg, p+r, p+a, p+r+a, vi, ortg, drtg, points_per_36]
Index: []

[0 rows x 30 columns]
            name team cur pos   age  gp  mpg  usgpct  topct  fta  ...  spg  \
528  Isaac Jones  Det   *   F  25.6   1  1.3    63.6    0.0    0  ...  0.0   

     bpg  topg  p+r  p+a  p+r+a   vi  ortg  drtg  points_per_36  
528  0.0   0.0  2.0  2.0    2.0  0.0   0.0   0.0      55.384615  

[1 rows x 30 columns]


In [41]:
# Remove the extreme rows
dataset_two_player_stats = dataset_two_player_stats[dataset_two_player_stats["points_per_36"] <= 40]

print(dataset_two_player_stats.shape)

(587, 30)


In [42]:
# Save cleaned dataset
output_path = "data/02-processed/NBAstuffer_2025_26_processed.csv"

In [43]:
# Save directly
dataset_two_player_stats.to_csv(output_path, index=False)
print("Saved cleaned dataset to:", output_path)

Saved cleaned dataset to: data/02-processed/NBAstuffer_2025_26_processed.csv


In [44]:
# Reload to confirm
df_check = pd.read_csv(output_path)
print("Reloaded shape:", df_check.shape)

Reloaded shape: (587, 30)


## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
       
> Data collected is publicly available public athlete performance data, with no direct human subjects interaction.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> Points per 36 minutes was chosen to represent a substantial amount of playing time (approximately three quarters of a game), but may still inflate scoring rates for players with limited minutes or specific roles. We can begin to mitigate such bias by acknowledging the limitations of points per 36 minutes and interpreting results cautiously rather than as definitive measures of scoring ability.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

> We can limit PII exposure by using only publicly available player statistics and collecting no personal information beyond what is  necessary for our analysis.

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

> We are not collecting protected attributes (race/gender), so downstream bias testing by protected group is not possible with our data. We will avoid claims about such groups.

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

> The data are public and not sensitive. We will not store passwords, keys, or any private information in the repo.

 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?

> The data collected is publicly available and non-sensitive. However, individual records could be removed from future analyses upon request.

 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

> The data are publicly available and non-sensitive, so they may be retained for reproducibility and future reference.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

> We were mindful of potential blindspots in a statistical approach. We confirmed our assumptions using basic basketball context, such as player roles and how scoring opportunities may vary by position and team system.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

> The dataset may reflect bias due to imbalanced height distributions across positions and survivorship bias, as only players who reached the NBA are included. We can mitigate potential bias by framing height as one factor among many and avoiding claims about its effect on scoring.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
> We will avoid misleading graphs and avoid claiming height causes scoring. We will show the full spread of the data and point out outliers.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

> We will avoid displaying personal identifiers and instead focus on aggregate statistical relationships rather than individual players.

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

> The analysis is documented in a version-controlled Jupyter notebook, making the steps reproducible and allowing issues to be identified and corrected if needed.

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

> Height may act as a proxy for player position or role, which could lead to oversimplified interpretations of scoring ability. We will interpret results carefully and avoid oversimplified claims about height.

 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

> Points per 36 minutes was selected to standardize scoring across players and reflect meaningful playing time, though it assumes linear scaling and may not capture all in-game dynamics.

 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?

> The methods used in this analysis are straightforward and interpretable, allowing results to be explained clearly without requiring complex model explanations.

 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

> We made an effort to clearly explain the limitations of the analysis, including potential sources of bias and the fact that results do not necessarily imply causation.

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?

> Since the analysis is not deployed, ongoing monitoring is not applicable. However, future work could reassess results as new data is available.

 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?

> In the unlikely event of harm or misuse, we would review the analysis and clarify or correct the findings as needed.

 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?

> Since the analysis is not deployed, rollback is not applicable. Results could also be updated or removed if necessary.

 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

> Results could be misinterpreted to suggest that height alone determines scoring ability, so findings are presented as correlational and exploratory.

## Team Expectations 

Justin Bourdlaies, Zee Avila, Lance Mendoza, Jefferson Umanzor Urrutia, Majd Abu-Shamiyeh

1. Check the group chat at least once a day and respond
2. Do your assigned share of work
3. If something comes up, discuss with the group and work can be redistributed accordingly (e.g. one person who misses work one week can help do more research the next week)
4. If there are conflicting plans/ideas for parts of the project compromise and integrate as much of both as we can

## Project Timeline Proposal

W7: Data Checkpoint 01 due on 18 February
- Export season data and choose a clear cutoff date
- Clean data and compute points per 36
- Save processed dataset for reuse and push notebook

W9: EDA Checkpoint 02 due on 6 March
- Load processed data from data/02-processed
- Create key EDA visuals and document patterns and outliers
- Decide final analysis approach and push notebook

W10: Final Project + Video 03 due on 18 March
- Run final statistical analysis with controls
- Finish figures and write discussion limitations and conclusion
- Record video summary and push final notebook