# COGS 108 - Data Checkpoint

## Authors

This is a modified [CRediT taxonomy of contributions](https://credit.niso.org). For each group member please list how they contributed to this project using these terms:
> Analysis, Background Research, Conceptualization, Data Curation, Experimental Investigation, Methodology, Project Administration, Software, Visualization, Writing – Original Draft, Writing – Review & Editing

- Justin Bourdlaies: Background Research, Experimental Investigation
- Zee Avila: Project Administration, Experimental Investigation
- Lance Mendoza: Conceptualization, Visualization, Methodology
- Jefferson Umanzor Urrutia: Data curation, Software, Writing - Review & Editing
- Majd Abu-Shamiyeh: Writing - Original Draft, Writing - Review & Editing

## Research Question

To what extent does an NBA player’s height (in inches) predict points scored per 36 minutes during the 2025-2026 NBA regular season? After testing for position and other key performance metrics such as usage rate and field goal attempts, how does height, measured by its partial R² contribution within a multiple regression model, vary across player positions and over time?

Additionally, how do scoring patterns, including shot attempts and efficiency, differ across height groups, and has the relationship between height and scoring efficiency changed across recent NBA seasons?

## Background and Prior Work

Player physical attributes, particularly height, have long played a central role in how basketball players are evaluated and used at the professional level. In the NBA, height strongly influences positional assignment and on-court responsibilities. Taller players are more likely to occupy interior positions such as center or power forward, where responsibilities emphasize rebounding, rim protection, and screening rather than high-volume scoring. Shorter players, especially guards, are typically more involved in ball handling and shot creation. Because of this specialization, height may be indirectly related to scoring output through role differences rather than scoring ability alone. This is further supported by the fact that players in the top height/weight category with low experience were mostly categorized by "two-point field goals", "offensive and defensive rebounds", "blocks", and "fouls".<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1)

Prior basketball analytics research has shown that scoring output varies substantially by position, which is closely correlated with height. Analyses of NBA data indicate that guards and wings tend to score more points per minute than forwards and centers due to higher usage rates and greater involvement in offensive actions. While the modern NBA has become more positionless, height still affects how players are used offensively, with taller players generally contributing less to scoring volume and more to non-scoring tasks. As stated in the Southwest Journal, "height remains a factor, but not the only one dictating a player's role".<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2)

Academic research has also examined the relationship between player anthropometrics and performance statistics. NBA player height and weight in relation to box score metrics and found that height was strongly associated with rebounding and shot blocking, but had a weaker and often negative relationship with scoring when controlling for playing time.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3)

Modern basketball analytics frequently normalize scoring by playing time using metrics such as points per 36 minutes to allow fair comparisons across players with different minute allocations. NBA statistical documentation recommends per-minute or per-possession metrics when evaluating player production.<a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4) Community-driven analytics projects using publicly available NBA data have applied regression models to examine how physical attributes relate to scoring and often find that variables such as usage rate and offensive role explain much more variance than height alone. However, we aim to see how height can influence performance as well where these previous studies have fallen short on exploring.

This project builds on prior work by focusing specifically on the relationship between player height and points scored per 36 minutes during a single modern NBA season. By treating height as a continuous variable and measuring both statistical significance and variance explained, this analysis aims to determine whether height has a meaningful independent effect on scoring rate or whether its impact is small relative to other factors.

References

1. <a name="cite_note-1"></a> [^](#cite_ref-1)
Zhang, S., Lorenzo, A., Gómez, M., Mateus N., Gonçalves, B., Sampaio, J. (20 Apr 2018) Clustering performances in the NBA according to players' anthropometric attributes and playing experience. *PubMed*. https://pubmed.ncbi.nlm.nih.gov/29676222/

2. <a name="cite_note-2"></a> [^](#cite_ref-2)
Ilic S. (12 Feb 2024) Average NBA Height By Position 2024: How They Measure Up?. *Southwest Journal*. https://www.southwestjournal.com/sport/nba/average-nba-height-by-position/

3. <a name="cite_note-3"></a> [^](#cite_ref-3)
Yixiong, C., Liu, F., Bao, D., Liu, H., Zhang, S., Gómez, M. (21 Oct 2019) Key Anthropometric and Physical Determinants for Different Playing Positions During National Basketball Association Draft Combine Test. *Frontiers*. https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2019.02359/full

4. <a name="cite_note-4"></a> [^](#cite_ref-4)
Wikipedia (19 Jul 2022) Player efficiency rating: Revision history. *Wikipedia*. https://en.wikipedia.org/wiki/Player_efficiency_rating

## Hypothesis


We predict that there will be a significant relationship between an NBA player's height and their points scored per 36 minutes. We expect taller players to score slightly fewer points per 36 minutes on average. This is because taller players have different roles such as defending and rebounding which can prevent them from focusing on attacking and scoring. We also predict that height will only have a small impact on the variance in scoring rate.

## Data

### Data overview

We use two datasets from the 2025–2026 NBA regular season. Dataset 1 provides player height and identifiers. Dataset 2 provides minutes and scoring statistics needed to compute points per 36 minutes. We merge them so each player’s scoring rate can be analyzed against height.

- Dataset #1
  - Dataset Name: NBA.com LeagueDashPlayerBioStats 2025–26 Regular Season
  - Link to the dataset: https://www.nba.com/stats/players/bio
  - Raw file: `data/00-raw/nba_com_players_bio_2025_26.json`
  - Number of observations: 532 rows, one row per player
  - Number of variables: 23 columns
  - Variables most relevant to this project:
    - `PLAYER_ID`, `PLAYER_NAME` (identifiers)
    - `PLAYER_HEIGHT_INCHES` (height in inches)
    - `TEAM_ABBREVIATION`, `AGE`, `USG_PCT`, `TS_PCT` (context)
  - Shortcomings:
    - No minutes played, so we cannot compute points per 36 minutes from this dataset alone

- Dataset #2
  - Dataset Name: NBAstuffer NBA Player Stats 2025–26 Regular Season
  - Link to the dataset: https://www.nbastuffer.com/2025-2026-nba-player-stats/
  - Raw file: `data/00-raw/2025-2026 NBA Player Stats - NBAstuffer.csv`
  - Number of observations: 587 rows, one row per player
  - Number of variables: 30 columns
  - Variables most relevant to this project:
    - player name, position, minutes per game (`MPG`), points per game (`PPG`), and efficiency/usage metrics
  - Shortcomings:
    - Third-party aggregation and player naming may not perfectly match NBA.com
    - Height is not consistently available, so we attach height using Dataset 1

How we combine datasets
- We compute `points_per_36 = PPG / MPG * 36` in Dataset 2 and then merge Dataset 2 with Dataset 1 using a cleaned player-name key. The final merged dataset is saved to `data/02-processed/nba_2025_26_merged.csv`.


In [1]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [2]:
# Raw data files are stored in data/00-raw/ and treated as immutable.
# Dataset 1: data/00-raw/nba_com_players_bio_2025_26.json
# Dataset 2: data/00-raw/2025-2026 NBA Player Stats - NBAstuffer.csv
# This notebook loads raw data, cleans it, and writes processed outputs to data/02-processed/.

### NBA.com Player Bio Stats (2025–26)

This dataset comes from the NBA.com player bio statistics table for the 2025–26 regular season. Each row represents one player and includes identifiers and basic profile information along with some per-game context stats.

The key variable for our project is height. Height is available as `PLAYER_HEIGHT_INCHES`, which is already in inches and can be used directly. We also use identifiers (`PLAYER_ID`, `PLAYER_NAME`) and context variables such as `TEAM_ABBREVIATION`, `AGE`, `USG_PCT`, and `TS_PCT`.

A key limitation is that this dataset does not include total minutes played, so we cannot compute points per 36 minutes from it alone. We address this by using Dataset 2 (NBAstuffer) to get MPG/PPG, compute points per 36, and then merge height onto those stats.

In [3]:
import json
import pandas as pd
import numpy as np
from pathlib import Path

RAW_DIR = Path("data/00-raw")
PROC_DIR = Path("data/02-processed")
PROC_DIR.mkdir(parents=True, exist_ok=True)

def nba_resultset_to_df(json_path, resultset_index=0):
    with open(json_path, "r") as f:
        obj = json.load(f)
    rs = obj["resultSets"][resultset_index]
    df = pd.DataFrame(rs["rowSet"], columns=rs["headers"])
    df.columns = [c.strip().lower() for c in df.columns]
    return df, obj

bio_path = RAW_DIR / "nba_com_players_bio_2025_26.json"
bio, _ = nba_resultset_to_df(bio_path)

print("Dataset 1 shape:", bio.shape)
display(bio.head())

# Tidy checks
print("player_id unique:", bio["player_id"].is_unique)
print("missing player_name:", bio["player_name"].isna().sum())

# Convert key numeric columns
num_cols = ["player_height_inches", "age", "gp", "pts", "reb", "ast", "usg_pct", "ts_pct", "ast_pct", "net_rating"]
for c in num_cols:
    if c in bio.columns:
        bio[c] = pd.to_numeric(bio[c], errors="coerce")

# Missingness (counts and fractions)
missing_counts = bio.isna().sum().sort_values(ascending=False)
missing_fracs = bio.isna().mean().sort_values(ascending=False)
print("Top missing counts:")
display(missing_counts.head(10))
print("Top missing fractions:")
display(missing_fracs.head(10))

# Outliers / suspicious entries
display(bio["player_height_inches"].describe())
sus_height = bio[(bio["player_height_inches"] < 65) | (bio["player_height_inches"] > 90)]
if len(sus_height) > 0:
    display(sus_height[["player_name", "player_height", "player_height_inches"]].head(30))

# Save processed Dataset 1
bio_out = PROC_DIR / "nba_com_bio_2025_26_processed.csv"
bio.to_csv(bio_out, index=False)
print("Wrote:", bio_out)

# Reload proof
bio_reload = pd.read_csv(bio_out)
print("Reloaded shape:", bio_reload.shape)


Dataset 1 shape: (532, 23)


Unnamed: 0,player_id,player_name,team_id,team_abbreviation,age,player_height,player_height_inches,player_weight,college,country,...,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct
0,1630639,A.J. Lawson,1610612761,TOR,25.0,6-6,78,179,South Carolina,Canada,...,13,3.8,1.8,0.2,-14.4,0.056,0.16,0.186,0.545,0.045
1,1631260,AJ Green,1610612749,MIL,26.0,6-4,76,190,Northern Iowa,USA,...,49,10.7,2.6,2.0,-0.2,0.008,0.076,0.128,0.643,0.088
2,1642358,AJ Johnson,1610612742,DAL,21.0,6-5,77,160,,USA,...,28,2.5,1.1,0.8,-3.8,0.032,0.104,0.197,0.375,0.147
3,203932,Aaron Gordon,1610612743,DEN,30.0,6-8,80,235,Arizona,USA,...,23,17.7,6.2,2.5,14.0,0.047,0.162,0.228,0.632,0.13
4,1628988,Aaron Holiday,1610612745,HOU,29.0,6-0,72,185,UCLA,USA,...,35,5.7,0.9,1.0,3.8,0.012,0.049,0.173,0.57,0.103


player_id unique: True
missing player_name: 0
Top missing counts:


player_weight    6
draft_number     1
player_id        0
ts_pct           0
usg_pct          0
dreb_pct         0
oreb_pct         0
net_rating       0
ast              0
reb              0
dtype: int64

Top missing fractions:


player_weight    0.011278
draft_number     0.001880
player_id        0.000000
ts_pct           0.000000
usg_pct          0.000000
dreb_pct         0.000000
oreb_pct         0.000000
net_rating       0.000000
ast              0.000000
reb              0.000000
dtype: float64

count    532.000000
mean      78.588346
std        3.292647
min       67.000000
25%       76.000000
50%       79.000000
75%       81.000000
max       88.000000
Name: player_height_inches, dtype: float64

Wrote: data/02-processed/nba_com_bio_2025_26_processed.csv
Reloaded shape: (532, 23)


In [4]:
bio_out = PROC_DIR / "nba_com_bio_2025_26_processed.csv"
bio.to_csv(bio_out, index=False)
print("Wrote:", bio_out)

# Reload proof
bio_reload = pd.read_csv(bio_out)
print("Reloaded shape:", bio_reload.shape)

Wrote: data/02-processed/nba_com_bio_2025_26_processed.csv
Reloaded shape: (532, 23)


### NBAstuffer Player Stats (2025–26)

This dataset is a downloadable CSV table from NBAstuffer for the 2025–26 regular season. Each row represents one player and includes per-game scoring and playing time metrics.

We use `MPG` (minutes per game) and `PPG` (points per game) to compute `points_per_36 = PPG / MPG * 36`. This provides the scoring-rate outcome variable needed for our research question.

A main concern is that NBAstuffer is a third-party source and player naming may not perfectly match NBA.com. We address this by creating a standardized player-name key and reporting the merge match rate and unmatched players.

In [7]:
import pandas as pd
import numpy as np
import re
from pathlib import Path

RAW_DIR = Path("data/00-raw")
PROC_DIR = Path("data/02-processed")
PROC_DIR.mkdir(parents=True, exist_ok=True)

path2 = RAW_DIR / "2025-2026 NBA Player Stats - NBAstuffer.csv"
d2 = pd.read_csv(path2)

print("Dataset 2 shape:", d2.shape)
display(d2.head())

# Clean column names
d2.columns = (d2.columns.astype(str)
              .str.strip()
              .str.lower()
              .str.replace(" ", "_", regex=False)
              .str.replace("%", "pct", regex=False))

print("Duplicate player-name rows:", d2.duplicated(subset=[next(c for c in ["name","player","player_name"] if c in d2.columns)]).sum() if any(c in d2.columns for c in ["name","player","player_name"]) else "no name col found")

# Missingness
print("Top missing counts:")
display(d2.isna().sum().sort_values(ascending=False).head(15))
print("Top missing fractions:")
display(d2.isna().mean().sort_values(ascending=False).head(15))

# Convert key columns
for col in ["mpg", "ppg"]:
    if col in d2.columns:
        d2[col] = pd.to_numeric(d2[col], errors="coerce")

# Compute points_per_36 safely
d2 = d2[d2["mpg"].notna() & (d2["mpg"] > 0)].copy()
d2["points_per_36"] = d2["ppg"] / d2["mpg"] * 36

# Outlier checks
display(d2["mpg"].describe())
display(d2["points_per_36"].describe())
display(d2.sort_values("points_per_36", ascending=False)[["points_per_36", "ppg", "mpg"]].head(15))

# Save processed Dataset 2
d2_out = PROC_DIR / "nbastuffer_2025_26_processed.csv"
d2.to_csv(d2_out, index=False)
print("Wrote:", d2_out)

# Reload proof
d2_reload = pd.read_csv(d2_out)
print("Reloaded shape:", d2_reload.shape)


Dataset 2 shape: (588, 30)


Unnamed: 0,RANK,NAME,TEAM,CUR,POS,AGE,GP,MpG,USG%,TO%,...,ApG,SpG,BpG,TOpG,P+R,P+A,P+R+A,VI,ORtg,DRtg
0,,Luka Doncic,Lal,*,F-G,27.0,42,35.5,37.9,16.2,...,8.5,1.5,0.5,4.3,40.7,41.4,49.2,14.6,119.9,110.7
1,,Shai Gilgeous-Alexander,Okc,*,G,27.6,49,33.3,33.5,9.6,...,6.4,1.3,0.8,2.1,36.2,38.2,42.7,11.6,134.5,106.4
2,,Anthony Edwards,Min,*,G,24.5,46,35.5,31.2,11.6,...,3.7,1.3,0.8,2.7,34.5,33.0,38.2,9.3,119.4,111.9
3,,Jaylen Brown,Bos,*,G-F,29.3,49,34.2,36.9,13.7,...,4.7,1.0,0.4,3.6,36.1,34.0,40.8,11.5,113.3,107.9
4,,Donovan Mitchell,Cle,*,G,29.4,51,33.7,32.6,13.0,...,5.9,1.5,0.3,3.1,33.5,34.9,39.4,10.9,120.6,111.3


Duplicate player-name rows: 56
Top missing counts:


rank      588
cur        56
efgpct      0
ortg        0
vi          0
p+r+a       0
p+a         0
p+r         0
topg        0
bpg         0
spg         0
apg         0
rpg         0
ppg         0
tspct       0
dtype: int64

Top missing fractions:


rank      1.000000
cur       0.095238
efgpct    0.000000
ortg      0.000000
vi        0.000000
p+r+a     0.000000
p+a       0.000000
p+r       0.000000
topg      0.000000
bpg       0.000000
spg       0.000000
apg       0.000000
rpg       0.000000
ppg       0.000000
tspct     0.000000
dtype: float64

count    588.000000
mean      19.122279
std        9.332692
min        1.200000
25%       11.650000
50%       19.300000
75%       26.900000
max       38.600000
Name: mpg, dtype: float64

count    588.000000
mean      15.234179
std        6.533287
min        0.000000
25%       11.215812
50%       14.683282
75%       18.523216
max       55.384615
Name: points_per_36, dtype: float64

Unnamed: 0,points_per_36,ppg,mpg
528,55.384615,2.0,1.3
471,38.571429,3.0,2.8
44,35.108911,19.7,20.2
7,34.520548,28.0,29.2
1,34.378378,31.8,33.3
26,33.311203,22.3,24.1
0,33.261972,32.8,35.5
376,32.727273,5.0,5.5
9,31.284345,27.2,31.3
4,30.979228,29.0,33.7


Wrote: data/02-processed/nbastuffer_2025_26_processed.csv
Reloaded shape: (588, 31)


### Combining datasets

We merge Dataset 2 (points per 36 computation) with Dataset 1 (height) using a standardized player-name key. We report match rate and inspect unmatched players.

In [6]:
import pandas as pd
import numpy as np
import re
from pathlib import Path

PROC_DIR = Path("data/02-processed")

bio = pd.read_csv(PROC_DIR / "nba_com_bio_2025_26_processed.csv")
d2 = pd.read_csv(PROC_DIR / "nbastuffer_2025_26_processed.csv")

def norm_name(x):
    if pd.isna(x):
        return np.nan
    s = str(x).lower().strip()
    s = re.sub(r"[^a-z\s\-']", "", s)
    s = re.sub(r"\b(jr|sr|ii|iii|iv)\b", "", s).strip()
    s = re.sub(r"\s+", " ", s)
    return s

bio["name_key"] = bio["player_name"].apply(norm_name)

# Find the player-name column in NBAstuffer
name_col = next((c for c in ["name", "player", "player_name"] if c in d2.columns), None)
if name_col is None:
    raise ValueError(f"Could not find player name column in Dataset 2. Columns: {list(d2.columns)}")

d2["name_key"] = d2[name_col].apply(norm_name)

merged = pd.merge(
    d2,
    bio[["name_key", "player_height_inches", "team_abbreviation", "age", "usg_pct", "ts_pct"]],
    on="name_key",
    how="left",
)

print("Merged shape:", merged.shape)
print("Height match rate:", merged["player_height_inches"].notna().mean())

unmatched = merged[merged["player_height_inches"].isna()][[name_col, "name_key"]].drop_duplicates().head(25)
display(unmatched)

merged_out = PROC_DIR / "nba_2025_26_merged.csv"
merged.to_csv(merged_out, index=False)
print("Wrote:", merged_out)

Merged shape: (588, 37)
Height match rate: 0.9591836734693877


Unnamed: 0,name,name_key
0,Luka Doncic,luka doncic
6,Nikola Jokic,nikola jokic
77,Alexandre Sarr,alexandre sarr
78,Kristaps Porzingis,kristaps porzingis
82,Nikola Vucevic,nikola vucevic
147,Dennis Schroder,dennis schroder
188,Jusuf Nurkic,jusuf nurkic
191,Egor Demin,egor demin
217,Vit Krejci,vit krejci
235,Jonas Valanciunas,jonas valanciunas


Wrote: data/02-processed/nba_2025_26_merged.csv


## Ethics 

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
       
> Data collected is publicly available public athlete performance data, with no direct human subjects interaction.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> Points per 36 minutes was chosen to represent a substantial amount of playing time (approximately three quarters of a game), but may still inflate scoring rates for players with limited minutes or specific roles. We can begin to mitigate such bias by acknowledging the limitations of points per 36 minutes and interpreting results cautiously rather than as definitive measures of scoring ability.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

> We can limit PII exposure by using only publicly available player statistics and collecting no personal information beyond what is  necessary for our analysis.

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

> We are not collecting protected attributes (race/gender), so downstream bias testing by protected group is not possible with our data. We will avoid claims about such groups.

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

> The data are public and not sensitive. We will not store passwords, keys, or any private information in the repo.

 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?

> The data collected is publicly available and non-sensitive. However, individual records could be removed from future analyses upon request.

 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

> The data are publicly available and non-sensitive, so they may be retained for reproducibility and future reference.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

> We were mindful of potential blindspots in a statistical approach. We confirmed our assumptions using basic basketball context, such as player roles and how scoring opportunities may vary by position and team system.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

> The dataset may reflect bias due to imbalanced height distributions across positions and survivorship bias, as only players who reached the NBA are included. We can mitigate potential bias by framing height as one factor among many and avoiding claims about its effect on scoring.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
> We will avoid misleading graphs and avoid claiming height causes scoring. We will show the full spread of the data and point out outliers.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

> We will avoid displaying personal identifiers and instead focus on aggregate statistical relationships rather than individual players.

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

> The analysis is documented in a version-controlled Jupyter notebook, making the steps reproducible and allowing issues to be identified and corrected if needed.

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

> Height may act as a proxy for player position or role, which could lead to oversimplified interpretations of scoring ability. We will interpret results carefully and avoid oversimplified claims about height.

 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

> Points per 36 minutes was selected to standardize scoring across players and reflect meaningful playing time, though it assumes linear scaling and may not capture all in-game dynamics.

 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?

> The methods used in this analysis are straightforward and interpretable, allowing results to be explained clearly without requiring complex model explanations.

 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

> We made an effort to clearly explain the limitations of the analysis, including potential sources of bias and the fact that results do not necessarily imply causation.

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?

> Since the analysis is not deployed, ongoing monitoring is not applicable. However, future work could reassess results as new data is available.

 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?

> In the unlikely event of harm or misuse, we would review the analysis and clarify or correct the findings as needed.

 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?

> Since the analysis is not deployed, rollback is not applicable. Results could also be updated or removed if necessary.

 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

> Results could be misinterpreted to suggest that height alone determines scoring ability, so findings are presented as correlational and exploratory.

## Team Expectations 

Justin Bourdlaies, Zee Avila, Lance Mendoza, Jefferson Umanzor Urrutia, Majd Abu-Shamiyeh

1. Check the group chat at least once a day and respond
2. Do your assigned share of work
3. If something comes up, discuss with the group and work can be redistributed accordingly (e.g. one person who misses work one week can help do more research the next week)
4. If there are conflicting plans/ideas for parts of the project compromise and integrate as much of both as we can

## Project Timeline Proposal

W7: Data Checkpoint 01 due on 18 February
- Export season data and choose a clear cutoff date
- Clean data and compute points per 36
- Save processed dataset for reuse and push notebook

W9: EDA Checkpoint 02 due on 6 March
- Load processed data from data/02-processed
- Create key EDA visuals and document patterns and outliers
- Decide final analysis approach and push notebook

W10: Final Project + Video 03 due on 18 March
- Run final statistical analysis with controls
- Finish figures and write discussion limitations and conclusion
- Record video summary and push final notebook