# COGS 108 - Data Checkpoint

## Authors

Instructions: REPLACE the contents of this cell with your team list and their contributions. Note that this will change over the course of the checkpoints

This is a modified [CRediT taxonomy of contributions](https://credit.niso.org). For each group member please list how they contributed to this project using these terms:
> Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

Example team list and credits:
- Alice Anderson: Conceptualization, Data curation, Methodology, Writing - original draft
- Bob Barker:  Analysis, Software, Visualization
- Charlie Chang: Project administration, Software, Writing - review & editing
- Dani Delgado: Analysis, Background research, Visualization, Writing - original draft

Ja-Chan Lu: Updated research question, updated data after commentary on data section from project proposal.

Tianlin Situ:

Julian Luan:

Tony Zhang:

## Research Question

Updates to Research Question:

**Defining "Eras":**

***Era 1 (1990-2000 seasons):*** Big-man dominated era where most points were scored within close distances to the basket as opposed to three-pointers (much further away from the basket).

***Era 2 (2001-2013 seasons):*** League begins to shy away from traditional post-up and big-men due to the removal of the "illegal defense" which demanded players must be glued to the player they are defending at all times. With the abolishment of that rule, teams were allowed to play zone defense, which allowed players to occupy space on the court and prevent the other team from taking it instead of solely sticking to the player. In zone defenses, teams could theoretically just sufforcate the paint with their players and prevent anyone from making their way into the basket, taking away points from closer to the basket. To counter this, teams began to space out their players in the court, in attempts to work with the little space they have in the perimters to score from further away the basket. 

***Era 3 (2014-Present seasons):*** Teams analyzed shot attempts throughout their seasons and come to the realization that mid-range shots just are not as efficient as shooting 3-pointers. The chances of missing a mid-range and a three-pointer are the same, but three-pointers reward more points than a mid-range, meaning teams would much rather shoot 3's going forward. Teams now push all their players to attempt shooting 3-pointers, rather than mid-range jumpers, further increasing the volume of 3-pointers in the league.

**Editting Reserach Question:**

***New Research Question:*** How has the correlation between height and 3-point attempts changed across select NBA eras? Furthermore, we can then pursue another question: Is the increase in 3-point shooting volume only apply to taller positions (power forwards and centers) or does this affect all positions equally?

## Data

### Update to Data Proposal Section

Actual Dataset Description:
Our dataset will be from swar/nba_api python package, which is an open-source client that gives access to the official data API's on NBA.com. The package is widely used and reliable for accessing NBA data, as seen with their 3,400 GitHub stars. 

As mentioned earlier, the data comes directly from NBA.com's database that gives access to player statistics. For this project, we will utilize the playercareerstats endpoint which keeps records of the player information needed, starting from the 1990 NBA season up until the present day.

For our analysis, we will sample 3-4 datasets (seasons) from each era defined above (1990-2000, 2001-2013, and 2014-present day. Under the assumption that each NBA season has around 500 players, we will have roughly 6000 total observations, where each observation contains a player's id, team_id, height, wingspan, position, points, three_point_attempts, games_played, and minutes_played.

The data in the nba_api originates from official NBA box scores and logs, that were maintained by the league since the NBA was created. During 1990s-2000 era, the statistics were recorded by officials manually whereas 2001-present day, the data was logged and recorded by automated system. Whether it was automated or manually recorded, the data is still reliable and truthful as the NBA continues to verify recorded statistics both during and after games.

Limitations to the data:
1. Records of heights: Throughout all eras, players are to self-report height, and oftentimes may be rounded up/down, rather than 100% truthful.
2. Position Labels: Although all players are given one specific position on paper, their roles on the court often change situationally, meaning no player is truly confined to one role throughout the whole season.
3. In order to ensure accuracy, players who have less than 41 games_played or 10 minutes_played will not be considered in the dataset to prevent skewing the analysis as they will most likely have exceptionally high/low statistics that do not truly reflect their abilities as a player. 

### Data overview

Dataset #1: `LeagueDashPlayerStats` (NBA Stats API via `nba_api`)
- Purpose: season-level box score stats for all players
- Key fields used: `PLAYER_ID`, `PLAYER_NAME`, `GP`, `MIN`, `FG3A`, `FGA`
- Coverage in this scaffold: representative seasons across all three eras

Dataset #2: `PlayerIndex` (NBA Stats API via `nba_api`)
- Purpose: player profile metadata for merge keys and covariates
- Key fields used: `PERSON_ID`/`PLAYER_ID`, `HEIGHT`, `POSITION`, `FROM_YEAR`, `TO_YEAR`

Join strategy
- Merge season stats to player metadata using `PLAYER_ID`
- Create derived variables: `HEIGHT_IN`, `MPG`, `FG3A_PER_GAME`, `FG3A_PER36`, `FG3A_RATE`
- Add era labels using season year buckets from our research question

Output files produced by code below
- Interim: `data/01-interim/nba_player_season_combined.csv`
- Processed: `data/02-processed/nba_height_3pa_player_season.csv`


In [2]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [3]:
# Setup + reproducible data pull / wrangling scaffold

import os
import time
import warnings
import numpy as np
import pandas as pd

try:
    from nba_api.stats.endpoints import leaguedashplayerstats, playerindex
except ImportError as exc:
    raise ImportError("Missing dependency `nba_api` in the active notebook kernel.") from exc

warnings.filterwarnings("ignore", category=FutureWarning)

SEASONS = [
    "1996-97", "1999-00", "2004-05", "2008-09", "2012-13", "2016-17", "2020-21", "2023-24"
]
REQUEST_TIMEOUT = 120
MAX_RETRIES = 4
BASE_BACKOFF_SECONDS = 2


def season_to_era(season: str) -> str:
    start_year = int(season.split("-")[0])
    if 1990 <= start_year <= 2000:
        return "1990-2000"
    if 2001 <= start_year <= 2013:
        return "2001-2013"
    return "2014-present"


def parse_height_to_inches(height_value):
    if pd.isna(height_value):
        return np.nan
    if isinstance(height_value, (int, float)):
        return float(height_value)

    s = str(height_value).strip().replace('"', "")
    if "-" in s:
        feet, inches = s.split("-", 1)
        if feet.isdigit() and inches.isdigit():
            return int(feet) * 12 + int(inches)

    try:
        return float(s)
    except ValueError:
        return np.nan


def standardize_position(position_value) -> str:
    if pd.isna(position_value):
        return "Unknown"
    p = str(position_value).upper()
    if "C" in p:
        return "Big"
    if "F" in p:
        return "Wing"
    if "G" in p:
        return "Guard"
    return "Unknown"


def fetch_with_retry(fetch_fn, label: str, retries: int = MAX_RETRIES):
    last_error = None
    for attempt in range(1, retries + 1):
        try:
            return fetch_fn()
        except Exception as exc:
            last_error = exc
            if attempt < retries:
                wait_s = BASE_BACKOFF_SECONDS ** attempt
                print(f"[{label}] attempt {attempt}/{retries} failed: {exc}")
                print(f"Retrying in {wait_s}s...")
                time.sleep(wait_s)
    raise RuntimeError(f"Failed to fetch {label} after {retries} attempts") from last_error


# Pull player metadata and include historical players
player_meta = fetch_with_retry(
    lambda: playerindex.PlayerIndex(historical_nullable=1, timeout=REQUEST_TIMEOUT).get_data_frames()[0].copy(),
    label="PlayerIndex",
)
id_col = "PERSON_ID" if "PERSON_ID" in player_meta.columns else "PLAYER_ID"
height_col = "HEIGHT" if "HEIGHT" in player_meta.columns else "PLAYER_HEIGHT"
position_col = "POSITION" if "POSITION" in player_meta.columns else "PLAYER_POSITION"

meta_cols = [id_col, height_col, position_col]
for optional_col in ["FROM_YEAR", "TO_YEAR"]:
    if optional_col in player_meta.columns:
        meta_cols.append(optional_col)

player_meta = player_meta[meta_cols].rename(
    columns={
        id_col: "PLAYER_ID",
        height_col: "HEIGHT",
        position_col: "POSITION",
    }
)

# Normalize join key types before merge
player_meta["PLAYER_ID"] = pd.to_numeric(player_meta["PLAYER_ID"], errors="coerce").astype("Int64")
player_meta = player_meta.dropna(subset=["PLAYER_ID"]).drop_duplicates(subset=["PLAYER_ID"], keep="last")

# Pull season-level stats for selected seasons
season_frames = []
for season in SEASONS:
    print(f"Pulling season {season}...")
    sdf = fetch_with_retry(
        lambda s=season: leaguedashplayerstats.LeagueDashPlayerStats(
            season=s,
            season_type_all_star="Regular Season",
            per_mode_detailed="Totals",
            timeout=REQUEST_TIMEOUT,
        ).get_data_frames()[0],
        label=f"LeagueDashPlayerStats {season}",
    )

    required_cols = ["PLAYER_ID", "PLAYER_NAME", "GP", "MIN", "FG3A", "FGA"]
    keep_cols = [c for c in required_cols if c in sdf.columns]
    sdf = sdf[keep_cols].copy()
    sdf["PLAYER_ID"] = pd.to_numeric(sdf["PLAYER_ID"], errors="coerce").astype("Int64")
    sdf["SEASON"] = season
    season_frames.append(sdf)
    time.sleep(1.0)

season_stats = pd.concat(season_frames, ignore_index=True)
season_stats = season_stats.dropna(subset=["PLAYER_ID"])

# Merge + feature engineering
df = season_stats.merge(player_meta, on="PLAYER_ID", how="left")

merge_success = df["HEIGHT"].notna().mean()
print(f"Metadata merge success (non-null HEIGHT): {merge_success:.1%}")

# Fallback if merge quality is unexpectedly low
if merge_success < 0.20:
    raise RuntimeError(
        "Metadata merge success is too low (<20%). Check NBA API response schema or connectivity."
    )

df["HEIGHT_IN"] = df["HEIGHT"].apply(parse_height_to_inches)
df["POSITION_GROUP"] = df["POSITION"].apply(standardize_position)
df["ERA"] = df["SEASON"].apply(season_to_era)

df["MPG"] = df["MIN"] / df["GP"].replace(0, np.nan)
df["FG3A_PER_GAME"] = df["FG3A"] / df["GP"].replace(0, np.nan)
df["FG3A_PER36"] = df["FG3A"] / (df["MIN"] / 36).replace(0, np.nan)
df["FG3A_RATE"] = df["FG3A"] / df["FGA"].replace(0, np.nan)

# Data quality filters from project plan
df = df[(df["GP"] >= 41) & (df["MPG"] >= 10)].copy()

# Save outputs for reproducibility
os.makedirs("data/01-interim", exist_ok=True)
os.makedirs("data/02-processed", exist_ok=True)

interim_path = "data/01-interim/nba_player_season_combined.csv"
processed_path = "data/02-processed/nba_height_3pa_player_season.csv"

df.to_csv(interim_path, index=False)
df.to_csv(processed_path, index=False)

print(f"Saved interim dataset: {interim_path}")
print(f"Saved processed dataset: {processed_path}")
print(f"Rows: {len(df):,} | Columns: {df.shape[1]}")

df.head()


Pulling season 1996-97...
Pulling season 1999-00...
Pulling season 2004-05...
Pulling season 2008-09...
Pulling season 2012-13...
Pulling season 2016-17...
Pulling season 2020-21...
Pulling season 2023-24...
Metadata merge success (non-null HEIGHT): 100.0%
Saved interim dataset: data/01-interim/nba_player_season_combined.csv
Saved processed dataset: data/02-processed/nba_height_3pa_player_season.csv
Rows: 2,523 | Columns: 18


Unnamed: 0,PLAYER_ID,PLAYER_NAME,GP,MIN,FG3A,FGA,SEASON,HEIGHT,POSITION,FROM_YEAR,TO_YEAR,HEIGHT_IN,POSITION_GROUP,ERA,MPG,FG3A_PER_GAME,FG3A_PER36,FG3A_RATE
0,920,A.C. Green,83,2494.298333,20,484,1996-97,6-9,F,1985,2000,81,Wing,1990-2000,30.051787,0.240964,0.288658,0.041322
1,243,Aaron McKie,83,1623.911667,103,365,1996-97,6-5,G,1994,2006,77,Guard,1990-2000,19.565201,1.240964,2.283375,0.282192
3,768,Acie Earl,47,500.141667,5,179,1996-97,6-11,F-C,1993,1996,83,Big,1990-2000,10.641312,0.106383,0.359898,0.027933
4,228,Adam Keefe,62,916.788333,1,160,1996-97,6-9,F,1992,2000,81,Wing,1990-2000,14.786909,0.016129,0.039268,0.00625
5,154,Adrian Caldwell,45,571.608333,2,92,1996-97,6-8,F,1989,1997,80,Wing,1990-2000,12.702407,0.044444,0.12596,0.021739


### Dataset #1: Player-Season Stats + Player Metadata (NBA API)

This checkpoint uses two official NBA Stats API endpoint families exposed through `nba_api`:
1. `LeagueDashPlayerStats` for per-season player statistics (including total 3-point attempts).
2. `PlayerIndex` for player profile attributes (height and listed position).

Wrangling steps implemented in code
- Pull representative seasons across our three eras.
- Keep analysis variables needed for our research question (`FG3A`, `FGA`, `GP`, `MIN`, `HEIGHT`, `POSITION`).
- Merge on player identifier.
- Convert height strings (for example, `6-7`) into inches.
- Create standardized position groups and era labels.
- Create usage-normalized metrics (`FG3A_PER_GAME`, `FG3A_PER36`, and `FG3A_RATE`).
- Apply participation filters (`GP >= 41`, `MPG >= 10`) to reduce noise from tiny samples.

The resulting dataframe is saved to `data/02-processed/nba_height_3pa_player_season.csv` and becomes the canonical analysis table for EDA/modeling.


In [4]:
import sys
sys.path.append('./modules')

from data_quality import checkpoint_summary

summary = checkpoint_summary(df)
summary


{'rows': 2523,
 'cols': 18,
 'duplicate_rows': 0,
 'missing_by_col': {'PLAYER_ID': 0,
  'PLAYER_NAME': 0,
  'FG3A_PER36': 0,
  'FG3A_PER_GAME': 0,
  'MPG': 0,
  'ERA': 0,
  'POSITION_GROUP': 0,
  'HEIGHT_IN': 0,
  'TO_YEAR': 0,
  'FROM_YEAR': 0,
  'POSITION': 0,
  'HEIGHT': 0,
  'SEASON': 0,
  'FGA': 0,
  'FG3A': 0,
  'MIN': 0,
  'GP': 0,
  'FG3A_RATE': 0}}

### Dataset #2: Optional Extension Dataset (Not Required for This Checkpoint)

For this checkpoint, we use the merged player-season table above as our analysis-ready dataset.
If we later add team-context controls (pace, offensive rating, etc.), they will be joined as a separate dataset in this section.


In [5]:
# Additional validation checks for checkpoint reporting
missing_pct = (df.isna().mean() * 100).sort_values(ascending=False)
print("Top missingness percentages:")
print(missing_pct.head(10))

suspicious_heights = df[(df["HEIGHT_IN"] < 65) | (df["HEIGHT_IN"] > 90)]
print(f"Suspicious height rows (<65 or >90 inches): {len(suspicious_heights)}")

era_position_counts = (
    df.groupby(["ERA", "POSITION_GROUP"], dropna=False)
      .size()
      .reset_index(name="n_player_seasons")
      .sort_values(["ERA", "POSITION_GROUP"])
)
era_position_counts


Top missingness percentages:
PLAYER_ID         0.0
PLAYER_NAME       0.0
FG3A_PER36        0.0
FG3A_PER_GAME     0.0
MPG               0.0
ERA               0.0
POSITION_GROUP    0.0
HEIGHT_IN         0.0
TO_YEAR           0.0
FROM_YEAR         0.0
dtype: float64
Suspicious height rows (<65 or >90 inches): 3


Unnamed: 0,ERA,POSITION_GROUP,n_player_seasons
0,1990-2000,Big,156
1,1990-2000,Guard,198
2,1990-2000,Wing,235
3,2001-2013,Big,247
4,2001-2013,Guard,322
5,2001-2013,Wing,388
6,2014-present,Big,207
7,2014-present,Guard,378
8,2014-present,Wing,392


## Ethics

**UPDATED Ethics Section**
## Ethics 

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

Our project uses publicly available NBA player statistics and biographical attributes (e.g., height, position) from official/public sports data sources accessed via `nba_api`. We are not collecting new data from human subjects, conducting interventions, or gathering private information. Because this is archival, public, non-sensitive data about public figures in a professional context, informed consent in the traditional research sense is not applicable.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

Because we rely on an existing dataset, we inherit any biases from how the league records data and how positions/heights are defined. For example, “position” labels can be inconsistent across seasons and may reflect team conventions rather than actual on-court roles. We will mitigate this by (1) documenting data sources and definitions, (2) standardizing positions into broad groups (e.g., Guard/Wing/Big or G/F/C), and (3) running sensitivity checks with alternative groupings where feasible.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

Although player names are technically identifying, they are public figures and names are not necessary for our analysis. We will minimize exposure by analyzing at the player-season level using player IDs and excluding unnecessary personal fields (birthdate, birthplace, etc.). If we include example players to illustrate outliers, we will keep it minimal and directly tied to analysis.

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

We are not building a system that makes decisions about individuals, and we do not use protected attributes such as race, ethnicity, or gender. As a result, typical “downstream bias” concerns are limited. However, we will avoid making normative claims about what different groups “should” do and will frame findings as descriptive patterns in professional basketball strategy and roles.

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

We will store only public, non-sensitive statistical data in our course GitHub repository and local course environment. We will not store secrets (tokens) in the repo. If authentication tokens are required (e.g., GitHub PAT), they will be kept outside version control (environment variables or local credential storage).

 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?

Because the data are public professional records and we are not maintaining a user-facing database, this is not directly applicable. If needed, we can remove player names or omit individual examples from outputs and present aggregated results.

 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

We will retain only what is required to reproduce the analysis for the duration of the course. After the course, we can delete local copies and/or make the repository private/archived depending on team preference and course guidance.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

Our analysis concerns professional sports strategy rather than impacted communities. Still, we will check assumptions using domain knowledge (e.g., rule changes, era differences, and the rise of “spacing” offenses) and avoid over-generalizing from basketball to broader claims about bodies or ability.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

Key confounds include era-wide increases in 3PA, changing offensive schemes, minutes played, and role differences within positions. To reduce omitted-variable bias, we will use rate-based outcomes (e.g., 3PA per 36 minutes or 3PA/FGA) and include controls such as minutes/games (and age if available). We will also test interactions with era and position to avoid attributing league-wide trends solely to height.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

We will avoid misleading axes/scales, show uncertainty where appropriate (confidence intervals), and present both overall trends and stratified trends by position/era. We will report effect sizes in interpretable units (e.g., change in 3PA rate per 2 inches) instead of relying only on p-values.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

We will not publish unnecessary personal fields. Player names are not required for the core analysis and will be omitted from most tables/plots. If naming an outlier player is helpful, we will do so sparingly and only when it materially improves interpretation.

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

We will keep a reproducible pipeline: data pull script/notebook, cleaning steps with clear comments, fixed season lists/era bins, and saved intermediate datasets. We will include a README describing how to rerun the project from raw pull to final figures.

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

Our models are descriptive and use basketball-related features (height, position, era, minutes/games). We will not include protected attributes. We will also avoid using height to draw claims about people outside the context of NBA roles and strategy.

 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?

We are not deploying a predictive model for decisions about people, so formal fairness testing (disparate error rates) is not directly applicable. Our primary goal is explanation/association. If we do include predictive components, we will report performance separately by position/era to ensure results are not dominated by one group.

 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

We will compare multiple reasonable outcomes (3PA/game, 3PA/36, 3PA/FGA) because “attempts” can be driven by playing time and team context. Using multiple metrics reduces the risk of drawing conclusions that are an artifact of one specific definition.

 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?

We will use interpretable models (correlations and linear regression with interaction terms) and will translate coefficients into plain language. For example, we will explain how the height–3PA relationship changes across eras and positions rather than presenting only statistical output.

 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

We will explicitly state that correlation does not imply causation, and that position labels and evolving play styles confound simple interpretations. We will discuss how era-wide strategy changes (not height alone) drive modern 3PA, and we will caution against generalizing results beyond NBA context.

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?

This project is not being deployed as a real-world system. If we were to extend it, “monitoring” would mean updating with new seasons and checking whether the height–3PA relationship shifts over time (concept drift), which we partially address by explicitly modeling era effects.

 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?

Potential harm is low because this is descriptive sports analytics, but misinterpretation could occur (e.g., implying taller players “shouldn’t” shoot threes). We will reduce this risk through careful framing, clear limitations, and avoiding normative claims. If issues are identified, we can revise wording, remove unnecessary identification of individuals, and add clarifying caveats.

 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?

Not applicable because we are not deploying a production system. Our work exists as a report/notebook; if needed we can remove or revise analyses and update the repository.

 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

Because we are not deploying, risk is limited. The main unintended use would be over-generalizing results to non-NBA contexts or using findings to support stereotypes about body types. We will prevent this by keeping claims strictly within basketball strategy/role context and explicitly stating boundaries of interpretation.

## Team Expectations 

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback


## Project Timeline Proposal

**UPDATED Timeline Section**
| Meeting Date | Meeting Time    | Agenda / Discuss at Meeting                                                                                                                                                                                                                            | Next Steps                                                                                                                              |
| ------------ | --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------- |
| **2/16**     | 6 PM            | Finalize scope: choose eras/seasons (e.g., 90s/00s/10s or specific season ranges); decide unit of analysis (player-season); decide primary metric (3PA/game vs 3PA/36 vs 3PA rate = 3PA/FGA); confirm position grouping plan (G/F/C or Guard/Wing/Big) | Assign roles (data pull, cleaning, EDA, modeling, writing); set repo structure; draft data dictionary + analysis plan outline           |
| **2/18**     | 6 PM            | Data Checkpoint work session: pull data via `nba_api`; verify joins (player_id ↔ season stats ↔ height); check coverage across eras; start cleaning (height standardization, position parsing, era labeling)                                       | Submit Data Checkpoint; export raw dataset v1 + cleaned dataset v1; document endpoints + code workflow in README                        |
| **2/23**     | 6 PM            | Data cleaning review: handle missing height/position; decide filters (min minutes/games); engineer variables (3PA/36, 3PA rate); sanity checks by era/position (counts, ranges)                                                                        | Lock cleaning rules; produce final analysis-ready dataset; write short “Data & Cleaning” methods paragraph                              |
| **2/25**     | 6 PM            | EDA sprint: scatter/hex plots of height vs 3PA metrics; stratify by era and position; check whether correlation differs by group; identify outliers (very tall high-3PA) and decide treatment                                                          | Finalize EDA notebook + 3–5 key figures; draft EDA narrative answering “overall vs by era/position”                                     |
| **3/01**     | 6 PM            | Modeling plan + first results: run correlations and baseline regressions; test interactions (Height×Era, Height×Position, maybe Height×Era×Position); choose controls (minutes, games; age if available)                                               | Select primary model + backup model; run robustness checks (different metrics, thresholds); prep EDA Checkpoint submission polishing    |
| **3/04**     | 6 PM            | Submit EDA Checkpoint (Due 3/04): clean plots, captions, and clear takeaways; confirm reproducibility (run-all); outline what modeling will prove beyond EDA                                                                                       | Begin full analysis write-up; generate final results tables (effect sizes + CIs); start final visualization polishing                   |
| **3/08**     | 6 PM            | Main analysis + interpretation: finalize models; translate coefficients into plain language (e.g., “+2 inches → X change in 3PA/36 in 90s vs 2010s”); finalize figure set (by-era trend lines, coefficient plot)                                       | Draft Results + Discussion; document limitations (position labeling, pace/era confounds, survivorship); update README + methods details |
| **3/11**     | 6 PM            | Full draft assembly: Methods/Results/Discussion integration; ensure claims match evidence; peer review for clarity + logic; confirm citations and formatting                                                                                           | Produce full project draft v1; polish visuals + captions; clean code and add comments; finalize conclusion                              |
| **3/18**     | Before 11:59 PM | Submission checklist: final run-through of notebook(s), report, figures; verify file names and rubric requirements; confirm all group surveys done                                                                                                     | **Turn in Final Project & Group Project Surveys**                                                                                       |
