# COGS 108 - Data Checkpoint

- Boyu Tian: Methodology, Background research, Writing - original draft, Analysis, Writing - review & editing
- Ziming Zhu: Visualization, Project administration, Writing - original draft, Writing - review & editing
- Haoshuo Bi: Methodology, Writing - original draft, Analysis, Writing - review & editing
- Xukuan Wang: Visualization, Writing - original draft, Writing – review & editing, Software, Methodology, Conceptualization, Data curation 
- Kevin Pyo: Writing - original draft, Software, Conceptualization, Data curation, Project administration, Analysis ,Background research, Writing – review & editing

## Research Question

Based on season-level data from the NBA, is player height (in inches) associated with any measures of on-court production–specifically in the context of Player Efficiency Rating (PER) and field goals made (FGM), after controlling for other factors such as minutes played, age/experience, weight, or wingspan (or ranked alternately–wingspan:height ratio)? When regressing both the correlation and multiple regression models, is there any model in which the height–perfrormance relationship becomes attenuated depending on positional group (guard/forward/center) or season (e.g., through interaction terms, “season” effects)?

## Background and Prior Work

Basketball performance reflects both learned skill (shooting, decision making) and physical constraints (reach, mass, speed). Because the NBA systematically records player measurements and season level box-score outcomes, it provides a strong setting for testing whether **height is associated with performance** and whether that association differs by **position**, where roles and typical body types differ (e.g., rim protection vs. playmaking). Prior work on lineups and “small-ball” strategy supports the idea that the value of body size is **context dependent**, motivating our decision to analyze results **within positions** rather than assuming one global height–performance relationship. <sup>[5](#fn5)</sup>

A key challenge is operationalizing “achievement” in a way that is measurable and comparable across players and seasons. We will use **Player Efficiency Rating (PER)** as our **primary outcome** because it is a widely used, per-minute composite of box-score production normalized to a league average value (≈15). However, PER is not a complete measure of total value: it is **box-score based** and is commonly criticized for **under-representing defense** beyond steals/blocks/rebounds, ignoring off-ball defense and matchup effects, and potentially over-rewarding certain offensive box-score patterns. Because of these limitations, we treat PER as a convenient summary, not a definitive “true value”, and will also report **secondary outcomes** (points, rebounds, assists per game) to show *how* any height association manifests across different skill domains and roles. <sup>[6](#fn6)</sup>

Beyond height alone, prior work suggests that reach related measures especially **wingspan relative to height** can matter for basketball outcomes. For example, reporting on research published in the *Journal of Anthropology of Sport and Physical Education*, a UC Berkeley news summary describes evidence that arm-length-to-height proportions are associated with elite athletic success in the NBA, suggesting that “functional length” may matter beyond listed height. <sup>[7](#fn7)</sup> Complementing this, sport analytics work focusing on wingspan reports potential tradeoffs: longer wingspans may support defensive impact while being linked to reduced shooting accuracy in some analyses, typically evaluated using **regression-style modeling** (i.e., modeling performance outcomes as a function of wingspan while controlling for other factors). <sup>[8](#fn8)</sup> These findings directly shape our project direction: if height effects weaken after controlling for role/position, wingspan (or wingspan-to-height ratio) may better capture “usable reach” relevant to rebounding/defense—so we will include wingspan when available and explicitly acknowledge it as a potential confound or alternative predictor.

Relatedly, draft combine research demonstrates that anthropometric variables differ systematically by position and can be associated with selection outcomes, often using group comparisons and predictive modeling to evaluate discriminative factors. For instance, research on NBA Draft Combine participants identifies which anthropometric and fitness measures best distinguish draft outcomes across positions. <sup>[9](#fn9)</sup> While draft status is not the same as NBA on-court production, this literature reinforces two points that matter for our modeling plan: (1) anthropometrics are intertwined with positional role and selection processes, and (2) any “height effect” should be tested with **controls** and **position-stratified analysis** to reduce misleading aggregate conclusions.

Finally, prior work in this area commonly relies on **correlation** and **multivariable regression** frameworks to quantify associations between physical traits and outcomes (e.g., Pearson correlations for bivariate relationships; linear regression to control for confounds). Because our research question centers on whether the height–performance relationship is **position-dependent**, we will extend the typical approach by explicitly testing a **Height × Position interaction** in a regression model, alongside controls such as minutes played and age/experience when available. This allows us to distinguish (a) overall associations from (b) role-specific patterns—aligning our method directly with what prior work suggests, while addressing the common limitation that many analyses either do not stratify by position or do not formally test whether relationships differ across roles.

---

<a name="fn5"></a>5. Zhang et al. (and related work summarized in): *Clustering Performances in Elite Basketball Matches According to the Anthropometric Features of the Line-ups Based on Big Data Technology.* https://pmc.ncbi.nlm.nih.gov/articles/PMC9309682/

<a name="fn6"></a>6. Wikipedia. “Player efficiency rating.” https://en.wikipedia.org/wiki/Player_efficiency_rating

<a name="fn7"></a>7. University of California. “Study shows wingspan has correlation to athletic prowess in NBA, MMA.” https://www.universityofcalifornia.edu/news/study-shows-wingspan-has-correlation-athletic-prowess-nba-mma

<a name="fn8"></a>8. SportRxiv preprint on wingspan and performance tradeoffs (defense vs. shooting). https://sportrxiv.org/index.php/server/preprint/view/238

<a name="fn9"></a>9. NBA Draft Combine / anthropometrics study (position differences and draft outcomes). https://pmc.ncbi.nlm.nih.gov/articles/PMC6820507/

## Hypothesis


We hypothesize that the height of NBA players (in inches) will have an overall positive association with PER (our primary performance metric), but that this relationship will be position-dependent. Specifically, we expect the height–PER association to be strongest for centers, moderate for forwards, and weak or near-zero for guards, since height is more directly tied to interior roles (e.g., rebounding and rim protection) than perimeter roles.

For our secondary outcomes, we predict height will be positively associated with rebounds per game (RPG) most strongly among centers and forwards, weakly associated with points per game (PPG), and negatively or near-zero associated with assists per game (APG), particularly among guards where playmaking is less height-dependent.



## Data

### Data overview

The ideal datasets we would need to best answer our question should be player–season level NBA dataset that combines and contained key variables including the height of players in inches, weight, age, minutes played, years of experience, position of the players, and also the Player Efficiency Rating (PER) and traditional box-score statistics such as field goals made, points, etc. The datasets should also include data of multiple NBA seasons, and containing all of the observations of player performances and other data during this time range to allow reliable correlation and analyses on both overall and by position groups. All the datasets should be in CSV format if possible for better organization.



In [2]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [3]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

Overall Download Progress:   0%|          | 0/2 [00:00<?, ?it/s]
Downloading airline-safety.csv:   0%|          | 0.00/1.23k [00:00<?, ?B/s][A
                                                                           [A

Successfully downloaded: airline-safety.csv



Downloading bad-drivers.csv:   0%|          | 0.00/1.37k [00:00<?, ?B/s][A
Overall Download Progress: 100%|██████████| 2/2 [00:00<00:00, 21.81it/s][A

Successfully downloaded: bad-drivers.csv





### Dataset #1 Players weight, height, position, and PER (Player Efficiency Rating)

  - Dataset Name: NBA Players stats since 1950
  - Link to the dataset: https://www.kaggle.com/datasets/drgilermo/nba-players-stats
  - Number of observations: ~ 29241
  - Number of variables: ~ 60
  - This dataset includes the name, position, PER (Player Efficiency Rating), number of games, heights (in centimeters and inches), which is the primary independent variable for the project, and weight (in kilograms and pounds) of NBA players since 1950. The height, position and PER data can be use to analyze whether the relation between player height and on-court performance vary by their position. This dataset doesn't require any request or permission to use. It contains large amounts of NBA players' height and weight data, which can require some filtering to use.
  - This dataset includes players' stats since 1950s, which means that many of the older players data will probably need to be filter out as the related match data is too old and likely unavailable for further analysis, and the time range and data of matches and players from 1950s to current will be too much for the data analysis. Many of the older players in the dataset have many missing data for some of the variables, because some of the rules related to these variables are not yet introduced during their time playing in NBA, we will need to drop/ignore these values for the analysis later on. Also, since the datasets are separated into different csv files, which includes the stats such as position and PER separated from players height and weight, we will need to combined the data in order to be use for the project later.



In [8]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
import pandas as pd
import numpy as np
    
url1 = "https://raw.githubusercontent.com/COGS108/Group039_WI26/refs/heads/master/data/00-raw/NBA%20Players%20stats%20since%201950/NBA_Players_stats_since_1950_player_data.csv"
url2 = "https://raw.githubusercontent.com/COGS108/Group039_WI26/refs/heads/master/data/00-raw/NBA%20Players%20stats%20since%201950/NBA_Players_stats_since_1950_Seasons_Stats.csv" 
df1 = pd.read_csv(url1)
df2 = pd.read_csv(url2)
df1.head()
df2.head()

# Tidy
df1.shape
df1.columns
df1.info()

df2.shape
df2.columns
df2.info()
# the dataset seems tidy, just need to convert some data to a better format for calculation later

# Size of dataset
print("Number of observations:", df1.shape[0])
print("Number of variables:", df1.shape[1])
print("Number of observations:", df2.shape[0])
print("Number of variables:", df2.shape[1]) 

# The number of observations in the first part is 4550, mumber of variables is 8
# The number of observations in the second part is 24691, number of variables is 53

# Missing/cleaning
df1.isna().sum()
df2.isna().sum()
(df1.isna().sum() / len(df1)) * 100
df1 = df1.dropna(subset = ["height"])
df1 = df1.dropna(subset = ["position"])
df1 = df1.dropna(subset = ["weight"])

(df2.isna().sum() / len(df2)) * 100
df2 = df2.drop(columns = ["blanl", "blank2"])
df2 = df2[["Year", "Player", "Pos", "Age", "Tm", "G", "GS", "MP", "PER", "TRB", "AST","PTS"]]
df2 = df2[df2["Year"] >= 1980]

def height_cleanup(h):
    if pd.isna(h):
        return None
    feet, inches = h.split("-")
    return int(feet)*12 + int(inches)
df1["height_in"] = df1["height"].apply(height_cleanup)
#df1.head()

df1 = df1.reset_index(drop = True)
df2 = df2.reset_index(drop = True)

df1
df2


Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Player-additional
0,1,Jeff Adrien,PF,24,GSW,23,0,8.5,1.0,2.3,...,1.0,1.5,2.5,0.4,0.2,0.2,0.4,1.2,2.5,adrieje01
1,2,Arron Afflalo,SG,25,DEN,69,69,33.7,4.5,9.1,...,0.7,3.0,3.6,2.4,0.5,0.4,1.0,2.2,12.6,afflaar01
2,3,Maurice Ager,SG,26,MIN,4,0,7.3,1.5,2.8,...,0.0,0.5,0.5,0.3,0.3,0.0,1.0,1.0,3.8,agerma01
3,4,Alexis Ajinça,C,22,TOT,34,2,10.0,1.7,3.9,...,0.5,1.8,2.3,0.3,0.3,0.6,0.5,2.1,4.2,ajincal01
4,4,Alexis Ajinça,C,22,DAL,10,2,7.5,1.2,3.2,...,0.5,1.2,1.7,0.2,0.3,0.5,0.1,1.3,2.9,ajincal01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
620,448,Dorell Wright,SF,25,GSW,82,82,38.4,5.9,14.0,...,1.1,4.2,5.3,3.0,1.5,0.8,1.6,2.1,16.4,wrighdo01
621,449,Julian Wright,SF,23,TOR,52,6,14.7,1.6,3.1,...,0.9,1.4,2.3,1.1,0.8,0.4,0.8,0.9,3.6,wrighju01
622,450,Nick Young,SG,25,WAS,64,40,31.8,6.4,14.6,...,0.4,2.3,2.7,1.2,0.7,0.3,1.4,2.3,17.4,youngni01
623,451,Sam Young,SF,25,MEM,78,46,20.2,3.0,6.3,...,0.5,1.9,2.4,0.9,0.9,0.3,0.8,1.5,7.3,youngsa01


### Dataset #2 Player position, games and minutes played, and field goal data.

  - Dataset Name: NBA Player Data 2009-2023
  - Link to the dataset: https://www.kaggle.com/datasets/sidlamsal/nba-player-data-2009-2023
  - Number of observations: ~ 7057
  - Number of variables: 31
  - This dataset includes the name, position, games played, games started, field goal percentage (which measures shooting accuracy during live play in basketball), and field goals attempted (which represent the total number of shots taken from the field during live play in basketball) of NBA players from 2009 to 2023. The data of players' position and field goal percentage, etc. can help us analyze the players performances and find out about the correlations between the players' height and their on-court performances, as well as in relations to their position. This dataset doesn't require any request or permission to use, it's intended for education use. It includes many useful data of NBA players ranging across several years.
  - This dataset stores the players data by seasons in different csv files, which can require some organizing for it to be combined with other data and be analyze for the project. Also, the dataset is missing some performance variables such as PER, which will need to be obtain from other datasets and combined for analysis later on.



In [60]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
import pandas as pd
import numpy as np

urls = ["https://raw.githubusercontent.com/COGS108/Group039_WI26/refs/heads/master/data/00-raw/NBA%20Player%20Data%202009-2023/9-10stats.csv",
        "https://raw.githubusercontent.com/COGS108/Group039_WI26/refs/heads/master/data/00-raw/NBA%20Player%20Data%202009-2023/10-11stats.csv",
        "https://raw.githubusercontent.com/COGS108/Group039_WI26/refs/heads/master/data/00-raw/NBA%20Player%20Data%202009-2023/11-12stats.csv",
        "https://raw.githubusercontent.com/COGS108/Group039_WI26/refs/heads/master/data/00-raw/NBA%20Player%20Data%202009-2023/12-13stats.csv",
        "https://raw.githubusercontent.com/COGS108/Group039_WI26/refs/heads/master/data/00-raw/NBA%20Player%20Data%202009-2023/13-14stats.csv",
        "https://raw.githubusercontent.com/COGS108/Group039_WI26/refs/heads/master/data/00-raw/NBA%20Player%20Data%202009-2023/14-15stats.csv",
        "https://raw.githubusercontent.com/COGS108/Group039_WI26/refs/heads/master/data/00-raw/NBA%20Player%20Data%202009-2023/15-16stats.csv",
        "https://raw.githubusercontent.com/COGS108/Group039_WI26/refs/heads/master/data/00-raw/NBA%20Player%20Data%202009-2023/16-17stats.csv",
        "https://raw.githubusercontent.com/COGS108/Group039_WI26/refs/heads/master/data/00-raw/NBA%20Player%20Data%202009-2023/17-18stats.csv",
        "https://raw.githubusercontent.com/COGS108/Group039_WI26/refs/heads/master/data/00-raw/NBA%20Player%20Data%202009-2023/18-19stats.csv",
        "https://raw.githubusercontent.com/COGS108/Group039_WI26/refs/heads/master/data/00-raw/NBA%20Player%20Data%202009-2023/19-20stats.csv",
        "https://raw.githubusercontent.com/COGS108/Group039_WI26/refs/heads/master/data/00-raw/NBA%20Player%20Data%202009-2023/20-21stats.csv",
        "https://raw.githubusercontent.com/COGS108/Group039_WI26/refs/heads/master/data/00-raw/NBA%20Player%20Data%202009-2023/21-22stats.csv",
        "https://raw.githubusercontent.com/COGS108/Group039_WI26/refs/heads/master/data/00-raw/NBA%20Player%20Data%202009-2023/22-23stats.csv"]

dfs = []

# combining the data csvs together
for url in urls:
    temp = pd.read_csv(url)
    dfs.append(temp)

df = pd.concat(dfs, ignore_index = True)

# Tidy
df.shape
df.columns
df.info()
# The data seems tidy?

# Size
print("Number of observations:", df.shape[0])
print("Number of variables:", df.shape[1])

# Missimg
missing_percent = (df.isna().sum() / len(df)) * 100
missing_percent.sort_values(ascending = False)

df = df.dropna(axis = 1, how = "all")
df = df[["Player", "Pos", "Age", "G", "MP", "TRB", "AST", "PTS"]]

df.describe()
df.isna().sum()
df.shape

df_clean = df.dropna(subset=["Player", "Pos", "Age", "G", "MP", "TRB", "AST", "PTS"])

print("Shape after cleaning:", df_clean.shape)
print("Missing data after cleaning:\n", df_clean.isna().sum())

df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8980 entries, 0 to 8979
Data columns (total 31 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Rk                 8980 non-null   int64  
 1   Player             8980 non-null   object 
 2   Pos                8980 non-null   object 
 3   Age                8980 non-null   int64  
 4   Tm                 8980 non-null   object 
 5   G                  8980 non-null   int64  
 6   GS                 8980 non-null   int64  
 7   MP                 8980 non-null   float64
 8   FG                 8980 non-null   float64
 9   FGA                8980 non-null   float64
 10  FG%                8924 non-null   float64
 11  3P                 8980 non-null   float64
 12  3PA                8980 non-null   float64
 13  3P%                8026 non-null   float64
 14  2P                 8980 non-null   float64
 15  2PA                8980 non-null   float64
 16  2P%                8861 

Unnamed: 0,Player,Pos,Age,G,MP,TRB,AST,PTS
0,Arron Afflalo,SG,24,82,27.1,3.1,1.7,8.8
1,Alexis Ajinça,C,21,6,5.0,0.7,0.0,1.7
2,LaMarcus Aldridge,PF,24,78,37.5,8.0,2.1,17.9
3,Joe Alexander,SF,23,8,3.6,0.6,0.3,0.5
4,Malik Allen,PF,31,51,8.9,1.6,0.3,2.1
...,...,...,...,...,...,...,...,...
8975,Thaddeus Young,PF,34,54,14.7,3.1,1.4,4.4
8976,Trae Young,PG,24,73,34.8,3.0,10.2,26.2
8977,Omer Yurtseven,C,24,9,9.2,2.6,0.2,4.4
8978,Cody Zeller,C,30,15,14.5,4.3,0.7,6.5


### Dataset #3 Players performances in detail combined with position 

  - Dataset Name: NBA Player Data from 2003 to 2022
  - Link to the dataset: https://www.kaggle.com/datasets/dhruvsuryavanshi/nba-player-data-from-2003-to-2022
  - Number of observations: 2052
  - Number of variables: 42
  - This dataset inclues position, name, age, and many data related to the player's performances such as average field goals made, average field goals percentage, average 3 pointers made, average 2 pointers made, different average rebounds per game, average personal fouls per game, average points per game, points won by the player to win the MVP award, share of total points a player received that season, minutes played, and many other match data of NBA players from 2003 to 2022. The many variables related to player's performances can be use to better figure out the performances of players when combined with PER, position, and height data for the project. 
  - This dataset inclues many variables of data related to players's performances, which can required more analysis to determine their relations to players performances, we may also need to exclude some of the variables from this dataset that may not be important for analyzing players performances and finding the relation between player height and performance. 



In [12]:
import pandas as pd
players = pd.read_csv('data/00-raw/NBA Player Data 2009-2023/10-11stats.csv')
players
#This is just one year of all the players from 2010 to 2011. Kaggle already cleaned the data up pretty well. 

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Player-additional
0,1,Jeff Adrien,PF,24,GSW,23,0,8.5,1.0,2.3,...,1.0,1.5,2.5,0.4,0.2,0.2,0.4,1.2,2.5,adrieje01
1,2,Arron Afflalo,SG,25,DEN,69,69,33.7,4.5,9.1,...,0.7,3.0,3.6,2.4,0.5,0.4,1.0,2.2,12.6,afflaar01
2,3,Maurice Ager,SG,26,MIN,4,0,7.3,1.5,2.8,...,0.0,0.5,0.5,0.3,0.3,0.0,1.0,1.0,3.8,agerma01
3,4,Alexis Ajinça,C,22,TOT,34,2,10.0,1.7,3.9,...,0.5,1.8,2.3,0.3,0.3,0.6,0.5,2.1,4.2,ajincal01
4,4,Alexis Ajinça,C,22,DAL,10,2,7.5,1.2,3.2,...,0.5,1.2,1.7,0.2,0.3,0.5,0.1,1.3,2.9,ajincal01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
620,448,Dorell Wright,SF,25,GSW,82,82,38.4,5.9,14.0,...,1.1,4.2,5.3,3.0,1.5,0.8,1.6,2.1,16.4,wrighdo01
621,449,Julian Wright,SF,23,TOR,52,6,14.7,1.6,3.1,...,0.9,1.4,2.3,1.1,0.8,0.4,0.8,0.9,3.6,wrighju01
622,450,Nick Young,SG,25,WAS,64,40,31.8,6.4,14.6,...,0.4,2.3,2.7,1.2,0.7,0.3,1.4,2.3,17.4,youngni01
623,451,Sam Young,SF,25,MEM,78,46,20.2,3.0,6.3,...,0.5,1.9,2.4,0.9,0.9,0.3,0.8,1.5,7.3,youngsa01


## Ethics

This project uses publicly available NBA player statistics and does not include any private or sensitive personal information. Since the data come from official league sources and public databases, issues related to informed consent and personal privacy are minimal. However, there are still several ethical considerations in how we define performance and interpret our results.

First, we create a composite performance index using standardized values of PER, points per game, rebounds per game, and assists per game. Although this helps combine different performance metrics into one measure, assigning equal weight to these variables is a subjective choice. Different positions in basketball naturally emphasize different skills, so some players may be advantaged or disadvantaged depending on how the index is constructed. In addition, PER itself has known limitations and may not fully represent defensive contributions or team impact. Therefore, our performance index should be understood as an approximation rather than a complete measure of player value.

Second, our regression analysis assumes a linear relationship between height and the constructed performance index. This may oversimplify the true relationship. There may be nonlinear patterns or interaction effects between height and position that are not fully captured in our model. In addition, other factors such as team role, experience, or injuries are not included in the dataset, which may introduce omitted variable bias. For this reason, our findings describe statistical associations rather than causal conclusions.

Third, there is a possibility of unintended interpretation. A statistical relationship between height and performance does not imply that taller players are inherently superior. Our analysis is exploratory and academic in nature, and we will be careful not to make deterministic or discriminatory claims based on physical attributes.

Finally, we recognize that quantitative statistics cannot capture qualitative aspects of basketball, such as leadership, communication, or defensive positioning. These limitations will be clearly acknowledged when presenting our results.

## Team Expectations 

* *Communication channel & response time: We will use Discord as our primary communication channel. We aim to respond within 24 hours on weekdays. Important decisions and updates will be summarized in the main Discord channel so everyone stays aligned.*
* *Meeting schedule: We will meet at least once per week. Meetings will be held virtually unless the team agrees to meet in person. Each meeting will have a short agenda and end with clear action items and owners.*
* *Task ownership: Every task will have a clear owner + deadline. If someone cannot meet a deadline, they will notify the group at least 24 hours in advance and propose a new plan.*
* *Work quality & reproducibility: Before pushing, members will ensure their work is clear, documented, and the notebook can run from top-to-bottom.*
* *Division of labor: We will split work fairly across writing, coding, and analysis. Everyone will contribute to both content (writing) and technical work as appropriate.*
* *Respect & collaboration: We will communicate respectfully, assume good intent, and give constructive feedback. Disagreements will be discussed calmly with evidence (data/rubric) guiding decisions.*
* *Conflict resolution: If a conflict cannot be resolved within the group, we will escalate early (TA/instructor) rather than waiting until deadlines.*
* *Accountability: We will post a brief progress update in Discord before each weekly meeting. If someone is blocked, we will ask for help early rather than waiting until the deadline.*

## Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/4  |  1 PM | Finalize ProjectProposal.ipynb draft; add citations/footnotes; delete rubric cell; Restart & Run All  | Final review proposal (RQ, hypothesis, data plan, ethics, timeline); assign roles for Data Checkpoint | 
| 2/9  |  7 PM |  Download/get data; store in data/ (or script to fetch); write initial loading code | Confirm unit of analysis; decide cleaning rules (missing values/outliers); split wrangling tasks | 
| 2/11  | 7 PM  | Clean/wrangle v1; document steps; create codebook; draft 01 sections (dataset description + source citation)  | Review wrangling; check reproducibility; plan remaining tasks for 01-DataCheckpoint.ipynb   |
| 2/14  | 6 PM  | Finish 01-DataCheckpoint.ipynb (data source, ideal vs real dataset, storage/organization, wrangling summary) | Restart & Run All; polish writing; confirm what to submit   |
| 2/18  | 4 PM  | Commit/push current progress | Discuss division of labor and working on assigned parts |
| 2/18  | 9 PM  | Submit Data Checkpoint via git commit/push | Post-submit plan: list EDA questions + required plots; assign EDA plot owners |
| 2/23  | 7 PM | EDA plots v2 + captions; start 02-EDACheckpoint.ipynb write-up | Decide final metrics + baseline method(s); confirm train/test or statistical test plan |
| 3/4  | 7 PM  | Submit EDA Checkpoint via git commit/push | Lock in final analysis pipeline; assign final notebook sections and responsibilities |
| 3/8  | 7 PM  | Run main analysis/model (baseline + improved); save key tables/figures in results/ | Debug issues; check assumptions/validity; decide what goes into Results vs Discussion |
| 3/11  | 7 PM  | Draft 03-FinalProject.ipynb (methods, results, discussion, limitations, ethics) | Peer review writing; ensure narrative matches rubric; plan video outline |
| 3/15  | 7 PM  | Finalize notebook visuals + text; Restart & Run All; clean outputs; draft video script/slides | Final QA checklist; confirm all team contributions; prepare final push |
| 3/18  | 7 PM  | Submit Final project + Video; complete Post-course + Team Evaluation surveys | Confirm submission success; backup final files |