# COGS 108 - Data Checkpoint

## Authors

- Ava: Conceptualization, Writing (RQ, revisions)
- Bryan: Writing (original background), Data Collection/Wrangling/Analysis
- Cielo: Writing (original hypothesis), Data Collection/Wrangling/Analysis
- Austin: Writing (ethical concerns, data descriptions)
- Mary: Writing (schedule, data descriptions)


## Research Question

Among adults aged 18 years and older in the United States, how is average weekly intentional exercise (measured in minutes per week) associated with prevalence of obesity, diabetes, and self-reported mental health outcomes?


## Background and Prior Work

Health outcomes vary across the United States and are influenced by a combination of biological and lifestyle factors. Conditions like diabetes, obesity, and mental health are not attributed to a single cause. Factors like genetics, medical conditions, access to healthcare and the daily routine of an individual all play a role. One thing we can say is that physical activity is one of the factors that can easily improve such conditions when time is dedicated to it weekly, and our goal is to examine exactly how intentional physical exercise is associated with adult obesity prevalence, diabetes prevalence, and mental health across the U.S.

Physical activity plays an important role in maintaining both physical and mental health, and it is recognized as an important health factor. Physical activity includes everyday movements such as walking, running, stretching, or even having a structured workout. For the purposes of our study, we will be examining physical activity from intentional exercise (cardio workout, strength training, yoga, etc.), not from other factors such as work or commuting. According to the World Health Organization (WHO), individuals should engage in at least 150 minutes of moderate to intense physical activity per week to support health outcomes <a name="cite_ref-1"></a><a href="#cite_note-1">1</a>. Despite the recommendations from this organization, the amount of time or levels of physical activity varies across the individuals in the U.S. due to the differences they have in their work schedules, access to resources, and personal life.

Previous research has shown a beneficial relationship between higher levels of physical activity and lower prevalence of obesity and diabetes. The Center for Disease Control and Prevention (CDC) has reported that obesity and diabetes prevalence have steadily increased across the U.S. over the past decades<a name="cite_ref-2"></a><a href="#cite_note-2"> 2</a>. We noticed that these trends are influenced by a variety of factors which include diet quality, limited physical activity, metabolic conditions like thyroid conditions, or even the individual's metabolism. However, in this project, we focus on physical activity as the key variable and examine how it is associated with obesity and diabetes.

Moving along with physical health, there is also prior work that examined the relationship between physical activity and mental health outcomes. Research from the Harvard T.H. Chan School of Public Health found that even modest amounts of physical activity, such as 15 minutes of exercise per day or even one hour of walking, are associated with reduced risk of major depression and improved overall mental health outcomes<a name="cite_ref-3"></a><a href="#cite_note-3"> 3</a>. This supports the inclusion of mental health as an outcome variable in our project, along with obesity and diabetes when we examine how physical activity varies across the U.S.

Our group has noticed that the existing research focuses more on the mental and health benefits for an individual. Our goal is to show how physical activity helps in a broader geographic level which is in the U.S. The differences in physical activity rates across the states may help explain the observed variation in obesity prevalence, diabetes prevalence and mental health indicators. This project aims to identify patterns that may help inform public health strategies and highlight the disparities in health and mental outcomes related to physical activity.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) World Health Organization. (30 Aug 2020). 
<i>Physical activity</i>. https://www.who.int/initiatives/behealthy/physical-activity

3. <a name="cite_note-2"></a> [^](#cite_ref-2) Centers for Disease Control and Prevention. (28 Jan 2026). 
<i>National Diabetes Statistics Report</i>. https://gis.cdc.gov/grasp/diabetes/diabetesatlas-statsreport.html


5. <a name="cite_note-3"></a> [^](#cite_ref-3) Harvard Health Publishing. (1 May 2019). 
<i>More evidence that exercise can boost mood</i>. https://www.health.harvard.edu/mind-and-mood/more-evidence-that-exercise-can-boost-mood




## Hypothesis


We predict a strong negative correlation between minutes of intentional exercise and obesity rates and diabetes diagnosis rates among U.S. adults. We also predict that higher minutes of exercise will be positively correlated with improved self-reported mental health outcomes. 

Additionally, we further hypothesize that these relationships will vary by factors such as age and gender. Specifically, we predict that men between the ages 18 and 30 have higher weekly intentional exercise minutes per week and correspondingly lower obesity and diabetes rates, which will be positively correlated with self-reported mental health outcomes. In contrast, we also predict that women aged 40 and older will have lower amounts of weekly intentional exercise minutes and higher obesity and diabetes rates, which will be negatively correlated with self-reported mental health outcomes.

## Data

### Data overview

Dataset #1
  - Dataset Name: Diabetes Binary Health Indicators (BRFSS 2015)
  - Link to the dataset: https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset?select=diabetes_binary_health_indicators_BRFSS2015.csv
  - Number of observations: 253,680
  - Number of variables: 22
  - Description of the variables most relevant to this project: Key variables include Diabetes_binary (binary outcome: 0 = no diabetes, 1 = prediabetes/diabetes), PhysActivity (binary: 0 = no exercise in past 30 days, 1 = yes), BMI (continuous body mass index), GenHlth (self-reported general health, 1-5), MentHlth (days of poor mental health in past 30 days), Age (categorical, 5-year age groups like 18-24), and Sex (binary: 0 = female, 1 = male). These allow analysis of exercise associations with diabetes, obesity (via BMI), and mental health.
  - Descriptions of any shortcomings this dataset has with respect to the project: The data is from 2015, so it may not reflect current trends. PhysActivity is binary, not minutes per week, limiting precision towards our research question. Self-reported variables introduce possible bias. Class imbalance is present, as only ~13% have diabetes.

Dataset #2
  - Dataset Name: Mental Health x Physical Activity
  - Link to the dataset: https://www.kaggle.com/datasets/safiyafatima/mental-health-x-physical-activity?resource=download
  - Number of observations: 300
  - Number of variables: 18
  - Description of the variables most relevant to this project: Key variables include Exercise_Frequency (days per week, 0-7), Exercise_Duration (minutes per session), Daily_Steps (average steps per day), Stress_Level/Anxiety_Level/Depression_Level/Happiness_Level (self-reported scales, 1-10), Age (continuous), and Gender (categorical). These allows us to calculate weekly exercise minutes (Frequency × Duration) and directly assess mental health outcomes via stress, anxiety, depression, and happiness levels.
  - Descriptions of any shortcomings this dataset has with respect to the project: Small sample size (n=300) limits generalizability to the broader U.S. adult population. All variables are self-reported, risking bias again. Mental_Health_Score is an unvalidated, arbitrarily conceived formula (dropped in cleaning). Potential sampling bias due to small sample size and sources. Lastly, missing values in Exercise_Type (rows need to be dropped).

We plan to analyze the two datasets in several ways to provide a more comprehensive view: 
- aggregating both datasets by demographic subgroups (our target groups: age and gender) to compare exercise against health outcomes across similar populations
- calculating weekly exercise minutes from the mental health dataset's frequency and duration variables, and comparing these with the binary physical activity indicator in the diabetes dataset
- performing individual analyses on each dataset and synthesizing findings through qualitative comparison, such as examining how exercise relates to mental health in one versus physical health in the other, to identify overarching trends in U.S. adults

In [None]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

#%load_ext autoreload
#%autoreload 2

In [None]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
#%pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://drive.google.com/uc?export=download&id=1NJzzTgyM5pExxJJS6YQav2JGtYhBoGGI', 'filename':'mental_health_physical_activity.csv'},
    { 'url': 'https://drive.google.com/uc?export=download&id=1fGvIjW5TaZfea9buV7KzgNEtaCbaFePQ', 'filename':'diabetes.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

### Diabetes Risk Factors - BRFSS 2015

This dataset is a cleaned and consolidated version of the 2015 Behavioral Risk Factor Surveillance System (BRFSS) survey data, focusing on health indicators related to diabetes. The BRFSS is an annual, state-based, random-digit-dialed telephone survey conducted by the CDC, collecting data from over 400,000 Americans on their health-related risk behaviors, chronic health conditions, and use of preventative services. For this project, the dataset is valuable for exploring the relationship between physical activity (our key variable) and the prevalence of diabetes and obesity.

A. Important Metrics and Their Meaning

The dataset is available in three versions, but they all share the same core set of 21 feature variables. The key metrics relevant to our research question, along with their units and interpretation, are:

*   Diabetes Status (Target Variable): This is the primary outcome variable for diabetes.
    *   In `diabetes_012_health_indicators_BRFSS2015.csv`, this is a 3-class variable: `0` (no diabetes or only during pregnancy), `1` (prediabetes), and `2` (diabetes).
    *   In the two binary files, the variable `Diabetes_binary` is a 2-class variable where `0` indicates no diabetes and `1` indicates prediabetes or diabetes.
*   Physical Activity (`PhysActivity`): This is a binary variable directly relevant to our hypothesis. It indicates whether the respondent engaged in any exercise during the past 30 days, excluding their regular job. The units are binary: `0` (no) or `1` (yes). While it doesn't provide minutes per week, it serves as a high-level indicator of an active versus sedentary lifestyle.
*   Body Mass Index (`BMI`): This is a key metric for our obesity outcome. It is a calculated value based on the respondent's height and weight. The standard formula is `(weight in pounds / (height in inches)^2) * 703`. The unit is `kg/m²`, and it provides a continuous measure of body fat based on height and weight.
*   General Health (`GenHlth`): This is a self-reported, ordinal variable assessing the respondent's perception of their general health on a scale of 1 to 5, where:
    *   `1 = excellent`
    *   `2 = very good`
    *   `3 = good`
    *   `4 = fair`
    *   `5 = poor`
    This can serve as a proxy for overall well-being.
*   Mental Health (`MentHlth`): This variable directly measures the number of days the respondent experienced poor mental health, including stress, depression, and problems with emotions, during the past 30 days. It is reported on a scale from 1 to 30 days. This is a crucial variable for our mental health outcome.
*   Demographic & Socioeconomic Variables: The dataset includes other important variables for stratified analysis, such as:
    *   Age (`Age`): A 13-level categorical variable, not a continuous age. For example, `1 = 18-24`, `9 = 60-64`, `13 = 80 or older`.
    *   Sex (`Sex`): Binary, with `0 = female` and `1 = male`.
    *   Income (`Income`): An 8-level ordinal scale of annual household income (e.g., `1 = less than $10,000`, `8 = $75,000 or more`).
    *   Education (`Education`): A 6-level ordinal scale of education level (e.g., `1 = Never attended school or only kindergarten`, `6 = College 4 years or more`).

B. Major Concerns with the Dataset

While this is a large and well-structured dataset, there are several important limitations and concerns to consider for our analysis:

1.  Physical Activity is Not a Continuous Variable: The most significant limitation for our specific research question is that physical activity is recorded as a binary variable (`PhysActivity`). It only indicates whether any activity occurred in the past 30 days, not the frequency, duration, or intensity of that activity. This prevents us from calculating "average weekly intentional exercise (measured in minutes per week)" as stated in our research question and hypothesis. We can only compare those who did any activity versus those who did none.

2.  Data is from 2015: The survey data is nearly a decade old (from 2015). Health behaviors and population health trends can change over time, so the findings may not perfectly reflect current associations.

3.  Self-Reported Data and Associated Biases: All variables, including `BMI`, `MentHlth`, `PhysActivity`, and `GenHlth`, are based on self-report. This introduces potential biases:
    *   Recall Bias: Respondents may not accurately remember their exact number of poor mental health days or physical activity over the past 30 days.
    *   Social Desirability Bias: Respondents may underreport weight (affecting `BMI`) or overreport physical activity to present themselves in a better light.

4.  Class Imbalance: The primary dataset (`diabetes_012_health_indicators_BRFSS2015.csv`) has a significant class imbalance, meaning there are far more respondents without diabetes than with prediabetes or diabetes. This could bias a predictive model if not handled properly. The balanced 50/50 split version is provided to mitigate this for classification tasks.

5.  Aggregated and Categorical Variables: Some variables are provided in categorical ranges (e.g., `Age`, `Income`) rather than as continuous values. This results in a loss of granularity and statistical power. For instance, we cannot pinpoint an exact age but only a range.

This dataset is excellent for its large, population-based sample from the U.S. and its rich set of health and demographic indicators. However, for our project, we must critically address the binary nature of the `PhysActivity` variable, as it is a crude measure that does not align perfectly with our research question about exercise minutes.

In [None]:
import pandas as pd
diabetes = pd.read_csv("data/00-raw/diabetes.csv")
diabetes.head()

In [None]:
print("DataFrame shape: ", diabetes.shape)
diabetes.isna().sum()

We can see that the dataset is already cleaned, but has some extra columns we are not interested in for the purposes of this study.

In [None]:
cols_to_drop = ["Fruits", "Veggies", "CholCheck", "Smoker", "Stroke", 
"HeartDiseaseorAttack", "HvyAlcoholConsump", "AnyHealthcare", 
"NoDocbcCost", "Education", "Income"]

clean_diabetes = diabetes.drop(columns=cols_to_drop)
clean_diabetes.columns

In [None]:
clean_diabetes.describe()

From these early statistics, we can actually see that diabetics only make up around 13% of the surveyed population. However, nearly 43% of this population has high blood pressure and 42% has high cholesterol, so in our analysis, we will also examine the relationship between exercise and pre-diabetes indicators. 

We noticed that 17% of survey respondents reported that they have difficulty walking, which we are curious to see if that negatively affects levels of physical activity in any way.

Lastly, we can see that "Age" is currently set to categorical variables between 1 and 13 indicating 5-year age groups, not raw age, so let's translate this number back into categorical years.

In [None]:
age_bins = {1: "18-24", 2: "25-29", 3: "30-34", 4: "35-39", 5: "40-44", 
6: "45-49", 7: "50-54", 8: "55-59", 9: "60-64", 10: "65-69", 11: "70-74", 
12: "75-80", 13: "80 or older"}

clean_diabetes["Age"] = clean_diabetes["Age"].map(age_bins)
clean_diabetes.head()

In [None]:
import seaborn as sns
sns.histplot(x="BMI", hue="Diabetes_binary", data=clean_diabetes, bins=30)

This is an early look into the counts of respondents by diabetes status across the distribution of BMI. We are curious as to why there seems to be a few respondents at the tail end of the distribution with BMIs over 80, and the statistics indicate a maximum of 98. We will continute to explore and evaluate this dataset in our next checkpoint.

In [None]:
clean_diabetes.to_csv('data/02-processed/diabetes_clean.csv', index=False)

### Physical Activity and its connection with Mental Health outcomes for individuals in the U.S.
Within this dataset it contains each individual level information such as, lifestyle behaviors, physical activity habits, and self reports on their own mental health. The key physical metrics in this dataset are Daily_Steps, which is the average number of steps per day, Exercise_duaration which is average duration per session in minutes, and Exercise_frequency which is days per week the person exercises (0-7). By using the above variables, the total number of minutes participants exercised during a week may be determined, and total exercise time can be compared with recommended exercise levels by public health organizations for adults, such as the health department recommends at least 150 minutes per week of moderate-intensity aerobic activity for adults. The life style variables include Sleep_Hours which is the average sleep per night in hours, Screen_Time_Hours which is the screen time per day (non-work screen use), Diet_Quality which is a self rated diet quality (1=poor, 5=excellent), and finally Social_Interaction which is the number of meaningful social interactions per week. Metrics like those listed above are important indicators of your overall health, and how healthy you are physically and mentally. Also to help provide context when interpreting exercise effects.

Mental health within the dataset is measured using a self-reported scale from 1-10. The four self-reported mental health measurements are: Stress_Level, Anxiety_Level, Depression_Level, and Happiness_Level, where at 10 represents the highest intensity of the condition. The dataset also includes a calculated Mental_Health_Score, defined as (Happiness × 2) − ((Stress + Anxiety + Depression) / 3). This composite score is an attempt to capture overall well-being by giving more weight to happiness and subtracting the mean of negative emotional experiences. Positive values indicate better mental health, and negative values indicate poorer well-being.But these ratings are still subjective, and are not clinically validated or diagnosed. So the data should just be interpreted as just simple indicators rather than a proven medical diagnosis. 

In the dataset there are several concerns that we as a group have noticed. Within the dataset the sample size is relatively pretty small (300 participants), which limits generalizability. This also causes a concern because the conclusions are meant to represent the broader U.S. adult population. There's uncertainty regarding how participants were chosen ( randomly chosen,or  through volunteer participation), and therefore, there is a possibility of sampling bias. Participants willing to share their health and fitness data may be more inclined to stay healthy, may have greater access to technology, and different socioeconomic status than people in the general population.Moreover, a considerable percentage of the variables obtained from participants are self-reported, which presents the potential for recall bias and social desirability bias. Additionally, given that the data set author developed their own scale for measuring mental health outcomes without validation, there are valid concerns that the subsequent data collected as a result of this measure may lack reliable measurement. Due to these issues, it would be good to use findings generated from this dataset in an exploratory correlational manner as opposed to being used definitively or as demonstrating causal.


In [None]:
import pandas as pd
activity_mental = pd.read_csv('data/00-raw/mental_health_physical_activity.csv')
activity_mental.head()

In [None]:
activity_mental.columns

In [None]:
print(activity_mental['ID'].duplicated().sum())
activity_mental.shape

After checking the columns and the shape of the dataset, we can see that there are no duplicate ID's or merged columns that would make our dataset untidy. Since the duplicated check returned 0, we can safely confirm this data set is already tidy. Now we proceed to check how many missing values are on this dataset.

In [None]:
missing_values_am = activity_mental.isna().sum()
missing_values_am


After checking the missing values in the dataset, we found that 48 individuals did not specify their Exercise_Type. We can safely remove these rows since Exercise_Type is not our primary variable of interest. We may revisit this column later to determine whether a certain Exercise_Type is associated with more positive mental health outcomes. The "Notes" column is irrelevant to this project and will be dropped entirely since all 300 values are missing.

In [None]:
clean_am = activity_mental.dropna(subset = 'Exercise_Type').drop(columns = 'Notes')
clean_am

We can now see a clean and tidy dataset that showcases the relationship between the amount of physical activity done weekly and the mental health outcome score. The drop() function was used for the 'Notes' column since all values were missing. For the column 'Exercise_Type' there were 48 missing values, so it is safe to use the dropna() function which removes the complete row since we may need to check whether a certain exercise type improves mental health outcomes. Even though it is not our primary variable of interest, it is safe to revisit this later in the project.

Let us now inspect the "Mental_Health_Score" column

In [None]:
mh_score = clean_am['Mental_Health_Score']
mh_score.describe()

The "Mental_Health_Score" values ranges from -6.0 to 17.7 with a mean of 5.93. We flag this as a concern because these values have no clear clinical representation. This dataset does not have an established scale to determine what are the good or bad scores look like. This score is calculated by a formula created by the author rather than a validated mental health instrument. It is safe to use the self-reported values columns instead. Therefore, we can proceed by dropping the table and only using the self-reported values.

In [None]:
clean_am_sr = clean_am.drop(columns = 'Mental_Health_Score')
clean_am_sr.head()

We know have a clean and tidy table where there are no merging columns and missing values. We have the sufficient information to get started with out project. We can proceed to show the statistics of some of the important variables we will be using to conduct or hypothesis which are the following: 

In [None]:
clean_am_sr[
['Age','Gender',
'Exercise_Frequency','Exercise_Duration',
'Stress_Level','Anxiety_Level',
'Depression_Level','Happiness_Level']].describe()

In [None]:
clean_am_sr.to_csv('data/02-processed/mental_health_physical_activity_clean.csv', index=False)

## Ethics

### A. Data Collection
- [x] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

We will not use any human-subject interaction or individual-level data for this project, thus no informed consent is required. All analysis is performed at aggregated (U.S. state) level using publicly available, non-interactive data sources.

- [x] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

We have acknowledged that the underlying datasets may reflect collection and survey biases (e.g., nonresponse, sampling bias). We will document known limitations of each dataset and avoid over-interpreting state-level aggregates where collection bias may distort the results.

- [x] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

Datasets used are at the state-aggregate level and contain no personally identifiable information. We will not attempt to merge data in ways that could reconstruct individual identities.

- [x] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

Some data sources include demographic variables that allow subgroup comparisons. While our primary analysis is state-level, we recognize that aggregation can conceal disparities. We will note these limitations and, where possible, examine demographic breakdowns to identify patterns that aggregate measures might hide.

### B. Data Storage
- [x] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

All datasets are publicly available and non-sensitive, but we will store copies on password-protected personal and institutional systems and follow standard security practices to reduce risk of unauthorized access.

- [x] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?

No individual-level personal data are collected or stored. Because no PII is used, a right-to-be-forgotten mechanism is not required; nevertheless, we will delete any derived files if requested and as described in our retention plan.

- [x] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

We do not plan to retain the project beyond the course. All project files and data will be deleted after the course concludes, unless the team agrees to preserve a cleaned, anonymized copy for future educational purposes and documents that decision.

### C. Analysis
- [x] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

We acknowledge that state-level analyses miss local and marginalized perspectives. We will explicitly call out these blindspots in our write-up and interpret findings conservatively, avoiding claims about individual-level causation.

- [x] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

We will examine missingness patterns, consistency across states, and known survey biases (e.g., recall bias, social desirability) and document these issues. Where feasible, we will avoid analyses that amplify biased signals and will be transparent about remaining biases.

- [x] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

All visualizations and statistics will be produced to accurately reflect the data and the uncertainty in estimates; we will include appropriate labels, scales, and caveats so readers are not misled.

- [x] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

No PII will be used or displayed in any analysis or figures.

- [x] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

We will document all data cleaning, merging, and analysis steps and provide code and notes to enable reproducibility and auditing of our results.

### D. Modeling
- [x] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

We considered whether physical activity and health outcomes could act as proxies for socioeconomic status or access to care and will interpret associations cautiously, avoiding language that blames individuals or states for structural disparities.

- [x] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?

Because our analysis is at the state-aggregate level, fairness testing across individual demographic groups is limited. Where subgroup data are available, we will report descriptive comparisons and caveat the limitations of such comparisons.

- [ ] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

We have not defined modeling metrics for a deployed decision system; this item is not applicable for our current exploratory/state-level analyses.

- [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?

Not applicable for a deployed decision model; we will, however, ensure that any statistical models used are interpretable and that we explain key relationships in plain language.

- [x] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

We will explicitly communicate limitations of our analysis and potential biases in the report and presentation to avoid overgeneralization or misuse.

### E. Deployment
- [x] **E.1 Monitoring and evaluation**: Do we have a plan to monitor the model and its impacts after it is deployed?

This project is exploratory and not intended for deployment in decision-making. If outputs were to be used beyond this course, additional monitoring and validation procedures would be required and are outside the scope of the current work.

- [x] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results?

No direct users are affected by this class project; nonetheless, if interpretations or visualizations cause misinterpretation, we will correct and clarify our findings promptly in our documentation and public materials.

- [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?

Not applicable because we are not deploying a production model.

- [x] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

We recognize results could be misused to place individual accountability for structural health outcomes. To reduce misuse, we will provide clear context, document limitations, and emphasize structural determinants of health in our report.

## Team Expectations 


Ava, Bryan, Cielo, Austin, Mary


Read the [COGS108 Team Policies](https://github.com/COGS108/Projects/blob/master/COGS108_TeamPolicies.md) individually. By including each member’s name above and adding their name to the submission, each member indicates they have read the COGS108 Team Policies, accept these team expectations, and intend to fulfill them.

Team Expectation 1 — Communication
- We will communicate via iMessage and FaceTime. We will communicate respectfully and promptly.
- We expect to meet one to two times per week on FaceTime to review project components and will text throughout the week to check in and coordinate.
- If a member cannot attend a FaceTime meeting, they should review meeting notes and follow up via text as soon as possible.

Team Expectation 2 — Tone and Conflict
- Our tone will be polite, professional, and honest. We prioritize respect first and clear, direct communication second.
- If conflicts arise, we will handle them professionally and directly, clarifying expectations to avoid misunderstandings.

Team Expectation 3 — Decision Making
- Decisions will be made by majority vote (at least 3 of 5 members) to ensure fairness and timely progress.
- Exception: selecting the project idea requires unanimous agreement from all five members.
- If a decision must be made quickly, the majority of members present at that time may decide. If an absent member does not respond promptly via text, we will proceed without them.

Team Expectation 4 — Roles and Learning
- Members will naturally focus on tasks that match their strengths, but we will strive to collaborate so everyone learns all parts of the project.
- We will share knowledge and pair on tasks when feasible to distribute learning.

Team Expectation 5 — Timeline and Documentation
- We will maintain and update the project timeline to stay organized.
- The timeline will be reviewed and adjusted during our weekly FaceTime meetings.
- We will document meetings and decisions so responsibilities and progress are clear.

Team Expectation 6 — Handling Struggles and Deadlines
- If a member struggles to complete their assigned work, they should notify the group as soon as possible via text.
- When time allows, teammates will help. If we face a tight deadline, we will reassign the task to ensure the project stays on schedule.
- Reassigning work is a last-resort solution to avoid falling behind; we aim to distribute work fairly and support members early.

Team Expectation 7 — Accessibility of Expectations
- We have pasted these expectations into our group text chat and shared them as a file so every member can refer to them throughout the quarter.

## Project Timeline Proposal

Weekly meetings: FaceTime meetings every Tuesday at 8:00 PM (unless otherwise noted). Ad-hoc communications via group iMessage for scheduling, quick questions, and file sharing. Major milestones and meeting notes will be recorded in the repo (meetings/ or notes/ folder) after each meeting.

| Meeting Date | Meeting Time | Completed Before Meeting | Discussed at Meeting |
|---|---:|---|---|
| 1/26 | 6:00 PM | Looked for CSV files related to fitness and health issues; ensured everyone had the repo on GitHub | Project theme; research question; CSV file availability; repo access |
| 2/03 | 8:00 PM | Formed 3 possible research questions; found CSV files relating to our topic; reviewed two past COGS108 projects | Project proposal scope; split proposal assignments; availability and future meetings |
| 2/04 | 8:00 PM | Completed project proposal; formalized research question; background research | Submission logistics; GitHub branches and pull requests; next steps for data search |
| 2/10 | 8:00 PM | Import & begin wrangling data | Status update; review/edit wrangling; plan for upcoming data checkpoint |
| 2/17 | 8:00 PM | Data checkpoint prep | Data checkpoint; begin detailed data analysis; assess state of the project.|
| 2/24 | 8:00 PM | Complete primary data analysis tasks and submit checkpoint deliverables | Finalize data analysis; plan timeline for finals work and presentation availability |
| 3/10 | 8:00 PM | Draft conclusion; prepare analysis results | Plan final project video; ensure all code/notebooks run; finalize remaining items |
| 3/17 | 8:00 PM | N/A | Turn in final project; complete group evaluation forms |

Planned milestone summary
- Week-to-week rhythm: research/wrangling during weekdays, discuss progress and blockers at Tuesday 8 PM meetings, and assign concrete tasks for the following week.
- Checkpoints:
  - Data checkpoint: 2/17 (in-meeting review of cleaned and merged datasets)
  - Analysis checkpoint: 2/24 (key analysis steps completed; draft figures)
  - Final draft & video planning: 3/10
  - Final submission & evaluations: 3/17
- Responsibilities: we will assign owners for data wrangling, EDA, modeling, visualization, write-up, and the final video; owners will update status in the repo prior to each meeting.

Special resources / training needed
- No special external tools beyond Python (pandas, numpy, matplotlib/seaborn, scikit-learn/statsmodels) and Jupyter notebooks are required. If we use wearable/device study data requiring special access, one member will handle the data-access request process and document required approvals.

Communication & conflict resolution
- Primary channels: iMessage for daily coordination; FaceTime for weekly meetings; GitHub issues and PRs for code review and task tracking.
- If conflicts arise, raise the issue in the group chat and bring it to the next FaceTime meeting; if urgent, schedule a short ad-hoc FaceTime. Decisions use majority vote (3 of 5) except for project topic selection which requires unanimity.