# COGS 108 - Project Proposal

## Authors

- Joel Abutin
- Nitika Bhawe
- Gabriel Hilmen
- Arushi Patra
- Ishaanee Roy

## Research Question

Are demographic and biological variables that individuals cannot change (such as age and gender) more strongly correlated with self-rated daytime sleepiness (or sleep quality) than lifestyle variables that individuals can change (such as physical activity level and BMI), and do these two categories of variables interact with one another in predicting self-rated daytime sleepiness?

## Background and Prior Work

Sleep is an important process for cognitive functioning, emotional regulation, and physical health. Hence, understanding the factors that may influence how people sleep is important for both clinical research and public health interventions. Current research has identified certain externally influenceable factors in oneâ€™s lifestyle such as physical activity, screentime, chosen profession and use of drugs such as alcohol. We aim to observe the interaction between sleep quality and such factors through this project. 

Xu et al examined the relationship between Physical activity, self-reported screen time, and sleep quantity and quality. This study looks at a sample of 1136 adolescents aged 16-19  from the 2005â€“2006 National Health and Nutrition Examination Survey (NHANES) as this is a less common age group studied in such research. They used an accelerometer, a wearable device to estimate physical activity and self-reported data for screen time, sleep quality and quantity for 30 days. They found that meeting recommended screen time guidelines was associated with significantly lower odds of reporting poor sleep quality, and that adolescents who met both physical activity and screen time guidelines had even lower odds of poor sleep, especially among males [1]. These results illustrate that modifiable behaviors like screen time and physical activity are linked to selfâ€‘rated sleep quality and may interact differently depending on intrinsic factors such as the sex and behavior of the individual.

Bailey et al aimed to categorise data from Fitbit devices collected from 30,445 participants in the All of Us Research Program. This Program is a national effort to enroll more than 1 million participants for health research. It enables participants to donate Fitbit data, providing a unique dataset for physical activity (PA) and sleep research. For this study, days 15â€“21 post consent date were selected for analysis of demographic characteristics, wear days, and wear time proxy variables such as heart rate for amount of physical activity [2]. This study demonstrated another way to quantify variations in physical activity and sleep patterns other than surveys.

Nelson et al examined how work demands influence sleep among nearly 3,000 adults from the Midlife in the United States (MIDUS) cohort. The researchers assessed multiple aspects of job demands such as intensity, role conflict and job control, finding that there were significant linear and quadratic relationships between job demands and sleep outcomes. The linear effects indicated that participants with higher job demands had worse sleep health, such as shorter duration, greater irregularity, greater inefficiency, and more sleep dissatisfaction. The quadratic effects indicated that sleep regularity and efficiency outcomes were the best when participantsâ€™ job demands were moderate rather than too low or too high [3]. These findings illustrate how variables like occupational stress and control may intersect with both internal and external influences on sleep quality in real-world populations. 

Studies also show strong concurrence between insomnia and alcoholism. Colrain et al reviewed a number of studies involving different research methods from self reported data to EEG scans in order to analyse brain waves indicating different stages of sleep. Alcohol has a profound impact on sleep, with effects dependent on acute versus chronic use and dependence. While alcohol is initially sedating, this effect disappears after a few hours due to decrease in REM sleep. This results in a fragmented and disturbed sleep in the second half of the night. Sustained use of alcohol in chronic alcoholism is associated with major sleep problems [4]. Hence, this study shows multiple ways of gaining data to analyse sleep quality after alcohol use.

While these studies provide important insights, most rely on self-reported sleep measures and cross-sectional designs which introduce potential biases [1][3][4]. Nonetheless, they provide a strong foundation for examining how individual characteristics and lifestyle behaviors together influence perceived sleep quality, which is the focus of the present project.

References:

1. Relationship between Physical Activity, Screen Time, and Sleep Quantity and Quality in US Adolescents Aged 16â€“19 https://pmc.ncbi.nlm.nih.gov/articles/PMC6539318/
2. Fitbit Physical Activity and Sleep Data in the All of Us Research Program: Data Exploration and Processing Considerations for Research https://pmc.ncbi.nlm.nih.gov/articles/PMC12264798/#S22
3. Goldilocks at Work: Just the Right Amount of Job Demands May be Needed for Your Sleep Health https://pmc.ncbi.nlm.nih.gov/articles/PMC9991992/#S24
4. Alcohol and the sleeping brain https://pmc.ncbi.nlm.nih.gov/articles/PMC5821259/



## Hypothesis


Self-rated sleep quality is influenced by both modifiable and non-modifiable factors. Higher levels of modifiable health behaviors (e.g., greater physical activity and healthier BMI) will be associated with more positive self-reported sleep quality. However, this relationship will be moderated by non-modifiable characteristics such as age and gender. Specifically, the strength and direction of the association between modifiable factors and sleep quality will differ across age groups and between genders, indicating an interaction effect between variables that can be changed and variables that cannot be changed.



## Data

## 1:

   ### A. What variables are needed?

   We need to be able to test the relationship between demographic + biological variables and sleep quality. Ideal datasets would include quantitative measures of daily physical activity, BMI, gender and sleep outcomes for adults, with a given age range

   **Physical activity variables could include:**
   - Minutes of moderate/vigorous activity  
   - Calories burned
   - Sedentary 	

   **Sleep variables could include:**
   - Total sleep duration  
   - Daytime sleepiness
   - Sleep quality score  
   - Self-reported restfulness after waking  

   

   ### B. How many observations are needed?

   An ideal dataset would include at least several hundred observations to allow for meaningful statistical analysis and visualization. Anything less than that is not very statistically significant. A larger sample size would be very helpful if we are able to get one. If we are not able to get large datasets the next step would be to try and analyze multiple smaller ones.



   ### C. Who / what / how would these data be collected?

   The data would ideally be collected by wearable sleep monitors such as:
   - Smart watches  
   - Fitness trackers  

   For fitness variables, data should be collected via survey every day (or collected from wearables daily), with data collected over multiple days at a minimum.


   ### B. How would these data be stored/organized?

   The data would ideally be stored in a structural way that allows us to interface with the data via tools discussed in class, such as csv files or some sort of relational database. Each row would represent a single observation for one individual with columns representing specific variables such as sleep quality, age, occupation, activity levels, etc. For extremely large databases something like PostgreSQL or MySQL would be nice with seperate tables for participant information, acitivty data, and sleep data but it is unlikely we will find such a database based on research conducted thus far.

---

## 2:

   ðŸ”— https://pmc.ncbi.nlm.nih.gov/articles/PMC6539318/

   This is a PubMed journal with data tables giving the relationship between sleep quality and physical activity (there are other variables but we will not need them). Since this is a PubMed journal, it does not natively give CSV files for the data, but we will be able to export the data tables in CSV format via pandas. This dataset is almost ideal. However, it only includes individuals aged 16â€“19, so we need to take that into consideration.



   


   ðŸ”— https://physionet.org/content/mmash/1.0.0/

   This is a PhysioNet dataset which gives us direct access to the data with many variables. We will be using the variables related to sleep and activity. Both the sleep and activity sections of this dataset are large and contain smaller sub-variables under them that we will be using, such as:
   - Difference between medium/small activity  
   - Total sleep time  
   - Sleep quality  

   This would be considered an ideal dataset due to its size and number of variables involved.


  

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> The data by NHANES (2024) has been conducted on an ongoing basis, with public-use data being released in two-year cycles. The sample for each two-year cycle is representative of the non-institutionalized U.S. population. As participation in NHANES is voluntary, participants had informed consent. 

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
> Until 2020, NHANES had an oversampling of certain races, Hispanic origin, age, and income groups. However, the sample design was modified to remove this bias. Additionally, to reduce oversampling of certain age groups such as 0-12, 12-19 and >70, the in-household survey was modified to include 0-19 and >60 also eligible, and then they created a system to randomly select adults in the age range 25-59. Plus, bilingual field interviewers were present while interviewing English and Spanish language respondents. Only Spanish or English-speaking participants were chosen.
> 
 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII), for example through anonymization or not collecting information that isn't relevant for analysis?
> The data is taken from the 2024 National Center of Health Statistics(NCHS) under Centers for Disease Control and Prevention (CDC). These surveys for statistical analysis are under the authority of the Public Health Service Act and are protected by federal confidentiality laws. Therefore, these are anonymised and will be used only for statistical analysis. NCHS does its best not to disclose any personal information by omitting personal data and identifiers, and has strict rules for anyone who tries to violate them. 

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?
> The data includes demographic variables like race/ethnicity. However, as our research question is more inclined towards sleep health based on other factors, we may or may not use data on protected groups. If we do, it will have an unbiased random sample because it is representative of the non-institutionalized U.S. population.
> 
### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
> As this is a public and de-identified dataset, individuals will have to request the original contributors to remove their data.
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
>
 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
 - [ ] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

### D. Modeling
 - [ ] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [ ] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [ ] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [ ] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

## Team Expectations 

* All project members will communicate through Discord and respond to messages preferably within 8-12 hours.
* Meetings will occur at minimum weekly through Discord and tasks will be assigned during these meetings.
* Project members struggling on their tasks will ask for help as soon as possible so other members can provide assistance.
* If a member has not responded in a lengthly time, such as 48 hours or more, a welfare check will be attempted by contacting through Discord, email, and phone. If the member still has not responded, contact will be made with a TA or professor on what to do next.

## Project Timeline Proposal

| Meeting Date  | Meeting Time | Completed Before Meeting | Discuss at Meeting |
|---|---|---|---|
| 2025-02-02 Monday | 2:30 PM | Review the project proposal Jupyter notebook | Assign each project member to complete a section of the project propsal |
| 2025-02-04 Wednesday | Project proposal due | | |
| 2025-02-09 Monday | 3:00 PM | Review the data checkpoint Jupyter notebook | Assign each project member to complete a section of the data checkpoint |
| 2025-02-16 Monday | 3:00 PM | Each member does as much of their assigned task | Checkup on each member's progress, assist other members if necessary |
| 2025-02-18 Wednesday | Data checkpoint due | | |
| 2025-02-23 Monday | 3:00 PM | Review the EDA checkpoint Jupyter notebook | Assign each project member to complete a section of the EDA checkpoint |
| 2025-03-02 Monday | 3:00 PM | Each member does as much of their assigned task | Checkup on each member's progress, assist other members if necessary |
| 2025-03-04 Wednesday | EDA checkpoint due | | |
| 2025-02-09 Monday | 3:00 PM | Review the final project Jupyter notebook | Assign each project member to complete a section of the final project checkpoint |
| 2025-03-16 Monday | 3:00 PM | Each member does as much of their assigned task | Checkup on each member's progress, assist other members if necessary |
| 2025-03-18 Wednesday | Final project due | | |