# COGS 108 - Data Checkpoint

- Joel Abutin
- Nitika Bhawe
- Gabriel Hilmen
- Arushi Patra
- Ishaanee Roy

## Research Question

Are demographic and biological variables that individuals cannot change (such as age and gender) more strongly correlated with self-rated daytime sleepiness (or sleep quality) than lifestyle variables that individuals can change (such as physical activity level and BMI), and do these two categories of variables interact with one another in predicting daytime sleepiness?

## Background and Prior Work

Sleep is an important process for cognitive functioning, emotional regulation, and physical health. Hence, understanding the factors that may influence how people sleep is important for both clinical research and public health interventions. Current research has identified certain externally influenceable factors in one’s lifestyle such as physical activity, screentime and chosen profession.

Xu et al examined the relationship between Physical activity, self-reported screen time, and sleep quantity and quality. This study looks at a sample of 1136 adolescents aged 16-19 from the 2005–2006 National Health and Nutrition Examination Survey (NHANES) as this is a less common age group studied in such research. They used an accelerometer, a wearable device to estimate physical activity and self-reported data for screen time, sleep quality and quantity for 30 days. They found that meeting recommended screen time guidelines was associated with significantly lower odds of reporting poor sleep quality, and that adolescents who met both physical activity and screen time guidelines had even lower odds of poor sleep, especially among males [^1]. These results illustrate that modifiable behaviors like screen time and physical activity are linked to self‑rated sleep quality and may interact differently depending on intrinsic factors such as the sex and behavior of the individual.

Bailey et al aimed to categorise data from Fitbit devices collected from 30,445 participants in the All of Us Research Program. This Program is a national effort to enroll more than 1 million participants for health research. It enables participants to donate Fitbit data, providing a unique dataset for physical activity (PA) and sleep research. For this study, days 15–21 post consent date were selected for analysis of demographic characteristics, wear days, and wear time proxy variables such as heart rate for amount of physical activity [^2]. This study demonstrated another way to quantify variations in physical activity and perhaps sleep patterns other than surveys.

Nelson et al examined how work demands influence sleep among nearly 3,000 adults from the Midlife in the United States (MIDUS) cohort. The researchers assessed multiple aspects of job demands such as intensity, role conflict and job control, finding that there were significant linear and quadratic relationships between job demands and sleep outcomes. The linear effects indicated that participants with higher job demands had worse sleep health, such as shorter duration, greater irregularity, greater inefficiency, and more sleep dissatisfaction. The quadratic effects indicated that sleep regularity and efficiency outcomes were the best when participants’ job demands were moderate rather than too low or too high [^3]. These findings illustrate how variables like occupational stress and control may intersect with both internal and external influences on sleep quality in real-world populations.

While these studies provide important insights, most rely on self-reported sleep measures and cross-sectional designs which introduce potential biases [^1][^3]. Nonetheless, they provide a strong foundation for examining how individual characteristics and lifestyle behaviors together influence perceived sleep quality, which is the focus of the present project.

References :

[^1] Relationship between Physical Activity, Screen Time, and Sleep Quantity and Quality in US Adolescents Aged 16–19 https://pmc.ncbi.nlm.nih.gov/articles/PMC6539318/

[^2] Fitbit Physical Activity and Sleep Data in the All of Us Research Program: Data Exploration and Processing Considerations for Research https://pmc.ncbi.nlm.nih.gov/articles/PMC12264798/#S22

[^3] Goldilocks at Work: Just the Right Amount of Job Demands May be Needed for Your Sleep Health https://pmc.ncbi.nlm.nih.gov/articles/PMC9991992/#S24

## Hypothesis


Self-rated sleep quality is influenced by both modifiable and non-modifiable factors. Higher levels of modifiable health behaviors (e.g., greater physical activity and healthier BMI) will be associated with more positive self-reported sleep quality. However, this relationship will be moderated by non-modifiable characteristics such as age and gender. Specifically, the strength and direction of the association between modifiable factors and sleep quality will differ across age groups and between genders, indicating an interaction effect between variables that can be changed and variables that cannot be changed.

## Data

### Data overview

  - Dataset #1
    - **Dataset Name:** NHANES 2017-2020 Customized Dataset
    - **Link to the dataset:** https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?Cycle=2017-2020
    - **Number of observations:** 680
    - **Number of variables:** 12
    - **Description of the variables most relevant to this project:** `daytime sleepiness` is the variable to measure quality of sleep the day before.
    - **Descriptions of any shortcomings this dataset has with repsect to the project:** 
    - **Description of multiple datasets combined:** The NHANES dataset is split into multiple files where each subject has a unique ID called a SEQN (respondent sequence number) as the subject may be observed in multiple XPT files. When merging datasets into a single one, observations from different XPT files that have the same ID will have the rows merged together.

### Dataset #1 - NHANES 2017-2020 Customized Dataset

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. The important metrics for Dataset #1 are:
   - Gender (female or male)
   - BMI (Body Mass Index) in kg/m^2,
   - How frequently alcohol is consumed (some of the responses were 1-2 times per week, 5+ times per week, 0 times per week, etc.)
   - Walking/bicycling in minutes per day
   - How many minutes per day the person was sedentary
   - Hours slept on the weekdays and hours slept on the weekend
   - Self-rated sleep quality (written as sleepiness (daytime) in the dataset. High daytime sleepiness correlates with low sleep quality, Low daytime sleepiness correlates with high sleep quality) For example, one of the responses was 2-4/month, meaning that the person got 2-4 days of poor quality sleep that month. 
   - How many cigarettes the person smoked per day/weel (the responses included never, 1 per day, 2-7 per week)

   2. Some of the major concerns with this database are:
   - Self-reported data bias
        - Many of the variables in the dataset (i.e. sleep levels, alcohol use, sedentary time, self-rated sleep quality) may not be exactly accurate for each person, which can introduce recall bias (which occurs when participants in a study do not accurately remember a past experience). 
    - Inconsistent categorical formats
        - For example, the frequency of alcohol consumption uses mixed text formats (e.g., 1/day, 2–3/week, 5–6/year, never), which require extensive normalization before analysis.
    - Missing contextual variables
        - There is no clear information on socioeconomic status, occupation, health conditions, or geographic location, which are important confounders when performing data analysis on this dataset.
    
3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you


In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE

from functools import reduce
import numpy as np
import pandas as pd

# --------------------
# Demographics
# --------------------

df_demo = pd.read_csv('data/00-raw/NHANES_2017-2020_DEMO_DEMO.csv') # Import dataset
df_demo = df_demo[['SEQN', 'RIAGENDR', 'RIDAGEYR']]                 # Keep specific variables
df_demo = df_demo.rename(columns = {'SEQN' : 'ID',
                                    'RIAGENDR' : 'gender',
                                    'RIDAGEYR' : 'age (year)'})     # Make columns more readable
df_demo['ID'] = df_demo['ID'].astype(int)                           # Remove decimal from ID

df_demo = df_demo[df_demo['age (year)'].between(0, 80)]       # Only ages 0-80
df_demo['age (year)'] = df_demo['age (year)'].astype(int)     # Remove decimal from age
df_demo['gender'] = df_demo['gender'].replace({1 : 'male',
                                               2 : 'female'}) # Convert numbers to string

# --------------------
# Body measures
# --------------------

df_bm = pd.read_csv('data/00-raw/NHANES_2017-2020_EXAM_BM.csv') # Import dataset
df_bm = df_bm[['SEQN', 'BMXBMI']]                               # Keep specific variables
df_bm = df_bm.rename(columns = {'SEQN' : 'ID',
                                'BMXBMI' : 'BMI'})              # Make columns more readable
df_bm['ID'] = df_bm['ID'].astype(int)                           # Remove decimal from ID

df_bm[df_bm['BMI'].between(11.9, 92.3)] # Only BMI 11.9-92.3

# --------------------
# Alcohol usage
# --------------------

df_al = pd.read_csv('data/00-raw/NHANES_2017-2020_QUES_AL.csv') # Import dataset
df_al = df_al[['SEQN', 'ALQ121']]                               # Keep only ID and BMI
df_al = df_al.rename(columns = {'SEQN' : 'ID',
                                'ALQ121' : 'alcohol usage'})    # Make columns more readable
df_al['ID'] = df_al['ID'].astype(int)                           # Remove decimal from ID

df_al['alcohol usage'] = df_al['alcohol usage'].replace({0 : 'never',
                                                           1 : '1/day',
                                                           2 : '5-6/week',
                                                           3 : '3-4/week',
                                                           4 : '2/week',
                                                           5 : '1/week',
                                                           6 : '2-3/mo',
                                                           7 : '1/mo',
                                                           8 : '7-11/year',
                                                           9 : '3-6/year',
                                                           10 : '1-2/year'}) # Convert numbers to description
order = pd.CategoricalDtype(categories = ['never',
                                          '1-2/year',
                                          '3-6/year',
                                          '7-11/year',
                                          '1/mo',
                                          '2-3/mo',
                                          '1/week',
                                          '2/week',
                                          '3-4/week',
                                          '5-6/week',
                                          '1/day'],
                                          ordered = True)                    # 1-2 sort categorical variables
df_al['alcohol usage'] = df_al['alcohol usage'].astype(order)                # 2-2 sort categorical variables

# --------------------
# Physical activity
# --------------------

df_pa = pd.read_csv('data/00-raw/NHANES_2017-2020_QUES_PA.csv')      # Import dataset
df_pa = df_pa[['SEQN', 'PAD645', 'PAD680']]                          # Keep specific variables
df_pa = df_pa.rename(columns = {'SEQN' : 'ID',
                                'PAD645' : 'walk/bicycle (min/day)',
                                'PAD680' : 'sedentary (min/day)'})   # Make columns more readable
df_pa['ID'] = df_pa['ID'].astype(int)                                # Remove decimal from ID

df_pa = df_pa[df_pa['walk/bicycle (min/day)'].between(10, 840)]               # Only walk/bicycle 10-840
df_pa = df_pa[df_pa['sedentary (min/day)'].between(0, 1320)]                  # Only sedentary 0-1320
df_pa['walk/bicycle (min/day)'] = df_pa['walk/bicycle (min/day)'].astype(int) # Remove decimal from walk/bicycle
df_pa['sedentary (min/day)'] = df_pa['sedentary (min/day)'].astype(int)       # Remove decimal from sedentary

# --------------------
# Sleep disorders
# --------------------

df_sl = pd.read_csv('data/00-raw/NHANES_2017-2020_QUES_SL.csv')             # Import dataset
df_sl = df_sl[['SEQN', 'SLD012', 'SLD013', 'SLQ030', 'SLQ120']]             # Keep specific variables
df_sl = df_sl.rename(columns = {'SEQN' : 'ID',
                                        'SLD012' : 'hours slept (weekday)',
                                        'SLD013' : 'hours slept (weekend)',
                                        'SLQ030' : 'snore',
                                        'SLQ120' : 'daytime sleepiness'})   # Make columns more readable
df_sl['ID'] = df_sl['ID'].astype(int)                                       # Remove decimal from ID

df_sl = df_sl[df_sl['hours slept (weekday)'].between(2, 14)] # Only sleep (weekday) 2-14
df_sl = df_sl[df_sl['hours slept (weekend)'].between(2, 14)] # Only sleep (weekend) 2-14

df_sl['snore'] = df_sl['snore'].replace({0 : '0/week',
                                                 1 : '1-2/week',
                                                 2 : '3-4/week',
                                                 3 : '5+/week'}) # Convert numbers to description
order = pd.CategoricalDtype(categories = ['0/week',
                                          '1-2/week',
                                          '3-4/week',
                                          '5+/week'],
                                         ordered = True)         # 1-2 sort categorical variables
df_sl['snore'] = df_sl['snore'].astype(order)                    # 2-2 sort categorical variables

df_sl['daytime sleepiness'] = df_sl['daytime sleepiness'].replace({0 : '0/mo',
                                                               1 : '1/mo',
                                                               2 : '2-4/mo',
                                                               3 : '5-15/mo',
                                                               4 : '16-30/mo'}) # Convert numbers to description
order = pd.CategoricalDtype(categories = ['0/mo',
                                          '1/mo',
                                          '2-4/mo',
                                          '5-15/mo',
                                          '16-30/mo'],
                                         ordered = True)                        # 1-2 sort categorical variables
df_sl['daytime sleepiness'] = df_sl['daytime sleepiness'].astype(order)         # 2-2 sort categorical variables

# --------------------
# Smoking
# --------------------

df_sm = pd.read_csv('data/00-raw/NHANES_2017-2020_QUES_SM.csv') # Import dataset
df_sm = df_sm[['SEQN', 'SMQ040']]                               # Keep specific variables
df_sm = df_sm.rename(columns = {'SEQN' : 'ID',
                                'SMQ040' : 'cigarette usage'})  # Make columns more readable
df_sm['ID'] = df_sm['ID'].astype(int)                           # Remove decimal from ID

df_sm['cigarette usage'] = df_sm['cigarette usage'].replace({1 : '1/day',
                                                             2 : '2-7/week',
                                                             3 : 'never'})   # Convert numbers to description
order = pd.CategoricalDtype(categories = ['never',
                                          '2-7/week',
                                          '1/day'],
                                          ordered = True)                    # 1-2 sort categorical variables
df_sm['cigarette usage'] = df_sm['cigarette usage'].astype(order)            # 2-2 sort categorical variables

# --------------------
# Cleanup
# --------------------

df_demo = df_demo.dropna() # Remove rows with NaN data
df_bm = df_bm.dropna()     # Remove rows with NaN data
df_al = df_al.dropna()     # Remove rows with NaN data
df_pa = df_pa.dropna()     # Remove rows with NaN data
df_sl = df_sl.dropna()     # Remove rows with NaN data
df_sm = df_sm.dropna()     # Remove rows with NaN data

# --------------------
# Merge dataset
# --------------------

df_list = [df_demo, df_bm, df_al, df_pa, df_sl, df_sm]
df_final = reduce(lambda left, right: pd.merge(left, right, on='ID', how='inner'), df_list) # Merge rows with same ID

df_final

## Ethics

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback

## Team Expectations 

Instructions: 
- All project members will communicate through Discord and respond to messages preferably within 8-12 hours.
- Meetings will occur at minimum weekly through Discord and tasks will be assigned during these meetings.
- Project members struggling on their tasks will ask for help as soon as possible so other members can provide assistance.
- If a member has not responded in a lengthly time, such as 48 hours or more, a welfare check will be attempted by contacting through Discord, email, and phone. If the member still has not responded, contact will be made with a TA or professor on what to do next.


## Project Timeline Proposal

| Meeting Date  | Meeting Time | Completed Before Meeting | Discuss at Meeting |
|---|---|---|---|
| 2025-02-02 Monday | 2:30 PM | Review the project proposal Jupyter notebook | Assign each project member to complete a section of the project propsal |
| 2025-02-04 Wednesday | Project proposal due | | |
| 2025-02-09 Monday | 3:00 PM | Review the data checkpoint Jupyter notebook | Assign each project member to complete a section of the data checkpoint |
| 2025-02-16 Monday | 3:00 PM | Each member does as much of their assigned task | Checkup on each member's progress, assist other members if necessary |
| 2025-02-18 Wednesday | Data checkpoint due | | |
| 2025-02-23 Monday | 3:00 PM | Review the EDA checkpoint Jupyter notebook | Assign each project member to complete a section of the EDA checkpoint |
| 2025-03-02 Monday | 3:00 PM | Each member does as much of their assigned task | Checkup on each member's progress, assist other members if necessary |
| 2025-03-04 Wednesday | EDA checkpoint due | | |
| 2025-02-09 Monday | 3:00 PM | Review the final project Jupyter notebook | Assign each project member to complete a section of the final project |
| 2025-03-16 Monday | 3:00 PM | Each member does as much of their assigned task | Checkup on each member's progress, assist other members if necessary |
| 2025-03-18 Wednesday | Final project due | | |