# COGS 108 - Project Proposal

## Authors

- Saemi Namgung: Conceptualization, Analysis, Project administration, Writing – original draft,Writing – review & editing
- Evelyn Na: Software, Data curation, Visualization, Methodology, Writing – review & editing
- Jookyoung Lee: Conceptualization, Analysis, Project administration, Software, Writing - review & editing
- Dani Delgado: Analysis, Background research, Visualization, Writing - original draft
- Agam Chahal: Background research, Prior Research Analysis, Software
- Alex Sun: Background research, Analysis, Proofreading

## Research Question

Sleep deprivation is increasingly common among university students, potentially undermining the cognitive processes required for academic success. Therefore, we ask: Does average nightly sleep duration ($TotalSleepTime$) significantly predict end-of-semester GPA ($term\_gpa$) for first-year college students when controlling for prior academic performance ($cum\_gpa$) and demographic factors?

## Background and Prior Work

Sleep plays an essential role in cognition, memory, and learning. Research in neuroscience shows that sleep is an active process during which the brain consolidates newly learned information and strengthens memory. These processes are important for attention, executive function, and long-term learning, all of which are necessary for academic success. When sleep is inadequate or disrupted, these cognitive functions can be impaired, making it more difficult for students to perform well academically (<a href="https://www.nature.com/articles/nrn2762" target="_blank">Diekelmann & Born, 2010</a>).

Several empirical studies have examined the relationship between sleep patterns and academic performance in student populations. An in-depth study of medical students found that ongoing sleep deprivation over a three-month period was associated with declines in academic scores, suggesting that reduced sleep may negatively affect academic outcomes (<a href="https://www.cureus.com/articles/380070-impact-of-sleep-deprivation-on-cognition-and-academic-scores-a-three-month-longitudinal-study-among-indian-medical-students" target="_blank">Impact of Sleep Deprivation on Cognition and Academic Scores, Cureus</a>). In addition, research by Gaultney (2017) found that among college students, sleep duration and sleep timing were associated with academic performance measures such as GPA (<a href="https://www.tandfonline.com/doi/full/10.1080/10963758.2017.1297713" target="_blank">Gaultney, 2017</a>). Together, these studies provide evidence that sleep behavior is meaningfully related to academic outcomes, though they differ in how thoroughly they control for other influencing factors. Although the relationship between sleep and academic performance has been studied before, this project does not aim to introduce a new theoretical idea. Instead, it focuses on confirming and extending existing findings using a data science approach and publicly available datasets. Many previous studies do not fully account for prior academic performance, which makes it harder to separate the effect of sleep from a student’s existing ability. By controlling for prior GPA and demographic factors, our analysis examines whether sleep duration remains a meaningful predictor of end-of-semester GPA.

To better understand the specific contribution of sleep to academic success, researchers recommend controlling for prior academic performance and relevant demographic variables. Prior GPA is a strong predictor of future academic outcomes and should be included in analytical models to avoid attributing existing differences in ability to sleep alone. By using regression models that incorporate prior performance and demographic controls, it becomes possible to evaluate whether sleep duration remains a significant predictor of end-of-semester GPA. This project follows that approach by examining the relationship between average nightly sleep duration and semester GPA while accounting for baseline academic performance and demographic factors.


## Hypothesis


We hypothesize that average nightly sleep duration ($TotalSleepTime$) will be a positive and significant predictor of end-of-semester GPA ($term\_gpa$), such that an increase in sleep duration is associated with higher academic achievement even when controlling for prior performance ($cum\_gpa$) and demographic factors. This expectation is grounded in the established neurobiological role of sleep in memory consolidation, executive function, and emotional regulation, all of which are critical for the high-level cognitive processing required to succeed in a college environment. By accounting for baseline academic ability and course load, we aim to isolate the specific impact that sleep hygiene has on a student's ability to maintain and improve their grades.

## Data

### 1. Ideal Dataset
To answer the research question—whether average nightly sleep duration predicts end-of-semester GPA for first-year college students while controlling for prior academic performance and demographic factors—the ideal dataset would have the following characteristics:

**A. Variables**
* **Outcome variable:** `term_gpa` (end-of-semester GPA)
* **Primary predictor:** `TotalSleepTime` (average nightly sleep duration, measured in hours or minutes)
* **Control variables:** `cum_gpa` (prior cumulative GPA as a measure of baseline academic performance), demographic variables such as age, sex/gender, race/ethnicity, and residency status (on-campus vs. off-campus).
* **Optional additional controls:** Socioeconomic indicators or academic workload (e.g., credit hours).

**B. Number of Observations Needed**
The ideal dataset would include data from several hundred first-year college students (at least 200–300 observations). This sample size would provide sufficient statistical power to estimate a multiple regression model with several control variables and detect moderate effect sizes.

**C. Data Collection Method**
Sleep data would be collected continuously throughout the semester using validated methods such as wearable sleep trackers or daily sleep diaries. Academic performance data (term GPA and cumulative GPA) would be obtained from official university records. Demographic information would be collected through institutional records or standardized student surveys.

**D. Data Storage and Organization**
The data would be stored in a student-level dataset, where each row represents an individual student. Sleep data collected nightly would be aggregated to compute average sleep duration per student. Variables would be clearly labeled and accompanied by documentation describing measurement methods.

### 2. Real Dataset
One real dataset suitable for this project is the **CMU Sleep Study dataset**.

**A. Location and Access**
The dataset is publicly available through the CMU Statistics Data Repository at: https://cmustatistics.github.io/data-repository/psychology/cmu-sleep.html. No special permissions or applications are required to access or use the data.

**B. Important Variables**
This dataset includes `TotalSleepTime`, `term_gpa`, and `cum_gpa`, which directly correspond to the primary predictor, outcome variable, and key control variable in the research question. It also contains demographic variables such as sex and age, allowing for basic demographic controls.

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
> This study involves human subjects and includes sensitive academic and behavioral data such as sleep duration, GPA, and demographic characteristics. Direct student data collection will require informed consent, which includes a detailed explanation of the study's objectives, the data's purpose, and the fact that participation is entirely voluntary. Participants will be told that participation or non-participation will not have an impact on their academic status.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
> Self-reported sleep data may add social desirability bias or recollection bias. Also, there may be systematic differences between students who decide to participate and who do not (e.g., more academically engaged kids). The results will be cautiously interpreted, and these limitations will be noted.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
> Personally identifiable information will not be collected or retained. Only the variables necessary for the research question will be included, and all data will be anonymized before analysis.

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?
> Rather than drawing inferences regarding group differences, demographic characteristics are included in order to adjust for confounding effects. Their inclusion helps prevent erroneous findings caused by omitted variable bias and permits investigation of whether observed connections differ between groups.

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
> Data will be stored on secure, access-restricted systems used for academic research. Only authorized members of the research team will have access to the dataset.

 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
> If data are collected directly from participants, individuals will be informed that they may request removal of their data prior to anonymization and analysis.

 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?
> Data will be retained only for the duration of the course project and deleted afterward in accordance with course and institutional guidelines.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
> This analysis acknowledges that sleep duration is only one of many factors influencing academic performance. Limitations are examined in relation to broader contextual variables like workload, financial stress, mental health, and institutional assistance.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
> Prior academic performance (cumulative GPA) is included as a control variable to reduce confounding. Remaining limitations due to unobserved variables will be explicitly acknowledged.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
> Results will be presented transparently, emphasizing effect sizes and uncertainty rather than overstating significance.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
> No personally identifiable information will be used or displayed in analysis outputs.

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?
> The analysis workflow will be documented and reproducible, allowing results to be reviewed or re-run if concerns arise.

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
> Instead of generating predictions or making decisions, the model makes use of demographic characteristics, sleep duration, and past GPA for statistical adjustment. Individual-level evaluations are made without the use of variables.

 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
> This study is not primarily concerned with group-level differences. To make sure results are not false, interaction effects and subgroup trends could be investigated descriptively.

 - [ ] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
> The coefficients of the regression model can be interpreted in terms of correlations between GPA and sleep duration.

 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?
> Limitations such as correlational design, potential measurement error, and omitted variables will be clearly stated.

### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?
> This research is not meant to be used in situations when decisions must be made. However, there is a risk that results could be misused to blame students or enforce rigid behavioral expectations. Findings will be presented as informative rather than prescriptive, with a focus on wellness-oriented and helpful interpretations, in order to reduce this.


## Team Expectations 


- **Communication:**
  Our team will communicate primarily through **messaging** for updates and questions, and **Zoom** for meetings. Team members are expected to check messages regularly and respond within 12 hours on weekdays and 24 hours on weekends. We will hold **weekly/bi-weekly Zoom meetings** to discuss progress, assign tasks, and address questions or concerns.

- **Respectful Collaboration:**
  We agree to communicate respectfully and constructively, especially when giving feedback or disagreeing. We assume all feedback is well-intentioned and aimed at improving the project.

- **Equal Contribution and Accountability:**
  All team members are expected to contribute equally in effort across the project. While roles may differ based on strengths (e.g., coding, writing, analysis), everyone will participate in coding, writing, and editing at some point. Team members will complete assigned tasks by agreed-upon deadlines and communicate early if they are struggling.

- **Decision Making:**
  Most decisions will be made through group discussion and consensus or majority agreement. For time-sensitive decisions, the team member responsible for that section may make the decision and update the group afterward.

- **Handling Conflict and Challenges:**
  If conflicts arise, we will address them openly and respectfully as a group. If a team member is unable to meet expectations, they are expected to notify the group as soon as possible so responsibilities can be adjusted. If issues persist, we will follow the course guidelines for addressing problem teammates.

By contributing to this project and adding our names to the submission, we confirm that we have read the COGS108 Team Policies, agree to these expectations, and intend to fulfill them throughout the quarter.

## Project Timeline Proposal



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/4  |  1 PM | Each member worked on assigned sections of the project proposal (research question, data, ethics, team expectations, timeline)  | Discuss and finalize research question; review proposal sections; allocate responsibilities to each team member | 
| 2/11  |  7 PM |  Identify and explore dataset; conduct preliminary data inspection; review background literature | Discuss dataset suitability and ethics; finalize variables; confirm overall analysis plan | 
| 2/18  | 7 PM  | Complete data wrangling and exploratory data analysis (EDA); draft initial visualizations  | Review and edit wrangling/EDA; discuss patterns and potential issues; refine analysis and visualization approach   |
| 3/4  | 7 PM  | Conduct main statistical analysis; update visualizations; draft results section | Discuss analysis results; interpret findings; plan discussion and conclusion sections   |
| 3/11  | 7 PM  | Draft full project (results, discussion, conclusion); integrate feedback | Review full project draft; finalize writing, figures, and interpretations |
| 3/18  | Before 11:59 PM  | Final edits and checks completed | Submit Final Project and complete Group Project Surveys |