# COGS 108 - Data Checkpoint

## Authors
Tony Chen: Background research, Writing - review & editing, Experimental investigation
Toby Jacob: Background research, Writing - original draft
Andrew Liang: Analysis, Conceptualization, Visualization
Celine Nguyen: Background research, Software, Methodology
Shivam Sharma: Background research, Software, Project administration

## Research Question

Is there a statistically significant difference in the average CGPA of university students who report sleeping 7+ hours per night compared to those who report sleeping fewer than 7 hours?


## Background and Prior Work

The relationship between sleep hygiene and academic success is a cornerstone of student health research. 
University students often experience a "sleep debt" due to high cognitive demands and the biological 
tendency for delayed sleep phases in young adults. Research indicates that sleep is not merely a passive 
state of rest but an active period for memory consolidation, where the hippocampus and neocortex interact 
to solidify information acquired during the day.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) 
When students drop below the recommended threshold of rest, they risk impairing executive functions, 
including attention span and decision-making capabilities, which are vital for maintaining a high CGPA.

The American Academy of Sleep Medicine emphasizes that for college students, getting enough sleep is vital for academic success, noting that sleep-deprived students are more likely to experience lower grades and a decrease in overall performance.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2)While many students prioritize "all-nighter" study sessions, data consistently shows that consistent, long-duration sleep provides a much stronger foundation for long-term retention and academic endurance.

Furthermore, large-scale longitudinal studies have demonstrated that sleep quality, duration, and consistency are all significantly related to academic performance in college. Specifically, students who maintain a regular sleep schedule and average over seven hours of rest tend to have significantly higher GPAs than those with irregular or short-duration sleep patterns.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3)  This suggests that the quality and quantity of rest may be a more powerful predictor of success than the sheer volume of study hours completed.

By utilizing Exploratory Data Analysis on student lifestyle data, we aim to verify if a 7-hour threshold results in a statistically significant difference in CGPA. Organizations like the Sleep Foundation have historically tracked how sleep affects school reports, providing a baseline that suggests a clear drop-off in academic metrics when students fall into the "insufficient sleep" category.<a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4)  Our project builds on this by specifically isolating the 7-hour variable to provide an unambiguous answer to how rest influences academic standing.

References

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Okano, K., Kaczmarzyk, J.R., Dave, N. et al. (2019). 
   Sleep quality, duration, and consistency are associated with better academic performance in college students. 
   npj Science of Learning.  
   https://www.nature.com/articles/s41539-019-0055-z

2. <a name="cite_note-2"></a> [^](#cite_ref-2) American Academy of Sleep Medicine. (2017). 
   College students: Getting enough sleep is vital to academic success.  
   https://aasm.org/college-students-getting-enough-sleep-is-vital-to-academic-success/

3. <a name="cite_note-3"></a> [^](#cite_ref-3) Maheshwari, G., & Shaukat, F. (2019). 
   Impact of poor sleep quality on the academic performance of medical students. 
   Cureus.  
   https://pmc.ncbi.nlm.nih.gov/articles/PMC7381801/

4. <a name="cite_note-4"></a> [^](#cite_ref-4) National Sleep Foundation. (2023). 
   Sleep and School Performance.  
   https://www.sleepfoundation.org/children-and-sleep/sleep-and-school-performance



## Hypothesis


We predict that there will be a **large, statistically significant, and positive** difference in the average CGPA of university students who report sleeping 7+ hours per night compared to the average CGPA of university students who report sleeping less than 7 hours per night. We believe that students who have more sleep will likely be more alert in the classes they take, and as a result will receive higher scores on homework assignments and exams, on average. This means we would expect their CGPAs to be greater compared to students who sleep less.

## Data

### Data overview

Instructions: REPLACE the contents of this cell with descriptions of your actual datasets.

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
  - Description of the variables most relevant to this project
  - Descriptions of any shortcomings this dataset has with repsect to the project
- Dataset #2 (if you have more than one!)
  - same as above
- etc

Each dataset deserves either a set of bullet points as above or a few sentences if you prefer that method.

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

In [None]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [None]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

### Dataset #1 

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. Describe the important metrics, what units they are in, and giv some sense of what they mean.  For example "Fasting blood glucose in units of mg glucose per deciliter of blood.  Normal values for healthy individuals range from 70 to 100 mg/dL.  Values 100-125 are prediabetic and values >125mg/dL indicate diabetes. Values <70 indicate hypoglycemia. Fasting idicates the patient hasn't eaten in the last 8 hours.  If blood glucose is >250 or <50 at any time (regardless of the time of last meal) the patient's life may be in immediate danger"
   2. If there are any major concerns with the dataset, describe them. For example "Dataset is composed of people who are serious enough about eating healthy that they voluntarily downloaded an app dedicated to tracking their eating patterns. This sample is likely biased because of that self-selection. These people own smartphones and may be healthier and may have more disposable income than the average person.  Those who voluntarily log conscientiously and for long amounts of time are also likely even more interested in health than those who download the app and only log a bit before getting tired of it"
3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you


In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


### Dataset #2 

See instructions above for Dataset #1.  Feel free to keep adding as many more datasets as you need.  Put each new dataset in its own section just like these. 

Lastly if you do have multiple datasets, add another section where you demonstrate how you will join, align, cross-reference or whatever to combine data from the different datasets

Please note that you can always keep adding more datasets in the future if these datasets you turn in for the checkpoint aren't sufficient.  The goal here is demonstrate that you can obtain and wrangle data.  You are not tied down to only use what you turn in right now.

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


## Ethics

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback

## Team Expectations 

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback


## Project Timeline Proposal

Instructions: Replace this with your timeline.  **PLEASE UPDATE your Timeline!** No battle plan survives contact with the enemy, so make sure we understand how your plans have changed.  Also if you have lost points on the previous checkpoint fix them