# COGS 108 - Project Proposal

## Authors

Team list and credits:

- Alexis Menor: Conceptualization, Background research, Writing – original draft
- Camdon Dreisbach: Methodology, Software, Data curation
- Ivan Li: Analysis, Visualization
- Joseph Tuazon: Project administration, Writing – review & editing
- Yuna Yeom: Analysis, Background research, Visualization

## Research Question

We aim to identify which social media behaviors are the strongest predictors of stress levels among Americans. Specifically, we will examine variables such as daily social media usage time, nighttime usage, number of platforms used, posting frequency, and passive versus active engagement, and evaluate their relationship with self-reported stress scores. We will analyze whether these relationships differ across age groups, including college-aged adults, young professionals, and older adults. Using multiple regression and tree-based machine learning models, we will determine which behaviors have the greatest explanatory power for stress and assess whether these variables can accurately classify individuals into high- and low-stress groups. This project will rely on publicly available lifestyle and mental health datasets to ensure that the analysis is data-driven and reproducible.


## Background and Prior Work

Our group began our work with the premise that social media use has become a pervasive aspect of daily life for Americans across age groups. Its impact on mental health, particularly stress, is increasingly measurable in the era of digital engagement. Research investigating the relationship between social media behaviors and psychological outcomes, such as stress, anxiety, and emotional dysregulation, has expanded rapidly in recent years.

With the proliferation of survey datasets and behavioral measures of social media use, researchers have sought to determine whether specific social media behaviors—including usage intensity, active versus passive engagement, and addiction-like patterns—are associated with stress levels and overall psychological distress. Prior work has largely focused on college students due to their high engagement with social media and elevated risk for stress, as well as the availability of validated survey instruments that allow standardized measurement of both social media behaviors and stress outcomes.<a href="#ref1" id="note1b">1</a> However, understanding these relationships across broader age groups, including young adults, middle-aged, and older adults, is critical for generalizing findings beyond the college population.

One study by Cheng et al.<a href="#ref1" id="note1b">1</a> employed cross-sectional survey data to investigate the effects of problematic social media use on psychological distress among medical students. Regression analyses showed that frequent social media engagement was positively correlated with psychological distress, and maladaptive coping strategies mediated part of this relationship. This highlights the potential for specific social media behaviors to exacerbate stress rather than alleviate it.

Another study by Li et al.<a href="#ref2" id="note2">2</a> distinguished between passive and active social media use among 1,740 college students. Regression and mediation analyses indicated that passive use was positively correlated with social anxiety, whereas active engagement exhibited a negative correlation, mediated by communication capacity. These results suggest that not all social media behaviors are equally detrimental to psychological well-being, and that distinguishing between behavior types is critical for predictive modeling.

Longitudinal research by Smith et al.<a href="#ref3" id="note3">3</a> examined social media use before and during the COVID‑19 pandemic to understand its effects on stress and depressive symptoms. Survey data collected over multiple time points revealed that later stages of high social media engagement were associated with increased negative outcomes, underscoring the context-dependent nature of social media’s impact on stress.

Additional studies have applied structural equation modeling (SEM) to understand pathways linking social media addiction to psychological anxiety. Zhao et al.<a href="#ref4" id="note4">4</a> demonstrated that social media addiction directly increased anxiety and indirectly influenced it through reductions in self-efficacy and increases in negative coping strategies. Such analyses highlight the complexity of the relationship between social media behaviors and stress outcomes and suggest the importance of examining these pathways across diverse populations.

Together, these studies suggest that while social media use contains meaningful information regarding stress, the predictive strength of specific behaviors is uneven and highly dependent on the type of engagement, measurement tools, and contextual factors. Building on this prior work, our project aims to quantitatively identify which social media behaviors—including usage intensity, passive versus active patterns, and addiction-like features—are the strongest predictors of stress among Americans across age groups. Unlike prior studies, we will employ both multiple regression and tree-based machine learning models to assess predictive strength and variable importance, allowing us to uncover behavioral patterns most closely linked to stress outcomes across the population.

<ol>
<li id="ref1"><a href="#note1">^</a> Cheng, Y., et al. <i>Psychological distress, social media use, and academic performance among medical students.</i> BMC Medical Education, 2024. <a href="https://pubmed.ncbi.nlm.nih.gov/39272179" target="_blank">link</a></li>

<li id="ref2"><a href="#note2">^</a> Li, H., et al. <i>Relationship between passive/active social media use and social anxiety in college students.</i> Int. J. Environ. Res. Public Health, 2023. <a href="https://www.mdpi.com/1660-4601/20/4/3657" target="_blank">link</a></li>

<li id="ref3"><a href="#note3">^</a> Smith, J., et al. <i>Social media use and mental health among college students during COVID-19: A longitudinal study.</i> PubMed, 2024. <a href="https://pubmed.ncbi.nlm.nih.gov/38873817" target="_blank">link</a></li>

<li id="ref4"><a href="#note4">^</a> Zhao, L., et al. <i>The effects of social media addiction on college students’ psychological anxiety: SEM approach.</i> Frontiers in Psychology, 2025. <a href="https://www.frontiersin.org/articles/10.3389/fpsyg.2025.1676899" target="_blank">link</a></li>

<li id="ref5"><a href="#note5">^</a> Cheng, Y., et al. <i>A scoping review of social media use and mental health outcomes among college students.</i> PubMed, 2024. <a href="https://pubmed.ncbi.nlm.nih.gov/40941586" target="_blank">link</a></li>
</ol>


## Hypothesis


We hypothesize that certain social media behaviors, particularly higher daily usage, nighttime use, and passive engagement, will be positively associated with higher stress levels among Americans across age groups. This expectation is based on prior studies showing that frequent or maladaptive social media behaviors contribute to psychological distress, social anxiety, and emotional dysregulation (Cheng et al., 2024; Li et al., 2023; Zhao et al., 2025). Conversely, active and purposeful engagement is expected to have a weaker or potentially negative association with stress, indicating that the type of social media interaction plays a critical role in predicting stress outcomes. Additionally, we expect that the strength of these associations may vary by age group, with patterns observed among college-aged adults potentially differing from those in older populations.

## Data

Ideal Dataset:

To answer our research question, the ideal dataset would include detailed behavioral and psychological measures from a large, representative sample of Americans across different age groups. Key variables should include:

Social media behaviors: daily usage time, nighttime usage, number of platforms used, posting frequency, passive vs. active engagement, and platform-specific activity patterns.

Mental health outcomes: self-reported stress scores, anxiety levels, depressive symptoms, and general psychological distress.

Demographics: age, gender, education level, employment status, and geographic region, to control for potential confounding factors.

We would aim for at least 5,000–10,000 participants to ensure sufficient statistical power for regression and tree-based analyses while enabling age-group comparisons among college-aged adults (18–24), young professionals (25–44), and older adults (45+).

The data would ideally be collected via online surveys supplemented with mobile app tracking or digital logs for objective social media usage (with participant consent). Each row would represent a single participant, and columns would represent variables. The data would be stored in a tidy CSV format or relational database for straightforward analysis in Python or R.


Potential Real Datasets:

1. StudentLife (Dartmouth College)

Location & Access: Publicly available at https://studentlife.cs.dartmouth.edu/datasets.html
. Researchers can download anonymized data directly from the site.

Important Variables: Daily phone usage, app usage patterns, social communication, EMA stress/mood ratings, and sleep/activity patterns. Nighttime usage and engagement indicators allow analysis of social media behavior, while stress and mood EMA scores provide psychological outcomes. This dataset is particularly useful for examining objective behavioral predictors of stress in college-aged adults.

2. HINTS (Health Information National Trends Survey)

Location & Access: Public-use datasets downloadable at https://hints.cancer.gov/data/download-data.aspx
. No special permission is required; available in CSV, SAS, and SPSS formats.

Important Variables: Self-reported stress/mental health indicators, social media usage frequency and purpose, health information-seeking behaviors, and demographic variables including age and education. HINTS is nationally representative, enabling analysis of social media behaviors as predictors of stress across different age groups.

3. Pew Research Center Surveys

Location & Access: Survey datasets available at https://www.pewresearch.org/short-reads/2021/10/22/how-to-access-pew-research-center-survey-data/
. Researchers need to create a free account to access raw survey data.

Important Variables: Social media platform usage, posting frequency, active vs. passive engagement, exposure to content, and demographic variables. Some surveys include self-reported stress or emotional response measures. This dataset provides a large, age-diverse U.S. adult sample, making it ideal for regression and tree-based modeling of social media behaviors predicting stress.
  

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> Example of how to use the checkbox, and also of how you can put in a short paragraph that discusses the way this checklist item affects your project.  Remove this paragraph and the X in the checkbox before you fill this out for your project

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

       - We recognize that the StudentLife dataset only includes college students, which may limit generalizability. HINTS and Pew provide more representative samples, but we will account for potential age and demographic biases in interpretation.
 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
       
       - Analyses will examine whether model predictions vary by age group to ensure that results are not biased toward any particular demographic.
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?
       
       - Limitations such as sampling bias, survey non-response, and generalizability will be explicitly stated in the report to ensure honest interpretation of findings.

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?


## Team Expectations 

Instructions: REPLACE the contents of this cell with your work
  
Read over the [COGS108 Team Policies](https://github.com/COGS108/Projects/blob/master/COGS108_TeamPolicies.md) individually. Then, include your group’s expectations of one another for successful completion of your COGS108 project below. Discuss and agree on what all of your expectations are. Discuss how your team will communicate throughout the quarter and consider how you will communicate respectfully should conflicts arise. By including each member’s name above and by adding their name to the submission, you are indicating that you have read the COGS108 Team Policies, accept your team’s expectations below, and have every intention to fulfill them. These expectations are for your team’s use and benefit — they won’t be graded for their details.

* *Team Expectation 1 - Communication*
  - Team members will communicate clearly and respectfully, respond to messages within a day
* *Team Expectation 2 - Task Responsibility*
  - All members will contribute fairly to all parts of the project. Tasks, progress, and code will be tracked and coordinated through Github.
* *Team Expecation 3 - Conflict Resolution*
  - If someone struggles with a task, they will inform the team early. We will work together to resolve issues respectfully and involve the professor only if problems persist.

## Project Timeline Proposal

Instructions: REPLACE the contents of this cell with your work

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/30  |  11 AM |  Do Assignment Project Review; Discuss and decide on final project topic; | Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/4  | 8 PM  | Assign group members to lead each specific part; Search for datasets  | Discuss Wrangling and possible analytical approaches;  Edit, finalize, and submit proposal |
| 2/14  | 6 PM  | Import & Wrangle Data (Ant Man); EDA (Hulk) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 12 PM  | Finalize wrangling/EDA; Begin Analysis (Iron Man; Thor) | Discuss/edit Analysis; Complete project check-in |
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Wasp)| Discuss/edit full project |
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |