# COGS 108 - Project Proposal

## Authors

- Evan Honggo Widjojo: Conceptualization, Methodology
- Ahmad Bin Feizal: Background research
- Nicholas Chan: Data curation
- Fadi Gorgees: Background research, Methodology
- Neenos Yaldiko: Project administration


## Research Question

How are different styles of social media use, passive consumption versus active interaction, associated with wellbeing outcomes among adults aged 18+? We define passive consumption as behaviors like time spent browsing content and viewing posts or short form videos, and active interaction as behaviors like posting, commenting, liking, and direct messaging. Our primary wellbeing outcomes are self reported stress and self reported happiness, and we will also examine sleep related and basic physical health indicators when available. To reduce confounding in this observational study, we will control for demographic factors, socioeconomic status, and lifestyle variables such as physical activity and work hours. We will use regression based statistical inference to estimate associations and will avoid causal claims.

## Background and Prior Work

Social media uses are now a major component of everyday life. Instagram, Facebook, TikTok and Snapchat have accrued daily engagement of the majority of adults. As social media use has grown, research has shifted towards focusing on the manner in which individuals engage with it. For this study, we are using a widely adopted distinction which identifies active interaction as including posting content, commenting, liking, and direct messaging. On the other hand, passive consumption includes browsing feeds or watching short-term videos Our initial research suggests that these usage styles may influence different psychological mechanisms with passive use often linked to social comparison and envy, while active use may have higher perceived support and social connectedness. However, associations with other wellbeing outcomes including stress, happiness, sleep quality, and physical health remain mixed and motivates further investigation.

One prior influential effort in this study is the Social Media Activity Questionnaire (SMAQ), which surveyed 1,230 participants derived from Facebook Activity Questionnaire. This study by Ozimek, Brailovskaia and Bierhoff separated actions identified as active and passive engagements with a rating scale from 1 to 5.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) With the application of exploratory factor analysis (EFA) to identify behavioral dimensions, outcomes regarding mental health were assessed using established scales. These include Bergen Social Media Addiction Scale, Fear of Missing Out (FoMO) scale, and the Depression Anxiety Stress Scales (DASS). In contrast to our project, this study focused on negative mental health indicators, rather than trying to estimate well being directly which remains  our priority. Research outcome displayed stronger association between active use depression, anxiety and stress, while passive was more strongly associated with problematic behavioral tendencies like addiction and FoMO. Researchers' opinions attributed these tendencies to upwards social comparison and envy enabled by consumed contents.

One notable limitation to the SMAQ study was the disproportionate sample of mostly young and females, which limits generalizability to the broader population. Other than that, the data set was limited to Facebook users and geographically restricted to the Ruhr region of Germany. This makes it difficult to extrapolate any observations to older adults, different cultural contexts, and our specific social media of choice, Instagram.

 Another referenced perspective is provided by the study “Are active and passive social media use related to mental health, wellbeing, and social support outcomes?”. This meta-analysis synthesizes findings from 141 quantitative studies which studies correlations between social media use styles and 13 mental health and wellbeing outcomes. Using pooled effect sizes, Godard and Holtzman report that neither active nor passive use strongly predicts well being on average, with most effects being minute.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) Active use has modest association with perceived social support and wellbeing, but also with slightly higher anxiety. Passive use showed overall weak associations but were more linked to worse emotional outcomes in general contexts. Crucially, this study highlights that demographic characteristics and contextual moderators, such as age and usage setting, have more substantial influence on outcomes. These findings show the importance of controlling for confounding factors rather than simplistic narratives like “active is good, passive is bad” which becomes a focus in guiding our project.

 A study that heavily grounded age as contextual moderator is Underwood’s expanded literature review which is specific to adolescents aged 10-19. Reviewing 16 peer-reviewed studies gathered from MEDLINE and PubMed via keyword search, the literature review concludes that both active and passive social media has positive association with negative mental health outcomes in adolescents.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) While evidence was insufficient to conclude that one style is categorically more harmful, this study shows the repeated importance of grounding research by age as observations of effects varied across different stages of adolescence. Our current study aims to control moderating factors by narrowing our analysis to Instagram usage among adults aged 18+. We target to examine both positive and negative wellbeing outcomes stretching beyond happiness and stress to include physical health factors like sleep and physical health while explicitly accounting for demographic, socioeconomic, and lifestyle confounders.
1. <a name="cite_note-1"></a> [^](#cite_ref-1)  Ozimek, P., Brailovskaia, J., Bierhoff, H-W. (2023) Active and passive behavior in social media: Validating the Social Media Activity Questionnaire (SMAQ), Telematics and Informatics Reports, Volume 10, 100048, ISSN 2772-5030, https://doi.org/10.1016/j.teler.2023.100048
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Godard, R., Holtzman, S. (2024). Are active and passive social media use related to mental health, wellbeing, and social support outcomes? A meta-analysis of 141 studies, Journal of Computer-Mediated Communication, Volume 29, Issue 1, January 2024, zmad055, https://doi.org/10.1093/jcmc/zmad055
3. <a name="cite_note-3"></a> [^](#cite_ref-3)  Underwood, L. (2024). Difference Between the Impact of Active Social Media Use and Passive Social Media Use on Adolescent Mental Health: An Expanded Literature Review. The Eleanor Mann School of Nursing Undergraduate Honors. https://scholarworks.uark.edu/nursuht/211

## Hypothesis


We hypothesize that among adults aged 18+, higher passive social media consumption will be associated with higher self reported stress and lower self reported happiness, while higher active interaction will be associated with lower stress and higher happiness. This is because passive browsing tends to increase social comparison and rumination, whereas active interaction is more likely to involve social connection and support, which can buffer stress and improve mood.

## Data

# Data

## 1) Ideal dataset

### Unit of observation
- One row per person, ideally with a time window attached.
- Best case: repeated measures (same person measured multiple times), because it helps us separate “who they are” from “what they did this week.”

### Variables we need (and why)

#### Unique ID + timestamps
- `user_id` (unique person ID)
- `date` or `week_id` (so we know the time window)

#### Main predictors (social media use style)
- Passive consumption (examples): minutes scrolling/browsing, number of posts viewed, short-form video viewing time
- Active interaction (examples): number of posts made, comments written, likes given, DMs/messages sent
- Ideally we also have platform (TikTok/IG/X/Reddit/etc.), because usage meaning can differ by platform.

#### Outcome variables (wellbeing)
- Primary: self-reported stress, self-reported happiness
- Optional if available: sleep (hours, sleep quality), basic physical health (self-rated health, fatigue), maybe mood indicators

#### Controls (to reduce confounding)
- Demographics: age, gender, education
- Socioeconomic: income or proxy, employment status
- Lifestyle: physical activity, work hours, student status (if relevant)
- Optional but helpful: baseline mental health, personality/introversion (only if the dataset includes it)

### How many observations we need (rough target + reasoning)
- Goal: enough data so regression estimates are stable after adding controls.
- A practical target:
  - At least ~1,000 people for a basic model (passive + active + a handful of controls).
  - Better: 3,000–10,000 people if we want to test subgroups (e.g., different age groups) or add more controls without the model becoming unstable.
- If we have repeated measures (daily/weekly rows per person), then the “number of rows” can be large even if the number of people is smaller — but we still want a solid number of unique people for generalizable results.

### Ideal data collection method (in a perfect world)
- Best-quality approach is a hybrid:
  - Phone/app logs to capture passive vs active behavior accurately (reduces memory bias), and
  - Short surveys for stress/happiness and sleep/health.
- If logs aren’t possible, then a well-designed survey can still work, but it’s more prone to self-report error.

### How the data would be stored/organized
- Use “tidy” structure:
  - Each row = one person in one time window (person-week or person-day)
  - Columns = predictors/outcomes/controls
- Keys:
  - Primary key could be (user_id, date/week_id) if repeated measures exist.
- Missingness handling plan:
  - Track missing values explicitly (don’t silently drop without checking).
  - If only a small amount is missing, we can use listwise deletion (drop rows) but we must report how many rows we dropped.
  - If missingness is larger, consider simple imputation (only if allowed / appropriate) and run sensitivity checks.

---

## 2) Real dataset sources (5 Sources)

We propose use dataset 1 for our main analysis, and treat dataset 2 to 5 as backup/validation datasets (or to strengthen the “real data sources” part of the proposal).

### Dataset 1 (Main): Kaggle — Social Media User Activity Dataset
This is our primary dataset because it is directly designed around social media activity and includes wellbeing outcomes aligned with our research question. The dataset is expected to let us measure both passive and active social media behaviors and relate them to stress/happiness, while controlling for demographics and lifestyle variables when available. Access is straightforward for students because it is hosted on Kaggle, and we can download it manually or use the Kaggle API after setting up credentials. A key limitation is that Kaggle datasets often rely on self-reported or synthetic/compiled data, so we will clearly describe what the dataset represents once we inspect the included documentation and column names.

- **URL:** https://www.kaggle.com/datasets/sadiajavedd/social-media-user-activity-dataset

#### Access requirements
- Kaggle account login
- Accept Kaggle dataset terms
- Optional: Kaggle API token if downloading in a notebook/script

#### How it maps to our ideal dataset
- Should contain: passive + active measures, wellbeing outcomes (stress/happiness), and at least some controls (we will confirm exact columns when we load the CSV)

#### Main gaps / how we handle
- If missing timestamps: treat it as cross-sectional (one row per person) and avoid time-based claims
- If missing key controls: we’ll use what exists and clearly state remaining confounding risk

---

### Dataset 2: Understanding Society (UK Household Longitudinal Study) — social media “looking” vs “posting”
This dataset is useful because it includes survey measures that separate looking at social networking sites (a passive-like behavior) from posting on social networking sites (an active-like behavior). That matches our “passive vs active” concept well, and it is paired with many wellbeing and background variables commonly used in social science (e.g., mental health and demographics). Access requires registering through the UK Data Service, which is extra steps compared to Kaggle, but still realistic for a university project. A limitation is that it is UK-based and the exact wellbeing measures may differ from “stress/happiness,” so we may need to use related outcomes (e.g., psychological distress or life satisfaction) if we use it for replication.

- **URL:** https://www.understandingsociety.ac.uk/

#### Access requirements:
- Registration (typically through the UK Data Service)
- Agree to an End User License (EUL) / academic use conditions

#### Important variables (examples):
- `smlook` (frequency of looking at social networking sites)
- `smpost` (frequency of posting on social networking sites)
- Plus many demographic/SES variables and wellbeing/mental health measures

#### Main gaps / how we handle
- Different country + potentially different outcome measures → treat as “validation/replication,” not a direct replacement for our main dataset

---

### Dataset 3: CDC Youth Risk Behavior Surveillance System (YRBSS) — social media use + health/wellbeing proxies
YRBSS is a strong backup because it is a large, public, well-documented survey and includes youth health and wellbeing-related variables (mental health indicators, sleep, physical activity, etc.). It can support analysis linking amount/frequency of social media use to wellbeing-related outcomes, especially if we want a well-known, credible dataset source. However, it typically won’t separate passive vs active usage in detail, so it is better as a “bigger picture” comparison dataset rather than a perfect match. Access is easy: CDC provides documentation and data downloads.

- **URL:** https://www.cdc.gov/yrbs/index.html

#### Access requirements
- Public access (no API key needed); follow CDC data use guidance

#### Important variables (typical categories)
- Social media use frequency/time (when included in that survey year)
- Sleep and physical activity variables (commonly included)
- Mental health-related indicators (commonly included)

#### Main gaps / how we handle
- Teen-focused (not all ages) + limited “active vs passive” detail → use for robustness checks only, and keep conclusions narrow

---

### Dataset 4: Kaggle - Social Media and Mental Health
This dataset is helpful because it is directly about social media usage and mental health/wellbeing, and it is easy to access and analyze in a class setting. It can provide alternative measures of wellbeing (mental health indicators) and usage (time spent / habits), which can be used to check whether patterns match our main dataset’s results. Kaggle access is simple (download or API), which makes it “low friction” for the project timeline. A limitation is that the exact definitions of variables may not perfectly match our passive vs active definitions, so we would map the closest available variables and be explicit about the mismatch.

- **URL:** https://www.kaggle.com/datasets/souvikahmed071/social-media-and-mental-health

#### Access requirements
- Kaggle login + terms acceptance
- Optional Kaggle API token for programmatic download

#### Main gaps / how we handle
- If “passive vs active” is not explicit → create proxy variables (e.g., scrolling time as passive; posting/commenting frequency as active) only if the dataset supports it

---

### Dataset 5: GitHub project (data description) — Social Media Usage & Emotional Well-Being (columns include posts/likes/comments/messages)
This source is useful because it clearly lists a clean set of activity variables that match our active vs passive framing: daily usage time (passive-ish exposure) plus posts, likes, comments, and messages (active interaction). It also includes an emotion/wellbeing-related outcome field (dominant emotion), so it can be used as an additional dataset to test similar relationships even if the outcome is not exactly “stress/happiness.” This can strengthen our proposal by showing we have multiple realistic dataset options with the right activity structure. A limitation is that it may not include the exact control variables we want (SES/lifestyle), so it is best as a “supporting dataset,” not the main one.

- **URL:** https://github.com/JamshedAli18/Social-Media-Usage-and-Emotional-Well-Being-Analysis

#### Important variables (as listed)
- `Daily_Usage_Time` (minutes) (passive exposure proxy)
- `Posts_Per_Day`, `Likes_Received_Per_Day`, `Comments_Received_Per_Day`, `Messages_Sent_Per_Day` (active interaction proxies)

#### Main gaps / how we handle
- Outcome is emotion category (not stress/happiness) and controls may be limited to use only as secondary/illustrative analysis

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> Example of how to use the checkbox, and also of how you can put in a short paragraph that discusses the way this checklist item affects your project.  Remove this paragraph and the X in the checkbox before you fill this out for your project

 - [ ] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
 - [ ] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
 - [ ] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [ ] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [ ] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [ ] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
 - [ ] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
 - [ ] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
 - [ ] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [ ] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

### D. Modeling
 - [ ] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [ ] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [ ] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [ ] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?


## Team Expectations 

Instructions: REPLACE the contents of this cell with your work
  
Read over the [COGS108 Team Policies](https://github.com/COGS108/Projects/blob/master/COGS108_TeamPolicies.md) individually. Then, include your group’s expectations of one another for successful completion of your COGS108 project below. Discuss and agree on what all of your expectations are. Discuss how your team will communicate throughout the quarter and consider how you will communicate respectfully should conflicts arise. By including each member’s name above and by adding their name to the submission, you are indicating that you have read the COGS108 Team Policies, accept your team’s expectations below, and have every intention to fulfill them. These expectations are for your team’s use and benefit — they won’t be graded for their details.

* *Team Expectation 1*
* *Team Expectation 2*
* *Team Expecation 3*
* ...

## Project Timeline Proposal

Instructions: REPLACE the contents of this cell with your work

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/20  |  1 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 1/26  |  10 AM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/1  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/14  | 6 PM  | Import & Wrangle Data (Ant Man); EDA (Hulk) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 12 PM  | Finalize wrangling/EDA; Begin Analysis (Iron Man; Thor) | Discuss/edit Analysis; Complete project check-in |
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Wasp)| Discuss/edit full project |
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |