# COGS 108 - Project Proposal

## Authors

- Evan Honggo Widjojo: Conceptualization, Methodology
- Ahmad Bin Feizal: Background research
- Nicholas Chan: Data curation
- Fadi Gorgees: Background research, Methodology
- Neenos Yaldiko: Project administration


## Research Question

How are different styles of social media use, passive consumption versus active interaction, associated with wellbeing outcomes among adults aged 18+? We define passive consumption as behaviors like time spent browsing content and viewing posts or short form videos, and active interaction as behaviors like posting, commenting, liking, and direct messaging. Our primary wellbeing outcomes are self reported stress and self reported happiness, and we will also examine sleep related and basic physical health indicators when available. To reduce confounding in this observational study, we will control for demographic factors, socioeconomic status, and lifestyle variables such as physical activity and work hours. We will use regression based statistical inference to estimate associations and will avoid causal claims.

## Background and Prior Work

Social media uses are now a major component of everyday life. Instagram, Facebook, TikTok and Snapchat have accrued daily engagement of the majority of adults. As social media use has grown, research has shifted towards focusing on the manner in which individuals engage with it. For this study, we are using a widely adopted distinction which identifies active interaction as including posting content, commenting, liking, and direct messaging. On the other hand, passive consumption includes browsing feeds or watching short-term videos Our initial research suggests that these usage styles may influence different psychological mechanisms with passive use often linked to social comparison and envy, while active use may have higher perceived support and social connectedness. However, associations with other wellbeing outcomes including stress, happiness, sleep quality, and physical health remain mixed and motivates further investigation.

One prior influential effort in this study is the Social Media Activity Questionnaire (SMAQ), which surveyed 1,230 participants derived from Facebook Activity Questionnaire. This study by Ozimek, Brailovskaia and Bierhoff separated actions identified as active and passive engagements with a rating scale from 1 to 5.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) With the application of exploratory factor analysis (EFA) to identify behavioral dimensions, outcomes regarding mental health were assessed using established scales. These include Bergen Social Media Addiction Scale, Fear of Missing Out (FoMO) scale, and the Depression Anxiety Stress Scales (DASS). In contrast to our project, this study focused on negative mental health indicators, rather than trying to estimate well being directly which remains  our priority. Research outcome displayed stronger association between active use depression, anxiety and stress, while passive was more strongly associated with problematic behavioral tendencies like addiction and FoMO. Researchers' opinions attributed these tendencies to upwards social comparison and envy enabled by consumed contents.

One notable limitation to the SMAQ study was the disproportionate sample of mostly young and females, which limits generalizability to the broader population. Other than that, the data set was limited to Facebook users and geographically restricted to the Ruhr region of Germany. This makes it difficult to extrapolate any observations to older adults, different cultural contexts, and our specific social media of choice, Instagram.

 Another referenced perspective is provided by the study “Are active and passive social media use related to mental health, wellbeing, and social support outcomes?”. This meta-analysis synthesizes findings from 141 quantitative studies which studies correlations between social media use styles and 13 mental health and wellbeing outcomes. Using pooled effect sizes, Godard and Holtzman report that neither active nor passive use strongly predicts well being on average, with most effects being minute.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) Active use has modest association with perceived social support and wellbeing, but also with slightly higher anxiety. Passive use showed overall weak associations but were more linked to worse emotional outcomes in general contexts. Crucially, this study highlights that demographic characteristics and contextual moderators, such as age and usage setting, have more substantial influence on outcomes. These findings show the importance of controlling for confounding factors rather than simplistic narratives like “active is good, passive is bad” which becomes a focus in guiding our project.

 A study that heavily grounded age as contextual moderator is Underwood’s expanded literature review which is specific to adolescents aged 10-19. Reviewing 16 peer-reviewed studies gathered from MEDLINE and PubMed via keyword search, the literature review concludes that both active and passive social media has positive association with negative mental health outcomes in adolescents.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) While evidence was insufficient to conclude that one style is categorically more harmful, this study shows the repeated importance of grounding research by age as observations of effects varied across different stages of adolescence. Our current study aims to control moderating factors by narrowing our analysis to Instagram usage among adults aged 18+. We target to examine both positive and negative wellbeing outcomes stretching beyond happiness and stress to include physical health factors like sleep and physical health while explicitly accounting for demographic, socioeconomic, and lifestyle confounders.
1. <a name="cite_note-1"></a> [^](#cite_ref-1)  Ozimek, P., Brailovskaia, J., Bierhoff, H-W. (2023) Active and passive behavior in social media: Validating the Social Media Activity Questionnaire (SMAQ), Telematics and Informatics Reports, Volume 10, 100048, ISSN 2772-5030, https://doi.org/10.1016/j.teler.2023.100048
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Godard, R., Holtzman, S. (2024). Are active and passive social media use related to mental health, wellbeing, and social support outcomes? A meta-analysis of 141 studies, Journal of Computer-Mediated Communication, Volume 29, Issue 1, January 2024, zmad055, https://doi.org/10.1093/jcmc/zmad055
3. <a name="cite_note-3"></a> [^](#cite_ref-3)  Underwood, L. (2024). Difference Between the Impact of Active Social Media Use and Passive Social Media Use on Adolescent Mental Health: An Expanded Literature Review. The Eleanor Mann School of Nursing Undergraduate Honors. https://scholarworks.uark.edu/nursuht/211

## Hypothesis


We hypothesize that among adults aged 18+, higher passive social media consumption will be associated with higher self reported stress and lower self reported happiness, while higher active interaction will be associated with lower stress and higher happiness. This is because passive browsing tends to increase social comparison and rumination, whereas active interaction is more likely to involve social connection and support, which can buffer stress and improve mood.

## Data

## 1) Ideal dataset

### Unit of observation
- **One row per person**, ideally with a **time window** attached.
- Best case: **repeated measures** (same person measured multiple times), because it helps us separate “who they are” from “what they did this week.”

### Variables we need (and why)

#### Unique ID + timestamps
- `user_id` (unique person ID)
- `date` or `week_id` (so we know the time window)

#### Main predictors (social media use style)
- **Passive consumption** (examples): minutes scrolling/browsing, number of posts viewed, short-form video viewing time
- **Active interaction** (examples): number of posts made, comments written, likes given, DMs/messages sent
- Ideally we also have **platform** (TikTok/IG/X/Reddit/etc.), because usage meaning can differ by platform.

#### Outcome variables (wellbeing)
- Primary: **self-reported stress**, **self-reported happiness**
- Optional if available: **sleep** (hours, sleep quality), **basic physical health** (self-rated health, fatigue), maybe mood indicators

#### Controls (to reduce confounding)
- **Demographics:** age, gender, education
- **Socioeconomic:** income or proxy, employment status
- **Lifestyle:** physical activity, work hours, student status (if relevant)
- Optional but helpful: baseline mental health, personality/introversion (only if the dataset includes it)

### How many observations we need (rough target + reasoning)
- Goal: enough data so regression estimates are **stable** after adding controls.
- A practical target:
  - At least **~1,000 people** for a basic model (passive + active + a handful of controls).
  - Better: **3,000–10,000 people** if we want to test subgroups (e.g., different age groups) or add more controls without the model becoming unstable.
- If we have repeated measures (daily/weekly rows per person), then the “number of rows” can be large even if the number of people is smaller — but we still want a solid number of **unique people** for generalizable results.

### Ideal data collection method (in a perfect world)
- Best-quality approach is a **hybrid**:
  - **Phone/app logs** to capture passive vs active behavior accurately (reduces memory bias), and
  - **Short surveys** for stress/happiness and sleep/health.
- If logs aren’t possible, then a well-designed survey can still work, but it’s more prone to self-report error.

### How the data would be stored/organized
- Use “tidy” structure:
  - Each row = one person in one time window (person-week or person-day)
  - Columns = predictors/outcomes/controls
- Keys:
  - Primary key could be **(user_id, date/week_id)** if repeated measures exist.
- Missingness handling plan:
  - Track missing values explicitly (don’t silently drop without checking).
  - If only a small amount is missing, we can use **listwise deletion** (drop rows) but we must report how many rows we dropped.
  - If missingness is larger, consider **simple imputation** (only if allowed / appropriate) and run sensitivity checks.

---

## 2) Real dataset sources (5 Kaggle sources, most relevant → least relevant)

We propose using **Dataset 1** for our main analysis, and treating **Datasets 2–5** as backup/validation datasets (to strengthen the “real data sources” part of the proposal and to confirm whether patterns are consistent across datasets).

### Dataset 1 (Main): Kaggle — Social Media User Activity Dataset (sadiajavedd)
This is our primary dataset because it is directly designed around social media activity and includes wellbeing outcomes aligned with our research question. The dataset is expected to let us measure both passive and active social media behaviors and relate them to stress/happiness, while controlling for demographics and lifestyle variables when available. Access is straightforward for students because it is hosted on Kaggle, and we can download it manually or use the Kaggle API after setting up credentials. A key limitation is that Kaggle datasets often rely on self-reported or synthetic/compiled data, so we will clearly describe what the dataset represents once we inspect the included documentation and column names.

- **URL:** https://www.kaggle.com/datasets/sadiajavedd/social-media-user-activity-dataset

#### Access requirements
- Kaggle account login
- Accept Kaggle dataset terms
- Optional: Kaggle API token if downloading in a notebook/script

#### How it maps to our ideal dataset
- Should contain: passive + active measures, wellbeing outcomes (stress/happiness), and at least some controls (we will confirm exact columns when we load the CSV).
- We will restrict to **adults aged 18+** if an age variable is available (otherwise we will clearly state that we cannot enforce the 18+ restriction in that dataset).

#### Main gaps / how we handle
- If missing timestamps: treat it as **cross-sectional** (one row per person) and avoid time-based claims.
- If missing key controls: we’ll use what exists and clearly state remaining confounding risk.

---

### Dataset 2: Kaggle — Mental Health & Social Media Balance Dataset (prince7489)
This dataset is highly relevant because it is explicitly about how social media usage relates to **stress, sleep, and happiness**, which directly matches our main outcomes and secondary outcomes. It is useful for validating whether higher usage (passive exposure proxies) is associated with stress/happiness in a way that is consistent with our main dataset. Access is done through Kaggle, so it is easy to download and use in the same workflow as our primary dataset. The main limitation is that it may not include detailed “active interaction” behaviors like posting/commenting/DMs, so it may not support a clean passive-vs-active split.

- **URL:** https://www.kaggle.com/datasets/prince7489/mental-health-and-social-media-balance-dataset

#### Access requirements
- Kaggle account login
- Accept Kaggle dataset terms
- Optional: Kaggle API token for programmatic download

#### How it maps to our ideal dataset
- Likely includes: stress and happiness measures, sleep-related variables, and lifestyle variables (we will confirm exact columns).
- Provides strong support for our outcomes (stress/happiness) and secondary outcomes (sleep/health proxies).

#### Main gaps / how we handle
- If active behaviors (posting/commenting/DMs) are not present, we will use it as a **backup validation dataset** focused mainly on passive exposure proxies (like time spent) rather than the full passive-vs-active comparison.

---

### Dataset 3: Kaggle — Social Media Usage and Emotional Well-Being (emirhanai)
This dataset is useful because it includes both a passive consumption proxy (daily usage time) and multiple active interaction proxies (posts, likes, comments, messages), which matches our definition of passive versus active styles. It also includes demographic fields like age and gender, and platform information, which supports confounding control and allows restricting to adults aged 18+. However, the main outcome is often an “emotion” category rather than numeric stress and happiness scales, so it is better used as a supportive dataset to test whether passive vs active activity patterns relate to a wellbeing-related outcome. We will be explicit that this dataset supports the “usage style” part of our project more strongly than our exact outcome measures.

- **URL:** https://www.kaggle.com/datasets/emirhanai/social-media-usage-and-emotional-well-being

#### Access requirements
- Kaggle account login
- Accept Kaggle dataset terms
- Optional: Kaggle API token for programmatic download

#### How it maps to our ideal dataset
- Passive proxy: daily usage time (minutes)
- Active proxies: posts/likes/comments/messages per day
- Controls: age, gender, platform (and possibly other demographic fields)

#### Main gaps / how we handle
- Outcome is not exactly “stress” and “happiness” → use as **validation** for the passive-vs-active predictors, and treat its wellbeing outcome as a related but different measure.

---

### Dataset 4: Kaggle — Social Media and Mental Health (souvikahmed071)
This dataset is relevant because it contains survey-style fields on social media use, platform usage, and mental health-related questions, along with demographic information. It is especially useful for strengthening our proposal because it looks like a realistic observational survey study with many potential control variables. Depending on the exact survey questions, it may provide measures that can serve as stress-related outcomes, wellbeing outcomes, or mental health proxies. A limitation is that it may not include detailed “active interaction” counts, so it may not fully match our passive-vs-active definition without using proxies.

- **URL:** https://www.kaggle.com/datasets/souvikahmed071/social-media-and-mental-health

#### Access requirements
- Kaggle account login
- Accept Kaggle dataset terms
- Optional: Kaggle API token for programmatic download

#### How it maps to our ideal dataset
- Likely includes: age and demographics, social media use patterns and platforms, and wellbeing/mental health survey outcomes.
- Can help with controls (demographics and possibly occupation-related variables) and with general usage measures (passive proxies).

#### Main gaps / how we handle
- If it lacks direct active-interaction variables (posting/commenting/DMs), we will use it as a **supporting dataset** to confirm the direction of associations using available proxies (time/frequency/platforms).

---

### Dataset 5: Kaggle — Screen Time vs Mental Wellness Survey (2025) (adharshinikumar)
This dataset is the least directly matched to our “social media passive vs active” framing, but it is still useful because it includes outcomes like stress, sleep, and general wellness indicators. It can serve as an additional validation dataset to see whether general digital screen time relates to stress and sleep in a way that supports our broader claims about passive exposure. Access is straightforward through Kaggle. The main limitation is that it focuses on screen time more generally rather than specific social media behaviors like posting, commenting, and messaging, so it cannot fully support our passive-vs-active comparison.

- **URL:** https://www.kaggle.com/datasets/adharshinikumar/screentime-vs-mentalwellness-survey-2025

#### Access requirements
- Kaggle account login
- Accept Kaggle dataset terms
- Optional: Kaggle API token for programmatic download

#### How it maps to our ideal dataset
- Passive exposure proxy: screen time
- Outcomes: stress/sleep/wellness-related variables (we will confirm exact columns)

#### Main gaps / how we handle
- Not social-media-specific and no active interaction measures → use only as a **broad robustness check** and keep conclusions narrow for this dataset.

##Ethics

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> This project uses secondary data from publicly available sources such as Kaggle. While we do not collect data directly from participants, there is a risk that individuals did not anticipate all future uses of their data. We mitigate this by using the data only for aggregate research purposes and avoiding sensitive individual-level conclusions.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> Social media wellbeing datasets often overrepresent younger, more active users or specific regions, which may bias results. If ignored, this could lead to misleading conclusions about broader populations. We will report these limitations and avoid generalizing beyond the sampled groups.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

> The datasets we use are de-identified, but indirect identification is still possible from detailed demographics or behavior. To reduce risk, we will drop unnecessary fields, avoid small subgroup reporting, and present only aggregated results.

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

> If demographic variables are available, we will use them to test whether associations differ across groups (ex: gender, age ranges) and report any differences. We will be careful not to interpret group differences as biological or moral claims, and we will highlight dataset limitations.

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

> Data will be stored in secure, course-approved environments with access limited to project members. Although the data are public and anonymized, restricting access reduces unnecessary exposure.

 - [] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?

> Not directly applicable because we are using public, de-identified secondary datasets and do not control data collection. If any dataset includes removals or takedown procedures, we will follow the dataset’s terms and avoid downloading unnecessary copies.

 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

> We will keep datasets only for the course project duration and delete local copies afterward. Keeping fewer copies reduces exposure risk and supports responsible data handling.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

> Wellbeing and social media experience vary by culture, age, and context, and the dataset may not capture these differences. We will interpret results cautiously and avoid universal claims, using prior literature to contextualize findings.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

> We will look for bias in the dataset, like missing data, uneven group sizes, or important factors that aren’t included. We will report these issues and control for key variables when possible, instead of acting like the results are causal.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

> We will avoid misleading visuals and cherry-picking. We will report effect sizes with uncertainty and include null results when they occur.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

> We will not display or publish any PPI data. Results will be reported only in aggregated form, and we will avoid subgroup breakdowns that could enable re-identification.

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

> We will document data cleaning and analysis steps and keep code organized so results can be reproduced. This supports accountability if errors or ethical concerns are found later.

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

> We will avoid using variables that can “stand in” for protected traits (like race or income) in a way that could be unfair. If we include those variables as controls, we will explain why and be careful how we interpret them.

 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?

> We will check whether results look different across demographic groups (when that data is available). If we build a prediction model, we will also compare error rates across groups and report any big gaps.

 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

> We will choose evaluation metrics that match the task and report multiple metrics where relevant. We will avoid optimizing a single metric that could hide subgroup harms.

 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?

> We will use simple models like regression so it’s clear why the model gives a result. We will explain the main factors and what they mean in plain language.

 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

> We will explain our findings in plain language and note what the model cannot conclude. We will highlight bias and missing factors and keep conclusions within what the data supports.

### E. Deployment
 - [] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?

> Not applicable because we are not deploying a model or system. This is a course project analysis only.

 - [] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?

> Not applicable because our results will not be used to make decisions about individuals. If we find something misleading or harmful, we will revise the analysis and how we report it.

 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 
> Not applicable because there is no production deployment. If issues are found, we can update or remove the analysis/report.

 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?
 
> People could misuse our results to say social media is always good or always bad, or to shame people with mental health issues. We will reduce this risk by saying our findings are correlational, clearly listing limits, and keeping conclusions only within what the data supports.

## Team Expectations 

### *Team Expectation 1: Communication and Availability*
- Communications will be done through Discord and iMessage as they are the most readily available, and all authors are well-versed in both applications.
- Team members are expected to respond to messages within 48 hours unless there was prior communication that they may be unavailable for personal reasons.
- If a meeting needs to be scheduled, all team members must free up some of their time at least one day out of any given week to discuss proposals and findings live with other members. Discord will be used to communicate for meetings.

### *Team Expectation 2: Tone and Respectful Interactions*
- All communication should be respectful, constructive, and professional.
- As stated in the COGS108 Team Policies, direct but polite is the agreed upon tone.
- Team members will assume good intentions and view given criticism as a way to improve the work done rather than a personal attack.

### *Team Expectation 3: Decision making*
- Unless in a time crunch with a team member not responding, all decisions should be made through the agreement of all members of the team.
- If all members can not come to a consensus, a majority vote will be used to make decisions.
- Major decisions that can change deadlines or project direction must be made early to avoid time crunch. They also must be communicated clearly to all members of the team.

### *Team Expectation 4: Task assignment and accountability*
- Tasks will be assigned both voluntarily and based on each member's strengths.
- Although team members may be more involved in specific tasks than others, we will ensure equal overall effort across the entire project.
- Each set task will have an expected deadline and a section “manager”, who will be the owner mainly responsible for that task.

### *Team Expectation 5: Deadlines and Task support*
- All team members are expected to meet the agreed upon deadlines.
- If someone believes that they may not be able to meet a given deadline, a notification to the rest of the members is expected as soon as possible, preferably 48 to 72 hours in advance of the deadline.
- If a team member is struggling to work on a certain part of the project, the team will meet together to redistribute tasks that better support the members strengths and weaknesses.

### *Team Expectation 6: Conflict Resolution*
- Conflicts will be addressed directly but respectfully.
- Conflicts will be addressed early to not halt task completion or team morale.
- Open discussion during a meeting or live communication through messaging apps will be used to resolve conflicts.
- If a conflict does not get resolved in a respectful manner through team members, seeking out to the professor will be the immediate next step.

### *Team Expectation 7: Handling Non-Participation*
- In the event that a team member is constantly unresponsive or fails to complete an assigned task, the rest of the team will do as follows:
  - Contact the member directly through both Discord and iMessage clearly outlining expectations for improvement, within a reasonable time period, in a respectful manner.
  - If no improvement occurs within the given reasonable time period, the professor will be notified by the other team members about the situation.

### *Team Expectation 8: Team agreement to policies*
- By participating in this project and submitting this proposal, each team member confirms that they have:
  - Read the COGS108 Team Policies
  - Agreed to the expectations listed above
  - Committed to contributing fairly, communicating openly, and working collaboratively

## Project Timeline Proposal

| Discussion Date  | Discussion Time & Place | Completed Before Discussion | What to Discuss |
|---|---|---|---|
| 2/4  | Discord - throughout day  | Edit, finalize, and submit proposal; | Confirm datasets to use; discuss task assignment and edit timeline accordingly; outline planned wrangling and analysis approach |
| 2/17  | Discord - throughout day  | Begin data import and wrangling; conduct initial EDA | Review wrangling and EDA progress; refine analysis plan; finalize Data checkpoint assignment |
| 2/29  | Discord - throughout day  | Finalize wrangling and EDA; Begin Analysis | Discuss/edit Analysis; finalize EDA checkpoint assignment |
| 3/13  | Discord - throughout day  | Complete analysis; Draft results/conclusion/discussion | Discuss/edit full project |
| 3/18  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |