# COGS 108 - Data Checkpoint

# Authors

- Aryaman Dayal: Writing - original draft (Dataset search, Methodology and Metric Inspection)
- Dhruv Sehgal: Writing - original draft (Ethics, Team Expectations)
- Prachi Heda: Writing - original draft (Background and Prior Work, Data overview)
- Ricky Zhang: Writing - original draft (Research question, Hypothesis, Outlier and Missingness Inspection)
- Kaylee Viriyavong: Writing - original draft (Project timeline proposal, summary statistics for variables)

## Background and Prior Work

Mental health concerns among college students have increased in recent years, with rising reports of anxiety and depression, coupled with the growth of social media platforms. Because most students engage with social networking sites daily, researchers have become interested in understanding whether online behavior plays a role in psychological well-being. While early discussions often focused on the amount of time spent online, more recent work has shifted toward examining the quality and consequences of social media use, including whether the behavior is compulsive, causes interpersonal conflict, or interferes with daily life. 


Social media use is extremely common among college students, but researchers now separate how much people use social platforms (time or frequency) from how they use them (such as compulsive or conflict-driven behavior). “Problematic social media use,” also described as social media addiction, refers to patterns that are hard to control and that disrupt daily life.<a name="cite_ref-1"></a><sup>1</sup> This distinction matters for our project because the dataset includes both an addiction-like score and relationship-focused variables, such as conflicts. This lets us examine whether problematic use is linked to worse mental health outcomes beyond what can be explained by time spent on social media alone.Do you like this personality?


Prior research consistently shows that higher levels of problematic social media use are linked to poorer mental health outcomes, including higher depression and anxiety symptoms and lower overall well-being.<a name="cite_ref-2"></a><sup>2</sup><a name="cite_ref-3"></a><sup>3</sup> However, many studies emphasize that causality is unclear and that these effects may depend on individual or social factors. One systematic review of adolescents and young adults found widespread links between problematic use and depression and stress, but noted that most studies were cross-sectional and therefore could not determine whether social media use leads to poorer mental health or vice versa.<a name="cite_ref-2"></a><sup>2</sup> Public health reports argue that risks cannot be captured by screen time alone and that the type of engagement and harmful experiences matter more than total time spent online.<a name="cite_ref-4"></a><sup>4</sup> These findings motivate two directions for our study: first, testing whether addiction-like measures of social media use are associated with mental health and whether that association varies by gender or academic level, and second, examining whether conflict-related experiences predict poorer mental health even after controlling for total usage time and relationship status.


Measurement research is also relevant to our analysis. Many studies assess problematic social media use using validated tools such as the Bergen Social Media Addiction Scale (BSMAS), which is based on core addiction features like salience, mood modification, tolerance, withdrawal, conflict, and relapse.<a name="cite_ref-1"></a><sup>1</sup> Validation studies show that the BSMAS has a stable structure and can reliably measure problematic use in young populations, making it useful for studying links with mental health and related outcomes.<a name="cite_ref-1"></a><sup>1</sup> Although our Kaggle dataset uses its own addiction score, this work helps clarify what this measure is intended to capture. 


Because one potential research direction focuses specifically on interpersonal conflict and relationship factors, earlier work on romantic and social relationships is especially relevant. Studies of college students and young adults have found that heavier social media use is associated with relationship strain.<a name="cite_ref-5"></a><sup>5</sup><a name="cite_ref-6"></a><sup>6</sup> In particular, Clayton et al. reported that greater Facebook use was linked to worse relationship outcomes especially in newer relationships.<a name="cite_ref-5"></a><sup>5</sup> Other research on “Facebook intrusion” reported similar connections between intrusive use, jealousy, and reduced satisfaction in undergraduate romantic relationships.<a name="cite_ref-6"></a><sup>6</sup> These results support our plan to examine whether conflicts over social media predict lower mental health scores, even after accounting for usage time and relationship status.


Finally, prior studies suggest that the relationship between problematic technology use and well-being can differ across demographic groups, which motivates our future analyses by gender and academic level. Research on smartphone addiction among undergraduates, for example, has reported gender differences in prevalence and associated factors, implying that risks and outcomes may not be the same for everyone.<a name="cite_ref-7"></a><sup>7</sup> Therefore, one possible research question for this study is whether the association between social media addiction scores and mental health differs by gender or academic level, rather than assuming a single overall effect for all students.

### References

1. <a name="cite_note-1"></a> [^](#cite_ref-1)  
Lin, C.-Y., Broström, A., Nilsen, P., Griffiths, M. D., & Pakpour, A. H. (2017). *Psychometric validation of the Persian Bergen Social Media Addiction Scale (BSMAS).* Addiction Research & Theory.  
https://pmc.ncbi.nlm.nih.gov/articles/PMC6034942/

2. <a name="cite_note-2"></a> [^](#cite_ref-2)  
Shannon, H., Bush, K., Villeneuve, P. J., Hellemans, K. G. C., & Hartling, L. (2022). *Problematic Social Media Use in Adolescents and Young Adults: Systematic Review and Meta-analysis.* JMIR Mental Health.  
https://mental.jmir.org/2022/4/e33450/

3. <a name="cite_note-3"></a> [^](#cite_ref-3)  
Ahmed, O., et al. (2024). *Social media use, mental health and sleep: A systematic review.* Journal of Affective Disorders.  
https://www.sciencedirect.com/science/article/pii/S0165032724014265

4. <a name="cite_note-4"></a> [^](#cite_ref-4)  
U.S. Surgeon General. (2023). *Social Media and Youth Mental Health: The U.S. Surgeon General’s Advisory.*  
https://www.hhs.gov/sites/default/files/sg-youth-mental-health-social-media-advisory.pdf

5. <a name="cite_note-5"></a> [^](#cite_ref-5)  
Clayton, R. B., Nagurney, A., & Smith, J. R. (2013). *Cheating, breakup, and divorce: Is Facebook use to blame?* Cyberpsychology, Behavior, and Social Networking.  
https://journals.sagepub.com/doi/full/10.1089/cyber.2012.0424

6. <a name="cite_note-6"></a> [^](#cite_ref-6)  
Elphinston, R. A., & Noller, P. (2011). *Time to face it! Facebook intrusion and the implications for romantic jealousy and relationship satisfaction.* Cyberpsychology, Behavior, and Social Networking.  
https://pubmed.ncbi.nlm.nih.gov/21548798/

7. <a name="cite_note-7"></a> [^](#cite_ref-7)  
Chen, B., Liu, F., Ding, S., Ying, X., Wang, L., & Wen, Y. (2017). *Gender differences in factors associated with smartphone addiction: a cross-sectional study among medical college students.* BMC Psychiatry.  
https://pmc.ncbi.nlm.nih.gov/articles/PMC5634822/



## Research Question

For this project, our interest is to answer the following question:

Is there, for students, a statistically significant relationship between social media addiction score and mental health score, and does this relationship differ by gender or academic level?

For context, we will define the social media addiction score and mental health score as the following:
* Social Media Addiction Score: a self-reported single-item measure on a 10-point Likert scale, where:
    *  1 = Very poor mental health (frequent distress, anxiety, low mood, difficulty coping)
    *  10 = Excellent mental health (emotionally stable, low stress, positive mood, good coping ability)
* Mental Health Score: a self-reported 10-point Likert scale measure, where:
    *  1 = No signs of addictive behavior (controlled, balanced use)
    *  10 = Severe addictive tendencies (compulsive use, inability to reduce usage, interference with sleep, academics, or relationships)

## Hypothesis


We predict that there is a statistically significant relationship between social media addiction and mental health score for university students, clearly differing by gender and academic level. In fact, several signs of bad mental health such as detachment of reality and excess fear or worry are commly attributed to the growing and prevasive social media algorithms. When considering the influence of targeted post against users of certain academic level and promoted gender conformity, it is not surprising for intersectionality to exist  between these set of elements.

## Data

**Ideal Dataset:**
To answer whether social media addiction is related to mental health and whether that relationship differs by gender or academic level, the ideal dataset for our project would be student-level survey data that includes a validated measure of social media addiction, a validated measure of mental health, demographic information on gender, and a clear indicator of academic level such as year in school or undergraduate versus graduate status. We would also want a few important background variables that could affect both social media use and mental health, such as age, typical daily time spent on social media, sleep habits, and basic school or location context so we can account for major differences across groups. In terms of sample size, we need enough students in each gender and academic-level group to reliably detect differences in the relationship, so several hundred students at minimum and ideally closer to one thousand or more is a strong target, especially if group sizes are uneven. These data would ideally be collected through a standardized survey distributed across multiple schools or through stratified sampling to ensure representation across genders and academic levels, using consistent scoring procedures and a documented codebook. We would store the data in a clean, organized table where each row represents one student and each column represents one variable, with clear variable definitions, consistent handling of missing values, and documentation explaining how the addiction and mental health scores were computed and interpreted.

**Dataset 1 ([Link 1](https://www.kaggle.com/datasets/adilshamim8/social-media-addiction-vs-relationships)):**
This dataset is hosted on Kaggle under the title “Students’ Social Media Addiction,” and we would need a Kaggle account to accept the dataset’s terms and download the CSV file for analysis. It includes variables that map directly onto our research question, including an addiction score, a mental health score, gender, and academic level, along with related factors like average daily usage, sleep per night, and other context variables such as platform, relationship status, and social media conflicts. Using this dataset, we could model mental health as a function of addiction and then test whether the strength of that relationship changes across gender and academic level, while optionally controlling for usage and sleep. This makes it a strong match for the full set of analyses we want to run.

**Dataset 2 ([Link 2](https://www.kaggle.com/datasets/zahranusratt/student-social-media-addiction-analysis-dataset)):**
This Kaggle dataset also provides a CSV titled “Students Social Media Addiction.csv,” and we would similarly need Kaggle access and to download the file before we can use it. Public writeups and analyses of this same CSV indicate it contains about 705 rows and 13 columns, and the core fields include gender, academic level, mental health score, and addiction score, plus usage and lifestyle variables such as daily usage and sleep. Functionally, it supports the same main tests we want, including estimating the overall relationship between addiction and mental health and checking whether that relationship differs across subgroups. It is also useful for comparing our results against existing Kaggle notebooks or replications that use the same file structure.
  

### Data overview

After careful review, we decided to use the first dataset described above. Below are the preliminary findings of the dataset.

- **Dataset #1**
  - **Dataset Name**: Students' Social Media Addiction
  - **Link to the dataset**: https://www.kaggle.com/datasets/adilshamim8/social-media-addiction-vs-relationships
  - **Number of observations**: 705
  - **Number of variables**: 13
  - **Description of the variables most relevant to this project**
    - **Academic_Level**: _Categorical_, High School / Undergraduate / Graduate
    - **Gender**:	_Categorical_,	“Male” or “Female”
    - **Avg_Daily_Usage_Hours**:	_Float_,	Average hours per day on social media
    - **Most_Used_Platform**:	_Categorical_,	Instagram, Facebook, TikTok, etc.
    - **Mental_Health_Score**:	_Integer_,	Self‐rated mental health (1 = poor to 10 = excellent)
    - **Conflicts_Over_Social_Media**:	_Integer_,	Number of relationship conflicts due to social media
    - **Addicted_Score**:	_Integer_,	Social Media Addiction Score (1 = low to 10 = high)
  - **Shortcomings**
    - **Self‐Report Bias**: All measures are self‐reported and may be subject to social‐desirability effects.
    - **Cross‐Sectional Design**: One‐time survey prevents causal inference.
    - **Sampling Variability**: Recruitment via online channels may underrepresent students with limited internet access.
    - **Spurious Associations**: No socioeconomic status proxies (income, parental education), no academic performance metric (GPA/grades)



In [1]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [None]:
%pip install requests tqdm
%pip install pandas

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import pandas as pd
import get_data # this is where we get the function we need to download data

datafiles = [
    { 'url': 'https://drive.google.com/uc?export=download&id=1OOzFA2-e2j92c0HMrKnJ2z3HTr_oFZmg', 'filename':'Students Social Media Addiction.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

### Students social media use, addiction score, and relationship factors survey 

This dataset is a cross sectional student survey with 705 responses and 13 variables that describe demographics, daily social media habits, sleep, mental health ratings, academics, and relationship conflict tied to social media. Each row corresponds to one student, and Student_ID appears to be an anonymized identifier rather than personally identifying information. The respondents are mostly in the late teen to young adult range, roughly ages 16 to 25, so the dataset is best interpreted as a snapshot of student experiences rather than a general population sample.

The most important metrics are the measures of time use, well being, and conflict. Avg_Daily_Usage_Hours is in hours per day and reflects how much time a student spends on social media on a typical day. Sleep_Hours_Per_Night is in hours per night and gives a sense of sleep quantity. For context, public health guidance for adults is generally at least 7 hours of sleep per night. The dataset also includes Mental_Health_Score, commonly described in public summaries as a 1 to 10 self rating, and an Addicted_Score that is also treated in some analyses as a 1 to 10 severity style scale. Conflicts_Over_Social_Media is used as a small count or short scale of how often social media is linked to relationship disagreements. Categorical fields such as Gender, Academic_Level, Relationship_Status, Most_Used_Platform, Country, and Affects_Academic_Performance make it possible to compare these outcomes across groups and to look for patterns like heavier use being associated with lower sleep or more reported conflict.

There are a few major concerns to keep in mind. First, nearly all variables are self reported, so usage hours, sleep, mental health, and addiction severity can be affected by memory errors and by how comfortable someone is reporting honestly. Second, the data are cross sectional, meaning they capture one point in time, so you can study associations but you cannot say social media use causes changes in mental health, sleep, or relationship conflict. Third, the sample is likely shaped by who chose to respond, so it may not be representative of all students, and responses across countries can differ based on culture and how people interpret rating scales. Finally, some public writeups describe the score ranges slightly differently, so it is safer to verify the actual distributions directly in the dataset before using strict cutoffs or treating the scales as standardized clinical measures.

### Summary Statistics for Key Variables 

#### For this project, the most important variables relate to:

**Addiction Score**: The mean addiction score is X, with values ranging from Y to Z. Higher scores indicate stronger signs of problematic social media use.

**Daily Usage Hours**: Participants spend an average of X hours per day on social media, suggesting moderate/high usage in the sample.

**Relationship Satisfaction**: The average satisfaction score is X, indicating generally moderate/high relationship quality.

**Conflict Frequency**: The distribution shows whether conflict levels are typically low or elevated.

**Age**: Provides context about the demographic makeup of the sample.

#### These variables are important because:

Addiction score is the **primary predictor**

Relationship satisfaction/conflict are the **primary outcomes**

Usage hours helps validate addiction scores

Age may act as a **confounder**

##### These variables directly relate to our research question examining the association between social media addiction and relationship quality.


3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`

In [7]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
ssma = pd.read_csv('data/00-raw/Students Social Media Addiction.csv')
ssma

Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,Relationship_Status,Conflicts_Over_Social_Media,Addicted_Score
0,1,19,Female,Undergraduate,Bangladesh,5.2,Instagram,Yes,6.5,6,In Relationship,3,8
1,2,22,Male,Graduate,India,2.1,Twitter,No,7.5,8,Single,0,3
2,3,20,Female,Undergraduate,USA,6.0,TikTok,Yes,5.0,5,Complicated,4,9
3,4,18,Male,High School,UK,3.0,YouTube,No,7.0,7,Single,1,4
4,5,21,Male,Graduate,Canada,4.5,Facebook,Yes,6.0,6,In Relationship,2,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...
700,701,20,Female,Undergraduate,Italy,4.7,TikTok,No,7.2,7,In Relationship,2,5
701,702,23,Male,Graduate,Russia,6.8,Instagram,Yes,5.9,4,Single,5,9
702,703,21,Female,Undergraduate,China,5.6,WeChat,Yes,6.7,6,In Relationship,3,7
703,704,24,Male,Graduate,Japan,4.3,Twitter,No,7.5,8,Single,2,4


In [10]:
print(f"There is a total of {ssma.shape[0]} observation and {ssma.shape[1]} variables.")

There is a total of 705 observation and 13 variables.


In [67]:
ssma.dtypes

Student_ID                        int64
Age                               int64
Gender                           object
Academic_Level                   object
Country                          object
Avg_Daily_Usage_Hours           float64
Most_Used_Platform               object
Affects_Academic_Performance     object
Sleep_Hours_Per_Night           float64
Mental_Health_Score               int64
Relationship_Status              object
Conflicts_Over_Social_Media       int64
Addicted_Score                    int64
dtype: object

Considering the variables, the data types are expected and show no abnormality. Let us inspect the values for closer exmaination.

In [66]:
# Tidiness and outlier Inspection

print("====================== Variable Inspection =============================")
print(f"\nStudent_ID should be unique and we can prove by inspecting the total uniqueness count:")
print(f"  The student ID count is {ssma.shape[0]}, while unique student id count is {ssma.Student_ID.nunique()}. The count {'matches' if ssma.shape[0] == ssma.Student_ID.nunique() else 'does not match'}.\n")

print("Age should be reasonable and numeric, such as the range of 1 to 100 (Although the age are usually betweeen 18 and 24).")
print(f"  The age ranges between {ssma.Age.min()} and {ssma.Age.max()}, which is what we expect.")
print(f"  Unique ages are: {ssma.Age.unique()}.\n")

print("Gender should be categorical.")
print(f" Unique genders are: {ssma.Gender.unique()}.")
print(f"All genders are expected and no outliers are spotted.\n")

print("Country should also be categorical.")
print(f" Unique countries are: {ssma.Country.unique()}")
print(f" While formatting is slightly inconsistent (e.g., USA vs full names), no obvious typographical errors were detected.\n")

print("Avg_Daily_Usage_Hours should be numerical, nonnegative, and reasonable (below 24 hours).")
print(f" There are {ssma.query('Avg_Daily_Usage_Hours >= 0 & Avg_Daily_Usage_Hours <= 24').shape[0]} entries that match our expectation defined above, which matches the total number observations.")
print(f" By the code confirmation I done, it also ensures that every entry is numerical as otherwise, it would yield error before it could display the result.")
print(f" The Avg_Daily_Usage_Hours ranges between {ssma.Avg_Daily_Usage_Hours.min()} and {ssma.Avg_Daily_Usage_Hours.max()}, which is reasonable.\n")

print("Most_Used_Platform should be strings and no typos should exist.")
print(f" Unique platforms are: {ssma.Most_Used_Platform.unique()}.")
print(f" No observations are out of ordinary as shown above.\n")

print("Affects_Academic_Performance should be binary (Yes or No).")
print(f" Unique responses to Affects_Academic_Performance are: {ssma.Affects_Academic_Performance.unique()}.")
print(f" As expected and shown above, only two response exists.\n")

print("Sleep_Hours_Per_Night should be numerical, nonnegative, and reasonable (below 24 hours).")
print(f" There are {ssma.query('Sleep_Hours_Per_Night >= 0 & Sleep_Hours_Per_Night <= 24').shape[0]} entries that match our expectation defined above, which matches the total number observations.")
print(f" By the code confirmation I done, it also ensures that every entry is numerical as otherwise, it would yield error before it could display the result.")
print(f" The Sleep_Hours_Per_Night ranges between {ssma.Sleep_Hours_Per_Night.min()} and {ssma.Sleep_Hours_Per_Night.max()}, which is reasonable.\n")

print("Mental_Health_Score should be between 1 and 10 and numerical.")
print(f" There are {ssma.query('Mental_Health_Score >= 1 & Mental_Health_Score <= 10').shape[0]} entries that match our expectation defined above, which matches the total number observations.")
print(f" By the code confirmation I done, it also ensures that every entry is numerical as otherwise, it would yield error before it could display the result.\n")

print("Relationship_Status should be categorical.")
print(f" Unique countries are: {ssma.Relationship_Status.unique()}")
print(f" We see three variations which are distinct and clear.\n")

print("Conflicts_Over_Social_Media should be numerical (count), nonnegative.")
print(f" There are {ssma.query('Conflicts_Over_Social_Media >= 0').shape[0]} entries that match our expectation defined above, which matches the total number observations.")
print(f" The Conflicts_Over_Social_Media ranges between {ssma.Conflicts_Over_Social_Media.min()} and {ssma.Conflicts_Over_Social_Media.max()}, which is reasonable.\n")

print("Addicted_Score should be between 1 and 10 and numerical.")
print(f" There are {ssma.query('Addicted_Score >= 1 & Addicted_Score <= 10').shape[0]} entries that match our expectation defined above, which matches the total number observations.")
print(f" By the code confirmation I done, it also ensures that every entry is numerical as otherwise, it would yield error before it could display the result.\n")


Student_ID should be unique and we can prove by inspecting the total uniqueness count:
  The student ID count is 705, while unique student id count is 705. The count matches.

Age should be reasonable and numeric, such as the range of 1 to 100 (Although the age are usually betweeen 18 and 24).
  The age ranges between 18 and 24, which is what we expect.
  Unique ages are: [19 22 20 18 21 23 24].

Gender should be categorical.
 Unique genders are: ['Female' 'Male'].
All genders are expected and no outliers are spotted.

Country should also be categorical.
 Unique countries are: ['Bangladesh' 'India' 'USA' 'UK' 'Canada' 'Australia' 'Germany' 'Brazil'
 'Japan' 'South Korea' 'France' 'Spain' 'Italy' 'Mexico' 'Russia' 'China'
 'Sweden' 'Norway' 'Denmark' 'Netherlands' 'Belgium' 'Switzerland'
 'Austria' 'Portugal' 'Greece' 'Ireland' 'New Zealand' 'Singapore'
 'Malaysia' 'Thailand' 'Vietnam' 'Philippines' 'Indonesia' 'Taiwan'
 'Hong Kong' 'Turkey' 'Israel' 'UAE' 'Egypt' 'Morocco' 'South Afri

By the above unique outputs, we can clearly determine that each row is an observation, each column is a variable, and each cell is a single value. Otherwise we would see cells with values that seem to be of multiple response, such as 'India, USA'. In addition, there are no duplicate entries as each row corresponds to an unique student_ID. The format is consistent too as inspected and no extraneous characters are found. Furthermore, no missing entries are found as we will show below. Thus, we can conclude the data is tidy.

On a separate note, through the inspection of unique values, we see now suspicious entries or outliers considering the type of question and mentioned demographic discussed in the source website description.

Finally, because there are no missing values, no imputation or row deletion was required.

In [64]:
# Missinginess Inspection
print(f"The variables for each observation is {'all present' if ssma.isna().sum().sum() == 0 else 'missing for some'}.\n")
print("The following output describes the missingness for each variable. 0 represents no missing value, while 1 represents the presence of missing value.")
print(ssma.isna().sum())

print(f"\nSince there are no missing values, the missingness mechanisms are not applicable in this dataset.")

The variables for each observation is all present.

The following output describes the missingness for each variable. 0 represents no missing value, while 1 represents the presence of missing value.
Student_ID                      0
Age                             0
Gender                          0
Academic_Level                  0
Country                         0
Avg_Daily_Usage_Hours           0
Most_Used_Platform              0
Affects_Academic_Performance    0
Sleep_Hours_Per_Night           0
Mental_Health_Score             0
Relationship_Status             0
Conflicts_Over_Social_Media     0
Addicted_Score                  0
dtype: int64

Since there are no missing values, the missingness mechanisms are not applicable in this dataset.


In [71]:
ssma.to_csv("data/02-processed/Students Social Media Addiction_clean.csv", index=False)

For the time being, unless EDA or analysis requires, we consider the current iteration of the data as final.

## Ethics 

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> Since the data is collected from a survey administered through university mailing lists and social-media platforms, data should be provided voluntarily by participants. If at any point subjects felt uncomfortable, continuing with the survey would be voluntary.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> Countries with higher populations are likely to have more rows in the dataset, independent of development level. This can be mitigated by normalization. Since the survey is administered online, there is a bias for students with reliable internet access, which could in turn also depend on the country; students from more developed countries are more likely to have internet access. For this, we plan to verify any actual correlations and then take further steps from there on out, possibly analyzing groups of specific countries at a time. Nevertheless, these biases cause students with internet access and living in well-developed communities to be favored, and hence could impose unfair practices (such as restrictions on internet access) on students from communities with not as many resources readily available, affecting their access to information. For example, communities not knowledgable enough about social media may believe the adverse affects of social media are from using the internet overall, not just social media.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

> The only identifiable data column in our dataset is the `Student_ID`, which is used to keep track of duplicate submissions. As long as that does not have any connection to the actual student ID of a student, there is no risk of exposure. Considering `Student_ID` is numbered incrementally (1-705), there should be no relation present with the student's identifiable information. However, 79 out of 110 countries only have one submission, which could potentially be used to identify students who filled out the survey. Hence, we would want to be more careful for country-wise analyses, making sure sufficient aggregations are made to hide any identifiable statistics.

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

 > Looking at the data, some countries dominate over others, which could be an issue if there are major cultural differences between them. Again, normalization could be a way to fix this. Also, we might do analysis by platform too, since most of the dataset is dominated by Instagram, TikTok, and Facebook (in that order). This would help not skew results, since it is very likely that many mental health concerns are caused by specific popular platforms rather than social media in general. This could ruin the reputation of social media platforms that can actually be useful to people, such as YouTube.

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 
> No personably identifiable data is stored. We cannot delete a person's data if we cannot identify them in the first place.

 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

> Since some countries and some communities (because of factors like internet access, number of college students in region) are overrepresented in the dataset, if our findings were to be used to implement policies, they might not generalize well to all regions. Moreover, since some platforms dominate the dataset, findings would be skewed towards perceived effect of those platforms as compared to all social media in general. Since much of the data is self-reported, there might be a bias towards median values since that is how humans usually rate things; 1 or 10 are only given in case of extremes, each person having a different threshold. Additionally, depending on the current situation of the person filling out the survey, they might give higher or lower ratings than their usual self would, potentially skewing results. For example, students are much more likely to be stressed out and lack sleep during exam season, which can vary depending on the country and the university.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

> Even though the data does seem to follow a normal distribution for `Mental_Health_Score`, there might be some bias. Because the survey was completely voluntary, there is a chance people who are more likely to believe they have some mental health problems are also more likely to fill out the survey. Moreover, since the top two countries are India and USA, both of which contribute to disproportionately large amount of content to the internet, students from these countries are not only more likely to fill out the survey but also be more affected by social media, since their culture would allow them to connect with more of the content they see online.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

> All of us have experience with data visualization.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

> So far, yes.

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

> We are using a publicly available dataset. As long as we follow good practices, our results should be reproducible for college students.

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?

> Because of overrepresentation of certain countries and certain platforms, normalizing would be the best option to solve any issues that arise related to Fairness. The goal is to make our findings as generalizable as possible.

 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

> We are taking into account possible biases, and have explained them in this section.


### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?

> Using techniques such as normalizaing and aggregation, we hope to account for any overrepresentation present in the dataset. This is to make sure that any policies introduced in schools do not cause more harm than good. For example, a boarding school using our findings may ban YouTube for its students, even though that is used by many as a learning resource and isn't even most people's most used platform.

 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

> We will emphasize how our findings cannot be generalized to all regions, and that each institution or government will have to decide for themselves using their own data what policies should be implemented to improve students' mental health.


## Team Expectations 

* Communication will be done through Discord. Check it as frequently as possible, but at least twice a day.
* If you know you are about to be busy, notify the team *before* you are. That way, we can reallocate the workload for the week in advance and stay on track.
* When providing criticism, make sure to keep it constructive. Specify clearly whether something is your opinion, belief from past experiences, or general advice.
* Decisions will be made after a group discussion, usually after a group call.
* Weekly meetings will be held on mondays to decide work for the rest of the week.

## Project Timeline Proposal

| Date | Time | Completed Before Meeting | Discuss at Meeting |
|------|------|--------------------------|--------------------|
| **Feb 18** | **4:00–5:00 PM** | **Checkpoint #1: Data Due — research question finalized and clearly operationalized; dataset selected and dataset size (N) reported; ethics section revised (including bias, misuse, and limitations); datasets collected and initial cleaning completed** | **Review finalized research question, dataset size, and ethics revisions; evaluate data quality and missingness; confirm readiness for EDA and finalize preprocessing steps** |
| Feb 20 | 4:00–5:00 PM | Conduct preliminary exploratory data analysis (EDA); examine distributions of addiction and mental health measures; generate visualizations | Discuss EDA results; assess normality and potential need for transformations; refine analysis plan |
| Feb 22 | 4:00–5:00 PM | Revise EDA; analyze correlations; identify potential confounders (e.g., age, gender, usage time) | Confirm statistical methods (e.g., regression models); discuss assumptions and model limitations |
| Feb 25 | 4:00–5:00 PM | Implement core statistical analysis; check model assumptions; document effect sizes | Review early results; distinguish correlation vs. causation; identify robustness checks |
| Feb 27 | 4:00–5:00 PM | Refine analysis; improve visualizations; draft careful interpretation of findings | Discuss interpretation risks; identify potential overgeneralization or misuse |
| **Mar 4** | **4:00–5:00 PM** | **Checkpoint #2: EDA Due — finalized exploratory analysis; clearly document variable definitions, dataset size (N), and preliminary limitations** | **Review EDA findings; confirm direction for final modeling and writing tasks** |
| Mar 6 | 4:00–5:00 PM | Draft Results section; clearly report dataset size, measures used, and statistical findings | Edit for clarity; ensure no causal overstatement; refine explanation of results |
| Mar 8 | 4:00–5:00 PM | Draft expanded Ethics & Limitations section (measurement limits, sampling bias, confounders, generalizability limits, risks of misuse) | Review ethical framing; strengthen discussion of potential misinterpretation or policy misuse |
| Mar 11 | 4:00–5:00 PM | Compile full project notebook with clear variable definitions and dataset description | Conduct full walkthrough; debug code; check clarity and coherence |
| Mar 13 | 4:00–5:00 PM | Revise notebook based on feedback; polish writing and visualizations | Ensure all rubric elements are fully addressed |
| Mar 15 | 4:00–5:00 PM | Final review of project; verify measurement explanations and dataset size are clearly stated | Confirm submission readiness; final checks for completeness |
| **Mar 18** | **By 11:59 PM** | NA | **Submit final project and complete group project surveys** |
