# COGS 108 - Data Checkpoint

## Authors

- Trisha Hoang: Background research, Conceptualization, Data curation
- Sana Gupta: Data curation and Analysis of Dataset #1 
- Daria Stolyarova: Data curation, Analysis, Software, Writing– original draft, Writing – review & editing
- Sanmita Babu: Data Analysis of Dataset #2, Writing – original draft, Writing – review & editing
- Shivani Parimi: Data curation and Analysis of Dataset #2

## Research Question

Research Question: What is the correlation between daily time spent on algorithm-driven social media platforms (TikTok, Instagram Reels, YouTube Shorts) and college students' self-reported attention span, mood state, and academic productivity? Additionally, do these correlations vary significantly across different global regions?




## Background and Prior Work

TikTok, Instagram Reels, and YouTube Shorts are personalized, AI-driven short-entertainment platforms that continuously adjust to user behavior with the goal of maximizing engagement. Their recommendation systems are designed to push highly stimulating, fast-paced videos that keep users scrolling for extended periods of time. Because these platforms are now deeply integrated into the daily lives of many college students, researchers have begun examining whether frequent exposure to this type of algorithmically curated content is linked to shifts in attention, mood, and academic performance.

In a mixed-methods study of Generation Z university students in Egypt, El-Shihy found that higher levels of social media addiction were significantly associated with greater psychological stress, reduced concentration, and lower academic engagement. Survey data revealed a positive correlation between addiction scores and students’ perceptions of academic interference. In addition, qualitative responses showed consistent themes of procrastination and difficulty maintaining focus while studying. Many participants directly connected their decreased productivity to the amount of time spent on highly engaging social media platforms, suggesting that students themselves recognize a relationship between excessive use and academic disruption.

Pan’s research on TikTok’s personalized recommendation algorithm further highlights the role of algorithmic design. Students described the platform’s content feed as extremely relevant and difficult to disengage from, often leading to extended viewing sessions and losing track of time. Participants frequently reported procrastinating more and struggling to sustain attention on demanding tasks after prolonged use. Similarly, Henrich et al. studied adolescents and young adults between ages 13 and 21 and identified associations between long-term exposure to algorithm-driven content and shifts in attention patterns as well as mood regulation. Their findings suggest that repeated engagement with rapid, personalized content may contribute to shorter attentional cycles and heightened emotional responsiveness.

Although existing research documents clear associations between social media addiction, algorithmic personalization, attention, and psychological outcomes, much of it focuses on single countries or specific age groups. There is still limited cross-regional research examining whether these relationships differ across global contexts. Our project expands on this work by directly measuring correlations between daily time spent on algorithm-driven short-form platforms and college students’ self-reported attention span, mood state, and academic productivity, while also testing whether the strength of these relationships varies across global regions.


## Hypothesis


We hypothesize that greater daily time spent on algorithm-driven short-form platforms (TikTok, Instagram Reels, and YouTube Shorts) will be associated with lower self-reported attention span and reduced academic productivity among college students. Drawing from prior research showing links between social media addiction and decreased concentration and engagement (El-Shihy), as well as reports of time loss and procrastination tied to algorithmic feeds (Pan), we expect to observe a statistically significant moderate negative correlation.

We also predict that increased time on these platforms will be positively correlated with negative mood indicators, such as higher stress, anxiety, or emotional dysregulation. This expectation is supported by Henrich et al., who found associations between sustained exposure to algorithm-driven content and changes in mood-related outcomes.

Finally, we anticipate that the strength of these relationships will differ across global regions, potentially reflecting cultural differences in technology use, academic expectations, and digital access patterns.

[Note for Regrading: This hypothesis was revised to address feedback regarding operational definitions and statistical thresholds. We have specified the platforms being studied (TikTok, Instagram Reels, YouTube Shorts), defined "time spent" as daily self-reported usage, clarified correlation strength using standard benchmarks (moderate = r ≈ 0.3–0.5), and grounded our predictions in cited prior research to justify expected effect sizes.]

## Data

### Data overview

- Dataset #1
  - Dataset Name: Students Social Media Addiction and Academic Impact Dataset
  - Link to the dataset: https://www.kaggle.com/datasets/adilshamim8/social-media-addiction-vs-relationships
  - Number of observations: 705
  - Number of variables: 13
    
  - Description of the variables most relevant to this project
      - Avg_Daily_Usage_Hours – Average number of hours per day spent on social media (measured in hours).
      - Addicted_Score – A self-reported addiction score on a 1–10 scale, where higher values indicate stronger signs of social media dependence.
      - Affects_Academic_Performance – A categorical variable (Yes/No) indicating whether the student believes social media negatively impacts their academic performance.
      - Mental_Health_Score – A 1–10 scale measuring perceived mental health, where higher values indicate better mental well-being.
      - Sleep_Hours_Per_Night – Average number of hours slept per night.
      - Conflicts_Over_Social_Media – A numeric measure capturing interpersonal conflict related to social media use.
        


  - Descriptions of any shortcomings this dataset has with repsect to the project
      - The dataset relies heavily on self-reported measures, which means responses may reflect personal bias or inaccurate self-perception rather than objective reality.
      - The data is cross-sectional and captures only one point in time, so while we can observe relationships between variables, we cannot determine causation.
      - The sample only includes students between the ages of 16 and 25, which limits how well the findings generalize to older adults or non-student populations.
      - If the survey was distributed online or completed voluntarily, it may be affected by self-selection bias, meaning the students who chose to participate may not represent the broader student population.
      - Academic performance is measured based on students’ perceptions rather than objective indicators such as GPA, which may not fully reflect actual academic outcomes.

In [None]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [None]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

### Social Media Usage and Academic Impact Among Students
 
**Dataset 1: 
Students Social Media Addiction and Academic Impact Dataset contains 705 observations that track the intersection of digital habits and student wellbeing. One of the primary metrics is Average Daily Usage Hours, which measures the amount of time a student spends on social media platforms each day. In the context of student health, usage exceeding 4 to 6 hours is often considered high and may correlate with decreased academic focus. Another critical metric is the Sleep Hours Per Night, measured in hours. For students, the recommended sleep duration is typically between 7 and 9 hours; values significantly lower than this may indicate that social media use is displacing essential rest, which can further impact cognitive function and classroom performance.

The dataset also utilizes two subjective scales: the Mental Health Score and the Addicted Score. Both are measured on an integer scale from 1 to 10. For the Mental Health Score, a higher value represents better perceived mental wellbeing, whereas for the Addicted Score, a higher value indicates a stronger self-perceived dependency on social media platforms. Additionally, the variable Affects Academic Performance is a categorical binary measure (Yes or No). This represents the student's own assessment of whether their digital habits have hindered their schooling. These metrics allow for a multidimensional look at how digital consumption relates to both the psychological and practical aspects of a student's life.

There are several concerns regarding the data that should be noted. Because all metrics—including hours used and the addiction score—are self-reported, the dataset is susceptible to recall bias and social desirability bias. Students may underestimate their actual screen time or hesitate to report a high addiction score. Furthermore, the dataset includes students from various countries like Bangladesh, India, the UK, and the USA, but with a relatively small sample size of 705, the data may not be truly representative of the global student population. Finally, because the data is cross-sectional, it can show correlations between social media use and academic performance, but it cannot definitively prove that social media addiction causes lower grades.

### A. Load Dataset

In [23]:
### A. Load dataset from data/00-raw/
import pandas as pd

path = "data/00-raw/students_social_media_addcition.csv"
df_raw = pd.read_csv(path)

df_raw.head()

Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,Relationship_Status,Conflicts_Over_Social_Media,Addicted_Score
0,1,19,Female,Undergraduate,Bangladesh,5.2,Instagram,Yes,6.5,6,In Relationship,3,8
1,2,22,Male,Graduate,India,2.1,Twitter,No,7.5,8,Single,0,3
2,3,20,Female,Undergraduate,USA,6.0,TikTok,Yes,5.0,5,Complicated,4,9
3,4,18,Male,High School,UK,3.0,YouTube,No,7.0,7,Single,1,4
4,5,21,Male,Graduate,Canada,4.5,Facebook,Yes,6.0,6,In Relationship,2,7


### B. Tidiness check
This dataset already follows tidy data principles. Each row represents one student, and each column captures a single variable, such as demographic information, social media usage patterns, or outcome measures. There are no duplicated headers, merged cells, or columns that combine multiple variables. Because of this clear structure, no reshaping or restructuring is needed before moving on to analysis.



### C. Size of Dataset 

In [24]:
# Check number of observations and variables
df_raw.shape

(705, 13)

The dataset contains 705 observations (rows) and 13 variables (columns). Each row represents one individual student, and each column captures a specific characteristic or outcome. The variables cover demographic information (such as age, gender, academic level, and country), measures of social media usage, indicators of academic impact, sleep patterns, mental health scores, relationship status, and reported conflicts related to social media use.

### D. Missing Data

In [25]:
# Check for missing values in aeach column
df_raw.isnull().sum()

Student_ID                      0
Age                             0
Gender                          0
Academic_Level                  0
Country                         0
Avg_Daily_Usage_Hours           0
Most_Used_Platform              0
Affects_Academic_Performance    0
Sleep_Hours_Per_Night           0
Mental_Health_Score             0
Relationship_Status             0
Conflicts_Over_Social_Media     0
Addicted_Score                  0
dtype: int64

No missing values were detected in any column. All 705 observations contain complete information across all 13 variables. Since the dataset has no missing data, there are no observable patterns of missingness to evaluate, and no imputation or row removal was necessary.


### E. Outliers / Suspicious Entries

In [27]:
# Check column names
df_raw.columns

Index(['Student_ID', 'Age', 'Gender', 'Academic_Level', 'Country',
       'Avg_Daily_Usage_Hours', 'Most_Used_Platform',
       'Affects_Academic_Performance', 'Sleep_Hours_Per_Night',
       'Mental_Health_Score', 'Relationship_Status',
       'Conflicts_Over_Social_Media', 'Addicted_Score'],
      dtype='object')

After reviewing the summary statistics, all numeric variables appear to fall within realistic and expected ranges. Ages align with typical student populations. Average daily social media usage stays within a reasonable 0–24 hour range, and reported sleep hours per night are biologically plausible. Additionally, both Mental_Health_Score and Addicted_Score remain within their intended 1–10 scales. No extreme or impossible values were identified, suggesting there are no clear outliers or suspicious entries in the dataset.


### F. Cleaning

In [8]:
# Check data types
df_raw.dtypes

Student_ID                        int64
Age                               int64
Gender                           object
Academic_Level                   object
Country                          object
Avg_Daily_Usage_Hours           float64
Most_Used_Platform               object
Affects_Academic_Performance     object
Sleep_Hours_Per_Night           float64
Mental_Health_Score               int64
Relationship_Status              object
Conflicts_Over_Social_Media       int64
Addicted_Score                    int64
dtype: object

After checking the data types, all variables appear to be stored in appropriate formats (integers, floats, and categorical/object types where expected). Earlier checks also showed that there are no missing values or obvious inconsistencies in the dataset. Because of this, no additional cleaning steps were necessary. The dataset appears internally consistent and ready for analysis.


### Student Social Media & Academic Performance Survey

Student Social Media, Academic Performance Dataset is a survey-based dataset with 405 student responses that focus on how students' social media habits relate to academic outcomes and general wellbeing. Each row represents one student and each column represents one measured attribute from the questionare. The dataset includes a mix of demographic context variables and behavior/psychological outcome variables that describe social media behavior, along with academic and wellbeing-related outcomes. 

The primary metric is "Time Spent on Social Media (hours)", which is a self-reported daily figure in which "usage" quantified through time spent (recorded as hours/day). However, this column has significant formatting inconsistencies with some students entering clean integers while others writing things like "Highest 2 hour" or "2-3 hrs," which makes numerical analysis without cleaning difficult. Similarly, "Study Time (in Hours)" and "Attendance Rate (in Percentile)" have entries mixed in with plain numbers, and need to be standardized before modeling. "Most Time Spent in a Day" is a categorical variable with 4 options (Morning, Afternoon, Evening, Night) that define when during the day students are most active on social media. This is relevant since evening/night use can interfere with sleep more than daytime use. Also, the dataset records which platform students use (e.g. Tiktok, Youtube, Instagram) and whether or not they are physically active for 30+ minutes per day (Yes/No). Time-based metrics are  important since they are a direct, interpretable measure that define higher daily time spent (4+ hours) to be a greater exposure to algorithm-driven fields, which can displace study time, sleep, or attention (but doesn't imply addiction). 

The dataset also includes metrics related to wellbeing and attention-related experiences that are all measured on ordinal 1-5 Likert scales (sleep disturbance on sleep quality, mood modification scale, anxiety scale, depression scale, self esteem scale, and social media distraction during academic activities). Higher values on the anxiety, depression, and distraction indicate worser outcomes while higher self-esteem values indicate a better perceived self worth. Withdrawal Symptoms is a categorical variable that has five response options (e.g., "Feeling restless," "Cravings," "Trouble concentrating," "No symptoms"), that each convey a sense of behavioral dependency. Last Academic Result (GPA/CGPA) is an outcome variable that represents the students' academic performance, but is formatted inconsistently with decimals and letter grades. While these metrics be used for testing whether heavier short-form/algorithmic consumption is associated with poorer self-reported attention and more negative mood indicators, these metrics capture association rather than causation. 

Since all metrics are survey-based and self reported, Respondents may underreport usage or overreport positive academic habits due to recall error and social desirability bias. Also, since the dataset appears to draw reponses primarily from Metropolitan University and related academic communities), it may not represent college students globally, which limits how confidently we can generalize findings across regions. Responses are collected at one point in time, which allows us to identify correlations (i.e. associating higher usage with lower productivity) but cannot conclude causation. Many survey items mix numeric and categorical/ordinal responses that require reprocessing and the proper determination of correlation methods for each type of data. Lastly, it is important to consider that other factors such as course load, access to devices/the internet, etc... can influence both usage and outcomes, so there is potential confounding in the dataset. 

How we will join/align/cross-refernce data from **both** datasets:



## A. Load Dataset

In [30]:

import pandas as pd

path = "data/00-raw/survey.csv"
df2_raw = pd.read_csv(path)

df2_raw.head()


Unnamed: 0,Timestamp,Age,Gender,Residence Area,Education Level,Socioeconomic status (Parent's education level),Study time (In Hours),Attendance rate (In Percentile),Social Media Platform,Time spent in social media (hours),...,Physical activity (30 min+),Withdrawal symptoms (Side effects of not using social media),Sleep Disturbance on Sleep Quality,Mood Modification Scale,Anxiety Scale,Depression Scale,Self-esteem Scale,Last Academic Result (GPA/CGPA),Social Media Distraction During Academic Activities,Column 19
0,26/04/2025 01:14:36,28,Male,Urban,Tertiary Education,Tertiary Education,2,80,YouTube,4,...,Yes,Feeling restless,4,3,4,5,4,3.55,5,
1,26/04/2025 01:24:20,21,Male,Rural,Tertiary Education,No education,3,60,Instagram,5,...,Yes,No symtoms,5,2,2,1,5,3.1,3,
2,27/04/2025 21:38:37,17,Female,Rural,HSC / A' Level,SSC / O' Level,6-7 hour,70%,YouTube,Highest 2 hour,...,Yes,Cravings,1,2,2,1,1,A-,1,
3,27/04/2025 21:42:51,22,Male,Urban,Tertiary Education,HSC / A' Level,1-2hrs,80%,Facebook,6hrs,...,Yes,Feeling restless,3,2,3,3,3,gpa 5/5,1,
4,27/04/2025 21:46:22,26,Male,Urban,Tertiary Education,Tertiary Education,2,85,Facebook,5,...,No,Cravings,4,3,4,3,4,2.78,4,


## B. Tidiness Check
The dataset is tidy: each row represents a student respondent, and each column is a distinct variable. There are no duplicated headers, merged cells, or columns and the dataset is already well structured so no reshaping is needed.

In [17]:
# Display first few rows and column names
df2_raw.head()
df2_raw.columns

Index(['Timestamp', 'Age', 'Gender', 'Residence Area  ', 'Education Level',
       'Socioeconomic status (Parent's education level)',
       'Study time (In Hours)', 'Attendance rate (In Percentile)',
       'Social Media Platform', 'Time spent in social media (hours)',
       'Most time spent in a day', 'Physical activity (30 min+) ',
       'Withdrawal symptoms (Side effects of not using social media)',
       'Sleep Disturbance on Sleep Quality', 'Mood Modification Scale',
       'Anxiety Scale ', 'Depression Scale', 'Self-esteem Scale',
       'Last Academic Result (GPA/CGPA)',
       'Social Media Distraction During Academic Activities ', 'Column 19'],
      dtype='object')

## C. Size of Dataset
The dataset contains 405 rows (observations) and 21 columns (variables). Each observation corresponds to one student, and the variables include demographic information (e.g., age, gender, academic background), academic-related factors, social media usage habits, and self-reported wellbeing or behavioral indicators. This scope is effective in allowing the examination of the relationship between social media bejavior and academic/psychological outcomes. 

In [13]:
# Check number of rows and columns
df2_raw.shape


(405, 21)

## D. Missing Data

There are no missing entries detected in the dataset across all columns. All 405 observations contain complete information for each of the 21 variables. So, since there is no missing data, no rows/columns need to be removed or evaluated. 

In [24]:
# Check missing values
df2_raw.isnull().sum()

Timestamp                                                         0
Age                                                               0
Gender                                                            0
Residence Area                                                    0
Education Level                                                   0
Socioeconomic status (Parent's education level)                   0
Study time (In Hours)                                             0
Attendance rate (In Percentile)                                   0
Social Media Platform                                             0
Time spent in social media (hours)                                0
Most time spent in a day                                          0
Physical activity (30 min+)                                       0
Withdrawal symptoms (Side effects of not using social media)      0
Sleep Disturbance on Sleep Quality                                0
Mood Modification Scale                         

## E. Outliers / Suspicious Entries
After looking at the column names and ranges, all numeric values were reasonable suggesting no obvious otliers or suspicious entries in the dataset. Demographic variables also aligh with typical student populations, and usage-related quantities have plausible ranges. 

In [27]:
# Check column names
df2_raw.columns


Index(['Timestamp', 'Age', 'Gender', 'Residence Area  ', 'Education Level',
       'Socioeconomic status (Parent's education level)',
       'Study time (In Hours)', 'Attendance rate (In Percentile)',
       'Social Media Platform', 'Time spent in social media (hours)',
       'Most time spent in a day', 'Physical activity (30 min+) ',
       'Withdrawal symptoms (Side effects of not using social media)',
       'Sleep Disturbance on Sleep Quality', 'Mood Modification Scale',
       'Anxiety Scale ', 'Depression Scale', 'Self-esteem Scale',
       'Last Academic Result (GPA/CGPA)',
       'Social Media Distraction During Academic Activities ', 'Column 19'],
      dtype='object')

## F. Cleaning

Reviewing the data types indicates that all the varibales are stored in appropriate formats (numeric type int64 for quantitative measures and categorical/object types for survey-based responses). Since there were no missing values or unrealistic entries, no additional cleaning was required. 

In [28]:
# Check data types
df2_raw.dtypes


Timestamp                                                        object
Age                                                              object
Gender                                                           object
Residence Area                                                   object
Education Level                                                  object
Socioeconomic status (Parent's education level)                  object
Study time (In Hours)                                            object
Attendance rate (In Percentile)                                  object
Social Media Platform                                            object
Time spent in social media (hours)                               object
Most time spent in a day                                         object
Physical activity (30 min+)                                      object
Withdrawal symptoms (Side effects of not using social media)     object
Sleep Disturbance on Sleep Quality                              

## Ethics

A primary concern is the privacy paradox, where participants may share sensitive personal information without grasping the long-term implications of that data being recorded, such as precise daily routines. Because the question relies on self-reported metrics for "mood" and "attention," there is a high risk of social desirability bias. Students may subconsciously underreport their actual screen time on apps like TikTok or overreport their academic productivity to appear more disciplined.

The dataset is also susceptible to representation bias and selection bias, particularly regarding the "digital divide." The data we collect inherently excludes students who may not have access to high-speed internet or high-end smartphones capable of running these apps. Social media algorithms themselves reflect algorithmic bias, where certain content or demographics are prioritized over others. If we don't carefully balance our dataset, the conclusions might suggest that "average" productivity looks like a specific demographic's experience, further marginalizing students from different socioeconomic or cultural backgrounds. To address this, we will conduct a demographic audit to compare our sample's diversity against global university enrollment statistics and use stratified analysis to look for patterns within specific subgroups rather than just the aggregate "average student." When communicating results, we will explicitly state the limitations of our sample and avoid making universal claims if certain populations are underrepresented.

Collecting information related to screen usage patterns, emotional state, and daily behaviors may reveal sensitive details about individuals' lives, even when identifiers are removed. There is also a concern that data collected for a class project could later be misunderstood or used in ways it wasn't originally intended. To reduce this risk, we will only collect information that is truly necessary, avoid gathering identifying details, and make sure data is stored securely. We will also be clear with participants about why their data is being collected and how it will be used and protected. Our analysis will focus on overall trends and subgroup patterns rather than individual-level behavior, and when presenting findings, we will be careful to explain the limits of the data and avoid making broad or judgmental conclusions.

## Team Expectations 

 - Comunication and Meetings: We will primarly communicate via iMessage for quick updates and questions throughout the week while using Github for tracking code and progress. Team members are expected to respond within 24 hours on weekdays and in a timely manner when deadlines approach! We will meet atleast once to check progress and plan next steps.
 - Respect: We agree to communicate respectfully and constructively, giving reasonable feedback and understand to not criticize individuals.
 - Decision Making: Major decisions will be discussed as a group and decided by consensus/majority vote.
 - Task Distribution: Tasks will be divided based on individual strengths and interests (e.g., data wrangling, analysis, writing), but all members are expected to contribute to every major aspect of the project. We will track responsibilities and deadlines using GitHub commits and shared notes. If ussues arise, everyone should communicate early!
 - Conflict: If a team member is struggling with a task, they should let the others know asap to redistribute tasks.

## Project Timeline Proposal

| Meeting Date | Meeting Time | Completed Before Meeting | Discuss at Meeting |
|--------------|--------------|--------------------------|--------------------|
| 2/4  | 6 PM  | Finalize and submit Project Proposal; brainstorm data collection strategies/topic | Confirm communication norms; assign roles and responsibilities between coding/research; discuss hypothesis; start background research |
| 2/11 | 8 PM  | Identify potential datasets; review data availability and ethics considerations | Select final dataset; plan data wrangling approach |
| 2/18 | 8 PM  | Import and clean dataset; begin exploratory data analysis | Review wrangling and EDA; identify patterns and data limitations; submit Checkpoint #1: Data |
| 2/23 | 6 PM  | Import & wrangle data; EDA; finalize cleaned dataset and begin analysis | Prepare and submit Checkpoint #2: EDA; finalize analysis plan |
| 3/8 | 7 PM  | Complete analysis; draft results, discussion, and conclusion sections | Edit and refine full project; finalize narrative and interpretations |
| 3/17 | 10 PM | Final proofreading; polish code and visualizations | Submit final project, video, and complete team evaluation surveys |