# COGS 108 - Data Checkpoint

## Authors

This is a modified [CRediT taxonomy of contributions](https://credit.niso.org). For each group member please list how they contributed to this project using these terms:
> Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

- Trisha Hoang: 
- Sana Gupta: – 
- Daria Stolyarova: Data curation, Analysis, Software, Writing– original draft, Writing – review & editing
- Sanmita Babu: 
- Shivani Parimi: Data curation and Analysis of Dataset #2

## Research Question

Research Question: What is the correlation between daily time spent on algorithm-driven social media platforms (TikTok, Instagram Reels, YouTube Shorts) and college students' self-reported attention span, mood state, and academic productivity? Additionally, do these correlations vary significantly across different global regions?




## Background and Prior Work

TikTok, Instagram Reels, and YouTube Shorts are personalized, AI-driven short-entertainment platforms that continuously adjust to user behavior with the goal of maximizing engagement. Their recommendation systems are designed to push highly stimulating, fast-paced videos that keep users scrolling for extended periods of time. Because these platforms are now deeply integrated into the daily lives of many college students, researchers have begun examining whether frequent exposure to this type of algorithmically curated content is linked to shifts in attention, mood, and academic performance.

In a mixed-methods study of Generation Z university students in Egypt, El-Shihy found that higher levels of social media addiction were significantly associated with greater psychological stress, reduced concentration, and lower academic engagement. Survey data revealed a positive correlation between addiction scores and students’ perceptions of academic interference. In addition, qualitative responses showed consistent themes of procrastination and difficulty maintaining focus while studying. Many participants directly connected their decreased productivity to the amount of time spent on highly engaging social media platforms, suggesting that students themselves recognize a relationship between excessive use and academic disruption.

Pan’s research on TikTok’s personalized recommendation algorithm further highlights the role of algorithmic design. Students described the platform’s content feed as extremely relevant and difficult to disengage from, often leading to extended viewing sessions and losing track of time. Participants frequently reported procrastinating more and struggling to sustain attention on demanding tasks after prolonged use. Similarly, Henrich et al. studied adolescents and young adults between ages 13 and 21 and identified associations between long-term exposure to algorithm-driven content and shifts in attention patterns as well as mood regulation. Their findings suggest that repeated engagement with rapid, personalized content may contribute to shorter attentional cycles and heightened emotional responsiveness.

Although existing research documents clear associations between social media addiction, algorithmic personalization, attention, and psychological outcomes, much of it focuses on single countries or specific age groups. There is still limited cross-regional research examining whether these relationships differ across global contexts. Our project expands on this work by directly measuring correlations between daily time spent on algorithm-driven short-form platforms and college students’ self-reported attention span, mood state, and academic productivity, while also testing whether the strength of these relationships varies across global regions.


## Hypothesis


We hypothesize that greater daily time spent on algorithm-driven short-form platforms (TikTok, Instagram Reels, and YouTube Shorts) will be associated with lower self-reported attention span and reduced academic productivity among college students. Drawing from prior research showing links between social media addiction and decreased concentration and engagement (El-Shihy), as well as reports of time loss and procrastination tied to algorithmic feeds (Pan), we expect to observe a statistically significant moderate negative correlation.

We also predict that increased time on these platforms will be positively correlated with negative mood indicators, such as higher stress, anxiety, or emotional dysregulation. This expectation is supported by Henrich et al., who found associations between sustained exposure to algorithm-driven content and changes in mood-related outcomes.

Finally, we anticipate that the strength of these relationships will differ across global regions, potentially reflecting cultural differences in technology use, academic expectations, and digital access patterns.


## Data

### Data overview

- Dataset #1
  - Dataset Name: Students Social Media Addiction and Academic Impact Dataset
  - Link to the dataset: https://www.kaggle.com/datasets/adilshamim8/social-media-addiction-vs-relationships
  - Number of observations: 705
  - Number of variables: 13
    
  - Description of the variables most relevant to this project
      - Avg_Daily_Usage_Hours – Average number of hours per day spent on social media (measured in hours).
      - Addicted_Score – A self-reported addiction score on a 1–10 scale, where higher values indicate stronger signs of social media dependence.
      - Affects_Academic_Performance – A categorical variable (Yes/No) indicating whether the student believes social media negatively impacts their academic performance.
      - Mental_Health_Score – A 1–10 scale measuring perceived mental health, where higher values indicate better mental well-being.
      - Sleep_Hours_Per_Night – Average number of hours slept per night.
      - Conflicts_Over_Social_Media – A numeric measure capturing interpersonal conflict related to social media use.
        


  - Descriptions of any shortcomings this dataset has with repsect to the project
      - The dataset relies heavily on self-reported measures, which means responses may reflect personal bias or inaccurate self-perception rather than objective reality.
      - The data is cross-sectional and captures only one point in time, so while we can observe relationships between variables, we cannot determine causation.
      - The sample only includes students between the ages of 16 and 25, which limits how well the findings generalize to older adults or non-student populations.
      - If the survey was distributed online or completed voluntarily, it may be affected by self-selection bias, meaning the students who chose to participate may not represent the broader student population.
      - Academic performance is measured based on students’ perceptions rather than objective indicators such as GPA, which may not fully reflect actual academic outcomes.

In [None]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [None]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

### Social Media Usage and Academic Impact Among Students
 

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. Describe the important metrics, what units they are in, and giv some sense of what they mean.  For example "Fasting blood glucose in units of mg glucose per deciliter of blood.  Normal values for healthy individuals range from 70 to 100 mg/dL.  Values 100-125 are prediabetic and values >125mg/dL indicate diabetes. Values <70 indicate hypoglycemia. Fasting idicates the patient hasn't eaten in the last 8 hours.  If blood glucose is >250 or <50 at any time (regardless of the time of last meal) the patient's life may be in immediate danger"
   2. If there are any major concerns with the dataset, describe them. For example "Dataset is composed of people who are serious enough about eating healthy that they voluntarily downloaded an app dedicated to tracking their eating patterns. This sample is likely biased because of that self-selection. These people own smartphones and may be healthier and may have more disposable income than the average person.  Those who voluntarily log conscientiously and for long amounts of time are also likely even more interested in health than those who download the app and only log a bit before getting tired of it"

**Dataset 1: 
Students Social Media Addiction and Academic Impact Dataset contains 705 observations that track the intersection of digital habits and student wellbeing. One of the primary metrics is Average Daily Usage Hours, which measures the amount of time a student spends on social media platforms each day. In the context of student health, usage exceeding 4 to 6 hours is often considered high and may correlate with decreased academic focus. Another critical metric is the Sleep Hours Per Night, measured in hours. For students, the recommended sleep duration is typically between 7 and 9 hours; values significantly lower than this may indicate that social media use is displacing essential rest, which can further impact cognitive function and classroom performance.

The dataset also utilizes two subjective scales: the Mental Health Score and the Addicted Score. Both are measured on an integer scale from 1 to 10. For the Mental Health Score, a higher value represents better perceived mental wellbeing, whereas for the Addicted Score, a higher value indicates a stronger self-perceived dependency on social media platforms. Additionally, the variable Affects Academic Performance is a categorical binary measure (Yes or No). This represents the student's own assessment of whether their digital habits have hindered their schooling. These metrics allow for a multidimensional look at how digital consumption relates to both the psychological and practical aspects of a student's life.

There are several concerns regarding the data that should be noted. Because all metrics—including hours used and the addiction score—are self-reported, the dataset is susceptible to recall bias and social desirability bias. Students may underestimate their actual screen time or hesitate to report a high addiction score. Furthermore, the dataset includes students from various countries like Bangladesh, India, the UK, and the USA, but with a relatively small sample size of 705, the data may not be truly representative of the global student population. Finally, because the data is cross-sectional, it can show correlations between social media use and academic performance, but it cannot definitively prove that social media addiction causes lower grades.**

3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you


### A. Load Dataset

In [23]:
### A. Load dataset from data/00-raw/
import pandas as pd

path = "data/00-raw/students_social_media_addcition.csv"
df_raw = pd.read_csv(path)

df_raw.head()

Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,Relationship_Status,Conflicts_Over_Social_Media,Addicted_Score
0,1,19,Female,Undergraduate,Bangladesh,5.2,Instagram,Yes,6.5,6,In Relationship,3,8
1,2,22,Male,Graduate,India,2.1,Twitter,No,7.5,8,Single,0,3
2,3,20,Female,Undergraduate,USA,6.0,TikTok,Yes,5.0,5,Complicated,4,9
3,4,18,Male,High School,UK,3.0,YouTube,No,7.0,7,Single,1,4
4,5,21,Male,Graduate,Canada,4.5,Facebook,Yes,6.0,6,In Relationship,2,7


### B. Tidiness check
This dataset already follows tidy data principles. Each row represents one student, and each column captures a single variable, such as demographic information, social media usage patterns, or outcome measures. There are no duplicated headers, merged cells, or columns that combine multiple variables. Because of this clear structure, no reshaping or restructuring is needed before moving on to analysis.



### C. Size of Dataset 

In [24]:
# Check number of observations and variables
df_raw.shape

(705, 13)

The dataset contains 705 observations (rows) and 13 variables (columns). Each row represents one individual student, and each column captures a specific characteristic or outcome. The variables cover demographic information (such as age, gender, academic level, and country), measures of social media usage, indicators of academic impact, sleep patterns, mental health scores, relationship status, and reported conflicts related to social media use.

### D. Missing Data

In [25]:
# Check for missing values in aeach column
df_raw.isnull().sum()

Student_ID                      0
Age                             0
Gender                          0
Academic_Level                  0
Country                         0
Avg_Daily_Usage_Hours           0
Most_Used_Platform              0
Affects_Academic_Performance    0
Sleep_Hours_Per_Night           0
Mental_Health_Score             0
Relationship_Status             0
Conflicts_Over_Social_Media     0
Addicted_Score                  0
dtype: int64

No missing values were detected in any column. All 705 observations contain complete information across all 13 variables. Since the dataset has no missing data, there are no observable patterns of missingness to evaluate, and no imputation or row removal was necessary.


### E. Outliers / Suspicious Entries

In [27]:
# Check column names
df_raw.columns

Index(['Student_ID', 'Age', 'Gender', 'Academic_Level', 'Country',
       'Avg_Daily_Usage_Hours', 'Most_Used_Platform',
       'Affects_Academic_Performance', 'Sleep_Hours_Per_Night',
       'Mental_Health_Score', 'Relationship_Status',
       'Conflicts_Over_Social_Media', 'Addicted_Score'],
      dtype='object')

After reviewing the summary statistics, all numeric variables appear to fall within realistic and expected ranges. Ages align with typical student populations. Average daily social media usage stays within a reasonable 0–24 hour range, and reported sleep hours per night are biologically plausible. Additionally, both Mental_Health_Score and Addicted_Score remain within their intended 1–10 scales. No extreme or impossible values were identified, suggesting there are no clear outliers or suspicious entries in the dataset.


### F. Cleaning

In [8]:
# Check data types
df_raw.dtypes

Student_ID                        int64
Age                               int64
Gender                           object
Academic_Level                   object
Country                          object
Avg_Daily_Usage_Hours           float64
Most_Used_Platform               object
Affects_Academic_Performance     object
Sleep_Hours_Per_Night           float64
Mental_Health_Score               int64
Relationship_Status              object
Conflicts_Over_Social_Media       int64
Addicted_Score                    int64
dtype: object

After checking the data types, all variables appear to be stored in appropriate formats (integers, floats, and categorical/object types where expected). Earlier checks also showed that there are no missing values or obvious inconsistencies in the dataset. Because of this, no additional cleaning steps were necessary. The dataset appears internally consistent and ready for analysis.


### Dataset #2 

See instructions above for Dataset #1.  Feel free to keep adding as many more datasets as you need.  Put each new dataset in its own section just like these. 

Lastly if you do have multiple datasets, add another section where you demonstrate how you will join, align, cross-reference or whatever to combine data from the different datasets

Please note that you can always keep adding more datasets in the future if these datasets you turn in for the checkpoint aren't sufficient.  The goal here is demonstrate that you can obtain and wrangle data.  You are not tied down to only use what you turn in right now.

## A. Load Dataset

In [30]:

import pandas as pd

path = "data/00-raw/survey.csv"
df2_raw = pd.read_csv(path)

df2_raw.head()


Unnamed: 0,Timestamp,Age,Gender,Residence Area,Education Level,Socioeconomic status (Parent's education level),Study time (In Hours),Attendance rate (In Percentile),Social Media Platform,Time spent in social media (hours),...,Physical activity (30 min+),Withdrawal symptoms (Side effects of not using social media),Sleep Disturbance on Sleep Quality,Mood Modification Scale,Anxiety Scale,Depression Scale,Self-esteem Scale,Last Academic Result (GPA/CGPA),Social Media Distraction During Academic Activities,Column 19
0,26/04/2025 01:14:36,28,Male,Urban,Tertiary Education,Tertiary Education,2,80,YouTube,4,...,Yes,Feeling restless,4,3,4,5,4,3.55,5,
1,26/04/2025 01:24:20,21,Male,Rural,Tertiary Education,No education,3,60,Instagram,5,...,Yes,No symtoms,5,2,2,1,5,3.1,3,
2,27/04/2025 21:38:37,17,Female,Rural,HSC / A' Level,SSC / O' Level,6-7 hour,70%,YouTube,Highest 2 hour,...,Yes,Cravings,1,2,2,1,1,A-,1,
3,27/04/2025 21:42:51,22,Male,Urban,Tertiary Education,HSC / A' Level,1-2hrs,80%,Facebook,6hrs,...,Yes,Feeling restless,3,2,3,3,3,gpa 5/5,1,
4,27/04/2025 21:46:22,26,Male,Urban,Tertiary Education,Tertiary Education,2,85,Facebook,5,...,No,Cravings,4,3,4,3,4,2.78,4,


## B. Tidiness Check
The dataset is tidy: each row is a student, and each column is a variable. No reshaping is needed.

In [17]:
# Display first few rows and column names
df2_raw.head()
df2_raw.columns

Index(['Timestamp', 'Age', 'Gender', 'Residence Area  ', 'Education Level',
       'Socioeconomic status (Parent's education level)',
       'Study time (In Hours)', 'Attendance rate (In Percentile)',
       'Social Media Platform', 'Time spent in social media (hours)',
       'Most time spent in a day', 'Physical activity (30 min+) ',
       'Withdrawal symptoms (Side effects of not using social media)',
       'Sleep Disturbance on Sleep Quality', 'Mood Modification Scale',
       'Anxiety Scale ', 'Depression Scale', 'Self-esteem Scale',
       'Last Academic Result (GPA/CGPA)',
       'Social Media Distraction During Academic Activities ', 'Column 19'],
      dtype='object')

## C. Size of Dataset
The dataset has 405 rows and 21 columns. Variables cover demographics, academic info, social media habits, and wellbeing indicators.

In [13]:
# Check number of rows and columns
df2_raw.shape


(405, 21)

## D. Missing Data

In [24]:
# Check missing values
df2_raw.isnull().sum()

Timestamp                                                         0
Age                                                               0
Gender                                                            0
Residence Area                                                    0
Education Level                                                   0
Socioeconomic status (Parent's education level)                   0
Study time (In Hours)                                             0
Attendance rate (In Percentile)                                   0
Social Media Platform                                             0
Time spent in social media (hours)                                0
Most time spent in a day                                          0
Physical activity (30 min+)                                       0
Withdrawal symptoms (Side effects of not using social media)      0
Sleep Disturbance on Sleep Quality                                0
Mood Modification Scale                         

## E. Outliers / Suspicious Entries
After looking at the column names and ranges, all numeric values are reasonable. There are no obvious outliers or impossible entries.

In [27]:
# Check column names
df2_raw.columns


Index(['Timestamp', 'Age', 'Gender', 'Residence Area  ', 'Education Level',
       'Socioeconomic status (Parent's education level)',
       'Study time (In Hours)', 'Attendance rate (In Percentile)',
       'Social Media Platform', 'Time spent in social media (hours)',
       'Most time spent in a day', 'Physical activity (30 min+) ',
       'Withdrawal symptoms (Side effects of not using social media)',
       'Sleep Disturbance on Sleep Quality', 'Mood Modification Scale',
       'Anxiety Scale ', 'Depression Scale', 'Self-esteem Scale',
       'Last Academic Result (GPA/CGPA)',
       'Social Media Distraction During Academic Activities ', 'Column 19'],
      dtype='object')

## F. Cleaning

In [28]:
# Check data types
df2_raw.dtypes


Timestamp                                                        object
Age                                                              object
Gender                                                           object
Residence Area                                                   object
Education Level                                                  object
Socioeconomic status (Parent's education level)                  object
Study time (In Hours)                                            object
Attendance rate (In Percentile)                                  object
Social Media Platform                                            object
Time spent in social media (hours)                               object
Most time spent in a day                                         object
Physical activity (30 min+)                                      object
Withdrawal symptoms (Side effects of not using social media)     object
Sleep Disturbance on Sleep Quality                              

## Ethics

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback

## Team Expectations 

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback


## Project Timeline Proposal

Instructions: Replace this with your timeline.  **PLEASE UPDATE your Timeline!** No battle plan survives contact with the enemy, so make sure we understand how your plans have changed.  Also if you have lost points on the previous checkpoint fix them