## Feedback from Assignment 1

This section details the incorporation of Assignment 1 feedback regarding the project's novelty.

**Feedback:** "Novelty: where is your methodology?"

**Incorporation:** This feedback highlighted the need to articulate the unique methodological approach. The project's novelty lies not just in the problem but in its **integrated, multi-data source, and advanced analytical methodology** for comprehensively tackling 'brain rot' beyond simple correlations. The current demonstration serves as a foundational step for this broader, novel approach, strengthening the project's unique value proposition through a clear methodological framework.

## 4. Characterising and Analysing Data

This section details the data pertinent to "The Unspoken Epidemic," outlining sources, characteristics, analysis methods, and a practical demonstration.

### 4.1 Potential Data Sources and Characteristics

Addressing 'brain rot' comprehensively requires diverse data, from individual experiences to macro trends.

* **Primary Data (Surveys):** Self-reported data on social media usage, academic performance, sleep, and mental health (as used in this demonstration). Questions adapted from validated scales (e.g., Bergen Social Media Addiction Scale).
    * **Pros:** Directly addresses specific research questions; captures subjective experiences.
    * **Cons:** Prone to self-report bias, recall issues, social desirability bias; limited for objective behavior or long-term trends.

* **Secondary Data (for Future Expansion & Broader Trends):** Mobile engagement trends (Shorts/Reels), reading trends (average book length), Wikipedia dwell time data, neuroscience data, CommonCrawl/Google Books Ngrams Viewer, and academic score trends. These data types offer objective evidence and broader trend analysis.

* **Data Characteristics (The 4 V's):**
    The characteristics of big data are commonly described by the "4 V's":
    * **Volume:** Current project is small. Future expansion involves **petabytes**, requiring scalable storage.
    * **Variety:** Current data is structured. Future expansion introduces **high variety**: structured, semi-structured, and unstructured (text, video, neuroimages).
    * **Velocity:** Current data is static. Future social media engagement data would be **high velocity**, necessitating real-time processing.
    * **Veracity:** Current self-reported data is subject to biases. Future data from APIs or web crawls may contain noise; rigorous validation and cleaning are critical (Marr, n.d.).

* **Platforms, Software, and Tools:**
    * **Current Project:** R (analysis, modeling), RStudio, local storage, CSV.
    * **Future Expansion:** Cloud object storage (AWS S3, Google Cloud Storage), NoSQL databases (MongoDB), Data Warehouses (Snowflake, BigQuery) for storage. Distributed computing frameworks (Apache Spark, Hadoop); cloud services (AWS Glue, Google Dataflow) for processing. Python (advanced ML, NLP, deep learning), specialized visualization (Tableau), workflow orchestration (Apache Airflow) for tools. These provide scalability, flexibility, and robust processing for diverse, large-scale, high-velocity data.

### 4.2 Data Analysis Techniques and Statistical Methods

This section outlines broader analytical methods applicable to the project, with rationale and expected outcomes.

* **Descriptive Statistics & Visualization:** Methods include means, medians, distributions, time-series plots. Rationale: Summarize data, identify patterns and trends (e.g., usage, book length changes). Expected Outcomes: Baseline understanding of current patterns.
* **Inferential Statistics:** Methods such as T-tests, ANOVA (comparing group means), Chi-squared tests (categorical associations), Multiple Regression Analysis, Confidence Interval Tests, and Variance Inflation Factor (VIF) (Selection of Appropriate Statistical Methods for Data Analysis, n.d.). Rationale: Draw statistically sound conclusions, testing hypotheses. Multiple regression assesses simultaneous impact of multiple predictors, VIF diagnoses multicollinearity, and confidence intervals provide plausible ranges for population parameters. Expected Outcomes: Confirm significant differences or associations; understand independent effects of variables; ensure model robustness.
* **Time-Series Analysis:** Methods include ARIMA models, Prophet. Rationale: Analyze trends over time (e.g., usage, book lengths) for 'brain rot' onset and progression. Expected Outcomes: Detection of temporal patterns, forecasting.
* **Natural Language Processing (NLP):** Methods like N-gram analysis, readability scores (e.g., Flesch-Kincaid) (DataCamp, n.d.). Rationale: Quantitatively assess linguistic simplification in text corpora. Expected Outcomes: Statistical evidence of trends in linguistic complexity.
* **Machine Learning for Prediction:** Methods including Logistic Regression, Linear Regression, Decision Trees/Random Forests, Gradient Boosting Machines (Machine Learning for Social Science, n.d.). Rationale: Build predictive models to identify 'brain rot' drivers and outcomes. Expected Outcomes: Predict academic impact, sleep, mental health; identify influential factors.
* **Clustering (Unsupervised Learning):** Methods such as K-Means, Hierarchical Clustering (Machine Learning for Social Science, n.d.). Rationale: Identify natural groupings in student populations based on behavior. Expected Outcomes: Discover student profiles or 'brain rot' archetypes.

### 4.3 Demonstration

This section demonstrates the project's feasibility using a real dataset and R.

* **Dataset Identification:** A "Social Media Addiction and Mental Health" dataset from Kaggle was used. It was collected via surveys (university mailing lists, social media) with validation, de-duplication, and anonymization.
    * **Download Link:** https://www.kaggle.com/datasets/adilshamim8/social-media-addiction-vs-relationships
    * **R Markdown File:** The R Markdown file used will be uploaded to Moodle.
* **Data Description & Features:** 645 observations, 15 variables. Key features include `Age`, `Gender`, `Academic_Level`, `Avg_Daily_Usage_Hours`, `Most_Used_Platform`, `Sleep_Hours_Per_Night`, `Mental_Health_Score`, `Affects_Academic_Performance`, `Conflicts_Over_Social_Media`, `Addicted_Score`, `Relationship_Status`, `Country`.
* **Analysis Process (Using R):**
    1.  **Initial Inspection:** Loaded data (`str()`, `summary()`, `head()`).
    2.  **Cleaning & Preprocessing:** `na.omit()` removed missing data. Variables converted to factors. "LinkedIn" users filtered out. `Most_Used_Platform` grouped, with "Facebook" as reference.
    3.  **Exploratory Data Analysis (EDA):** Visualizations revealed initial patterns. These box plots are enhanced using hybrid box-violin plots for richer data distribution insights:
        ```R
        library(ggplot2)

        # Example: Hybrid Box-Violin plot for Academic Performance Impact by Most Used Platform
        ggplot(data, aes(x = Most_Used_Platform_Grouped, y = Academic_Performance_Impact, fill = Most_Used_Platform_Grouped)) +
          geom_violin(trim = FALSE) + # Violin plot
          geom_boxplot(width = 0.1, outlier.shape = NA, fill = "white", color = "black") + # Boxplot inside
          labs(title = "Academic Performance Impact by Most Used Platform",
               x = "Most Used Platform",
               y = "Academic Performance Impact Score") +
          theme_minimal() +
          theme(axis.text.x = element_text(angle = 45, hjust = 1))

        # Example: Hybrid Box-Violin plot for Mental Health Score by Academic Performance Impact
        ggplot(data, aes(x = Affects_Academic_Performance, y = Mental_Health_Score, fill = Affects_Academic_Performance)) +
          geom_violin(trim = FALSE) + # Violin plot
          geom_boxplot(width = 0.1, outlier.shape = NA, fill = "white", color = "black") + # Boxplot inside
          labs(title = "Mental Health Score by Academic Performance Impact",
               x = "Academic Performance Affected",
               y = "Mental Health Score") +
          theme_minimal()
        ```
    4.  **Statistical Modeling:** Three regression models were built: Logistic Regression for `Affects_Academic_Performance`, Linear Regression for `Sleep_Hours_Per_Night` and `Mental_Health_Score`. Predictors included `Avg_Daily_Usage_Hours`, `Academic_Level`, `Most_Used_Platform_Grouped`, `Age`, `Gender`, `Relationship_Status`, plus `Conflicts_Over_Social_Media` for mental health. VIFs confirmed no problematic multicollinearity.
        The regression analysis included various predictors to predict academic performance impact, sleep hours, and mental health scores. The coefficients and their statistical significance for these models can be effectively visualized using coefficient plots with a sequential color scale based on p-value:
        ```R
        library(ggplot2)
        library(broom) # For tidy model output
        library(dplyr) # For data manipulation

        # Assume model1, model2, model3 are already fitted (e.g., after running the models in R)
        # model1 <- glm(Affects_Academic_Performance ~ Avg_Daily_Usage_Hours + Academic_Level + Most_Used_Platform_Grouped + Age + Gender + Relationship_Status + Sleep_Hours_Per_Night, data = data, family = binomial)
        # model2 <- lm(Sleep_Hours_Per_Night ~ Avg_Daily_Usage_Hours + Academic_Level + Most_Used_Platform_Grouped + Age + Gender + Relationship_Status, data = data)
        # model3 <- lm(Mental_Health_Score ~ Avg_Daily_Usage_Hours + Conflicts_Over_Social_Media + Most_Used_Platform_Grouped + Age + Gender + Relationship_Status + Academic_Level, data = data)

        # Function to create a coefficient plot with custom p-value coloring
        plot_coefficients <- function(model, title) {
          tidy_model <- tidy(model, conf.int = TRUE, exponentiate = ifelse(family(model)$family == "binomial", TRUE, FALSE)) %>%
            filter(term != "(Intercept)") # Remove intercept for cleaner plot

          ggplot(tidy_model, aes(x = estimate, y = reorder(term, estimate), color = p.value)) +
            geom_point() +
            geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
            geom_vline(xintercept = ifelse(family(model)$family == "binomial", 1, 0), linetype = "dashed", color = "grey") +
            labs(title = title,
                 x = ifelse(family(model)$family == "binomial", "Odds Ratio", "Coefficient Estimate"),
                 y = "Predictor Term") +
            scale_color_gradientn(
              colors = c(
                "#004d00", # Darkest green (p=0)
                "#008000", # Green (p ~ 0.01)
                "#90EE90", # Light green (p=0.05)
                "#FFD700", # Gold/Yellow (p=0.075)
                "#FFA500", # Orange (p=0.1)
                "#FF4500", # OrangeRed (p=0.5)
                "#8B0000"  # Dark red (p=1)
              ),
              values = scales::rescale(c(0, 0.01, 0.05, 0.075, 0.1, 0.5, 1)), # Map p-values to color scale
              limits = c(0, 1), # p-values range from 0 to 1
              name = "P-value"
            ) +
            theme_minimal()
        }

        # Generate plots for each model (uncomment and run in R)
        # plot_coefficients(model1, "Logistic Regression: Academic Performance Impact Coefficients")
        # plot_coefficients(model2, "Linear Regression: Sleep Hours Per Night Coefficients")
        # plot_coefficients(model3, "Linear Regression: Mental Health Score Coefficients")
        ```

* **Analysis Results:** Models showed strong predictive power (p < 2.2e-16) and low VIFs (all $\text{VIF}^{1/(2*\text{Df})} < 2.0$).

* **Feasibility Conclusion:** The demonstration identified significant relationships between social media usage, platform choices, and student well-being. Findings highlight that while some platforms are strongly associated with negative outcomes, others present counter-trends. This confirms project feasibility and underscores the importance of detailed, platform-specific analysis.

## 5. Standard for Data Science Process, Data Governance and Management

This section outlines methodological standards, data governance, and management, including ethical considerations.

### 5.1 Standard for Data Science Process

The **Cross-Industry Standard Process for Data Mining (CRISP-DM)** was adopted for its structured, iterative approach (Shimaoka et al., 2024).

* **Business Understanding:** Defined 'brain rot' problem and goals.
* **Data Understanding:** Acquired and explored dataset, identifying patterns and quality.
* **Data Preparation:** Cleaned data, handled missing values, converted types, filtered, and engineered features.
* **Modeling:** Selected and applied regression models, performed diagnostic checks (VIF).
* **Evaluation:** Assessed model performance and interpreted findings.
* **Deployment:** Disseminated findings via this report and R Markdown file.

CRISP-DM ensures robustness and reliability.

### 5.2 Data Governance and Management

Data governance ensures integrity, security, and ethical use; data management executes policies (Information Governance Services, 2023).

* **Data Accessibility:**
    * **Current Project:** Uses a public, anonymized Kaggle dataset, promoting transparency.
    * **Future Expansion:** New primary/sensitive data would have strictly controlled access, limited to authorized personnel via secure platforms with ethical approvals.
* **Data Security & Confidentiality:**
    * **Current Project:** Minimal direct risk due to anonymized public data.
    * **Future Expansion:** For sensitive data, measures include: anonymization/pseudonymization, encryption (at rest/in transit), role-based access, secure compliant storage, and data minimization.
* **Ethical Concerns Related to Data Usage:**
    * **Self-Reported Bias:** Acknowledged limitation of the current dataset (cross-sectional, online recruitment bias).
    * **Privacy & Confidentiality (Future Data):** Crucial to obtain explicit informed consent for new individual-level data.
    * **Potential for Misinterpretation/Stigmatization:** Findings must be presented with nuance, emphasizing correlation over causation to avoid unfair generalizations. Statistical associations are not causal claims.
    * **Responsible Reporting:** Findings communicated responsibly, highlighting limitations and actionable insights.

Adherence ensures robust insights and responsible data handling.

## References

1.  Boys & Girls Clubs of America. (2025, February 26). *Supporting Digital Well-being: 12 Ways to Help Teens Unplug from Technology*. https://www.bgca.org/news-stories/2025/February/unplugging/
2.  Ch, D. (2025, January 8). *YouTube shorts statistics*. SendShort - Create Viral Shorts Instantly With SendShort. https://sendshort.ai/statistics/shorts/
3.  Connell, A. (2025, January 1). *32 Top Instagram Reels Statistics for 2025*. Adam Connell. https://adamconnell.me/instagram-reels-statistics/
4.  Curtis, L. (2025, January 6). *12 habits to Prevent "Brain Rot" Health*. https://www.health.com/habits-to-prevent-brain-rot-8766150
5.  DataCamp. (n.d.). *What is Natural Language Processing (NLP)? A Beginner's Guide*. Retrieved May 25, 2025, from https://www.datacamp.com/blog/what-is-natural-language-processing
6.  Heaton, B. (2024, December 2). *‘Brain rot’ named Oxford Word of the Year 2024*. Oxford University Press. https://corp.oup.com/news/brain-rot-named-oxford-word-of-the-year-2024/
7.  Information Governance Services. (2023, August 24). *What is data ethics and why does it matter?* https://www.informationgovernanceservices.com/articles/what-is-data-ethics-and-why-does-it-matter/
8.  Kim, I. (2024). EXPLORING THE COGNITIVE AND SOCIAL EFFECTS OF TIKTOK ON ADOLESCENT MINDS: a STUDY OF SHORT-FORM VIDEO CONSUMPTION. *ierj.in*. https://doi.org/10.21276/IERJ24769489007345
9.  Li, G., Geng, Y., & Wu, T. (2024). Effects of short-form video app addiction on academic anxiety and academic engagement: The mediating role of mindfulness. *Frontiers in Psychology, 15*. https://doi.org/10.3389/fpsyg.2024.1428813
10. Machine Learning for Social Science. (n.d.). *Researcher reasoning meets computational capacity: Machine learning for social science*. Retrieved May 25, 2025, from https://pmc.ncbi.nlm.nih.gov/articles/PMC10893965/
11. Marr, B. (n.d.). *What are the 4 Vs of Big Data?* Bernard Marr. Retrieved May 25, 2025, from https://bernardmarr.com/what-are-the-4-vs-of-big-data/
12. Ortiz-Ospina, E. (2019, September 18). *The rise of social media*. Our World in Data. https://ourworldindata.org/rise-of-social-media
13. Qu, D., Liu, B., Jia, L., Zhang, X., Chen, D., Zhang, Q., Feng, Y., & Chen, R. (2023). The longitudinal relationships between short video addiction and depressive symptoms: A cross-lagged panel network analysis. *Computers in Human Behavior, 152*, 108059. https://doi.org/10.1016/j.chb.2023.108059
14. Selection of Appropriate Statistical Methods for Data Analysis. (n.d.). *Selection of Appropriate Statistical Methods for Data Analysis*. Retrieved May 25, 2025, from https://pmc.ncbi.nlm.nih.gov/articles/PMC6639881/
15. Shimaoka, M., Ferreira, J., & Goldman, A. (2024). The evolution of CRISP-DM for Data Science: Methods, Processes and Frameworks. *Journals-Sol.SBC.Org.Br*. Retrieved May 25, 2025, from https://journals-sol.sbc.org.br/index.php/reviews/article/view/3757
16. Team, B. (2025, February 10). *Social Media Usage & Growth Statistics*. Backlinko. https://backlinko.com/social-media-users
17. *How many users on TikTok? Statistics & Facts (2025)*. (n.d.). https://seo.ai/blog/how-many-users-on-tiktok