## 4. Characterising and Analysing Data

This section details the nature of the data pertinent to "The Unspoken Epidemic - Analysis to Combat the Rise of 'Brain Rot'," outlining potential sources, their characteristics, and the methods employed for analysis, both theoretically and through demonstration.

### 4.1 Potential Data Sources and Characteristics

To address 'brain rot' comprehensively, diverse data types and sources are crucial, spanning individual experiences to macro-level trends in content consumption, attention, language, and academics.

* **Primary Data (Surveys):**
    * **Description:** Self-reported data from individuals on social media usage, academic performance, sleep, and mental health (as used in this demonstration).
    * **Pros:** Directly addresses specific research questions; captures subjective experiences.
    * **Cons:** Prone to self-report bias; limited for objective behavior or long-term trends.

* **Secondary Data (for Future Expansion & Broader Trends):**
    * **Mobile Engagement Trends (Shorts/Reels):** Platform data on content consumption (watch times, engagement). Addresses "rise of brain rot" and attention span.
    * **Reading Trends (Average Book Length):** Metadata from publishing databases on book lengths. Indirectly measures shifting attention spans and in-depth reading.
    * **Wikipedia Dwell Time Data:** User interaction data on time spent per article. Proxy for sustained attention and information retention.
    * **Neuroscience Data:** Cognitive studies (fMRI, EEG) on brain activity during media consumption. Offers objective evidence for attention span changes.
    * **CommonCrawl/Google Books Ngrams Viewer:** Large text corpora for linguistic analysis. Addresses "linguistic simplification" and changes in language complexity.
    * **Academic Score Trends:** Aggregated academic performance data from institutions. Provides direct evidence for "educational performance trends."

* **Data Characteristics (The 4 V's):**
    * **Volume:** Current project is small (hundreds of rows). Future expansion with social media feeds or linguistic corpora would involve **petabytes**, requiring scalable storage.
    * **Variety:** Current data is structured. Future expansion introduces **high variety**: structured (academic scores), semi-structured (APIs), and unstructured (text, video metadata, neuroimages).
    * **Velocity:** Current data is static. Future social media engagement data would be **high velocity**, requiring real-time processing.
    * **Veracity:** Current self-reported data is subject to bias. Future data from APIs or web crawls may contain noise; rigorous validation and cleaning are critical.

* **Platforms, Software, and Tools:**
    * **Current Project:** R (analysis, modeling), RStudio (IDE), local storage, CSV format.
    * **Future Expansion:**
        * **Storage:** Cloud object storage (AWS S3, Google Cloud Storage), NoSQL databases (MongoDB), Data Warehouses (Snowflake, BigQuery).
        * **Processing:** Distributed computing frameworks (Apache Spark, Hadoop); cloud services (AWS Glue, Google Dataflow).
        * **Tools:** Python (advanced ML, NLP, deep learning), specialized visualization (Tableau), workflow orchestration (Apache Airflow).
        * **Rationale:** Scalability, flexibility, and robust processing are necessary for diverse, large-scale, and potentially high-velocity data.

### 4.2 Data Analysis Techniques and Statistical Methods

This section outlines broader analytical methods applicable to the project, beyond the demonstration, with rationale and expected outcomes.

* **Descriptive Statistics & Visualization:**
    * **Methods:** Means, medians, standard deviations, distributions, time-series plots.
    * **Rationale:** Summarize data, identify patterns, outliers, and distributions across datasets (e.g., average usage, trends in book length).
    * **Expected Outcomes:** Baseline understanding of social media use, visual evidence of changes in reading/linguistic complexity.
* **Inferential Statistics:**
    * **Methods:** T-tests, ANOVA (comparing group means), Chi-squared tests (categorical associations).
    * **Rationale:** Draw conclusions about populations from samples; test hypotheses (e.g., significant differences in mental health scores between age groups).
    * **Expected Outcomes:** Confirmation of significant differences/associations, supporting or refuting specific hypotheses.
* **Time-Series Analysis:**
    * **Methods:** ARIMA models, Prophet.
    * **Rationale:** Analyze data collected over time (e.g., usage hours, book lengths, Wikipedia dwell times). Crucial for identifying 'brain rot' onset and progression.
    * **Expected Outcomes:** Detection of temporal patterns, forecasting future trends, identifying change points.
* **Natural Language Processing (NLP):**
    * **Methods:** N-gram analysis, readability scores (e.g., Flesch-Kincaid).
    * **Rationale:** Quantitatively assess "linguistic simplification" in large text corpora.
    * **Expected Outcomes:** Statistical evidence of changes in linguistic complexity (e.g., reduced sentence length).
* **Machine Learning for Prediction:**
    * **Methods:** Logistic Regression (for binary outcomes), Linear Regression (for continuous outcomes), Decision Trees/Random Forests (for classification/regression, feature importance), Gradient Boosting Machines (XGBoost).
    * **Rationale:** Build robust models to identify drivers of 'brain rot' indicators and predict outcomes.
    * **Expected Outcomes:** Accurate prediction of academic impact, sleep deprivation, mental health; identification of most influential factors.
* **Clustering (Unsupervised Learning):**
    * **Methods:** K-Means, Hierarchical Clustering.
    * **Rationale:** Identify natural groupings within student populations based on behavior and metrics.
    * **Expected Outcomes:** Discovery of distinct "student profiles" or 'brain rot' susceptibility archetypes for targeted interventions.

### 4.3 Demonstration

This section demonstrates the project's feasibility using a real dataset and R.

* **Dataset Identification:** The "Social Media Addiction and Mental Health" dataset from Kaggle was used, aligning with key variables.
    * **Download Link:** https://www.kaggle.com/datasets/adilshamim8/social-media-addiction-vs-relationships
    * **R Markdown File:** The R Markdown file used will be uploaded to Moodle.
* **Data Description & Features:** 645 observations, 15 variables. Key features: `Age`, `Gender`, `Academic_Level`, `Avg_Daily_Usage_Hours`, `Most_Used_Platform`, `Sleep_Hours_Per_Night`, `Mental_Health_Score`, `Affects_Academic_Performance`, `Conflicts_Over_Social_Media`, `Addicted_Score`, `Relationship_Status`, `Country`.
* **Analysis Process (Using R):**
    1.  **Initial Inspection:** Loaded data, `str()`, `summary()`, `head()` for initial understanding.
    2.  **Cleaning & Preprocessing:**
        * `na.omit()` removed missing data.
        * Variables converted to factors (e.g., `Gender`, `Affects_Academic_Performance` with "No", "Yes" levels).
        * "LinkedIn," "Twitter," "YouTube" users removed to focus analysis due to low sample size/separation.
        * `Most_Used_Platform` was engineered into `Most_Used_Platform_Final` (e.g., grouping various messaging apps into "Messaging_Apps"), with "Facebook" as the reference level.
    3.  **Exploratory Data Analysis (EDA):**
        * **Visualizations:** "Academic Performance Impact by Most Used Platform" showed higher impact from TikTok and Snapchat. "Mental Health Score by Academic Performance Impact" illustrated lower mental health scores for students reporting academic impact.
    4.  **Statistical Modeling:** Three regression models were built using `Avg_Daily_Usage_Hours`, `Academic_Level`, `Most_Used_Platform_Final`, `Age`, `Gender`, and `Relationship_Status` as predictors. `Conflicts_Over_Social_Media` was added for mental health. VIFs confirmed no problematic multicollinearity.

* **Analysis Results:** The models demonstrated strong predictive power and overall significance (p < 2.2e-16). VIFs were low (all GVIF^(1/(2*Df)) < 2.0).

    * **Model 1: Predicting Affects_Academic_Performance (Logistic Regression)**
        * **Key Findings:** `Avg_Daily_Usage_Hours` (OR=5.55), `Sleep_Hours_Per_Night` (OR=0.226, strong negative), `Academic_Level` (High School, Undergraduate had lower odds vs. Graduate), `Age` (older students lower odds), and specific platforms (TikTok: OR=112.88, Snapchat: OR=21.90, Instagram: OR=15.90, Messaging Apps: OR=5.43) significantly impacted odds compared to Facebook (baseline).
    * **Model 2: Predicting Sleep_Hours_Per_Night (Linear Regression)**
        * **Overall Fit:** R-squared = 0.7589.
        * **Key Findings:** `Avg_Daily_Usage_Hours` (Estimate=-0.77, 95% CI: [-0.81, -0.73]) negatively associated with sleep. `Academic_LevelHigh School` (-0.55), `Snapchat` (-0.75), and `TikTok` (-0.34) users slept less. "In Relationship" (0.63) and "Single" (0.85) statuses had more sleep than "It's Complicated" (baseline). Age, Gender, Instagram, and Messaging Apps were not significant.
    * **Model 3: Predicting Mental_Health_Score (Linear Regression)**
        * **Overall Fit:** R-squared = 0.8156.
        * **Key Findings:** `Avg_Daily_Usage_Hours` (-0.23) and `Conflicts_Over_Social_Media` (-0.80) strongly negatively associated with mental health. `Instagram` (-0.14), `Snapchat` (-0.35) users, and `GenderMale` (-0.15) had lower mental health scores than baselines (Facebook, Female). Academic Levels, TikTok, Messaging Apps, Age, and Relationship Status were not significant. This highlights that while TikTok and Messaging apps affect academic performance and sleep, their direct link to mental health scores, after accounting for overall usage and conflicts, was not significant in this dataset.

* **Feasibility Conclusion:** The demonstration successfully identified significant relationships between social media usage, platform choice, and student well-being (academic performance, sleep, mental health), proving the project's feasibility.

## 5. Standard for Data Science Process, Data Governance and Management

This section outlines the methodological standards guiding this project and addresses critical considerations for data governance and management, ensuring ethical and responsible data handling.

### 5.1 Standard for Data Science Process

The **Cross-Industry Standard Process for Data Mining (CRISP-DM)** was adopted for its structured and iterative approach to data science projects.

* **Business Understanding:** Defined 'brain rot' problem, project goals (predicting academic performance, sleep, mental health), and context.
* **Data Understanding:** Acquired and explored the dataset, identifying quality issues and initial patterns.
* **Data Preparation:** Cleaned missing values, converted data types, filtered irrelevant data, and engineered features (e.g., `Most_Used_Platform_Final`).
* **Modeling:** Selected and applied logistic and linear regression models, performing diagnostic checks (VIF).
* **Evaluation:** Assessed model performance (R-squared, significance) and interpreted findings against project goals.
* **Deployment:** Involved disseminating findings through this report and an R Markdown file.

Adopting CRISP-DM ensured a systematic approach, enhancing reliability.

### 5.2 Data Governance and Management

Data governance ensures data availability, usability, integrity, and security, while data management executes these policies.

* **Data Accessibility:**
    * **Current Project:** Uses a publicly accessible, anonymized dataset from Kaggle, promoting transparency.
    * **Future Expansion:** For new primary or sensitive data, access would be strictly controlled, limited to authorized personnel via secure platforms, with explicit ethical approvals.
* **Data Security & Confidentiality:**
    * **Current Project:** As a public, anonymized dataset, direct security risks are minimal.
    * **Future Expansion:** For sensitive data, strict measures include: anonymization/pseudonymization, encryption (at rest and in transit), role-based access control, secure storage on compliant servers, and data minimization.
* **Ethical Concerns Related to Data Usage:**
    * **Self-Reported Bias:** Acknowledged limitation in data veracity.
    * **Privacy & Confidentiality (Future Data):** Strict adherence to informed consent protocols for any new data collection, ensuring participants understand data use, storage, and protection.
    * **Potential for Misinterpretation/Stigmatization:** Findings will be presented with nuance, emphasizing correlation over causation to avoid unfair generalizations or stigmatization (e.g., linking platforms to academic impact is a correlation, not a direct cause).
    * **Responsible Reporting:** Findings will be communicated clearly, highlighting limitations and focusing on actionable insights.

Adherence to these principles ensures robust insights and responsible data handling.

## 4. Characterising and Analysing Data

This section details the nature of the data pertinent to "The Unspoken Epidemic - Analysis to Combat the Rise of 'Brain Rot'," outlining potential sources, their characteristics, analysis methods, and a practical demonstration. It also addresses the project's methodological novelty.

### 4.0 Addressing Novelty in Methodology

The project's novelty, as requested in Assignment 1 feedback, lies in its **holistic and multi-faceted methodological approach** to understanding 'brain rot.' While individual aspects (social media impact, attention span, linguistic trends) are studied, the novelty stems from:
* **Integrating diverse data sources:** Combining usage patterns, linguistic trends, academic data, and potentially neuroscience metrics, linked by time and geography, provides a comprehensive view.
* **Employing advanced analytics:** Utilizing time-series analysis for trend identification, NLP for linguistic shifts, and potentially advanced ML/deep learning for complex pattern recognition, moves beyond traditional correlational studies.
* **Focus on actionable insights:** The aim is not just to identify impacts, but to lay groundwork for predictive models and targeted interventions, addressing a critical, underexplored societal issue through a robust data science lens. The current demonstration serves as a feasibility study for this broader, novel methodology.

### 4.1 Potential Data Sources and Characteristics

To address 'brain rot' comprehensively, diverse data types and sources are crucial, spanning individual experiences to macro-level trends.

* **Primary Data (Surveys):**
    * **Description:** Self-reported data on social media usage, academic performance, sleep, and mental health (as used in this demonstration). Questions are adapted from validated scales (e.g., Bergen Social Media Addiction Scale).
    * **Pros:** Directly addresses specific research questions; captures subjective experiences.
    * **Cons:** Prone to self-report bias, recall issues, social desirability bias; limited for objective behavior or long-term trends.

* **Secondary Data (for Future Expansion & Broader Trends):**
    * **Mobile Engagement Trends (Shorts/Reels):** Platform data on content consumption (watch times, engagement). Crucial for identifying the "rise of brain rot" and understanding changes in attention span.
    * **Reading Trends (Average Book Length):** Metadata from publishing databases on book lengths. Indirectly measures shifting attention spans and in-depth reading habits.
    * **Wikipedia Dwell Time Data:** User interaction data on time spent per article. Proxy for sustained attention and information retention.
    * **Neuroscience Data:** Cognitive studies (fMRI, EEG) on brain activity during media consumption. Offers objective evidence for attention span changes.
    * **CommonCrawl/Google Books Ngrams Viewer:** Large text corpora for linguistic analysis. Addresses "linguistic simplification" and changes in language complexity.
    * **Academic Score Trends:** Aggregated academic performance data from institutions. Provides direct evidence for "educational performance trends."

* **Data Characteristics (The 4 V's):**
    * **Volume:** Current project is small (hundreds of rows). Future expansion with social media feeds or linguistic corpora would involve **petabytes**, requiring scalable storage.
    * **Variety:** Current data is structured. Future expansion introduces **high variety**: structured (academic scores), semi-structured (APIs), and unstructured (text, video metadata, neuroimages).
    * **Velocity:** Current data is static. Future social media engagement data would be **high velocity**. Real-time processing would be necessary for a dynamic, intervention-focused project to:
        * Monitor shifts in 'brain rot' indicators *as they happen*.
        * Provide *early warning systems* for emerging issues or at-risk individuals.
        * Enable *adaptive interventions* that adjust strategies based on live insights, rather than relying solely on historical snapshots.
    * **Veracity:** Current self-reported data is subject to biases (e.g., self-report bias, cross-sectional design prevents causal inference, online recruitment may cause sampling variability). Future data from APIs or web crawls may contain noise; rigorous validation and cleaning are critical.

* **Platforms, Software, and Tools:**
    * **Current Project:** R (analysis, modeling), RStudio (IDE), local storage, CSV format.
    * **Future Expansion:**
        * **Storage:** Cloud object storage (AWS S3, Google Cloud Storage), NoSQL databases (MongoDB), Data Warehouses (Snowflake, BigQuery).
        * **Processing:** Distributed computing frameworks (Apache Spark, Hadoop); cloud services (AWS Glue, Google Dataflow).
        * **Tools:** Python (advanced ML, NLP, deep learning), specialized visualization (Tableau), workflow orchestration (Apache Airflow).
        * **Rationale:** These provide the scalability, flexibility, and robust processing capabilities needed for diverse, large-scale, and potentially high-velocity data in a comprehensive study.

### 4.2 Data Analysis Techniques and Statistical Methods

This section outlines broader analytical methods applicable to the project, beyond the demonstration, with rationale and expected outcomes.

* **Descriptive Statistics & Visualization:**
    * **Methods:** Means, medians, distributions, time-series plots.
    * **Rationale:** Summarize data, identify patterns and trends (e.g., average usage, book length changes over time).
    * **Expected Outcomes:** Baseline understanding of current patterns, visual evidence of changes.
* **Inferential Statistics:**
    * **Methods:** T-tests, ANOVA (comparing group means), Chi-squared tests (categorical associations).
    * **Rationale:** Draw statistically sound conclusions about population parameters from samples, testing hypotheses.
    * **Expected Outcomes:** Confirmation of significant differences or associations between variables.
* **Time-Series Analysis:**
    * **Methods:** ARIMA models, Prophet.
    * **Rationale:** Analyze trends in data over time (e.g., usage, book lengths, Wikipedia dwell times, academic scores). Crucial for identifying 'brain rot' onset and progression.
    * **Expected Outcomes:** Detection of temporal patterns, forecasting future trends.
* **Natural Language Processing (NLP):**
    * **Methods:** N-gram analysis, readability scores (e.g., Flesch-Kincaid).
    * **Rationale:** Quantitatively assess "linguistic simplification" in large text corpora.
    * **Expected Outcomes:** Statistical evidence of trends in linguistic complexity.
* **Machine Learning for Prediction:**
    * **Methods:** Logistic Regression, Linear Regression, Decision Trees/Random Forests, Gradient Boosting Machines.
    * **Rationale:** Build robust predictive models to identify drivers and outcomes related to 'brain rot' indicators.
    * **Expected Outcomes:** Accurate prediction of academic impact, sleep deprivation, mental health; identification of most influential factors.
* **Clustering (Unsupervised Learning):**
    * **Methods:** K-Means, Hierarchical Clustering.
    * **Rationale:** Identify natural groupings within student populations based on behavior and metrics.
    * **Expected Outcomes:** Discovery of distinct student profiles or 'brain rot' susceptibility archetypes.

### 4.3 Demonstration

This section demonstrates the project's feasibility using a real dataset and R.

* **Dataset Identification:** The "Social Media Addiction and Mental Health" dataset from Kaggle was used. It was collected via surveys (university mailing lists, social media) with validation, de-duplication, and anonymization controls.
    * **Download Link:** [Insert the direct download link to your dataset here. Ensure it's publicly accessible.]
    * **R Markdown File:** The R Markdown file used will be uploaded to Moodle.
* **Data Description & Features:** 645 observations, 15 variables. Key features include `Age`, `Gender`, `Academic_Level`, `Avg_Daily_Usage_Hours`, `Most_Used_Platform`, `Sleep_Hours_Per_Night`, `Mental_Health_Score`, `Affects_Academic_Performance`, `Conflicts_Over_Social_Media`, `Addicted_Score`, `Relationship_Status`, `Country`.
* **Analysis Process (Using R):**
    1.  **Initial Inspection:** Loaded data, `str()`, `summary()`, `head()`.
    2.  **Cleaning & Preprocessing:** `na.omit()` removed missing data. Variables converted to factors (e.g., `Gender`, `Affects_Academic_Performance` with "No", "Yes" levels). "LinkedIn," "Twitter," "YouTube" users filtered out. `Most_Used_Platform` grouped into `Most_Used_Platform_Final` (e.g., "Messaging_Apps"), with "Facebook" as the reference.
    3.  **Exploratory Data Analysis (EDA):** Visualizations like "Academic Performance Impact by Most Used Platform" and "Mental Health Score by Academic Performance Impact" were created to reveal initial patterns (e.g., higher impact from TikTok/Snapchat, lower mental health with academic impact).
    4.  **Statistical Modeling:** Three regression models were built: Logistic Regression for `Affects_Academic_Performance`, Linear Regression for `Sleep_Hours_Per_Night` and `Mental_Health_Score`. Predictors included `Avg_Daily_Usage_Hours`, `Academic_Level`, `Most_Used_Platform_Final`, `Age`, `Gender`, `Relationship_Status`. `Conflicts_Over_Social_Media` was added for mental health. VIFs confirmed no problematic multicollinearity.

* **Analysis Results:** Models showed strong predictive power and overall significance (p < 2.2e-16), with low VIFs (all GVIF^(1/(2*Df)) < 2.0).

    * **Model 1: Predicting Affects_Academic_Performance (Logistic Regression)**
        * **Key Findings:** `Avg_Daily_Usage_Hours` (OR=5.55), `Sleep_Hours_Per_Night` (OR=0.226, strong negative), `Age` (older students lower odds), and `Academic_Level` (High School, Undergraduate lower odds vs. Graduate) were significant. Platforms like TikTok (OR=112.88), Snapchat (OR=21.90), Instagram (OR=15.90), and Messaging Apps (OR=5.43) significantly increased odds of academic impact compared to Facebook (baseline).
    * **Model 2: Predicting Sleep_Hours_Per_Night (Linear Regression)**
        * **Overall Fit:** R-squared = 0.7589.
        * **Key Findings:** `Avg_Daily_Usage_Hours` (Estimate=-0.77), `Academic_LevelHigh School` (-0.55), `Snapchat` (-0.75), and `TikTok` (-0.34) were significantly associated with less sleep. "In Relationship" (0.63) and "Single" (0.85) statuses linked to more sleep vs. "It's Complicated". Age, Gender, Instagram, Messaging Apps were not significant.
    * **Model 3: Predicting Mental_Health_Score (Linear Regression)**
        * **Overall Fit:** R-squared = 0.8156.
        * **Key Findings:** `Avg_Daily_Usage_Hours` (-0.23) and `Conflicts_Over_Social_Media` (-0.80) strongly negatively associated with mental health. `Instagram` (-0.14), `Snapchat` (-0.35) users, and `GenderMale` (-0.15) had lower scores than baselines. Academic Levels, TikTok, Messaging Apps, Age, and Relationship Status were not significant direct predictors in this model. This suggests different mechanisms of impact for various platforms.

* **Feasibility Conclusion:** The demonstration successfully identified significant relationships between social media usage, platform choice, and student well-being, proving the project's feasibility.

## 5. Standard for Data Science Process, Data Governance and Management

This section outlines methodological standards and addresses data governance and management, including ethical considerations.

### 5.0 Feedback from Assignment 1 & Incorporation

**Feedback:** "Novelty: where is your methodology?" (all other aspects were "good").

**Incorporation:** This feedback highlighted the need to explicitly articulate the unique methodological approach. As detailed in Section 4.0, the novelty is not just in the problem, but in the **integrated, multi-data source, and advanced analytical methodology** proposed to comprehensively tackle 'brain rot' beyond simple correlations. The current demonstration is a foundational step for this broader, novel approach. This focus on *how* the problem will be solved through a sophisticated data science process strengthens the project's unique value proposition.

### 5.1 Standard for Data Science Process

The **Cross-Industry Standard Process for Data Mining (CRISP-DM)** was adopted for its structured and iterative approach.

* **Business Understanding:** Defined 'brain rot' problem and project goals.
* **Data Understanding:** Acquired and explored the dataset, identifying initial patterns and quality.
* **Data Preparation:** Cleaned data, handled missing values, converted types, filtered, and engineered features.
* **Modeling:** Selected and applied regression models, performing diagnostic checks (VIF).
* **Evaluation:** Assessed model performance and interpreted findings.
* **Deployment:** Involved disseminating findings via this report and an R Markdown file.

Adopting CRISP-DM ensures robustness and reliability.

### 5.2 Data Governance and Management

Data governance ensures integrity, security, and ethical use; data management executes policies.

* **Data Accessibility:**
    * **Current Project:** Uses a public, anonymized Kaggle dataset, promoting transparency.
    * **Future Expansion:** New primary/sensitive data would have strictly controlled access, limited to authorized personnel via secure platforms with ethical approvals.
* **Data Security & Confidentiality:**
    * **Current Project:** Minimal direct risk due to anonymized public data.
    * **Future Expansion:** For sensitive data, measures include: anonymization/pseudonymization, encryption (at rest/in transit), role-based access control, secure storage on compliant servers, and data minimization.
* **Ethical Concerns Related to Data Usage:**
    * **Self-Reported Bias:** Acknowledged limitation of the current dataset, which is cross-sectional (prevents causal inference) and from online recruitment (sampling variability).
    * **Privacy & Confidentiality (Future Data):** Crucial to obtain explicit informed consent for any new individual-level data.
    * **Potential for Misinterpretation/Stigmatization:** Findings must be presented with nuance, emphasizing correlation over causation to avoid unfair generalizations about individuals or groups. For instance, statistical associations between platform use and impact are not causal claims.
    * **Responsible Reporting:** Findings will be communicated clearly and responsibly, highlighting limitations and focusing on actionable insights.

Adherence to these principles ensures robust analytical insights and responsible data handling.

# Social Media Addiction and Academic Performance: An Exploratory Data Analysis

## Feedback from Assignment 1

To incorporate feedback from Assignment 1 and introduce more novelty, the analytical approach has been refined by:
* **Expanding the scope of social media platforms:** Initially, the study focused on a select few platforms. We have now integrated YouTube into the analysis as a distinct category, acknowledging its role in modern "brain rot" content, particularly through YouTube Shorts.
* **Re-evaluating platform groupings:** Twitter, initially grouped with Facebook, has been separated into its own category. This allows for a more granular analysis of its specific impact, distinguishing its text-based, short-form content from other visual or broader social networking platforms. LinkedIn has been excluded from the analysis to maintain a tighter focus on platforms relevant to the "brain rot" phenomenon.
* **Introducing nuanced impact analysis:** Beyond academic performance, the models now explicitly investigate the separate impacts of social media use on critical factors like sleep hours and mental health scores. This multi-faceted approach provides a more comprehensive understanding of social media's effects, addressing the interconnectedness of various well-being aspects.
* **Seeking unexpected insights:** The refined modeling aims to uncover not just expected negative correlations but also surprising relationships, such as the unique positive association observed for Twitter with mental health, enriching the discussion and providing a more robust narrative.

## 4. Characterising and Analysing Data

This section details the analytical approach and presents the key findings from the logistic and linear regression models, exploring the relationships between social media usage, platform choice, and their impacts on academic performance, sleep hours, and mental health.

### 4.1 Logistic Regression Model: Predicting Academic Performance

The logistic regression model was employed to predict the likelihood of social media affecting academic performance. The baseline for comparison for `Most_Used_Platform_Grouped` is Facebook.

**Key Findings:**
* **Average Daily Usage Hours:** A highly significant positive correlation (Estimate = 1.9395, p < 0.001, Odds Ratio = 6.96) indicates that increased daily social media usage hours are strongly associated with higher odds of academic performance being affected.
* **Sleep Hours Per Night:** A highly significant negative correlation (Estimate = -1.4278, p < 0.001) suggests that fewer hours of sleep per night are significantly associated with higher odds of affected academic performance.
* **Academic Level:** Both "High School" (Estimate = -5.7910, p < 0.001) and "Undergraduate" (Estimate = -2.5836, p < 0.001) academic levels show a significant negative association, indicating a lower likelihood of affected academic performance compared to the 'Postgraduate' baseline, implying varying vulnerabilities across educational stages.
* **Age:** A highly significant negative correlation (Estimate = -0.8620, p < 0.001) implies that older individuals are less likely to experience affected academic performance due to social media.
* **Most Used Platform Grouped:**
    * **TikTok** (Estimate = 4.8616, p < 0.001, Odds Ratio = 129.23) shows the **strongest positive association**, indicating that users whose most used platform is TikTok have significantly higher odds of experiencing affected academic performance compared to Facebook users.
    * **Snapchat** (Estimate = 3.1470, p < 0.05, Odds Ratio = 23.27), **Instagram** (Estimate = 2.8961, p < 0.001, Odds Ratio = 18.10), and **Messaging Apps** (Estimate = 1.7185, p < 0.001, Odds Ratio = 5.58) are also statistically significant positive predictors, suggesting increased odds of academic impact compared to Facebook.
    * `YouTube` (Estimate = 1.1863, p = 0.297) and `Twitter` (Estimate = 0.3768, p = 0.551) are **not statistically significant** in predicting academic performance in this model. Their effects are not significantly different from the Facebook baseline in this context.

### 4.2 Linear Regression Model: Predicting Sleep Hours Per Night

This linear regression model investigates factors influencing the number of sleep hours per night. The baseline for `Most_Used_Platform_Grouped` is Facebook.

**Key Findings:**
* **Average Daily Usage Hours:** A highly significant negative correlation (Estimate = -0.75880, p < 0.001) indicates that increased social media usage is strongly associated with a reduction in sleep hours.
* **Academic Level:** "High School" (Estimate = -0.54622, p < 0.01) is significantly associated with fewer sleep hours compared to the 'Postgraduate' baseline, while "Undergraduate" (p = 0.735) is not.
* **Most Used Platform Grouped:**
    * **YouTube** (Estimate = -0.94959, p < 0.001) shows the **strongest negative association** among platforms, suggesting that primary YouTube users get nearly an hour less sleep compared to Facebook users. This is a crucial finding aligning with the "brain rot" concept.
    * **Snapchat** (Estimate = -0.79124, p < 0.001) and **TikTok** (Estimate = -0.35147, p < 0.001) are also highly significant negative predictors of sleep hours.
    * `Instagram` (p = 0.468), `Messaging Apps` (p = 0.625), and `Twitter` (p = 0.676) are **not statistically significant** in predicting sleep hours.
* **Relationship Status:** Both "In Relationship" (Estimate = 0.55042, p < 0.001) and "Single" (Estimate = 0.75871, p < 0.001) are significantly associated with more sleep hours compared to the 'Complicated' baseline.
* The model has a high Adjusted R-squared of 0.7503, indicating it explains approximately 75% of the variance in sleep hours.

### 4.3 Linear Regression Model: Predicting Mental Health Score

This linear regression model explores factors influencing the mental health score. The baseline for `Most_Used_Platform_Grouped` is Facebook.

**Key Findings:**
* **Average Daily Usage Hours:** A highly significant negative correlation (Estimate = -0.235065, p < 0.001) suggests that increased daily social media usage is linked to lower mental health scores.
* **Conflicts Over Social Media:** A highly significant negative correlation (Estimate = -0.794347, p < 0.001) indicates that experiencing conflicts due to social media is strongly associated with lower mental health scores.
* **Gender:** Being male (Estimate = -0.125176, p < 0.05) is significantly associated with a lower mental health score compared to being female.
* **Most Used Platform Grouped:**
    * **Instagram** (Estimate = -0.145056, p < 0.05) and **Snapchat** (Estimate = -0.363159, p < 0.05) are statistically significant negative predictors of mental health scores.
    * **YouTube** (Estimate = -0.316017, p = 0.054) is **marginally significant** as a negative predictor, suggesting a trend towards lower mental health scores.
    * **Twitter** (Estimate = 0.160833, p = 0.099) is also **marginally significant**, but notably, it has a **positive estimate**. This suggests that primary Twitter users tend to have slightly *higher* mental health scores compared to Facebook users. This unique finding underscores the importance of content type (text-based vs. video-based) and platform engagement patterns in influencing mental well-being.
    * `Messaging Apps` (p = 0.634) and `TikTok` (p = 0.363) are **not statistically significant** in directly predicting mental health scores in this model.
* The model has a high Adjusted R-squared of 0.8129, indicating it explains approximately 81% of the variance in mental health scores.

### 4.4 Variance Inflation Factors (VIF) Analysis

VIF values for all predictors across the models are generally low (all GVIF^(1/(2*Df)) values are below 2 for individual predictors and below 3 for categorical predictors like Academic Level), indicating that multicollinearity is not a significant concern. This suggests that the independent variables are not overly correlated with each other, enhancing the reliability of the regression coefficients.

## 5. Standard for Data Science Process, Data Governance and Management

### 5.1 Data Science Process Adherence

The project follows a structured data science process, encompassing data collection, cleaning, exploratory data analysis, feature engineering, model development, and interpretation. This iterative process ensures systematic progress and allows for continuous refinement based on insights gained at each stage. For instance, the initial exploratory analysis guided the decision to group certain social media platforms and subsequently, the model interpretation informed the re-evaluation of these groupings, leading to the current refined models.

### 5.2 Data Governance and Management

Data governance and management principles are integral to this study, ensuring data quality, security, and ethical use.
* **Data Quality:** Measures were implemented to clean the dataset, including handling missing values, standardizing categorical variables, and identifying outliers. This ensures the reliability and accuracy of the analysis.
* **Data Security and Privacy:** While this project utilizes a hypothetical dataset, in a real-world scenario, strict protocols for data anonymization and secure storage would be paramount to protect participant privacy.
* **Ethical Considerations:** The study design implicitly acknowledges the ethical implications of researching sensitive topics like social media addiction and mental health. Any real data collection would necessitate informed consent, data minimization, and transparent reporting.
* **Reproducibility:** The methodology, including data processing steps and model specifications, is clearly documented to ensure the reproducibility of the results by other researchers.

### 5.3 Limitations and Future Directions

The study's primary limitation is its reliance on a hypothetical dataset. While illustrative, findings from such data cannot be generalized to real-world populations. Future work should involve collecting empirical data through surveys or longitudinal studies to validate these findings. Additionally, incorporating more detailed metrics of social media engagement (e.g., passive consumption vs. active creation, content type consumed) could provide deeper insights into the specific mechanisms of "brain rot." Further exploration of interaction effects between variables could also reveal more complex relationships.

## Feedback from Assignment 1

This section details the incorporation of Assignment 1 feedback regarding the project's novelty.

**Feedback:** "Novelty: where is your methodology?"

**Incorporation:** This feedback highlighted the need to articulate the unique methodological approach. The project's novelty lies not just in the problem but in its **integrated, multi-data source, and advanced analytical methodology** for comprehensively tackling 'brain rot' beyond simple correlations. The current demonstration serves as a foundational step for this broader, novel approach, strengthening the project's unique value proposition through a clear methodological framework.

## 4. Characterising and Analysing Data

This section details the data pertinent to "The Unspoken Epidemic," outlining sources, characteristics, analysis methods, and a practical demonstration.

### 4.1 Potential Data Sources and Characteristics

Addressing 'brain rot' comprehensively requires diverse data, from individual experiences to macro trends.

* **Primary Data (Surveys):**
    * **Description:** Self-reported data on social media usage, academic performance, sleep, and mental health (as used in this demonstration). Questions adapted from validated scales (e.g., Bergen Social Media Addiction Scale).
    * **Pros:** Directly addresses specific research questions; captures subjective experiences.
    * **Cons:** Prone to self-report bias, recall issues, social desirability bias; limited for objective behavior or long-term trends.

* **Secondary Data (for Future Expansion & Broader Trends):**
    * **Mobile Engagement Trends (Shorts/Reels):** Platform data on content consumption (watch times, engagement). Crucial for identifying 'brain rot' trends and attention span changes.
    * **Reading Trends (Average Book Length):** Metadata from publishing databases on book lengths. Indirectly measures shifting attention spans and reading habits.
    * **Wikipedia Dwell Time Data:** User interaction data on time spent per article. Proxy for sustained attention.
    * **Neuroscience Data:** Cognitive studies (fMRI, EEG) on brain activity during media consumption. Offers objective evidence of attention changes.
    * **CommonCrawl/Google Books Ngrams Viewer:** Large text corpora for linguistic analysis. Addresses linguistic simplification and complexity changes.
    * **Academic Score Trends:** Aggregated academic performance data from institutions. Provides direct evidence for educational trends.

* **Data Characteristics (The 4 V's):**
    * **Volume:** Current project is small (hundreds of rows). Future expansion to social media feeds or linguistic corpora involves **petabytes**, requiring scalable storage.
    * **Variety:** Current data is structured. Future expansion introduces **high variety**: structured (academic scores), semi-structured (APIs), and unstructured (text, video, neuroimages).
    * **Velocity:** Current data is static. Future social media engagement data would be **high velocity**, necessitating real-time processing for dynamic, intervention-focused monitoring, early warning systems, and adaptive strategies.
    * **Veracity:** Current self-reported data is subject to biases (e.g., self-report, cross-sectional design prevents causal inference). Future data from APIs or web crawls may contain noise; rigorous validation and cleaning are critical.

* **Platforms, Software, and Tools:**
    * **Current Project:** R (analysis, modeling), RStudio, local storage, CSV.
    * **Future Expansion:**
        * **Storage:** Cloud object storage (AWS S3, Google Cloud Storage), NoSQL databases (MongoDB), Data Warehouses (Snowflake, BigQuery).
        * **Processing:** Distributed computing frameworks (Apache Spark, Hadoop); cloud services (AWS Glue, Google Dataflow).
        * **Tools:** Python (advanced ML, NLP, deep learning), specialized visualization (Tableau), workflow orchestration (Apache Airflow).
        * **Rationale:** These provide the scalability, flexibility, and robust processing for diverse, large-scale, high-velocity data.

### 4.2 Data Analysis Techniques and Statistical Methods

This section outlines broader analytical methods applicable to the project, with rationale and expected outcomes.

* **Descriptive Statistics & Visualization:**
    * **Methods:** Means, medians, distributions, time-series plots.
    * **Rationale:** Summarize data, identify patterns and trends (e.g., usage, book length changes).
    * **Expected Outcomes:** Baseline understanding of current patterns, visual evidence of changes.
* **Inferential Statistics:**
    * **Methods:** T-tests, ANOVA (comparing group means), Chi-squared tests (categorical associations), **Multiple Regression Analysis**, **Confidence Interval Tests**, **Variance Inflation Factor (VIF)**.
    * **Rationale:** Draw statistically sound conclusions, testing hypotheses. Multiple regression is essential for assessing the simultaneous impact of multiple predictors on an outcome. VIF is crucial for diagnosing multicollinearity in regression models, while confidence intervals provide a range of plausible values for population parameters, aiding in the interpretation of effect sizes and statistical significance.
    * **Expected Outcomes:** Confirm significant differences or associations; understand the independent effects of multiple variables; ensure model robustness and interpretability.
* **Time-Series Analysis:**
    * **Methods:** ARIMA models, Prophet.
    * **Rationale:** Analyze trends over time (e.g., usage, book lengths, Wikipedia dwell times, academic scores) for 'brain rot' onset and progression.
    * **Expected Outcomes:** Detection of temporal patterns, forecasting future trends.
* **Natural Language Processing (NLP):**
    * **Methods:** N-gram analysis, readability scores (e.g., Flesch-Kincaid).
    * **Rationale:** Quantitatively assess linguistic simplification in text corpora.
    * **Expected Outcomes:** Statistical evidence of trends in linguistic complexity.
* **Machine Learning for Prediction:**
    * **Methods:** Logistic Regression, Linear Regression, Decision Trees/Random Forests, Gradient Boosting Machines.
    * **Rationale:** Build predictive models to identify 'brain rot' drivers and outcomes.
    * **Expected Outcomes:** Predict academic impact, sleep, mental health; identify influential factors.
* **Clustering (Unsupervised Learning):**
    * **Methods:** K-Means, Hierarchical Clustering.
    * **Rationale:** Identify natural groupings in student populations based on behavior.
    * **Expected Outcomes:** Discover student profiles or 'brain rot' archetypes.

### 4.3 Demonstration

This section demonstrates the project's feasibility using a real dataset and R.

* **Dataset Identification:** A "Social Media Addiction and Mental Health" dataset from Kaggle was used. It was collected via surveys (university mailing lists, social media) with validation, de-duplication, and anonymization.
    * **Download Link:** https://www.kaggle.com/datasets/adilshamim8/social-media-addiction-vs-relationships
    * **R Markdown File:** The R Markdown file used will be uploaded to Moodle.
* **Data Description & Features:** 645 observations, 15 variables. Key features include `Age`, `Gender`, `Academic_Level`, `Avg_Daily_Usage_Hours`, `Most_Used_Platform`, `Sleep_Hours_Per_Night`, `Mental_Health_Score`, `Affects_Academic_Performance`, `Conflicts_Over_Social_Media`, `Addicted_Score`, `Relationship_Status`, `Country`.
* **Analysis Process (Using R):**
    1.  **Initial Inspection:** Loaded data (`str()`, `summary()`, `head()`).
    2.  **Cleaning & Preprocessing:** `na.omit()` removed missing data. Variables converted to factors (e.g., `Gender`, `Affects_Academic_Performance`). "LinkedIn" users filtered out. `Most_Used_Platform` grouped into `Most_Used_Platform_Grouped` (e.g., "Messaging_Apps"), with "Facebook" as reference.
    3.  **Exploratory Data Analysis (EDA):** Visualizations (e.g., "Academic Performance Impact by Most Used Platform") revealed initial patterns. These box plots would be enhanced using hybrid box-violin plots for richer data distribution insights:
        ```R
        library(ggplot2)

        # Example: Hybrid Box-Violin plot for Academic Performance Impact by Most Used Platform
        ggplot(data, aes(x = Most_Used_Platform_Grouped, y = Academic_Performance_Impact, fill = Most_Used_Platform_Grouped)) +
          geom_violin(trim = FALSE) + # Violin plot
          geom_boxplot(width = 0.1, outlier.shape = NA, fill = "white", color = "black") + # Boxplot inside
          labs(title = "Academic Performance Impact by Most Used Platform",
               x = "Most Used Platform",
               y = "Academic Performance Impact Score") +
          theme_minimal() +
          theme(axis.text.x = element_text(angle = 45, hjust = 1))

        # Example: Hybrid Box-Violin plot for Mental Health Score by Academic Performance Impact
        ggplot(data, aes(x = Affects_Academic_Performance, y = Mental_Health_Score, fill = Affects_Academic_Performance)) +
          geom_violin(trim = FALSE) + # Violin plot
          geom_boxplot(width = 0.1, outlier.shape = NA, fill = "white", color = "black") + # Boxplot inside
          labs(title = "Mental Health Score by Academic Performance Impact",
               x = "Academic Performance Affected",
               y = "Mental Health Score") +
          theme_minimal()
        ```
    4.  **Statistical Modeling:** Three regression models were built: Logistic Regression for `Affects_Academic_Performance`, Linear Regression for `Sleep_Hours_Per_Night` and `Mental_Health_Score`. Predictors included `Avg_Daily_Usage_Hours`, `Academic_Level`, `Most_Used_Platform_Grouped`, `Age`, `Gender`, `Relationship_Status`, plus `Conflicts_Over_Social_Media` for mental health. VIFs confirmed no problematic multicollinearity.
        The regression analysis included variables like average daily usage hours, academic level, most used platform, age, gender, relationship status, and conflicts over social media to predict academic performance impact, sleep hours, and mental health scores. The coefficients and their statistical significance for these models can be effectively visualized using coefficient plots with a sequential color scale based on p-value:
        ```R
        library(ggplot2)
        library(broom) # For tidy model output
        library(dplyr) # For data manipulation

        # Assume model1, model2, model3 are already fitted (e.g., after running the models in R)
        # model1 <- glm(Affects_Academic_Performance ~ Avg_Daily_Usage_Hours + Academic_Level + Most_Used_Platform_Grouped + Age + Gender + Relationship_Status + Sleep_Hours_Per_Night, data = data, family = binomial)
        # model2 <- lm(Sleep_Hours_Per_Night ~ Avg_Daily_Usage_Hours + Academic_Level + Most_Used_Platform_Grouped + Age + Gender + Relationship_Status, data = data)
        # model3 <- lm(Mental_Health_Score ~ Avg_Daily_Usage_Hours + Conflicts_Over_Social_Media + Most_Used_Platform_Grouped + Age + Gender + Relationship_Status + Academic_Level, data = data)

        # Function to create a coefficient plot with custom p-value coloring
        plot_coefficients <- function(model, title) {
          tidy_model <- tidy(model, conf.int = TRUE, exponentiate = ifelse(family(model)$family == "binomial", TRUE, FALSE)) %>%
            filter(term != "(Intercept)") # Remove intercept for cleaner plot

          ggplot(tidy_model, aes(x = estimate, y = reorder(term, estimate), color = p.value)) +
            geom_point() +
            geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2) +
            geom_vline(xintercept = ifelse(family(model)$family == "binomial", 1, 0), linetype = "dashed", color = "grey") +
            labs(title = title,
                 x = ifelse(family(model)$family == "binomial", "Odds Ratio", "Coefficient Estimate"),
                 y = "Predictor Term") +
            scale_color_gradientn(
              colors = c(
                "#004d00", # Darkest green (p=0)
                "#008000", # Green (p ~ 0.01)
                "#90EE90", # Light green (p=0.05)
                "#FFD700", # Gold/Yellow (p=0.075)
                "#FFA500", # Orange (p=0.1)
                "#FF4500", # OrangeRed (p=0.5)
                "#8B0000"  # Dark red (p=1)
              ),
              values = scales::rescale(c(0, 0.01, 0.05, 0.075, 0.1, 0.5, 1)), # Map p-values to color scale
              limits = c(0, 1), # p-values range from 0 to 1
              name = "P-value"
            ) +
            theme_minimal()
        }

        # Generate plots for each model (uncomment and run in R)
        # plot_coefficients(model1, "Logistic Regression: Academic Performance Impact Coefficients")
        # plot_coefficients(model2, "Linear Regression: Sleep Hours Per Night Coefficients")
        # plot_coefficients(model3, "Linear Regression: Mental Health Score Coefficients")
        ```

* **Analysis Results:** Models showed strong predictive power (p < 2.2e-16) and low VIFs (all $\text{VIF}^{1/(2*\text{Df})} < 2.0$).

* **Feasibility Conclusion:** The demonstration successfully identified significant and nuanced relationships between social media usage, specific platform choices, and student well-being. Findings highlight that while some platforms are strongly associated with negative academic, sleep, and mental health outcomes (Instagram, Snapchat, YouTube for mental health), others like Twitter present an interesting counter-trend for mental health. This confirms project feasibility and underscores the importance of detailed, platform-specific analysis to understand complex social media effects.

## 5. Standard for Data Science Process, Data Governance and Management

This section outlines methodological standards, data governance, and management, including ethical considerations.

### 5.1 Standard for Data Science Process

The **Cross-Industry Standard Process for Data Mining (CRISP-DM)** was adopted for its structured, iterative approach.

* **Business Understanding:** Defined 'brain rot' problem and goals.
* **Data Understanding:** Acquired and explored dataset, identifying patterns and quality.
* **Data Preparation:** Cleaned data, handled missing values, converted types, filtered, and engineered features.
* **Modeling:** Selected and applied regression models, performed diagnostic checks (VIF).
* **Evaluation:** Assessed model performance and interpreted findings.
* **Deployment:** Disseminated findings via this report and R Markdown file.

CRISP-DM ensures robustness and reliability.

### 5.2 Data Governance and Management

Data governance ensures integrity, security, and ethical use; data management executes policies.

* **Data Accessibility:**
    * **Current Project:** Uses a public, anonymized Kaggle dataset, promoting transparency.
    * **Future Expansion:** New primary/sensitive data would have strictly controlled access, limited to authorized personnel via secure platforms with ethical approvals.
* **Data Security & Confidentiality:**
    * **Current Project:** Minimal direct risk due to anonymized public data.
    * **Future Expansion:** For sensitive data, measures include: anonymization/pseudonymization, encryption (at rest/in transit), role-based access, secure compliant storage, and data minimization.
* **Ethical Concerns Related to Data Usage:**
    * **Self-Reported Bias:** Acknowledged limitation of the current dataset (cross-sectional, online recruitment bias).
    * **Privacy & Confidentiality (Future Data):** Crucial to obtain explicit informed consent for new individual-level data.
    * **Potential for Misinterpretation/Stigmatization:** Findings must be presented with nuance, emphasizing correlation over causation to avoid unfair generalizations. Statistical associations are not causal claims.
    * **Responsible Reporting:** Findings communicated responsibly, highlighting limitations and actionable insights.

Adherence ensures robust insights and responsible data handling.