# Report specifications
1.  Assignment 3: Report
Weight: 15% of the unit mark
Submission format: one PDF file and one RMD file (for demonstration in the Characterising and Analysing Data section)
Size: up to 2500 words

This report is your comprehensive analysis of how data science can be used to help solve a significant real-world problem. Please answer the following question in the FIRST page of your Assignment 3 submission:
  ●  Have you selected a topic for Assignment 3 that is different from the one that you used for Assignment 1 (i.e., have you rewrote the first three sections of the report)?

Your report should have the following sections:

1.  Introduction
  -  Clear articulation of the specific problem the project aims to solve.
  -  Background and context of the problem.
  -  Importance of the problem (why it matters).
  -  Specific goals of the project.

2.  Related Work
  -  Summary of existing research, projects, or industry solutions related to the problem.
  -  Identification of gaps in current approaches.
  -  Why or how your project should be considered as novel.

3.  Business Model
  -  Analysis about the business/application area the project sits in.
  -  What kind of benefits or values the project can create for the specific business area?
  -  Who are the primary stakeholders and how will they benefit from the project?

4.  Characterising and Analysing Data:
  -  Discuss potential data sources and analyze their characteristics (e.g., the 4 V's), evaluate the required platforms, software, and tools for data processing and storage based on the specific characteristics of the data or consider potential options (e.g., platforms, software, and tools) if your project expands in the future.
  -  Specify the data analysis techniques and statistical methods (e.g., decision tree or regression tree) applicable to the project. Provide a rationale for the selected methods and discuss the expected high-level outcomes. Note: The specification of data analysis and statistical methods should be different from the demonstration below and must be described separately.
  -  Demonstration: identify a usable dataset for the proposed project and perform some basic analysis on the identified dataset to demonstrate the feasibility of the project, using R (e.g., detailing the information/features contained in the dataset, analyse the basic characteristics of the dataset, etc.), and report the analysis process and result in the demonstration section of a final report.

Note: Please include a link to download the dataset in the final report, and upload the R markdown file created for data analysis on Moodle.

5.  Standard for Data Science Process, Data Governance and Management
  -  Describe any standards used in your data science process
  -  Describe any practices for data governance and management in the project, e.g., how to address key issues such as data accessibility, security, and confidentiality, as well as potential ethical concerns related to data usage.

The sections would present aspects of Weeks 1-10 of the unit for your chosen case study.

The maximum word limit for the report (Assignment 3) is 2500 words.  It may include some/all of your Assignment 1, modified if needed (counted in the 2500 word total). References at the end of the report (i.e., URLs and academic publications) are not included in the word count. Note that staying within the word limit demonstrates your ability to write concisely.

Assignment 3 report:  You will be assessed on your ability to:

  ●  define the problem, provide background and significance, outline specific goals, analyze the business domain and its value creation, identify key stakeholders and their benefits, summarize existing research or industry solutions, highlight gaps in current approaches, and justify the project's novelty and potential impact (You can reuse the content from Assignment 1 for this section);  
  ●  discuss potential data sources and analyze their characteristics (e.g., the 4 V's) and evaluate the required platforms, software, and tools for data processing and storage based on the specific characteristics of the data or consider potential options (e.g., platforms, software, and tools) if your project expands in the future;  
  ●  specify the data analysis techniques and statistical methods (e.g., decision tree or regression tree) applicable to the project. Provide a rationale for the selected methods and discuss the expected high-level outcomes;  
  ●  identify a usable dataset for the proposed project and perform some basic analysis on the identified dataset to demonstrate the feasibility of the project, using R (e.g., detailing the information/features contained in the dataset, analyse the basic characteristics of the dataset, etc.), and report the analysis process and result in the demonstration section of a final report;  
  ●  describe any standards used in your data science process and practices for data governance and management in the project, e.g., how to address key issues such as data accessibility, security, and confidentiality, as well as potential ethical concerns related to data usage;  
  ●  think critically and creatively, providing justification and analysis;  
  ●  provide a good quality of report in terms of structure, expression, grammar and spelling.  

For both assignments, make sure that any resources you use are acknowledged in your report. You may need to review the FIT citation style to make yourself familiar with appropriate citing and referencing for this assessment. Also, review the demystifying citing and referencing guide for help.

Assignment 3: Presentation (Slides + Verbal) + Peer-review Evaluation  
Weight: 10% of the unit mark   
Submission format: one PDF file (Slides)   
Size: a maximum of 10 slides (Slides)   
You need to submit your presentation slides along with your final report. The 4 minute presentation 
is given in Week 12 during your assigned applied class and after your presentation, the tutor will ask 
at  least  one  question  to  the  presenter  (1  minute).  You will also be required to review and provide 
feedback on presentations of other students (peer-review) during the applied class in Week 12, using 
the Google Form provided.  

Section 1-3 can be copy pasted from Assignment 1. Only need to do section 4, 5 and the presentation slides.

4. Characterising and Analysing Data

This section requires two distinct parts: a theoretical discussion of data characteristics and methods, followed by your practical demonstration.
4.1 Potential Data Sources and Characteristics

This sub-section addresses: "Discuss potential data sources and analyze their characteristics (e.g., the 4 V's), evaluate the required platforms, software, and tools for data processing and storage based on the specific characteristics of the data or consider potential options (e.g., platforms, software, and tools) if your project expands in the future." 

    Introduction to Data:
        Start by defining the broad type of data relevant to a study on social media impact on academic performance (e.g., behavioral data, survey data, demographic data).
    Potential Data Sources:
        Primary Data (e.g., Surveys):
            Description: Data collected directly from individuals through questionnaires, interviews, or observations. Your current dataset is an example of primary survey data.
            Pros: Directly addresses research questions, specific to context.
            Cons: Time-consuming, resource-intensive, potential for self-report bias.
        Secondary Data (e.g., Social Media APIs, University Records):
            Description: Data collected by others for different purposes but usable for your study.
            Pros: Cost-effective, readily available, large scale.
            Cons: May not perfectly align with research questions, data quality can vary, ethical/privacy concerns with sensitive data.
            Examples for future expansion:
                Social Media APIs (e.g., Facebook Graph API, Twitter API): Collect public posts, engagement metrics, sentiment.
                University Student Records: Academic transcripts, attendance data (requires strict ethical approval and anonymization).
                Wearable Devices/App Usage Trackers: Objective data on screen time, sleep patterns (highly intrusive, significant privacy hurdles).
    Data Characteristics (The 4 V's):
        Volume: The sheer amount of data.
            Current Project: Relatively small (hundreds of rows).
            Future Expansion: Could become very large (terabytes) if integrating social media streams, university records, or longitudinal studies. Requires scalable storage.
        Variety: The different types of data.
            Current Project: Structured (numeric: age, usage; categorical: gender, platform).
            Future Expansion: Could include unstructured (text from posts, images/videos from social media), semi-structured (JSON from APIs). Requires diverse processing capabilities.
        Velocity: The speed at which data is generated and needs to be processed.
            Current Project: Static (collected once).
            Future Expansion: Could be high velocity (real-time social media streams, continuous sensor data). Requires streaming analytics.
        Veracity: The accuracy and trustworthiness of the data.
            Current Project: Self-reported survey data, prone to bias and inaccuracies (e.g., memory recall, social desirability).
            Future Expansion: Data from APIs might have bots/fake accounts; university data generally high veracity but sensitive. Requires robust data validation and cleaning.
    Platforms, Software, and Tools for Processing and Storage:
        Current Project (Basic):
            Tools: R (for analysis and visualization), RStudio (IDE).
            Platforms/Storage: Local machine storage. CSV format.
        Future Expansion (Advanced/Big Data):
            Data Storage: Cloud storage solutions (AWS S3, Google Cloud Storage, Azure Blob Storage for raw data); NoSQL databases (MongoDB, Cassandra for unstructured data); Data Warehouses (Snowflake, BigQuery for structured, analytical data).
            Data Processing: Distributed computing frameworks (Apache Spark, Hadoop MapReduce); Cloud-based data processing services (AWS Glue, Google Dataflow, Azure Data Factory).
            Tools: Python (for more complex data pipelines, machine learning), specialized visualization tools (Tableau, Power BI), workflow orchestration tools (Apache Airflow).
            Rationale for Selection: Choice depends on scale, real-time needs, data variety, and budget. Open-source options (Spark, Hadoop) versus managed cloud services.

4.2 Data Analysis Techniques and Statistical Methods

This sub-section addresses: "Specify the data analysis techniques and statistical methods (e.g., decision tree or regression tree) applicable to the project. Provide a rationale for the selected methods and discuss the expected high-level outcomes."  Crucially, this should be different from your demonstration below.

    Introduction to Methods:
        Explain that a variety of statistical and machine learning methods can be applied depending on the research questions and data types.
    Types of Analysis:
        Descriptive Statistics:
            Methods: Mean, median, mode, standard deviation, frequency distributions, box plots, histograms.
            Rationale: To summarize and describe the main features of the dataset, identify patterns, outliers, and data distributions.
            Expected Outcomes: Understanding typical social media usage hours, average sleep, mental health scores, and demographic breakdowns of the student population.
        Inferential Statistics (Hypothesis Testing):
            Methods: T-tests, ANOVA (for comparing means across groups), Chi-squared tests (for associations between categorical variables).
            Rationale: To draw conclusions about a population based on sample data and test specific hypotheses (e.g., if there's a significant difference in mental health scores between different age groups).
            Expected Outcomes: Identifying statistically significant differences between groups (e.g., do high social media users sleep less than low users?).
        Predictive Modeling:
            Methods:
                Logistic Regression (as you used):
                    Rationale: To predict a binary outcome (Affects_Academic_Performance - Yes/No) and quantify the association between predictors and the odds of the outcome. Good for interpretability.
                    Expected Outcomes: Identifying key factors (e.g., usage hours, specific platforms) that significantly increase or decrease the odds of academic performance being affected.
                Linear Regression (as you used):
                    Rationale: To predict a continuous outcome (Sleep_Hours_Per_Night, Mental_Health_Score) and identify factors that linearly influence these.
                    Expected Outcomes: Quantifying how much sleep or mental health score changes with a unit increase in usage or specific demographic/social factors.
                Decision Trees/Random Forests (alternative):
                    Rationale: Non-linear models capable of capturing complex interactions and useful for both classification (e.g., Affects_Academic_Performance) and regression (e.g., Sleep_Hours_Per_Night). Provides feature importance.
                    Expected Outcomes: A tree-like structure showing decision rules that lead to different outcomes, and identifying the most important features in predicting academic impact, sleep, or mental health. Could be more robust to outliers than linear models.
                Support Vector Machines (SVM) / Neural Networks (NN) (advanced):
                    Rationale: Powerful for complex classification/regression tasks, especially with large datasets and non-linear relationships. Can capture highly intricate patterns.
                    Expected Outcomes: High predictive accuracy, but potentially less interpretable "black-box" models.
        Clustering (Unsupervised Learning):
            Methods: K-Means, Hierarchical Clustering.
            Rationale: To identify natural groupings (segments) within the student population based on their social media behavior, demographics, and well-being metrics without predefined labels.
            Expected Outcomes: Discovering distinct "student profiles" (e.g., "heavy user, low sleep, affected performance" group vs. "moderate user, good sleep, no impact" group).

4.3 Demonstration

This sub-section addresses: "Identify a usable dataset for the proposed project and perform some basic analysis on the identified dataset to demonstrate the feasibility of the project, using R (e.g., detailing the information/features contained in the dataset, analyse the basic characteristics of the dataset, etc.), and report the analysis process and result in the demonstration section of a final report." 

    Dataset Identification:
        Clearly state the dataset used: "For this demonstration, a publicly available dataset titled 'Social Media Addiction and Mental Health' was utilized."
        Download Link: Provide the exact link where the dataset can be downloaded. You will need to provide this. (I cannot provide an external link as I do not have Browse capabilities, but you must include it).
    Data Description and Features:
        Briefly describe the dataset's content: "This dataset comprises self-reported survey data from XXX students, collected across various countries, detailing their social media usage habits, academic performance, sleep patterns, mental health, and demographic information."
        List key features/variables (columns) and their types:
            Age (Numeric)
            Gender (Categorical)
            Avg_Daily_Usage_Hours (Numeric)
            Sleep_Hours_Per_Night (Numeric)
            Mental_Health_Score (Numeric)
            Affects_Academic_Performance (Binary: Yes/No)
            Most_Used_Platform (Categorical)
            Academic_Level (Categorical)
            Relationship_Status (Categorical)
            Conflicts_Over_Social_Media (Numeric)
            Addicted_Score (Numeric)
            Country (Categorical)
    Analysis Process (Using R):
        Initial Data Loading and Inspection:
            Mention loading the CSV into R.
            Briefly describe using str(), summary(), head() to understand data types and initial distributions.
        Data Cleaning and Preprocessing:
            Handling Missing Values: State how na.omit() was used to remove rows with missing data. Provide the number of rows before and after.
            Type Conversion: Explain converting relevant columns to factors (e.g., Gender, Academic_Level, Affects_Academic_Performance). Mention specifically setting Affects_Academic_Performance to levels = c("No", "Yes") for logistic regression.
            Outlier/Irrelevant Data Removal: Detail your decision to remove users of "LinkedIn", "Twitter", and "YouTube" and the justification (low sample size, lack of significance, separation issues).
            Feature Engineering/Grouping (Most_Used_Platform_Final): Explain the rationale and process for creating Most_Used_Platform_Final by grouping country-specific messaging apps (LINE, KakaoTalk, WeChat, VKontakte, WhatsApp) into "Messaging_Apps", and setting Facebook as the reference level.
        Exploratory Data Analysis (EDA):
            Visualizations:
                Bar Chart (Academic Performance Impact by Most Used Platform): Describe the insights from this chart (e.g., "The initial visualization of academic performance impact by most used platform revealed varying proportions of affected students across different platforms. Notably, platforms like TikTok and Snapchat showed a higher proportion of 'Yes' responses compared to Facebook or Messaging Apps.")
                Box Plot (Mental Health Score by Academic Performance Impact): Describe the insights (e.g., "A box plot comparing Mental Health Score by Academic Performance Impact clearly illustrated that students reporting affected academic performance also exhibited lower median mental health scores and a wider distribution of lower scores compared to their unaffected counterparts.")
            Descriptive Statistics: Mention calculating basic descriptive statistics for key numerical variables (Avg_Daily_Usage_Hours, Sleep_Hours_Per_Night, Mental_Health_Score) to show their distributions.
        Statistical Modeling:
            Rationale for Models:
                Logistic Regression: Explain why it's chosen for Affects_Academic_Performance (binary outcome) to predict odds.
                Linear Regression: Explain why it's chosen for Sleep_Hours_Per_Night and Mental_Health_Score (continuous outcomes) to predict direct changes.
            Model Building Steps: Mention inclusion of VIF checks for multicollinearity. Briefly discuss how you handled separation issues.
    Analysis Results:
        Overview: State the overall significance and R-squared/pseudo R-squared/AUC of each model (e.g., "The models demonstrated strong predictive power for all three outcomes...").
        Key Findings (Summarize your interpretations for each model):
            Academic Performance Impact:
                Highlight Avg_Daily_Usage_Hours, Sleep_Hours_Per_Night, Academic_Level categories, Age, and the specific Most_Used_Platform_Final categories (TikTok, Snapchat, Instagram, Messaging Apps) as significant predictors.
                Emphasize TikTok's extremely high odds ratio.
                Mention Facebook as the baseline with lowest associated impact.
            Sleep Hours Per Night:
                Highlight Avg_Daily_Usage_Hours (strong negative), Academic_LevelHigh School (negative), Snapchat (negative), TikTok (negative).
                Mention the positive effect of Relationship_Status categories ("In Relationship", "Single") compared to "It's Complicated".
                Explicitly state which variables were not significant (Age, Gender, Instagram, Messaging Apps).
            Mental Health Score:
                Highlight Avg_Daily_Usage_Hours (negative), Conflicts_Over_Social_Media (very strong negative), Instagram (negative), Snapchat (negative), GenderMale (negative).
                Explicitly state which variables were not significant (Academic Levels, TikTok, Messaging Apps, Age, Relationship Status categories).
        Consistency/Divergence: Briefly discuss any interesting patterns or divergences between the three models' significant predictors (e.g., TikTok affecting academic performance and sleep but not directly mental health score in this model).
    Feasibility Conclusion: Conclude by reiterating that the analysis demonstrates the feasibility of using such data and methods to gain insights into social media's impact on student well-being.

5. Standard for Data Science Process, Data Governance and Management

This section addresses: "Describe any standards used in your data science process" and "Describe any practices for data governance and management in the project, e.g., how to address key issues such as data accessibility, security, and confidentiality, as well as potential ethical concerns related to data usage." 

5.1 Standard for Data Science Process

    CRISP-DM (Cross-Industry Standard Process for Data Mining): This is a widely recognized and robust standard for data science projects.
        Description: Explain that CRISP-DM provides a structured approach to planning and executing data mining/science projects.
        Stages and How You Applied Them (briefly):
            Business Understanding: Defining the problem (social media impact on students), goals (predicting academic performance, sleep, mental health), and success criteria.
            Data Understanding: Initial data collection, exploration (EDA, str(), summary()), quality checks, and identification of relevant variables.
            Data Preparation: Cleaning missing values, converting data types, handling outliers, feature engineering (e.g., Most_Used_Platform_Final), and data filtering. This is where most of your R code fits.
            Modeling: Selecting and applying appropriate statistical methods (logistic and linear regression), building the models, and checking diagnostics (VIF).
            Evaluation: Assessing model performance (R-squared, pseudo R-squared, AUC, significance of coefficients), interpreting findings, and determining if goals are met.
            Deployment: (Future step for a real-world project) Operationalizing the model findings (e.g., policy recommendations, intervention strategies). For your project, it's about reporting and presenting findings.
        Rationale: Emphasize that using a standard like CRISP-DM ensures a systematic, repeatable, and thorough approach to the project, minimizing errors and maximizing the reliability of findings.

5.2 Data Governance and Management

    Introduction: Define data governance (the overall management of the availability, usability, integrity, and security of data) and data management (the execution of governance policies).
    Practices in this Project:
        Data Accessibility:
            Current Project: The dataset is publicly available. This promotes transparency and reproducibility for research purposes.
            Future Expansion: If collecting primary sensitive data, accessibility would be restricted to authorized personnel via secure networks/platforms.
        Data Security and Confidentiality:
            Current Project: As a publicly available and anonymized dataset, direct security measures on the data itself are limited as personal identifiers were already removed.
            Future Expansion (if collecting raw data):
                Anonymization/Pseudonymization: Removing or masking personal identifiers.
                Access Control: Limiting data access to authorized individuals (e.g., researchers with ethical approval).
                Secure Storage: Storing data on encrypted servers, cloud storage with robust security features, and limited physical access.
                Data Transfer: Using secure protocols (SFTP, HTTPS) for data transfer.
        Ethical Concerns Related to Data Usage:
            General:* Data science projects, especially those involving human behavior and well-being, carry significant ethical responsibilities.
            Current Project:
                Self-Reported Bias: Acknowledge that self-reported data can be subject to bias (e.g., social desirability, memory issues), which might affect the veracity.
                Generalizability: Discuss limitations in generalizability due to sample demographics or collection method.
                Privacy: While the dataset is anonymized, for any future primary data collection, strict adherence to privacy regulations (e.g., GDPR, HIPAA if applicable) is paramount.
                Misinterpretation/Misuse of Findings: Emphasize the importance of carefully interpreting statistical correlations (not causation) to avoid drawing unsubstantiated conclusions that could unfairly stigmatize users of certain platforms or demographics. For example, your finding that TikTok users have highest academic impact is a correlation, not a causal statement that TikTok causes academic issues.
                Informed Consent: For future primary data collection, explicitly state the necessity of informed consent, ensuring participants understand data collection, usage, and anonymization.
    Conclusion: Reiterate the importance of adhering to these standards for responsible and ethical data science practice, ensuring data integrity, privacy, and trustworthy insights.