# COGS 108 - Project Proposal

## Authors

- David Oh: Conceptualization, Methodology, Data Curation, Writing – Original Draft, Project Administration, Writing – review & editing
- Nella Sak: Project administration, Conceptualization, Visualization, Writing – review & editing
- Rhett McClurg: Project administration, Analysis, Writing – Review & Editing
- Jack Park: Background Research, Experimental investigation, Writing – Original Draft, Software

## Research Question

Which music genre exhibits the strongest positive and negative associations with a person’s mental health profile, specifically across self-reported Anxiety, Depression, and Insomnia scores on a 0-10 scale? 

This project makes use of statistical inference to determine if identifying a specific genre as one’s favorite and its average bpm correlates with significantly higher or lower mental health scores. By controlling age and hours of daily listening, we will isolate the relationship between musical preference and mental health to ensure that we are strictly analyzing the correlation between the two. This analysis aims to uncover whether certain genre preferences can be used as indicators of psychological distress patterns.  


## Background and Prior Work

The project focuses on the relationship between music habits and mental health, which is an interesting field of research. Music has transformed from just entertainment to a mechanism for emotional regulation and stress management. In recent years, studies have shown that music is beneficial to students with high stress levels. For example, a 2025 study titled “Statistical Insights into the Influence of Music on Mental Health” highlights the impact of musical engagement on psychological well-being by giving quantitative evidence [1].


A thesis from the University of Guelph by Parya Abadeh argues that while music is a regular tool for emotional processing, the type of engagement determines whether the outcome is beneficial or not. Abadeh explains that “certain songs-especially those with negative or ambiguous emotional content- can reinforce rumination and other maladaptive emotion-regulation patterns, ultimately elevating stress…” [2].


However, the correlation between them is debated profusely, especially regarding certain genres of music and their levels of intensity and calmness. A 2022 study by Kuo Wang, Sunyu Gao, and Jianhao Huang titled “Learning About Your Mental Health From Your Playlist? Investigating the Correlation Between Music Preference and Mental Health of College Students” found correlations between genres of music and mental health status. Their results indicated the “correlation was significant and positive between pop music and mental health.” [3] They also observed a “significant and inverse relationship between college students’ preference for heavy music and their mental health”. [3] These findings raise a question: Does a specific genre actually correlate to psychological stress patterns?
References:
	[1] ScitePress(2025). Statistical Insights into the Influence of Music on Mental Health. https://www.scitepress.org/Papers/2025/138269/138269.pdf
	[2] Abadeh, P. (2020). Research on Music and Well-being. University of Guelph Atrium Repository. https://atrium.lib.uoguelph.ca/server/api/core/bitstreams/33cdecf0-9f34-4317-96a9-1e672fa4a538/content
	[3] Wang, K. Gao, S. Huang, J. (2022). Learning About Your Mental Health From Your Playlist? Investigating the Correlation Between Music Preference and Mental Health of College Students. Frontiers in Psychology. https://pmc.ncbi.nlm.nih.gov/articles/PMC9072654/


## Hypothesis


We hypothesize that high BPM genres such as Rock or Metal will show a significant positive correlation with ‘high arousal’ symptoms, specifically Anxiety and Insomnia scores. We further predict that Depression scores will be high across both extremes of tempo in genres reflecting people’s different coping strategies. 

Fast auditory stimuli can increase heart rate and can potentially cause restlessness, making it harder to fall asleep. For this reason, we believe that high BPM music would cause more anxiety and insomnia. Additionally, people usually listen to music that reflects their current mood. Therefore, someone with Depression might listen to slow music to reflect their sadness, or they might pick faster, louder music as a way to let out frustration. 


## Data

Ideal Data Set

The ideal dataset we would want for this project would consist of object data from at least 10,000 diverse participants to ensure representation across all music genres. Additionally, having a large dataset will allow for analysis of subtopics. For the variables, clinically validated assessments and biometric data from wearable devices would give us more concrete data compared to self-reported scores. The music data would be pulled from an API from music platforms to capture precise data like genre, BPM, energy, and valence. This data would be collected automatically with a background app that helps eliminate survey bias. Ideally, the data would be stored in a secure database, such as a private GitHub repository, where the only people who can access it are contributors to the project. While all information has been de-identified to ensure privacy.

Real World Data Sets

A data set from Kaggle called Music & Mental Health Survey Results, which can be used since it is in the public domain, where anyone can use the data. Some important variables in this dataset include independent variables such as favorite music genre and BPM. While also including dependent variables such as scores for Anxiety, Depression, and Insomnia. Additionally, there are control variables such as age and hours per day. (https://www.kaggle.com/datasets/catherinerasgaitis/mxmh-survey-results/data)
There is a dataset on Kaggle called 32,000 Songs & Mental Health Classification, which can be downloaded since it is on public domain. It is a huge dataset that maps specific song audio to features that relate to metional and mental health states. Some important variables that I see are the BPM, Valence, and Energy, which are used to describe how positive or intense a song is. Additionally, there are Mental Health Labels that show the emotion or mental health state associated with the song. (https://www.kaggle.com/datasets/ashishbadal18/32000-songs-ragas-mental-health-classification)
Another data set from Kaggle called 500K+ Spotify Songs with Lyrics, Emotions & More, which is a csv file open to the public to download. It is a dataset that gives information on specific songs on Spotify. Some important variables that could be used in this dataset are the Emotion (the main emotion extracted from lyrics), Genre, Tempo, Energy, Danceability, and Positivity. There are also some other variables that could be deemed useful, such as good for exercise, relaxation, yoga, and driving. These variables are interesting because you can argue that there is a correlation between a person who listens to a song that fits a song for relaxation would have a lower anxiety score. (https://www.kaggle.com/datasets/devdope/900k-spotify)


## Ethics 


### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?


 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> We have considered that the first data set on music and mental Health survey might have selection bias because of the fact that it was shared on reddit, discord, and social media. These are mostly platforms that younger people tend to use. This means that the older generations might not be accurately represented. To address this, we will use Age as a control variable to see if mental scores are being driven by age or taste in music. 

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

>  Usually, Kaggle datasets are de-identified, but we are still aware that there might be PIIs in the dataset. To minimize the risk of re-identification, we would only really work with variables that are relevant to the dataset and drop all the variables that are not. Additionally, we would analyze data in groups rather than looking at individual cases to ensure that no single person’s mental health profile is highlighted. 

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

> We have considered that our results could be used to stereotype certain music fanatics. For example, for music listeners with high-energy music, such as Metal could be seen as more anxious or unstable than lower-energy music listeners. We will mitigate this by specifically mentioning that we are only looking at the correlations between variables and not causation. 

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

> Even though the data is on a public domain, it contains information that could be considered sensitive. We would store the data and working files on password-protected devices and use private repositories that prevent unauthorized third parties. This would allow us to reduce the risk of data leaks. 

 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?
> We will be using the data given to us only for the duration of the project which allows us to minimize long-term risk. This means that we will delete all of the local files of the Kaggle dataset after the final project is finished. Also by deleting this data, we eliminate the risk of future data leaks and unauthorized use of the data.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

> To address blindspots in the analysis, we acknowledge the limitations posed by this dataset, since we cannot draw data from real individuals experiencing mental health struggles directly. To combat this, we will frame findings as correlational patterns and challenge our assumptions by testing alternative explanations, such as reverse causation. We also acknowledge that our samples are primarily WEIRD samples (Western, Educated, Industrialized, Rich Democratic), entailing that the findings may not be generalizable to music-mental health relationships for all individuals. To address this shortcoming, we will frame our conclusions cautiously and describe findings as correlations that require validation within the target communities.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

> Some possible sources of biases in our data can arise from the survey distribution protocols, as the survey used to collect this data was posted on social media forums, such as Reddit. This can overrepresent younger populations with high tech literacy and could incline music enthusiasts to complete the music-oriented survey. Additionally, these mental health struggles are self-reported and do not consider the clinical diagnoses an individual actually has, entailing that the results cannot be generalizable to clinically diagnosed populations. To combat these, we can analyze the findings by age groups to check if correlational relationships are present across all demographics. Additionally, within our findings, we can acknowledge categories such as depression and anxiety are self-perceived by the reporter and are not clinically diagnosed symptoms.


 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data

> We will strive to honestly represent the underlying data through including error bars within visualizations to show possible variance. Additionally, we will ensure our data visualizations are not truncated to make our findings more extreme and overexaggerated. Additionally, we will also include effect sizes, such as correlational strength, with p-values for readers to better identify the statistical significance of our data. 

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

> Yes, the dataset we are utilizing does not contain PII’s since all participants were anonymous in the original data collection.  

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

> Yes, all processes will be documented with clear comments explaining our code. Additionally, we will utilize a README page to explain where our data originates from and how to access it to run this analysis.
### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

> We have ensured that the model does not rely on variables or proxies that are unfairly discriminatory since irrelevant demographic variables, including race, gender, sexuality, etc. are not being utilized as metrics.

 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?

> We will check whether correlations hold true across all age groups and report any discrepancies we identify. When conducting the analysis, we will also acknowledge if a certain group is underrepresented and how the generalization of our findings will be impacted from this.

 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

> Yes, for additional metrics, we plan to use correlation coefficients to measure relationships between variables and also check linear and ranked correlations to ensure we are not missing any relevant patterns in the data.

 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?

> Yes, we intend to utilize correlations that are easy to interpret and complement our findings with data visualizations to make our findings understandable.

 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

> We will communicate the shortcomings and biases by having a dedicated section in our final report communicating relevant limitations, such as possible biases from data collection, overrepresented demographics, etc.

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?


## Team Expectations 

* *We will communicate via iMessage and meet in-person or via Zoom. Any messages pertaining to the project in the group chat must receive a response within 24 hours.*
* *We will meet at least once a week. It is expected that team members show up on time, ready, and have the necessary materials to work on the project.*
* *Since we all come from different coding backgrounds, it is important to treat each other, especially those who aren’t as experienced, with respect and kindness. If anyone needs help in any way, the team is expected to listen and help them find the right direction.*
* *Every team mate is expected to complete their assigned tasks in a timely manner and on collaborative portions, communicate when they have finished. Each team member must acknowledge that it is their responsibility to be proactive.*

## Project Timeline Proposal


| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/3  |  6 PM | Review project proposal; Read dataset and explore variables| Finalize team communication plan and logistics; Discuss and decide on final project topic; discuss hypothesis; begin background research; Assign data wrangling and EDA tasks | 
| 2/11  |  6 PM |  Import and clean dataset; Perform initial data analyses | Discuss wrangling and possible analytical approaches; Discuss data cleaning decisions & EDA findings; Assign data analysis tasks for each member to complete| 
| 2/18  | 6 PM  | Calculate the correlations between genres and mental health metrics; Run statistical significance tests; Begin creating data visualizations | Review correlation results and visualizations; discuss findings and patterns; plan additional analyses if needed|
| 2/25  | 6 PM  | Finish statistical analyses; Complete data visualizations | Review analysis; Ensure code written is reproducible and well-documented
   |
| 3/4  | 6 PM  | Draft Introduction, Methods, and Discussion sections; Integrate visualizations into report| Discuss/edit Analysis; Complete project check-in; Assign final edits and prep for video recording|
| 3/11  | 6 PM  | Finalize all report sections; Proofread and polish final report; Record video walking through analysis| Discuss/edit full project; final review of deliverables; divide work for video production |
| 3/18 | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |