#  Data

Public dataset https://www.kaggle.com/datasets/catherinerasgaitis/mxmh-survey-results

#  Importing

Importing necessary libraries:

In [None]:
import numpy as np
import pandas as pd
from statistics import mean
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation

Importing CSV file into a DataFrame:

In [None]:
survey = pd.read_csv("C:\Project\survey_results.csv")

Showing the example of using web scraping:

In [None]:
# We get the url
r = requests.get("https://www.pfizer.com/news/articles/why_and_how_music_moves_us")
soup = BeautifulSoup(r.content)

# We get the words within paragrphs
text_p = (''.join(s.findAll(text=True))for s in soup.findAll('p'))
c_p = Counter((x.rstrip(punctuation).lower() for y in text_p for x in y.split()))

# We get the words within divs
text_div = (''.join(s.findAll(text=True))for s in soup.findAll('div'))
c_div = Counter((x.rstrip(punctuation).lower() for y in text_div for x in y.split()))

# We sum the two countesr and get a list with top 10 words count from most to less common
total = c_div + c_p
list_most_common_words = total.most_common() 
print(list_most_common_words[:10])

Result of running code is top 10 used words on  the web page with research "Why — and How — Music Moves Us". The counting using Python could be utility for analyzing the language and topics, as well as identifying key phrases and terms associated with the website's content. This information can be useful for a variety of purposes, such as search engine optimization, content marketing, or even conducting research on a specific topic. 

# Preparation 

Exploring the dataset:

In [None]:
survey.head()

In [None]:
survey.shape

In [None]:
survey.dtypes

In [None]:
survey.info()

Timestamp and Permissions don't have any meaningful info and can be dropped:

In [None]:
survey = survey.drop(axis=1, columns=['Timestamp', 'Permissions'])

Dropping duplicates keeping first repeating row:

In [None]:
survey = survey.drop_duplicates(keep='first')

Checking missing values in each column:

In [None]:
survey.isnull().sum()

Replacing the null values in the 'Age' column with the mean of the existing non-null values in that column:

In [None]:
survey ['Age'] = survey['Age'].fillna(survey['Age'].mean())

Remremoving any rows that have null values in the specified columns with type object:

In [None]:
survey = survey.dropna(subset=['Primary streaming service','While working', 'Instrumentalist', 'Composer', 'Foreign languages','Music effects'])

Identifying and removing any rows that contained outliers:

In [None]:
print(survey['BPM'].max())
print(survey['BPM'].min())
print(survey['BPM'].mean())
print(survey['BPM'].mode())
survey = survey[survey['BPM'] > 1]
survey = survey[survey['BPM'] < 250]

Replasing the remaining missing values with the average BPM for the user's favorite genre:

In [None]:
genre_list = survey['Fav genre'].unique()
for i in genre_list:
  survey['BPM'] = survey['BPM'].fillna(round(survey[survey['Fav genre']== i ]['BPM'].mean(), 0))

Checking age range of questioned people: 

In [None]:
print(survey['Age'].min())
print(survey['Age'].max())

Adding age group column by using custom function:

In [None]:
def age_groups(age):
  if age <= 14:
    return "10-14"
  elif age <= 19:
    return "15-19"
  elif age <= 24:
    return "20-25"
  elif age <= 29:
    return "25-29"
  elif age <= 34:
    return "30-34"
  elif age <= 49:
    return "35-49"
  elif age <= 69:
    return "50-69"
  else:
    return "70+"

survey['Age group'] = survey['Age'].apply(age_groups)

Checking for outliers in the 'Age group' column:

In [None]:
survey.loc[:,['Age group']].groupby('Age group')['Age group'].agg('count')

Age group 70+ contains only 6 rows and for further visualizations and calculations could be removed:

In [None]:
survey = survey[survey['Age']< 70]
survey.loc[:,['Age group']].groupby('Age group')['Age group'].agg('count')

Reorganizing the dataframe index:

In [None]:
survey = survey.sort_values('Age group')
survey = survey.reset_index(drop=True)
survey.head()  

Creating the list with prevalence depression in percentage by age groups:

In [None]:
age_groups = survey['Age group'].unique()
list_persent = [4.545230994, 1.233283626, 6.059251462, 4.916381579, 3.228402047, 4.729831871, 6.131866959]
prevalence_depression_array = np.array([age_groups, list_persent])
prevalence_depression_array = prevalence_depression_array.transpose()
prevalence_depression = pd.DataFrame(prevalence_depression_array, columns = ['Age group','Prevalence depression percent'])
print(prevalence_depression)

Merging the clean dataset with prevalence_depression and saving the result dataframe:

In [None]:
survey_full = pd.merge(survey, prevalence_depression, on = ['Age group','Age group'])
survey_full.head()

# Analysis  & Visualisations & Insights 

Exploring trends in listening music and favourite platforms:

In [None]:
survey_by_ages = survey_full.loc[:,['Age group','Primary streaming service','Hours per day']]
survey_by_ages = survey_by_ages.groupby(['Age group','Primary streaming service'])['Hours per day'].agg(['count', max, mean])
print(survey_by_ages)

1.Visualizing Primary streaming service of respondents:

In [None]:
plt.figure(figsize=(12,6))
plt.hist(survey_full['Primary streaming service'])
plt.ylabel('Number of responses')
plt.xlabel('Primary streaming service')
plt.show()

Histogram showing that the most used platform of respondents is Spotify.

2.Showing visualization for Average hours listening music per day by age groups:

In [None]:
plt.ylabel('Agerage hours listening misic per day')
plt.title('Agerage hours listening misic per day by age groups')
survey_full.loc[:,['Age group','Hours per day']].groupby('Age group')['Hours per day'].mean().plot(kind='bar')
plt.show()

The graph illustrates that the peak of music listening occurs during teenage years, gradually declining until the age of 50, and then increasing again.

3.Exploring effects of listening music:

In [None]:
survey_by_effect = survey_full.loc[:,['Age group','Music effects']].groupby(['Age group','Music effects'])['Music effects'].count()
print(survey_by_effect )

plt.ylabel('No of answers')
plt.title('Effects of listening music')
survey_full.loc[:,['Age group','Music effects']].groupby('Music effects')['Music effects'].count().plot(kind='barh')
plt.show()

plt.ylabel('No of answers')
plt.title('Effects of listening music by age group')
survey_by_effect.plot(kind='barh')
plt.show()


Most respondents stated that music improved their mood. This finding was consistent across all age groups, indicating that the positive impact of music on emotions is not restricted to any particular age group.


4.Exploring the correlation between listening music hours and disorders:

In [None]:
survey_full.plot(kind ='scatter', x = 'Hours per day', y = 'Anxiety')
plt.show()

survey_full.plot(kind = 'scatter', x = 'Hours per day', y = 'Depression')
plt.show()

survey_full.plot(kind = 'scatter', x = 'Hours per day', y = 'Insomnia')
plt.show()

survey_full.plot(kind = 'scatter', x = 'Hours per day', y = 'OCD')
plt.show()

Graphs show no correlation between listening music hours and disorders. It could be a great case for further studies using machine learning.

5.Exploring the correlation between BPM and disorders:

In [None]:
survey_full.plot(kind = 'scatter', x = 'BPM', y = 'Anxiety')
plt.show()

survey_full.plot(kind = 'scatter', x = 'BPM', y = 'Depression')
plt.show()

survey_full.plot(kind = 'scatter', x = 'BPM', y = 'Insomnia')
plt.show()

survey_full.plot(kind = 'scatter', x = 'BPM', y = 'OCD')
plt.show()

Graphs show no correlation between BPM and disorders. It could be a great case for further studies using machine learning.

6.Exploring the correlation between Favorite genres of respondents and disorders:

In [None]:
plt.xlabel('Level of Anxiety')
survey_full.loc[:,['Fav genre','Anxiety']].groupby('Fav genre')['Anxiety'].mean().plot(kind='barh')
plt.show()

plt.xlabel('Level of Depression')
survey_full.loc[:,['Fav genre','Depression']].groupby('Fav genre')['Depression'].mean().plot(kind='barh')
plt.show()

plt.xlabel('Level of Insomnia')
survey_full.loc[:,['Fav genre','Insomnia']].groupby('Fav genre')['Insomnia'].mean().plot(kind='barh')
plt.show()

plt.xlabel('Level of OCD')
survey_full.loc[:,['Fav genre','OCD']].groupby('Fav genre')['OCD'].mean().plot(kind='barh')
plt.show()

People who love gospel have a low level of OCD and depression & People who prefer rap and country have tendency to better sleep.

7.Exploring the correlation between the Prevalence of depression and the habit of listening to music while working:

In [None]:
plt.ylabel('Mean prevalence depression percent')
plt.xlabel('Listening music while working')
plt.title('Prevalence of depression for people listening to music while working or not')
survey_full.loc[:,['While working','Prevalence depression percent']].groupby('While working')['Prevalence depression percent'].mean().plot(kind='bar')
plt.show()

The prevalence of depression is lower among individuals who listen to music while working.