# Introduction to the data frame - Music Therapy

The post-pandemic era presents unprecedented challenges for individuals managing their mental health. Studies indicate that music has played a significant role in improving both physical and mental well-being (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8566759/). I am eager to explore the depth of music's influence on mental health and extract valuable insights from this study.

We want to identify what, if any, correlations exist between an individual's music taste and their self-reported mental health. Ideally, these findings could contribute to a more informed application of MT or simply provide interesting sights about the mind.

In [None]:
#pip install numpy
#pip install pandas
#pip install matplotlib

In [1]:
# Ignore warnings
import warnings
warnings.filterwarnings("ignore")  # This is to ignore any warnings that might pop up during execution

In [2]:
# Basic libraries to manipulate data
import matplotlib.pyplot as plt  # Matplotlib for data visualization
import numpy as np  # Numpy for numerical computations
import pandas as pd  # Pandas for data manipulation

In [3]:
np.random.seed(42)  # To ensure all the probabilistic things are reproducible

## Background on Data

#### i. Dataset
For this analysis, we take the data from the Kaggle website (https://www.kaggle.com/datasets/catherinerasgaitis/mxmh-survey-results).

As per the dataset collector and owner, the survey encompassed the following parameters:
**Block 1: Music genres**
Respondents selected Never, Rarely, Sometimes, or Very frequently to indicate the frequency of their listening to 16 music genres.

**Block 2: Mental health**
Respondents rated Anxiety, Depression, Insomnia, and OCD (Obsessive-compulsive disorder) on a scale of 0 to 10, with 10 representing the highest frequency and severity.

#### ii. Scope & Procedures
This analysis is based on data collected from 736 respondents across the US. The survey, conducted from Aug. 27 to Nov. 9, 2022, included participants of any age or location. The questionnaire comprised a mix of open-ended and drop-down questions, designed to elicit relatively concise responses and encourage participation.

# On Relative Paths and Well-Structured Projects
As you'll often collaborate within a team, establishing order from the outset is a beneficial practice. This ensures code portability and facilitates bug fixing.

For reading data from files, it's advisable to use a relative path—i.e., a path originating from the project directory. This way, being in the correct directory ensures smooth execution.

Throughout this course, each project will have its own directory. At the top level, notebooks will reside, while all datasets will be stored in a directory named "data."

Consequently, we can define the relative path for data and effortlessly read files by appending the correct file name from there.

If you're using `google colab`, things get a bit trickier. You'll need to upload the data file  each time you want to access it in a separate session.

Alternatively, you can mount your Google Drive, head to your project directory, and then employ relative paths just like before.

In [1]:
# Mount your drive
# ! This will only work in Google Colab.
#from google.colab import drive
#drive.mount('/content/drive')

# Navigate to your project's directory
# ! This will vary based on each person's drive structure.
#import os
#os.chdir('/content/drive/My Drive/.../data/')

# Define the relative path within
#data_path = "./data/"

# Basic characteristics of the datasets

Now that we've completed the previous steps, let's take a snapshot of the datasets. Understanding the basic characteristics is crucial; without knowing what we have, we can't address the questions we aim to answer.

To kick things off, we'll import our tables into a Pandas dataframe.

In [4]:
# Specify the path to the datasets
data_path = "./data/"

# Specify the filenames of the datasets
survey_filename = "survey_music.csv"

# Read the CSV files and create backup copies
survey_df_data = pd.read_csv(data_path + survey_filename)

# Create working copies of the dataframes for analysis
survey_df = survey_df_data.copy()


Since this table has many rows, viewing it in its entirety won't give us the full picture. 
Various methods are available for a glimpse of the contents.

To view the start and end of the table, we can use `head` and `tail`. Alternatively, returning a sample gives a broader view, considering potential correlations in rows. The data might have chunks created in a specific order, like an English section followed by others. To capture this slice, we use the `sample` method.


In [5]:
# Option to display all columns
pd.set_option('display.max_columns', None)
survey_df.sample(5)

Unnamed: 0,Timestamp,Age,Primary streaming service,Hours per day,While working,Instrumentalist,Composer,Fav genre,Exploratory,Foreign languages,BPM,Frequency [Classical],Frequency [Country],Frequency [EDM],Frequency [Folk],Frequency [Gospel],Frequency [Hip hop],Frequency [Jazz],Frequency [K pop],Frequency [Latin],Frequency [Lofi],Frequency [Metal],Frequency [Pop],Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music],Anxiety,Depression,Insomnia,OCD,Music effects,Permissions
669,9/15/2022 15:30:41,20.0,Spotify,1.0,Yes,No,No,K pop,Yes,Yes,120.0,Sometimes,Never,Rarely,Never,Sometimes,Never,Rarely,Very frequently,Never,Sometimes,Never,Very frequently,Sometimes,Never,Rarely,Never,8.0,7.0,1.0,1.0,No effect,I understand.
33,8/28/2022 10:59:53,17.0,Spotify,4.0,No,No,No,Rock,Yes,Yes,142.0,Rarely,Rarely,Rarely,Very frequently,Never,Sometimes,Rarely,Rarely,Never,Rarely,Rarely,Very frequently,Rarely,Sometimes,Very frequently,Rarely,5.0,6.0,6.0,1.0,Improve,I understand.
549,9/3/2022 16:00:31,18.0,Spotify,2.0,No,Yes,No,Rock,Yes,No,115.0,Never,Never,Never,Never,Never,Rarely,Never,Never,Never,Never,Sometimes,Sometimes,Rarely,Rarely,Very frequently,Never,8.0,3.0,1.0,3.0,Improve,I understand.
199,8/28/2022 20:57:57,28.0,YouTube Music,2.0,Yes,No,No,Rock,Yes,No,172.0,Never,Rarely,Rarely,Rarely,Never,Sometimes,Never,Rarely,Never,Sometimes,Rarely,Sometimes,Very frequently,Rarely,Very frequently,Rarely,10.0,8.0,6.0,2.0,Improve,I understand.
264,8/29/2022 0:39:59,19.0,Spotify,2.0,Yes,No,No,Country,Yes,No,125.0,Rarely,Very frequently,Rarely,Sometimes,Rarely,Never,Never,Never,Rarely,Rarely,Sometimes,Rarely,Never,Never,Sometimes,Never,7.0,2.0,0.0,7.0,No effect,I understand.


In [9]:
survey_df['Frequency [Classical]'].unique()

array(['Rarely', 'Sometimes', 'Never', 'Very frequently'], dtype=object)

We can further characterize our datasets using `info`


In [6]:
survey_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 736 entries, 0 to 735
Data columns (total 33 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Timestamp                     736 non-null    object 
 1   Age                           735 non-null    float64
 2   Primary streaming service     735 non-null    object 
 3   Hours per day                 736 non-null    float64
 4   While working                 733 non-null    object 
 5   Instrumentalist               732 non-null    object 
 6   Composer                      735 non-null    object 
 7   Fav genre                     736 non-null    object 
 8   Exploratory                   736 non-null    object 
 9   Foreign languages             732 non-null    object 
 10  BPM                           629 non-null    float64
 11  Frequency [Classical]         736 non-null    object 
 12  Frequency [Country]           736 non-null    object 
 13  Frequ

This method provides a concise summary of our table. Here's a breakdown of the information it offers:

The index comprises 736 entries, ranging from 0 to 735.

A brief description of each column is printed, including its non-null count and data type (dtype).

- **Non-null count** indicates the number of non-missing values in each column. In a table of this size, we typically wouldn't expect null values. However, understanding the presence of missing values is crucial for assessing the completeness of our data.
- **dtype** denotes the *type of data* present in each column. A comprehensive list of available data types is provided below. Ensuring accurate data types enables efficient operations and allows us to leverage pre-coded manipulations for data analysis.

| Pandas dtype    | What It Does                       | Common Operations                        | Advantages                                      | Disadvantages                                |
|-----------------|------------------------------------|------------------------------------------|-------------------------------------------------|----------------------------------------------|
| `object`        | Stores mixed types, often strings  | String operations, `.str` accessor       | Can hold mixed types                            | Inefficient, not suitable for numerical ops  |
| `int64`         | Integer numbers                    | Mathematical operations                  | Memory efficient, fast operations               | Cannot represent `NaN` in versions < 1.0     |
| `float64`       | Floating-point numbers             | Mathematical operations                  | Can represent decimals and `NaN`                | Less memory efficient than `int64`           |
| `bool`          | Boolean values (True/False)        | Logical operations                       | Very memory efficient                           | Limited to boolean logic                     |
| `datetime64[ns]`| Date and time values               | Date/time operations, `.dt` accessor     | Precise, timezone-aware operations             | More memory than simpler types               |
| `timedelta[ns]` | Differences between time points    | Time deltas, `.dt` accessor              | Precise calculations of time spans             | More complex to understand and manipulate    |
| `category`      | Finite list of text values         | Statistical ops on categorical data, `.cat` accessor | Memory efficient for small sets of unique values | Not suitable for large number of unique values |
| `complex`       | Complex numbers with real and imaginary parts | Mathematical operations on complex numbers | Can represent complex numbers                    | Rarely used, not supported by all libraries  |

As we can see in the previous table, our column `timestamp` has a dtype of object. This means that all its values are strings that contain timestamps. Storing data information this way isn't very efficient, as we can observe in the next cell.

In [7]:
# Example of how difficult it is to extract the month from a string column: 
survey_df['Timestamp'].str.split('/').str.get(0)

0       8
1       8
2       8
3       8
4       8
       ..
731    10
732    11
733    11
734    11
735    11
Name: Timestamp, Length: 736, dtype: object

This is not to mention that dates can sometimes be in various formats, such as `2023-09-21`, where the previous expression wouldn't capture all the years.

Now, we transform the column into a datetime type and retrieve the month in a simpler way. Although it seems straightforward in this example, the issue with differently formatted dates persists. It is almost always better to convert it to this format, as the time it takes for a column to be preprocessed is usually much smaller than the time this conversion will save us.

In [8]:
# Compared to using a datetime datatype
pd.to_datetime(survey_df['Timestamp']).dt.month

0       8
1       8
2       8
3       8
4       8
       ..
731    10
732    11
733    11
734    11
735    11
Name: Timestamp, Length: 736, dtype: int64

# Inplace and out-of-place operators.

Applying operators to isolated series or dataframes doesn't modify the object; instead, it makes a copy of the object and then applies operators to it. Some methods have an 'inplace' argument (a flag with specific values) that can update the data directly. Alternatively, you can reassign values to keep them updated.

In the previous cell, we changed the dtype of 'release_date' to datetime. However, when we check its dtype again, we find it's still an object.


In [9]:
# We can check the datatype of a column by just calling its values. 
# At the bottom we would find its dtype
survey_df['Timestamp']

0       8/27/2022 19:29:02
1       8/27/2022 19:57:31
2       8/27/2022 21:28:18
3       8/27/2022 21:40:40
4       8/27/2022 21:54:47
              ...         
731    10/30/2022 14:37:28
732     11/1/2022 22:26:42
733     11/3/2022 23:24:38
734     11/4/2022 17:31:47
735      11/9/2022 1:55:20
Name: Timestamp, Length: 736, dtype: object

In [10]:
# We can also find its dtype directly by accessing the dataframe's property
survey_df.dtypes


Timestamp                        object
Age                             float64
Primary streaming service        object
Hours per day                   float64
While working                    object
Instrumentalist                  object
Composer                         object
Fav genre                        object
Exploratory                      object
Foreign languages                object
BPM                             float64
Frequency [Classical]            object
Frequency [Country]              object
Frequency [EDM]                  object
Frequency [Folk]                 object
Frequency [Gospel]               object
Frequency [Hip hop]              object
Frequency [Jazz]                 object
Frequency [K pop]                object
Frequency [Latin]                object
Frequency [Lofi]                 object
Frequency [Metal]                object
Frequency [Pop]                  object
Frequency [R&B]                  object
Frequency [Rap]                  object


To make the change persistent, we need to assign it to the previous dataframe

In [11]:
survey_df['Timestamp']=pd.to_datetime(survey_df['Timestamp'])

survey_df.dtypes

Timestamp                       datetime64[ns]
Age                                    float64
Primary streaming service               object
Hours per day                          float64
While working                           object
Instrumentalist                         object
Composer                                object
Fav genre                               object
Exploratory                             object
Foreign languages                       object
BPM                                    float64
Frequency [Classical]                   object
Frequency [Country]                     object
Frequency [EDM]                         object
Frequency [Folk]                        object
Frequency [Gospel]                      object
Frequency [Hip hop]                     object
Frequency [Jazz]                        object
Frequency [K pop]                       object
Frequency [Latin]                       object
Frequency [Lofi]                        object
Frequency [Me

# The importance of an index:
- Allows for fast lookups
- Convenient data alignment
- Meaningful labeling

If there is a column that has unique values, it is convenient to assign it to an index so that each row is uniquely identified. We can see the unique values of each column using nunique.


In [12]:
len(survey_df)

736

In [13]:
survey_df.nunique().sort_values()

Permissions                       1
Foreign languages                 2
Composer                          2
Instrumentalist                   2
Exploratory                       2
While working                     2
Music effects                     3
Frequency [Rap]                   4
Frequency [Rock]                  4
Frequency [R&B]                   4
Frequency [Jazz]                  4
Frequency [Video game music]      4
Frequency [Pop]                   4
Frequency [Metal]                 4
Frequency [Lofi]                  4
Frequency [Latin]                 4
Frequency [K pop]                 4
Frequency [Hip hop]               4
Frequency [Folk]                  4
Frequency [EDM]                   4
Frequency [Country]               4
Frequency [Classical]             4
Frequency [Gospel]                4
Primary streaming service         6
Anxiety                          12
Depression                       12
Insomnia                         12
OCD                         

In [14]:
survey_df['Timestamp'].value_counts()

2022-08-28 16:15:08    2
2022-08-27 19:29:02    1
2022-09-01 21:07:33    1
2022-09-01 19:09:32    1
2022-09-01 19:36:54    1
                      ..
2022-08-28 23:34:19    1
2022-08-28 23:34:37    1
2022-08-28 23:40:54    1
2022-08-28 23:42:24    1
2022-11-09 01:55:20    1
Name: Timestamp, Length: 735, dtype: int64

Only `timestamp` is close to being an index, but it repeats, indicating simultaneous annotations. An index is crucial for faster and robust data analysis. Key characteristics of a good index:

- Uniqueness: Each record is uniquely identified, avoiding ambiguity.
- Immutability: Values in the index column remain constant, ensuring data integrity.
- Sequential order: A natural order improves query performance for retrievals and displays.
- Relevance: The index should be pertinent to frequent analyses, allowing quick access to crucial data.
- Minimal size: Compact indexes conserve memory and accelerate data operations.
- Performance: Optimize the index for common use cases, enhancing database or DataFrame performance.


Since we have numeric columns the method `describe` gives us a quick snapshot of their contents

In [18]:
survey_df.describe()

Unnamed: 0,Age,Hours per day,BPM,Anxiety,Depression,Insomnia,OCD
count,735.0,736.0,629.0,736.0,736.0,736.0,736.0
mean,25.206803,3.572758,1589948.0,5.837636,4.796196,3.738451,2.637228
std,12.05497,3.028199,39872610.0,2.793054,3.02887,3.088689,2.842017
min,10.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,18.0,2.0,100.0,4.0,2.0,1.0,0.0
50%,21.0,3.0,120.0,6.0,5.0,3.0,2.0
75%,28.0,5.0,144.0,8.0,7.0,6.0,5.0
max,89.0,24.0,1000000000.0,10.0,10.0,10.0,10.0


Here we can see that the mean age is 25, and its median is at 21, indicating a slightly skewed distribution. We also observe a higher mean score on anxiety, which may be an indication of where to look for interesting insights.

# Changing dtypes

In the Frequency column, all the values are of type 'object,' indicating they are strings. However, this data is clearly ordinal, with the value `Never` being less than the value `Sometimes`. We can replace these values to better reflect the type of data we have.


In [19]:
survey_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 736 entries, 0 to 735
Data columns (total 33 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   Timestamp                     736 non-null    datetime64[ns]
 1   Age                           735 non-null    float64       
 2   Primary streaming service     735 non-null    object        
 3   Hours per day                 736 non-null    float64       
 4   While working                 733 non-null    object        
 5   Instrumentalist               732 non-null    object        
 6   Composer                      735 non-null    object        
 7   Fav genre                     736 non-null    object        
 8   Exploratory                   736 non-null    object        
 9   Foreign languages             732 non-null    object        
 10  BPM                           629 non-null    float64       
 11  Frequency [Classical]         73

First, we select all the columns that start with the word `Frequency` by accessing the str property of the columns series. This also highlights the importance of having the correct dtype for each of the series.

In [20]:
survey_df.loc[:,survey_df.columns.str.startswith('Frequency')]

Unnamed: 0,Frequency [Classical],Frequency [Country],Frequency [EDM],Frequency [Folk],Frequency [Gospel],Frequency [Hip hop],Frequency [Jazz],Frequency [K pop],Frequency [Latin],Frequency [Lofi],Frequency [Metal],Frequency [Pop],Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music]
0,Rarely,Never,Rarely,Never,Never,Sometimes,Never,Very frequently,Very frequently,Rarely,Never,Very frequently,Sometimes,Very frequently,Never,Sometimes
1,Sometimes,Never,Never,Rarely,Sometimes,Rarely,Very frequently,Rarely,Sometimes,Rarely,Never,Sometimes,Sometimes,Rarely,Very frequently,Rarely
2,Never,Never,Very frequently,Never,Never,Rarely,Rarely,Very frequently,Never,Sometimes,Sometimes,Rarely,Never,Rarely,Rarely,Very frequently
3,Sometimes,Never,Never,Rarely,Sometimes,Never,Very frequently,Sometimes,Very frequently,Sometimes,Never,Sometimes,Sometimes,Never,Never,Never
4,Never,Never,Rarely,Never,Rarely,Very frequently,Never,Very frequently,Sometimes,Sometimes,Never,Sometimes,Very frequently,Very frequently,Never,Rarely
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
731,Very frequently,Rarely,Never,Sometimes,Never,Sometimes,Rarely,Never,Sometimes,Rarely,Rarely,Very frequently,Never,Rarely,Very frequently,Never
732,Rarely,Rarely,Never,Never,Never,Never,Rarely,Never,Never,Rarely,Never,Very frequently,Never,Never,Sometimes,Sometimes
733,Rarely,Sometimes,Sometimes,Rarely,Rarely,Very frequently,Rarely,Rarely,Rarely,Sometimes,Rarely,Sometimes,Sometimes,Sometimes,Rarely,Rarely
734,Very frequently,Never,Never,Never,Never,Never,Rarely,Never,Never,Never,Never,Never,Never,Never,Never,Sometimes


First, we observe all the different values in the columns, and then we will replace these different values using a dictionary and the replace method.

In [21]:
freq_columns=survey_df.columns.str.startswith('Frequency')
freq_columns

array([False, False, False, False, False, False, False, False, False,
       False, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
       False, False, False, False, False, False])

All of them have the same values. Now, we replace the values using a dictionary. We are allowing a space of one between them, but the difference can be greater.

In [22]:
dict_replace={'Never':0,'Rarely':1,'Sometimes':2,'Very frequently':3}

survey_df.loc[:,freq_columns].replace(dict_replace)

Unnamed: 0,Frequency [Classical],Frequency [Country],Frequency [EDM],Frequency [Folk],Frequency [Gospel],Frequency [Hip hop],Frequency [Jazz],Frequency [K pop],Frequency [Latin],Frequency [Lofi],Frequency [Metal],Frequency [Pop],Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music]
0,1,0,1,0,0,2,0,3,3,1,0,3,2,3,0,2
1,2,0,0,1,2,1,3,1,2,1,0,2,2,1,3,1
2,0,0,3,0,0,1,1,3,0,2,2,1,0,1,1,3
3,2,0,0,1,2,0,3,2,3,2,0,2,2,0,0,0
4,0,0,1,0,1,3,0,3,2,2,0,2,3,3,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
731,3,1,0,2,0,2,1,0,2,1,1,3,0,1,3,0
732,1,1,0,0,0,0,1,0,0,1,0,3,0,0,2,2
733,1,2,2,1,1,3,1,1,1,2,1,2,2,2,1,1
734,3,0,0,0,0,0,1,0,0,0,0,0,0,0,0,2


This may seem like we're done, but we have to make this change persistent by assigning it to the original dataframe.

In [23]:
survey_df.loc[:,freq_columns]=survey_df.loc[:,freq_columns].replace(dict_replace)

In [24]:
survey_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 736 entries, 0 to 735
Data columns (total 33 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   Timestamp                     736 non-null    datetime64[ns]
 1   Age                           735 non-null    float64       
 2   Primary streaming service     735 non-null    object        
 3   Hours per day                 736 non-null    float64       
 4   While working                 733 non-null    object        
 5   Instrumentalist               732 non-null    object        
 6   Composer                      735 non-null    object        
 7   Fav genre                     736 non-null    object        
 8   Exploratory                   736 non-null    object        
 9   Foreign languages             732 non-null    object        
 10  BPM                           629 non-null    float64       
 11  Frequency [Classical]         73

In [25]:
# Option 1: If your Frecuency is not yet a float, we need to convert it like this:
survey_df.loc[:, freq_columns] = survey_df.loc[:, freq_columns].astype(float)

Now we can see a snapshot of the frequency per genre using describe

In [26]:
survey_df.loc[:,freq_columns].describe()

Unnamed: 0,Frequency [Classical],Frequency [Country],Frequency [EDM],Frequency [Folk],Frequency [Gospel],Frequency [Hip hop],Frequency [Jazz],Frequency [K pop],Frequency [Latin],Frequency [Lofi],Frequency [Metal],Frequency [Pop],Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music]
count,736.0,736.0,736.0,736.0,736.0,736.0,736.0,736.0,736.0,736.0,736.0,736.0,736.0,736.0,736.0,736.0
mean,1.335598,0.817935,1.023098,1.012228,0.381793,1.384511,1.027174,0.735054,0.607337,1.067935,1.220109,2.03125,1.259511,1.335598,2.070652,1.25
std,0.988442,0.922584,1.048878,1.009405,0.70152,1.031598,0.938559,1.002945,0.864666,1.027912,1.134698,0.934801,1.058451,1.053732,1.034327,1.071587
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
50%,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,2.0,1.0,1.0,2.0,1.0
75%,2.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0,3.0,2.0,2.0,3.0,2.0
max,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0


In [1]:
# If it hasn't been converted we will go through it soon, don't worry :)

Country and Gospel have the least frequency, while Hip Hop, Pop, and Rock have the greatest

There are still some more transformations we can make to the dataset. For example, changing the columns with few values to categorical columns. This means that they can only take a few values instead of being open to any input, as their object dtype would suggest.

This change also makes the dataframe take up less space in memory, as we will see in a before-and-after.