# Introduction to the data frame - Music Therapy

The post-pandemic era presents unprecedented challenges for individuals managing their mental health. Studies indicate that music has played a significant role in improving both physical and mental well-being (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8566759/). I am eager to explore the depth of music's influence on mental health and extract valuable insights from this study.

We want to identify what, if any, correlations exist between an individual's music taste and their self-reported mental health. Ideally, these findings could contribute to a more informed application of MT or simply provide interesting sights about the mind.

In [None]:
#pip install numpy
#pip install pandas
#pip install matplotlib

In [None]:
pip install matplotlib

In [3]:
# Ignore warnings
import warnings
warnings.filterwarnings("ignore")  # This is to ignore any warnings that might pop up during execution

In [None]:
# Basic libraries to manipulate data
import matplotlib.pyplot as plt  # Matplotlib for data visualization
import numpy as np  # Numpy for numerical computations
import pandas as pd  # Pandas for data manipulation

In [None]:
np.random.seed(42)  # To ensure all the probabilistic things are reproducible

## Background on Data

#### i. Dataset
For this analysis, we take the data from the Kaggle website (https://www.kaggle.com/datasets/catherinerasgaitis/mxmh-survey-results).

As per the dataset collector and owner, the survey encompassed the following parameters:
**Block 1: Music genres**
Respondents selected Never, Rarely, Sometimes, or Very frequently to indicate the frequency of their listening to 16 music genres.

**Block 2: Mental health**
Respondents rated Anxiety, Depression, Insomnia, and OCD (Obsessive-compulsive disorder) on a scale of 0 to 10, with 10 representing the highest frequency and severity.

#### ii. Scope & Procedures
This analysis is based on data collected from 736 respondents across the US. The survey, conducted from Aug. 27 to Nov. 9, 2022, included participants of any age or location. The questionnaire comprised a mix of open-ended and drop-down questions, designed to elicit relatively concise responses and encourage participation.

# On Relative Paths and Well-Structured Projects
As you'll often collaborate within a team, establishing order from the outset is a beneficial practice. This ensures code portability and facilitates bug fixing.

For reading data from files, it's advisable to use a relative path—i.e., a path originating from the project directory. This way, being in the correct directory ensures smooth execution.

Throughout this course, each project will have its own directory. At the top level, notebooks will reside, while all datasets will be stored in a directory named "data."

Consequently, we can define the relative path for data and effortlessly read files by appending the correct file name from there.

If you're using `google colab`, things get a bit trickier. You'll need to upload the data file  each time you want to access it in a separate session.

Alternatively, you can mount your Google Drive, head to your project directory, and then employ relative paths just like before.

In [4]:
# Mount your drive
# ! This will only work in Google Colab.
from google.colab import drive
drive.mount('/content/drive')

# Navigate to your project's directory
# ! This will vary based on each person's drive structure.
import os
os.chdir('/content/drive/My Drive/.../data/')

# Define the relative path within
data_path = "./data/"

ModuleNotFoundError: No module named 'google'

# Basic characteristics of the datasets

Now that we've completed the previous steps, let's take a snapshot of the datasets. Understanding the basic characteristics is crucial; without knowing what we have, we can't address the questions we aim to answer.

To kick things off, we'll import our tables into a Pandas dataframe.

In [4]:
# Specify the path to the datasets
data_path = "./data/"

# Specify the filenames of the datasets
survey_filename = "survey_music.csv"

# Read the CSV files and create backup copies
survey_df_data = pd.read_csv(data_path + survey_filename)

# Create working copies of the dataframes for analysis
survey_df = survey_df_data.copy()


Since this table has many rows, viewing it in its entirety won't give us the full picture. 
Various methods are available for a glimpse of the contents.

To view the start and end of the table, we can use `head` and `tail`. Alternatively, returning a sample gives a broader view, considering potential correlations in rows. The data might have chunks created in a specific order, like an English section followed by others. To capture this slice, we use the `sample` method.


In [5]:
# Option to display all columns
pd.set_option('display.max_columns', None)
survey_df.sample(5)

Unnamed: 0,Timestamp,Age,Primary streaming service,Hours per day,While working,Instrumentalist,Composer,Fav genre,Exploratory,Foreign languages,BPM,Frequency [Classical],Frequency [Country],Frequency [EDM],Frequency [Folk],Frequency [Gospel],Frequency [Hip hop],Frequency [Jazz],Frequency [K pop],Frequency [Latin],Frequency [Lofi],Frequency [Metal],Frequency [Pop],Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music],Anxiety,Depression,Insomnia,OCD,Music effects,Permissions
669,9/15/2022 15:30:41,20.0,Spotify,1.0,Yes,No,No,K pop,Yes,Yes,120.0,Sometimes,Never,Rarely,Never,Sometimes,Never,Rarely,Very frequently,Never,Sometimes,Never,Very frequently,Sometimes,Never,Rarely,Never,8.0,7.0,1.0,1.0,No effect,I understand.
33,8/28/2022 10:59:53,17.0,Spotify,4.0,No,No,No,Rock,Yes,Yes,142.0,Rarely,Rarely,Rarely,Very frequently,Never,Sometimes,Rarely,Rarely,Never,Rarely,Rarely,Very frequently,Rarely,Sometimes,Very frequently,Rarely,5.0,6.0,6.0,1.0,Improve,I understand.
549,9/3/2022 16:00:31,18.0,Spotify,2.0,No,Yes,No,Rock,Yes,No,115.0,Never,Never,Never,Never,Never,Rarely,Never,Never,Never,Never,Sometimes,Sometimes,Rarely,Rarely,Very frequently,Never,8.0,3.0,1.0,3.0,Improve,I understand.
199,8/28/2022 20:57:57,28.0,YouTube Music,2.0,Yes,No,No,Rock,Yes,No,172.0,Never,Rarely,Rarely,Rarely,Never,Sometimes,Never,Rarely,Never,Sometimes,Rarely,Sometimes,Very frequently,Rarely,Very frequently,Rarely,10.0,8.0,6.0,2.0,Improve,I understand.
264,8/29/2022 0:39:59,19.0,Spotify,2.0,Yes,No,No,Country,Yes,No,125.0,Rarely,Very frequently,Rarely,Sometimes,Rarely,Never,Never,Never,Rarely,Rarely,Sometimes,Rarely,Never,Never,Sometimes,Never,7.0,2.0,0.0,7.0,No effect,I understand.


We can further characterize our datasets using `info`


In [6]:
survey_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 736 entries, 0 to 735
Data columns (total 33 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Timestamp                     736 non-null    object 
 1   Age                           735 non-null    float64
 2   Primary streaming service     735 non-null    object 
 3   Hours per day                 736 non-null    float64
 4   While working                 733 non-null    object 
 5   Instrumentalist               732 non-null    object 
 6   Composer                      735 non-null    object 
 7   Fav genre                     736 non-null    object 
 8   Exploratory                   736 non-null    object 
 9   Foreign languages             732 non-null    object 
 10  BPM                           629 non-null    float64
 11  Frequency [Classical]         736 non-null    object 
 12  Frequency [Country]           736 non-null    object 
 13  Frequ

This method provides a concise summary of our table. Here's a breakdown of the information it offers:

The index comprises 736 entries, ranging from 0 to 735.

A brief description of each column is printed, including its non-null count and data type (dtype).

- **Non-null count** indicates the number of non-missing values in each column. In a table of this size, we typically wouldn't expect null values. However, understanding the presence of missing values is crucial for assessing the completeness of our data.
- **dtype** denotes the *type of data* present in each column. A comprehensive list of available data types is provided below. Ensuring accurate data types enables efficient operations and allows us to leverage pre-coded manipulations for data analysis.

| Pandas dtype    | What It Does                       | Common Operations                        | Advantages                                      | Disadvantages                                |
|-----------------|------------------------------------|------------------------------------------|-------------------------------------------------|----------------------------------------------|
| `object`        | Stores mixed types, often strings  | String operations, `.str` accessor       | Can hold mixed types                            | Inefficient, not suitable for numerical ops  |
| `int64`         | Integer numbers                    | Mathematical operations                  | Memory efficient, fast operations               | Cannot represent `NaN` in versions < 1.0     |
| `float64`       | Floating-point numbers             | Mathematical operations                  | Can represent decimals and `NaN`                | Less memory efficient than `int64`           |
| `bool`          | Boolean values (True/False)        | Logical operations                       | Very memory efficient                           | Limited to boolean logic                     |
| `datetime64[ns]`| Date and time values               | Date/time operations, `.dt` accessor     | Precise, timezone-aware operations             | More memory than simpler types               |
| `timedelta[ns]` | Differences between time points    | Time deltas, `.dt` accessor              | Precise calculations of time spans             | More complex to understand and manipulate    |
| `category`      | Finite list of text values         | Statistical ops on categorical data, `.cat` accessor | Memory efficient for small sets of unique values | Not suitable for large number of unique values |
| `complex`       | Complex numbers with real and imaginary parts | Mathematical operations on complex numbers | Can represent complex numbers                    | Rarely used, not supported by all libraries  |

As we can see in the previous table, our column `timestamp` has a dtype of object. This means that all its values are strings that contain timestamps. Storing data information this way isn't very efficient, as we can observe in the next cell.

In [7]:
# Example of how difficult it is to extract the month from a string column: 
survey_df['Timestamp'].str.split('/').str.get(0)

0       8
1       8
2       8
3       8
4       8
       ..
731    10
732    11
733    11
734    11
735    11
Name: Timestamp, Length: 736, dtype: object

This is not to mention that dates can sometimes be in various formats, such as `2023-09-21`, where the previous expression wouldn't capture all the years.

Now, we transform the column into a datetime type and retrieve the month in a simpler way. Although it seems straightforward in this example, the issue with differently formatted dates persists. It is almost always better to convert it to this format, as the time it takes for a column to be preprocessed is usually much smaller than the time this conversion will save us.

In [8]:
# Compared to using a datetime datatype

pd.to_datetime(survey_df['Timestamp']).dt.month

0       8
1       8
2       8
3       8
4       8
       ..
731    10
732    11
733    11
734    11
735    11
Name: Timestamp, Length: 736, dtype: int64