![DataDunkers.ca Banner](https://github.com/Data-Dunkers/lessons/blob/main/images/top-banner.jpg?raw=true)

# Music

<table><tr>
<td style="font-size:8px;"> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/57/Treble_a.svg/1920px-Treble_a.svg.png" alt="Musical Staff" style="width: 350px;"/><br><a href="https://en.wikipedia.org/wiki/Musical_note#/media/File:Treble_a.svg">By Dbolton - Own work, CC0, https://commons.wikimedia.org/w/index.php?curid=17813642</a></td>
<td> <img src="https://storage.googleapis.com/pr-newsroom-wp/1/2018/11/Spotify_Logo_CMYK_Green.png" alt="Spotify Logo" style="width: 650px;"/> </td>
</tr></table>

Music is an art loved by many people around the world, and it has been an important part of people's life.

On a regular day, you might be listening to your artist or trying to play your favourite songs. In this hackathon notebook let's try to find out more about the most popular songs and what they have in common. Hopefully you will find some interesting insights that might be difficult to determine otherwise, while learning some new coding skills.

[Spotify](https://en.wikipedia.org/wiki/Spotify), an audio streaming platform, has a huge database of songs and information about them.

### Importing Libraries

Run the cell below to import required Python libraries and a dataset of about 40,000 songs that has been [exported from Spotify](https://developer.spotify.com/documentation/web-api).

In [None]:
import pandas as pd
import plotly.express as px
from IPython.display import YouTubeVideo

music = pd.read_csv('music-data.csv')
music

### Data Columns

Let's have a look at the columns in our data set.

In [None]:
for c in music.columns:
    print(c)

Now you know which columns are there in the dataset, but what do those columns refer to?

**Danceability**: How suitable a track is for dancing. A value of 0.0 is least danceable and 1.0 is most danceable.

**Energy**: A perceptual measure of intensity and activity that ranges between 0 to 1. Typically, energetic tracks feel fast, loud, and noisy.

**Key**: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.

**Loudness**: The average loudness of a track in decibels (dB). Values typically ranges between -60 and 0 db.

**Mode**: The modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

**Speechiness**: Indicates the presence of spoken words in a track. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech while below 0.33 most likely represent music and other non-speech-like tracks.

**Acousticness**: A confidence measure indicating whether the track is acoustic. Value of 1 represents highest confidence.

**Instrumentalness**: Predicts whether a track contains no vocals. The closer the value is to 1.0, the greater likelihood the track contains no vocal content.

**Liveness**: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.

**Valence**: A measure to describe the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

**Tempo**: The overall estimated tempo (speed or pace) of a track in beats per minute (BPM).

**duration_ms**: The duration of the track in milliseconds.

**time_signature**: An estimated overall time signature of a track. The time signature is a notational convention to specify how many beats are in each bar (or measure).

## Data Cleaning

### Adding New Columns

We can add a new column to show the duration of the track in seconds instead of milliseconds.

In [None]:
music['duration_s'] = music['duration_ms']/1000
music

We can also add a column of links to the tracks.

In [None]:
music['link'] = 'https://open.spotify.com/track/' + music['track_id']
music

Looking at the `release_date` column, we can see that for some songs it is just the year and for some it is a [standard date](https://www.iso.org/iso-8601-date-and-time-format.html). Let's create a new column called `release_year` that is just the first four characters of the `release_date`.

In [None]:
music['release_year'] = music['release_date'].str[:4].astype(int)
music

## Analysis

### Song Duration 

Let's visualize the song lengths over the years to see if there is anything strange in our dataset.

In [None]:
px.scatter(music, x='release_year', y='duration_s', title='Song Duration Over Time', hover_data=['artist', 'track', 'link'])

We may want to eliminate some of the outliers, for example songs released before 1950.

In [None]:
new_music = music[music['release_year'] >= 1949]
px.scatter(new_music, x='release_year', y='duration_s', title='Song Duration Over Time', hover_data=['artist', 'track', 'link'])

Or only look at songs from the 1990s with a duration less than 10 minutes.

In [None]:
short_90s_music = music[(music['release_year']>1989) & (music['release_year']<2000) & (music['duration_s']<10*60)]
px.scatter(short_90s_music, x='release_year', y='duration_s', title='Song Duration Over Time', hover_data=['artist', 'track', 'link'])

We can also see if the average `danceability` has changed over time in our dataset.

In [None]:
average_danceability = music.groupby('release_year')['danceability'].mean()
px.line(average_danceability, title='Average Danceability Over Time')

Of course for years where there are not a lot of songs in our in our dataset, the average will not be a useful value.

Let's look instead at another dataset containing the top 50 tracks from 2010 to 2019.

In [None]:
music2 = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/hackathon/spotify-top-50-from-2010-2019.csv')
px.line(music2.groupby('year')['danceability'].mean(), title='Top 50 Average Danceability Over Time')

Let's see if there is a relationship between `energy` and `danceability` in either dataset.

In [None]:
px.scatter(music, x='energy', y='danceability', title='Energy vs Danceability', hover_data=['artist', 'track'])

In [None]:
px.scatter(music2, x='energy', y='danceability', title='Energy vs Danceability for Top 50', hover_data=['artist', 'title'])

You can also explore and visualize other song features from the datasets.

In [None]:
music.columns

It is also possible to embed a YouTube video in a notebook using the video ID from the link. For example, if the video is at `https://www.youtube.com/watch?v=dQw4w9WgXcQ` then we can use the code below to display it.

In [None]:
YouTubeVideo('dQw4w9WgXcQ')

Check out the [next notebook](music-challenge.ipynb) to continue your own analysis.

[![Data Dunkers License](https://github.com/Data-Dunkers/lessons/blob/main/images/bottom-banner.jpg?raw=true)](https://github.com/Data-Dunkers/lessons/blob/main/LICENSE.md)