Skip to content

Solo project for Data Analytics CodeOp Course. I analysed the sound-related features of Spotify songs, built a model to predict the popularity of a song and made an artist recommender system.

Notifications You must be signed in to change notification settings

AntoniaVillarino/spotify

Repository files navigation

The data behind Spotify music

Spotify is an audio streaming service with more than 70 million tracks and with over 356 million active users every month. With these numbers, Data Analysis is a must! The Spotify databases are enriched with lots of features: popularity, danceability, key... You can see them for yourself in its API.

For my analysis, I used this dataset collected with this API by a Kaggle user, Yamac Eren Ay. There are two main datasets: one for songs and another one for artists (the other datasets are derived from the first ones, by aggregation techniques). The songs dataset contains around 600k songs, released between 1922 and 2020.

  • See the slides of the final presentation.

1. Analysis

How has music taste changed over time?

First, I wanted to see how the Spotify features have changed over the years. To do that, I aggregated the songs by year and took the mean or the mode of each feature (see the code in Spotify_analysis.ipynb, or in this nbviewer link to see the interactive plots)

These are some of the resulting plots:

Mean loudness over the years

The songs are becoming:

  • longer
  • louder
  • and with faster tempos

Mean valence over the years

The valence of the songs (positiveness, see the Spotify API):

  • had more variance before the ’50s
  • peaked in the ’80s
  • went down until recent years

What makes a song popular?

When making the previous time series analysis, I found a surprising fact: the mean popularity, grows over time. In other words, more recent songs are more popular. How can that be? My intuition is that there are lots of old, popular songs... How is Spotify defining "popularity"?

Checking in its API, we find that popularity is based on the total number of plays the song has, and how recent those plays are. So, songs that are being played a lot now are more popular than songs that were played a lot in the past.

With this in mind, I aggregated the songs dataset by popularity, and made bivariate plots to see how the other features were related to it (you can also see the code in Spotify_analysis.ipynb, or in this nbviewer link to see the interactive plots):

Popularity vs. mean loudness

Popularity vs. danceability

  • Recent songs are more popular.
  • Popular songs are more danceable, loud, and not very long (3.5 mins).
  • Live concerts and acoustic songs are less popular.

2. Can popularity be predicted?

As some of the characteristics of the songs are correlated with popularity. Can I use this to predict the popularity of a song? (see the code in the predict_popularity.ipynb notebook)

Popularity, as defined by Spotify, goes from 0 to 100. To have a more manageable problem, I reshaped the popularity feature into 3 categories, based on its distribution: low (<20), medium (20 to 70) and high popularity (>70).

I tried different models and ended using XGBoost, with the following results:

confusion_matrix.png

The importance the model gives to each feature is also very interesting:

feature_importance.png

  • The release year is the most important feature to determine popularity.
  • Instrumentalness, speechiness, liveness and acousticness have a negative impact on the popularity of a song (as we saw in previous plots).
  • Explicitness and valence (positiveness) have a positive impact on popularity (which is consistent with the previous EDA)

3. Building a recommender system

My next step was to build an artist recommender system: when the user searches for an artist, which others should I recommend?

To do that, I used a basic content-based method, cosine similarity (see the code in recommender_system.ipynb).

This method assumes each row (each artist) is a vector with multiple components (the features). All the artists form a vector space, and an "angle" similarity can be obtained by computing the scalar product of two vectors:

Cosine similarity

In addition to the sound-related features, I encoded the genres of each artist. It turns out that Spotify uses a gazillion different genres, so I ended with a very sparse matrix of 3245 columns.

Some of the results I obtained:

AC/DC similar artists: Rose Tattoo, Stevie Wright

Beyoncé similar artists: Normani, Selena

4. Next

  • Gather data with more relevant features from AcousticBrainz
  • Using a “superstar” variable to predict popularity: Predictability of success
  • Use Spotify users’ data to build a collaborative filtering model for the recommender system

About

Solo project for Data Analytics CodeOp Course. I analysed the sound-related features of Spotify songs, built a model to predict the popularity of a song and made an artist recommender system.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published