Spotify is an audio streaming service with more than 70 million tracks and with over 356 million active users every month. With these numbers, Data Analysis is a must! The Spotify databases are enriched with lots of features: popularity, danceability, key... You can see them for yourself in its API.
For my analysis, I used this dataset collected with this API by a Kaggle user, Yamac Eren Ay. There are two main datasets: one for songs and another one for artists (the other datasets are derived from the first ones, by aggregation techniques). The songs dataset contains around 600k songs, released between 1922 and 2020.
- See the slides of the final presentation.
First, I wanted to see how the Spotify features have changed over the years. To do that, I aggregated the songs by year and took the mean or the mode of each feature (see the code in Spotify_analysis.ipynb, or in this nbviewer link to see the interactive plots)
These are some of the resulting plots:
The songs are becoming:
- longer
- louder
- and with faster tempos
The valence of the songs (positiveness, see the Spotify API):
- had more variance before the ’50s
- peaked in the ’80s
- went down until recent years
When making the previous time series analysis, I found a surprising fact: the mean popularity, grows over time. In other words, more recent songs are more popular. How can that be? My intuition is that there are lots of old, popular songs... How is Spotify defining "popularity"?
Checking in its API, we find that popularity is based on the total number of plays the song has, and how recent those plays are. So, songs that are being played a lot now are more popular than songs that were played a lot in the past.
With this in mind, I aggregated the songs dataset by popularity, and made bivariate plots to see how the other features were related to it (you can also see the code in Spotify_analysis.ipynb, or in this nbviewer link to see the interactive plots):
- Recent songs are more popular.
- Popular songs are more danceable, loud, and not very long (3.5 mins).
- Live concerts and acoustic songs are less popular.
As some of the characteristics of the songs are correlated with popularity. Can I use this to predict the popularity of a song? (see the code in the predict_popularity.ipynb notebook)
Popularity, as defined by Spotify, goes from 0 to 100. To have a more manageable problem, I reshaped the popularity feature into 3 categories, based on its distribution: low (<20), medium (20 to 70) and high popularity (>70).
I tried different models and ended using XGBoost, with the following results:
- Low popularity and medium popularity are well predicted by the XGBoost model.
- The highest popularities are not well predicted. Adding a new variable that takes into account the “superstardom” of the artist may yield better results (see Musical trends and predictability of success in contemporary songs in and out of the top charts, 2017)
The importance the model gives to each feature is also very interesting:
- The release year is the most important feature to determine popularity.
- Instrumentalness, speechiness, liveness and acousticness have a negative impact on the popularity of a song (as we saw in previous plots).
- Explicitness and valence (positiveness) have a positive impact on popularity (which is consistent with the previous EDA)
My next step was to build an artist recommender system: when the user searches for an artist, which others should I recommend?
To do that, I used a basic content-based method, cosine similarity (see the code in recommender_system.ipynb).
This method assumes each row (each artist) is a vector with multiple components (the features). All the artists form a vector space, and an "angle" similarity can be obtained by computing the scalar product of two vectors:
In addition to the sound-related features, I encoded the genres of each artist. It turns out that Spotify uses a gazillion different genres, so I ended with a very sparse matrix of 3245 columns.
Some of the results I obtained:
AC/DC similar artists: Rose Tattoo, Stevie Wright
Beyoncé similar artists: Normani, Selena
- Gather data with more relevant features from AcousticBrainz
- Using a “superstar” variable to predict popularity: Predictability of success
- Use Spotify users’ data to build a collaborative filtering model for the recommender system