## Predicting song popularity using Machine Learning

This notebook looks into using various Python-based Machine Learning and Data Science libraries in an attempt to build a machine learning model capable of predicting a popularity of the song based on the song's attributes.

*Approach:*

* Problem definition
* Data 
* Evaluation
* Features
* Modelling
* Experimentation 

### 1. Problem definiton

> Can we predict the popularity of a song based on its attributes?

### 2. Data

This study combines 3 datasets. 
#### Spotify Top 100 Songs of 2010-2019
The first dataset features top 100 songs on Spotify from 2010 to 2019. It contains 17 attributes, including the predicted attribute. 
Datasaet is available on Kaggle: https://www.kaggle.com/datasets/muhmores/spotify-top-100-songs-of-20152019

* **title:**	          Song's Title
* **artist:**	      Song's artist
* **genre:**	          Genre of song
* **yearreleased:**	  Year the song was released
* **added:**	          Day song was added to Spotify's Top Hits playlist
* **bpm:**	          Beats Per Minute - The tempo of the song
* **nrgy:**          Energy - How energetic the song is
* **dnce:**	          Danceability - How easy it is to dance to the song
* **dB:**	          Decibel - How loud the song is
* **live:**	          How likely the song is a live recording
* **val:**	          How positive the mood of the song is
* **dur:**	          Duration of the song
* **acous:**	          How acoustic the song is
* **spch:**	          The more the song is focused on spoken word
* **pop:**	          Popularity of the song (not a ranking)
* **top year:**	      Year the song was a top hit
* **artist type:**	  Tells if artist is solo, duo, trio, or a band


#### Spotify Top 50 Songs 2020 Songs
The second dataset includes the data on top 50 songs on spotify from 2020. It contains 12 attributes, including the predicted attribute.
The dataset is available on Kaggle: https://www.kaggle.com/datasets/heminp16/spotify-top-2020-songs 

* **Top Genre:**	          Genre of the song
* **Year:**	      release year of the song
* **BPM:**	          Beats Per Minute - The tempo of the song
* **Energy:**	  The energy of the song; the higher the value, the more energetic
* **Danceability:**	          Describes how suitable a track is for dancing; the higher the value, the easier it is to dance
* **Loudness(dB):**	          The loudness level in decibels, higher the value, the louder the song
* **Liveness:**          the higher the value, the more likely the song is a live recording
* **dnce:**	          Danceability - How easy it is to dance to the song
* **Valence:**	          A measure of musical positiveness of the track. The tracks with the highest number give a sense of positive moods. 
* **Duration(sec):**	          The duration of the song in seconds.
* **Acousticness:**	          A measure of how acoustic the track is.
* **Speechiness:**	          The higher the value that tells how many spoken words were in the track. oken word
* **Popularity:**	          The higher the value, the more popular the song is.

#### Spotify Top 50 songs in 2021
The third dataset features top 50 songs on Spotify from 2021, and 14 characteristic variables of them. 
The dataset is available on Kaggle: https://www.kaggle.com/datasets/equinxx/spotify-top-50-songs-in-2021


* **id:**	          Position of the song in the list
* **artist_name:**	      Name of artist
* **track_name:**	          Name of track
* **track_id:**	  Unique ID for the track in spotify
* **popularity:**	          The higher the value the more popular the song is
* **danceability:**	         The higher the value, the easier it is to dance to this song
* **energy:**          The energy of a song - the higher the value, the more energtic the song
* **key:**	          The key the track is in
* **loudness:**	          The higher the value, the louder the song
* **mode:**	          The modality (major or minor) of a track
* **speechiness:**	          The higher the value the more spoken word the song contains
* **acousticness:**	          The higher the value the more acoustic the song is
* **instrumentalness:**	         This represents the number of vocals in a song
* **liveness:**	          The higher the value, the more likely the song is a live recording
* **valence:**	          The higher the value, the more positive mood for the song
* **tempo:**	      Tempo of the song
* **duration_ms:**	  Duration of the song (in ms)
* **time_signature:**	  An estimated time signature
In current study, the *pop* attribute 



In [27]:
import pandas as pd
import sklearn
import numpy as np
import matplotlib.pyplot as plot
import seaborn as sns
%matplotlib inline

In [28]:
data_2010_2019 = pd.read_csv("data/Spotify 2010 - 2019 Top 100.csv")
data_2020 = pd.read_csv("data/Top2020.csv")
data_2021 = pd.read_csv("data/spotify_top50_2021.csv")

data_2010_2019

Unnamed: 0,title,artist,top genre,year released,added,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop,top year,artist type
0,STARSTRUKK (feat. Katy Perry),3OH!3,dance pop,2009.0,2022‑02‑17,140.0,81.0,61.0,-6.0,23.0,23.0,203.0,0.0,6.0,70.0,2010.0,Duo
1,My First Kiss (feat. Ke$ha),3OH!3,dance pop,2010.0,2022‑02‑17,138.0,89.0,68.0,-4.0,36.0,83.0,192.0,1.0,8.0,68.0,2010.0,Duo
2,I Need A Dollar,Aloe Blacc,pop soul,2010.0,2022‑02‑17,95.0,48.0,84.0,-7.0,9.0,96.0,243.0,20.0,3.0,72.0,2010.0,Solo
3,Airplanes (feat. Hayley Williams of Paramore),B.o.B,atl hip hop,2010.0,2022‑02‑17,93.0,87.0,66.0,-4.0,4.0,38.0,180.0,11.0,12.0,80.0,2010.0,Solo
4,Nothin' on You (feat. Bruno Mars),B.o.B,atl hip hop,2010.0,2022‑02‑17,104.0,85.0,69.0,-6.0,9.0,74.0,268.0,39.0,5.0,79.0,2010.0,Solo
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
998,Strike a Pose (feat. Aitch),Young T & Bugsey,afroswing,2019.0,2020‑08‑20,138.0,58.0,53.0,-6.0,10.0,59.0,214.0,1.0,10.0,67.0,2019.0,Duo
999,The London (feat. J. Cole & Travis Scott),Young Thug,atl hip hop,2019.0,2020‑06‑22,98.0,59.0,80.0,-7.0,13.0,18.0,200.0,2.0,15.0,75.0,2019.0,Solo
1000,,,,,,,,,,,,,,,,,
1001,,,,,,,,,,,,,,,,,


## Exploring the data

In [29]:
data.dtypes

title             object
artist            object
top genre         object
year released    float64
added             object
bpm              float64
nrgy             float64
dnce             float64
dB               float64
live             float64
val              float64
dur              float64
acous            float64
spch             float64
pop              float64
top year         float64
artist type       object
dtype: object

In [30]:
data_2020

Unnamed: 0,sel,title,artist,top genre,year,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop
0,1,Blinding Lights,The Weeknd,canadian contemporary r&b,2020,171,73,51,-6,9,33,200,0,6,93
1,2,Watermelon Sugar,Harry Styles,pop,2019,95,82,55,-4,34,56,174,12,5,90
2,3,Someone You Loved,Lewis Capaldi,pop,2019,110,41,50,-6,11,45,182,75,3,89
3,4,lovely (with Khalid),Billie Eilish,art pop,2018,115,30,35,-10,10,12,200,93,3,89
4,5,Mood (feat. iann dior),24kGoldn,cali rap,2021,91,72,70,-4,32,73,141,17,4,89
5,6,Circles,Post Malone,dfw rap,2019,120,76,70,-3,9,55,215,19,4,88
6,7,goosebumps,Travis Scott,rap,2016,130,73,84,-3,15,43,244,8,5,87
7,8,Lucid Dreams,Juice WRLD,chicago rap,2018,84,57,51,-7,34,22,240,35,20,87
8,9,Memories,Maroon 5,pop,2021,91,33,78,-7,8,60,189,84,6,87
9,10,bad guy,Billie Eilish,art pop,2019,135,43,70,-11,10,56,194,33,38,87


In [31]:
data_2021

Unnamed: 0,id,artist_name,track_name,track_id,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,1,Olivia Rodrigo,drivers license,5wANPM4fQCJwkGd4rN57mH,92,0.561,0.431,10,-8.81,1,0.0578,0.768,1.4e-05,0.106,0.137,143.875,242013,4
1,2,Lil Nas X,MONTERO (Call Me By Your Name),1SC5rEoYDGUK4NfG82494W,90,0.593,0.503,8,-6.725,0,0.22,0.293,0.0,0.405,0.71,178.781,137704,4
2,3,The Kid LAROI,STAY (with Justin Bieber),5PjdY0CKGZdEuoNab3yDmX,92,0.591,0.764,1,-5.484,1,0.0483,0.0383,0.0,0.103,0.478,169.928,141806,4
3,4,Olivia Rodrigo,good 4 u,4ZtFanR9U6ndgddUvNcjcG,95,0.563,0.664,9,-5.044,1,0.154,0.335,0.0,0.0849,0.688,166.928,178147,4
4,5,Dua Lipa,Levitating (feat. DaBaby),5nujrmhLynf4yMoMtj8AQF,89,0.702,0.825,6,-3.787,0,0.0601,0.00883,0.0,0.0674,0.915,102.977,203064,4
5,6,Justin Bieber,Peaches (feat. Daniel Caesar & Giveon),4iJyoBOLtHqaGxP12qzhQI,90,0.677,0.696,0,-6.181,1,0.119,0.321,0.0,0.42,0.464,90.03,198082,4
6,7,Doja Cat,Kiss Me More (feat. SZA),3DarAbFujv6eYNliUTyqtz,88,0.764,0.705,8,-3.463,1,0.0284,0.259,8.9e-05,0.12,0.781,110.97,208667,4
7,8,The Weeknd,Blinding Lights,0VjIjW4GlUZAMYd2vXMi3b,93,0.514,0.73,1,-5.934,1,0.0598,0.00146,9.5e-05,0.0897,0.334,171.005,200040,4
8,9,Glass Animals,Heat Waves,02MWAaffLxlfxAUY7c5dvx,94,0.761,0.525,11,-6.9,1,0.0944,0.44,7e-06,0.0921,0.531,80.87,238805,4
9,10,Måneskin,Beggin',3Wrjm47oTz2sjIgck11l5e,93,0.714,0.8,11,-4.808,0,0.0504,0.127,0.0,0.359,0.589,134.002,211560,4


In [32]:
data_2010_2019.info(), data_2020.info(), data_2021.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1003 entries, 0 to 1002
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   title          1000 non-null   object 
 1   artist         1000 non-null   object 
 2   top genre      1000 non-null   object 
 3   year released  1000 non-null   float64
 4   added          1000 non-null   object 
 5   bpm            1000 non-null   float64
 6   nrgy           1000 non-null   float64
 7   dnce           1000 non-null   float64
 8   dB             1000 non-null   float64
 9   live           1000 non-null   float64
 10  val            1000 non-null   float64
 11  dur            1000 non-null   float64
 12  acous          1000 non-null   float64
 13  spch           1000 non-null   float64
 14  pop            1000 non-null   float64
 15  top year       1000 non-null   float64
 16  artist type    1000 non-null   object 
dtypes: float64(12), object(5)
memory usage: 133.3+ KB
<c

(None, None, None)