# Songs Popularity Prediction

## Obiettivo 
L'obiettivo di questo progetto è la creazione di un modello in grado di predire la popolarità di una canzone sulla base di alcuni suoi parametri, quali il genere, la durata e l'anno di uscita, e sulla base di alcuni parametri che riguardano l'artista che l'ha creata. 

## Dataset
Il dataset è stato reperito all'indirizzo https://www.kaggle.com/datasets/conorvaneden/best-songs-on-spotify-for-every-year-2000-2023 e contiene dati sulle 100 canzoni più popolari per ogni anno dal 2000 al 2023, per un totale di 2385 record.

## Implementazione

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Descrizione dataset

Carichiamo innanzitutto il dataset dal file .csv e mostriamone le prime righe.

In [4]:
songs = pd.read_csv("songs.csv", sep=";")
songs.head()

Unnamed: 0,title,artist,top genre,year,bpm,energy,danceability,dB,liveness,valence,duration,acousticness,speechiness,popularity
0,Flowers,Miley Cyrus,pop,2023,118,68,71,-4,3,65,200,6,7,98
1,Cupid - Twin Ver.,FIFTY FIFTY,k-pop girl group,2023,120,59,78,-8,35,73,174,44,3,97
2,BESO,ROSALÍA,pop,2023,95,64,77,-7,17,53,195,74,14,96
3,Boy's a liar Pt. 2,PinkPantheress,bronx drill,2023,133,81,70,-8,25,86,131,25,5,96
4,Creepin' (with The Weeknd & 21 Savage),Metro Boomin,rap,2022,98,62,72,-6,8,17,222,42,5,96


Si noti che è stato usato un indice numerico generato automaticamente da pandas, poiché nessun insieme di attributi può costituire una chiave primaria in questo dataset: il titolo non può esserlo perché due canzoni diverse possono ovviamente condividere lo stesso titolo, ma anche la coppia (titolo, artista) non funziona in quanto una canzone può essere rilasciata più volte (in linea teorica anche durante lo stesso anno, per quanto molto improbabile, e quindi aggiungere l'anno non aiuta).

Questa è la descrizione del significato delle variabili data dal creatore del dataset:
- Genre - the genre of the track
- Year - the release year of the recording. Note that due to vagaries of releases, re-releases, re-issues and general madness, sometimes the release years are not what you'd expect.
- Added - the earliest date you added the track to your collection.
- Beats Per Minute (BPM) - The tempo of the song.
- Energy - The energy of a song - the higher the value, the more energtic. song
- Danceability - The higher the value, the easier it is to dance to this song.
- Loudness (dB) - The higher the value, the louder the song.
- Liveness - The higher the value, the more likely the song is a live recording.
- Valence - The higher the value, the more positive mood for the song.
- Length - The duration of the song.
- Acousticness - The higher the value the more acoustic the song is.
- Speechiness - The higher the value the more spoken word the song contains.
- Popularity - The higher the value the more popular the song is.
- Duration - The length of the song.

Si faccia riferimento all'indirizzo http://organizeyourmusic.playlistmachinery.com/ per maggiori informazioni.

### Integrazione del dataset con altre fonti

Si è ritenuto necessario aggiungere al dataset una colonna contenente i vari testi delle canzoni, il cui contenuto, opportunamente processato con tecniche di Natural Language Processing, può essere di grande aiuto per predire la popolarità di una canzone.
Si è dunque fatto uso della libreria *lyricsgenius*, che tramite l'API del sito https://genius.com/ permette di scaricare il testo di una canzone dato il titolo e l'autore.

Di seguito è possibile trovare il codice (commentato) che esegue questa operazione. Si noti tuttavia che può richiedere molto tempo (almeno un'ora) per essere eseguito, in quanto le richieste vanno spesso in timeout. Non è necessario eseguire le celle seguenti, in quanto i testi sono già stati scaricati e si trovano nel file *songs_lyrics.csv*. 

In [None]:
#%pip install lyricsgenius
#import lyricsgenius as lg

In [3]:
#import csv

#access_token = '8yvpmDv96aodI5vg660Afcby4XPdrhPrx4JCAM3souNcRYG9C2nF5TWg1'
#genius = lg.Genius(access_token)

#def get_lyrics(song_title, artist_name):
#    song = genius.search_song(song_title, artist_name)
#    return song.lyrics if song is not None else ""

#songs = pd.read_csv("songs.csv", sep=";")

#with open("songs_lyrics.csv", "a", newline="", encoding="utf-8") as csv_file:
#    writer = csv.writer(csv_file, delimiter="|")
#    for _, song in songs.iloc.iterrows():
#        is_ok = False
#        while not is_ok:
#           try:
#                lyrics = get_lyrics(song.title, song.artist)
#                is_ok = True
#                writer.writerow([song.title, song.artist, lyrics])
#            except Exception as e:
#                continue

In [10]:
songs.columns = songs.columns.str.replace('speechiness ', 'speechiness').str.replace('danceability ', 'danceability')

In [15]:
corr = songs.corr(numeric_only=True)
print(abs(corr["popularity"]).sort_values(ascending=False))

popularity         1.000000
songs of artist    0.241812
year               0.205909
energy             0.120484
acousticness       0.100588
duration           0.095590
valence            0.086720
speechiness        0.072670
top genre          0.068774
bpm                0.019961
danceability       0.019491
liveness           0.019009
Name: popularity, dtype: float64


Come si può vedere dagli indici di correlazione delle features rispetto alla label *popularity*, i decibel *dB* non influiscono in maniera significativa sulla popolarità, quindi la colonna può essere eliminata (in realtà forse conviene comunque tenerla perché le varie regolarizzazioni sistemano già questa cosa).

In [12]:
songs.drop(["dB"], axis=1, inplace=True)

In [13]:
songs_of_artist = songs.groupby("artist").count().aggregate("max", axis=1)
songs["songs of artist"] = songs["artist"].map(songs_of_artist)

In [16]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
songs["top genre"] = label_encoder.fit_transform(songs["top genre"])

In [18]:
print(songs["year"].var(ddof=0))
print(songs["bpm"].var(ddof=0))
print(songs["energy"].var(ddof=0))
print(songs["danceability"].var(ddof=0))
print(songs["liveness"].var(ddof=0))
print(songs["valence"].var(ddof=0))
print(songs["duration"].var(ddof=0))
print(songs["acousticness"].var(ddof=0))
print(songs["speechiness"].var(ddof=0))
print(songs["popularity"].var(ddof=0))

45.521325896918626
743.5087472805665
261.1850164154899
188.13338009660305
185.67957948217588
504.1540666904
1867.7644600908015
411.0145899643562
90.60791828733919
130.35579007511132


In [19]:
# var() > 200

bpm = songs["bpm"].values.reshape(-1, 1)
energy = songs["energy"].values.reshape(-1, 1)
valence = songs["valence"].values.reshape(-1, 1)
duration = songs["duration"].values.reshape(-1, 1)
acousticness = songs["acousticness"].values.reshape(-1, 1)

############################################################################################################

# var() < 200

year = songs["year"].values.reshape(-1, 1)
danceability = songs["danceability"].values.reshape(-1, 1)
liveness = songs["liveness"].values.reshape(-1, 1)
speechiness = songs["speechiness"].values.reshape(-1, 1)
popularity = songs["popularity"].values.reshape(-1, 1)

In [20]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

standardScaler = StandardScaler()
minMaxScaler = MinMaxScaler()

In [21]:
# var() > 200
songs["bpm"] = standardScaler.fit_transform(bpm)
songs["energy"] = standardScaler.fit_transform(energy)
songs["valence"] = standardScaler.fit_transform(valence)
songs["duration"] = standardScaler.fit_transform(duration)
songs["acousticness"] = standardScaler.fit_transform(acousticness)

############################################################################################################

# var() <= 200
songs["year"] = minMaxScaler.fit_transform(year)
songs["danceability"] = minMaxScaler.fit_transform(danceability)
songs["liveness"] = minMaxScaler.fit_transform(liveness)
songs["speechiness"] = minMaxScaler.fit_transform(speechiness)
songs["popularity"] = minMaxScaler.fit_transform(popularity)

In [18]:
songs

Unnamed: 0,title,artist,top genre,year,bpm,energy,danceability,dB,liveness,valence,duration,acousticness,speechiness,popularity
0,Flowers,Miley Cyrus,pop,2023,118,68,71,-4,3,65,200,6,7,98
1,Cupid - Twin Ver.,FIFTY FIFTY,k-pop girl group,2023,120,59,78,-8,35,73,174,44,3,97
2,BESO,ROSALÍA,pop,2023,95,64,77,-7,17,53,195,74,14,96
3,Boy's a liar Pt. 2,PinkPantheress,bronx drill,2023,133,81,70,-8,25,86,131,25,5,96
4,Creepin' (with The Weeknd & 21 Savage),Metro Boomin,rap,2022,98,62,72,-6,8,17,222,42,5,96
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2380,Southbound,Beach Blvd,rock,2023,140,88,60,-6,10,49,228,19,4,21
2381,Dance with Somebody - Radio Version,Mando Diao,dalarna indie,2009,150,90,55,-4,36,51,241,0,9,16
2382,Flow,Desire Machines,indie,2022,100,87,60,-7,9,74,255,0,5,15
2383,Scared of the Dark,Everything Brighter,pop,2023,120,80,65,-6,7,61,179,0,6,11


In [11]:
lyrics = pd.read_csv("songs_lyrics.csv", sep="|", header=None)

In [17]:
clear_lyrics = lambda text: text.split("Lyrics")[1].split("Embed")[0]

"[Verse 1]\nWe were good, we were gold\nKinda dream that can't be sold\nWe were right 'til we weren't\nBuilt a home and watched it burn\n\n[Pre-Chorus]\nMm, I didn't wanna leave you, I didn't wanna lie\nStarted to cry, but then remembered I\n\n[Chorus]\nI can buy myself flowers\nWrite my name in the sand\nTalk to myself for hours\nSay things you don't understand\nI can take myself dancing\nAnd I can hold my own hand\nYeah, I can love me better than you can\n\n[Post-Chorus]\nCan love me better, I can love me better, baby\nCan love me better, I can love me better, baby\n\n[Verse 2]\nPaint my nails cherry-red\nMatch the roses that you left\nNo remorse, no regret\nI forgive every word you said\nYou might also like[Pre-Chorus]\nOoh, I didn't wanna leave you, baby, I didn't wanna fight\nStarted to cry, but then remembered I\n\n[Chorus]\nI can buy myself flowers\nWrite my name in the sand\nTalk to myself for hours, yeah\nSay things you don't understand\nI can take myself dancing, yeah\nI can 

In [30]:
songs[songs.title == "Tunnel Vision"]

Unnamed: 0,title,artist,top genre,year,bpm,energy,danceability,dB,liveness,valence,duration,acousticness,speechiness,popularity
2278,Tunnel Vision,Kodak Black,florida drill,2017,172,49,50,-8,12,23,268,6,29,50


In [7]:
songs.iloc[1382:]

Unnamed: 0,title,artist,top genre,year,bpm,energy,danceability,dB,liveness,valence,duration,acousticness,speechiness,popularity
1382,Battle Scars (with Guy Sebastian),Lupe Fiasco,chicago rap,2012,168,81,52,-5,11,49,250,19,29,69
1383,Be Without You - Kendu Mix,Mary J. Blige,dance pop,2005,147,70,73,-6,26,67,246,7,10,69
1384,Because I Got High,Afroman,comedy rap,2001,166,34,80,-9,8,85,198,17,49,69
1385,Black Widow,Iggy Azalea,australian hip hop,2014,164,72,74,-4,11,52,209,19,12,69
1386,Blow Me (One Last Kiss),P!nk,dance pop,2012,114,92,60,-3,28,73,256,0,4,69
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2380,Southbound,Beach Blvd,rock,2023,140,88,60,-6,10,49,228,19,4,21
2381,Dance with Somebody - Radio Version,Mando Diao,dalarna indie,2009,150,90,55,-4,36,51,241,0,9,16
2382,Flow,Desire Machines,indie,2022,100,87,60,-7,9,74,255,0,5,15
2383,Scared of the Dark,Everything Brighter,pop,2023,120,80,65,-6,7,61,179,0,6,11
