# ðŸŽµ Beatles Song Popularity Prediction - Random Forest

This notebook trains a random forest model to predict:
- the **popularity** score of the song (regression)

The dataset contains audio features from Spotifyâ€™s API, including acousticness, danceability, energy, tempo, and others. The goal is to explore how well these features can predict a song's success.

In [None]:
import pandas as pd
from sklearn.metrics import mean_absolute_error

# 1. Data Loading and Selection
beatles = pd.read_csv('./dados/beatles_spotify.csv')

In [13]:
beatles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 675 entries, 0 to 674
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   album             675 non-null    object 
 1   track_number      675 non-null    int64  
 2   acousticness      675 non-null    float64
 3   danceability      675 non-null    float64
 4   energy            675 non-null    float64
 5   instrumentalness  675 non-null    float64
 6   liveness          675 non-null    float64
 7   loudness          675 non-null    float64
 8   speechiness       675 non-null    float64
 9   tempo             675 non-null    float64
 10  valence           675 non-null    float64
 11  popularity        675 non-null    int64  
 12  duration_ms       675 non-null    int64  
dtypes: float64(9), int64(3), object(1)
memory usage: 68.7+ KB


In [5]:
beatles.head()

Unnamed: 0,album,track_number,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,duration_ms
0,Revolver (Super Deluxe),1,0.00225,0.484,0.771,0.0,0.718,-6.151,0.13,133.603,0.679,57,158266
1,Revolver (Super Deluxe),2,0.853,0.606,0.304,0.0,0.34,-7.485,0.0414,137.891,0.808,62,126466
2,Revolver (Super Deluxe),3,0.0944,0.559,0.479,0.0,0.269,-7.89,0.0281,103.392,0.658,57,180320
3,Revolver (Super Deluxe),4,0.706,0.46,0.6,4.3e-05,0.063,-9.108,0.0472,124.21,0.679,54,179866
4,Revolver (Super Deluxe),5,0.87,0.345,0.304,3.1e-05,0.116,-9.477,0.0297,164.568,0.425,55,144906


In [None]:
# Deleting unnecessary columns
beatles = beatles.drop("Unnamed: 0", axis = 1)
beatles = beatles.drop("uri", axis = 1)
beatles = beatles.drop("name", axis = 1)
beatles = beatles.drop("release_date", axis = 1)
beatles = beatles.drop("id", axis = 1)

In [None]:
from sklearn.model_selection import train_test_split
# Definition of Target (y) and Features (X)
y = beatles.popularity # What we want to predict
numerical_features = ["acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence", "duration_ms"]
X = beatles[numerical_features]

# 2. Division into Training and Validation Data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Define the model. Set random_state to 1 for reproducibility.
rf_model = RandomForestRegressor(random_state = 1)

# Train the model (fit)
rf_model.fit(train_X, train_y)

# Make validation predictions
rf_model_predictions = rf_model.predict(val_X)

# Calculate the Mean Absolute Error (MAE) of the Random Forest on the validation data.
rf_val_mae = mean_absolute_error(rf_model_predictions, val_y)

print("MAE validation for the Random Forest model: {:.3f}".format(rf_val_mae))

Validation MAE para o Modelo Random Forest: 9.824


In [None]:
# Testing the Model
# I Me Mine - Remastered 2009 - Popularity: 53
song = pd.DataFrame([{
    "acousticness": 0.179,
    "danceability": 0.291,
    "energy": 0.638,
    "instrumentalness": 0.0,
    "liveness": 0.101,
    "loudness": -7.854,
    "speechiness": 0.0554,
    "tempo": 185.235,
    "valence": 0.525,
    "duration_ms": 145586,
}])

# From Me To You - Mono / Remastered - Popularity: 57
song2 = pd.DataFrame([{
    "acousticness": 0.507,
    "danceability": 0.581,
    "energy": 0.821,
    "instrumentalness": 0.0,
    "liveness": 0.108,
    "loudness": -4.387,
    "speechiness": 0.0318,
    "tempo": 136.145,
    "valence": 0.968,
    "duration_ms": 116160,
}])

# Memphis, Tennessee - Live At The BBC For "Pop Go The Beatles" / 30th July, 1963 - Popularity: 24
song3 = pd.DataFrame([{
    "acousticness": 0.613,
    "danceability": 0.77,
    "energy": 0.411,
    "instrumentalness": 0.00275,
    "liveness": 0.0875,
    "loudness": -11.981,
    "speechiness": 0.0447,
    "tempo": 97.793,
    "valence": 0.538,
    "duration_ms": 135786,
}])

# Boys - Remastered 2009 - Popularity: 47
song4 = pd.DataFrame([{
    "acousticness": 0.607,
    "danceability": 0.402,
    "energy": 0.86,
    "instrumentalness": 0.0,
    "liveness": 0.736,
    "loudness": -10.31,
    "speechiness": 0.0504,
    "tempo": 142.445,
    "valence": 0.822,
    "duration_ms": 146440,
}])

result = rf_model.predict(song3)
print(f'The popularity prediction for this song is: {result[0]:.0f}')

The popularity prediction for this song is: 33
