# Introduction
Welcome to our SENG 550 Final Project!

In this project, we focused on extracting, transforming, and loading data from the Million Song Dataset to train a machine learning model capable of predicting unseen values. We utilized a subset of the dataset containing 10,000 samples. After data cleaning and preprocessing, our final dataset comprised 9,503 samples.

The primary objective of this project is to predict artist popularity based on a carefully selected set of features. From the dataset, we extracted four key features that we hypothesize influence artist popularity:

1. Duration: The length of the track.
2. Loudness: The average decibel level of the track.
3. Tempo: The speed or pace of the music, measured in beats per minute (BPM).
4. Familiarity: A metric representing how well-known the artist is.

This notebook documents our end-to-end process, from data preparation and feature engineering to model training, evaluation, and prediction.
It will be devided into 3 parts: Extraction of Data, Transforming the data. and loading data and training the model.

# PART A: Data Extraction

In [31]:
#Necessary installations
!pip install pyspark
!pip install h5py
#Getter source code provided that has getters required for extracting data
!wget -O hdf5_getters.py https://raw.githubusercontent.com/tbertinmahieux/MSongsDB/refs/heads/master/PythonSrc/hdf5_getters.py

--2024-12-19 23:01:34--  https://raw.githubusercontent.com/tbertinmahieux/MSongsDB/refs/heads/master/PythonSrc/hdf5_getters.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21702 (21K) [text/plain]
Saving to: ‘hdf5_getters.py’


2024-12-19 23:01:34 (109 MB/s) - ‘hdf5_getters.py’ saved [21702/21702]



In [32]:
#imports

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, FloatType, StringType, IntegerType

import hdf5_getters as getters
import h5py

import pandas as pd
import numpy as np
import os

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

 **IMPORTANT: please don't run the next 3 code blocks as they take a very long time to finish runnning. The final output will be retrieved from our github repository using a !wget statement later in the notebook.**

In [None]:
#Downloading the Million Song Subset to the workspace
!wget http://labrosa.ee.columbia.edu/~dpwe/tmp/millionsongsubset.tar.gz -O /content/MillionSongSubset.tar.gz

--2024-12-18 00:47:37--  http://labrosa.ee.columbia.edu/~dpwe/tmp/millionsongsubset.tar.gz
Resolving labrosa.ee.columbia.edu (labrosa.ee.columbia.edu)... 128.59.66.11
Connecting to labrosa.ee.columbia.edu (labrosa.ee.columbia.edu)|128.59.66.11|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1981914968 (1.8G) [application/x-gzip]
Saving to: ‘/content/MillionSongSubset.tar.gz’


2024-12-18 00:50:30 (10.9 MB/s) - ‘/content/MillionSongSubset.tar.gz’ saved [1981914968/1981914968]



In [None]:
#Unzipping the Data and confirming files.
!tar -xvzf /content/MillionSongSubset.tar.gz -C /content/
!ls /content/MillionSongSubset/

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
MillionSongSubset/A/I/R/TRAIRNE128F146B1FA.h5
MillionSongSubset/A/I/R/TRAIRWM12903CE75BB.h5
MillionSongSubset/A/I/R/TRAIRAL12903CCE808.h5
MillionSongSubset/A/I/R/TRAIRIK128F92F0212.h5
MillionSongSubset/A/I/R/TRAIRRA128F42788ED.h5
MillionSongSubset/A/I/R/TRAIRGP128F9337847.h5
MillionSongSubset/A/I/R/TRAIRVF128F932B0D6.h5
MillionSongSubset/A/I/R/TRAIRNT128EF33FF36.h5
MillionSongSubset/A/I/R/TRAIRRP128EF3669DC.h5
MillionSongSubset/A/I/R/TRAIRLU128F92D0250.h5
MillionSongSubset/A/I/R/TRAIRSH128F92E4B18.h5
MillionSongSubset/A/I/R/TRAIRGY128F4271023.h5
MillionSongSubset/A/I/B/
MillionSongSubset/A/I/B/TRAIBHW128F146721B.h5
MillionSongSubset/A/I/B/TRAIBXJ128F4260700.h5
MillionSongSubset/A/I/B/TRAIBJP128F932B2FF.h5
MillionSongSubset/A/I/B/TRAIBWR128F42390B3.h5
MillionSongSubset/A/I/B/TRAIBZF128F4275B4A.h5
MillionSongSubset/A/I/B/TRAIBHP128F149C2ED.h5
MillionSongSubset/A/I/B/TRAIBMN128E0780C96.h5
MillionSongSubset/A/I/B/TRAIBXQ128F4

In [None]:
#Initialize Spark session
spark_artist = SparkSession.builder.appName("MillionSongSubset_artist").getOrCreate()

data_dir = "/content/MillionSongSubset"

#Define schema for the Spark DataFrame
schema_artist = StructType([
    StructField("track_id", StringType(), True),
    StructField("artist_id", StringType(), True),
    StructField("artist_name", StringType(), True),
    StructField("release", StringType(), True),
    StructField("title", StringType(), True),
    StructField("duration", FloatType(), True),
    StructField("loudness", FloatType(), True),
    StructField("tempo", FloatType(), True),
    StructField("familiarity", FloatType(), True),
    StructField("hotness", FloatType(), True),
])

#Initialize empty DataFrame
df_artist = spark_artist.createDataFrame([], schema_artist)

#Process files in chunks due to memory limitations
batch_size = 50
file_count = 0

for root, dirs, files in os.walk(data_dir):
    for file in files:
        if file.endswith(".h5"):
            file_path = os.path.join(root, file)
            with h5py.File(file_path, "r") as h5:
                # Extract features
                track_id = h5["analysis"]["songs"]["track_id"][0].decode()
                artist_id = h5["metadata"]["songs"]["artist_id"][0].decode()
                artist_name = h5["metadata"]["songs"]["artist_name"][0].decode()
                release = h5["metadata"]["songs"]["release"][0].decode()
                title = h5["metadata"]["songs"]["title"][0].decode()
                duration = float(h5["analysis"]["songs"]["duration"][0])
                loudness = float(h5["analysis"]["songs"]["loudness"][0])  # Loudness
                tempo = float(h5["analysis"]["songs"]["tempo"][0])  # Tempo
                familiarity = float(h5["metadata"]["songs"]["artist_familiarity"][0])
                hotness = float(h5["metadata"]["songs"]["artist_hotttnesss"][0])



                #Append to DataFrame
                row = [(track_id, artist_id, artist_name, release, title, duration, loudness, tempo, familiarity, hotness)]
                temp_df_artist = spark_artist.createDataFrame(row, schema_artist)
                df_artist = df_artist.union(temp_df_artist)

            file_count += 1

            #Save progress and clear memory every batch
            if file_count % batch_size == 0:
                df_artist.write.mode("append").parquet("songs_chunked_artist.parquet")
                df_artist = spark_artist.createDataFrame([], schema_artist)  # Reset DataFrame

#Save remaining data
df_artist.write.mode("append").parquet("songs_chunked_artist.parquet")

#Confirming all rows were correctly extracted
final_df_artist = spark_artist.read.parquet("songs_chunked_artist.parquet")
final_df_artist.toPandas().to_csv("output_file.csv", index=False)
#Show the first 100
final_df_artist.show(1000)

# PART B: Data Transformation

In [35]:
#first use a wget command to load up the csv file from the github repository into pandas df and print first 5 rows to confirm extraction
!wget https://raw.githubusercontent.com/Kazmi20/artist-popularity-prediction/refs/heads/main/output_file.csv
df_extracted = pd.read_csv("output_file.csv")
df_extracted.head(5)

--2024-12-19 23:02:44--  https://raw.githubusercontent.com/Kazmi20/artist-popularity-prediction/refs/heads/main/output_file.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1392783 (1.3M) [text/plain]
Saving to: ‘output_file.csv.1’


2024-12-19 23:02:44 (122 MB/s) - ‘output_file.csv.1’ saved [1392783/1392783]



Unnamed: 0,track_id,artist_id,artist_name,release,title,duration,loudness,tempo,familiarity,hotness
0,TRBBLXX128F92C3D8C,ARK7ZPW1187B99C170,Maria Callas/Eugenio Fernandi/Elisabeth Schwar...,Callas Sings Puccini,Turandot (2008 Digital Remaster)_ Act III - Sc...,172.90404,-13.963,109.721,0.612859,0.360345
1,TRATMXM128F1457ACB,ARK7ZPW1187B99C170,Maria Callas/Giuseppe di Stefano/Ettore Bastia...,Verdi - La Traviata (Highlights),La Traviata - highlights (1990 Digital Remaste...,112.14322,-20.02,96.775,0.612859,0.360345
2,TRATRDP128F4261581,AR5R7791187FB3A8C3,Alfredo Kraus/Paul Plishka/Matteo Manuguerra/S...,Puccini: La Bohème - Highlights,Caro! (Colline/Rodolfo/Schaunard/Children/Town...,34.55955,-22.785,130.319,0.477387,0.363611
3,TRBBQYU128F4269447,ARK7ZPW1187B99C170,Maria Callas/Gianni Raimondi/Gabriella Cartura...,Donizetti: Anna Bolena,Anna Bolena (1997 Digital Remaster): Alcun pot...,102.68689,-14.447,111.643,0.612859,0.360345
4,TRATESI128F932F694,AR1X2H01187B98CCF2,Maurice Chevalier / Françoise Dorin / Marina H...,Christiné: Dede,Dialogue et final si j'avais su,213.55057,-20.541,164.618,0.502453,0.391585


In [36]:
#Get Number of NaN/Null and 0's
nan_count = df_extracted['hotness'].isna().sum()
zero_count = (df_extracted['hotness'] == 0).sum()
print(f"Number of NaN/Null values in 'artist hotttnesss': {nan_count}")
print(f"Number of 0 values in 'artist hotttnesss': {zero_count}")

#Removing rows with 0 in Hotness field and confirming
df_extracted = df_extracted[df_extracted['hotness'] != 0]
zero_count = (df_extracted['hotness'] == 0).sum()
print(f"Number of 0 values in 'artist hotttnesss' after dropping: {zero_count}")

#dropping all rows with nulls in other fields.
df_extracted = df_extracted.dropna()
print(f"Number of rows after dropping nulls: {len(df_extracted)}")

Number of NaN/Null values in 'artist hotttnesss': 0
Number of 0 values in 'artist hotttnesss': 496
Number of 0 values in 'artist hotttnesss' after dropping: 0
Number of rows after dropping nulls: 9503


In [27]:
#Determine which columns to use in the Feature and target.

Final_columns = (['duration', 'loudness', 'tempo', 'familiarity'])
target = 'hotness'

X = df_extracted.loc[:, Final_columns]
y = df_extracted.loc[:, target]

In [28]:
#Scaling the data in the features using a standard scaler to potentially improve model performance

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(X_scaled)

[[-0.5814851  -0.65622104 -0.37893602  0.2274745 ]
 [-1.1210248  -1.78208806 -0.7467947   0.2274745 ]
 [-1.80994688 -2.29604255  0.20635311 -0.6905386 ]
 ...
 [ 4.41517169  0.51500038  1.10426246 -0.34622439]
 [-0.43303057  1.03453122  2.3494013  -0.13501056]
 [ 0.44609914  0.9546034  -0.85465749 -1.18666642]]


In [29]:
#splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

print("Training Features Shape:", X_train.shape)
print("Testing Features Shape:", X_test.shape)
print("Training Target Shape:", y_train.shape)
print("Testing Target Shape:", y_test.shape)

Training Features Shape: (7602, 4)
Testing Features Shape: (1901, 4)
Training Target Shape: (7602,)
Testing Target Shape: (1901,)


In [45]:
#Function used for evaluating models

def evaluate_model(model, X_test, y_test, model_name):
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)

    print(f"{model_name} Performance:")
    print(f"Mean Squared Error: {mse:.4f}")
    print(f"R2 Score: {r2:.4f}")
    print(f"Mean Absolute Error: {mae:.4f}")
    print("\n")

# PART C: Data Loading and Model training

In [46]:
#Model creation and training

#Linear Regression Model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
evaluate_model(linear_model, X_test, y_test, "Linear Regression")

#Ridge Regression Model
ridge = Ridge(alpha=1.0)  # Alpha is the regularization strength
ridge.fit(X_train, y_train)
evaluate_model(ridge, X_test, y_test, "Ridge Regression")

#Lasso Regression Model
lasso = Lasso(alpha=0.01)  # Alpha is the regularization strength
lasso.fit(X_train, y_train)
evaluate_model(lasso, X_test, y_test, "Lasso Regression")

#Random Forest Regressor Model
rf = RandomForestRegressor(random_state=42)

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3, scoring='r2', n_jobs=-1)
grid_search.fit(X_train, y_train)
best_rf = grid_search.best_estimator_
print("Best Random Forest Parameters:", grid_search.best_params_)
evaluate_model(best_rf, X_test, y_test, "Random Forest Regressor")

Linear Regression Performance:
Mean Squared Error: 0.0045
R2 Score: 0.6868
Mean Absolute Error: 0.0450


Ridge Regression Performance:
Mean Squared Error: 0.0045
R2 Score: 0.6868
Mean Absolute Error: 0.0450


Lasso Regression Performance:
Mean Squared Error: 0.0048
R2 Score: 0.6719
Mean Absolute Error: 0.0455


Best Random Forest Parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Random Forest Regressor Performance:
Mean Squared Error: 0.0030
R2 Score: 0.7931
Mean Absolute Error: 0.0380


