# Technical Report: Spotify and YouTube Dataset Classification
# Course: AASD 4001 Mathematical Concepts for Machine Learning
# Group: 02
# Date of Submission: September 28, 2025
# Dataset: [Spotify-YouTube Data](https://www.kaggle.com/datasets/rohitgrewal/spotify-youtube-data)

In [46]:
# from google.colab import files
# uploaded = files.upload()

In [47]:
import pandas as pd
from numpy import hstack

df = pd.read_csv('Spotify Youtube Dataset.csv')
display(df.head())

Unnamed: 0.1,Unnamed: 0,Artist,Url_spotify,Track,Album,Album_type,Uri,Danceability,Energy,Key,...,Url_youtube,Title,Channel,Views,Likes,Comments,Description,Licensed,official_video,Stream
0,0,Gorillaz,https://open.spotify.com/artist/3AA28KZvwAUcZu...,Feel Good Inc.,Demon Days,album,spotify:track:0d28khcov6AiegSCpG5TuT,0.818,0.705,6.0,...,https://www.youtube.com/watch?v=HyHNuVaZJ-k,Gorillaz - Feel Good Inc. (Official Video),Gorillaz,693555221.0,6220896.0,169907.0,Official HD Video for Gorillaz' fantastic trac...,True,True,1040235000.0
1,1,Gorillaz,https://open.spotify.com/artist/3AA28KZvwAUcZu...,Rhinestone Eyes,Plastic Beach,album,spotify:track:1foMv2HQwfQ2vntFf9HFeG,0.676,0.703,8.0,...,https://www.youtube.com/watch?v=yYDmaexVHic,Gorillaz - Rhinestone Eyes [Storyboard Film] (...,Gorillaz,72011645.0,1079128.0,31003.0,The official video for Gorillaz - Rhinestone E...,True,True,310083700.0
2,2,Gorillaz,https://open.spotify.com/artist/3AA28KZvwAUcZu...,New Gold (feat. Tame Impala and Bootie Brown),New Gold (feat. Tame Impala and Bootie Brown),single,spotify:track:64dLd6rVqDLtkXFYrEUHIU,0.695,0.923,1.0,...,https://www.youtube.com/watch?v=qJa-VFwPpYA,Gorillaz - New Gold ft. Tame Impala & Bootie B...,Gorillaz,8435055.0,282142.0,7399.0,Gorillaz - New Gold ft. Tame Impala & Bootie B...,True,True,63063470.0
3,3,Gorillaz,https://open.spotify.com/artist/3AA28KZvwAUcZu...,On Melancholy Hill,Plastic Beach,album,spotify:track:0q6LuUqGLUiCPP1cbdwFs3,0.689,0.739,2.0,...,https://www.youtube.com/watch?v=04mfKJWDSzI,Gorillaz - On Melancholy Hill (Official Video),Gorillaz,211754952.0,1788577.0,55229.0,Follow Gorillaz online:\nhttp://gorillaz.com \...,True,True,434663600.0
4,4,Gorillaz,https://open.spotify.com/artist/3AA28KZvwAUcZu...,Clint Eastwood,Gorillaz,album,spotify:track:7yMiX7n9SBvadzox8T5jzT,0.663,0.694,10.0,...,https://www.youtube.com/watch?v=1V_xRb0x9aw,Gorillaz - Clint Eastwood (Official Video),Gorillaz,618480958.0,6197318.0,155930.0,The official music video for Gorillaz - Clint ...,True,True,617259700.0


# Dataset Description

For this project, we used the Spotify YouTube Dataset: Music Trends Across Platforms, which contains detailed information about 17,841 songs and their performance on Spotify and YouTube.

The dataset includes the following key columns:

In [48]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20718 entries, 0 to 20717
Data columns (total 28 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        20718 non-null  int64  
 1   Artist            20718 non-null  object 
 2   Url_spotify       20718 non-null  object 
 3   Track             20718 non-null  object 
 4   Album             20718 non-null  object 
 5   Album_type        20718 non-null  object 
 6   Uri               20718 non-null  object 
 7   Danceability      20716 non-null  float64
 8   Energy            20716 non-null  float64
 9   Key               20716 non-null  float64
 10  Loudness          20716 non-null  float64
 11  Speechiness       20716 non-null  float64
 12  Acousticness      20716 non-null  float64
 13  Instrumentalness  20716 non-null  float64
 14  Liveness          20716 non-null  float64
 15  Valence           20716 non-null  float64
 16  Tempo             20716 non-null  float6

Artist - Name of the performer or band of the song.

Url_spotify/Url_youtube - Links to the song on Spotify and YouTube.

Track - Title of the song

Album - The album name.

Album_type - The album type.

Uri -

Danceability - A score (0–1) of how suitable a track is for dancing. Higher values mean more danceable.

Energy - A score (0–1) representing intensity and activity. High-energy tracks feel fast and loud.

Key - Musical key of the track (integers 0–11, where each number represents a pitch class).

Loudness - Overall loudness of the track in decibels (dB).

Speechiness - Measures the presence of spoken words in the track. Higher values indicate more speech-like.

Acousticness - Probability (0–1) that the track is acoustic.

Instrumentalness - Predicts whether a track contains vocals. Higher values indicate fewer vocals.

Liveness - Indicates whether the track was recorded live. High values suggest a live audience.

Valence - Describes the positivity of a track. High valence = more happy/cheerful.

Tempo - The estimated tempo of the track in beats per minute (BPM).

Duration_ms - Length of the track in milliseconds.

Title

Channel

Views - Total number of views on YouTube.

Likes - Total number of likes.

Comments - Number of user comments on the video.

Description

Licensed - A boolean (True/False) feature indicating whether the track is officially licensed.

official_video - Indicates whether the video is the official music video of the track. (True = official, False = unofficial)

Stream - Number of times the track was streamed on Spotify.

# Target Variable

For this project, we created a new target variable called popularity_class based on the Views column. Songs were divided into Low, Medium, and High Popularity using quantiles (bottom third, middle third, top third).

# Data Preprocession and Preparation

In [49]:
# Create popularity classes based on quantiles
df = df.dropna(subset=["Views"])
df['popularity_class'] = pd.qcut(df['Views'], q=3, labels=['Low', 'Medium', 'High'])

# drop URLs, descriptive fields, and Views
drop_cols = ["Unnamed: 0", "Artist", "Url_spotify", "Track", "Album", "Uri", "Url_youtube", "Title", "Channel", "Description", "Views"]
df = df.drop(drop_cols, axis=1)

# Separate features and target
y = df["popularity_class"]
X = df.drop(columns=["popularity_class"])
numeric_features = X.select_dtypes(include=["int64","float64"]).columns
categorical_features = X.select_dtypes(include=["object","category"]).columns

# Remove duplicates
X = X.drop_duplicates()
y = y.loc[X.index]

In [50]:
print(X.isnull().sum())

Album_type            0
Danceability          1
Energy                1
Key                   1
Loudness              1
Speechiness           1
Acousticness          1
Instrumentalness      1
Liveness              1
Valence               1
Tempo                 1
Duration_ms           1
Likes                71
Comments             98
Licensed              0
official_video        0
Stream              554
dtype: int64


In [51]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer

# covert data to numeric form
le = LabelEncoder()
X["Album_type"] = le.fit_transform(X["Album_type"])
boolMap = {"True": 1, "False": 0, True: 1, False: 0}
X["Licensed"] = X["Licensed"].map(boolMap)
X["official_video"] = X["official_video"].map(boolMap)

# drop small missing data
X = X.dropna(subset=["Danceability", "Energy", "Key", "Loudness", "Speechiness", "Acousticness", "Instrumentalness", "Liveness", "Valence", "Tempo", "Duration_ms"])
y = y.loc[X.index]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# fill missing value with median
num_imputer = SimpleImputer(strategy="median")
X_train_num = num_imputer.fit_transform(X_train[numeric_features])
X_test_num = num_imputer.transform(X_test[numeric_features])

# fill missing value with mode for categorical True/False
cat_imputer = SimpleImputer(strategy="most_frequent")
X_train_cat = cat_imputer.fit_transform(X_train[categorical_features])
X_test_cat = cat_imputer.transform(X_test[categorical_features])

# scale
scaler = StandardScaler()
X_train_num_scaled = scaler.fit_transform(X_train_num)
X_test_num_scaled = scaler.transform(X_test_num)

# Recombining numerical + categorical data
X_train_final = hstack([X_train_num_scaled, X_train_cat])
X_test_final = hstack([X_test_num_scaled, X_test_cat])

# Train data with all features

In [52]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, precision_recall_fscore_support

svm = SVC(random_state=42)
svm.fit(X_train_final, y_train)
y_pred = svm.predict(X_test_final)
prec_b, rec_b, f1_b, _ = precision_recall_fscore_support(y_test, y_pred, average='macro', zero_division=0)
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))
print(f"F1: {f1_b:.4f}\n")

              precision    recall  f1-score   support

        High       0.92      0.81      0.86      1328
         Low       0.73      0.77      0.75      1306
      Medium       0.62      0.66      0.64      1314

    accuracy                           0.75      3948
   macro avg       0.76      0.75      0.75      3948
weighted avg       0.76      0.75      0.75      3948

Accuracy: 0.7479736575481256
F1: 0.7509



# Feature Selection

In [53]:
from sklearn.feature_selection import SelectKBest, mutual_info_classif

selector = SelectKBest(score_func=mutual_info_classif, k=10)
X_train_selected = selector.fit_transform(X_train_final, y_train)
X_test_selected = selector.transform(X_test_final)

svm = SVC(random_state=42)
svm.fit(X_train_selected, y_train)
y_pred = svm.predict(X_test_selected)
prec_b, rec_b, f1_b, _ = precision_recall_fscore_support(y_test, y_pred, average='macro', zero_division=0)
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))
print(f"F1: {f1_b:.4f}\n")

              precision    recall  f1-score   support

        High       0.91      0.84      0.87      1328
         Low       0.77      0.86      0.81      1306
      Medium       0.70      0.67      0.68      1314

    accuracy                           0.79      3948
   macro avg       0.79      0.79      0.79      3948
weighted avg       0.79      0.79      0.79      3948

Accuracy: 0.7887537993920972
F1: 0.7884



In [54]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "C": [0.1, 1, 10],
    "gamma": ["scale", 0.01, 0.001],
    "kernel": ["linear", "rbf", "poly"]
}

grid = GridSearchCV(
    estimator=SVC(random_state=42),
    param_grid=param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1,
    verbose=2
)

grid.fit(X_train_final, y_train)

print("Best Parameters:", grid.best_params_)
print("Best CV Score:", grid.best_score_)
print("Best Estimator:", grid.best_estimator_)

best_svm = grid.best_estimator_
y_pred = best_svm.predict(X_test_final)
prec_b, rec_b, f1_b, _ = precision_recall_fscore_support(y_test, y_pred, average='macro', zero_division=0)

print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))
print(f"F1: {f1_b:.4f}\n")

Fitting 5 folds for each of 27 candidates, totalling 135 fits
Best Parameters: {'C': 10, 'gamma': 'scale', 'kernel': 'linear'}
Best CV Score: 0.8573780873970869
Best Estimator: SVC(C=10, kernel='linear', random_state=42)
              precision    recall  f1-score   support

        High       0.90      0.85      0.88      1328
         Low       0.89      0.92      0.90      1306
      Medium       0.78      0.79      0.78      1314

    accuracy                           0.85      3948
   macro avg       0.85      0.85      0.85      3948
weighted avg       0.85      0.85      0.85      3948

Accuracy: 0.853596757852077
F1: 0.8536



# Drop Some columns

In [77]:
X_no_engagement = X.drop(columns=['Likes', 'Comments', 'Stream'], errors='ignore').copy()

# drop small missing data
X_no_engagement = X_no_engagement.dropna(subset=["Danceability", "Energy", "Key", "Loudness", "Speechiness", "Acousticness", "Instrumentalness", "Liveness", "Valence", "Tempo", "Duration_ms"])
X_no_engagement = X_no_engagement.drop_duplicates()
y_no_engagement = y.loc[X_no_engagement.index]
numeric_features_no_engagement = ['Danceability', 'Energy', 'Key', 'Loudness', 'Speechiness',
       'Acousticness', 'Instrumentalness', 'Liveness', 'Valence', 'Tempo',
       'Duration_ms']

X_train, X_test, y_train, y_test = train_test_split(X_no_engagement, y_no_engagement, test_size=0.2, random_state=42)

# fill missing value with median
num_imputer = SimpleImputer(strategy="median")
X_train_num = num_imputer.fit_transform(X_train[numeric_features_no_engagement])
X_test_num = num_imputer.transform(X_test[numeric_features_no_engagement])

# fill missing value with mode for categorical True/False
cat_imputer = SimpleImputer(strategy="most_frequent")
X_train_cat = cat_imputer.fit_transform(X_train[categorical_features])
X_test_cat = cat_imputer.transform(X_test[categorical_features])

# scale
scaler = StandardScaler()
X_train_num_scaled = scaler.fit_transform(X_train_num)
X_test_num_scaled = scaler.transform(X_test_num)

# Recombining numerical + categorical data
X_train_final = hstack([X_train_num_scaled, X_train_cat])
X_test_final = hstack([X_test_num_scaled, X_test_cat])

svm = SVC(random_state=42)
svm.fit(X_train_final, y_train)
y_pred = svm.predict(X_test_final)
prec_b, rec_b, f1_b, _ = precision_recall_fscore_support(y_test, y_pred, average='macro', zero_division=0)
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))
print(f"F1: {f1_b:.4f}\n")

              precision    recall  f1-score   support

        High       0.50      0.69      0.58      1173
         Low       0.60      0.65      0.62      1264
      Medium       0.42      0.23      0.30      1289

    accuracy                           0.52      3726
   macro avg       0.50      0.52      0.50      3726
weighted avg       0.50      0.52      0.50      3726

Accuracy: 0.5177133655394525
F1: 0.4989



Hypothesis: 'Likes', 'Comments', 'Stream' directly effect the popularity

Obsrervation:
1. Baseline (All Features, No Tuning)
Accuracy = 74.8%, F1 = 0.7509, Model had good precision/recall for High (0.92 precision, 0.81 recall), but struggled with Medium (F1 = 0.64)
2. Tuned Model (GridSearchCV, All Features)
Accuracy improved to 85.36%, F1 = 0.8536, Performance across classes became balanced
3. Dropped Features: Likes, Comments, Stream
Accuracy dropped to 51.77%, F1 = 0.4989.
Severe performance decline