<font>
<div dir=ltr align=center>
<img src='Sharif_logo.png' width=250 height=250> <br>
<font color=0F5298 size=7>
Applied Data Science<br>
<font color=2565AE size=5>
Spring 2025<br>
<font color=3C99D size=5>
HW8 - Multiclass Classification <br>
<font color=696880 size=4>
Ali Mohammadzade Shabestari - 401106482 - Computer Engineering



# 1. Import Libraries

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from collections import Counter
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.decomposition import PCA

# 2. Loading & Preprocessing Dataset

## 2. 1. Loading

This dataset provides a synthetic representation of user behavior on a fictional dating app. It contains 50,000 records with 19 features capturing demographic details, app usage patterns, swipe tendencies, and match outcomes. The data was generated programmatically to simulate realistic user interactions, making it ideal for exploratory data analysis (EDA), machine learning modeling (e.g., predicting match outcomes), or studying user behavior trends in online dating platforms.

Key features include gender, sexual orientation, location type, income bracket, education level, user interests, app usage time, swipe ratios, likes received, mutual matches, and match outcomes (e.g., "Mutual Match," "Ghosted," "Catfished"). The dataset is designed to be diverse and balanced, with categorical, numerical, and labeled variables for various analytical purposes.

[Dataset Source](https://www.kaggle.com/datasets/keyushnisar/dating-app-behavior-dataset?resource=download)

In [None]:
df = pd.read_csv('spotify_dataset.csv').iloc[:,1:]

df.columns = df.columns.str.strip()

# I choose 1/10 of the dataset to speed up the process
df = df.sample(frac=1/10, random_state=42)
df.head()

Unnamed: 0,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
113186,Hillsong Worship,No Other Name,No Other Name,50,440247,False,0.369,0.598,7,-6.984,1,0.0304,0.00511,0.0,0.176,0.0466,148.014,4,world-music
42819,Internal Rot,Grieving Birth,Failed Organum,11,93933,False,0.171,0.997,7,-3.586,1,0.118,0.00521,0.801,0.42,0.0294,122.223,4,grindcore
59311,Zhoobin Askarieh;Ali Sasha,Noise A Noise 20.4-1,"Save the Trees, Pt. 1",0,213578,False,0.173,0.803,9,-10.071,0,0.144,0.613,0.00191,0.195,0.0887,75.564,3,iranian
91368,Bryan Adams,All I Want For Christmas Is You,Merry Christmas,0,151387,False,0.683,0.511,6,-5.598,1,0.0279,0.406,0.000197,0.111,0.598,109.991,3,rock
61000,Nogizaka46,バレッタ TypeD,月の大きさ,57,236293,False,0.555,0.941,9,-3.294,0,0.0481,0.484,0.0,0.266,0.813,92.487,4,j-idol


In [169]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11400 entries, 113186 to 93748
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   artists           11399 non-null  object 
 1   album_name        11399 non-null  object 
 2   track_name        11399 non-null  object 
 3   popularity        11400 non-null  int64  
 4   duration_ms       11400 non-null  int64  
 5   explicit          11400 non-null  bool   
 6   danceability      11400 non-null  float64
 7   energy            11400 non-null  float64
 8   key               11400 non-null  int64  
 9   loudness          11400 non-null  float64
 10  mode              11400 non-null  int64  
 11  speechiness       11400 non-null  float64
 12  acousticness      11400 non-null  float64
 13  instrumentalness  11400 non-null  float64
 14  liveness          11400 non-null  float64
 15  valence           11400 non-null  float64
 16  tempo             11400 non-null  float6

## 2. 2. Preprocessing

In this notebook, I prefer to drop columns `artists`, `album_name` and `track_name`. Hence, the problem for Null Values in previous cell is solved.

In [170]:
df = df.drop(columns=['artists', 'album_name', 'track_name'])

Encode genre with `Label Encoder`. Because there are numerous genres, `One Hot Encoder` might increase dimensionality very much.

In [171]:
label_encoder = LabelEncoder()
df['track_genre_encoded'] = label_encoder.fit_transform(df['track_genre'])

Split dataframe into X and y vectors.

In [172]:
X = df.drop(columns=['track_genre', 'track_genre_encoded'])
y = df['track_genre_encoded']

In [173]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [174]:
n_classes = df['track_genre_encoded'].nunique()
threshold = 2.5 / df['track_genre'].nunique()

## 2. 3. Split

In [175]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y)

## 2. 4. Metric Function

It's a function that prints F1 Score for each model, comparing to the desired threshold.

In [176]:
def print_f1(model, true, prediction):
    print(f"🚀 {model}")
    f1 = f1_score(true, prediction, average='weighted')
    print(f"F1 Score: {f1:.4f}")
    print("Treshold: 0.0219")
    print(f"Meets threshold: {f1 > threshold}")

# 3. Classification Tasks

As a preprocessing task, I perform PCA for dimensionality reduction while maximizing variance.

In following learning algorithms, I use PCA version (projection) of data.

In [177]:
# Initialize PCA with the number of components you want to retain
pca = PCA(n_components=5)  

# Fit and transform the scaled data
X_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Print explained variance ratio
print("Explained Variance Ratio:", pca.explained_variance_ratio_)

Explained Variance Ratio: [0.19911155 0.10277994 0.09127955 0.08039127 0.07021116]


## 3. 1. Multiclass SVM

In [184]:
# SVM classifier
svm_classifier = SVC(kernel='rbf', decision_function_shape='ovr', random_state=42)

# Train 
svm_classifier.fit(X_pca, y_train)

# Predict
y_pred_svm = svm_classifier.predict(X_test_pca)

# F1 score
print_f1("SVM Classifier", y_test, y_pred_svm)

🚀 SVM Classifier
F1 Score: 0.1050
Treshold: 0.0219
Meets threshold: True


## 3. 2. Multiclass Logistic Regression

### 3. 2. 1. OvR Logistic Regression

In [178]:
# OvR Classifier
ovr_model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=1000)

# Train
ovr_model.fit(X_pca, y_train)

# Predict
y_pred_ovr = ovr_model.predict(X_test_pca)
y_proba_ovr = ovr_model.predict_proba(X_test_pca)

# F1 Score
print_f1("One-vs-Rest (OvR) Logistic Regression", y_test, y_pred_ovr)



🚀 One-vs-Rest (OvR) Logistic Regression
F1 Score: 0.0782
Treshold: 0.0219
Meets threshold: True


### 3. 2. 2 Multinomial Logistic Regression

In [179]:
# Multinomial Classifier
multinomial_model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000)

# Train
multinomial_model.fit(X_pca, y_train)

# Predict
y_pred_multi = multinomial_model.predict(X_test_pca)
y_proba_multi = multinomial_model.predict_proba(X_test_pca)

# Metrics
print_f1("Multinomial Logistic Regression", y_test, y_pred_multi)



🚀 Multinomial Logistic Regression
F1 Score: 0.0851
Treshold: 0.0219
Meets threshold: True


## 3. 3. KNN

In [180]:
def euclidean_distance(x1, x2):
    return np.sqrt(np.sum((x1 - x2) ** 2))

class CustomKNN:
    def __init__(self, n_neighbors=5):
        self.k = n_neighbors
        
    def fit(self, X, y):
        self.X_train = X
        self.y_train = np.array(y)  # Force to numpy array
            
    def predict(self, X):
        return np.array([self._predict(x) for x in X])
    
    def _predict(self, x):
        # Compute distances to all training points
        distances = [euclidean_distance(x, x_train) for x_train in self.X_train]
        # Find the k nearest samples
        k_indices = np.argsort(distances)[:self.k]
        k_nearest_labels = self.y_train[k_indices]
        # Majority vote
        most_common = Counter(k_nearest_labels).most_common(1)
        return most_common[0][0]

In [182]:
best_k = None
best_f1 = 0
f1_scores = []

for k in range(1, 8):  # Try K from 1 to 15
    knn = CustomKNN(n_neighbors=k)
    knn.fit(X_pca, y_train)
    y_pred = knn.predict(X_test_pca)
    f1 = f1_score(y_test, y_pred, average='macro')
    f1_scores.append(f1)
    
    if f1 > best_f1:
        best_f1 = f1
        best_k = k

print(f"Best K found: {best_k} with F1 Score (macro): {best_f1:.4f}")
print(f"KNN model meets threshold: {best_f1 > threshold}")

Best K found: 4 with F1 Score (macro): 0.1039
KNN model meets threshold: True


## 3. 4. Decision Tree

In [160]:
# Decision Tree Classifier
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_pca, y_train)

# Predict
y_pred_dt = tree.predict(X_test_pca)

# F1 Score
print_f1("Decision Tree Classifier", y_test, y_pred_dt)

🚀 Decision Tree Classifier
F1 Score: 0.1107
Treshold: 0.0219
Meets threshold: True


## 3. 5. Boosting Techniques 

### 3. 5. 1. XGBoost

I trained an XGBoost classifier with a multiclass softmax objective and achieved a macro-averaged F1-score above the required threshold.
XGBoost’s handling of overfitting and its regularization mechanisms helped in achieving strong multiclass performance.

In [None]:
# XGBoost model
xgb_model = XGBClassifier(objective='multi:softmax', num_class=n_classes, eval_metric='mlogloss')
xgb_model.fit(X_pca, y_train)

# Predict
y_pred_xgb = xgb_model.predict(X_test_pca)

# F1 Score
print_f1("XGBoost", y_test, y_pred_xgb)

F1 Score: 0.1420
Treshold: 0.0219
Meets threshold: True


### 3. 5. 2. LightGBM

In [None]:
# LightGBM model
lgbm_model = LGBMClassifier(objective='multiclass', num_class=n_classes, verbose=-1)
lgbm_model.fit(X_pca, y_train)

# Predict
y_pred_lgbm = lgbm_model.predict(X_test_pca)

# F1 Score
print_f1("LightGBM", y_test, y_pred_lgbm)

F1 Score: 0.0906
Treshold: 0.0219
Meets threshold: True


### 3. 5. 3. AdaBoost

In [None]:
# AdaBoost model (with Decision Stumps)
ada_model = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=200,
    random_state=42
)
ada_model.fit(X_pca, y_train)

# Predict
y_pred_ada = ada_model.predict(X_test_pca)

# F1 Score
print_f1("AdaBoost", y_test, y_pred_ada)



F1 Score: 0.0353
Treshold: 0.0219
Meets threshold: True


## 3. 6. Grid Search

In [185]:
# XGBoost model
xgb_model = XGBClassifier(objective='multi:softmax', num_class=n_classes, eval_metric='mlogloss')

# Hyperparameters grid
param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.3],
    'n_estimators': [100, 200, 300]
}

# Grid search with cross-validation
grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, cv=3, scoring='f1_macro', verbose=1, n_jobs=-1)
grid_search.fit(X_pca, y_train)

# Best parameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Predict with the best model
y_pred_xgb = best_model.predict(X_test_pca)

# F1 Score
print_f1("Best XGBoost from Grid Search", y_test, y_pred_xgb)

Fitting 3 folds for each of 27 candidates, totalling 81 fits
🚀 Best XGBoost from Grid Search
F1 Score: 0.1171
Treshold: 0.0219
Meets threshold: True


# 4. Question!

**❓ Q: Please explain how KNN and decision trees can be extended to multi-label classification problems.**

✅ A: KNN and Decision Trees can both be extended to multi-label classification by treating each label as a separate binary classification task. For KNN, this can be done using the Binary Relevance method, where the algorithm predicts each label independently. This approach treats each label as a binary classification problem, and for each label, KNN makes a prediction based on the majority class of the nearest neighbors. Alternatively, Classifier Chains can be used in decision trees, where the output of each classifier is passed as input to the next, allowing the model to capture dependencies between labels. This method works by creating a sequence of classifiers where each classifier predicts one label based on the predictions of previous classifiers.

For Decision Trees, Binary Relevance can also be applied, where a separate tree is trained for each label. Another approach is multi-output decision trees, which allows the tree to predict multiple labels at once. However, to handle label dependencies, Classifier Chains are commonly used. This approach creates a chain of decision trees where the predictions from one tree are passed as features to the next tree, capturing the relationships between labels. Both methods enable decision trees to adapt to multi-label classification, with the classifier chains approach offering a more robust model by utilizing label correlations.