## Problem Statement

ML model that predicts whether a song will be replayed by a user within a month. 
<br>The dataset contain-  song plays, timestamps, and user-song history.

Target Variable:
1 -The user has replayed the song within a month.
0 - The user has not replayed the song within a month.

## Import Library

In [271]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

In [272]:
# Load dataset
df = pd.read_csv("/kaggle/input/spotify-dataset/song_dataset.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,Username,Artist,Track,Album,Date,Time
0,0,Babs_05,Isobel Campbell,The Circus Is Leaving Town,Ballad of the Broken Seas,31 Jan 2021,23:36
1,1,Babs_05,Isobel Campbell,Dusty Wreath,Ballad of the Broken Seas,31 Jan 2021,23:32
2,2,Babs_05,Isobel Campbell,Honey Child What Can I Do?,Ballad of the Broken Seas,31 Jan 2021,23:28
3,3,Babs_05,Isobel Campbell,It's Hard To Kill A Bad Thing,Ballad of the Broken Seas,31 Jan 2021,23:25
4,4,Babs_05,Isobel Campbell,Saturday's Gone,Ballad of the Broken Seas,31 Jan 2021,23:21


In [273]:
# Convert 'Date' and 'Time' into a single datetime column
df['Timestamp'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
df = df.sort_values(by=['Username', 'Timestamp']).reset_index(drop=True)

In [274]:
df.head()

Unnamed: 0.1,Unnamed: 0,Username,Artist,Track,Album,Date,Time,Timestamp
0,16217,Babs_05,Eminem,"Lose Yourself - From ""8 Mile"" Soundtrack",8 Mile,01 Jan 2021,00:02,2021-01-01 00:02:00
1,16216,Babs_05,The Game,Westside Story,The Documentary,01 Jan 2021,00:09,2021-01-01 00:09:00
2,16215,Babs_05,Toto,Africa,Toto IV,01 Jan 2021,00:14,2021-01-01 00:14:00
3,16214,Babs_05,Baby D,Let Me Be Your Fantasy - Radio Edit,Let Me Be Your Fantasy EP,01 Jan 2021,00:19,2021-01-01 00:19:00
4,16213,Babs_05,The Notorious B.I.G.,Juicy - 2005 Remaster,Ready To Die (The Remaster),01 Jan 2021,00:23,2021-01-01 00:23:00


In [275]:
# Create replay target variable
df['Replay_within_30_days'] = df.groupby(['Username', 'Artist', 'Track'])['Timestamp'].diff().dt.days
df['Replay_within_30_days'] = df['Replay_within_30_days'].fillna(999)  # Fill NaN with large value
df['Replay_within_30_days'] = (df['Replay_within_30_days'] <= 30).astype(int)


In [276]:
(df['Replay_within_30_days'] > 0).sum()

33261

There are 33261 songs Replay within 30 days

# **Feature Engineering**

In [277]:
df['hour'] = df['Timestamp'].dt.hour
df['day'] = df['Timestamp'].dt.day
df['month'] = df['Timestamp'].dt.month
df['total_user_plays'] = df.groupby('Username')['Track'].transform('count')
df['artist_popularity'] = df.groupby('Artist')['Track'].transform('count')
df['session_length'] = df.groupby('Username')['hour'].transform(lambda x: x.max() - x.min())

In [278]:
df.head(2)

Unnamed: 0.1,Unnamed: 0,Username,Artist,Track,Album,Date,Time,Timestamp,Replay_within_30_days,hour,day,month,total_user_plays,artist_popularity,session_length
0,16217,Babs_05,Eminem,"Lose Yourself - From ""8 Mile"" Soundtrack",8 Mile,01 Jan 2021,00:02,2021-01-01 00:02:00,0,0,1,1,33695,287,23
1,16216,Babs_05,The Game,Westside Story,The Documentary,01 Jan 2021,00:09,2021-01-01 00:09:00,0,0,1,1,33695,16,23


# Encode categorical variables

In [279]:
from sklearn.preprocessing import LabelEncoder
label_enc = LabelEncoder()
df['Username'] = label_enc.fit_transform(df['Username'])
df['Artist'] = label_enc.fit_transform(df['Artist'])
df['Track'] = label_enc.fit_transform(df['Track'])

# Select features and target

In [280]:
X = df[['Username', 'Artist', 'Track', 'hour', 'day', 'month', 'total_user_plays', 'artist_popularity', 'session_length']]
y = df['Replay_within_30_days']

# Train-Test Split

In [281]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42, stratify=y_resampled)

In [282]:
X_train.head(3)

Unnamed: 0,Username,Artist,Track,hour,day,month,total_user_plays,artist_popularity,session_length
108097,4,12032,30978,20,30,1,32712,55,23
72032,3,2074,31747,11,18,1,20966,13,23
44397,1,15242,8361,23,29,1,27015,4,23


In [283]:
X_test.head(3)

Unnamed: 0,Username,Artist,Track,hour,day,month,total_user_plays,artist_popularity,session_length
114918,4,14846,19714,14,31,1,32712,24,23
138636,7,11965,62359,20,30,1,17230,46,23
67342,2,16540,38962,3,31,1,10123,219,23


In [284]:
y_train.head(3)

108097    0
72032     0
44397     0
Name: Replay_within_30_days, dtype: int64

In [285]:
y_test.head(3)

114918    1
138636    0
67342     0
Name: Replay_within_30_days, dtype: int64

# Train model- Random Forest

#### WhY Random Foreset
Handles Imbalanced Data Well – Reduces bias toward the majority class (Class 0).<br>
Reduces Overfitting – Uses multiple decision trees, making it more stable.<br>
Captures Non-Linear Patterns – Can learn complex relationships in song replay behavior.<br>
Feature Importance – Helps identify key factors affecting song replays.

In [286]:
rf_model = RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42)
rf_model.fit(X_train, y_train)


# Predictions and evaluation

In [287]:
y_pred_rf = rf_model.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
report_rf = classification_report(y_test, y_pred_rf)

In [288]:
print("Model Accuracy:", accuracy_rf)
print("Classification Report:\n", report_rf)

Model Accuracy: 0.7099767981438515
Classification Report:
               precision    recall  f1-score   support

           0       0.75      0.72      0.73     26578
           1       0.67      0.70      0.68     21263

    accuracy                           0.71     47841
   macro avg       0.71      0.71      0.71     47841
weighted avg       0.71      0.71      0.71     47841



## Analysis of the Classification Report
#### 1.Class 0 (Not Replayed Songs)

Precision: 75% → When predicting a song won’t be replayed, it's correct 75% of the time.<br>
Recall: 72% → The model almost never misses a "Not Replayed" song.<br>
F1-Score: 73% → Strong balance between precision and recall.<br>

#### 2.Class 1 (Replayed Songs - The Target Class)

Precision: 67% → When the model predicts a replay, it's correct 80% of the time.<br>
Recall: 70% (Low) → The model fails to detect most replayed songs, meaning many actual replays are missed.<br>
F1-Score: 68% → Poor balance between precision and recall.<br>

#### 3.Overall Metrics

Accuracy: 70.99% → The model is right 7 out of 10 times, but this is misleading due to class imbalance.<br>


# Recommendation system

In [289]:
df.head(2)

Unnamed: 0.1,Unnamed: 0,Username,Artist,Track,Album,Date,Time,Timestamp,Replay_within_30_days,hour,day,month,total_user_plays,artist_popularity,session_length
0,16217,0,5942,32735,8 Mile,01 Jan 2021,00:02,2021-01-01 00:02:00,0,0,1,1,33695,287,23
1,16216,0,19165,61534,The Documentary,01 Jan 2021,00:09,2021-01-01 00:09:00,0,0,1,1,33695,16,23


In [290]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer
from pyspark.ml.recommendation import ALS
from pyspark.sql.functions import explode, col

#  Initialize Spark Session
spark = SparkSession.builder.appName("MusicRecommendation").getOrCreate()

#  Ensure df is a Pandas DataFrame before conversion
if not isinstance(df, pd.DataFrame):  # Check if it's NOT a Pandas DataFrame
    raise TypeError("Expected `df` to be a Pandas DataFrame before converting to Spark DataFrame")
#  Convert Pandas DataFrame to Spark DataFrame
df_spark = spark.createDataFrame(df)

df_spark.printSchema()
df_spark.show(5)


root
 |-- Unnamed: 0: long (nullable = true)
 |-- Username: long (nullable = true)
 |-- Artist: long (nullable = true)
 |-- Track: long (nullable = true)
 |-- Album: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Time: string (nullable = true)
 |-- Timestamp: timestamp (nullable = true)
 |-- Replay_within_30_days: long (nullable = true)
 |-- hour: long (nullable = true)
 |-- day: long (nullable = true)
 |-- month: long (nullable = true)
 |-- total_user_plays: long (nullable = true)
 |-- artist_popularity: long (nullable = true)
 |-- session_length: long (nullable = true)

+----------+--------+------+-----+--------------------+-----------+------+-------------------+---------------------+----+---+-----+----------------+-----------------+--------------+
|Unnamed: 0|Username|Artist|Track|               Album|       Date|  Time|          Timestamp|Replay_within_30_days|hour|day|month|total_user_plays|artist_popularity|session_length|
+----------+--------+------+-----+----

In [291]:
from pyspark.sql.functions import col

# Convert `Username` and `Track` to string type before applying StringIndexer
df_spark = df_spark.withColumn("Username", col("Username").cast("string"))
df_spark = df_spark.withColumn("Track", col("Track").cast("string"))

# Now apply StringIndexer
from pyspark.ml.feature import StringIndexer

indexer_user = StringIndexer(inputCol="Username", outputCol="UserIndex").fit(df_spark)
indexer_song = StringIndexer(inputCol="Track", outputCol="TrackIndex").fit(df_spark)

df_spark = indexer_user.transform(df_spark)
df_spark = indexer_song.transform(df_spark)

# Check if new columns exist
df_spark.printSchema()
df_spark.select("UserIndex", "TrackIndex", "Replay_within_30_days").show(5)


root
 |-- Unnamed: 0: long (nullable = true)
 |-- Username: string (nullable = true)
 |-- Artist: long (nullable = true)
 |-- Track: string (nullable = true)
 |-- Album: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Time: string (nullable = true)
 |-- Timestamp: timestamp (nullable = true)
 |-- Replay_within_30_days: long (nullable = true)
 |-- hour: long (nullable = true)
 |-- day: long (nullable = true)
 |-- month: long (nullable = true)
 |-- total_user_plays: long (nullable = true)
 |-- artist_popularity: long (nullable = true)
 |-- session_length: long (nullable = true)
 |-- UserIndex: double (nullable = false)
 |-- TrackIndex: double (nullable = false)

+---------+----------+---------------------+
|UserIndex|TrackIndex|Replay_within_30_days|
+---------+----------+---------------------+
|      0.0|    2459.0|                    0|
|      0.0|    5910.0|                    0|
|      0.0|     255.0|                    0|
|      0.0|    5183.0|                    0

In [292]:
from pyspark.ml.evaluation import RegressionEvaluator
# Step 4: Split Data into Training and Test Sets (80-20 split)
(train_data, test_data) = df_spark.randomSplit([0.8, 0.2], seed=42)

# Train ALS Model
als = ALS(
    userCol="UserIndex",
    itemCol="TrackIndex",
    ratingCol="Replay_within_30_days",
    rank=10,  # Number of latent factors
    maxIter=10,  # Number of iterations
    regParam=0.1,  # Regularization parameter
    coldStartStrategy="drop"  # Handle missing values
)




In [293]:
# Fit the ALS Model
model = als.fit(train_data)

#  Generate Predictions on Test Data
predictions = model.transform(test_data)


In [294]:
#  Evaluate the Model using RMSE
evaluator = RegressionEvaluator(metricName="rmse", labelCol="Replay_within_30_days", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print(f"Root Mean Square Error (RMSE): {rmse}")

Root Mean Square Error (RMSE): 0.4279888188708282


In [295]:
#  Generate Song Recommendations for Each User
user_recommendations = model.recommendForAllUsers(5)  # Get top 5 recommendations per user

In [296]:
df_spark.printSchema()

root
 |-- Unnamed: 0: long (nullable = true)
 |-- Username: string (nullable = true)
 |-- Artist: long (nullable = true)
 |-- Track: string (nullable = true)
 |-- Album: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Time: string (nullable = true)
 |-- Timestamp: timestamp (nullable = true)
 |-- Replay_within_30_days: long (nullable = true)
 |-- hour: long (nullable = true)
 |-- day: long (nullable = true)
 |-- month: long (nullable = true)
 |-- total_user_plays: long (nullable = true)
 |-- artist_popularity: long (nullable = true)
 |-- session_length: long (nullable = true)
 |-- UserIndex: double (nullable = false)
 |-- TrackIndex: double (nullable = false)



In [297]:
#  Explode Recommendations into Individual Rows
recommendations_exploded = user_recommendations.select(
    "UserIndex", explode("recommendations").alias("recommendation")
)
recommendations_exploded = recommendations_exploded.select(
    "UserIndex", col("recommendation.TrackIndex").alias("TrackIndex"),
    col("recommendation.rating").alias("Rating")
)


In [298]:
#  Display Final Recommendations (UserIndex & TrackIndex)
recommendations_exploded.show(10, truncate=False)

+---------+----------+----------+
|UserIndex|TrackIndex|Rating    |
+---------+----------+----------+
|10       |25980     |0.64942867|
|10       |20712     |0.64942867|
|10       |24446     |0.6372059 |
|10       |22888     |0.6372059 |
|10       |19774     |0.6372059 |
|0        |2575      |0.75424916|
|0        |19548     |0.74741215|
|0        |19487     |0.74741215|
|0        |19433     |0.74741215|
|0        |18384     |0.74741215|
+---------+----------+----------+
only showing top 10 rows

