<h1> Spotify Skip Prediction Dataset </h1>
This dataset comes in two sets. The first set is details about 'sessions': chunks of songs a user listens to in one go, and what songs were listened to. The second set details the song's features. <br>
Our analysis will include just the mini set availible on AI Crowd. The input we are using is an augmented table that combines the user session data and the song features data. 
There are 167880 entires and 50 total features. Only 47 features will be used. 

<h3>References</h3>
We'd like to recognize that due to enormity of this dataset and the complexity of how it was stored (in multiple seperate and unorganized .csv files), we did use online references to decide on our stack and how we would approach the data. <br>
We used the following a examples: <br> <br>
<li> <a>https://github.com/a-poor/spotify-skip-prediction/blob/master/README.md</a>
<br><i>Used for template tech stack and reorganize dataset. </i>

<h2>Import Libraries and Datasets</h2>

In [20]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# load both csv files into dataframes
log_df = pd.read_csv('log_mini.csv') # user log
tf_df  = pd.read_csv('tf_mini.csv')  # track features
 
# rename and merge the two data frames so that the 
log_df = log_df.rename(columns={'track_id_clean': 'track_id'})

# perform a merge so that song information is attached to the user information
og_data_df = pd.merge(log_df, tf_df, on='track_id')

# Save the merged DataFrame to a new CSV file
og_data_df.to_csv('merged_file.csv', index=False)
data_df = og_data_df

In [21]:
print(data_df.head(1))
print(data_df.shape)

                               session_id  session_position  session_length  \
0  0_00006f66-33e5-4de7-a324-2d18e439fc1e                 1              20   

                                 track_id  skip_1  skip_2  skip_3  \
0  t_0479f24c-27d2-46d6-a00c-7ec928f2b539   False   False   False   

   not_skipped  context_switch  no_pause_before_play  ...  time_signature  \
0         True               0                     0  ...               4   

    valence  acoustic_vector_0  acoustic_vector_1  acoustic_vector_2  \
0  0.152255          -0.815775           0.386409            0.23016   

   acoustic_vector_3 acoustic_vector_4  acoustic_vector_5 acoustic_vector_6  \
0           0.028028         -0.333373           0.015452          -0.35359   

  acoustic_vector_7  
0          0.205826  

[1 rows x 50 columns]
(167880, 50)


# Reorganizing the Data
We decided to select the the variable skip_3 as our 'y' variable. The 'skip_3' feautre represents when a 

In [22]:
# Making a 'skipped' feature for whether a song has been skipped or not, regardless of how fast
data_df['skipped'] = (data_df.skip_3 | data_df.skip_2 | data_df.skip_1).astype('int32')

# Make 'skipped' column our 'y' value for prediction
y_df = data_df['skipped']

data_df = data_df.drop(columns=["skip_1", "skip_2", "skip_3", "not_skipped"], axis=1)
#Optional print just to check features
print(data_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167880 entries, 0 to 167879
Data columns (total 47 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   session_id                       167880 non-null  object 
 1   session_position                 167880 non-null  int64  
 2   session_length                   167880 non-null  int64  
 3   track_id                         167880 non-null  object 
 4   context_switch                   167880 non-null  int64  
 5   no_pause_before_play             167880 non-null  int64  
 6   short_pause_before_play          167880 non-null  int64  
 7   long_pause_before_play           167880 non-null  int64  
 8   hist_user_behavior_n_seekfwd     167880 non-null  int64  
 9   hist_user_behavior_n_seekback    167880 non-null  int64  
 10  hist_user_behavior_is_shuffle    167880 non-null  bool   
 11  hour_of_day                      167880 non-null  int64  
 12  da

## Dealing with Non-Float Values
The IDs of the songs and the users are strings. We've chosen to completely drop these values. While it is reasonable to assume they impact the predicted value, we opt to focus on more generally modeling whether a song will be skipped or not as opposed to whether a song will be skipped or not depending on previous skips and sessions since there are 10,000 sessions in the mini dataset, a value we are not sure how to deal with considering the reasources we have.

###

In [23]:
# drop id values
data_df = data_df.drop(columns=["session_id", "track_id"], axis=1)

In [24]:
# fix the session_date column into seperate parts. dropping day. 
data_df['session_year'] = pd.to_datetime(data_df['date']).dt.year
data_df['session_month'] = pd.to_datetime(data_df['date']).dt.month
# data_df['day'] = pd.to_datetime(data_df['date']).dt.day
data_df['session_day_of_week'] = pd.to_datetime(data_df['date']).dt.dayofweek
#print(data_df.head(3))
data_df = data_df.drop('date', axis=1)
print(data_df.head(3))



   session_position  session_length  context_switch  no_pause_before_play  \
0                 1              20               0                     0   
1                 7              12               0                     0   
2                 6              20               0                     0   

   short_pause_before_play  long_pause_before_play  \
0                        0                       0   
1                        1                       1   
2                        1                       1   

   hist_user_behavior_n_seekfwd  hist_user_behavior_n_seekback  \
0                             0                              0   
1                             0                              0   
2                             0                              0   

   hist_user_behavior_is_shuffle  hour_of_day  ...  acoustic_vector_2  \
0                           True           16  ...            0.23016   
1                          False           17  ...            0

In [25]:
data_df['premium'] = data_df['premium'].astype(int)
# print(data_df['premium'].head(5))

# hist_user_behavior_is_shuffle
data_df['hist_user_behavior_is_shuffle'] = data_df['hist_user_behavior_is_shuffle'].astype(int)
# print(data_df['hist_user_behavior_is_shuffle'].head(5))

data_df['mode'] = data_df['mode'].map({'major':1, 'minor':0})
# print(data_df['mode'].head(5))


### Categorical Variables
The following variables were categorical in nature:
* time_signature
* key_signature
* context_type
* hist_user_behavior_reason_start	
* hist_user_behavior_reason_end
<br><br>Lets analyze how many types of values are in each column to determine whether one-hot encoding or ordinal encoding is more advantageous. 

In [26]:
list = ['time_signature', 'key', 'context_type', 'hist_user_behavior_reason_start', 'hist_user_behavior_reason_end' ]

for col in list:
    unique_values = data_df[col].unique()
    print(col, ": ", unique_values)

time_signature :  [4 5 3 1 0]
key :  [ 1  7 10  8  6  5  4  2  0  3  9 11]
context_type :  ['editorial_playlist' 'user_collection' 'catalog' 'radio' 'charts'
 'personalized_playlist']
hist_user_behavior_reason_start :  ['trackdone' 'fwdbtn' 'appload' 'playbtn' 'clickrow' 'backbtn' 'remote'
 'endplay' 'trackerror']
hist_user_behavior_reason_end :  ['trackdone' 'endplay' 'fwdbtn' 'backbtn' 'remote' 'logout' 'clickrow']


### Analyzing Unique Values
For *time signature*, due to the ordered nature, we will use *ordinal* encoding.
For *context_type*, *key*, *hist_user_behavior_reason_start*, and *hist_user_behavior_reason_end* we will use *one-hot* encoding as their seems to be no inheret order to the values. 

Let's make the changes now!

## One-Hot Encoding


In [27]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

encoder = OneHotEncoder()

categorical_features = ['context_type', 'hist_user_behavior_reason_start', 'hist_user_behavior_reason_end', 'key']
encoded_data = encoder.fit_transform(data_df[categorical_features])

encoded_df = pd.DataFrame(encoded_data.toarray(), columns=encoder.get_feature_names_out(categorical_features))

data_df = pd.concat([data_df.reset_index(drop=True), encoded_df], axis=1)

data_df.drop(categorical_features, axis=1, inplace=True)

# print(data_df.head(2))
# print(data_df.columns)


### Check for Missing Values

In [28]:
# check for missing values
missing_values = data_df.isnull().sum()

# display columns with missing values and counts
print(missing_values[missing_values > 0])

data_df = data_df.drop(columns=["skipped"], axis=1)
# print(data_df.columns)

Series([], dtype: int64)


### Split Data into Test and Training Sets

In [29]:
X_train, X_test, y_train, y_test = train_test_split(data_df, y_df, test_size=0.2, random_state=42)

# Model #1: Logistic Regression
Our first model will be a logistic regression model using sklearn's implementation. 


In [30]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, accuracy_score, roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline


In [31]:
pipeline = make_pipeline(StandardScaler(), LogisticRegression())


param_grid = {
    'logisticregression__C': [0.001, 0.01, 0.1, 1, 10, 100],  
    'logisticregression__penalty': ['l1', 'l2'],  
    'logisticregression__solver': ['liblinear']  
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', verbose=1)

grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_

predictions = best_model.predict(X_test)

accuracy = accuracy_score(y_test, predictions)
print("Best Parameters:", grid_search.best_params_)
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, predictions))
probabilities = best_model.predict_proba(X_test)[:,1]
roc_auc = roc_auc_score(y_test, probabilities)
print("ROC-AUC Score:", roc_auc)


Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best Parameters: {'logisticregression__C': 0.1, 'logisticregression__penalty': 'l1', 'logisticregression__solver': 'liblinear'}
Accuracy: 0.9804920181081725
Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.96      0.97     11769
           1       0.98      0.99      0.99     21807

    accuracy                           0.98     33576
   macro avg       0.98      0.97      0.98     33576
weighted avg       0.98      0.98      0.98     33576

ROC-AUC Score: 0.988719411471767


# Model #2: SVM

In [33]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, precision_recall_curve, auc
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


pipeline = Pipeline([
    ('scaler', StandardScaler()),  
    ('svm', SVC(kernel='linear'))
])

param_grid = {
    'svm__C': [0.1, 1, 10],  
    'svm__gamma': ['scale', 'auto'], 

}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', verbose=1)

grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_

svm_predictions = best_model.predict(X_test)


svm_accuracy = accuracy_score(y_test, svm_predictions)
svm_conf_matrix = confusion_matrix(y_test, svm_predictions)
print("Best Parameters:", grid_search.best_params_)
print("SVM Accuracy:", svm_accuracy)
print(classification_report(y_test, svm_predictions))

decision_function = best_model.decision_function(X_test)
roc_auc = roc_auc_score(y_test, decision_function)
precision, recall, _ = precision_recall_curve(y_test, decision_function)
pr_auc = auc(recall, precision)
print("Confusion Matrix:\n", svm_conf_matrix)
print("ROC-AUC Score:", roc_auc)
print("Precision-Recall AUC:", pr_auc)



Fitting 5 folds for each of 6 candidates, totalling 30 fits


# Model #3: Neural Network

In [None]:

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

2024-04-28 19:27:03.034990: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [None]:
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])


In [None]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


In [None]:
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x1537ffa90>

In [None]:
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print("Test Accuracy:", test_accuracy)


Test Accuracy: 0.9799857139587402
