# 🎧 Spotify User Segmentation & Premium Prediction

## 📊 Section Titles

- [`## 1. 📦 Import All Requirements`](#1--import-all-requirements)

- [`## 2. 🗃️ Get & Read the Dataset`](#2-️-get-and-read-the-dataset)

- [`## 3. 📝 Basic Overview`](#3--basic-overview)

- [`## 4. 🚫 Missing Values`](#4--missing-values)

- [`## 5. 💡 Explore Interesting Data Insights`](#5--explore-interesting-data-insights)

- [`## 6. 📊 Show Key Distributions`](#6--show-key-distributions)

- [`## 7. 🔗 Find Correlations`](#7--find-correlations)

- [`## 8. 🔎 Spotting Outliers`](#8--spotting-outliers)

- [`## 9. 🧩 User Segmentation Using K-Means`](#9--user-segmentation-using-k-means)

- [`## 10. 🤖 Using ML to Predict User Subscribtion Willingness`](#10--using-ml-to-predict-user-subscribtion-willingness)

## 1. 📦 Import All Requirements

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE

## 2. 🗃️ Get and Read the Dataset

In [None]:
data = pd.read_excel('../data/spotify_data.xlsx')

print(data.shape)
data.head(8)

In [None]:
data.info()

## 3. 📝 Basic Overview

In [None]:
# Basic overview
data.describe(include='all')

## 4. 🚫 Missing Values

In [None]:
# Check for missing values
data.isnull().sum().sort_values(ascending=False)

## 5. 💡 Explore Interesting Data Insights

### How many people are using the Free vs Premium plan?

In [None]:
# Number people
print('Count:\n', data['spotify_subscription_plan'].value_counts())

print('\n')

# Percentage
print('Percentage:\n', data['spotify_subscription_plan'].value_counts(normalize=True).mul(100).round(2).astype(str) + '%')

data['spotify_subscription_plan'].value_counts().plot(
    kind='pie',
    ylabel='',
    autopct='%1.1f%%',
    shadow=True,
    legend=False,
    startangle=90
)
plt.show()

### All age groups analysis

In [None]:
data['Age'].value_counts().plot(
    kind='bar',
    legend=False,
    ylabel='# of People',
    xlabel='Age Groups'
)
plt.show()

### The most liked music genre for each age group

In [None]:
data.groupby('Age')['fav_music_genre'].value_counts(normalize=True).mul(100).round(2).astype(str) + '%'

### Which age group is most willing to upgrade to Premium?

In [None]:
willing_users = data[data['premium_sub_willingness'] == 'Yes']

print(f'Shape: {willing_users.shape}')
willing_users.head()

In [None]:
willing_users['Age'].value_counts().plot(
    kind='bar',
    legend=False,
    xlabel='Age Group',
    ylabel='# of People',
    title='Willingness to Subscribe by Its Age Group',
)
plt.show()

In [None]:
(willing_users['Age'].value_counts() / data['Age'].value_counts()).fillna(0).mul(100).round(2).astype(str) + '%'

### The most preffered Premium plan for willing users

In [None]:
plan_counts = willing_users['preffered_premium_plan'].value_counts()

plan_counts

In [None]:
plan_counts.plot(kind='bar', color='skyblue', title='Preffed Premium Plans')
plt.xlabel('Plan Type')
plt.ylabel('Number of Users')
plt.tight_layout()
plt.show()

## 6. 📊 Show Key Distributions

In [None]:
# Music recommendation system rating
sns.histplot(data['music_recc_rating'], bins=5, discrete=True)
plt.title('Music Recommendation Rating Ditribution', fontsize=18)
plt.xlabel('Rating (1 to 5)')
plt.tight_layout()
plt.show()

In [None]:
# Usage time distribution
sns.histplot(data['spotify_usage_period'], bins=4, discrete=True)
plt.title('Spotify Usage Period Ditribution', fontsize=18)
plt.xticks(rotation=55, fontsize=12)
plt.xlabel('Usage Time')
plt.tight_layout()
plt.show()

In [None]:
# Favorite music distribution
sns.histplot(data['fav_music_genre'], bins=11, discrete=True)
plt.title('Music Genre Prefference Ditribution', fontsize=18)
plt.xticks(rotation=65)
plt.xlabel('Music Genre', fontsize=12)
plt.tight_layout()
plt.show()

## 7. 🔗 Find Correlations

In [None]:
# Encode all object types numerically for correlations
df_encoded = data.copy()
for col in df_encoded.columns:
    if df_encoded[col].dtype == 'object':
        df_encoded[col] = pd.factorize(df_encoded[col])[0]

# Correlation heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(df_encoded.corr(), annot=True, cmap='coolwarm')
plt.title('Feature Correlation Heatmap')
plt.show()

## 8. 🔎 Spotting Outliers

In [None]:
# Visualize outliers in numeric feature
sns.boxplot(data=df_encoded[['music_recc_rating', 'pod_lis_frequency', 'music_lis_frequency']])
plt.title('Outliers Check: music_recc_rating')
plt.tight_layout()
plt.show()

In [None]:
# Clean music_lis_frequency from poor text data by making multiple columns from multi-choice text
listening_types = ['Office', 'Workout', 'Night', 'Travel', 'Leisure']

for t in listening_types:
    data[f'listens_{t.lower()}'] = data['music_lis_frequency'].str.contains(t, case=False)

data.iloc[:, -5:]

## 9. 🧩 User Segmentation Using K-Means

### Select features

In [None]:
# Select specific features
cluster_features = [
    'music_recc_rating',
    'fav_music_genre',
    'music_time_slot',
    'music_lis_frequency',
    'pod_lis_frequency',
    'preferred_listening_content'
]

df_seg = data.copy()

# Encode all string values in numerical representations
for col in cluster_features:
    if df_seg[col].dtype == 'object':
        df_seg[col] = LabelEncoder().fit_transform(df_seg[col].astype(str))

df_seg[cluster_features]

### Init and run K-Means

In [None]:
# Run K-Means clustering
X_seg = df_seg[cluster_features]
kmeans = KMeans(n_clusters=4, random_state=42)
df_seg['segment'] = kmeans.fit_predict(X_seg)

# Analyze segments
segment_summary = df_seg.groupby('segment')[cluster_features].mean()
segment_summary

### Plot the results using t-SNE

In [None]:
# Visualize segments using t-SNE
X_vis = df_seg[cluster_features]

# Run t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_embedded = tsne.fit_transform(X_vis)

# Add to dataframe
df_seg['tsne_x'] = X_embedded[:, 0]
df_seg['tsne_y'] = X_embedded[:, 1]

# Plot it
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_seg, x='tsne_x', y='tsne_y', hue='segment', palette='tab10')
plt.title("t-SNE: Spotify User Segments")
plt.xlabel("")
plt.ylabel("")
plt.grid(True)
plt.show()


We can see that we have 4 strongly separated clusters of users, who have their own specific traits.

We can see that segment 1 and 2 are very close to each other, which means they're really similar; however, they also have something not in common, which is enough for them to be separated. In addition, they have low variance, which indicates fixed and straightforward choices. Importantly, it tells us they have almost the same rating of recommendation system, time spent on Spotify, favorite genres, etc.

Segment 0 is the farthest from 1 and 2, which tells us that these groups are almost the opposite when it comes to time spent on Spotify and other features. It also has the largest variance out of all segments, which indicates a wide variety of users in this segment. It is pretty hard to make any decisions from this point because the variance is really big. Maybe we will be able to increase the `n_clusters` in `KMeans` in order to get more clusters and maybe separate that giant segment into something more meaningful and insightful.

Segment 3 is a liminal segment that acts as a transitional segment between the two segments on the left and the biggest one on the right. We can notice that some of this segment slightly translates into segment 0.

## 10. 🤖 Using ML models to Predict User Subscribtion Willingness

### Prepare the dataset

In [None]:
# Prepare certain features for ML model
features = [
    'Age',
    'Gender',
    'fav_music_genre',
    'music_time_slot',
    'spotify_subscription_plan'
]

In [None]:
# Encode categorical features for ML model
df_ml = data.copy()
for col in features:
    df_ml[col] = LabelEncoder().fit_transform(df_ml[col])

df_ml[features].head()

In [None]:
# Get features and target variable
X = df_ml[features]
y = data['premium_sub_willingness'].apply(lambda x: 1 if x == 'Yes' else 0)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Resample the training data to handle class imbalance using SMOTE
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

### Logistic Regression model

In [None]:
# Train and test the Logistic Regression model
lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train_res, y_train_res)
predictions = lr_model.predict(X_test)

# Print classification report for Logistic Regression
print(classification_report(y_test, predictions))

In [None]:
features_weights = pd.DataFrame(lr_model.coef_[0], index=X.columns, columns=['Weight']).sort_values(by='Weight', ascending=False)
features_weights['Adjusted Weight'] = features_weights['Weight'] + lr_model.intercept_[0]
features_weights

It shows the importance of **each feature** in predicting user subscription willingness. **Positive weights** indicate a **positive correlation** with willingness to subscribe *(approaches to YES)*, while **negative weights** indicates a **negative correlation** *(approaches to NO)*.

So we can see that `spotify_subscription_plan` and `fav_music_genre` have the most significant impact on the **positive prediction** of user subscription willingness meaning that users who prefer certain music genres or subscription plans are **more likely to subscribe**, while `Age` has a negative impact, meaning that older users are **less likely to subscribe**.

### Decision Tree model

In [None]:
# Train and test the Decision Tree model
dt_model = DecisionTreeClassifier(max_depth=4, random_state=42)
dt_model.fit(X_train_res, y_train_res)
dt_predictions = dt_model.predict(X_test)

# Print classification report for Decision Tree
print(classification_report(y_test, dt_predictions))

In [None]:
# Export the decision tree rules
tree_rules = export_text(dt_model, feature_names=list(X.columns))

print(tree_rules)

In [None]:
plt.figure(figsize=(24, 10))
plot_tree(dt_model, feature_names=X.columns, class_names=['No', 'Yes'], filled=True, rounded=True, fontsize=11)
plt.title("Decision Tree Visualization")
plt.show()