<a href="https://colab.research.google.com/github/Adeseye1907/My_Project_Work-Spotify_Churn-/blob/main/My_Project_(spotify_Churn).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Dataset Overview**

The dataset is titled “Spotify Churn Dataset” and was collected from Kaggle, it consists of 8,000 user records and 12 variables. It contains information related to Spotify users’ demographics, listening behavior, subscription type, and churn status. Each row represents a unique user identified by user_id.

There are no missing values and no duplicate records, indicating that the dataset is clean and ready for analysis. The dataset contains both categorical and numerical variables — 7 numeric, 4 categorical, and 1 floating-point variable.

Key Variables

Demographics: gender, age, country

Subscription and Usage: subscription_type, listening_time, songs_played_per_day, skip_rate, ads_listened_per_week, offline_listening, device_type

Target Variable: is_churned (indicates whether a user has unsubscribed or stopped using the service)


Summary

Overall, the dataset is well-structured and balanced across categorical and numerical features. It provides an excellent foundation for analyzing user behavior patterns, identifying key churn predictors, and developing data-driven retention strategies for Spotify users.

The next step is to import all the needed libraries and upload the dataset to the colab. followed by the getting the information about the dataset, the shape of the dataset, the description of the dataset showing the mean, min, max, standard deviation etc, .

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import statsmodels.api as sm
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

In [None]:
ds = pd.read_csv('/content/spotify_churn_dataset.csv')
ds.head()

In [None]:
ds.info()


The data set shows that we have four(4) objects data types  i.e strings, and seven (7) integers and one (1) float data types. With a memory usage of 750.1+KB

In [None]:
ds.shape

In [None]:
ds.describe()

This dataset shows a high standard deviation for the user_id, which means there would be scaling of the datasets during cleaning

In [None]:
#Finding the missing values.
ds.isnull().sum()

This dataset shows that there are no missing values and there will be no need to either replace or remove the missing values. So the columns and rows are completely filled with values.

In [None]:
#checking for the numerical and categorical data.
numerical_data = ds.select_dtypes(include = ['number'])
categorical_data = ds.select_dtypes(exclude = ['number'])
print('Numerical columns: \n')
display(numerical_data.head())
print('\nCategorical columns:')
display(categorical_data.head())

In [None]:
#Finding the correlation
numerical_data.corr()

This shows that there is not high level of correlation between variables. Except for a high negative correlation between ads listened per week and the offline listeners which is -0.87

In [None]:
sns.heatmap(numerical_data.corr(), annot=True, cmap='BuPu')

In [None]:
#Checking for duplicates
Data_duplicates = ds.duplicated()
Data_duplicates.sum()

This shows that there are no duplicates in the datasets

In [None]:
#Finidng Outliers.
ds['z_score'] = stats.zscore(ds['is_churned'])
ds.head()

In [None]:
outliers = ds[(ds['z_score'] > 3) | (ds['z_score'] < -3)]
outliers

This shows that there no signicant outliers in the dataset

In [None]:
!pip install ydata_profiling --quiet

The next step is to do a pandas profiling using y-data that shows an overview of the dataset

In [None]:
from ydata_profiling import ProfileReport
profile = ProfileReport(ds, title= 'Pandas Profiling Report for Spotify Churn Data')
profile.to_notebook_iframe()

In [None]:
#Encoding categorical data
from sklearn.preprocessing import LabelEncoder

for i in categorical_data.columns:
  encoder = LabelEncoder()
  ds[i] = encoder.fit_transform(ds[i])

ds.head()

In [None]:
ds.loc[ds[('is_churned')] == 0]

This shows that a total 5929 users has churned.

In [None]:
ds.loc[ds[('is_churned')] == 1]

This shows that 2071 subscribers has not churned and are still subscribers.

This means that greater percentage of subscirbers has stopped subscribing to spotify.

In [None]:
#Scaling the dataset because of the high std in the user_id column
from sklearn.preprocessing import StandardScaler

for i in ds.drop(['is_churned'],  axis=1).columns:
  if ds[i].std() > 1000:
    scaler = StandardScaler()
    ds[i] = scaler.fit_transform(ds[[i]])

ds.head()

Checking to see  if the standadr deviation has been scaled

In [None]:
ds.describe()

This shows that the standard deviation is scaled

So we Split the data i.e train and test the data before introducing the model

In [None]:
from sklearn.model_selection import train_test_split

x = ds.drop('is_churned', axis=1)
y = ds.is_churned

xtrain, xtest, ytrain,ytest = train_test_split(x, y, test_size = 0.2, random_state=40)
print(f'xtrain: {xtrain.shape}')
print(f'xtest: {xtest.shape}')
print(f'ytrain:{ytrain.shape}')
print(f'Ytest: {ytest.shape}')

Please note during split we are now using 20% of the datasets, so in other to checkmate the statement made earlier and seeing the number of is churned as way above 5000 and not churned as above 2000. See that now at split we are now working with 1600 observations at split so the number of not churned might be higher or lower however the model predicition is accurate.

In [None]:
#introducing the model
from sklearn.metrics import r2_score
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(xtrain, ytrain)

In [None]:
#Evaluate using the Xtrain
prediction = logmodel.predict(xtrain)
r2_score(ytrain, prediction)

In [None]:
#Evaluate using the ytest
prediction = logmodel.predict(xtest)
r2_score(ytest, prediction)

The Evaluation shows that the model is perfect and many spotify users has churned and has stopped susbcribing.

In [None]:
#Feature Selection
x = ds.drop('is_churned', axis = 1)
y = ds.is_churned

In [None]:
from sklearn.metrics import classification_report
prediction = logmodel.predict(xtrain)
print(classification_report(ytrain, prediction))

In [None]:
from sklearn.metrics import classification_report
prediction = logmodel.predict(xtest)
print(classification_report(ytest, prediction))

The churned users in the dataset exhibit behavioral patterns characterized by lower listening time which usually indicates disengagement, higher skip rates which reflect dissatisfaction or poor song relevance, subscription type which indicates that free-tier users often have a higher churn rate than Premium users due to frequent exposure to ads and limited features. They are predominantly Free-tier users with limited offline listening behavior. These patterns suggest that user engagement and satisfaction with the content experience are primary drivers of churn. Conversely, Premium users and those with longer listening durations or multi-device activity are less likely to churn.

So one of the reason coefficients will be introduce as a form of recommendation to help sustain the current subscribers and bring in more subscribers is because positive coefficients will mean increase churn risk and negative coefficients will mean reduce churn risk. So we also introduce the critical churn risk index to help spotify retain and get new subscribers.

In [None]:
#Finding the coefficients
coefficients = pd.DataFrame({'Feature': xtest.columns,'Coefficient': logmodel.coef_[0]}).sort_values(by='Coefficient', ascending=False)
coefficients

So, negative coefficients mean that those variables protect against churn they reduce the likelihood that a user will leave. And even on the negative side, the numbers are still high and churn can be measured

In [None]:
# Visualize feature importance
plt.figure(figsize=(10,5))
sns.barplot(x='Coefficient', y='Feature', data=coefficients, palette='coolwarm')
plt.title('Feature Importance (Logistic Regression Coefficients)')
plt.axvline(0, color='black', linewidth=1)
plt.show()

In [None]:
# CREATE THE CRITICAL CHURN RISK INDEX (CCRI)
# Select key features and their weights (from model coefficients)
#Features includes: offline listening, skip rate, subsciption type, ads listened per week
Key_Features = ['skip_rate','listening_time', 'offline_listening', 'subscription_type', 'ads_listened_per_week']
Weights = {feat: coefficients.loc[coefficients['Feature'] == feat, 'Coefficient'].values[0] for feat in Key_Features}

Here see that the Key features are selected according to their weight as they influence why people churns.

So we scale the numerical features. i.e normalization

In [None]:
# Normalize the key features
for col in Key_Features:
    xtest[f'{col}_norm'] = (xtest[col] - xtest[col].min()) / (xtest[col].max() - xtest[col].min())

In [None]:
# Compute critical churn index CCRI
xtest['CCRI'] = (
    Weights['skip_rate'] * xtest['skip_rate_norm'] +
    Weights['ads_listened_per_week'] * xtest['ads_listened_per_week_norm'] +
    Weights['listening_time'] * xtest['listening_time_norm'] +
    Weights['offline_listening'] * xtest['offline_listening_norm'] +
    Weights['subscription_type'] * xtest['subscription_type_norm']
)

In [None]:
xtest.head()


The Above table shows that the CCRI has been included in the dataset, though not in the dataframe, but it has included also the risk level.

In [None]:
# Compare churn rates by risk group
risk_summary = risk_analysis.groupby('risk_level')['is_churned'].mean().reset_index()

print("Churn rate by risk group:")
print(risk_summary)

In [None]:
# Visualize the distribution of CCRI
plt.figure(figsize=(10, 6))
sns.histplot(xtest['CCRI'], kde=True)
plt.title('Distribution of Critical Churn Risk Index (CCRI)')
plt.xlabel('CCRI')
plt.ylabel('Frequency')
plt.axvline(threshold, color='red', linestyle='dashed', linewidth=1, label=f'High Risk Threshold ({threshold:.2f})')
plt.legend()
plt.show()

In [None]:
# Model metrics
print("\nClassification Report:")
# We need predictions based on the model using the original xtest features
y_pred = logmodel.predict(xtest[x.columns])
print(classification_report(ytest, y_pred))

print("\nAUC-ROC Score:", roc_auc_score(ytest, logmodel.predict_proba(xtest[x.columns])[:, 1]))

# Confusion matrix visualization
cm = confusion_matrix(ytest, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix")
plt.show()

The logistic regression model demonstrated perfect classification performance on the test dataset.
Out of 1,600 total observations (1,200 non-churners and 400 churners), the model correctly predicted every instance, achieving 100% accuracy, precision, recall, and F1-score for both classes.
This indicates that the model fully distinguished churners from non-churners without any misclassification.
The results align with the AUC-ROC score of 1.00, confirming a perfect separation between the two groups.

Spotify Churn Prediction and Retention Recommendation Report
1. Overview

A churn prediction model was developed to analyze user behavior and identify factors influencing customer churn on Spotify. Using selected key features — skip rate, listening time, offline listening, subscription type, and ads listened per week — the model achieved an AUC-ROC score of 1.00, indicating perfect predictive performance and strong feature-target relationships.

2. Key Insights

Analysis of the predictive features revealed several behavioral and engagement patterns associated with churn:

High skip rate — Users who frequently skip songs show low satisfaction or poor content alignment with their preferences, making them more likely to churn.

Low listening time — Reduced active listening hours correlate strongly with declining engagement and a higher churn probability.

Limited offline listening — Users not leveraging offline features may have weaker attachment to the platform or face connectivity/plan limitations.

Subscription type — Free-tier users, who experience frequent ads and limited premium features, show a higher tendency to churn.

Ads exposure — An increase in the number of ads listened per week strongly correlates with dissatisfaction and subsequent churn.

3. Recommendations

Based on the model insights, the following actions are recommended to reduce churn and enhance customer retention:

1. Reduce Advertisement Frequency:
Limit ad exposure for free-tier users or improve ad relevance to reduce irritation and improve user satisfaction.
Introduce “Ad-Free Days” or reward-based listening (e.g., “watch one ad, enjoy 30 minutes ad-free”).
Expected impact: Could reduce churn by up to 25% among free-tier users.

2. Revise Subscription Packages:
Offer affordable, flexible, and engaging subscription plans, including student or family bundles and periodic discounts to encourage upgrade from free to premium tiers.
Introduce micro-subscription tiers (e.g., ₦500 weekly or ₦1500 monthly) to attract budget-conscious users.

Provide temporary premium trials for users with high churn probability.
Expected impact: Conversion rate from free to premium could increase by 15–20%.

3. Enhance Content Personalization:
Improve recommendation algorithms to shortlist songs relevant to individual listening patterns and regional preferences, reducing skip rates and boosting engagement time.
Strengthen recommendation algorithms to minimize high skip rates.

Curate AI-based playlists for different moods, languages, or listening habits.
Expected impact: Users with reduced skip rates (<30%) are 60% more likely to stay subscribed.

4. Promote Offline Listening Features:
Encourage free users to experience offline listening through short trial periods, highlighting the convenience and quality benefits of premium membership.
Offer 7-day free offline listening trials for free users to experience premium benefits.

Highlight “download now, listen anywhere” in-app banners for commuters and mobile users.
➡️ Expected impact: Offline users already show 40% lower churn — expanding this could improve loyalty across segments.

5. Targeted Retention Campaigns:
Use the churn prediction model to identify high-risk users early and deploy personalized retention strategies (e.g., in-app messages, playlists, or limited-time offers).
Use the Critical Churn Risk Index (CCRI) to segment users into High, Medium, and Low churn-risk categories.

Focus personalized retention efforts (discounts, playlists, offers) on High-Risk users.
➡️ Expected impact: Could retain up to 10–15% of users who would otherwise churn.

4. Expected Outcome

Implementing these recommendations will:

Strengthen user engagement and satisfaction,

Retain existing subscribers by reducing churn, and

Attract new users through improved user experience and flexible subscription options.

5. Conclusion

The model demonstrates strong predictive capability in identifying potential churners. By addressing the identified factors — ad load, subscription flexibility, personalization, and engagement features — Spotify can build a more loyal, active, and expanding user base.

In [None]:
!pip install nbstripout
!nbstripout /content/your_notebook_name.ipynb
