# Netflix Retention Case Study
**Author:** Deni Kurti  
**Created:** 2025-08-28

---


## 1. Introduction & Questions
- What are the retention patterns by first-watch month (cohorts)?
- Which genres/categories drive engagement over time?
- Are there differences by country/region (top 10)?
- Simple churn baseline: who is likely to stop watching next month?

> Fill this with a short business goal + dataset description.


## 2. Data Loading & Schema Check


In [None]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns
pd.set_option("display.max_columns", 50)

raw_path = "data/raw/netflix_titles.csv"  # TODO: replace/confirm
df = pd.read_csv(raw_path)
df.head()


In [None]:
df.info()
df.describe(include='all').T


## 3. Cleaning (types, nulls, duplicates)


In [None]:
df.columns = (df.columns.str.strip().str.lower().str.replace(' ', '_'))
df = df.drop_duplicates()
nulls = df.isna().sum().sort_values(ascending=False)
nulls  # inspect
# Example: parse dates if present
# df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')


## 4. Cohort Construction (first-watch month)


In [None]:
# If you have user-level events with 'user_id' & 'watch_date', build cohorts.
# events = pd.read_csv('data/processed/view_events.csv')
# events['watch_date'] = pd.to_datetime(events['watch_date'])
# events['cohort_month'] = events.groupby('user_id')['watch_date'].transform(lambda s: s.min().to_period('M').to_timestamp())
# retention = (events.assign(activity_month=lambda d: d['watch_date'].dt.to_period('M').dt.to_timestamp())
#                    .groupby(['cohort_month','activity_month'])['user_id'].nunique()
#                    .reset_index(name='active_users'))


## 5. Retention Heatmap


In [None]:
# If 'retention' exists and cohort sizes computed:
# cohort_sizes = retention.groupby('cohort_month')['active_users'].first().rename('cohort_size')
# retention = retention.merge(cohort_sizes, on='cohort_month')
# retention['retention_rate'] = retention['active_users'] / retention['cohort_size']
# pivot = retention.pivot_table(index='cohort_month', columns='activity_month', values='retention_rate')
# plt.figure(figsize=(10,6)); sns.heatmap(pivot, annot=False); plt.title('Cohort Retention Heatmap'); plt.tight_layout(); plt.show()


## 6. Content Mix Trends (genre/time)


In [None]:
# Example if df has 'listed_in' genres and 'date_added':
# df['year'] = pd.to_datetime(df['date_added'], errors='coerce').dt.year
# df['listed_in'] = df['listed_in'].str.split(', ')
# genre_year = df.explode('listed_in').groupby(['year','listed_in']).size().reset_index(name='count')
# # TODO: plot top genres over years


## 7. Country Segmentation (top 10)


In [None]:
# country_counts = df.assign(country=df['country'].str.split(', ')).explode('country').country.value_counts().head(10)
# country_counts.plot(kind='bar'); plt.title('Top 10 Countries by Titles'); plt.tight_layout(); plt.show()


## 8. Simple Churn Baseline (train/test + metrics)


In [None]:
# from sklearn.model_selection import train_test_split
# from sklearn.linear_model import LogisticRegression
# from sklearn.metrics import classification_report, roc_auc_score
# # X = user_features; y = churned
# # X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
# # model = LogisticRegression(max_iter=1000).fit(X_train,y_train)
# # preds = model.predict(X_test); probs = model.predict_proba(X_test)[:,1]
# # print(classification_report(y_test,preds)); print('ROC AUC:', roc_auc_score(y_test, probs))


## 9. Insights (5 bullets)
- [ ] Retention takeaway #1
- [ ] Content mix insight #2
- [ ] Country segmentation insight #3
- [ ] Model/metric observation #4
- [ ] Business recommendation #5


## 10. Limits & Next Steps
- Data limitations / assumptions
- What to collect next
- How to improve segmentation or churn modeling
