# Advanced OTT Streaming Analytics
## User Insights, Churn Prediction & Hybrid Recommendation Engine

This notebook performs end-to-end analysis of OTT streaming data including:
- Data generation and cleaning
- Exploratory Data Analysis (EDA)
- Visualization
- Churn prediction modeling
- Content-based recommendations
- Collaborative filtering
- Hybrid recommendation engine


In [1]:
import sys
!{sys.executable} -m pip install pandas numpy scikit-learn matplotlib seaborn plotly



In [2]:
# Import required libraries and setup paths
import sys
import os

# Get project root directory - handle both notebook execution contexts
current_dir = os.getcwd()
if 'notebooks' in current_dir:
    # If we're in notebooks folder, go up one level
    project_root = os.path.dirname(current_dir)
else:
    # Otherwise assume we're at project root
    project_root = current_dir

src_dir = os.path.join(project_root, 'src')

# Verify paths exist
if not os.path.exists(src_dir):
    # Try alternative detection
    script_path = os.path.abspath(__file__ if '__file__' in globals() else '.')
    if 'notebooks' in script_path:
        project_root = os.path.dirname(os.path.dirname(script_path))
        src_dir = os.path.join(project_root, 'src')

# Add src directory to path
if src_dir not in sys.path and os.path.exists(src_dir):
    sys.path.insert(0, src_dir)

# Change to project root for relative paths in scripts
if os.path.exists(project_root):
    os.chdir(project_root)

# Now import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
try:
    plt.style.use('seaborn-v0_8')
except:
    try:
        plt.style.use('seaborn')
    except:
        pass

print("âœ“ Libraries imported successfully")
print(f"âœ“ Project root: {project_root}")
print(f"âœ“ Working directory: {os.getcwd()}")
print(f"âœ“ Python version: {sys.version}")
print(f"âœ“ Pandas version: {pd.__version__}")

# Force reload modules to ensure we get the latest code
import importlib
print("\nðŸ”„ Forcing module reload to clear cache...")
modules_to_reload = ['churn_model', 'collaborative_filtering', 'hybrid_recommender', 
                     'content_based_recommender', 'data_cleaning', 'eda', 'visualization', 
                     'generate_data', 'path_utils']
for mod_name in modules_to_reload:
    if mod_name in sys.modules:
        importlib.reload(sys.modules[mod_name])
        print(f"  âœ“ Reloaded {mod_name}")
print("âœ“ Module cache cleared!")
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
plt.style.use('seaborn-v0_8')

print("âœ“ Libraries imported successfully")
print(f"âœ“ Project root: {project_root}")
print(f"âœ“ Working directory: {os.getcwd()}")


âœ“ Libraries imported successfully
âœ“ Project root: /Users/mbgirish/StreamPulse-Analytics
âœ“ Working directory: /Users/mbgirish/StreamPulse-Analytics
âœ“ Python version: 3.11.14 (main, Oct  9 2025, 16:16:55) [Clang 17.0.0 (clang-1700.4.4.1)]
âœ“ Pandas version: 2.3.3

ðŸ”„ Forcing module reload to clear cache...
âœ“ Module cache cleared!
âœ“ Libraries imported successfully
âœ“ Project root: /Users/mbgirish/StreamPulse-Analytics
âœ“ Working directory: /Users/mbgirish/StreamPulse-Analytics


## Step 1: Generate Synthetic Datasets


In [3]:
# Generate all datasets
from generate_data import main as generate_main

print("Generating synthetic datasets...")
generate_main()
print("\nâœ“ Data generation complete!")


Generating synthetic datasets...
Generating OTT Users dataset...
âœ“ Generated 10000 users

Generating OTT Titles dataset...
âœ“ Generated 2000 titles

Generating OTT Watch History dataset...
âœ“ Generated 150000 watch records

Dataset Generation Complete!

Users: 10000
Titles: 2000
Watch Records: 150000

Churn Rate: 14.68%
Average Watch Time: 88.2 minutes

âœ“ Data generation complete!


## Step 2: Data Cleaning and Preprocessing


In [4]:
# Clean and preprocess datasets
from data_cleaning import main as clean_main

print("Cleaning datasets...")
users_df, watch_df, titles_df, merged_df = clean_main()
print("\nâœ“ Data cleaning complete!")


Cleaning datasets...
Loading datasets...

Cleaning users data...
âœ“ Cleaned 10000 users

Cleaning watch history data...
âœ“ Cleaned 149988 watch records

Cleaning titles data...
âœ“ Cleaned 2000 titles

Merging datasets...
âœ“ Merged dataset: 149988 records

Data Cleaning Complete!

Users: 10000
Watch Records: 149988
Titles: 2000
Merged Records: 149988

âœ“ Data cleaning complete!


## Step 3: Exploratory Data Analysis (EDA)


In [5]:
# Perform comprehensive EDA
from eda import main as eda_main

print("Performing EDA...")
users_df, watch_df, titles_df = eda_main()
print("\nâœ“ EDA complete!")


Performing EDA...
Loading cleaned datasets...

EXPLORATORY DATA ANALYSIS

1. Genre Engagement Analysis...
           total_watch_time  avg_watch_time  watch_count  avg_rating  \
genre                                                                  
Animation         1203315.0           88.88        13538        3.20   
Fantasy           1198426.2           86.73        13818        3.20   
Romance           1149646.7           88.40        13005        3.21   
Drama             1139751.6           87.79        12983        3.21   
Thriller          1138306.2           86.71        13128        3.20   
Mystery           1114470.4           88.89        12537        3.20   
Action            1083749.6           86.72        12497        3.21   
Crime             1042493.9           87.08        11972        3.21   
Horror            1033144.9           86.70        11917        3.20   
Comedy            1027307.3           85.97        11950        3.24   

           unique_users  
gen

## Step 4: Create Visualizations


In [6]:
# Generate all visualizations
from visualization import main as viz_main

print("Generating visualizations...")
viz_main()
print("\nâœ“ Visualizations saved to outputs/charts/")


Generating visualizations...
Loading datasets...

GENERATING VISUALIZATIONS

1. Creating engagement heatmap...
âœ“ Saved engagement_heatmap.png

2. Creating genre popularity over time...
âœ“ Saved genre_popularity_over_time.png

3. Creating rating distribution...
âœ“ Saved rating_distribution.png

4. Creating device watch patterns...
âœ“ Saved device_watch_patterns.png

5. Creating churn breakdown...
âœ“ Saved churn_breakdown.png

6. Creating KMeans cluster visualization...
âœ“ Saved kmeans_clusters.png

All visualizations saved to outputs/charts/

âœ“ Visualizations saved to outputs/charts/


## Step 5: Churn Prediction Model


In [7]:
# Train churn prediction models
from churn_model import main as churn_main

print("Training churn prediction models...")
results, trained_models, scaler, le_dict = churn_main()
print("\nâœ“ Churn prediction complete!")


Training churn prediction models...
Loading datasets...

CHURN PREDICTION MODEL

1. Feature Engineering...

2. Preparing Features...
   Features: 13
   Samples: 10000
   Churn Rate: 14.68%

3. Training Models...

Training Logistic Regression...
  Accuracy: 0.9190
  Precision: 0.8474
  Recall: 0.5476
  F1-Score: 0.6653
  ROC-AUC: 0.8868

Training Random Forest...
  Accuracy: 0.9170
  Precision: 0.8265
  Recall: 0.5510
  F1-Score: 0.6612
  ROC-AUC: 0.8649

Training Gradient Boosting...
  Accuracy: 0.9200
  Precision: 0.8221
  Recall: 0.5816
  F1-Score: 0.6813
  ROC-AUC: 0.8844

4. Creating Visualizations...
âœ“ Saved churn_confusion_matrices.png
âœ“ Saved churn_roc_curves.png

5. Saving Metrics...
âœ“ Saved churn_model_metrics.txt

Churn Prediction Complete!

âœ“ Churn prediction complete!


## Step 6: Content-Based Recommender


In [8]:
# Initialize and train content-based recommender
from content_based_recommender import main as content_main

print("Training content-based recommender...")
content_recommender = content_main()
print("\nâœ“ Content-based recommender ready!")


Training content-based recommender...
Loading datasets...

CONTENT-BASED RECOMMENDER

1. Initializing Recommender...

2. Fitting Model...
Fitting TF-IDF vectorizer...
Computing cosine similarity matrix...
âœ“ Similarity matrix shape: (2000, 2000)

3. Testing Recommendations...

Recommendations for Title ID 1:
            title_name  genre  similarity_score
 Action Chronicles 190 Action               1.0
    Action Secrets 635 Action               1.0
 Action Chronicles 842 Action               1.0
Action Chronicles 1256 Action               1.0
    Action Legacy 1463 Action               1.0

Recommendations for User ID 2827:
             title_name       genre  similarity_score
    Action Secrets 1670      Action               1.0
      Action Tales 1754      Action               1.0
      Drama Secrets 667       Drama               1.0
   Drama Chronicles 654       Drama               1.0
       Drama Legacy 638       Drama               1.0
      Horror Quest 1674      Horror       

## Step 7: Collaborative Filtering Recommender


In [9]:
# Initialize and train collaborative filtering recommender
from collaborative_filtering import main as collab_main

print("Training collaborative filtering recommender...")
collab_recommender = collab_main()
print("\nâœ“ Collaborative filtering recommender ready!")


Training collaborative filtering recommender...
Loading datasets...

COLLABORATIVE FILTERING RECOMMENDER

Note: Using scikit-learn NMF as fallback (Surprise not available)
For best results, install Surprise with: pip install scikit-surprise
(May require Python < 3.14 or building from source)


1. Initializing Recommender...

2. Preparing Data...
Filtered data: 54984 ratings from 9394 users

3. Training Model...

Training collaborative filtering model...
âœ“ NMF Model Trained (Fallback)
  RMSE: 3.2554
  MAE: 3.0373

4. Testing Recommendations...

Recommendations for User ID 2827:
              title_name       genre  predicted_rating
     Action Chronicles 1      Action                 1
      Drama Journey 1344       Drama                 1
       Crime Legacy 1342       Crime                 1
    Mystery Journey 1341     Mystery                 1
     Comedy Journey 1340      Comedy                 1
    Fantasy Journey 1339     Fantasy                 1
      Drama Journey 1338     

## Step 8: Hybrid Recommendation Engine


In [10]:
# Create and test hybrid recommender
from hybrid_recommender import main as hybrid_main

print("Creating hybrid recommendation engine...")
hybrid_recommender = hybrid_main()
print("\nâœ“ Hybrid recommender complete!")


Creating hybrid recommendation engine...
Loading datasets...

HYBRID RECOMMENDER ENGINE

1. Initializing Content-Based Recommender...
Fitting TF-IDF vectorizer...
Computing cosine similarity matrix...
âœ“ Similarity matrix shape: (2000, 2000)

2. Initializing Collaborative Filtering Recommender...
Filtered data: 54984 ratings from 9394 users

Training collaborative filtering model...
âœ“ NMF Model Trained (Fallback)
  RMSE: 3.2554
  MAE: 3.0373

3. Creating Hybrid Recommender...

4. Testing Hybrid Recommendations...

User 2827 - Top 3 Recommendations:
          title_name   genre  hybrid_score
 Action Chronicles 1  Action           0.6
   Action Tales 1334  Action           0.6
Mystery Journey 1341 Mystery           0.6

User 2093 - Top 3 Recommendations:
            title_name       genre  hybrid_score
   Action Chronicles 1      Action           0.6
Documentary Quest 1331 Documentary           0.6
  Fantasy Journey 1339     Fantasy           0.6

User 7164 - Top 3 Recommendations:
  

## Step 9: Summary and Insights

### Key Findings:
1. **User Segmentation**: Users clustered into 4 distinct groups based on watch patterns
2. **Churn Prediction**: Best model achieves high accuracy in predicting churn
3. **Recommendations**: Hybrid approach combines content-based and collaborative filtering
4. **Genre Analysis**: Action and Drama genres show highest engagement
5. **Device Patterns**: TV devices show highest watch time per session

### Next Steps:
- Deploy models to production
- Set up real-time recommendation API
- Implement A/B testing for recommendations
- Monitor model performance over time
