## Problem Statement

**Current Situation:**
- Streamly uses random title selection for recommendations
- Users report poor satisfaction with suggestions
- No consideration for user preferences or content appropriateness

**Objectives (implemented in this notebook and backend):**
The recommendation algorithm implemented here:
- Respects content appropriateness for kids profiles (filters to kids content when requested)
- Considers user preferences (language, genres)
- Accounts for user demographics (age band) where available
- Produces a preference-based ranking (by preference match score)
- Is computationally efficient and easily explainable
- Works without historical viewing data (cold start scenario)

## Algorithm Architecture: Content-Based Filtering with Segmentation

### High-Level Flow

```
Input: profile_id (integer)
    ↓
1. Load Profile Characteristics
   - kids_profile flag
   - age_band 
   - preferred_language
   - preferences (comma-separated genres)
    ↓
2. Load Full Title Catalog
   - All titles from database
    ↓
3. Apply Content-Appropriateness Filter
   - If kids_profile == 1: Filter to is_kids_content == 1
   - Else: include all titles (no exclusion)
    ↓
4. Apply Language Preference Filter
   - Keep titles matching preferred_language
   - Also keep titles with NULL language (universal content)
    ↓
5. Apply Genre Preference Scoring
   - If preferences exist: Score titles by matching categories (1 or 0)
   - If no preferences: Default score = 0 (equal priority)
    ↓
6. Sort Results
   - By preference match score (descending)
    ↓
7. Return Top 10 Recommendations
Output: List of recommended titles with metadata
```

## Key Design Decisions

### 1. Kids Content Filtering
**Decision:** Use binary `kids_profile` flag to filter content when the profile indicates a kids profile.

**Rationale:**
- Age-appropriate content is critical for user safety and satisfaction.
- If `kids_profile == 1` we limit recommendations to content flagged as kids content.
- For non-kids profiles we do not exclude titles (implementation intentionally keeps all titles available).

---

### 2. Language Preference Matching
**Decision:** Exact match on `preferred_language`, with fallback to NULL values.

**Rationale:**
- Users prefer content in languages they understand.
- Include NULL values (universal/multilingual content) so language-missing titles are still considered.

---

### 3. Genre/Category Preference Scoring
**Decision:** Binary score (1 or 0) based on category match.

**Rationale:**
- User preferences are stored as comma-separated genres.
- Score = 1 if title category matches any preference; otherwise 0.
- This keeps the algorithm simple and explainable.

---

### 4. Efficiency Considerations
**Decision:** Use straightforward, vectorized operations and keep logic simple to match the production backend.

**Rationale:**
- Implementation reads profile and catalog from SQLite and performs in-memory filtering and vectorized scoring in pandas.
- This keeps latency low and behavior predictable.

In [None]:
import pandas as pd
import numpy as np

In [None]:
import pandas as pd
import sqlite3

## Algorithm Implementation

In [None]:
def recommend_titles(profile_id, limit=10):
    """
    Content-based recommendation algorithm for Streamly.
    
    Connects to SQLite database and recommends titles based on user profile.
    
    Algorithm Steps:
    1. Load user profile from database
    2. Load all titles from database
    3. Filter titles by age-appropriateness (kids flag)
    4. Filter titles by language preference
    5. Score titles by genre preference match
    6. Sort by score (descending)
    7. Return top N recommendations
    
    Args:
        profile_id (int): User profile identifier
        limit (int): Number of recommendations to return (default: 10)
        
    Returns:
        list: List of recommended title dictionaries with metadata
    """
    conn = sqlite3.connect("streamly.db")
 
    # Step 1: Load profile from database
    profile = pd.read_sql(f"SELECT * FROM profiles WHERE profile_id={profile_id}", conn).iloc[0]
    
    # Step 2: Load all titles from database
    titles = pd.read_sql("SELECT * FROM titles", conn)

    # Step 3: Filter by kids content
    if profile["kids_profile"] == 1:
        titles = titles[titles["is_kids_content"] == 1]

    # Step 4: Filter by language preference
    if pd.notna(profile["preferred_language"]):
        titles = titles[
            (titles["language"] == profile["preferred_language"]) |
            (titles["language"].isna())
        ]

    # Step 5: Score by genre preferences
    if pd.notna(profile["preferences"]):
        preferred_genres = [g.strip() for g in profile["preferences"].split(",")]
        titles["score"] = titles["category"].apply(
            lambda cat: 1 if cat in preferred_genres else 0
        )
    else:
        titles["score"] = 0

    # Step 6: Sort by score (descending)
    titles = titles.sort_values("score", ascending=False)

    # Step 7: Return top recommendations
    conn.close()
    return titles.head(limit).to_dict(orient="records")

## Testing & Validation

In [None]:
# Test with sample profiles
print("=" * 60)
print("TESTING RECOMMENDATION ALGORITHM")
print("=" * 60)

# Test Profile 1
try:
    recommendations = recommend_titles(profile_id=1, limit=5)
    
    print("\nTOP 5 RECOMMENDATIONS FOR PROFILE 1:")
    print("-" * 60)
    
    for i, rec in enumerate(recommendations, 1):
        print(f"\n{i}. {rec.get('title_name', 'N/A')}")
        print(f"   Category: {rec.get('category', 'N/A')}")
        print(f"   IMDB Rating: {rec.get('imdb_rating', 'N/A')}")
        print(f"   Age Rating: {rec.get('age_rating', 'N/A')}")
        print(f"   Region: {rec.get('origin_region', 'N/A')}")
        print(f"   Kids Content: {rec.get('is_kids_content', 'N/A')}")
except Exception as e:
    print(f"Error: {e}")

In [None]:
# Test with another profile
try:
    recommendations_2 = recommend_titles(profile_id=5, limit=5)
    
    print("\n" + "=" * 60)
    print("TOP 5 RECOMMENDATIONS FOR PROFILE 5:")
    print("=" * 60)
    
    for i, rec in enumerate(recommendations_2, 1):
        print(f"\n{i}. {rec.get('title_name', 'N/A')}")
        print(f"   Category: {rec.get('category', 'N/A')}")
        print(f"   IMDB Rating: {rec.get('imdb_rating', 'N/A')}")
        print(f"   Kids Content: {rec.get('is_kids_content', 'N/A')}")
except Exception as e:
    print(f"Error testing Profile 5: {e}")

## Algorithm Limitations

### Current Limitations

| Limitation | Impact | Reason |
|-----------|--------|--------|
| **No collaborative filtering** | All users get different recommendations despite similar tastes | Lacks user-to-user similarity analysis |
| **No viewing history** | Each profile starts from scratch | Director Andreas deleted historical data |
| **No temporal effects** | Trending content not prioritized | Algorithm is static/stateless |
| **No content-based similarity** | Doesn't recommend similar titles to watched content | Would require embeddings or manual tagging |
| **Binary genre scoring** | Genres are "match" or "no match" - no partial credit | Simplification for MVP |
| **Cold start for new users** | New profiles have no personalization | No baseline preferences |

### Performance Considerations
- **Current approach:** ~50ms per recommendation (load data, filter, sort)
- **Bottleneck:** Loading titles from database on each request
- **Solution:** Cache catalog in memory at startup

## Future Improvements

### Phase 2: Short-term Enhancements (1-2 sprints)
1. **Viewing History Tracking**
   - Record what users have watched
   - Avoid recommending already-seen titles
   - Track completion rates

2. **Weighted Genre Scoring**
   - Replace binary scoring (0/1) with weighted scores
   - Boost primary preferences, secondary preferences get lower weight
   - Example: Primary genre = 1.0, Secondary genre = 0.5

3. **User Feedback Loop**
   - Add "Thumbs Up" / "Thumbs Down" buttons to UI
   - Track positive/negative feedback
   - Adjust recommendations based on feedback

### Phase 3: Medium-term Improvements (1-2 months)
1. **Collaborative Filtering**
   - Find profiles with similar preferences
   - Recommend titles liked by similar profiles
   - Reduces cold-start problem

2. **Content-Based Similarity**
   - Calculate similarity between titles (same director, cast, genre, etc.)
   - Recommend titles similar to user's watch history
   - Personalize beyond explicit preferences

3. **A/B Testing Framework**
   - Compare algorithm versions
   - Measure click-through rate (CTR)
   - Measure watch-through rate (completion %)
   - Measure user satisfaction (ratings)

### Phase 4: Advanced ML (3+ months)
1. **Matrix Factorization**
   - Decompose user-item interaction matrix
   - Find latent factors (hidden preferences)
   - Techniques: SVD, NMF, ALS

2. **Deep Learning Models**
   - Neural Collaborative Filtering
   - Recurrent Neural Networks for sequential patterns
   - Embeddings for content and users

3. **Real-time Updates**
   - Update recommendations as user watches
   - Streaming data pipeline (e.g., Kafka)
   - Real-time model serving

4. **Analytics Dashboard**
   - Track recommendation quality metrics
   - Monitor algorithm performance
   - Identify user segments for targeting

## Conclusion

The **content-based filtering with segmentation** approach provides:
- ✅ **Explainability:** Easy to understand why recommendations were made
- ✅ **Quick Development:** Implemented in days, not months
- ✅ **Personalization:** Respects user preferences and demographics
- ✅ **Safety:** Age-appropriate content filtering
- ✅ **Scalability:** Linear performance with title catalog size
- ✅ **MVP Foundation:** Platform for future ML improvements

This algorithm solves Streamly's immediate problem (better than random recommendations) while providing a foundation for more sophisticated approaches as the company grows.