# StreamCart AI Assistant - Capstone Project

**Your Name:** [Enter your name]

**Date:** [Enter date]

---

## Project Overview

Build an end-to-end ML-powered customer support assistant that:
1. Predicts customer churn risk
2. Retrieves relevant help articles
3. Generates helpful responses
4. Ensures quality with guardrails

**Estimated Time:** 8-12 hours

---

## Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import (roc_auc_score, precision_recall_curve, 
                             classification_report, confusion_matrix)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import json
import re
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')

print("Setup complete!")

## Generate StreamCart Datasets

Run this cell to generate the datasets for your project.

In [None]:
# Generate synthetic StreamCart data
np.random.seed(42)

N_CUSTOMERS = 2000

# Customer data
def generate_customers():
    tiers = ['basic', 'premium', 'enterprise']
    tier_weights = [0.5, 0.35, 0.15]
    
    data = {
        'customer_id': [f'CUST-{i:05d}' for i in range(N_CUSTOMERS)],
        'tenure_months': np.random.exponential(18, N_CUSTOMERS).astype(int).clip(1, 60),
        'subscription_tier': np.random.choice(tiers, N_CUSTOMERS, p=tier_weights),
    }
    
    # Correlated features
    tier_multiplier = {'basic': 1, 'premium': 2.5, 'enterprise': 5}
    data['monthly_spend'] = [
        np.random.normal(50 * tier_multiplier[t], 15) 
        for t in data['subscription_tier']
    ]
    data['monthly_spend'] = np.clip(data['monthly_spend'], 10, 500).round(2)
    
    # Churn-predictive features
    data['support_tickets_90d'] = np.random.poisson(2, N_CUSTOMERS)
    data['last_purchase_days'] = np.random.exponential(30, N_CUSTOMERS).astype(int).clip(1, 180)
    data['engagement_score'] = np.random.beta(5, 2, N_CUSTOMERS) * 100
    
    # Generate churn based on features (realistic relationship)
    churn_prob = (
        0.1 +
        0.15 * (data['support_tickets_90d'] > 3) +
        0.2 * (np.array(data['last_purchase_days']) > 60) +
        0.15 * (np.array(data['engagement_score']) < 30) +
        0.1 * (np.array(data['tenure_months']) < 6) -
        0.1 * (np.array(data['subscription_tier']) == 'enterprise')
    )
    churn_prob = np.clip(churn_prob, 0.05, 0.8)
    data['churned'] = (np.random.random(N_CUSTOMERS) < churn_prob).astype(int)
    
    return pd.DataFrame(data)

# Events data
def generate_events(customers_df):
    events = []
    event_types = ['login', 'purchase', 'support_contact', 'page_view', 'feature_use']
    
    for _, cust in customers_df.iterrows():
        n_events = np.random.poisson(20 + cust['engagement_score'] / 5)
        
        for _ in range(n_events):
            event_type = np.random.choice(event_types, p=[0.3, 0.15, 0.1, 0.3, 0.15])
            days_ago = np.random.randint(1, 91)
            
            metadata = {}
            if event_type == 'purchase':
                metadata['amount'] = round(np.random.exponential(50), 2)
            elif event_type == 'page_view':
                metadata['page'] = np.random.choice(['home', 'products', 'account', 'help'])
            
            events.append({
                'customer_id': cust['customer_id'],
                'event_type': event_type,
                'timestamp': (datetime.now() - timedelta(days=days_ago)).isoformat(),
                'metadata': json.dumps(metadata)
            })
    
    return pd.DataFrame(events)

# Support tickets
def generate_tickets(customers_df):
    categories = ['billing', 'technical', 'general', 'shipping']
    
    subjects = {
        'billing': ['Charge question', 'Refund request', 'Invoice needed', 'Payment failed'],
        'technical': ['Login issue', 'App not loading', 'Feature not working', 'Error message'],
        'general': ['Account question', 'How to use', 'Feedback', 'Feature request'],
        'shipping': ['Order status', 'Delivery delay', 'Wrong address', 'Return request']
    }
    
    messages = {
        'billing': [
            "I was charged twice this month. Can you help?",
            "I need a refund for my last order.",
            "Can I get an invoice for tax purposes?",
            "My payment keeps failing even though my card is valid."
        ],
        'technical': [
            "I can't log into my account after resetting my password.",
            "The app crashes every time I try to open it.",
            "The search feature isn't returning any results.",
            "I'm getting an error 500 when trying to checkout."
        ],
        'general': [
            "How do I change my email address?",
            "What's the difference between basic and premium?",
            "I love the product but wish it had dark mode.",
            "Can you add a feature to export my data?"
        ],
        'shipping': [
            "Where is my order? It's been 2 weeks.",
            "The tracking shows delivered but I didn't receive it.",
            "I entered the wrong shipping address, can you change it?",
            "I need to return an item, how do I do that?"
        ]
    }
    
    resolutions = {
        'billing': [
            "Refund processed. Should appear in 3-5 business days.",
            "Duplicate charge reversed. Apologies for the inconvenience.",
            "Invoice sent to your email.",
            "Payment issue resolved. Please try again."
        ],
        'technical': [
            "Password reset. Please clear cache and try logging in.",
            "Please update to the latest version from the app store.",
            "Bug identified and fixed. Should work now.",
            "Server issue resolved. Please try checkout again."
        ],
        'general': [
            "Email changed in your account settings.",
            "Premium includes: no ads, offline access, priority support.",
            "Thanks! Dark mode is on our roadmap for Q2.",
            "Data export feature is available in Account > Privacy."
        ],
        'shipping': [
            "Order shipped yesterday. Tracking: 1Z999AA10123456784",
            "Replacement order shipped. File claim with carrier.",
            "Address updated before shipment.",
            "Return label sent to your email. Drop off at any carrier location."
        ]
    }
    
    tickets = []
    ticket_id = 0
    
    for _, cust in customers_df.iterrows():
        n_tickets = cust['support_tickets_90d']
        
        for _ in range(n_tickets):
            cat = np.random.choice(categories, p=[0.3, 0.3, 0.2, 0.2])
            
            tickets.append({
                'ticket_id': f'TKT-{ticket_id:05d}',
                'customer_id': cust['customer_id'],
                'category': cat,
                'subject': np.random.choice(subjects[cat]),
                'message': np.random.choice(messages[cat]),
                'resolution': np.random.choice(resolutions[cat])
            })
            ticket_id += 1
    
    return pd.DataFrame(tickets)

# Products catalog
def generate_products():
    products = [
        {'product_id': 'PROD-001', 'name': 'StreamCart Basic', 'category': 'subscription', 
         'price': 9.99, 'description': 'Essential features for casual users. Ad-supported.'},
        {'product_id': 'PROD-002', 'name': 'StreamCart Premium', 'category': 'subscription',
         'price': 24.99, 'description': 'No ads, offline access, priority support.'},
        {'product_id': 'PROD-003', 'name': 'StreamCart Enterprise', 'category': 'subscription',
         'price': 99.99, 'description': 'Team features, SSO, dedicated support, API access.'},
        {'product_id': 'PROD-004', 'name': 'Gift Card $25', 'category': 'gift',
         'price': 25.00, 'description': 'StreamCart gift card. Never expires.'},
        {'product_id': 'PROD-005', 'name': 'Gift Card $50', 'category': 'gift',
         'price': 50.00, 'description': 'StreamCart gift card. Never expires.'},
        {'product_id': 'PROD-006', 'name': 'Premium Add-on: Cloud Storage', 'category': 'addon',
         'price': 4.99, 'description': 'Extra 100GB cloud storage for your content.'},
        {'product_id': 'PROD-007', 'name': 'Premium Add-on: Family Plan', 'category': 'addon',
         'price': 9.99, 'description': 'Share your subscription with up to 5 family members.'},
    ]
    return pd.DataFrame(products)

# Generate all datasets
print("Generating StreamCart datasets...")
customers_df = generate_customers()
events_df = generate_events(customers_df)
tickets_df = generate_tickets(customers_df)
products_df = generate_products()

print(f"\n✓ Customers: {len(customers_df)} rows")
print(f"✓ Events: {len(events_df)} rows")
print(f"✓ Tickets: {len(tickets_df)} rows")
print(f"✓ Products: {len(products_df)} rows")
print(f"\nChurn rate: {customers_df['churned'].mean():.1%}")

---

# Part 1: Churn Prediction Model (30%)

Build a model to predict customer churn.

## 1.1 Data Exploration

In [None]:
# TODO: Explore the customers dataset
# - What's the distribution of each feature?
# - How does churn relate to different features?
# - Any data quality issues?

customers_df.head()

In [None]:
# TODO: Create visualizations
# - Churn rate by subscription tier
# - Churn rate by tenure
# - Feature correlations


## 1.2 Feature Engineering

In [None]:
# TODO: Engineer features from events data
# Ideas:
# - Event counts by type (last 30/60/90 days)
# - Days since last activity
# - Purchase frequency and amount
# - Support contact frequency

# Be careful about leakage! Don't use future information.


## 1.3 Model Development

In [None]:
# TODO: Prepare data for modeling
# - Create feature matrix X and target y
# - Split into train/validation/test (suggest: 60/20/20)
# - Scale features if needed


In [None]:
# TODO: Train baseline model (logistic regression)


In [None]:
# TODO: Train improved model (gradient boosting)


## 1.4 Business Metrics & Threshold Selection

In [None]:
# TODO: Select optimal threshold
# Consider:
# - Cost of intervention (retention offer): $20
# - Value of retained customer: $150 annual
# - What threshold maximizes expected value?


---

# Part 2: Knowledge Base Retrieval (25%)

Build a retrieval system for help articles.

## 2.1 Create Embeddings

In [None]:
# TODO: Create embeddings for tickets
# Options: TF-IDF, sentence transformers (if available)
# Combine subject + message + resolution for embedding


## 2.2 Implement Retrieval

In [None]:
# TODO: Implement search function
def search_tickets(query: str, k: int = 5):
    """
    Search for relevant tickets given a query.
    
    Args:
        query: User's question
        k: Number of results to return
    
    Returns:
        List of (ticket, score) tuples
    """
    # Your implementation here
    pass

## 2.3 Evaluate Retrieval

In [None]:
# TODO: Create test queries and evaluate
# Metrics: Recall@K, MRR

test_queries = [
    "How do I get a refund?",
    "App keeps crashing",
    "Where is my order?",
    "Can't log in",
    "Need an invoice"
]

# Your evaluation here


---

# Part 3: Response Generation (25%)

Build an LLM-powered response generator.

## 3.1 RAG Pipeline

In [None]:
# TODO: Implement RAG pipeline
def generate_response(query: str, customer_id: str = None):
    """
    Generate a response using RAG.
    
    Args:
        query: Customer's question
        customer_id: Optional customer ID for personalization
    
    Returns:
        Generated response with confidence
    """
    # Step 1: Retrieve relevant context
    # Step 2: Get customer context (if provided)
    # Step 3: Build prompt
    # Step 4: Generate response (simulated or API call)
    pass

## 3.2 Prompt Engineering

In [None]:
# TODO: Design your prompts
# - System prompt
# - Few-shot examples
# - Context injection

SYSTEM_PROMPT = """
Your system prompt here...
"""

## 3.3 Context Utilization

In [None]:
# TODO: Use customer context to personalize responses
# - Churn risk
# - Subscription tier
# - Ticket history


---

# Part 4: Guardrails & Evaluation (20%)

Ensure your system is production-ready.

## 4.1 Input Validation

In [None]:
# TODO: Implement input validation
def validate_input(text: str):
    """
    Validate user input.
    
    Returns:
        (is_valid, issues)
    """
    pass

## 4.2 Output Safety

In [None]:
# TODO: Implement output filtering
def filter_output(response: str, context: dict):
    """
    Filter and validate output.
    
    Returns:
        (filtered_response, warnings)
    """
    pass

## 4.3 Evaluation Framework

In [None]:
# TODO: Implement evaluation metrics
def evaluate_response(question: str, response: str, context: str = None):
    """
    Evaluate response quality.
    
    Returns:
        Dict with scores for relevance, faithfulness, safety
    """
    pass

## 4.4 Human Review Integration

In [None]:
# TODO: Define human review triggers
def should_review(question: str, response: str, evaluation: dict):
    """
    Determine if response needs human review.
    
    Returns:
        (needs_review, reason)
    """
    pass

---

# Complete Pipeline

In [None]:
# TODO: Integrate all components into a complete pipeline
def handle_support_request(customer_id: str, message: str):
    """
    Complete support request handler.
    
    1. Validate input
    2. Get customer context (including churn risk)
    3. Retrieve relevant knowledge
    4. Generate response
    5. Filter and evaluate
    6. Queue for review if needed
    
    Returns:
        Response object with all metadata
    """
    pass

## End-to-End Testing

In [None]:
# TODO: Test your complete pipeline
test_cases = [
    {"customer_id": "CUST-00001", "message": "Where is my order?"},
    {"customer_id": "CUST-00050", "message": "I want a refund"},
    {"customer_id": "CUST-00100", "message": "Can't log in after password reset"},
]

for case in test_cases:
    result = handle_support_request(case['customer_id'], case['message'])
    print(f"\nCustomer: {case['customer_id']}")
    print(f"Message: {case['message']}")
    print(f"Response: {result}")

---

# Executive Summary

Write a 500-word summary for stakeholders.

## Your Executive Summary

*Write your summary here...*

---

# Self-Assessment

Fill out the self-assessment from the grading rubric.

## Your Self-Assessment

```
### Part 1: Churn Prediction (X/30)
- Data Exploration: X/5
- Feature Engineering: X/8
- Model Development: X/10
- Business Metrics: X/7

Reflection: [What went well? What would you improve?]

### Part 2: Retrieval (X/25)
- Embeddings: X/8
- Retrieval Quality: X/10
- Edge Cases: X/7

Reflection: [...]

### Part 3: Response Generation (X/25)
- RAG Pipeline: X/10
- Prompt Engineering: X/8
- Context Usage: X/7

Reflection: [...]

### Part 4: Guardrails (X/20)
- Input Validation: X/5
- Output Safety: X/5
- Evaluation: X/5
- Human Review: X/5

Reflection: [...]

### Total: X/100

### Key Learnings:
1. [...]
2. [...]
3. [...]
```