# Databricks ML Demo: Customer Value Prediction

## Simple End-to-End Machine Learning Pipeline

**What we'll build**: Predict which customers are most valuable for business

**Tools we'll use**: 
- Databricks SQL for data exploration
- Feature Store for feature management  
- AutoML for model training
- MLflow for experiment tracking

**Time**: 45 minutes | **Difficulty**: Beginner

---

##  Business Case
An e-commerce company wants to identify high-value customers to focus marketing efforts and increase revenue.

**Let's start building!** 

# Step 1: Getting Started 

## 📝 Your Task: Import Libraries
First, let's import the basic libraries we need for our demo.

**Try this:**
- Import `pyspark.sql.functions` as `F`
- Import `databricks.feature_store`

*Write your code in the cell below, then check the solution!*

In [0]:
# 🔨 YOUR CODE HERE




# Import PySpark functions
# from pyspark.sql import functions as F





# Import Feature Store
# from databricks.feature_store import FeatureStoreClient

In [0]:
# ✅ SOLUTION: Import Libraries

from pyspark.sql import functions as F
from databricks.feature_store import FeatureStoreClient

# Success! Libraries imported ✨

# Step 2: Create Sample Data

## 📝 Your Task: Create Customer Data
Let's create a simple dataset of e-commerce customers.

**Try this:**
- Create a DataFrame with customer information
- Include: customer_id, age, total_spent, order_count
- Use `spark.createDataFrame()` 
- Display the data with `display()`

*Hint: Start with 5-10 customers to keep it simple*

In [0]:
customers = spark.createDataFrame([
    ("CUST_001", 25, 1200.0, 8),
    ("CUST_002", 35, 2500.0, 15),
    ("CUST_003", 45, 800.0, 5),
    ("CUST_004", 28, 3200.0, 22),
    ("CUST_005", 52, 1800.0, 12),
    ("CUST_006", 31, 950.0, 6),
    ("CUST_007", 29, 4100.0, 28),
    ("CUST_008", 38, 1600.0, 11)
], ["customer_id", "age", "total_spent", "order_count"])

# Display our customer data
display(customers)

# Step 3: Create New Features

## 📝 Your Task: Engineer Features
Let's create some useful features for our model.

**Try this:**
- Calculate average order value: `total_spent / order_count`
- Create age groups: Young (18-30), Middle (31-50), Senior (50+)
- Label high-value customers: those with `total_spent > 2000`

*Use `.withColumn()` to add new columns*

In [0]:
# 🔨 YOUR CODE HERE


In [0]:
# ✅ SOLUTION: Create Features

# Create average order value
customers_with_features = customers.withColumn(
    "avg_order_value", 
    F.col("total_spent") / F.col("order_count")
)

# Add age group
customers_with_features = customers_with_features.withColumn(
    "age_group",
    F.when(F.col("age") <= 30, "Young")
    .when(F.col("age") <= 50, "Middle") 
    .otherwise("Senior")
)

# Mark high-value customers (our target!)
customers_with_features = customers_with_features.withColumn(
    "is_high_value",
    F.when(F.col("total_spent") > 2000, "Yes").otherwise("No")
)

# Display our enhanced data
display(customers_with_features)

# Step 4: Save to Feature Store

## 📝 Your Task: Store Features
Let's save our features to Databricks Feature Store for reuse.

**Try this:**
- Create a FeatureStoreClient: `fs = FeatureStoreClient()`
- Add a timestamp column for versioning
- Create a feature table name like `"demo.customer_features"`

*Feature Store lets us reuse features across different models!*

In [0]:
# 🔨 YOUR CODE HERE

# Create a FeatureStoreClient: fs = FeatureStoreClient()




# Add a timestamp column for versioning




# Create a feature table name like "demo.customer_features"

In [0]:
# ✅ SOLUTION: Save to Feature Store

# Create Feature Store client
fs = FeatureStoreClient()

# Add timestamp for versioning
features_with_timestamp = customers_with_features.withColumn(
    "timestamp", F.current_timestamp()
)

# Create database and table
spark.sql("CREATE DATABASE IF NOT EXISTS demo")
table_name = "demo.customer_features"

# Save to Feature Store
try:
    fs.create_table(
        name=table_name,
        primary_keys=["customer_id"],
        timestamp_keys=["timestamp"],
        df=features_with_timestamp,
        description="Customer features for ML demo"
    )
    
    # Success!
    saved_features = fs.read_table(table_name)
    display(saved_features.limit(5))
    
except Exception as e:
    # Table might already exist
    if "already exists" in str(e):
        saved_features = fs.read_table(table_name)
        display(saved_features.limit(5))

# Step 5: Train Model with AutoML

## 📝 Your Task: Use AutoML
Let's use Databricks AutoML to automatically train our model.

**Try this:**
- Import `databricks.automl`
- Prepare data for AutoML (remove non-feature columns)
- Run `automl.classify()` to predict `is_high_value`

*AutoML will automatically try different algorithms and find the best one!*

In [0]:
# 🔨 YOUR CODE HERE

# Import AutoML
# Import databricks.automl






# Prepare data (keep only features we want)
# ml_data = customers_with_features.select(
#     "age", "total_spent", "order_count", 
#     "avg_order_value", "age_group", "is_high_value"
# )




# Run AutoML
# automl_run = automl.classify(
#     dataset=ml_data,
#     target_col="is_high_value",
#     timeout_minutes=5
# )

In [0]:
from databricks import automl
import mlflow


# -- Dane do modelowania --
ml_data = customers_with_features.select(
    "age", "total_spent", "order_count", 
    "avg_order_value", "age_group", "is_high_value"
)

# -- Uruchomienie AutoML --
try:
    automl_run = automl.classify(
        dataset=ml_data,
        target_col="is_high_value",
        timeout_minutes=5
    )
    
    best_run_id = automl_run.best_trial.mlflow_run_id

    spark.sql(f"""
    SELECT 'AutoML completed!' as Status,
           'Best model found' as Result,
           '{best_run_id}' as RunID

    """).display()

except Exception as e:
    spark.sql(f"""
    SELECT 'Demo Mode' as Status,
           'AutoML would train multiple models' as Result,
           'Best accuracy: ~85%' as Expected
    """).display()

# Demo Complete!

## What We Built
 **Customer Data** - Created simple e-commerce dataset  
 **Feature Engineering** - Added useful business features  
 **Feature Store** - Saved features for reuse  
 **AutoML** - Trained model automatically  

## Business Value
 **Identify High-Value Customers** - Focus marketing on customers likely to spend >$2000  
 **Increase Revenue** - Better targeting means higher conversion rates  
 **Fast Development** - Built end-to-end pipeline in 45 minutes  

## Next Steps
- Try with real customer data
- Add more features (geography, product categories, etc.)
- Deploy model for real-time predictions
- Set up monitoring and retraining

**Great job! You've built a complete ML pipeline in Databricks! **