### 1. Training Multiple Models 
- In ML, there is no "best" model for every dataset. Today, you will run an "audition" where you train different types of algorithms to see which one understands your e-commerce data better.
- Linear Regression: Good for simple relationships (e.g., "The more clicks, the more spend").
- Decision Trees: Good for capturing "If/Then" logic (e.g., "If they are on a mobile device AND it's the weekend, they buy more").
- Random Forest / XGBoost: "Ensemble" models that combine many small trees to handle the complexity of 42M rows.

### 2. Hyperparameter Tuning (The Fine-Tuning)
- A model is like a radio; you have to turn the knobs (Hyperparameters) to get a clear signal.
- What are they? These are settings you choose before training starts (e.g., max_depth of a tree or learning_rate).
- The Pro Way: Instead of guessing, we use Grid Search or Hyperopt in Databricks. This automatically tries 20 different "knob settings" and tells you which one worked best.

### 3. Feature Importance (The "Why")
- Once your model is accurate, you need to explain why it's making those predictions. Feature Importance ranks your input data from most to least influential.
- Example: You might find that avg_viewed_price is a huge predictor of spend, while interaction_count doesn't matter as much.

### 4. Spark ML Pipelines (The Assembly Line)
- This is the most "Engineering" part of the day. A Pipeline bundles your data cleaning and your model into one single object.
- Transformers: Stages that change the data (e.g., turning "Electronics" into the number 1).
- Estimators: The actual ML algorithm (e.g., Random Forest).
- Why use it? When you get a new row of raw data tomorrow, you don't have to manually clean it again. You just pass it to the Pipeline, and it handles everything from cleaning to prediction in one go.

### Task 1 & 3: Build the Spark ML Pipeline

In [0]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.regression import RandomForestRegressor, GBTRegressor, LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
import mlflow
import mlflow.spark

# 1. Prepare Data
train_df, test_df = spark.table("ecommerce_prod.gold.user_ml_features").randomSplit([0.8, 0.2], seed=42)

# 2. Define Pipeline Stages
assembler = VectorAssembler(
    inputCols=['interaction_count', 'weekend_ratio', 'avg_viewed_price', 'category_diversity'], 
    outputCol="raw_features"
)
scaler = StandardScaler(inputCol="raw_features", outputCol="features")

# 3. List of Models to "Audition"
models = [
    LinearRegression(labelCol="total_spend", featuresCol="features"),
    RandomForestRegressor(labelCol="total_spend", featuresCol="features", numTrees=50),
    GBTRegressor(labelCol="total_spend", featuresCol="features", maxIter=20)
]

### Task 2: Train and Compare in MLflow

In [0]:
%sql
CREATE SCHEMA IF NOT EXISTS ecommerce_prod.ml_staging;
CREATE VOLUME IF NOT EXISTS ecommerce_prod.ml_staging.model_tmp;

In [0]:
import os
import mlflow
import mlflow.spark
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.regression import RandomForestRegressor, GBTRegressor, LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from mlflow.tracking import MlflowClient

# 1. SETUP: Fix the "UC Volume Path" error
os.environ['MLFLOW_DFS_TMP'] = "/Volumes/ecommerce_prod/ml_staging/model_tmp"
user_email = "shrinathrajeshirke@gmail.com" # Updated from your error log
experiment_path = f"/Users/{user_email}/day13_comparison"
mlflow.set_experiment(experiment_path)

# 2. DATA PREP: Load features from Day 12
data = spark.table("ecommerce_prod.gold.user_ml_features")
train_df, test_df = data.randomSplit([0.8, 0.2], seed=42)

# 3. PIPELINE COMPONENTS: Build the assembly line
assembler = VectorAssembler(
    inputCols=['interaction_count', 'weekend_ratio', 'avg_viewed_price', 'category_diversity'], 
    outputCol="raw_features"
)
scaler = StandardScaler(inputCol="raw_features", outputCol="features")

# 4. MODELS: Define the three candidates
models = [
    LinearRegression(labelCol="total_spend", featuresCol="features"),
    RandomForestRegressor(labelCol="total_spend", featuresCol="features", numTrees=50),
    GBTRegressor(labelCol="total_spend", featuresCol="features", maxIter=20)
]

# 5. EXECUTION: Loop, Train, and Log
print("Starting Model Audition...")
with mlflow.start_run(run_name="Day13_Model_Comparison") as parent_run:
    for model in models:
        model_name = model.__class__.__name__
        
        with mlflow.start_run(run_name=model_name, nested=True):
            # Construct Pipeline
            pipeline = Pipeline(stages=[assembler, scaler, model])
            
            # Train model
            pipeline_model = pipeline.fit(train_df)
            
            # Make Predictions
            predictions = pipeline_model.transform(test_df)
            
            # Evaluate Performance
            evaluator = RegressionEvaluator(labelCol="total_spend", metricName="rmse")
            rmse = evaluator.evaluate(predictions)
            
            # Log results
            mlflow.log_metric("rmse", rmse)
            mlflow.spark.log_model(
                pipeline_model, 
                artifact_path=f"{model_name}_pipeline",
                dfs_tmpdir=os.environ['MLFLOW_DFS_TMP']
            )
            
            print(f"{model_name} completed. RMSE: {rmse}")

# 6. CHAMPION SELECTION: Pick the best model automatically
client = MlflowClient()
experiment_id = client.get_experiment_by_name(experiment_path).experiment_id

# Find the run with the lowest RMSE
runs = client.search_runs(
    experiment_ids=[experiment_id],
    order_by=["metrics.rmse ASC"],
    max_results=1
)

best_run = runs[0]
best_rmse = best_run.data.metrics['rmse']
best_model_run_name = best_run.data.tags['mlflow.runName']

print("-" * 30)
print(f"Best Model: {best_model_run_name}")
print(f"BEST RMSE: {best_rmse}")

# 7. REGISTER: Save the winner to the Model Registry
model_uri = f"runs:/{best_run.info.run_id}/{best_model_run_name}_pipeline"
mlflow.register_model(model_uri, "ecommerce_champion_model")

### Task 4: Programmatically Select the Best Model

In [0]:
from mlflow.tracking import MlflowClient

client = MlflowClient()
experiment = client.get_experiment_by_name("/Users/shrinathrajeshirke@gmail.com/day13_comparison")

# Search for the best run based on lowest RMSE
best_run = client.search_runs(
    experiment_ids=[experiment.experiment_id],
    filter_string="",
    run_view_type=1,
    max_results=1,
    order_by=["metrics.rmse ASC"]
)[0]

best_model_name = best_run.data.tags['mlflow.runName']
print(f"The best model is: {best_model_name} with RMSE: {best_run.data.metrics['rmse']}")

# Register the winner
model_uri = f"runs:/{best_run.info.run_id}/{best_model_name}_pipeline"
mlflow.register_model(model_uri, "ecommerce_best_predictor")