### MLflow components (tracking, registry, models)

**1. MLflow Tracking (The Experiment Journal)**
- When you train a model, you usually try different settings (hyperparameters). If you don't track them, you'll forget which settings gave you the best results.
- What it tracks:
- **Parameters:** Like the "learning rate" or "number of trees" in a forest.
- **Metrics:** Your Accuracy, RMSE, or P-value.
- **Artifacts:** The actual model file, plots, or images.
- **The Benefit:** You can compare 50 different runs side-by-side in a table to see which one is the winner.

**2. MLflow Models (The Universal Package)**
- Once you have a trained model, it needs to be saved. A "Model" in MLflow isn't just a file; it’s a standardized folder.
- What it does: It packages your model so it can be run anywhere (in a Notebook, as a Real-time API, or as a Batch job).
- The Benefit: You don't have to worry about whether the model was made in Scikit-Learn, PyTorch, or SparkML. MLflow handles the "flavor" so it works across different systems.

**3. MLflow Model Registry (The Version Controller)**
- This is like GitHub for Models. It manages the lifecycle of your model from "Birth" to "Retirement."
- **Stages:**
- None: Just saved.
- Staging: Testing the model to see if it works on real data.
- Production: The model is currently powering your dashboard or website.
- Archived: The model is old and has been replaced.
- The Benefit: If a new model starts making mistakes, you can "roll back" to the previous version with one click.

**4. How this looks in your Project**
- Imagine you are predicting if a user will buy a Samsung phone:
- Tracking: You run 10 experiments. Run #7 has 92% accuracy. You tag it as "Best Run."
- Models: You save Run #7 as a standardized MLflow model.
- Registry: You move that model to "Production." Now, your Gold layer can use that model to label users as "Likely Buyers" every morning.

**5. Why use it with a SQL Warehouse?**
Databricks integrates MLflow directly. You can actually call your registered models directly inside a SQL query!
```
SQL
-- Example: Using a registered model in SQL
SELECT 
  user_id,
  predict_purchase(feature1, feature2) as likelihood
FROM ecommerce_prod.gold.user_ml_features;
```

### Experiment tracking

**1. What exactly is being "Tracked"?**
- When you start an "Experiment," MLflow captures four main things:
- **Parameters:** Your input settings (e.g., "I used 100 trees for my Random Forest").
- **Metrics:** Your results (e.g., "The accuracy was 85%").
- **Artifacts:** Your files (e.g., a "Importance Plot" chart or the actual model file).
- **Source:** Which Notebook and which version of the code produced this result.

**2. How to run an Experiment (The Code)**
- Here is how you would track a simple model that predicts if a user will make a purchase.
```
Python
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# 1. Start the experiment
with mlflow.start_run(run_name="Purchase_Predictor_v1"):
    
    # 2. Log a parameter (your "recipe")
    n_estimators = 100
    mlflow.log_param("trees", n_estimators)
    
    # 3. Train your model
    model = RandomForestClassifier(n_estimators=n_estimators)
    model.fit(X_train, y_train)
    
    # 4. Log a metric (your "score")
    acc = accuracy_score(y_test, model.predict(X_test))
    mlflow.log_metric("accuracy", acc)
    
    # 5. Save the model (the "artifact")
    mlflow.sklearn.log_model(model, "model")

print(f"Run complete! Accuracy: {acc}")
```

**3. Why this is a game-changer for you**
- Imagine you spend all day Friday testing different models. On Monday, your boss asks for the best one.
- Without Tracking: You have to look through messy notebooks or try to remember.
- With Tracking: You open the Experiments sidebar in Databricks, sort by "Accuracy," and the winner is right at the top.

**4. Comparing Runs**
- The most powerful feature of tracking is the Comparison View. You can select 3 or 4 different runs and see a "Parallel Coordinates Plot." 
- This shows you exactly how changing one parameter (like the number of trees) caused the accuracy to go up or down.

### Model logging

**1. What's inside a "Logged Model"?**
- When you use mlflow.sklearn.log_model(), MLflow creates a folder containing:
- **The Model File:** The actual serialized code (like a .pkl or .joblib file).
- **MLmodel file:** A text file that tells MLflow how to load the model (e.g., "This is a Scikit-Learn model").
- **Conda/Requirements:** A list of every library and version (like pandas==2.0.0) needed to make the model work again.

**2. How to Log your Model**
- In your project, after you train your classifier on the 42M row dataset, you log it like this:
```
Python
import mlflow.sklearn

# Start a run
with mlflow.start_run():
    # ... training code here ...
    
    # Log the model to the current run
    mlflow.sklearn.log_model(
        sk_model=model, 
        artifact_path="purchase-predictor-model"
    )
```

**3. Logged vs. Registered**
- It is important to know the difference:
- **Logging:** Happens during training. It’s like putting a prototype on a shelf in the back of the lab. It is tied to a specific "Run ID."
- **Registering:** Happens after you decide a model is good. It’s like moving that prototype to the showroom and giving it a name like "Production_v1."

**4. Why "Logged Models" are better than local files**
- If you just save a model to your local computer (e.g., model.save()), it often breaks when you move it to a server. With MLflow Logging:
- **Reproducibility:** It remembers exactly which version of Python you used.
- **Ease of Use:** You can load the model back with one line of code: model = mlflow.sklearn.load_model("runs:/<run_id>/purchase-predictor-model")
- **Governance:** You can see who logged the model and when.

### MLflow UI

**1. The Experiments Sidebar**
- On the right-hand side of your Databricks Notebook, you’ll see a beaker icon. Clicking this opens the "Experiment Sidebar."
- It shows a quick list of all recent runs.
- You can see the Date, User, and Status (Success/Fail) at a glance.
- It provides a direct link to the full MLflow UI.

**2. The Main Experiments Page**
- This is the "Control Center" for your Machine Learning project. Here, you can:
- **Filter & Search:** Search for runs where metrics.accuracy > 0.85.
- **Compare Runs:** Select multiple boxes and click "Compare." This opens a side-by-side view that highlights exactly which parameters (like n_estimators) led to better results.
- **Parallel Coordinates Plot:** A visual map showing how different "knobs" you turned (parameters) affected the final score.

**3. The Run Details Page**
- When you click on a specific "Run Name," you go deeper. This page is divided into:
- **Parameters:** Every setting you logged (e.g., learning_rate: 0.01).
- **Metrics:** Your final scores. If you log metrics over time (like during training loops), the UI will automatically draw a Line Chart showing the model getting smarter.
- **Artifacts:** This is the most important part. It shows the Logged Model folder, your requirements files, and any plots (like a Confusion Matrix) you saved.

**4. The Model Registry UI**
- Once you decide a model is ready for the real world, you use this tab to manage its "Career."
- **Versions:** It shows Version 1, Version 2, etc.
- **Stages:** You can see a clear badge for "Staging" or "Production." 
- **Transitions:** You can request to move a model from Staging to Production, and a manager can approve it right here in the UI.

### Task 1: Train a Simple Regression Model and login to MLFlow

In [0]:
%sql
SELECT count(*) FROM ecommerce_prod.gold.user_ml_features;

In [0]:
# Re-load the data into Pandas
df = spark.table("ecommerce_prod.gold.user_ml_features").toPandas()

# Now check if it works
print(df.columns)

In [0]:
import mlflow
import mlflow.sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# 1. Load your engineered features
# We created this table in the Gold notebook
df = spark.table("ecommerce_prod.gold.user_ml_features").toPandas()

# 2. Prepare Features (X) and Target (y)
# We want to predict 'total_spend' based on behavior
X = df[['interaction_count', 'weekend_ratio', 'avg_viewed_price', 'category_diversity']]
y = df['total_spend']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Start MLflow Tracking
# This creates a "Run" in the Databricks Experiment UI
with mlflow.start_run(run_name="User_Spend_Predictor"):
    
    # Set Hyperparameters
    n_estimators = 100
    max_depth = 10
    
    # LOG PARAMETERS (The Settings)
    mlflow.log_param("num_trees", n_estimators)
    mlflow.log_param("max_depth", max_depth)
    
    # Train Model
    rf = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth)
    rf.fit(X_train, y_train)
    
    # Make Predictions
    predictions = rf.predict(X_test)
    
    # LOG METRICS (The Results)
    mse = mean_squared_error(y_test, predictions)
    r2 = r2_score(y_test, predictions)
    
    mlflow.log_metric("mse", mse)
    mlflow.log_metric("r2_score", r2)
    
    # LOG MODEL (The Artifact)
    # This packages the model so it can be used in SQL later
    mlflow.sklearn.log_model(rf, "spend_model")
    
    print(f"✅ Run Completed! R2 Score: {r2}")

In [0]:
import mlflow
import mlflow.sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# 1. Load your engineered features
# We created this table in the Gold notebook
df = spark.table("ecommerce_prod.gold.user_ml_features").toPandas()

# 2. Prepare Features (X) and Target (y)
# We want to predict 'total_spend' based on behavior
X = df[['interaction_count', 'weekend_ratio', 'avg_viewed_price', 'category_diversity']]
y = df['total_spend']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Start MLflow Tracking
# This creates a "Run" in the Databricks Experiment UI
with mlflow.start_run(run_name="User_Spend_Predictor"):
    
    # Set Hyperparameters
    n_estimators = 200
    max_depth = 15
    
    # LOG PARAMETERS (The Settings)
    mlflow.log_param("num_trees", n_estimators)
    mlflow.log_param("max_depth", max_depth)
    
    # Train Model
    rf = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth)
    rf.fit(X_train, y_train)
    
    # Make Predictions
    predictions = rf.predict(X_test)
    
    # LOG METRICS (The Results)
    mse = mean_squared_error(y_test, predictions)
    r2 = r2_score(y_test, predictions)
    
    mlflow.log_metric("mse", mse)
    mlflow.log_metric("r2_score", r2)
    
    # LOG MODEL (The Artifact)
    # This packages the model so it can be used in SQL later
    mlflow.sklearn.log_model(rf, "spend_model")
    
    print(f"✅ Run Completed! R2 Score: {r2}")