# Tutorial 2: Sending Selected Runs to Remote Server

## 🎯 Learning Objectives

By the end of this tutorial, you will:
- Understand how to identify your best local runs
- Learn to retrieve artifacts from local MLflow runs
- Master the process of re-running experiments on a remote server
- Avoid overloading remote servers with unnecessary data

---

## 📋 Prerequisites

- Completed **Tutorial 1** (local MLflow setup)
- At least one successful run in your local MLflow server
- Access credentials to the remote MLflow server (username + password)
- Understanding of the `mlruns/` and `mlartifacts/` directory structure

---

## 🔑 Key Concept: Why Re-run Instead of Transfer?

MLflow doesn't have a "copy run" feature. Instead, we:

1. **Experiment freely locally** (unlimited runs, no server load)
2. **Identify the best model** (using local UI/metrics)
3. **Re-run ONLY that model** on the remote server (with same params, data, model)

This approach:
- ✅ Keeps remote servers clean and organized
- ✅ Reduces bandwidth and storage costs
- ✅ Ensures reproducibility
- ✅ Maintains complete experiment history locally

---


## 1. Setting Up Remote Server Credentials

Before connecting to the remote server, you need to set up authentication.

### 🔐 Set environment variables

In your terminal (or add to your `.bashrc`/`.zshrc`):

```bash
export MLFLOW_TRACKING_USERNAME="your_username"
export MLFLOW_TRACKING_PASSWORD="your_password"
export MLFLOW_TRACKING_URI="https://mlflow-dev.fink-broker.org"
```

⚠️ **Security Note**: Never hardcode credentials in your notebooks or scripts!

---


### Verify environment variables are set


In [None]:
import os

# Check if credentials are set
required_vars = ["MLFLOW_TRACKING_USERNAME", "MLFLOW_TRACKING_PASSWORD", "MLFLOW_TRACKING_URI"]

for var in required_vars:
    if var in os.environ:
        print(f"✅ {var} is set")
    else:
        print(f"❌ {var} is NOT set - please set it before continuing!")
        
# Display the tracking URI (safe to show)
if "MLFLOW_TRACKING_URI" in os.environ:
    print(f"\n🔗 Remote server: {os.environ['MLFLOW_TRACKING_URI']}")


## 2. Identifying Your Best Local Run

First, let's look at your local runs to identify which one to send to the remote server.

### 📁 Understanding the Local Directory Structure

Navigate to where you ran `mlflow server` in Tutorial 1. You should see:

```
your_directory/
├── mlruns/              # Metadata
│   └── <experiment_id>/
│       └── <run_id>/
└── mlartifacts/         # Actual artifacts
    └── <experiment_id>/
        └── <run_id>/
```

### 📸 Screenshot placeholder: Directory structure
extract from tree 

```
mlruns/
├── 0 # default experiemnt
│   └── meta.yaml
├── 256835725489686937 # tutorial experiement
│   ├── 159b8d88e56446dfb595b540960b99be # run
│   │   ├── artifacts
│   │   ├── meta.yaml
│   │   ├── metrics
│   │   │   ├── accuracy
│   │   │   └── f1_score
│   │   ├── outputs
│   │   │   └── m-34aaeb7bdad14f93a3e76ed769f7bf18
│   │   │       └── meta.yaml
│   │   ├── params
│   │   │   ├── learning_rate
│   │   │   └── random_state
│   │   └── tags
│   │       ├── mlflow.runName
│   │       ├── mlflow.source.name
│   │       ├── mlflow.source.type
│   │       └── mlflow.user
│   ├── meta.yaml
│   ├── models # model use for this run 
│   │   ├── m-34aaeb7bdad14f93a3e76ed769f7bf18
│   │   │   ├── meta.yaml
│   │   │   ├── metrics
│   │   │   │   ├── accuracy
│   │   │   │   └── f1_score
│   │   │   ├── params
│   │   │   │   ├── learning_rate
│   │   │   │   └── random_state
│   │   │   └── tags
│   │   │       ├── mlflow.source.name
│   │   │       ├── mlflow.source.type
│   │   │       └── mlflow.user
│   └── tags
│       └── mlflow.experimentKind
└── models
mlartifacts/
└── 256835725489686937
    ├── d62f60d9ecde424ea0bdc30352bf7143
    │   └── artifacts
    │       ├── meta.json
    │       ├── X_train.parquet
    │       └── y_train.parquet
    └── models
        ├── m-34aaeb7bdad14f93a3e76ed769f7bf18
        │   └── artifacts
        │       ├── conda.yaml
        │       ├── input_example.json
        │       ├── MLmodel
        │       ├── model.pkl
        │       ├── python_env.yaml
        │       ├── requirements.txt
        │       └── serving_input_example.json
```

---


### 🔍 Find your experiment and run IDs

You can find these either:
1. **In the MLflow UI** (URL shows experiment and run IDs)
2. **Programmatically** (as we did in Tutorial 1)
3. **By browsing the file system**

Let's do it programmatically:


In [None]:
import mlflow
from mlflow.tracking import MlflowClient
import pandas as pd

# Connect to LOCAL MLflow server
LOCAL_HOST = "http://127.0.0.1"
LOCAL_PORT = 6969
mlflow.set_tracking_uri(f"{LOCAL_HOST}:{LOCAL_PORT}")

client = MlflowClient()

# Get the experiment
EXPERIMENT_NAME = "tutorial"  # From Tutorial 1
experiment = client.get_experiment_by_name(EXPERIMENT_NAME)

if experiment is None:
    print(f"❌ Experiment '{EXPERIMENT_NAME}' not found!")
    print("Make sure you completed Tutorial 1 first.")
else:
    print(f"✅ Found experiment: {EXPERIMENT_NAME}")
    print(f"📊 Experiment ID: {experiment.experiment_id}")
    
    # Get all runs
    runs = client.search_runs(experiment.experiment_id)
    print(f"📈 Total runs: {len(runs)}\n")


### View all runs and select the best one


In [None]:
# Display all runs with their metrics
run_data = []
for run in runs:
    run_data.append({
        "Run Name": run.data.tags.get("mlflow.runName", "N/A"),
        "Run ID": run.info.run_id,
        "Learning Rate": run.data.params.get("learning_rate", "N/A"),
        "Accuracy": run.data.metrics.get("accuracy", "N/A"),
        "F1 Score": run.data.metrics.get("f1_score", "N/A"),
        "Status": run.info.status
    })

runs_df = pd.DataFrame(run_data)
print("📊 Available runs:")
display(runs_df.sort_values("Accuracy", ascending=False))


### 🎯 Select the run you want to send


In [None]:
# Select the best run (highest accuracy in this case)
# In practice, you might choose based on different criteria
best_run = runs[0]  # Assuming sorted by best metric
for run in runs:
    acc = run.data.metrics.get("accuracy", 0)
    if acc >= best_run.data.metrics.get("accuracy", 0):
        best_run = run

print("🏆 Selected run to send:")
print(f"  - Name: {best_run.data.tags.get('mlflow.runName', 'N/A')}")
print(f"  - Run ID: {best_run.info.run_id}")
print(f"  - Accuracy: {best_run.data.metrics.get('accuracy', 'N/A')}")
print(f"  - F1 Score: {best_run.data.metrics.get('f1_score', 'N/A')}")

# Store IDs for later use
SELECTED_RUN_ID = best_run.info.run_id
EXPERIMENT_ID = experiment.experiment_id


## 3. Retrieving Artifacts from the Local Run

Now we need to gather all the information from this run to reproduce it on the remote server.

### 🗂️ Configure paths

⚠️ **Important**: Update `root_dir` to where YOUR `mlflow server` was running!


In [None]:
import json
from pathlib import Path

# UPDATE THIS PATH to where you ran `mlflow server`
# This directory should contain mlruns/ and mlartifacts/
root_dir = Path().resolve().parent # Assumes parent directory, adjust if needed

print(f"📁 Root directory: {root_dir}")
print(f"   Looking for mlruns/ and mlartifacts/...")

# Verify directories exist
mlruns_dir = root_dir / "mlruns"
mlartifacts_dir = root_dir / "mlartifacts"

if mlruns_dir.exists():
    print(f"   ✅ Found mlruns/")
else:
    print(f"   ❌ mlruns/ not found! Update root_dir")
    
if mlartifacts_dir.exists():
    print(f"   ✅ Found mlartifacts/")
else:
    print(f"   ❌ mlartifacts/ not found! Update root_dir")


### 📦 Load training data from artifacts


In [None]:
# Build paths to the artifacts
artifacts_path = mlartifacts_dir / EXPERIMENT_ID / SELECTED_RUN_ID / "artifacts"

path_X = artifacts_path / "X_train.parquet"
path_y = artifacts_path / "y_train.parquet"

print(f"📂 Artifacts path: {artifacts_path}")

# Load the data
if path_X.exists() and path_y.exists():
    X = pd.read_parquet(path_X)
    y = pd.read_parquet(path_y)
    print(f"✅ Loaded training data:")
    print(f"   - X shape: {X.shape}")
    print(f"   - y shape: {y.shape}")
else:
    print("❌ Training data not found in artifacts!")
    print("   Make sure you logged the data in Tutorial 1")


### ⚙️ Load model parameters


In [None]:
# Load metadata (contains parameters)
meta_path = artifacts_path / "meta.json"

if meta_path.exists():
    with open(meta_path, "r") as f:
        metadata = json.load(f)
    PARAMS = metadata['params']
    print(f"✅ Loaded parameters: {PARAMS}")
else:
    print("❌ meta.json not found!")
    print("   Falling back to run parameters...")
    PARAMS = best_run.data.params
    print(f"   Parameters: {PARAMS}")


## 4. Generate Environment Dependencies

The remote server needs to know which packages are required to run your preprocessing code.

This project includes a utility to automatically extract dependencies from your code:


In [None]:
import sys, os
sys.path.append(os.path.abspath(".."))

In [None]:
from mlflink.utils.env_utils import generate_requirements_txt_from_imports
from importlib import resources

# Get path to preprocessing module
with resources.path("mlflink", "processing") as module_path:
    PKG_DIR = module_path

print(f"📦 Analyzing dependencies in: {PKG_DIR}")

# Generate requirements.txt
dependencies_path = generate_requirements_txt_from_imports(
    PKG_DIR, 
    "/tmp/requirements.txt", 
    include_self=False
)

print(f"✅ Generated requirements.txt at: {dependencies_path}")

# Display the generated requirements
print("\n📋 Dependencies to be sent to remote server:")
with open(dependencies_path, "r") as f:
    print(f.read())


### 💡 Why This Matters

The remote server (Fink) needs:
1. ✅ Your trained model (for making predictions)
2. ✅ Your preprocessing code (to prepare new data)
3. ✅ Environment dependencies (to run your code)

MLflow automatically handles (1), but we need to explicitly log (2) and (3).

---


## 5. Re-run on Remote Server

Now we have everything we need! Let's re-run the experiment on the remote server.

### 🌐 Switch to remote tracking URI


In [None]:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from mlflow.models import infer_signature
import numpy as np

# Get remote server URL from environment
# REMOTE_URI = os.environ.get("MLFLOW_TRACKING_URI", "https://mlflow-dev.fink-broker.org")
REMOTE_URI = "https://mlflow-dev.fink-broker.org"

print(f"🔄 Switching from local to remote server...")
print(f"   Local:  {LOCAL_HOST}:{LOCAL_PORT}")
print(f"   Remote: {REMOTE_URI}")

# Switch to remote server
mlflow.set_tracking_uri(REMOTE_URI)
mlflow.set_experiment("tutorial")  # Use same experiment name

client = MlflowClient()

print(f"✅ Connected to remote server!")


### 🚀 Run the experiment on remote server

This is almost identical to Tutorial 1, but with TWO critical additions for the remote server:
1. Log the preprocessing code
2. Log the requirements.txt


In [None]:
print("🚀 Starting run on REMOTE server...\\n")

with mlflow.start_run(run_name=f"remote_LR_{PARAMS['learning_rate']}"):
    
    # ==========================================
    # 1. TRAIN MODEL (same as before)
    # ==========================================
    print("🏋️  Training model...")
    model = HistGradientBoostingClassifier(**PARAMS)
    model.fit(X.values, y)
    y_pred = model.predict(X.values)
    print("✅ Model trained!\\n")
    
    # ==========================================
    # 2. LOG PARAMETERS (same as before)
    # ==========================================
    print("📝 Logging parameters...")
    mlflow.log_params(PARAMS)
    
    # ==========================================
    # 3. LOG MODEL (same as before)
    # ==========================================
    print("💾 Logging model...")
    signature = infer_signature(X, y_pred)
    mlflow.sklearn.log_model(
        model,
        artifact_path="model",
        signature=signature,
        input_example=X.iloc[:1],
    )
    
    # ==========================================
    # 4. LOG METRICS (same as before)
    # ==========================================
    print("📊 Logging metrics...")
    mlflow.log_metric("accuracy", accuracy_score(y, y_pred))
    mlflow.log_metric("precision", precision_score(y, y_pred, zero_division=0))
    mlflow.log_metric("recall", recall_score(y, y_pred, zero_division=0))
    mlflow.log_metric("f1_score", f1_score(y, y_pred, zero_division=0))
    
    # ==========================================
    # 5. LOG DATA (optional - be selective!)
    # ==========================================
    print("💿 Logging training data...")
    mlflow.log_table(X, "X_train.parquet")
    mlflow.log_table(y, "y_train.parquet")
    
    # ==========================================
    # 6. LOG METADATA (same as before)
    # ==========================================
    print("📋 Logging metadata...")
    meta_info = {
        "params": PARAMS,
        "data_info": {
            "n_samples": X.shape[0],
            "n_features": X.shape[1]
        },
        "notes": "Run sent from local to remote server",
        "original_run_id": SELECTED_RUN_ID
    }
    with open("meta.json", "w") as f:
        json.dump(meta_info, f, indent=2)
    mlflow.log_artifact("meta.json")
    
    # ==========================================
    # 7. LOG PREPROCESSING CODE NEW FOR REMOTE 🆕
    # ==========================================
    # This is CRITICAL for deployment! The remote server (Fink) needs your 
    # preprocessing code to transform new incoming data the same way you 
    # transformed your training data. Without this, the model won't work!
    #
    # Why we log the entire processing directory:
    # - Other developers can reproduce your exact preprocessing pipeline
    # - The Fink server can apply the same transformations to live data
    # - Ensures consistency between training and production inference
    # - Makes your model deployment fully reproducible
    print("📦 Logging preprocessing code...")
    with resources.path("mlflink", "processing") as preprocessing_path:
        mlflow.log_artifacts(str(preprocessing_path), name="code")

    # ==========================================
    # 8. LOG REQUIREMENTS NEW FOR REMOTE 🆕
    # ==========================================
    # The requirements.txt file tells the remote server which Python packages
    # are needed to run your preprocessing code. Without this, your code 
    # might fail due to missing dependencies!
    #
    # Why this matters:
    # - Ensures the remote environment has all necessary libraries
    # - Other developers know exactly which versions you used
    # - Prevents "works on my machine" problems
    # - Critical for the Fink server to execute your preprocessing pipeline
    #
    # Note: This only includes dependencies for preprocessing, not the model
    # (MLflow handles model dependencies automatically)
    print("📋 Logging dependencies...")
    mlflow.log_artifact(dependencies_path)
    
    print("\\n✅ Run completed successfully on REMOTE server!")
    print(f"🔗 View at: {REMOTE_URI}")


## 6. Verify the Remote Run

Let's verify that everything was uploaded correctly:


In [None]:
# Query the remote server
remote_experiment = client.get_experiment_by_name("tutorial")
remote_runs = client.search_runs(remote_experiment.experiment_id, max_results=5)

print(f"📊 Latest runs on remote server:\\n")

for i, run in enumerate(remote_runs[:3], 1):
    print(f"{i}. {run.data.tags.get('mlflow.runName', 'N/A')}")
    print(f"   - Run ID: {run.info.run_id[:8]}...")
    print(f"   - Accuracy: {run.data.metrics.get('accuracy', 'N/A')}")
    print(f"   - Status: {run.info.status}")
    print()


## 7. What We Sent vs. What We Didn't

### ✅ What we sent to the remote server:
- Trained model (sklearn artifact)
- Model parameters
- Performance metrics
- Training data (X and y) - optional
- Preprocessing code (entire `processing/` directory)
- Requirements.txt (dependencies)
- Metadata (custom info)

### 🙅🏿 What we didn't send:
- All the failed/experimental runs
- Intermediate debugging runs
- Large datasets from experimentation
- Temporary files

This keeps the remote server clean and focused on production-ready models!

---


## 8. Best Practices for Remote Runs

### ✅ DO:
- **Experiment extensively locally first** (unlimited runs)
- **Send only your best/final models** to remote
- **Include preprocessing code** (critical for deployment)
- **Log dependencies** (requirements.txt)
- **Use descriptive run names** (easier to find later)
- **Add metadata** about the original local run

### 🙅🏿 DON'T:
- **Send every experimental run** (overloads server)
- **Log huge datasets** unless absolutely necessary
- **Forget to include preprocessing code** (model won't work without it)
- **Skip the requirements.txt** (environment issues)
- **Send runs with errors or incomplete training**

### 💡 Recommended Workflow:

```
1. Local: Run 50 experiments with different parameters
2. Local: Identify the best 1-3 models
3. Local: Do a final run with those params + full data logging
4. Remote: Send ONLY those 1-3 final runs
5. Remote: Verify they work correctly
```

This approach:
- Maximizes your experimentation freedom locally
- Minimizes remote server load
- Keeps remote experiments organized and meaningful
- Reduces costs (bandwidth, storage)

---


###  <img src="https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExeXd2ZTNscDZtbm1jbDB2dDJveGp2ZWdtMmhsMDg0MmdvdnpncmJzZSZlcD12MV9zdGlja2Vyc19zZWFyY2gmY3Q9cw/xUPGchCSJlq3NzwLyU/giphy.gif" width="50" height="50"> WHY  Steps 7 & 8 Are Essential for Remote Deployment

When you send a model to the Fink server, you're not just sending the trained model weights. 
For the model to work in production, Fink needs:

1. **Your preprocessing code** (Step 7):
   - The server receives raw alert data
   - It needs to transform this data exactly as you did during training
   - Your `make_cut()`, `raw2clean()`, `run_sherlock()`, and `make_X()` functions
   - Without this, the model would receive incorrectly formatted data

2. **Your dependencies** (Step 8):
   - Your preprocessing code might use specific libraries (pandas, numpy, lasair, etc.)
   - The server needs to know which versions to install
   - Prevents runtime errors due to missing or incompatible packages

**Real-world scenario:**
- You train a model locally with your preprocessing pipeline
- Fink receives new alerts and needs to classify them
- Fink loads your model AND your preprocessing code
- For each alert: `raw data → your preprocessing → your model → prediction`
- Other scientists can also download and run your complete pipeline

**Without these steps**, your model would be useless on the remote server! 🚫

## 🎓 Summary

<iframe src="https://giphy.com/embed/S3nZ8V9uemShxiWX8g" width="40" height="40" style="" frameBorder="0" class="giphy-embed" allowFullScreen></iframe>Congratulations! You've learned : <iframe src="https://giphy.com/embed/8fftcK2D4PK6XCs2P0" width="60" height="60" style="" frameBorder="0" class="giphy-embed" allowFullScreen></iframe>

✅ How to identify your best local runs  
✅ How to retrieve artifacts from local MLflow storage  
✅ How to generate environment dependencies automatically  
✅ How to re-run experiments on a remote server  
✅ Best practices for avoiding server overload  
✅ The difference between local experimentation and remote deployment  

---

## 🚀 Next Steps

In **Tutorial 3** ??? (optional), you'll learn how to:
<!-- - Customize the preprocessing pipeline for your data
- Replace the example model with your own
- Adapt this template for your specific project -->

---

## 🆘 Troubleshooting  <iframe src="https://giphy.com/embed/PnpkimJ5mrZRe" width="200" height="200" style="" frameBorder="0" class="giphy-embed" allowFullScreen></iframe>

### Problem: Authentication failed
**Solution**: Double-check your environment variables are set correctly. Try logging out and back in.

### Problem: Can't find local artifacts
**Solution**: Make sure `root_dir` points to where you ran `mlflow server`. Look for `mlruns/` and `mlartifacts/` folders.

### Problem: "Permission denied" on remote server
**Solution**: Verify your username has write access. Contact server administrator if needed.

### Problem: Missing preprocessing code on remote
**Solution**: Make sure you logged the preprocessing directory with `mlflow.log_artifacts()`

---
