# Tutorial 1: MLflow Local Setup and First Run

## 🎯 Learning Objectives

By the end of this tutorial, you will:
- Understand how to set up a local MLflow tracking server
- Learn to log parameters, metrics, models, and artifacts
- Explore the MLflow UI to compare experiments
- Understand best practices for local experimentation

---

## 📋 Prerequisites

- Python ≥ 3.9
- Basic knowledge of machine learning and scikit-learn
- Familiarity with pandas DataFrames

---


## 1. Installation and Setup

First, ensure all dependencies are installed:


In [None]:
# If you haven't installed the package yet, run:
# pip install -e .[dev]

# Verify MLflow is installed
import mlflow
print(f"MLflow version: {mlflow.__version__}")


## 2. Starting Your Local MLflow Server

### 🚀 Launch the server

Open a **new terminal** window and run:

```bash
mlflow server --host 127.0.0.1 --port 6969
```

**Important Notes:**
- Keep this terminal running during your experiments
- You can choose any available port (6969 is just an example)
- The server will create `mlruns/` and `mlartifacts/` directories in your current location

### 📸 Screenshot placeholder: MLflow server running in terminal
![image.png](assets/mlfflow_server_run__on_terminal.png)

---

### 🌐 Access the UI

Open your browser and navigate to:
```
http://127.0.0.1:6969
```

You should see the MLflow UI (initially empty).

### 📸 Screenshot placeholder: MLflow UI homepage
![image.png](assets/Empty%20MLflow%20UI%20with%20no%20experiments%20yet.png)

---


## 3. Configure MLflow Connection

Now let's connect our Python code to the local MLflow server:


In [None]:
from mlflow.tracking import MlflowClient
import pandas as pd
import numpy as np

# Configuration
HOST = "http://127.0.0.1"  # Your local host
PORT = 6969                # Must match the port you used in the terminal

# Set the tracking URI (where MLflow will log everything)
mlflow.set_tracking_uri(f"{HOST}:{PORT}")

# Create or set an experiment
# Experiments help organize related runs
EXPERIMENT = "tutorial"
mlflow.set_experiment(EXPERIMENT)

# Instantiate the MLflow client (useful for advanced operations)
client = MlflowClient()

print(f"✅ Connected to MLflow at {HOST}:{PORT}")
print(f"✅ Experiment set to: {EXPERIMENT}")


### 💡 Key Concepts

- **Tracking URI**: Where MLflow stores your experiments (can be local or remote)
- **Experiment**: A collection of related runs (e.g., all runs for a specific project)
- **Run**: A single execution of your training code with specific parameters



![tutorial_experiement.png](assets/tutorial_experiement.png)
---


## 4. Understanding the Data Pipeline

Before training, let's understand the preprocessing pipeline. This project uses a structured approach:

```
Raw Alerts → Cut Alerts → Clean Data → Curated Data → Features (X matrix)
```

Let's load some data and process it:


In [None]:
import sys, os
sys.path.append(os.path.abspath(".."))

In [None]:
from mlflink.processing import preprocessing as pp
from importlib import resources

# Get data from local file, live stream or mocked data stream
with resources.path("mlflink.data", "test_alerts.parquet") as parquet_path:
    PARQUET_FILE = parquet_path

alerts_df = pd.read_parquet(PARQUET_FILE)
print(f"📊 Loaded {len(alerts_df)} alerts")
print(f"📋 Columns: {alerts_df.columns.tolist()[:5]}...")  # Show first 5 columns


### Step 1: Apply cuts to filter relevant alerts


In [None]:
# Apply quality cuts (defined in preprocessing.py)
# This filters alerts based on criteria like magnitude, classification, etc.
cut_alerts_df = pp.make_cut(alerts_df)
print(f"  After cuts: {len(cut_alerts_df)} alerts remaining")


### Step 2: Extract and clean relevant columns


In [None]:
# Extract only the columns we need for training
clean_df = pp.raw2clean(cut_alerts_df)
print(f"🧹 Clean dataframe shape: {clean_df.shape}")
print(f"📋 Clean columns: {clean_df.columns.tolist()}")


### Step 3: Optional curation (e.g., cross-matching)

⚠️ **Note**: `run_sherlock()` requires a LASAIR_TOKEN environment variable. If not set, it will skip this step gracefully.


In [None]:
# This step adds additional information via cross-matching services
# For this tutorial, it will skip if LASAIR_TOKEN is not set
curated_df = pp.run_sherlock(clean_df)
print(f" Curated dataframe shape: {curated_df.shape}")


### Step 4: Create features matrix (X)


In [None]:
# Transform curated data into ML-ready features
X, ids = pp.make_X(curated_df)

print(f" Features matrix shape: {X.shape}")
print(f" Feature columns: {X.columns.tolist()}")
print(f"\nFirst few features:")
display(X.head())


### Create mock labels for demonstration

In a real scenario, you would have actual labels. For this tutorial, we'll create dummy labels:


In [None]:
# Create fake labels (in real use, you'd have actual labels)
y = np.array([0] * X.shape[0])
y = pd.DataFrame(y, columns=["labels"])

print(f" Labels shape: {y.shape}")


---

## 5. Your First MLflow Run

Now comes the exciting part! We'll train a model and log everything with MLflow.

### What we'll log:
1.  **Parameters** (hyperparameters)
2.  **Metrics** (accuracy, precision, recall, F1)
3.  **Model** (the trained model)
4.  **Data** (training data - optional, see warning below)
5.  **Artifacts** (metadata, custom files)

---


### Define model hyperparameters


In [None]:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from mlflow.models import infer_signature 
import json

# Define hyperparameters
# 💡 TIP: Use descriptive parameter names that match your model's API
PARAMS = {
    "learning_rate": 0.1,
    "random_state": 42
}

print(" Hyperparameters:")
for key, value in PARAMS.items():
    print(f"  - {key}: {value}")


### Start MLflow run and train the model


In [None]:
# Start an MLflow run
# 💡 TIP: Give runs descriptive names so you can identify them later
with mlflow.start_run(run_name=f"test_LR_{PARAMS['learning_rate']}"):
    
    print("🚀 MLflow run started!\n")
    
    # ==========================================
    # 1. TRAIN YOUR MODEL (as you normally would)
    # ==========================================
    print("🏋️  Training model...")
    model = HistGradientBoostingClassifier(**PARAMS)
    model.fit(X.values, y)
    y_pred = model.predict(X.values)
    print("✅ Model trained!\n")
    
    # ==========================================
    # 2. LOG PARAMETERS
    # ==========================================
    print("📝 Logging parameters...")
    mlflow.log_params(PARAMS)
    
    # ==========================================
    # 3. LOG THE MODEL
    # ==========================================
    print("💾 Logging model...")
    signature = infer_signature(X, y_pred) # Define model input/output schema
    mlflow.sklearn.log_model(
        model,
        name="model_name",  # Where it's stored in MLflow
        signature=signature, 
        input_example=X.iloc[:1],  # Example input for documentation
    )
    
    # ==========================================
    # 4. LOG METRICS
    # ==========================================
    print("📊 Calculating and logging metrics...")
    
    acc = accuracy_score(y, y_pred)
    mlflow.log_metric("accuracy", acc)
    print(f"  - Accuracy: {acc:.4f}")
    
    prec = precision_score(y, y_pred, zero_division=0)
    mlflow.log_metric("precision", prec)
    print(f"  - Precision: {prec:.4f}")
    
    recall = recall_score(y, y_pred, zero_division=0)
    mlflow.log_metric("recall", recall)
    print(f"  - Recall: {recall:.4f}")
    
    f1 = f1_score(y, y_pred, zero_division=0)
    mlflow.log_metric("f1_score", f1)
    print(f"  - F1-score: {f1:.4f}\n")
    
    # ==========================================
    # 5. LOG DATA (OPTIONAL - READ WARNING BELOW)
    # ==========================================
    print("💿 Logging training data...")
    mlflow.log_table(X, "X_train.parquet")
    mlflow.log_table(y, "y_train.parquet")
    
    # ==========================================
    # 6. LOG ADDITIONAL METADATA
    # ==========================================
    print("📋 Logging metadata...")
    meta_info = {
        "params": PARAMS,
        "data_info": {
            "n_samples": X.shape[0],
            "n_features": X.shape[1]
        },
        "notes": "First tutorial run with local MLflow"
    }
    
    with open("meta.json", "w") as f:
        json.dump(meta_info, f, indent=2)
    mlflow.log_artifact("meta.json")
    
    print("\n✅ Run completed successfully!")
    print(f"🔗 View in UI: {HOST}:{PORT}")


### ⚠️ Important Warning: Logging Large Datasets

```python
mlflow.log_table(X, "X_train.parquet")
```

**When to log data:**
- ✅ For final, production-ready models
- ✅ When you need complete reproducibility
- ✅ For small-to-medium datasets (< 100 MB)

**When NOT to log data:**
- ❌ During frequent experimentation
- ❌ With large datasets (> 1 GB)
- ❌ When sending to remote servers (avoid overload)

**💡 Best Practice**: 
- Experiment locally WITHOUT logging data
- Once you have a winning model, do ONE final run WITH data logging
- Only send that final run to remote servers

---


## 6. Exploring the MLflow UI

Now that we've logged a run, let's explore what MLflow captured!

### 🌐 Open the UI

Go to your browser: `http://127.0.0.1:6969`

You should see:

### 📸 Screenshot placeholder: Experiment view
![tutorial_experiement.png](assets/tutorial_experiement.png)


---

### 🔍 Click on your run

You'll see detailed information:

1. **Parameters tab**: Your hyperparameters
2. **Metrics tab**: Accuracy, precision, recall, F1
3. **Artifacts tab**: Model, data files, metadata

### 📸 Screenshot placeholder: Run details
![first_run.png](assets/first_run.png)

![image.png](assets/overview%20details.png)

![artifact.png](assets/artifact.png)

---

### 📊 Explore the Model

Click on **Artifacts** → **model**

You'll see:
- `MLmodel` file (metadata)
- `model.pkl` (your actual model)
- `conda.yaml` (environment info)
- `requirements.txt` (dependencies)

### 📸 Screenshot placeholder: Model artifacts
![image.png](assets/click%20on%20model.png)

![image.png](assets/model_artifact.png)


---


## 7. Run Multiple Experiments

The real power of MLflow comes from comparing multiple runs. Let's train with different learning rates:


In [None]:
# Try different learning rates
learning_rates = [0.01, 0.05, 0.1, 0.2]

print(" 🏄 Running experiments with different learning rates...\n")

for lr in learning_rates:
    print(f"🏃‍♀️ Running with learning_rate={lr}")
    
    with mlflow.start_run(run_name=f"lr_{lr}"):
        # Update parameters
        params = {"learning_rate": lr, "random_state": 42}
        
        # Train
        model = HistGradientBoostingClassifier(**params)
        model.fit(X.values, y)
        y_pred = model.predict(X.values)
        
        # Log
        mlflow.log_params(params)
        mlflow.log_metric("accuracy", accuracy_score(y, y_pred))
        mlflow.log_metric("f1_score", f1_score(y, y_pred, zero_division=0))
        
        signature = infer_signature(X, y_pred)
        mlflow.sklearn.log_model(
            model,
            artifact_path="model",
            signature=signature,
            input_example=X.iloc[:1]
        )

    print(f"Completed\n")

print("🎉 All experiments completed!")
print(f"\n🔗 Compare runs in UI: {HOST}:{PORT}")


### 📊 Compare Runs in the UI

1. Go to the MLflow UI
2. Select multiple runs (checkboxes)
3. Click **Compare**
4. View side-by-side parameter and metric comparisons

### 📸 Screenshot placeholder: Run comparison
![compare_run.png](assets/compare_run.png)

![compare_runs_in experiment.png](<assets/compare_runs_in experiment.png>)

---


## 8. Programmatically Query Runs

You can also access run data programmatically:


In [None]:
# Get all runs from the current experiment
experiment = client.get_experiment_by_name(EXPERIMENT)
runs = client.search_runs(experiment.experiment_id)

print(f"📊 Found {len(runs)} runs in experiment '{EXPERIMENT}'\n")

# Display run information
run_data = []
for run in runs:
    run_data.append({
        "Run Name": run.data.tags.get("mlflow.runName", "N/A"),
        "Learning Rate": run.data.params.get("learning_rate", "N/A"),
        "Accuracy": run.data.metrics.get("accuracy", "N/A"),
        "F1 Score": run.data.metrics.get("f1_score", "N/A"),
        "Run ID": run.info.run_id[:8] + "..."  # Shortened for display
    })

runs_df = pd.DataFrame(run_data)
display(runs_df.sort_values("Accuracy", ascending=False))


## 9. Best Practices Summary

### ✅ DO:
- Give runs descriptive names (e.g., `lr_0.1_dropout_0.3`)
- Log all hyperparameters consistently
- Log multiple metrics (not just accuracy)
- Use experiments to organize related work
- Add metadata/notes for context

### 🙅🏿 DON'T:
- Log large datasets on every run
- Forget to log important parameters
- Mix unrelated experiments in the same experiment name
- Delete `mlruns/` and `mlartifacts/` directories (unless you want to lose everything!)

---


## 10. Where to Find Your Data

MLflow stores everything locally in:

```
your_working_directory/
├── mlruns/              # Metadata, parameters, metrics
│   └── <experiment_id>/
│       └── <run_id>/
│           └── meta.yaml
└── mlartifacts/         # Models, data, artifacts
    └── <experiment_id>/
        └── <run_id>/
            ├── artifacts/
            │   ├── model/
            │   ├── X_train.parquet
            │   └── meta.json
```

💡 **Tip**: You'll need these paths in Tutorial 2 when sending runs to a remote server!

---


## 🎓 Summary

Congratulations! You've learned:

✅ How to set up a local MLflow tracking server  
✅ How to log parameters, metrics, models, and artifacts  
✅ How to navigate the MLflow UI  
✅ How to compare multiple runs  
✅ Best practices for local experimentation  

---

## 🚀 Next Steps

In **Tutorial 2**, you'll learn how to:
- Select your best runs locally
- Send **only** the successful runs to a remote MLflow server
- Avoid overloading remote servers with unnecessary data

---

## 🆘 Troubleshooting

### Problem: "Connection refused" error
**Solution**: Make sure the MLflow server is running in a separate terminal.

### Problem: Can't find runs in UI
**Solution**: Check you're using the correct port and that the server is running from the same directory.

<!-- ### Problem: "Module not found" errors
**Solution**: Install the package: `pip install -e .[dev]` -->

---
