# Set Up CUDA Environment

This sets up environment variables needed for TensorFlow to locate CUDA libraries for GPU support. It:

- Determines the conda environment path and sets it as `CUDA_PATH`.
- On Linux, additionally sets `LD_LIBRARY_PATH` and `XLA_FLAGS` for proper CUDA library access.
- Prints the configured CUDA path for verification.


In [1]:
# -------------------------
# GPU and Environment Setup
# -------------------------
import os
import platform
import sys

# Always silence TensorFlow logs
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

# Only use the gpu_setup module on Linux
if platform.system() != "Windows":
    print("Linux environment detected - loading specialized GPU setup...")
    try:
        import gpu_setup
    except ImportError:
        print("Warning: gpu_setup.py module not found. GPU functionality may be limited.")
else:
    print("Windows environment detected - using standard GPU configuration.")

# Import TensorFlow and print version info
import tensorflow as tf
print("TensorFlow version:", tf.__version__)
print("GPUs detected:", tf.config.list_physical_devices('GPU'))

Linux environment detected - loading specialized GPU setup...
✅ Libraries pre-loaded successfully
✅ GPU setup complete: GPU 0 enabled and configured for training
TensorFlow version: 2.10.1
GPUs detected: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


# Import Machine Learning Libraries and Modules

This cell loads the key libraries required for building and training the model. These include TensorFlow (and related modules), MLflow for tracking experiments, and additional libraries for data processing, visualization, and command-line argument parsing. Utility functions from photoz_utils and DataMakerPlus are also imported for custom data handling.

In [2]:
# Import Libraries
import tensorflow_probability as tfp
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import h5py
import tensorboard
import mlflow
import mlflow.tensorflow
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Conv2D, MaxPooling2D, Flatten, Input, Concatenate
from tensorflow.keras.optimizers import Adam
from sklearn.preprocessing import StandardScaler
from tensorboard.plugins.hparams import api as hp
import getpass
from datetime import datetime
import shutil
import sys
import getpass
import platform
import time
import socket 
import subprocess

# Import your local utility functions
from photoz_utils import *
from DataMakerPlus import *

 
 

# MLflow & Parameter Configuration

This cell initializes your training parameters. Feel free to tweak any of these to run multiple experiments and compare how different settings affect model performance:

- **image_size**: size of input images (here, 64×64)  
- **epochs**: number of full passes through the training data (5)  
- **batch_size**: samples per gradient update (256)  
- **learning_rate**: step size for optimizer (0.0001)  
- **experiment_name**: MLflow experiment under which runs will be grouped  
- **run_name**: specific run identifier (automatically generated if `None`)  
- **gpu_id**: index of the GPU to use (0-based)  


In [3]:
# -------------------------
# MLflow and Directory Setup
# -------------------------
params = {
    'image_size': 64,             # Set image size to 64
    'epochs': 200,
    'batch_size': 256,
    'learning_rate': 0.0001,
    'experiment_name': "Galaxy_CNN_Redshift_Estimation",
    'run_name': None,             # Auto-generate run name if None
    'gpu_id': 0
}


# MLflow Environment & Directory Setup

This cell prepares your MLflow tracking, artifacts, and local experiment folders:

- **Config Flags:**  
  - `clear_tracking_store`, `clear_experiments_store`, `clear_artifacts`  
  - **True** → delete existing data for a completely fresh run  
  - **False** → keep prior runs/artifacts so you can compare multiple experiments

- **1) Tracking Store:**  
  - Creates `./mlruns` (and the required `.trash` subfolder).  
  - Sets the MLflow tracking URI (platform‐aware file URI).  
  - Selects your experiment by name, with diagnostic prints and a SQLite fallback if it fails.

- **2) Artifact Staging:**  
  - Creates (or recreates) `./MLFlowData` for staging all per-run artifacts, including:  
    - **Training history CSVs** (`<timestamp>_training_history.csv`)  
    - **Training curve images** (`<timestamp>_training_curves.png`)  
    - **Prediction scatter plots** (`<timestamp>_prediction_plot.png`)

- **3) Experiment Directories:**  
  - Under `./experiments`, sets up:  
    - `MLCheckpoints` (model weights)  
    - `MLlogs` (TensorBoard logs)  


In [4]:
# ─── Config options ───────────────────────────────────────────────────────────
clear_tracking_store    = True
clear_experiments_store = True
clear_artifacts         = True

# ─── 1) Tracking store setup ─────────────────────────────────────────────────
mlruns_dir = os.path.abspath("mlruns")
if clear_tracking_store and os.path.exists(mlruns_dir):
    shutil.rmtree(mlruns_dir)
os.makedirs(mlruns_dir, exist_ok=True)

# MLflow expects a ".trash" folder
trash_dir = os.path.join(mlruns_dir, ".trash")
os.makedirs(trash_dir, exist_ok=True)

# Set tracking URI
if os.name == 'nt':  # Windows
    mlruns_uri = f"file:///{mlruns_dir.replace(os.sep, '/')}"
else:                # Linux/Mac
    mlruns_uri = f"file://{mlruns_dir}"

os.environ['MLFLOW_TRACKING_URI'] = f"file://{mlruns_dir}"
print(f"MLflow tracking URI → {mlruns_uri}")
mlflow.set_tracking_uri(mlruns_uri)

# Set experiment
try:
    mlflow.set_experiment(params['experiment_name'])
    print(f"Successfully set experiment to {params['experiment_name']}")
except Exception as e:
    print(f"Error setting experiment: {e}")
    # (Optional) diagnostics and fallback
    print("MLruns exists?", os.path.exists(mlruns_dir))
    print("Trash exists?", os.path.exists(trash_dir))
    print("Contents of mlruns:", os.listdir(mlruns_dir))
    # Fallback to SQLite if needed
    sqlite_uri = f"sqlite:///{os.path.abspath('mlflow.db')}"
    print("Falling back to SQLite at", sqlite_uri)
    mlflow.set_tracking_uri(sqlite_uri)
    mlflow.set_experiment(params['experiment_name'])
    print("Experiment set using SQLite backend")

# ─── 2) MLFlowData (artifact staging) ────────────────────────────────────────
mlflow_data_dir = os.path.abspath("MLFlowData")
if clear_artifacts and os.path.exists(mlflow_data_dir):
    shutil.rmtree(mlflow_data_dir)
    print(f"Deleted entire MLFlowData directory.")
os.makedirs(mlflow_data_dir, exist_ok=True)
print(f"Local MLFlowData directory → {mlflow_data_dir}")

# ─── 3) experiments/ folder (checkpoints & logs) ─────────────────────────────
base_dir = os.path.abspath("experiments")
if clear_experiments_store and os.path.exists(base_dir):
    shutil.rmtree(base_dir)
    print("Deleted entire experiments directory.")
checkpoint_dir = os.path.join(base_dir, "MLCheckpoints")
log_dir        = os.path.join(base_dir, "MLlogs")
os.makedirs(checkpoint_dir, exist_ok=True)
os.makedirs(log_dir, exist_ok=True)
print("experiments/ structure set up:")
print(f"  • checkpoints → {checkpoint_dir}")
print(f"  • tensorboard logs → {log_dir}")

2025/04/28 20:38:16 INFO mlflow.tracking.fluent: Experiment with name 'Galaxy_CNN_Redshift_Estimation' does not exist. Creating a new experiment.


MLflow tracking URI → file:///home/jupyter-jacob/RepoCloned/MLFlow-CNN/mlruns
Successfully set experiment to Galaxy_CNN_Redshift_Estimation
Deleted entire MLFlowData directory.
Local MLFlowData directory → /home/jupyter-jacob/RepoCloned/MLFlow-CNN/MLFlowData
Deleted entire experiments directory.
experiments/ structure set up:
  • checkpoints → /home/jupyter-jacob/RepoCloned/MLFlow-CNN/experiments/MLCheckpoints
  • tensorboard logs → /home/jupyter-jacob/RepoCloned/MLFlow-CNN/experiments/MLlogs


# Hyperparameters and Dataset Paths

This cell extracts key training hyperparameters from the `params` dictionary and defines additional settings (like number of dense units, maximum redshift, and data format). It also stores these hyperparameters in a dictionary (`hparams`) for logging with MLflow, sets the file paths for the training, validation, and test datasets, and verifies that these dataset files exist. Adjust these paths as needed depending on where you downloaded the datasets.


In [None]:
# -------------------------
# Dataset Paths and Preprocessing Setup
# -------------------------
import os

# Check if running in CI environment
in_ci = os.path.exists('.ci_mode') or os.environ.get('CI') or os.environ.get('GITHUB_ACTIONS') or os.environ.get('BINDER_SERVICE_HOST')

# Set dataset paths based on environment
if in_ci:
    print("CI/Binder environment detected - using demo datasets")
    TRAIN_PATH = os.path.join(os.getcwd(), 'demo_astrodata/5x64x64_training_with_morphology.hdf5')
    VAL_PATH = os.path.join(os.getcwd(), 'demo_astrodata/5x64x64_validation_with_morphology.hdf5')
    TEST_PATH = os.path.join(os.getcwd(), 'demo_astrodata/5x64x64_testing_with_morphology.hdf5')
    
    # Extreme resource conservation for CI
    params['epochs'] = 1  # Single epoch
    
    # Drastically reduce batch size to avoid memory issues
    original_batch_size = params['batch_size']
    params['batch_size'] = 4  # Use tiny batch size for CI
    
    print(f"CI environment: Reduced epochs to {params['epochs']} and batch size from {original_batch_size} to {params['batch_size']}")
else:
    print("Local environment detected - using original dataset paths")
    TRAIN_PATH = '/shared/astrodata/5x64x64_training_with_morphology.hdf5'
    VAL_PATH = '/shared/astrodata/5x64x64_validation_with_morphology.hdf5'
    TEST_PATH = '/shared/astrodata/5x64x64_testing_with_morphology.hdf5'

# Check if files exist
for path in [TRAIN_PATH, VAL_PATH, TEST_PATH]:
    if not os.path.exists(path):
        raise FileNotFoundError(f"Dataset not found: {path}")

# Rest of your original code remains unchanged
# Prepare model checkpoint filename
username = getpass.getuser()
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
checkpoint_filepath = os.path.join(checkpoint_dir, f"{username}_cp_{timestamp}.weights.h5")

# Define generator arguments (using original preprocessing details)
param_names = []
for band in ['g', 'r', 'i', 'z', 'y']:
    for col in ['cmodel_mag']:
        param_names.append(f"{band}_{col}")

gen_args = {
    'image_key': 'image',
    'numerical_keys': param_names,
    'y_key': 'specz_redshift',
    'scaler': True,             # Data scaling enabled
    'labels_encoding': False,   # No extra label encoding
    'batch_size': params['batch_size'],
    'shuffle': False
}

train_gen = HDF5DataGenerator(TRAIN_PATH, mode='train', **gen_args)
val_gen   = HDF5DataGenerator(VAL_PATH, mode='train', **gen_args)
test_gen  = HDF5DataGenerator(TEST_PATH, mode='test', **gen_args)

# Define the Model Architecture

This cell defines the `create_model()` function, which builds a Keras model with two input branches. One branch (CNN) processes image data, and the other (NN) handles additional numerical features. The outputs from these branches are combined and passed through a final layer to produce a single prediction.

The model is compiled using the Adam optimizer, a custom HSC loss function (instead of the standard mean squared error loss), and RMSE is tracked as a performance metric. This setup allows for a tailored approach to measuring prediction errors while making the architecture reusable throughout the notebook.


In [6]:
def create_model():
    # Define inputs for image (CNN branch) and numerical data (NN branch)
    input_cnn = Input(shape=(5, params['image_size'], params['image_size']))
    input_nn  = Input(shape=(5,))
    
    # CNN branch with 7 convolutional layers and pooling
    conv1 = Conv2D(32, kernel_size=(3, 3), activation='tanh', padding='same', data_format='channels_first')(input_cnn)
    pool1 = MaxPooling2D(pool_size=(2,2), data_format='channels_first')(conv1)
    conv2 = Conv2D(64, kernel_size=(3, 3), activation='tanh', padding='same', data_format='channels_first')(pool1)
    pool2 = MaxPooling2D(pool_size=(2,2), data_format='channels_first')(conv2)
    conv3 = Conv2D(128, kernel_size=(3, 3), activation='tanh', padding='same', data_format='channels_first')(pool2)
    pool3 = MaxPooling2D(pool_size=(2,2), data_format='channels_first')(conv3)
    conv4 = Conv2D(256, kernel_size=(3, 3), activation='tanh', padding='same', data_format='channels_first')(pool3)
    pool4 = MaxPooling2D(pool_size=(2,2), data_format='channels_first')(conv4)
    conv5 = Conv2D(256, kernel_size=(3, 3), activation='tanh', padding='same', data_format='channels_first')(pool4)
    pool5 = MaxPooling2D(pool_size=(2,2), data_format='channels_first')(conv5)
    conv6 = Conv2D(512, kernel_size=(3, 3), activation='relu', padding='same', data_format='channels_first')(pool5)
    conv7 = Conv2D(512, kernel_size=(3, 3), activation='relu', padding='same', data_format='channels_first')(conv6)
    flatten = Flatten()(conv7)
    dense1 = Dense(512, activation='tanh')(flatten)
    dense2 = Dense(128, activation='tanh')(dense1)
    dense3 = Dense(32, activation='tanh')(dense2)
    
    # NN branch: fully connected layers processing numerical inputs
    NUM_DENSE_UNITS = 200
    hidden1 = Dense(NUM_DENSE_UNITS, activation="relu")(input_nn)
    hidden2 = Dense(NUM_DENSE_UNITS, activation="relu")(hidden1)
    hidden3 = Dense(NUM_DENSE_UNITS, activation="relu")(hidden2)
    hidden4 = Dense(NUM_DENSE_UNITS, activation="relu")(hidden3)
    hidden5 = Dense(NUM_DENSE_UNITS, activation="relu")(hidden4)
    hidden6 = Dense(NUM_DENSE_UNITS, activation="relu")(hidden5)
    
    # Concatenate the outputs from both branches and produce the final prediction
    concat = Concatenate()([dense3, hidden6])
    output = Dense(1)(concat)
    model = Model(inputs=[input_cnn, input_nn], outputs=output)
    
    # Define custom HSC loss function
    def calculate_loss(y_true, y_pred):
        dz = y_pred - y_true
        gamma = 0.15
        denominator = 1.0 + tf.square(dz / gamma)
        L = 1 - 1.0 / denominator
        return L
    
    model.compile(
        optimizer=Adam(learning_rate=params['learning_rate']),
        loss=calculate_loss,
        metrics=[tf.keras.metrics.RootMeanSquaredError()]
    )
    return model


# Define Callbacks for Logging Metrics

This cell sets up several callbacks to monitor and manage training:

- **TensorBoard Callback:**  
  Logs training data for visualization in TensorBoard, including histograms of the model's layers.

- **Model Checkpoint Callback:**  
  Saves only the model weights at the end of each epoch if there is an improvement (monitored by the loss value). This ensures the best model is saved during training.

- **Hyperparameter Callback:**  
  Logs key hyperparameters (like the number of dense units, batch size, epochs, learning rate, etc.) to help track the training setup.

- **MLflow Callback (Custom):**  
  At the end of each epoch, it logs training metrics (like loss and RMSE) to MLflow for experiment tracking.


In [7]:

# Callback definitions
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='loss',
    mode='min',
    save_freq='epoch',
    save_best_only=True,
    verbose=True
)

hparam_callback = hp.KerasCallback(log_dir, {
    'num_dense_units': 200,
    'batch_size': params['batch_size'],
    'num_epochs': params['epochs'],
    'learning_rate': params['learning_rate'],
    'z_max': 4,
    'data_format': 'channels_first'
})

class MLflowCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        logs = logs or {}
        for name, value in logs.items():
            mlflow.log_metric(name, value, step=epoch)


# Training Function with MLflow Logging

This cell defines the `train_model_with_mlflow()` function, which handles the entire training process and logs details using MLflow:

- **Run Setup:**  
  It sets a unique run name and logs key parameters and hyperparameters.

- **Model Training:**  
  The function creates the model, trains it using training and validation data, and uses callbacks (for TensorBoard, checkpointing, hyperparameter logging, and custom MLflow logging) during training.

- **Artifact Logging:**  
  After training, it saves the model, training history, and plots of the training loss. These artifacts are logged to MLflow.

- **Prediction and Evaluation:**  
  It generates and saves a scatter plot comparing true and predicted values, evaluates the model on test data, and logs the test metrics.

- **Final Steps:**  
  The model's summary and package requirements are saved and logged, and the function prints the MLflow Run ID to confirm completion.


In [8]:
# -------------------------
# Training Function with MLflow Logging
# -------------------------
def train_model_with_mlflow():
    run_name = params['run_name'] or f"GalaxyCNN_Size{params['image_size']}_Batch{params['batch_size']}_LR{params['learning_rate']}_Epochs{params['epochs']}_{username}"
    with mlflow.start_run(run_name=run_name):
        mlflow.set_tag("username", username)
        mlflow.log_params({**params, 'num_dense_units': 200, 'z_max': 4, 'data_format': 'channels_first'})
        mlflow.tensorflow.autolog(
            log_models=False,            # don’t auto‐save the model at training end
            log_datasets=False,           
            log_model_signatures=False,   
            silent=True                  # suppress autolog INFO/WARNING messages
        )

        
        model = create_model()
        history = model.fit(
            train_gen,
            epochs=params['epochs'],
            validation_data=val_gen,
            callbacks=[tensorboard_callback, model_checkpoint_callback, MLflowCallback()],
            verbose=1,
            shuffle=True
        )
        
        model.save(checkpoint_filepath)
        mlflow.log_artifact(checkpoint_filepath)
        
        # Get unique identifiers for this run
        run_id = mlflow.active_run().info.run_id
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        file_prefix = f"{timestamp}_{run_id[:8]}"
        
        # Save training history to MLFlowData directory with timestamp
        history_df = pd.DataFrame(history.history)
        history_csv_path = os.path.join(mlflow_data_dir, f"{file_prefix}_training_history.csv")
        history_df.to_csv(history_csv_path, index=False)
        mlflow.log_artifact(history_csv_path)
        
        # Save training curves to MLFlowData directory with timestamp
        plt.figure(figsize=(8, 4))
        plt.plot(history_df.index, history_df['loss'], label='Training Loss')
        if 'val_loss' in history_df.columns:
            plt.plot(history_df.index, history_df['val_loss'], label='Validation Loss')
        plt.xlabel('Epoch')
        plt.ylabel('Loss')
        plt.title('Training Curves')
        plt.legend()
        plt.tight_layout()
        training_curves_path = os.path.join(mlflow_data_dir, f"{file_prefix}_training_curves.png")
        plt.savefig(training_curves_path)
        mlflow.log_artifact(training_curves_path)
        plt.close()
        
        # Save prediction plot to MLFlowData directory with timestamp
        predictions = model.predict(test_gen)
        predictions = predictions.squeeze()
        with h5py.File(TEST_PATH, 'r') as f:
            test_labels = np.asarray(f['specz_redshift'][:])
        test_labels = test_labels.squeeze()
        print("Test labels shape:", test_labels.shape)
        print("Predictions shape:", predictions.shape)

        plt.figure(figsize=(6, 6))
        sc = plt.scatter(test_labels, predictions, c=predictions, cmap='viridis', alpha=0.7, edgecolors='w', s=50)
        plt.plot([test_labels.min(), test_labels.max()], [test_labels.min(), test_labels.max()], 'r--', lw=2)
        plt.xlabel("True Redshift")
        plt.ylabel("Predicted Redshift")
        plt.title("Prediction Scatter Plot")
        plt.colorbar(sc, label="Predicted Value")
        plt.tight_layout()
        prediction_plot_path = os.path.join(mlflow_data_dir, f"{file_prefix}_prediction_plot.png")
        plt.savefig(prediction_plot_path)
        mlflow.log_artifact(prediction_plot_path)
        plt.close()
        
        # Save test metrics to MLFlowData directory with timestamp
        test_loss, test_rmse = model.evaluate(test_gen, verbose=1)
        mlflow.log_metric("test_loss", test_loss)
        mlflow.log_metric("test_rmse", test_rmse)
        test_metrics_path = os.path.join(mlflow_data_dir, f"{file_prefix}_test_metrics.txt")
        with open(test_metrics_path, "w") as f:
            f.write(f"Test Loss: {test_loss}\nTest RMSE: {test_rmse}\n")
        mlflow.log_artifact(test_metrics_path)
        
        mlflow.keras.log_model(model, "model")
        
        # Save model summary to MLFlowData directory with timestamp
        model_summary_lines = []
        model.summary(print_fn=lambda line: model_summary_lines.append(line))
        model_summary_path = os.path.join(mlflow_data_dir, f"{file_prefix}_model_summary.txt")
        with open(model_summary_path, "w") as f:
            f.write("\n".join(model_summary_lines))
        mlflow.log_artifact(model_summary_path)
        
        # Save requirements to MLFlowData directory with timestamp
        import subprocess
        requirements_path = os.path.join(mlflow_data_dir, f"{file_prefix}_requirements.txt")
        subprocess.run(f"pip freeze > {requirements_path}", shell=True)
        mlflow.log_artifact(requirements_path)
        
        print(f"Training complete. MLflow Run ID: {mlflow.active_run().info.run_id}")
        print(f"Artifacts saved to MLFlowData with prefix: {file_prefix}")

# Run Training

This final cell calls the `train_model_with_mlflow()` function, which starts the training process, logs experiment details with MLflow, saves model checkpoints, and evaluates the model on the test set.


In [9]:
# Simply run this cell to start training the model, log metrics, and save artifacts via MLflow.

train_model_with_mlflow()



Epoch 1/5
Epoch 1: loss improved from inf to 0.35870, saving model to /home/jupyter-jacob/RepoCloned/MLFlow-CNN/experiments/MLCheckpoints/jupyter-jacob_cp_2025-04-28_20-38-34.weights.h5
Epoch 2/5
Epoch 2: loss improved from 0.35870 to 0.28438, saving model to /home/jupyter-jacob/RepoCloned/MLFlow-CNN/experiments/MLCheckpoints/jupyter-jacob_cp_2025-04-28_20-38-34.weights.h5
Epoch 3/5
Epoch 3: loss improved from 0.28438 to 0.25873, saving model to /home/jupyter-jacob/RepoCloned/MLFlow-CNN/experiments/MLCheckpoints/jupyter-jacob_cp_2025-04-28_20-38-34.weights.h5
Epoch 4/5
Epoch 4: loss improved from 0.25873 to 0.22379, saving model to /home/jupyter-jacob/RepoCloned/MLFlow-CNN/experiments/MLCheckpoints/jupyter-jacob_cp_2025-04-28_20-38-34.weights.h5
Epoch 5/5
Epoch 5: loss improved from 0.22379 to 0.20610, saving model to /home/jupyter-jacob/RepoCloned/MLFlow-CNN/experiments/MLCheckpoints/jupyter-jacob_cp_2025-04-28_20-38-34.weights.h5
Test labels shape: (40914,)
Predictions shape: (40914,



Training complete. MLflow Run ID: 9e955422f6fa4c85b0ca62b3d9a20f7d
Artifacts saved to MLFlowData with prefix: 20250428_204051_9e955422


## MLflow UI Startup Helper

This cell automates launching the MLflow User Interface (UI) so you can easily visualize your experiment tracking results.

**What it Does:**

1.  **Cleans Up:** Stops any old MLflow UI processes that might still be running.
2.  **Configures:** Sets the `MLFLOW_TRACKING_URI` environment variable to point to the local `./mlruns` directory, telling MLflow where your experiment data is stored.
3.  **Launches UI:** Starts the MLflow UI server as a background process. It listens on port `5000` and is configured to be accessible from other machines on the network (host `0.0.0.0`).
4.  **Provides Access Links:** After a short pause (to allow the server to start), it prints several URLs you can use to access the UI:
    * **Local/SSH Tunnel:** `http://localhost:5000`
    * **Direct Network:** `http://<hostname>:5000` (uses your machine's actual hostname)
    * **Jupyter Proxy:** A common URL format for accessing services through Jupyter.
    * It also gives you the `ssh` command needed to create a tunnel from your local machine for secure access if needed.

**How to Use:**

* Run this cell.
* Click one of the generated URLs (usually the `localhost:5000` link if using SSH tunneling or working locally, or the direct/proxy link otherwise) to open the MLflow dashboard in your browser.

In [10]:
# 1) Kill any old MLflow UI
subprocess.run("pkill -f 'mlflow ui' || true", shell=True)

# 2) Point the CLI at your new ./mlruns folder
mlruns_dir = os.path.abspath("mlruns")
os.environ['MLFLOW_TRACKING_URI'] = f"file://{mlruns_dir}"

# 3) Fire up the UI on all interfaces:5000
cmd = [
    sys.executable, "-m", "mlflow", "ui",
    "--host", "0.0.0.0", "--port", "5000",
    "--backend-store-uri", os.environ['MLFLOW_TRACKING_URI']
]
subprocess.Popen(cmd, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

# 4) Wait a sec
time.sleep(2)

# 5) Build and print clickable URLs
host = socket.gethostname()
print(f"→ Local (SSH‐tunnel):  http://localhost:5000")
print(f"→ Direct (if open):   http://{host}:5000")
print(f"→ Jupyter proxy:       http://{host}:8888/proxy/5000/")
print("\nTo tunnel from your laptop, run:")
print(f"  ssh -N -L 5000:localhost:5000 your_user@{host}")

→ Local (SSH‐tunnel):  http://localhost:5000
→ Direct (if open):   http://altair:5000
→ Jupyter proxy:       http://altair:8888/proxy/5000/

To tunnel from your laptop, run:
  ssh -N -L 5000:localhost:5000 your_user@altair
