# Set Up CUDA Environment

This sets up environment variables needed for TensorFlow to locate CUDA libraries for GPU support. It:

- Determines the conda environment path and sets it as `CUDA_PATH`.
- On Linux, additionally sets `LD_LIBRARY_PATH` and `XLA_FLAGS` for proper CUDA library access.
- Prints the configured CUDA path for verification.


In [1]:
# Setup Environment and CUDA Configuration
# This cell sets up the environment variables and prints the CUDA path.
import os
import sys
import getpass
from datetime import datetime
import platform

# Get conda environment path and set CUDA_PATH (works for both Linux and Windows)
conda_env_path = os.path.dirname(os.path.dirname(sys.executable))
os.environ['CUDA_PATH'] = conda_env_path
print(f"Set CUDA_PATH to: {conda_env_path}")

# Setup for Linux systems
if platform.system() != "Windows":
    cuda_path = os.path.join(conda_env_path, "lib")
    os.environ['LD_LIBRARY_PATH'] = f"{cuda_path}:{os.environ.get('LD_LIBRARY_PATH', '')}"
    os.environ['XLA_FLAGS'] = f"--xla_gpu_cuda_data_dir={conda_env_path}"
    print(f"Set LD_LIBRARY_PATH and XLA_FLAGS for Linux")

Set CUDA_PATH to: c:\Users\truen\miniconda3\envs


# Import Machine Learning Libraries and Modules

This cell loads the key libraries required for building and training the model. These include TensorFlow (and related modules), MLflow for tracking experiments, and additional libraries for data processing, visualization, and command-line argument parsing. Utility functions from photoz_utils and DataMakerPlus are also imported for custom data handling.

In [2]:
# Import Libraries
import tensorflow as tf
import tensorflow_probability as tfp
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import h5py
import tensorboard
import mlflow
import mlflow.tensorflow
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Conv2D, MaxPooling2D, Flatten, Input, Concatenate
from tensorflow.keras.optimizers import Adam
from sklearn.preprocessing import StandardScaler
from tensorboard.plugins.hparams import api as hp
import os
import getpass
from datetime import datetime
import shutil

# Import your local utility functions
from photoz_utils import *
from DataMakerPlus import *

print(f"TensorFlow version: {tf.__version__}")

TensorFlow version: 2.10.1


# Configure GPU Settings
This cell checks for available GPUs and enables memory growth. Enabling memory growth ensures that TensorFlow only allocates as much GPU memory as needed, rather than grabbing all available memory. This is especially useful when running multiple experiments on the same machine.

In [3]:
# -------------------------
# GPU Setup: List GPUs and enable memory growth
# -------------------------
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    for i, gpu in enumerate(gpus):
        try:
            tf.config.experimental.set_memory_growth(gpu, True)
            print(f"Memory growth enabled for GPU {i}")
        except RuntimeError as e:
            print(e)
else:
    print("No GPUs found. Running on CPU.")

Memory growth enabled for GPU 0


# Hyperparameters and GPU Configuration

This cell defines a dictionary of key training parameters (e.g., image size, epochs, batch size, learning rate, experiment name) for interactive use. It then checks for multiple GPUs and, if available, sets the specified GPU for training. Finally, it confirms whether a GPU is available or if the training will proceed on CPU.


In [None]:
# -------------------------
# MLflow and Directory Setup
# -------------------------
params = {
    'image_size': 64,             # Set image size to 64
    'epochs': 200,
    'batch_size': 256,
    'learning_rate': 0.0001,
    'experiment_name': "Galaxy_CNN_Redshift_Estimation",
    'run_name': None,             # Auto-generate run name if None
    'gpu_id': 0
}

if len(gpus) > 1:
    os.environ["CUDA_VISIBLE_DEVICES"] = str(params['gpu_id'])
    print(f"Using GPU {params['gpu_id']}")

if gpus:
    with tf.device('/GPU:0'):
        print("GPU is available and will be used for training")
else:
    print("Proceeding with CPU.")

 

GPU is available and will be used for training


### MLflow Environment Reset

- **Cleanup Flags:**  
  Three flags (`clear_tracking_store`, `clear_experiments_store`, and `clear_artifacts`) decide if existing runs, directories, or artifact files should be removed.  
  * Set them each to "True" if you wish to start a new run from scratch and remove previous runs, or set them to "False" if you intend to do multiple runs.

- **MLflow Setup:**  
  It creates the `mlruns` directory, deletes any leftover `.trash` folder, sets the tracking URI, and initializes the experiment using a provided name.  
  * **Note:** To change where MLflow stores its run data, modify the `mlruns_dir` variable (e.g., `mlruns_dir = os.path.abspath("my_custom_mlruns_dir")`).

- **Experiment Directories:**  
  It ensures directories for checkpoints and logs exist, or removes them if a full reset is requested.  
  * **Note:** To change where experiment artifacts are stored, update the `base_dir` variable (e.g., `base_dir = os.path.abspath("my_custom_experiments_dir")`).

- **Artifact Removal:**  
  Optionally, it deletes specific artifact files (e.g., logs, plots, metrics) for a clean slate.


In [11]:
# Optionally clear previous MLflow runs and experiment artifacts
clear_tracking_store = False     
clear_experiments_store = False      
clear_artifacts = False             

mlruns_dir = os.path.abspath("mlruns").replace("\\", "/")
# If clear_tracking_store is False, we keep existing data, but still remove .trash automatically.
os.makedirs(mlruns_dir, exist_ok=True)
trash_path = os.path.join(mlruns_dir, ".trash")
if os.path.exists(trash_path):
    try:
        shutil.rmtree(trash_path)
        print("Deleted .trash directory.")
    except Exception as e:
        print("Error deleting .trash directory:", e)

mlflow.set_tracking_uri(f"file:///{mlruns_dir}")
try:
    mlflow.set_experiment(params['experiment_name'])
except Exception as e:
    # If any issue occurs, attempt to remove .trash and try again
    if os.path.exists(trash_path):
        try:
            shutil.rmtree(trash_path)
            print("Force-deleted .trash directory on retry.")
        except Exception as e:
            print("Failed to delete .trash on retry:", e)
    mlflow.set_experiment(params['experiment_name'])

base_dir = os.path.abspath("experiments")
if clear_experiments_store and os.path.exists(base_dir):
    shutil.rmtree(base_dir)
    print("Deleted entire experiments directory.")
checkpoint_dir = os.path.join(base_dir, "MLCheckpoints")
log_dir = os.path.join(base_dir, "MLlogs")
os.makedirs(checkpoint_dir, exist_ok=True)
os.makedirs(log_dir, exist_ok=True)
print("Experiments directory is set up.")

if clear_artifacts:
    artifact_files = [
        "training_history.csv",
        "training_curves.png",
        "prediction_plot.png",
        "test_metrics.txt",
        "model_summary.txt",
        "requirements.txt"
    ]
    for fname in artifact_files:
        if os.path.exists(fname):
            os.remove(fname)
            print(f"Deleted previous {fname}")


Deleted .trash directory.


Exception: Invalid parent directory 'C:\Users\truen\MLFlow-CNN\mlruns\.trash'

# Hyperparameters and Dataset Paths

This cell extracts key training hyperparameters from the `params` dictionary and defines additional settings (like number of dense units, maximum redshift, and data format). It also stores these hyperparameters in a dictionary (`hparams`) for logging with MLflow, sets the file paths for the training, validation, and test datasets, and verifies that these dataset files exist.


In [6]:
# -------------------------
# Dataset Paths and Preprocessing Setup
# -------------------------
TRAIN_PATH = r'E:\Datasets\5x64x64_training_with_morphology.hdf5'
VAL_PATH   = r'E:\Datasets\5x64x64_validation_with_morphology.hdf5'
TEST_PATH  = r'E:\Datasets\5x64x64_testing_with_morphology.hdf5'

for path in [TRAIN_PATH, VAL_PATH, TEST_PATH]:
    if not os.path.exists(path):
        raise FileNotFoundError(f"Dataset not found: {path}")

# Prepare model checkpoint filename
username = getpass.getuser()
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
checkpoint_filepath = os.path.join(checkpoint_dir, f"{username}_cp_{timestamp}.weights.h5")

# Define generator arguments (using original preprocessing details)
param_names = []
for band in ['g', 'r', 'i', 'z', 'y']:
    for col in ['cmodel_mag']:
        param_names.append(f"{band}_{col}")

gen_args = {
    'image_key': 'image',
    'numerical_keys': param_names,
    'y_key': 'specz_redshift',
    'scaler': True,             # Data scaling enabled
    'labels_encoding': False,   # No extra label encoding
    'batch_size': params['batch_size'],
    'shuffle': False
}

train_gen = HDF5DataGenerator(TRAIN_PATH, mode='train', **gen_args)
val_gen   = HDF5DataGenerator(VAL_PATH, mode='train', **gen_args)
test_gen  = HDF5DataGenerator(TEST_PATH, mode='test', **gen_args)


# Define the Model Architecture

This cell defines the `create_model()` function, which builds a Keras model with two input branches. One branch (CNN) processes image data, and the other (NN) handles additional numerical features. The outputs from these branches are combined and passed through a final layer to produce a single prediction.

The model is compiled using the Adam optimizer, a custom HSC loss function (instead of the standard mean squared error loss), and RMSE is tracked as a performance metric. This setup allows for a tailored approach to measuring prediction errors while making the architecture reusable throughout the notebook.


In [7]:
# -------------------------
# Model Definition: Full CNN Architecture with Custom HSC Loss
# -------------------------
def create_model():
    # Define inputs for image (CNN branch) and numerical data (NN branch)
    input_cnn = Input(shape=(5, params['image_size'], params['image_size']))
    input_nn  = Input(shape=(5,))
    
    # CNN branch with 7 convolutional layers and pooling
    conv1 = Conv2D(32, kernel_size=(3, 3), activation='tanh', padding='same', data_format='channels_first')(input_cnn)
    pool1 = MaxPooling2D(pool_size=(2,2), data_format='channels_first')(conv1)
    conv2 = Conv2D(64, kernel_size=(3, 3), activation='tanh', padding='same', data_format='channels_first')(pool1)
    pool2 = MaxPooling2D(pool_size=(2,2), data_format='channels_first')(conv2)
    conv3 = Conv2D(128, kernel_size=(3, 3), activation='tanh', padding='same', data_format='channels_first')(pool2)
    pool3 = MaxPooling2D(pool_size=(2,2), data_format='channels_first')(conv3)
    conv4 = Conv2D(256, kernel_size=(3, 3), activation='tanh', padding='same', data_format='channels_first')(pool3)
    pool4 = MaxPooling2D(pool_size=(2,2), data_format='channels_first')(conv4)
    conv5 = Conv2D(256, kernel_size=(3, 3), activation='tanh', padding='same', data_format='channels_first')(pool4)
    pool5 = MaxPooling2D(pool_size=(2,2), data_format='channels_first')(conv5)
    conv6 = Conv2D(512, kernel_size=(3, 3), activation='relu', padding='same', data_format='channels_first')(pool5)
    conv7 = Conv2D(512, kernel_size=(3, 3), activation='relu', padding='same', data_format='channels_first')(conv6)
    flatten = Flatten()(conv7)
    dense1 = Dense(512, activation='tanh')(flatten)
    dense2 = Dense(128, activation='tanh')(dense1)
    dense3 = Dense(32, activation='tanh')(dense2)
    
    # NN branch: fully connected layers processing numerical inputs
    NUM_DENSE_UNITS = 200
    hidden1 = Dense(NUM_DENSE_UNITS, activation="relu")(input_nn)
    hidden2 = Dense(NUM_DENSE_UNITS, activation="relu")(hidden1)
    hidden3 = Dense(NUM_DENSE_UNITS, activation="relu")(hidden2)
    hidden4 = Dense(NUM_DENSE_UNITS, activation="relu")(hidden3)
    hidden5 = Dense(NUM_DENSE_UNITS, activation="relu")(hidden4)
    hidden6 = Dense(NUM_DENSE_UNITS, activation="relu")(hidden5)
    
    # Concatenate the outputs from both branches and produce the final prediction
    concat = Concatenate()([dense3, hidden6])
    output = Dense(1)(concat)
    model = Model(inputs=[input_cnn, input_nn], outputs=output)
    
    # Define custom HSC loss function
    def calculate_loss(z_photo, z_spec):
        dz = z_photo - z_spec
        gamma = 0.15
        denominator = 1.0 + tf.square(dz / gamma)
        L = 1 - 1.0 / denominator
        return L
    
    model.compile(optimizer=Adam(learning_rate=params['learning_rate']),
                  loss=calculate_loss,
                  metrics=[tf.keras.metrics.RootMeanSquaredError()])
    return model


# Define Callbacks for Logging Metrics

This cell sets up several callbacks to monitor and manage training:

- **TensorBoard Callback:**  
  Logs training data for visualization in TensorBoard, including histograms of the model's layers.

- **Model Checkpoint Callback:**  
  Saves only the model weights at the end of each epoch if there is an improvement (monitored by the loss value). This ensures the best model is saved during training.

- **Hyperparameter Callback:**  
  Logs key hyperparameters (like the number of dense units, batch size, epochs, learning rate, etc.) to help track the training setup.

- **MLflow Callback (Custom):**  
  At the end of each epoch, it logs training metrics (like loss and RMSE) to MLflow for experiment tracking.


In [8]:
# -------------------------
# Callbacks
# -------------------------
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='loss',
    mode='min',
    save_freq='epoch',
    save_best_only=True,
    verbose=True)
hparam_callback = hp.KerasCallback(log_dir, {
    'num_dense_units': 200,
    'batch_size': params['batch_size'],
    'num_epochs': params['epochs'],
    'learning_rate': params['learning_rate'],
    'z_max': 4,
    'data_format': 'channels_first'
})

class MLflowCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        logs = logs or {}
        for name, value in logs.items():
            mlflow.log_metric(name, value, step=epoch)

# Training Function with MLflow Logging

This cell defines the `train_model_with_mlflow()` function, which handles the entire training process and logs details using MLflow:

- **Run Setup:**  
  It sets a unique run name and logs key parameters and hyperparameters.

- **Model Training:**  
  The function creates the model, trains it using training and validation data, and uses callbacks (for TensorBoard, checkpointing, hyperparameter logging, and custom MLflow logging) during training.

- **Artifact Logging:**  
  After training, it saves the model, training history, and plots of the training loss. These artifacts are logged to MLflow.

- **Prediction and Evaluation:**  
  It generates and saves a scatter plot comparing true and predicted values, evaluates the model on test data, and logs the test metrics.

- **Final Steps:**  
  The model's summary and package requirements are saved and logged, and the function prints the MLflow Run ID to confirm completion.


In [None]:
# -------------------------
# Training Function with MLflow Logging
# -------------------------
def train_model_with_mlflow():
    run_name = params['run_name'] or f"GalaxyCNN_Size{params['image_size']}_Batch{params['batch_size']}_LR{params['learning_rate']}_Epochs{params['epochs']}_{username}"
    with mlflow.start_run(run_name=run_name):
        mlflow.set_tag("username", username)
        mlflow.log_params({**params, 'num_dense_units': 200, 'z_max': 4, 'data_format': 'channels_first'})
        mlflow.tensorflow.autolog()
        
        model = create_model()
        history = model.fit(
            train_gen,
            epochs=params['epochs'],
            validation_data=val_gen,
            callbacks=[tensorboard_callback, model_checkpoint_callback, hparam_callback, MLflowCallback()],
            verbose=1,
            shuffle=True
        )
        
        model.save(checkpoint_filepath)
        mlflow.log_artifact(checkpoint_filepath)
        
        history_df = pd.DataFrame(history.history)
        history_csv = "training_history.csv"
        history_df.to_csv(history_csv, index=False)
        mlflow.log_artifact(history_csv)
        
        plt.figure(figsize=(8, 4))
        plt.plot(history_df.index, history_df['loss'], label='Training Loss')
        if 'val_loss' in history_df.columns:
            plt.plot(history_df.index, history_df['val_loss'], label='Validation Loss')
        plt.xlabel('Epoch')
        plt.ylabel('Loss')
        plt.title('Training Curves')
        plt.legend()
        plt.tight_layout()
        training_curves_path = "training_curves.png"
        plt.savefig(training_curves_path)
        mlflow.log_artifact(training_curves_path)
        plt.close()
        
        # ---- Prediction Plot: Visualization with Colormap ----
        predictions = model.predict(test_gen)
        predictions = predictions.squeeze()
        with h5py.File(TEST_PATH, 'r') as f:
            test_labels = np.asarray(f['specz_redshift'][:])
        test_labels = test_labels.squeeze()
        print("Test labels shape:", test_labels.shape)
        print("Predictions shape:", predictions.shape)

        plt.figure(figsize=(6, 6))
        # Color points by the predicted value using the 'viridis' colormap
        sc = plt.scatter(test_labels, predictions, c=predictions, cmap='viridis', alpha=0.7, edgecolors='w', s=50)
        plt.plot([test_labels.min(), test_labels.max()], [test_labels.min(), test_labels.max()], 'r--', lw=2)
        plt.xlabel("True Redshift")
        plt.ylabel("Predicted Redshift")
        plt.title("Prediction Scatter Plot")
        plt.colorbar(sc, label="Predicted Value")
        plt.tight_layout()
        prediction_plot_path = "prediction_plot.png"
        plt.savefig(prediction_plot_path)
        mlflow.log_artifact(prediction_plot_path)
        plt.close()
        # -----------------------------------------------------------
                        
        test_loss, test_rmse = model.evaluate(test_gen, verbose=1)
        mlflow.log_metric("test_loss", test_loss)
        mlflow.log_metric("test_rmse", test_rmse)
        with open("test_metrics.txt", "w") as f:
            f.write(f"Test Loss: {test_loss}\nTest RMSE: {test_rmse}\n")
        mlflow.log_artifact("test_metrics.txt")
        
        mlflow.keras.log_model(model, "model")
        
        model_summary_lines = []
        model.summary(print_fn=lambda line: model_summary_lines.append(line))
        summary_path = "model_summary.txt"
        with open(summary_path, "w") as f:
            f.write("\n".join(model_summary_lines))
        mlflow.log_artifact(summary_path)
        
        import subprocess
        subprocess.run("pip freeze > requirements.txt", shell=True)
        mlflow.log_artifact("requirements.txt")
        
        print(f"Training complete. MLflow Run ID: {mlflow.active_run().info.run_id}")


# Run Training

This final cell calls the `train_model_with_mlflow()` function, which starts the training process, logs experiment details with MLflow, saves model checkpoints, and evaluates the model on the test set.


In [10]:
# Simply run this cell to start training the model, log metrics, and save artifacts via MLflow.

train_model_with_mlflow()





Epoch 1/5
Epoch 1: loss improved from inf to 0.43571, saving model to c:\Users\truen\MLFlow-CNN\experiments\MLCheckpoints\truen_cp_2025-04-06_18-43-39.weights.h5
Epoch 2/5
Epoch 2: loss improved from 0.43571 to 0.29747, saving model to c:\Users\truen\MLFlow-CNN\experiments\MLCheckpoints\truen_cp_2025-04-06_18-43-39.weights.h5
Epoch 3/5
Epoch 3: loss improved from 0.29747 to 0.27213, saving model to c:\Users\truen\MLFlow-CNN\experiments\MLCheckpoints\truen_cp_2025-04-06_18-43-39.weights.h5
Epoch 4/5
Epoch 4: loss improved from 0.27213 to 0.23782, saving model to c:\Users\truen\MLFlow-CNN\experiments\MLCheckpoints\truen_cp_2025-04-06_18-43-39.weights.h5
Epoch 5/5
Epoch 5: loss improved from 0.23782 to 0.21573, saving model to c:\Users\truen\MLFlow-CNN\experiments\MLCheckpoints\truen_cp_2025-04-06_18-43-39.weights.h5




INFO:tensorflow:Assets written to: C:\Users\truen\AppData\Local\Temp\tmpufbfche_\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\truen\AppData\Local\Temp\tmpufbfche_\model\data\model\assets


Test labels shape: (40914,)
Predictions shape: (40914,)




INFO:tensorflow:Assets written to: C:\Users\truen\AppData\Local\Temp\tmpoys3p6vn\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\truen\AppData\Local\Temp\tmpoys3p6vn\model\data\model\assets


Training complete. MLflow Run ID: 89364fa1bfc94d31ae95373dc76dbade
