In [1]:
import json
import numpy as np
from metaspace.sm_annotation_utils import SMInstance
import os

# 📁 Directory Setup for Running R-GCN

Before running the R-GCN model, we must initialize a standardized folder structure to store raw data, preprocessing results, model outputs, and configuration files.

The following script creates the full directory tree and moves the `mass_diff.csv` file to its appropriate location.

## ✅ Directory Structure Overview

Let’s assume the root project path is `./DS/` (or another absolute path like `/media/USB2/Massconvnet_data/`). Here's how the structure is organized:

```
DS/
├── MSI/
│   ├── raw_data/
│   │   └── <MSI_dataset_name>/
│   │       ├── <file>.imzML
│   │       └── <file>.ibd
│   └── centroid_data/
│       └── <preprocessing_param_name>/
│           └── <MSI_dataset_name>/
│               ├── spec_<pixel_id>.npy
│               └── graph_<pixel_id>.npy
│
├── parameters/
│   ├── pre_processing/
│   │   └── <preprocessing_param>.json
│   └── network/
│       └── <network_param>.json
│
├── mass_diff/
│   └── mass_diff.csv
│
├── Annot_table/
│
└── models/
    ├── GCN/
    │   └── model.pth.tar
    └── CAM_output/
        └── <MSI_dataset_name>/
            ├── CAM_<spectrum_id>.pt
            ├── OUT_<spectrum_id>.pt
            └── Ion_embedding_<spectrum_id>.pt
```

## 🧾 Explanation of Each Folder

| Path | Description |
|------|-------------|
| `MSI/raw_data/` | Contains subdirectories for each MSI dataset, with `.imzML` and `.ibd` files. |
| `MSI/centroid_data/` | Stores preprocessed spectra and graph files for each pixel. Organized by the name of the preprocessing JSON config used. |
| `parameters/pre_processing/` | Holds `.json` files defining how raw MSI data should be processed (e.g., centroiding, mass filters). |
| `parameters/network/` | Contains `.json` files defining R-GCN network parameters, such as number of layers, edge types, and training options. |
| `mass_diff/` | Includes the `mass_diff.csv` file listing known mass differences used to build relational graphs. |
| `models/GCN/` | Saves trained models as `model.pth.tar`, named using a combination of preprocessing and network parameter file names. |
| `models/CAM_output/` | Stores the outputs of the CAM analysis: class activation scores, node embeddings, and predictions for each spectrum. |
| `Annot_table/` | (Optional) A directory to store any metadata, label annotations, or reference tables associated with MSI datasets. |

> ✅ You can now run the cell below to automatically create this structure and move your `mass_diff.csv` file into place.


In [9]:
import os
import shutil

# Base path for the DS structure
base_path = "./DS"


# Directories to create
directories = [
    "MSI",
    "MSI/raw_data",
    "MSI/centroid_data",
    "parameters",
    "parameters/pre_processing",
    "parameters/network",
    "mass_diff",
    "Annot_table"
]

# Create each directory
for dir_name in directories:
    full_path = os.path.join(base_path, dir_name)
    os.makedirs(full_path, exist_ok=True)
    print(f"Created directory: {full_path}")

# Move the mass_diff.csv file
source_file = "./data/mass_diff.csv"
destination_file = os.path.join(base_path, "mass_diff", "mass_diff.csv")

if os.path.exists(source_file):
    shutil.move(source_file, destination_file)
    print(f"Moved mass_diff.csv to: {destination_file}")
else:
    print(f"⚠️ File not found: {source_file}. Please make sure it exists before running this script.")


Created directory: ./DS/MSI
Created directory: ./DS/MSI/raw_data
Created directory: ./DS/MSI/centroid_data
Created directory: ./DS/parameters
Created directory: ./DS/parameters/pre_processing
Created directory: ./DS/parameters/network
Created directory: ./DS/mass_diff
Created directory: ./DS/Annot_table
⚠️ File not found: ./data/mass_diff.csv. Please make sure it exists before running this script.



## Creating the Annot_table CSV

After setting up the directory structure, the next step is to generate an `Annot_table` CSV file for your experiment. This table is essential for linking your MSI data with labels, spatial coordinates, and data usage flags during model training and testing.

The table must include **at least** the following 7 columns:

| Column Name         | Description                                                                 |
|---------------------|-----------------------------------------------------------------------------|
| `MSI name`          | Name of the MSI dataset directory located in `MSI/raw_data/` (string).     |
| `MSI pixel id`      | Unique identifier for the centroided pixel/spectrum (int).                 |
| `Annotations`       | Class label assigned to the spectrum for classification (int).            |
| `Crd X` and `Crd Y` | X and Y coordinates of the pixel in 2D space (int).                        |
| `origianl MSI pixel id` | ID of the pixel in the original imzML image (int).                    |
| `train`             | Indicates whether the spectrum is used for training (`True`) or testing (`False`) (bool). |

Each row corresponds to a spectrum from a specific pixel in an MSI file located under `DS/MSI/raw_data/`.

### Purpose of the Annot_table
- It guides the script `Centroid_from_imzml.py` to identify which pixels to process from the raw data and convert into centroided spectra and graphs.
- It also provides class labels and training/testing flags for the R-GCN model to learn from.

📝 A sample `Annot_table` used in our background/foreground classification experiments is provided in **Supplementary S2**.



## Downloading MSI Datasets from METASPACE

The following Python script allows you to download imzML data directly from the [METASPACE](https://metaspace2020.eu/) platform using the `pyMSP` client.

Before running the script, ensure you have installed the required METASPACE Python client:
```bash
pip install metaspace2020


In [None]:
from metaspace import SMInstance
import os

# Replace with your actual base path
base_path = "./DS/"
folder_msi = os.path.join(base_path, "MSI", "raw_data")

# List of dataset IDs on METASPACE
metaspace_im = [
    "2020-12-09_02h41m05s",
    "2021-08-16_23h19m04s",
    "2020-01-17_17h36m25s",
    "2019-12-16_15h22m13s"
]

# Initialize connection to METASPACE
sm = SMInstance()

# Download each dataset
for dsid in metaspace_im:
    ds = sm.dataset(id=dsid)
    dest_path = os.path.join(folder_msi, ds.name)
    os.makedirs(dest_path, exist_ok=True)
    ds.download_to_dir(dest_path)


## Creating the Preprocessing Parameter JSON File

To generate centroided spectra and relational graphs from raw MSI data, we must define a set of preprocessing parameters. These parameters are saved as a `.json` file and used as input to the R-GCN pipeline.

Below is a script that creates the preprocessing parameter file `param_BGFG1.json` and saves it under: 
    DS/parameters/pre_processing/

This file is essential for defining:
- The number of peaks to retain per spectrum
- Mass difference values used to construct graph edges
- m/z tolerance for edge generation
- Mass range to consider
- Input MSI directory
- Output path for centroided data
- Associated annotation table

In [5]:
import os
import json
import numpy as np

# Define base folder path
folder_path = "./DS/"

# Load mass difference values from CSV
mass_diff_path = os.path.join(folder_path, "mass_diff", "mass_diff.csv")
mass_diff = np.genfromtxt(mass_diff_path).tolist()

# Define preprocessing parameters
param_dict = {
    "max_peaks": 2000,
    "mass_diff": mass_diff,
    "tolerance": 0.001,
    "mass range": [200, 1400],
    "msi_dir": os.path.join(folder_path, "MSI", "raw_data") + "/",
    "output_dir": os.path.join(folder_path, "MSI", "centroid_data") + "/",
    "file_type": "imzML",
    "annot_table": os.path.join(folder_path, "Annot_table", "Annot_table.csv")
}

# Save the dictionary as a JSON file
output_path = os.path.join(folder_path, "parameters", "pre_processing", "param_BGFG1.json")
os.makedirs(os.path.dirname(output_path), exist_ok=True)

with open(output_path, "w") as f:
    json.dump(param_dict, f, indent=4)

print(f"✅ Preprocessing parameter file saved to: {output_path}")

    

✅ Preprocessing parameter file saved to: /media/USB2/Massconvnet_data/parameters/pre_processing/param_BGFG1.json


## Running the Preprocessing Script: Generating Spectra and Graphs

Once the annotation table and preprocessing parameters are set, we can generate centroided spectra and their corresponding graph structures using the script `Centroid_from_imzml.py`. This script parses imzML files, applies mass filtering, builds a graph based on mass differences, and saves one `.npy` spectrum file and one `.npy` graph file per pixel.

### Script Overview

- Script: `Centroid_from_imzml.py`
- Located in: `MSI_preprocessing/`
- Input: Preprocessing parameter JSON file
- Output:
  - Spectrum and graph `.npy` files per pixel, stored under:
    ```
    DS/MSI/centroid_data/{param_name}/{MSI_name}/
    ```

### Python Launcher Script

Below is an example script to run `Centroid_from_imzml.py` in the background using `nohup`:


In [None]:
import subprocess
import os

# Define base paths
github_dir = "./Massconvnet/"
folder_path = "./DS/"

# Construct script and param file paths
script_path = os.path.join(github_dir, "MSI_preprocessing", "Centroid_from_imzml.py")
param_file = os.path.join(folder_path, "parameters", "pre_processing", "param_BGFG1.json")

# Run the script as a background process using nohup
command = f"nohup python {script_path} -i {param_file} &"
subprocess.run(command, shell=True)

print("✅ Launched Centroid_from_imzml.py as background process with nohup.")


## Creating the Network Parameter JSON File

Once the spectra and graphs have been generated, we need to define the parameters that the R-GCN model will use to load and interpret the processed data. This configuration is stored in a JSON file located in the `DS/parameters/network/` directory.

### Purpose of This File

This JSON file provides the R-GCN model with:
- Data loading instructions (e.g., whether to use only training or test data)
- Information about data augmentation strategies
- Annotation and training set identifiers
- Optional signal degradation settings for robustness testing

These parameters control how the model processes the dataset before training or inference.

### Key Parameters Explained

| Parameter                     | Description                                                                                   |
|------------------------------|-----------------------------------------------------------------------------------------------|
| `signal degradation`         | If `True`, applies synthetic noise/distortions to simulate signal degradation.               |
| `only test` / `only train`   | If `True`, loads only test or only training samples, respectively.                           |
| `intensity limitation param` | Scales intensity values; `1` keeps them unchanged.                                            |
| `mass shift param`           | Shifts m/z values to simulate calibration error.                                              |
| `spectral resolution param`  | Reduces spectral resolution to simulate lower-resolution data.                                |
| `random peaks removal param` | If `1`, randomly removes a subset of peaks; if `0`, all peaks are preserved.                  |
| `edge index to remove`       | Manually removes specific edge types in the graph (for ablation or debugging).               |
| `Data augmentation`          | Enables (`1`) or disables (`0`) data augmentation during training.                           |
| `Annotation name`            | Column name in the annotation CSV that contains class labels (e.g., `"Annotations"`).        |
| `training samples`           | Column name in the annotation CSV that flags training samples (e.g., `"train"`).             |
| `kfold seed`                 | Random seed for k-fold cross-validation (if used).                                           |
| `kfold K`                    | Number of folds in k-fold validation; use `0` to disable cross-validation.                   |



### Python Script to Create the File

Below is a Python script that generates this network parameter file:



In [None]:
import os
import json
import numpy as np

# Define base folder
folder_path = "./DS/"

# Build dictionary with loader and model behavior parameters
param_dict = {
    "signal degradation": False,
    "only test": False,
    "only train": False,
    "intensity limitation param": 1,
    "mass shift param": 0,
    "spectral resolution param": 0,
    "random peaks removal param": 1,
    "edge index to remove": None,
    "Data augmentation": 0,
    "Annotation name": "Annotations",
    "training samples": "train",
    "kfold seed": 1,
    "kfold K": 0
}

# Save JSON to file
output_path = os.path.join(folder_path, "parameters", "network", "param_1.json")
os.makedirs(os.path.dirname(output_path), exist_ok=True)

with open(output_path, "w") as f:
    json.dump(param_dict, f, indent=4)

print(f"✅ Network parameter file saved to: {output_path}")

### Running the R-GCN Model

Once the preprocessing and parameter JSON files are in place, you can train the R-GCN model and generate Class Activation Maps (CAM) using the following commands.

#### 1. Train the R-GCN Model

Run the model training using:

```bash
python ./src/main.py \
  --dataset_path="./DS/" \
  --pre_process_param_name=param_BGFG1 \
  --network_param_name=param_1 \
  --with_intensity \
  --with_masses \
  --normalize \
  --max_epochs 40 \
  --random_state 5
```

To run CAM on a pretrained model, use:


```bash

python ./src/main.py \
  --dataset_path="./DS/" \
  --pre_process_param_name=param_BGFG1 \
  --network_param_name=param_1 \
  --with_intensity \
  --with_masses \
  --normalize \
  --cam_only \
  --batch_size 1
```

To rename or reorganize CAM outputs:

```bash
mv ./DS//models/CAM_output/ \
   ./DS/models/meanspectra_BGFG1_trainint5_fullnet_pos/
```