# Data processing of CHX yeast cells for Leopard-EM manuscript

This notebook walks through the methods used to generate the 2DTM data from CHX treated yeast cells published in the 2025 Leopard-EM manuscript.

## Running Match Template
Match template jobs were ran on the 60S and 40S subunits.
The pdb models that were used for those are the same as for the constrained search tutorial, and can be found at [10.5281/zenodo.15368246](10.5281/zenodo.15368246).
These models of the SSU body and LSU were centred and aligned with respect to eachother as described in that tutorial.

### Simulating the maps
The pixel size was optimized according to the procedure described in the constrained search tutorial, which gave a pixel size of 1.059.
This was used to simulate maps of both the SSU body and the LSU.


In [1]:
from ttsim3d.models import Simulator, SimulatorConfig

# Instantiate the configuration object
sim_conf = SimulatorConfig(
    voltage=300.0,  # in keV
    apply_dose_weighting=True,
    dose_start=0.0,  # in e-/A^2
    dose_end=50.0,  # in e-/A^2
    dose_filter_modify_signal="rel_diff",
    upsampling=-1,  # auto
    mtf_reference="k3_300kV_FL2",
)

# Instantiate the simulator
sim = Simulator(
    pdb_filepath="../models/60S_aligned_aligned_zero.pdb",
    pixel_spacing=1.059,  # Angstroms
    volume_shape=(512, 512, 512),
    center_atoms=False,
    remove_hydrogens=True,
    b_factor_scaling=0.5,
    additional_b_factor=0,
    simulator_config=sim_conf,
)

# Run the simulation
volume = sim.run()
print("Volume generated")

mrc_filepath = "../maps/60S_map_px1.059_bscale0.5.mrc"
sim.export_to_mrc(mrc_filepath)

Volume generated


In [2]:
from ttsim3d.models import Simulator, SimulatorConfig

# Instantiate the configuration object
sim_conf = SimulatorConfig(
    voltage=300.0,  # in keV
    apply_dose_weighting=True,
    dose_start=0.0,  # in e-/A^2
    dose_end=50.0,  # in e-/A^2
    dose_filter_modify_signal="rel_diff",
    upsampling=-1,  # auto
    mtf_reference="k3_300kV_FL2",
)

# Instantiate the simulator
sim = Simulator(
    pdb_filepath="../models/6q8y_SSU_no_head_aligned_aligned_zero.pdb",
    pixel_spacing=1.059,  # Angstroms
    volume_shape=(512, 512, 512),
    center_atoms=False,
    remove_hydrogens=True,
    b_factor_scaling=0.5,
    additional_b_factor=0,
    simulator_config=sim_conf,
)

# Run the simulation
volume = sim.run()
print("Volume generated")

mrc_filepath = "../maps/SSU-body_map_px1.059_bscale0.5.mrc"
sim.export_to_mrc(mrc_filepath)

Volume generated


### Run match template on both LSU and SSU

We ran match template over all micrographs on our cluster using the following configurations and run scripts.

In [1]:
from IPython.display import Markdown, display

# Read the YAML file
with open("match_template_config_60S_base_H100.yaml") as file:
    yaml_content = file.read()

# Display as markdown code block
display(Markdown(f"```yaml\n{yaml_content}\n```"))

```yaml

computational_config:
  gpu_ids:
  - 0
  - 1
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7
  num_cpus: 16
defocus_search_config:
  defocus_max: 1200.0
  defocus_min: -1200.0
  defocus_step: 200.0
  enabled: true
match_template_result:
  allow_file_overwrite: true
  correlation_average_path: /global/scratch/users/jdickerson/2dtm_test_data/CHX_data/results_match_tm_60S/output_correlation_average.mrc
  correlation_variance_path: /global/scratch/users/jdickerson/2dtm_test_data/CHX_data/results_match_tm_60S/output_correlation_variance.mrc
  mip_path: /global/scratch/users/jdickerson/2dtm_test_data/CHX_data/results_match_tm_60S/output_mip.mrc
  orientation_phi_path: /global/scratch/users/jdickerson/2dtm_test_data/CHX_data/results_match_tm_60S/output_orientation_phi.mrc
  orientation_psi_path: /global/scratch/users/jdickerson/2dtm_test_data/CHX_data/results_match_tm_60S/output_orientation_psi.mrc
  orientation_theta_path: /global/scratch/users/jdickerson/2dtm_test_data/CHX_data/results_match_tm_60S/output_orientation_theta.mrc
  relative_defocus_path: /global/scratch/users/jdickerson/2dtm_test_data/CHX_data/results_match_tm_60S/output_relative_defocus.mrc
  scaled_mip_path: /global/scratch/users/jdickerson/2dtm_test_data/CHX_data/results_match_tm_60S/output_scaled_mip.mrc
micrograph_path: /global/scratch/users/jdickerson/2dtm_test_data/CHX_data/all_mgraphs/
optics_group:
  label: micrograph_1
  amplitude_contrast_ratio: 0.07
  ctf_B_factor: 150.0
  astigmatism_angle: 39.417260
  defocus_u: 5978.758301
  defocus_v: 5617.462402
  phase_shift: 0.0
  pixel_size: 1.059
  spherical_aberration: 2.7
  voltage: 300.0
orientation_search_config:
  psi_step: 1.5
  base_grid_method: uniform
  theta_step: 2.5
preprocessing_filters:
  bandpass_filter:
    enabled: false
    falloff: 0.05
    high_freq_cutoff: 0.5
    low_freq_cutoff: 0.0
  whitening_filter:
    enabled: true
    do_power_spectrum: true
    max_freq: 1.0
    num_freq_bins: null
template_volume_path: /global/scratch/users/jdickerson/2dtm_test_data/maps/60S_map_px1.059_bscale0.5_k3dqe.mrc
  

```

In [2]:
from IPython.display import Markdown, display

# Read the YAML file
with open("match_template_config_40S_base_H100.yaml") as file:
    yaml_content = file.read()

# Display as markdown code block
display(Markdown(f"```yaml\n{yaml_content}\n```"))

```yaml

computational_config:
  gpu_ids:
  - 0
  - 1
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7
  num_cpus: 16
defocus_search_config:
  defocus_max: 1200.0
  defocus_min: -1200.0
  defocus_step: 200.0
  enabled: true
match_template_result:
  allow_file_overwrite: true
  correlation_average_path: /global/scratch/users/jdickerson/2dtm_test_data/CHX_data/results_match_tm_40S-body/output_correlation_average.mrc
  correlation_variance_path: /global/scratch/users/jdickerson/2dtm_test_data/CHX_data/results_match_tm_40S-body/output_correlation_variance.mrc
  mip_path: /global/scratch/users/jdickerson/2dtm_test_data/CHX_data/results_match_tm_40S-body/output_mip.mrc
  orientation_phi_path: /global/scratch/users/jdickerson/2dtm_test_data/CHX_data/results_match_tm_40S-body/output_orientation_phi.mrc
  orientation_psi_path: /global/scratch/users/jdickerson/2dtm_test_data/CHX_data/results_match_tm_40S-body/output_orientation_psi.mrc
  orientation_theta_path: /global/scratch/users/jdickerson/2dtm_test_data/CHX_data/results_match_tm_40S-body/output_orientation_theta.mrc
  relative_defocus_path: /global/scratch/users/jdickerson/2dtm_test_data/CHX_data/results_match_tm_40S-body/output_relative_defocus.mrc
  scaled_mip_path: /global/scratch/users/jdickerson/2dtm_test_data/CHX_data/results_match_tm_40S-body/output_scaled_mip.mrc
micrograph_path: /global/scratch/users/jdickerson/2dtm_test_data/CHX_data/all_mgraphs/
optics_group:
  label: micrograph_1
  amplitude_contrast_ratio: 0.07
  ctf_B_factor: 150.0
  astigmatism_angle: 39.417260
  defocus_u: 5978.758301
  defocus_v: 5617.462402
  phase_shift: 0.0
  pixel_size: 0.936
  spherical_aberration: 2.7
  voltage: 300.0
orientation_search_config:
  psi_step: 1.5
  base_grid_method: uniform
  theta_step: 2.5
preprocessing_filters:
  bandpass_filter:
    enabled: false
    falloff: 0.05
    high_freq_cutoff: 0.5
    low_freq_cutoff: 0.0
  whitening_filter:
    enabled: true
    do_power_spectrum: true
    max_freq: 1.0
    num_freq_bins: null
template_volume_path: /global/scratch/users/jdickerson/2dtm_test_data/maps/SSU-body_map_px1.059_bscale0.5_k3dqe.mrc
  

```

In [None]:
#!/bin/bash
#SBATCH --job-name=match_template
#SBATCH --account=pc_lucaslab
#SBATCH --partition=es1
#SBATCH --qos=es_normal
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --output=/global/scratch/users/jdickerson/2dtm_test_data/CHX_data/logs/match_template_%A_%a.out
#SBATCH --error=/global/scratch/users/jdickerson/2dtm_test_data/CHX_data/logs/match_template_%A_%a.err
#SBATCH --time=72:00:00
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:H100:8

mkdir -p /global/scratch/users/jdickerson/2dtm_test_data/CHX_data/logs
# Load any necessary modules (adjust for your system)
# Print current shell and environment before activation
echo "=== ENVIRONMENT BEFORE ACTIVATION ==="
echo "Current shell: $SHELL"
echo "Current conda environments:"
conda env list
echo "Current Python: $(which python)"
echo "Current Python version: $(python --version 2>&1)"
echo "======================================"

# Activate leopard-em conda environment 
echo "=== ACTIVATING CONDA ENVIRONMENT ==="
source $(conda info --base)/etc/profile.d/conda.sh
conda activate leopard-em
ACTIVATION_STATUS=$?

# Check if activation succeeded
if [ $ACTIVATION_STATUS -ne 0 ]; then
    echo "ERROR: Failed to activate the leopard-em environment"
    echo "Available environments:"
    conda env list
    exit 1
fi

# Print environment details after activation
echo "=== ENVIRONMENT AFTER ACTIVATION ==="
echo "Active conda environment: $CONDA_PREFIX"
echo "Python interpreter: $(which python)"
echo "Python version: $(python --version 2>&1)"
echo "Conda packages in environment:"
conda list | grep -E 'program|leopard'
echo "======================================"

# Test if the programs module is importable
echo "=== TESTING PYTHON IMPORTS ==="
python -c "
try:
    import programs.match_template
    print('SUCCESS: programs.match_template module found')
except ImportError as e:
    print(f'ERROR: Unable to import programs.match_template: {e}')
    import sys
    print(f'Python path: {sys.path}')
"
echo "======================================"

# Activate leopard-em conda environment


# Set up paths (update these paths for your project)
PROJECT_DIR="/global/scratch/users/jdickerson/2dtm_test_data/CHX_data"
MICROGRAPHS_DIR="${PROJECT_DIR}/all_mgraphs"
OUTPUT_DIR="${PROJECT_DIR}/results_match_tm_60S_2"
CTFS_DIR="${PROJECT_DIR}/all_ctfs/defocus_list.txt"
TEMPLATE_YAML="${PROJECT_DIR}/match_template_config_60S_base_H100.yaml"
SCRIPT_PATH="${PROJECT_DIR}/process_all_ctf-list.py"

# Create results directory if it doesn't exist
mkdir -p ${OUTPUT_DIR}

# Run the processing script
python ${SCRIPT_PATH} \
  --micrographs-dir "${MICROGRAPHS_DIR}" \
  --template-yaml "${TEMPLATE_YAML}" \
  --defocus-list "${CTFS_DIR}" \
  --output-dir "${OUTPUT_DIR}" \
  --gpus "0,1,2,3,4,5,6,7" \
  --batch-size 8 \
  --pattern "*.mrc"

echo "All micrographs processed"


In [None]:
#!/bin/bash
#SBATCH --job-name=match_template
#SBATCH --account=pc_lucaslab
#SBATCH --partition=es1
#SBATCH --qos=es_normal
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --output=/global/scratch/users/jdickerson/2dtm_test_data/CHX_data/logs/match_template_40S_%A_%a.out
#SBATCH --error=/global/scratch/users/jdickerson/2dtm_test_data/CHX_data/logs/match_template_40S_%A_%a.err
#SBATCH --time=72:00:00
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:H100:8

mkdir -p /global/scratch/users/jdickerson/2dtm_test_data/CHX_data/logs
# Load any necessary modules (adjust for your system)
# Print current shell and environment before activation
echo "=== ENVIRONMENT BEFORE ACTIVATION ==="
echo "Current shell: $SHELL"
echo "Current conda environments:"
conda env list
echo "Current Python: $(which python)"
echo "Current Python version: $(python --version 2>&1)"
echo "======================================"

# Activate leopard-em conda environment 
echo "=== ACTIVATING CONDA ENVIRONMENT ==="
source $(conda info --base)/etc/profile.d/conda.sh
conda activate leopard-em
ACTIVATION_STATUS=$?

# Check if activation succeeded
if [ $ACTIVATION_STATUS -ne 0 ]; then
    echo "ERROR: Failed to activate the leopard-em environment"
    echo "Available environments:"
    conda env list
    exit 1
fi

# Print environment details after activation
echo "=== ENVIRONMENT AFTER ACTIVATION ==="
echo "Active conda environment: $CONDA_PREFIX"
echo "Python interpreter: $(which python)"
echo "Python version: $(python --version 2>&1)"
echo "Conda packages in environment:"
conda list | grep -E 'program|leopard'
echo "======================================"

# Test if the programs module is importable
echo "=== TESTING PYTHON IMPORTS ==="
python -c "
try:
    import programs.match_template
    print('SUCCESS: programs.match_template module found')
except ImportError as e:
    print(f'ERROR: Unable to import programs.match_template: {e}')
    import sys
    print(f'Python path: {sys.path}')
"
echo "======================================"

# Activate leopard-em conda environment


# Set up paths (update these paths for your project)
PROJECT_DIR="/global/scratch/users/jdickerson/2dtm_test_data/CHX_data"
MICROGRAPHS_DIR="${PROJECT_DIR}/all_mgraphs"
OUTPUT_DIR="${PROJECT_DIR}/results_match_tm_40S-body_2"
CTFS_DIR="${PROJECT_DIR}/all_ctfs/defocus_list.txt"
TEMPLATE_YAML="${PROJECT_DIR}/match_template_config_40S_base_H100.yaml"
SCRIPT_PATH="${PROJECT_DIR}/process_all_ctf-list.py"

# Create results directory if it doesn't exist
mkdir -p ${OUTPUT_DIR}

# Run the processing script
python ${SCRIPT_PATH} \
  --micrographs-dir "${MICROGRAPHS_DIR}" \
  --template-yaml "${TEMPLATE_YAML}" \
  --defocus-list "${CTFS_DIR}" \
  --output-dir "${OUTPUT_DIR}" \
  --gpus "0,1,2,3,4,5,6,7" \
  --batch-size 8 \
  --pattern "*.mrc"

echo "All micrographs processed"


In [None]:
"""process_all_ctf-list.py"""
import os
import re
import yaml
import sys
import argparse
import glob
import pandas as pd
from leopard_em.pydantic_models.managers import MatchTemplateManager

def load_defocus_list(defocus_list_path):
    """Load CTF parameters from a pipe-delimited defocus list file."""
    # Read the defocus list file into a DataFrame with pipe delimiter
    df = pd.read_csv(defocus_list_path, sep='|', comment='#', skipinitialspace=True, dtype=str)
    
    # Create a dictionary mapping micrograph filenames to defocus parameters
    defocus_dict = {}
    for _, row in df.iterrows():
        # Clean up column values to remove any leading/trailing whitespace
        full_path = row[0].strip()  # First column contains full filename
        micrograph_filename = os.path.basename(full_path)
        defocus_1 = float(row[1].strip())  # Second column has defocus_u (in Angstroms)
        defocus_2 = float(row[2].strip())  # Third column has defocus_v (in Angstroms)
        astigmatism_angle = float(row[3].strip())  # Fourth column has astigmatism angle
        
        defocus_dict[micrograph_filename] = (defocus_1, defocus_2, astigmatism_angle)
    
    return defocus_dict

def create_yaml_for_micrograph(template_yaml_path, micrograph_path, defocus_1, defocus_2, astigmatism_angle, output_path, gpu_ids):
    """Create a custom YAML file for a specific micrograph."""
    # Load the template YAML
    with open(template_yaml_path, 'r') as file:
        config = yaml.safe_load(file)
    
    # Update the config with the specific micrograph info
    config['micrograph_path'] = micrograph_path
    config['optics_group']['defocus_u'] = defocus_1
    config['optics_group']['defocus_v'] = defocus_2
    config['optics_group']['astigmatism_angle'] = astigmatism_angle
    
    # Set GPU IDs
    config['computational_config']['gpu_ids'] = gpu_ids
    
    # Update output paths to be in the all_results folder with micrograph-specific names
    micrograph_basename = os.path.basename(micrograph_path).split('.')[0]
    results_dir = os.path.dirname(output_path)
    
    for key in config['match_template_result']:
        if key != 'allow_file_overwrite':
            original_path = config['match_template_result'][key]
            filename = os.path.basename(original_path)
            new_filename = f"{micrograph_basename}_{filename}"
            config['match_template_result'][key] = os.path.join(results_dir, new_filename)
    
    # Write the updated config to a new YAML file
    with open(output_path, 'w') as file:
        yaml.dump(config, file, default_flow_style=False)
    
    return config

def process_micrograph(micrograph_path, template_yaml, defocus_dict, output_dir, gpu_ids, batch_size):
    """Process a single micrograph with match template using defocus info from dict."""
    # Get just the filename for looking up in the defocus dictionary
    micrograph_filename = os.path.basename(micrograph_path)
    
    # Get defocus parameters from dictionary
    if micrograph_filename not in defocus_dict:
        print(f"Defocus parameters not found for {micrograph_filename}")
        return False
    
    defocus_1, defocus_2, astigmatism_angle = defocus_dict[micrograph_filename]
    
    # Create custom YAML file for this micrograph
    custom_yaml_path = os.path.join(output_dir, f"{os.path.splitext(micrograph_filename)[0]}_config.yaml")
    create_yaml_for_micrograph(
        template_yaml, 
        micrograph_path, 
        defocus_1, 
        defocus_2, 
        astigmatism_angle, 
        custom_yaml_path,
        gpu_ids
    )
    
    # Run match template with the custom config
    try:
        print(f"Running match template for {micrograph_filename} with GPUs {gpu_ids}")
        mt_manager = MatchTemplateManager.from_yaml(custom_yaml_path)
        mt_manager.run_match_template(batch_size)
        
        # Save results to CSV
        df = mt_manager.results_to_dataframe()
        csv_path = os.path.join(output_dir, f"{os.path.splitext(micrograph_filename)[0]}_results.csv")
        df.to_csv(csv_path)
        
        print(f"Successfully processed {micrograph_filename}")
        return True
    except Exception as e:
        print(f"Error processing {micrograph_filename}: {str(e)}")
        return False

def is_already_processed(micrograph_path, output_dir):
    """Check if a micrograph has already been processed by looking for its results in output_dir."""
    micrograph_filename = os.path.basename(micrograph_path)
    results_filename = f"{os.path.splitext(micrograph_filename)[0]}_results.csv"
    results_path = os.path.join(output_dir, results_filename)
    return os.path.exists(results_path)

def main():
    parser = argparse.ArgumentParser(description='Process multiple micrographs with match template')
    parser.add_argument('--micrographs-dir', required=True, help='Directory containing micrograph files')
    parser.add_argument('--template-yaml', required=True, help='Path to the template YAML configuration')
    parser.add_argument('--defocus-list', required=True, help='Path to the defocus list file')
    parser.add_argument('--output-dir', required=True, help='Directory to store results')
    parser.add_argument('--gpus', default='0', help='Comma-separated list of GPU IDs to use')
    parser.add_argument('--batch-size', type=int, default=8, help='Orientation batch size')
    parser.add_argument('--pattern', default='*DWS.mrc', help='File pattern to match micrographs')
    parser.add_argument('--start-idx', type=int, default=None, help='Start index for processing (optional)')
    parser.add_argument('--end-idx', type=int, default=None, help='End index for processing (optional)')
    parser.add_argument('--job-idx', type=int, default=None, help='Job index from SLURM array (optional)')
    parser.add_argument('--jobs-per-array', type=int, default=None, help='Number of micrographs per array job (optional)')
    
    args = parser.parse_args()
    
    # Make sure output directory exists
    os.makedirs(args.output_dir, exist_ok=True)
    
    # Convert GPU IDs string to list of integers
    gpu_ids = [int(gpu_id) for gpu_id in args.gpus.split(',')]
    
    # Load defocus parameters from the defocus list file
    defocus_dict = load_defocus_list(args.defocus_list)
    print(f"Loaded defocus parameters for {len(defocus_dict)} micrographs")
    
    # Get list of micrograph files
    micrograph_pattern = os.path.join(args.micrographs_dir, args.pattern)
    micrograph_files = sorted(glob.glob(micrograph_pattern))
    
    if not micrograph_files:
        print(f"No micrograph files found matching pattern {micrograph_pattern}")
        return 1
    
    print(f"Found {len(micrograph_files)} micrograph files")
    
    # Filter out micrographs that have already been processed
    unprocessed_micrographs = []
    for micrograph_file in micrograph_files:
        if not is_already_processed(micrograph_file, args.output_dir):
            unprocessed_micrographs.append(micrograph_file)
    
    print(f"Found {len(unprocessed_micrographs)} unprocessed micrographs out of {len(micrograph_files)} total")
    
    # Replace the full list with only unprocessed micrographs
    micrograph_files = unprocessed_micrographs
    
    # Check if there are any micrographs to process
    if not micrograph_files:
        print("No unprocessed micrographs to process. Exiting.")
        return 0
    
    # Determine which micrographs to process
    if args.job_idx is not None and args.jobs_per_array is not None:
        # Calculate range for this job in the array
        start_idx = (args.job_idx - 1) * args.jobs_per_array
        end_idx = min(start_idx + args.jobs_per_array, len(micrograph_files))
        micrograph_files = micrograph_files[start_idx:end_idx]
        print(f"Processing micrographs {start_idx+1}-{end_idx} out of {len(micrograph_files)}")
    elif args.start_idx is not None or args.end_idx is not None:
        start_idx = args.start_idx if args.start_idx is not None else 0
        end_idx = args.end_idx if args.end_idx is not None else len(micrograph_files)
        micrograph_files = micrograph_files[start_idx:end_idx]
        print(f"Processing micrographs {start_idx+1}-{end_idx} out of {len(micrograph_files)}")
    
    # Process each micrograph
    successful = 0
    for i, micrograph_file in enumerate(micrograph_files):
        print(f"Processing {i+1}/{len(micrograph_files)}: {os.path.basename(micrograph_file)}")
        if process_micrograph(
            micrograph_file, 
            args.template_yaml, 
            defocus_dict, 
            args.output_dir, 
            gpu_ids, 
            args.batch_size
        ):
            successful += 1
    
    print(f"Successfully processed {successful}/{len(micrograph_files)} micrographs")
    return 0

if __name__ == "__main__":
    sys.exit(main())


## Refine template for the LSU
After running the match template, the results were copied to a local workstation for further proceessing.
The first step was to update the file paths in the results files with the new paths.

In [3]:
#!/usr/bin/env python3

import os
import sys
import glob
import pandas as pd

def replace_paths_in_csv(input_dir):
    """
    Read all _results.csv files in the specified directory, replace path strings,
    and write back to the same files.
    
    Args:
        input_dir (str): Directory containing _results.csv files
    """
    old_path = "/global/scratch/users/jdickerson/2dtm_test_data/"
    new_path = "/home/data/jdickerson/Leopard-EM_paper_data/"
    
    # Find all _results.csv files in the specified directory
    csv_files = glob.glob(os.path.join(input_dir, "**/*_results.csv"), recursive=True)
    
    if not csv_files:
        print(f"No _results.csv files found in {input_dir}")
        return
    
    print(f"Found {len(csv_files)} _results.csv files")
    
    for csv_file in csv_files:
        print(f"Processing: {csv_file}")
        
        try:
            # Read the CSV file using pandas
            df = pd.read_csv(csv_file)
            
            # Replace path strings in all columns
            for col in df.columns:
                if df[col].dtype == 'object':  # Only process string columns
                    df[col] = df[col].astype(str).str.replace(old_path, new_path)
            
            # Write the modified dataframe back to the file
            df.to_csv(csv_file, index=False)
            print(f"  Successfully updated paths in {csv_file}")
        
        except Exception as e:
            print(f"  Error processing {csv_file}: {e}")
    
    print("Path replacement completed!")

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python replace_paths.py <directory_path>")
        sys.exit(1)
    
    input_dir = "results_match_tm_60S_2/"
    
    if not os.path.isdir(input_dir):
        print(f"Error: {input_dir} is not a valid directory")
        sys.exit(1)
    
    replace_paths_in_csv(input_dir)

Found 45 _results.csv files
Processing: results_match_tm_60S_2/153_Sep13_11.46.34_261_1_results.csv
  Successfully updated paths in results_match_tm_60S_2/153_Sep13_11.46.34_261_1_results.csv
Processing: results_match_tm_60S_2/100_Sep12_18.37.00_181_1_results.csv
  Successfully updated paths in results_match_tm_60S_2/100_Sep12_18.37.00_181_1_results.csv
Processing: results_match_tm_60S_2/95_Sep12_18.17.26_207_1_results.csv
  Successfully updated paths in results_match_tm_60S_2/95_Sep12_18.17.26_207_1_results.csv
Processing: results_match_tm_60S_2/77_Sep12_15.54.58_165_1_results.csv
  Successfully updated paths in results_match_tm_60S_2/77_Sep12_15.54.58_165_1_results.csv
Processing: results_match_tm_60S_2/153_Sep13_11.48.36_263_1_results.csv
  Successfully updated paths in results_match_tm_60S_2/153_Sep13_11.48.36_263_1_results.csv
Processing: results_match_tm_60S_2/130_Sep13_11.13.01_219_1_results.csv
  Successfully updated paths in results_match_tm_60S_2/130_Sep13_11.13.01_219_1_resu

In [4]:
#!/usr/bin/env python3

import os
import sys
import glob
import pandas as pd

def replace_paths_in_csv(input_dir):
    """
    Read all _results.csv files in the specified directory, replace path strings,
    and write back to the same files.
    
    Args:
        input_dir (str): Directory containing _results.csv files
    """
    old_path = "/global/scratch/users/jdickerson/2dtm_test_data/"
    new_path = "/home/data/jdickerson/Leopard-EM_paper_data/"
    
    # Find all _results.csv files in the specified directory
    csv_files = glob.glob(os.path.join(input_dir, "**/*_results.csv"), recursive=True)
    
    if not csv_files:
        print(f"No _results.csv files found in {input_dir}")
        return
    
    print(f"Found {len(csv_files)} _results.csv files")
    
    for csv_file in csv_files:
        print(f"Processing: {csv_file}")
        
        try:
            # Read the CSV file using pandas
            df = pd.read_csv(csv_file)
            
            # Replace path strings in all columns
            for col in df.columns:
                if df[col].dtype == 'object':  # Only process string columns
                    df[col] = df[col].astype(str).str.replace(old_path, new_path)
            
            # Write the modified dataframe back to the file
            df.to_csv(csv_file, index=False)
            print(f"  Successfully updated paths in {csv_file}")
        
        except Exception as e:
            print(f"  Error processing {csv_file}: {e}")
    
    print("Path replacement completed!")

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python replace_paths.py <directory_path>")
        sys.exit(1)
    
    input_dir = "results_match_tm_40S-body_2/"
    
    if not os.path.isdir(input_dir):
        print(f"Error: {input_dir} is not a valid directory")
        sys.exit(1)
    
    replace_paths_in_csv(input_dir)

Found 45 _results.csv files
Processing: results_match_tm_40S-body_2/153_Sep13_11.46.34_261_1_results.csv
  Successfully updated paths in results_match_tm_40S-body_2/153_Sep13_11.46.34_261_1_results.csv
Processing: results_match_tm_40S-body_2/100_Sep12_18.37.00_181_1_results.csv
  Successfully updated paths in results_match_tm_40S-body_2/100_Sep12_18.37.00_181_1_results.csv
Processing: results_match_tm_40S-body_2/95_Sep12_18.17.26_207_1_results.csv
  Successfully updated paths in results_match_tm_40S-body_2/95_Sep12_18.17.26_207_1_results.csv
Processing: results_match_tm_40S-body_2/77_Sep12_15.54.58_165_1_results.csv
  Successfully updated paths in results_match_tm_40S-body_2/77_Sep12_15.54.58_165_1_results.csv
Processing: results_match_tm_40S-body_2/153_Sep13_11.48.36_263_1_results.csv
  Successfully updated paths in results_match_tm_40S-body_2/153_Sep13_11.48.36_263_1_results.csv
Processing: results_match_tm_40S-body_2/130_Sep13_11.13.01_219_1_results.csv
  Successfully updated paths 

And now running refine template with the following parameters.

In [1]:
from IPython.display import Markdown, display

# Read the YAML file
with open("refine_template_config_60S.yaml") as file:
    yaml_content = file.read()

# Display as markdown code block
display(Markdown(f"```yaml\n{yaml_content}\n```"))

```yaml
###################################################
### RefineTemplateManager configuration example ###
###################################################
# An example YAML configuration to modify.
# Call `RefineTemplateManager.from_yaml(path)` to load this configuration.
template_volume_path: /home/data/jdickerson/Leopard-EM_paper_data/maps/60S_map_px1.059_bscale0.5_k3dqe.mrc
particle_stack:
  df_path: /some/path/to/particles.csv  # Needs to be readable by pandas
  extracted_box_size: [528, 528]
  original_template_size: [512, 512]
defocus_refinement_config:
  enabled: true
  defocus_max:  100.0  # in Angstroms, relative to "best" defocus value in particle stack dataframe
  defocus_min: -100.0  # in Angstroms, relative to "best" defocus value in particle stack dataframe
  defocus_step: 20.0   # in Angstroms
orientation_refinement_config:
  enabled: true
  psi_step_coarse:     1.5   # in degrees
  psi_step_fine:       0.1  # in degrees
  theta_step_coarse: 2.5   # in degrees
  theta_step_fine:   0.1  # in degrees
pixel_size_refinement_config:
  enabled: false
  pixel_size_min: -0.005
  pixel_size_max: 0.005
  pixel_size_step: 0.001
preprocessing_filters:
  whitening_filter:
    do_power_spectrum: true
    enabled: true
    max_freq: 1.0  # In terms of Nyquist frequency
    num_freq_bins: null
  bandpass_filter:
    enabled: false
    falloff: null
    high_freq_cutoff: null
    low_freq_cutoff: null
computational_config:
  gpu_ids: 
    - 0
    - 1
    - 2
    - 3
  num_cpus: 8

```

In [None]:
# Load any necessary modules (adjust for your system)
# Print current shell and environment before activation
echo "=== ENVIRONMENT BEFORE ACTIVATION ==="
echo "Current shell: $SHELL"
echo "Current conda environments:"
conda env list
echo "Current Python: $(which python)"
echo "Current Python version: $(python --version 2>&1)"
echo "======================================"

# Activate leopard-em conda environment 
echo "=== ACTIVATING CONDA ENVIRONMENT ==="
source $(conda info --base)/etc/profile.d/conda.sh
conda activate leopard-em
ACTIVATION_STATUS=$?

# Check if activation succeeded
if [ $ACTIVATION_STATUS -ne 0 ]; then
    echo "ERROR: Failed to activate the leopard-em environment"
    echo "Available environments:"
    conda env list
    exit 1
fi

# Print environment details after activation
echo "=== ENVIRONMENT AFTER ACTIVATION ==="
echo "Active conda environment: $CONDA_PREFIX"
echo "Python interpreter: $(which python)"
echo "Python version: $(python --version 2>&1)"
echo "Conda packages in environment:"
conda list | grep -E 'program|leopard'
echo "======================================"


# Activate leopard-em conda environment


# Set up paths (update these paths for your project)
PROJECT_DIR="/home/data/jdickerson/Leopard-EM_paper_data/CHX_data"
MICROGRAPHS_DIR="${PROJECT_DIR}/all_mgraphs"
OUTPUT_DIR="${PROJECT_DIR}/results_refine_tm_60S"
MATCH_RESULTS_DIR="${PROJECT_DIR}/results_match_tm_60S"
TEMPLATE_YAML="${PROJECT_DIR}/refine_template_config_60S.yaml"
TEMPLATE_DIR="/home/data/jdickerson/Leopard-EM_paper_data/maps/60S_map_px1.059_bscale0.5_k3dqe.mrc"
RESULTS_SUFFIX="_results.csv"
SCRIPT_PATH="${PROJECT_DIR}/process_all_micrographs_refine.py"

# Create results directory if it doesn't exist
mkdir -p ${OUTPUT_DIR}

# Run the processing script
python ${SCRIPT_PATH} \
  --micrographs-dir "${MICROGRAPHS_DIR}" \
  --template-yaml "${TEMPLATE_YAML}" \
  --match-results-dir "${MATCH_RESULTS_DIR}" \
  --template-volume "${TEMPLATE_DIR}" \
  --output-dir "${OUTPUT_DIR}" \
  --results-suffix "${RESULTS_SUFFIX}" \
  --gpus "0,1,2,3" \
  --batch-size 64 \
  --pattern "*.mrc"

echo "All micrographs processed"


In [None]:
#!/usr/bin/env python
import os
import re
import yaml
import sys
import argparse
import glob
import time
import pandas as pd
from programs.refine_template import RefineTemplateManager

def extract_micrograph_number(filename):
    """Extract the micrograph number from the filename."""
    match = re.search(r'xenon_(\d+)_(\d+)_', filename)
    if match:
        return f"{match.group(1)}_{match.group(2)}"
    return None

def create_yaml_for_refinement(template_yaml_path, match_results_csv, output_yaml_path, template_volume_path, gpu_ids):
    """Create a custom YAML file for refinement of a specific micrograph's match results."""
    # Load the template YAML
    with open(template_yaml_path, 'r') as file:
        config = yaml.safe_load(file)
    
    # Update the config with the specific match results info
    config['template_volume_path'] = template_volume_path
    config['particle_stack']['df_path'] = match_results_csv
    
    # Set GPU IDs
    config['computational_config']['gpu_ids'] = gpu_ids
    
    # Write the updated config to a new YAML file
    with open(output_yaml_path, 'w') as file:
        yaml.dump(config, file, default_flow_style=False)
    
    return config

def process_micrograph_refinement(micrograph_path, match_results_csv, template_yaml, output_dir, template_volume_path, gpu_ids, batch_size):
    """Process refinement for a single micrograph match template results."""
    micrograph_basename = os.path.basename(micrograph_path)
    micrograph_number = extract_micrograph_number(micrograph_basename)
    
    if not micrograph_number:
        print(f"Could not extract micrograph number from {micrograph_basename}")
        return False
    
    # Check if match results CSV exists
    if not os.path.exists(match_results_csv):
        print(f"Match template results not found: {match_results_csv}")
        return False
    
    # Check if there are any results to refine
    df = pd.read_csv(match_results_csv)
    if len(df) == 0:
        print(f"No matches to refine in {match_results_csv}")
        return False
    
    # Create custom YAML file for refinement of this micrograph's results
    custom_yaml_path = os.path.join(output_dir, f"{os.path.splitext(micrograph_basename)[0]}_refine_config.yaml")
    create_yaml_for_refinement(
        template_yaml, 
        match_results_csv, 
        custom_yaml_path,
        template_volume_path,
        gpu_ids
    )
    
    # Run refine template with the custom config
    try:
        print(f"Running refine template for {micrograph_basename} results with GPUs {gpu_ids}")
        rt_manager = RefineTemplateManager.from_yaml(custom_yaml_path)
        
        # Define output path for refinement results
        refine_output_csv = os.path.join(output_dir, f"{os.path.splitext(micrograph_basename)[0]}_refined_results.csv")
        
        # Record start time
        start_time = time.time()
        
        # Run refinement
        rt_manager.run_refine_template(refine_output_csv, batch_size)
        
        # Calculate and print elapsed time
        end_time = time.time()
        elapsed_time = end_time - start_time
        elapsed_time_str = time.strftime("%H:%M:%S", time.gmtime(elapsed_time))
        print(f"Refinement wall time: {elapsed_time_str}")
        
        print(f"Successfully refined matches for {micrograph_basename}")
        return True
    except Exception as e:
        print(f"Error refining matches for {micrograph_basename}: {str(e)}")
        return False

def main():
    parser = argparse.ArgumentParser(description='Process multiple micrographs with refine template')
    parser.add_argument('--micrographs-dir', required=True, help='Directory containing micrograph files')
    parser.add_argument('--template-yaml', required=True, help='Path to the template refine YAML configuration')
    parser.add_argument('--match-results-dir', required=True, help='Directory containing match template results')
    parser.add_argument('--template-volume', required=True, help='Path to the template volume MRC file')
    parser.add_argument('--output-dir', required=True, help='Directory to store refinement results')
    parser.add_argument('--results-suffix', default='_results.csv', help='Suffix for match template results files (default: "_results.csv")')
    parser.add_argument('--gpus', default='0', help='Comma-separated list of GPU IDs to use')
    parser.add_argument('--batch-size', type=int, default=64, help='Particle batch size for refinement')
    parser.add_argument('--pattern', default='*DWS.mrc', help='File pattern to match micrographs')
    parser.add_argument('--start-idx', type=int, default=None, help='Start index for processing (optional)')
    parser.add_argument('--end-idx', type=int, default=None, help='End index for processing (optional)')
    parser.add_argument('--job-idx', type=int, default=None, help='Job index from SLURM array (optional)')
    parser.add_argument('--jobs-per-array', type=int, default=None, help='Number of micrographs per array job (optional)')
    
    args = parser.parse_args()
    
    # Make sure output directory exists
    os.makedirs(args.output_dir, exist_ok=True)
    
    # Convert GPU IDs string to list of integers
    gpu_ids = [int(gpu_id) for gpu_id in args.gpus.split(',')]
    
    # Get list of micrograph files
    micrograph_pattern = os.path.join(args.micrographs_dir, args.pattern)
    micrograph_files = sorted(glob.glob(micrograph_pattern))
    
    if not micrograph_files:
        print(f"No micrograph files found matching pattern {micrograph_pattern}")
        return 1
    
    print(f"Found {len(micrograph_files)} micrograph files")
    
    # Determine which micrographs to process
    if args.job_idx is not None and args.jobs_per_array is not None:
        # Calculate range for this job in the array
        start_idx = (args.job_idx - 1) * args.jobs_per_array
        end_idx = min(start_idx + args.jobs_per_array, len(micrograph_files))
        micrograph_files = micrograph_files[start_idx:end_idx]
        print(f"Processing micrographs {start_idx+1}-{end_idx} out of {len(micrograph_files)}")
    elif args.start_idx is not None or args.end_idx is not None:
        start_idx = args.start_idx if args.start_idx is not None else 0
        end_idx = args.end_idx if args.end_idx is not None else len(micrograph_files)
        micrograph_files = micrograph_files[start_idx:end_idx]
        print(f"Processing micrographs {start_idx+1}-{end_idx} out of {len(micrograph_files)}")
    
    # Process each micrograph's match results for refinement
    successful = 0
    for i, micrograph_file in enumerate(micrograph_files):
        micrograph_basename = os.path.basename(micrograph_file)
        base_name = os.path.splitext(micrograph_basename)[0]
        
        # Find the corresponding match results CSV file using the configurable suffix
        match_results_csv = os.path.join(args.match_results_dir, f"{base_name}{args.results_suffix}")
        
        print(f"Processing {i+1}/{len(micrograph_files)}: {micrograph_basename}")
        if process_micrograph_refinement(
            micrograph_file, 
            match_results_csv,
            args.template_yaml, 
            args.output_dir,
            args.template_volume,
            gpu_ids, 
            args.batch_size
        ):
            successful += 1
    
    print(f"Successfully refined matches for {successful}/{len(micrograph_files)} micrographs")
    return 0

if __name__ == "__main__":
    sys.exit(main()) 

## Performing a constrained search for the SSU

The constrained search follows the same 4 step process as for the untreated cells, except the angular ranges were optimized to be different.
We are again running all searches with a false positive rate of 1/200 LSUs.

Step 1 us a search around the Z-axis.

In [1]:
from IPython.display import Markdown, display

# Read the YAML file
with open("configs/constrained_search_config_SSU-body_step1.yaml") as file:
    yaml_content = file.read()

# Display as markdown code block
display(Markdown(f"```yaml\n{yaml_content}\n```"))

```yaml
###################################################
#### Constrained Search configuration example #####
###################################################
# An example YAML configuration to modify.
# Call `RefineTemplateManager.from_yaml(path)` to load this configuration.
template_volume_path: /home/data/jdickerson/Leopard-EM_paper_data/maps/SSU-body_map_px1.059_bscale0.5_k3dqe.mrc # Volume of small particle
particle_stack_reference: # This is from the large particles
  df_path: /some/path/to/particles.csv  # Needs to be readable by pandas
  extracted_box_size: [520, 520]
  original_template_size: [512, 512]
particle_stack_constrained: # This is from the small particles
  df_path: /some/path/to/particles.csv  # Needs to be readable by pandas
  extracted_box_size: [520, 520]
  original_template_size: [512, 512]
centre_vector: [88.023109, 52.080261, 45.528008] # Vector from large particle to small particle in Angstroms
orientation_refinement_config:
  enabled: true
  base_grid_method: uniform 
  psi_step: 1.0   # psi in degrees
  theta_step: 1.0   # theta and phi in degrees
  rotation_axis_euler_angles: [76.2, 60.02, 0.0] # This is the rotation axis
  phi_min: 0.0
  phi_max: 0.0
  theta_min: 0.0
  theta_max: 0.0
  psi_min: -13.0
  psi_max: 2.5
  search_roll_axis: false
  roll_axis: [0.0,1.0] # [x,y] This defines the roll axis (orthogonal to the rotation axis). None means search
  roll_step: 2.0 
defocus_refinement_config:
  enabled: false
  defocus_max:  100.0  # in Angstroms, relative to "best" defocus value in particle stack dataframe
  defocus_min: -100.0  # in Angstroms, relative to "best" defocus value in particle stack dataframe
  defocus_step: 20.0   # in Angstroms
preprocessing_filters:
  whitening_filter:
    do_power_spectrum: true
    enabled: true
    max_freq: 1.0  # In terms of Nyquist frequency
    num_freq_bins: null
  bandpass_filter:
    enabled: false
    falloff: null
    high_freq_cutoff: null
    low_freq_cutoff: null
computational_config:
  gpu_ids: 
    - 0
    - 1
    - 2
    - 3
  num_cpus: 8

```

In [None]:
#!/bin/bash

# Create logs directory
mkdir -p logs

# Print current shell and environment before activation
echo "=== ENVIRONMENT BEFORE ACTIVATION ==="
echo "Current shell: $SHELL"
echo "Current conda environments:"
conda env list
echo "Current Python: $(which python)"
echo "Current Python version: $(python --version 2>&1)"
echo "======================================"

# Activate leopard-em conda environment 
echo "=== ACTIVATING CONDA ENVIRONMENT ==="
source $(conda info --base)/etc/profile.d/conda.sh
conda activate leopard-em
ACTIVATION_STATUS=$?

# Check if activation succeeded
if [ $ACTIVATION_STATUS -ne 0 ]; then
    echo "ERROR: Failed to activate the leopard-em environment"
    echo "Available environments:"
    conda env list
    exit 1
fi

# Print environment details after activation
echo "=== ENVIRONMENT AFTER ACTIVATION ==="
echo "Active conda environment: $CONDA_PREFIX"
echo "Python interpreter: $(which python)"
echo "Python version: $(python --version 2>&1)"
echo "Conda packages in environment:"
conda list | grep -E 'program|leopard'
echo "======================================"


# Set up paths (update these paths for your project)
PROJECT_DIR="/home/data/jdickerson/Leopard-EM_paper_data/CHX_data"
MICROGRAPHS_DIR="${PROJECT_DIR}/all_mgraphs"
OUTPUT_DIR="${PROJECT_DIR}/results_constrained_step1"
LARGE_RESULTS_DIR="${PROJECT_DIR}/results_refine_tm_60S"
SMALL_RESULTS_DIR="${PROJECT_DIR}/results_match_tm_40S-body"
TEMPLATE_YAML="${PROJECT_DIR}/configs/constrained_search_config_SSU-body_step1.yaml"
TEMPLATE_DIR="/home/data/jdickerson/Leopard-EM_paper_data/maps/SSU-body_map_px1.059_bscale0.5_k3dqe.mrc"
LARGE_SUFFIX="_refined_results.csv"
SMALL_SUFFIX="_results.csv"
SCRIPT_PATH="${PROJECT_DIR}/process_all_micrographs_constrained.py"

# Create results directory if it doesn't exist
mkdir -p ${OUTPUT_DIR}

# Run the processing script
python ${SCRIPT_PATH} \
  --micrographs-dir "${MICROGRAPHS_DIR}" \
  --template-yaml "${TEMPLATE_YAML}" \
  --large-results-dir "${LARGE_RESULTS_DIR}" \
  --small-results-dir "${SMALL_RESULTS_DIR}" \
  --template-volume "${TEMPLATE_DIR}" \
  --output-dir "${OUTPUT_DIR}" \
  --large-suffix "${LARGE_SUFFIX}" \
  --small-suffix "${SMALL_SUFFIX}" \
  --gpus "0,1" \
  --batch-size 64 \
  --false-positives 0.005

echo "All micrographs processed"

In [None]:
#!/usr/bin/env python
import os
import re
import yaml
import sys
import argparse
import glob
import time
import pandas as pd
from leopard_em.pydantic_models.managers import ConstrainedSearchManager

def extract_micrograph_number(filename):
    """Extract the micrograph number from the filename."""
    # The new pattern matches filenames like: 25_Sep12_11.40.05_145_1.mrc
    # Simply returns the base filename without extension
    # This makes it work directly with similar result files: 25_Sep12_11.40.05_145_1_results.csv
    base_name = os.path.splitext(filename)[0]
    return base_name

def create_yaml_for_constrained_search(template_yaml_path, large_results_csv, small_results_csv, output_yaml_path, template_volume_path, gpu_ids):
    """Create a custom YAML file for constrained search of a specific micrograph's match results."""
    # Load the template YAML
    with open(template_yaml_path, 'r') as file:
        config = yaml.safe_load(file)
    
    # Update the config with the specific match results info
    config['template_volume_path'] = template_volume_path
    config['particle_stack_reference']['df_path'] = large_results_csv
    config['particle_stack_constrained']['df_path'] = small_results_csv
    
    # Set GPU IDs
    config['computational_config']['gpu_ids'] = gpu_ids
    
    # Write the updated config to a new YAML file
    with open(output_yaml_path, 'w') as file:
        yaml.dump(config, file, default_flow_style=False)
    
    return config

def process_micrograph_constrained_search(micrograph_path, large_results_csv, small_results_csv, template_yaml, output_dir, template_volume_path, gpu_ids, batch_size, false_positives):
    """Process constrained search for a single micrograph's match template results."""
    micrograph_basename = os.path.basename(micrograph_path)
    base_name = os.path.splitext(micrograph_basename)[0]
    
    # Check if match results CSV files exist
    if not os.path.exists(large_results_csv):
        print(f"Large particle results not found: {large_results_csv}")
        return False
    
    if not os.path.exists(small_results_csv):
        print(f"Small particle results not found: {small_results_csv}")
        return False
    
    # Check if there are any results to process
    df_large = pd.read_csv(large_results_csv)
    if len(df_large) == 0:
        print(f"No large particle matches in {large_results_csv}")
        return False
    
    df_small = pd.read_csv(small_results_csv)
    if len(df_small) == 0:
        print(f"No small particle matches in {small_results_csv}")
        return False
    
    # Create custom YAML file for constrained search of this micrograph's results
    custom_yaml_path = os.path.join(output_dir, f"{base_name}_constrained_config.yaml")
    create_yaml_for_constrained_search(
        template_yaml, 
        large_results_csv, 
        small_results_csv,
        custom_yaml_path,
        template_volume_path,
        gpu_ids
    )
    
    # Run constrained search with the custom config
    try:
        print(f"Running constrained search for {micrograph_basename} with GPUs {gpu_ids}")
        cs_manager = ConstrainedSearchManager.from_yaml(custom_yaml_path)
        
        # Define output path for constrained search results
        constrained_output_csv = os.path.join(output_dir, f"{base_name}_constrained_results.csv")
        
        # Record start time
        start_time = time.time()
        
        # Run constrained search
        cs_manager.run_constrained_search(constrained_output_csv, false_positives, batch_size)
        
        # Calculate and print elapsed time
        end_time = time.time()
        elapsed_time = end_time - start_time
        elapsed_time_str = time.strftime("%H:%M:%S", time.gmtime(elapsed_time))
        print(f"Constrained search wall time: {elapsed_time_str}")
        
        print(f"Successfully completed constrained search for {micrograph_basename}")
        return True
    except Exception as e:
        print(f"Error in constrained search for {micrograph_basename}: {str(e)}")
        return False

def is_already_processed(micrograph_path, output_dir):
    """Check if a micrograph has already been processed by looking for its constrained search results in output_dir."""
    micrograph_filename = os.path.basename(micrograph_path)
    results_filename = f"{os.path.splitext(micrograph_filename)[0]}_constrained_results.csv"
    results_path = os.path.join(output_dir, results_filename)
    return os.path.exists(results_path)

def main():
    parser = argparse.ArgumentParser(description='Process multiple micrographs with constrained search')
    parser.add_argument('--micrographs-dir', required=True, help='Directory containing micrograph files')
    parser.add_argument('--template-yaml', required=True, help='Path to the template constrained search YAML configuration')
    parser.add_argument('--large-results-dir', required=True, help='Directory containing large particle results')
    parser.add_argument('--small-results-dir', required=True, help='Directory containing small particle results')
    parser.add_argument('--template-volume', required=True, help='Path to the template volume MRC file (small particle)')
    parser.add_argument('--output-dir', required=True, help='Directory to store constrained search results')
    parser.add_argument('--large-suffix', default='_refined_results.csv', help='Suffix for large particle result files (default: "_refined_results.csv")')
    parser.add_argument('--small-suffix', default='_results.csv', help='Suffix for small particle result files (default: "_results.csv")')
    parser.add_argument('--gpus', default='0', help='Comma-separated list of GPU IDs to use')
    parser.add_argument('--batch-size', type=int, default=80, help='Particle batch size for constrained search')
    parser.add_argument('--pattern', default='*.mrc', help='File pattern to match micrographs')
    parser.add_argument('--false-positives', type=float, default=0.005, help='False positives rate for constrained search')
    parser.add_argument('--start-idx', type=int, default=None, help='Start index for processing (optional)')
    parser.add_argument('--end-idx', type=int, default=None, help='End index for processing (optional)')
    parser.add_argument('--job-idx', type=int, default=None, help='Job index from SLURM array (optional)')
    parser.add_argument('--jobs-per-array', type=int, default=None, help='Number of micrographs per array job (optional)')
    
    args = parser.parse_args()
    
    # Make sure output directory exists
    os.makedirs(args.output_dir, exist_ok=True)
    
    # Convert GPU IDs string to list of integers
    gpu_ids = [int(gpu_id) for gpu_id in args.gpus.split(',')]
    
    # Get list of micrograph files
    micrograph_pattern = os.path.join(args.micrographs_dir, args.pattern)
    micrograph_files = sorted(glob.glob(micrograph_pattern))
    
    if not micrograph_files:
        print(f"No micrograph files found matching pattern {micrograph_pattern}")
        return 1
    
    print(f"Found {len(micrograph_files)} micrograph files")
    
    # Filter out micrographs that have already been processed
    unprocessed_micrographs = []
    for micrograph_file in micrograph_files:
        if not is_already_processed(micrograph_file, args.output_dir):
            unprocessed_micrographs.append(micrograph_file)
    
    print(f"Found {len(unprocessed_micrographs)} unprocessed micrographs out of {len(micrograph_files)} total")
    
    # Replace the full list with only unprocessed micrographs
    micrograph_files = unprocessed_micrographs
    
    # Check if there are any micrographs to process
    if not micrograph_files:
        print("No unprocessed micrographs to process. Exiting.")
        return 0
    
    # Determine which micrographs to process
    if args.job_idx is not None and args.jobs_per_array is not None:
        # Calculate range for this job in the array
        start_idx = (args.job_idx - 1) * args.jobs_per_array
        end_idx = min(start_idx + args.jobs_per_array, len(micrograph_files))
        micrograph_files = micrograph_files[start_idx:end_idx]
        print(f"Processing micrographs {start_idx+1}-{end_idx} out of {len(micrograph_files)}")
    elif args.start_idx is not None or args.end_idx is not None:
        start_idx = args.start_idx if args.start_idx is not None else 0
        end_idx = args.end_idx if args.end_idx is not None else len(micrograph_files)
        micrograph_files = micrograph_files[start_idx:end_idx]
        print(f"Processing micrographs {start_idx+1}-{end_idx} out of {len(micrograph_files)}")
    
    # Process each micrograph's match results for constrained search
    successful = 0
    for i, micrograph_file in enumerate(micrograph_files):
        micrograph_basename = os.path.basename(micrograph_file)
        base_name = os.path.splitext(micrograph_basename)[0]
        
        # Find the corresponding results CSV files using the configurable suffixes
        large_results_csv = os.path.join(args.large_results_dir, f"{base_name}{args.large_suffix}")
        small_results_csv = os.path.join(args.small_results_dir, f"{base_name}{args.small_suffix}")
        
        print(f"Processing {i+1}/{len(micrograph_files)}: {micrograph_basename}")
        if process_micrograph_constrained_search(
            micrograph_file, 
            large_results_csv,
            small_results_csv,
            args.template_yaml, 
            args.output_dir,
            args.template_volume,
            gpu_ids, 
            args.batch_size,
            args.false_positives
        ):
            successful += 1
    
    print(f"Successfully completed constrained search for {successful}/{len(micrograph_files)} micrographs")
    return 0

if __name__ == "__main__":
    sys.exit(main()) 

The second search was around the y axis (theta)

In [2]:
from IPython.display import Markdown, display

# Read the YAML file
with open("configs/constrained_search_config_SSU-body_step3.yaml") as file:
    yaml_content = file.read()

# Display as markdown code block
display(Markdown(f"```yaml\n{yaml_content}\n```"))

```yaml
###################################################
#### Constrained Search configuration example #####
###################################################
# An example YAML configuration to modify.
# Call `RefineTemplateManager.from_yaml(path)` to load this configuration.
template_volume_path: /home/data/jdickerson/Leopard-EM_paper_data/maps/SSU-body_map_px1.059_bscale0.5_k3dqe.mrc # Volume of small particle
particle_stack_reference: # This is from the large particles
  df_path: /some/path/to/particles.csv  # Needs to be readable by pandas
  extracted_box_size: [520, 520]
  original_template_size: [512, 512]
particle_stack_constrained: # This is from the small particles
  df_path: /some/path/to/particles.csv  # Needs to be readable by pandas
  extracted_box_size: [520, 520]
  original_template_size: [512, 512]
centre_vector: [88.023109, 52.080261, 45.528008] # Vector from large particle to small particle in Angstroms
orientation_refinement_config:
  enabled: true
  base_grid_method: uniform 
  psi_step: 1.0   # psi in degrees
  theta_step: 1.0   # theta and phi in degrees
  rotation_axis_euler_angles: [76.2, 60.02, 0.0] # This is the rotation axis
  phi_min: 0.0
  phi_max: 0.0
  theta_min: -8.0
  theta_max: 2.0
  psi_min: 0.0
  psi_max: 0.0
  search_roll_axis: false
  roll_axis: [0.0,1.0] # [x,y] This defines the roll axis (orthogonal to the rotation axis). None means search
  roll_step: 2.0 
defocus_refinement_config:
  enabled: false
  defocus_max:  100.0  # in Angstroms, relative to "best" defocus value in particle stack dataframe
  defocus_min: -100.0  # in Angstroms, relative to "best" defocus value in particle stack dataframe
  defocus_step: 20.0   # in Angstroms
preprocessing_filters:
  whitening_filter:
    do_power_spectrum: true
    enabled: true
    max_freq: 1.0  # In terms of Nyquist frequency
    num_freq_bins: null
  bandpass_filter:
    enabled: false
    falloff: null
    high_freq_cutoff: null
    low_freq_cutoff: null
computational_config:
  gpu_ids: 
    - 0
    - 1
    - 2
    - 3
  num_cpus: 8

```

In [None]:
#!/bin/bash

# Create logs directory
mkdir -p logs

# Print current shell and environment before activation
echo "=== ENVIRONMENT BEFORE ACTIVATION ==="
echo "Current shell: $SHELL"
echo "Current conda environments:"
conda env list
echo "Current Python: $(which python)"
echo "Current Python version: $(python --version 2>&1)"
echo "======================================"

# Activate leopard-em conda environment 
echo "=== ACTIVATING CONDA ENVIRONMENT ==="
source $(conda info --base)/etc/profile.d/conda.sh
conda activate leopard-em
ACTIVATION_STATUS=$?

# Check if activation succeeded
if [ $ACTIVATION_STATUS -ne 0 ]; then
    echo "ERROR: Failed to activate the leopard-em environment"
    echo "Available environments:"
    conda env list
    exit 1
fi

# Print environment details after activation
echo "=== ENVIRONMENT AFTER ACTIVATION ==="
echo "Active conda environment: $CONDA_PREFIX"
echo "Python interpreter: $(which python)"
echo "Python version: $(python --version 2>&1)"
echo "Conda packages in environment:"
conda list | grep -E 'program|leopard'
echo "======================================"


# Set up paths (update these paths for your project)
PROJECT_DIR="/home/data/jdickerson/Leopard-EM_paper_data/CHX_data"
MICROGRAPHS_DIR="${PROJECT_DIR}/all_mgraphs"
OUTPUT_DIR="${PROJECT_DIR}/results_constrained_step3"
LARGE_RESULTS_DIR="${PROJECT_DIR}/results_constrained_step1"
SMALL_RESULTS_DIR="${PROJECT_DIR}/results_match_tm_40S-body"
TEMPLATE_YAML="${PROJECT_DIR}/configs/constrained_search_config_SSU-body_step3.yaml"
TEMPLATE_DIR="/home/data/jdickerson/Leopard-EM_paper_data/maps/SSU-body_map_px1.059_bscale0.5_k3dqe.mrc"
LARGE_SUFFIX="_constrained_results.csv"
SMALL_SUFFIX="_results.csv"
SCRIPT_PATH="${PROJECT_DIR}/process_all_micrographs_constrained.py"

# Create results directory if it doesn't exist
mkdir -p ${OUTPUT_DIR}

# Run the processing script
python ${SCRIPT_PATH} \
  --micrographs-dir "${MICROGRAPHS_DIR}" \
  --template-yaml "${TEMPLATE_YAML}" \
  --large-results-dir "${LARGE_RESULTS_DIR}" \
  --small-results-dir "${SMALL_RESULTS_DIR}" \
  --template-volume "${TEMPLATE_DIR}" \
  --output-dir "${OUTPUT_DIR}" \
  --large-suffix "${LARGE_SUFFIX}" \
  --small-suffix "${SMALL_SUFFIX}" \
  --gpus "0,1" \
  --batch-size 64 \
  --false-positives 0.005

echo "All micrographs processed"

The next step was a finer search around both angles simultaneously.

In [3]:
from IPython.display import Markdown, display

# Read the YAML file
with open("configs/constrained_search_config_SSU-body_step4.yaml") as file:
    yaml_content = file.read()

# Display as markdown code block
display(Markdown(f"```yaml\n{yaml_content}\n```"))

```yaml
###################################################
#### Constrained Search configuration example #####
###################################################
# An example YAML configuration to modify.
# Call `RefineTemplateManager.from_yaml(path)` to load this configuration.
template_volume_path: /home/data/jdickerson/Leopard-EM_paper_data/maps/SSU-body_map_px1.059_bscale0.5_k3dqe.mrc # Volume of small particle
particle_stack_reference: # This is from the large particles
  df_path: /some/path/to/particles.csv  # Needs to be readable by pandas
  extracted_box_size: [520, 520]
  original_template_size: [512, 512]
particle_stack_constrained: # This is from the small particles
  df_path: /some/path/to/particles.csv  # Needs to be readable by pandas
  extracted_box_size: [520, 520]
  original_template_size: [512, 512]
centre_vector: [88.023109, 52.080261, 45.528008] # Vector from large particle to small particle in Angstroms
orientation_refinement_config:
  enabled: true
  base_grid_method: uniform 
  psi_step: 0.5   # psi in degrees
  theta_step: 0.5   # theta and phi in degrees
  rotation_axis_euler_angles: [76.2, 60.02, 0.0] # This is the rotation axis
  phi_min: 0.0
  phi_max: 0.0
  theta_min: -5.0
  theta_max: 5.0
  psi_min: -5.0
  psi_max: 5.0
  search_roll_axis: false
  roll_axis: [0.0,1.0] # [x,y] This defines the roll axis (orthogonal to the rotation axis). None means search
  roll_step: 2.0 
defocus_refinement_config:
  enabled: false
  defocus_max:  100.0  # in Angstroms, relative to "best" defocus value in particle stack dataframe
  defocus_min: -100.0  # in Angstroms, relative to "best" defocus value in particle stack dataframe
  defocus_step: 20.0   # in Angstroms
preprocessing_filters:
  whitening_filter:
    do_power_spectrum: true
    enabled: true
    max_freq: 1.0  # In terms of Nyquist frequency
    num_freq_bins: null
  bandpass_filter:
    enabled: false
    falloff: null
    high_freq_cutoff: null
    low_freq_cutoff: null
computational_config:
  gpu_ids: 
    - 0
    - 1
    - 2
    - 3
  num_cpus: 8

```

In [None]:
#!/bin/bash

# Print current shell and environment before activation
echo "=== ENVIRONMENT BEFORE ACTIVATION ==="
echo "Current shell: $SHELL"
echo "Current conda environments:"
conda env list
echo "Current Python: $(which python)"
echo "Current Python version: $(python --version 2>&1)"
echo "======================================"

# Activate leopard-em conda environment 
echo "=== ACTIVATING CONDA ENVIRONMENT ==="
source $(conda info --base)/etc/profile.d/conda.sh
conda activate leopard-em
ACTIVATION_STATUS=$?

# Check if activation succeeded
if [ $ACTIVATION_STATUS -ne 0 ]; then
    echo "ERROR: Failed to activate the leopard-em environment"
    echo "Available environments:"
    conda env list
    exit 1
fi

# Print environment details after activation
echo "=== ENVIRONMENT AFTER ACTIVATION ==="
echo "Active conda environment: $CONDA_PREFIX"
echo "Python interpreter: $(which python)"
echo "Python version: $(python --version 2>&1)"
echo "Conda packages in environment:"
conda list | grep -E 'program|leopard'
echo "======================================"


# Set up paths (update these paths for your project)
PROJECT_DIR="/home/data/jdickerson/Leopard-EM_paper_data/CHX_data"
MICROGRAPHS_DIR="${PROJECT_DIR}/all_mgraphs"
OUTPUT_DIR="${PROJECT_DIR}/results_constrained_step4"
LARGE_RESULTS_DIR="${PROJECT_DIR}/results_constrained_step3"
SMALL_RESULTS_DIR="${PROJECT_DIR}/results_match_tm_40S-body"
TEMPLATE_YAML="${PROJECT_DIR}/configs/constrained_search_config_SSU-body_step4.yaml"
TEMPLATE_DIR="/home/data/jdickerson/Leopard-EM_paper_data/maps/SSU-body_map_px1.059_bscale0.5_k3dqe.mrc"
LARGE_SUFFIX="_constrained_results.csv"
SMALL_SUFFIX="_results.csv"
SCRIPT_PATH="${PROJECT_DIR}/process_all_micrographs_constrained.py"

# Create results directory if it doesn't exist
mkdir -p ${OUTPUT_DIR}

# Run the processing script
python ${SCRIPT_PATH} \
  --micrographs-dir "${MICROGRAPHS_DIR}" \
  --template-yaml "${TEMPLATE_YAML}" \
  --large-results-dir "${LARGE_RESULTS_DIR}" \
  --small-results-dir "${SMALL_RESULTS_DIR}" \
  --template-volume "${TEMPLATE_DIR}" \
  --output-dir "${OUTPUT_DIR}" \
  --large-suffix "${LARGE_SUFFIX}" \
  --small-suffix "${SMALL_SUFFIX}" \
  --gpus "0,1" \
  --batch-size 64 \
  --false-positives 0.005

echo "All micrographs processed"

The final search is even finer with both angles and combined witha  defocus search.

In [4]:
from IPython.display import Markdown, display

# Read the YAML file
with open("configs/constrained_search_config_SSU-body_step5.yaml") as file:
    yaml_content = file.read()

# Display as markdown code block
display(Markdown(f"```yaml\n{yaml_content}\n```"))

```yaml
###################################################
#### Constrained Search configuration example #####
###################################################
# An example YAML configuration to modify.
# Call `RefineTemplateManager.from_yaml(path)` to load this configuration.
template_volume_path: /home/data/jdickerson/Leopard-EM_paper_data/maps/SSU-body_map_px1.059_bscale0.5_k3dqe.mrc # Volume of small particle
particle_stack_reference: # This is from the large particles
  df_path: /some/path/to/particles.csv  # Needs to be readable by pandas
  extracted_box_size: [520, 520]
  original_template_size: [512, 512]
particle_stack_constrained: # This is from the small particles
  df_path: /some/path/to/particles.csv  # Needs to be readable by pandas
  extracted_box_size: [520, 520]
  original_template_size: [512, 512]
centre_vector: [88.023109, 52.080261, 45.528008] # Vector from large particle to small particle in Angstroms
orientation_refinement_config:
  enabled: true
  base_grid_method: uniform 
  psi_step: 0.1   # psi in degrees
  theta_step: 0.1   # theta and phi in degrees
  rotation_axis_euler_angles: [76.2, 60.02, 0.0] # This is the rotation axis
  phi_min: 0.0
  phi_max: 0.0
  theta_min: -0.5
  theta_max: 0.5
  psi_min: -0.5
  psi_max: 0.5
  search_roll_axis: false
  roll_axis: [0.0,1.0] # [x,y] This defines the roll axis (orthogonal to the rotation axis). None means search
  roll_step: 2.0 
defocus_refinement_config:
  enabled: true
  defocus_max:  100.0  # in Angstroms, relative to "best" defocus value in particle stack dataframe
  defocus_min: -100.0  # in Angstroms, relative to "best" defocus value in particle stack dataframe
  defocus_step: 20.0   # in Angstroms
preprocessing_filters:
  whitening_filter:
    do_power_spectrum: true
    enabled: true
    max_freq: 1.0  # In terms of Nyquist frequency
    num_freq_bins: null
  bandpass_filter:
    enabled: false
    falloff: null
    high_freq_cutoff: null
    low_freq_cutoff: null
computational_config:
  gpu_ids: 
    - 0
    - 1
    - 2
    - 3
  num_cpus: 8

```

In [None]:
#!/bin/bash

# Print current shell and environment before activation
echo "=== ENVIRONMENT BEFORE ACTIVATION ==="
echo "Current shell: $SHELL"
echo "Current conda environments:"
conda env list
echo "Current Python: $(which python)"
echo "Current Python version: $(python --version 2>&1)"
echo "======================================"

# Activate leopard-em conda environment 
echo "=== ACTIVATING CONDA ENVIRONMENT ==="
source $(conda info --base)/etc/profile.d/conda.sh
conda activate leopard-em
ACTIVATION_STATUS=$?

# Check if activation succeeded
if [ $ACTIVATION_STATUS -ne 0 ]; then
    echo "ERROR: Failed to activate the leopard-em environment"
    echo "Available environments:"
    conda env list
    exit 1
fi

# Print environment details after activation
echo "=== ENVIRONMENT AFTER ACTIVATION ==="
echo "Active conda environment: $CONDA_PREFIX"
echo "Python interpreter: $(which python)"
echo "Python version: $(python --version 2>&1)"
echo "Conda packages in environment:"
conda list | grep -E 'program|leopard'
echo "======================================"


# Set up paths (update these paths for your project)
PROJECT_DIR="/home/data/jdickerson/Leopard-EM_paper_data/CHX_data"
MICROGRAPHS_DIR="${PROJECT_DIR}/all_mgraphs"
OUTPUT_DIR="${PROJECT_DIR}/results_constrained_step5"
LARGE_RESULTS_DIR="${PROJECT_DIR}/results_constrained_step4"
SMALL_RESULTS_DIR="${PROJECT_DIR}/results_match_tm_40S-body"
TEMPLATE_YAML="${PROJECT_DIR}/configs/constrained_search_config_SSU-body_step5.yaml"
TEMPLATE_DIR="/home/data/jdickerson/Leopard-EM_paper_data/maps/SSU-body_map_px1.059_bscale0.5_k3dqe.mrc"
LARGE_SUFFIX="_constrained_results.csv"
SMALL_SUFFIX="_results.csv"
SCRIPT_PATH="${PROJECT_DIR}/process_all_micrographs_constrained.py"

# Create results directory if it doesn't exist
mkdir -p ${OUTPUT_DIR}

# Run the processing script
python ${SCRIPT_PATH} \
  --micrographs-dir "${MICROGRAPHS_DIR}" \
  --template-yaml "${TEMPLATE_YAML}" \
  --large-results-dir "${LARGE_RESULTS_DIR}" \
  --small-results-dir "${SMALL_RESULTS_DIR}" \
  --template-volume "${TEMPLATE_DIR}" \
  --output-dir "${OUTPUT_DIR}" \
  --large-suffix "${LARGE_SUFFIX}" \
  --small-suffix "${SMALL_SUFFIX}" \
  --gpus "0,1" \
  --batch-size 64 \
  --false-positives 0.005

echo "All micrographs processed"

We can now process the results sequentially to get the final output particles.

In [5]:
"""Processes results files from multiple directories sequentially."""

import argparse
import glob
import os
from collections import defaultdict

import numpy as np
import pandas as pd
from scipy.special import (
    erfcinv,  # Required for the gaussian_noise_zscore_cutoff function
)

import sys
sys.argv = ["process_sequential_results.py", "results_constrained_step1", "results_constrained_step3", "results_constrained_step4", "results_constrained_step5", "--output", "results_all_steps_4"]


def gaussian_noise_zscore_cutoff(num_ccg: int, false_positives: float = 0.005) -> float:
    """Determines the z-score cutoff based on Gaussian noise model and number of pixels.

    NOTE: This procedure assumes that the z-scores (normalized maximum intensity
    projections) are distributed according to a standard normal distribution. Here,
    this model is used to find the cutoff value such that there is at most
    'false_positives' number of false positives in all of the pixels.

    Parameters
    ----------
    num_ccg : int
        Total number of cross-correlograms calculated during template matching. Product
        of the number of pixels, number of defocus values, and number of orientations.
    false_positives : float, optional
        Number of false positives to allow in the image (over all pixels). Default is
        0.005 which corresponds to 0.5% false-positives.

    Returns
    -------
    float
        Z-score cutoff.
    """
    tmp = erfcinv(2.0 * false_positives / num_ccg)
    tmp *= np.sqrt(2.0)

    return float(tmp)


def get_micrograph_id(filename: str) -> str:
    """
    Extract micrograph ID from filename.

    Parameters
    ----------
    filename : str
        Filename to extract micrograph ID from

    Returns
    -------
    micrograph_id : str
        Micrograph ID
    """
    base_name = os.path.basename(filename)
    # Extract the part before _results.csv
    parts = base_name.split("_results.csv")[0]
    return parts


def process_directories_sequentially(
    directory_list: list[str],
    output_base_dir: str,
    false_positive_rate: float = 0.005,
) -> dict[str, pd.DataFrame]:
    """
    Process directories sequentially.

    Parameters
    ----------
    directory_list : list
        Ordered list of directories to process
    output_base_dir : str
        Base directory to store output files
    false_positive_rate : float
        False positive rate to use for threshold calculation

    Returns
    -------
    all_particles : dict
        Dictionary of micrograph IDs as keys and df as values
    """
    # Create output directory if it doesn't exist
    os.makedirs(output_base_dir, exist_ok=True)

    # Dictionary to store particles from all steps
    # Key: micrograph_id
    # Value: DataFrame of particles
    all_particles = {}

    # Dictionary to track which particles were found in which step
    # Key: (micrograph_id, particle_index)
    # Value: last step where this particle was found
    particle_step_map = {}

    # Dictionary to track total correlations per micrograph
    # Key: micrograph_id
    # Value: total correlations for this micrograph across all steps
    micrograph_correlations = defaultdict(int)

    # Dictionary to track thresholds per micrograph
    # Key: micrograph_id
    # Value: threshold for this micrograph in the current step
    micrograph_thresholds = {}

    # Process each directory in order
    for step_idx, directory in enumerate(directory_list):
        step_num = step_idx + 1
        print(f"\nProcessing Step {step_num}: {directory}")

        # Create step output directory
        step_output_dir = os.path.join(output_base_dir, f"step_{step_num}")
        os.makedirs(step_output_dir, exist_ok=True)

        # Find all results.csv files in the directory
        results_files = glob.glob(
            os.path.join(directory, "**", "*_results.csv"), recursive=True
        )

        if not results_files:
            print(f"  Warning: No results files found in {directory}")
            continue

        print(f"  Found {len(results_files)} results files")

        # Dictionary to store parameters from each micrograph
        step_micrograph_parameters = {}

        # First, find and read all parameters files to update correlation counts
        for results_file in results_files:
            micrograph_id = get_micrograph_id(results_file)
            params_file = results_file.replace(
                "_results.csv", "_results_parameters.csv"
            )

            if os.path.exists(params_file):
                try:
                    # Read the parameters file
                    params_df = pd.read_csv(params_file)
                    if not params_df.empty:
                        step_micrograph_parameters[micrograph_id] = params_df.iloc[0]

                        # Add to total correlations for this micrograph
                        if "num_correlations" in params_df.columns:
                            correlations = int(params_df.iloc[0]["num_correlations"])
                            micrograph_correlations[micrograph_id] += correlations
                            print(
                                f"  {micrograph_id}: Added {correlations} correlations "
                                f"(total: {micrograph_correlations[micrograph_id]})"
                            )
                except Exception as e:
                    print(f"  Error reading parameters file {params_file}: {e}")
            else:
                print(f"  Warning: Parameters file not found for {results_file}")

        # Calculate threshold for each micrograph based on its cumulative correlations
        for micrograph_id, total_correlations in micrograph_correlations.items():
            threshold = gaussian_noise_zscore_cutoff(
                total_correlations, false_positive_rate
            )
            micrograph_thresholds[micrograph_id] = threshold
            print(
                f"  Threshold for {micrograph_id} in step {step_num}: {threshold:.4f} "
                f"(based on {total_correlations} total correlations)"
            )

        # Process each results file
        for results_file in results_files:
            micrograph_id = get_micrograph_id(results_file)

            try:
                # Read the results file
                results_df = pd.read_csv(results_file)

                if results_df.empty:
                    print(f"  Warning: Empty results file {results_file}")
                    continue

                # Get the threshold for this micrograph
                if micrograph_id not in micrograph_thresholds:
                    print(
                        f"  Warning: No correlation information for {micrograph_id}, "
                        "using default threshold"
                    )
                    # Try to use correlations from the current step's parameters file
                    if (
                        micrograph_id in step_micrograph_parameters
                        and "num_correlations"
                        in step_micrograph_parameters[micrograph_id]
                    ):
                        correlations = int(
                            step_micrograph_parameters[micrograph_id][
                                "num_correlations"
                            ]
                        )
                        micrograph_correlations[micrograph_id] = correlations
                        threshold = gaussian_noise_zscore_cutoff(
                            correlations, false_positive_rate
                        )
                        micrograph_thresholds[micrograph_id] = threshold
                        print(
                            f"  Using threshold {threshold:.4f} for {micrograph_id} "
                            f"based on {correlations} correlations"
                        )
                    else:
                        # If no information at all, use the median of other thresholds
                        # or a reasonable default
                        if micrograph_thresholds:
                            threshold = np.median(list(micrograph_thresholds.values()))
                            print(
                                f"  Using median threshold {threshold:.4f} for "
                                f"{micrograph_id}"
                            )
                        else:
                            #  Default if no other information is available
                            threshold = 5.0
                            print(
                                f"  Using default threshold {threshold:.4f} for "
                                f"{micrograph_id}"
                            )
                        micrograph_thresholds[micrograph_id] = threshold
                else:
                    threshold = micrograph_thresholds[micrograph_id]

                # Check if refined_scaled_mip column exists
                if "refined_scaled_mip" not in results_df.columns:
                    print(
                        f" Warning: refined_scaled_mip not found in {results_file},"
                        " using mip instead"
                    )
                    compare_col = "scaled_mip"
                else:
                    compare_col = "refined_scaled_mip"

                # Filter particles above threshold using the appropriate column
                above_threshold_df = results_df[
                    results_df[compare_col] > threshold
                ].copy()

                if above_threshold_df.empty:
                    print(f"  No particles above threshold in {results_file}")
                    continue

                # Print stats
                print(
                    f"{micrograph_id}: {len(above_threshold_df)} of {len(results_df)}"
                    f" particles above threshold (using {compare_col})"
                )

                # Add a step column to track which step this is from
                above_threshold_df["step"] = step_num

                # If this is the first step, just add all particles above threshold
                if step_num == 1:
                    all_particles[micrograph_id] = above_threshold_df

                    # Update particle step map
                    for idx in above_threshold_df["particle_index"]:
                        particle_step_map[(micrograph_id, idx)] = step_num
                else:
                    # If this micrograph was not seen before, add all particles
                    if micrograph_id not in all_particles:
                        all_particles[micrograph_id] = above_threshold_df

                        # Update particle step map
                        for idx in above_threshold_df["particle_index"]:
                            particle_step_map[(micrograph_id, idx)] = step_num
                    else:
                        # For existing micrographs, handle particles differently
                        existing_df = all_particles[micrograph_id]

                        # Create a new DataFrame to store updated particles
                        updated_df = existing_df.copy()

                        # For each particle in the new results
                        for _, particle in above_threshold_df.iterrows():
                            particle_idx = particle["particle_index"]

                            # Check if this particle exists previously
                            existing_particle = existing_df[
                                existing_df["particle_index"] == particle_idx
                            ]

                            if len(existing_particle) > 0:
                                # Particle exists, update parameters
                                # Find the index in the updated_df
                                idx_to_update = updated_df.index[
                                    updated_df["particle_index"] == particle_idx
                                ].tolist()[0]

                                # Check if original offset columns exist
                                offset_cols = [
                                    "original_offset_phi",
                                    "original_offset_theta",
                                    "original_offset_psi",
                                ]

                                # Add original offset columns from step 1
                                for col in offset_cols:
                                    if col not in updated_df.columns:
                                        updated_df[col] = 0.0

                                # Add offset values from current step
                                for col in offset_cols:
                                    # Add particle's offset to existing offset
                                    if col in particle and pd.notna(particle[col]):
                                        updated_df.at[idx_to_update, col] += particle[
                                            col
                                        ]

                                # Update other parameters
                                for col in particle.index:
                                    if col not in offset_cols and pd.notna(
                                        particle[col]
                                    ):
                                        updated_df.at[idx_to_update, col] = particle[
                                            col
                                        ]

                                # Update step
                                updated_df.at[idx_to_update, "step"] = step_num
                                particle_step_map[(micrograph_id, particle_idx)] = (
                                    step_num
                                )
                            else:
                                # New particle, add it to the DataFrame
                                updated_df = pd.concat(
                                    [updated_df, pd.DataFrame([particle])],
                                    ignore_index=True,
                                )
                                particle_step_map[(micrograph_id, particle_idx)] = (
                                    step_num
                                )

                        # Update the all_particles dictionary
                        all_particles[micrograph_id] = updated_df

            except Exception as e:
                print(f"  Error processing results file {results_file}: {e}")

        # Save intermediate results for this step
        for micrograph_id, particles_df in all_particles.items():
            # Only save particles found or updated in this step
            step_particles = particles_df[particles_df["step"] == step_num]

            if not step_particles.empty:
                output_file = os.path.join(
                    step_output_dir, f"{micrograph_id}_results_above_threshold.csv"
                )
                step_particles.to_csv(output_file, index=False)
                print(
                    f"  Saved {len(step_particles)} particles for {micrograph_id} "
                    f"in step {step_num}"
                )

    # Save final results after all steps
    final_output_dir = os.path.join(output_base_dir, "final_results")
    os.makedirs(final_output_dir, exist_ok=True)

    # Save summary of total particles per micrograph
    summary_data = []

    for micrograph_id, particles_df in all_particles.items():
        output_file = os.path.join(
            final_output_dir, f"{micrograph_id}_results_above_threshold.csv"
        )
        particles_df.to_csv(output_file, index=False)

        # Get the final threshold for this micrograph
        final_threshold = micrograph_thresholds.get(micrograph_id, "N/A")
        total_correlations = micrograph_correlations.get(micrograph_id, 0)

        # Create summary data
        n_particles = len(particles_df)
        summary_data.append(
            {
                "micrograph_id": micrograph_id,
                "total_particles": n_particles,
                "total_correlations": total_correlations,
                "final_threshold": final_threshold,
            }
        )

        print(
            f"Saved {n_particles} final particles for {micrograph_id} "
            f"(threshold: {final_threshold}, correlations: {total_correlations})"
        )

    # Save summary
    summary_df = pd.DataFrame(summary_data)
    summary_df.to_csv(
        os.path.join(final_output_dir, "processing_summary.csv"), index=False
    )

    print(f"\nProcessing complete. Final results saved to {final_output_dir}")

    # Print total particles
    total_particles = sum(len(df) for df in all_particles.values())
    print(f"Total particles across all micrographs: {total_particles}")

    return all_particles


def main() -> None:
    """Main function to process results files sequentially."""
    parser = argparse.ArgumentParser(
        description="Process results files from multiple directories sequentially"
    )
    parser.add_argument(
        "directories", nargs="+", help="Ordered list of directories to process"
    )
    parser.add_argument(
        "--output", "-o", required=True, help="Output directory for results"
    )
    parser.add_argument(
        "--false-positive-rate",
        "-f",
        type=float,
        default=0.005,
        help="False positive rate for threshold calculation (default: 0.005)",
    )

    args = parser.parse_args()

    # Check if all directories exist
    for directory in args.directories:
        if not os.path.exists(directory):
            print(f"Error: Directory {directory} does not exist!")
            return

    # Process directories
    process_directories_sequentially(
        args.directories, args.output, args.false_positive_rate
    )


if __name__ == "__main__":
    main()



Processing Step 1: results_constrained_step1
  Found 188 results files
  158_Sep13_13.03.40_267_1_constrained: Added 1296 correlations (total: 1296)
  140_Sep13_11.09.36_237_1_constrained: Added 1296 correlations (total: 1296)
  113_Sep12_19.20.31_197_1_constrained: Added 1296 correlations (total: 1296)
  151_Sep13_11.42.25_259_1_constrained: Added 1296 correlations (total: 1296)
  107_Sep12_18.57.58_187_1_constrained: Added 1296 correlations (total: 1296)
  142_Sep13_11.01.22_241_1_constrained: Added 1296 correlations (total: 1296)
  79_Sep12_16.03.49_167_1_constrained: Added 1296 correlations (total: 1296)
  100_Sep12_18.37.00_181_1_constrained: Added 1296 correlations (total: 1296)
  163_Sep13_13.21.12_279_1_constrained: Added 1296 correlations (total: 1296)
  153_Sep13_11.48.36_263_1_constrained: Added 1296 correlations (total: 1296)
  141_Sep13_10.57.04_239_1_constrained: Added 1296 correlations (total: 1296)
  82_Sep12_16.17.55_173_1_constrained: Added 1296 correlations (total: 