## Data processing of untreated yeast cells for Leopard-EM manuscript

This notebook walks through the methods used to generate the 2DTM data from untreated cells published in the 2025 Leopard-EM manuscript.

### Running match template

Match template jobs were ran on the 60S and 40S subunits.
The maps and models that were used for those are the same as for the constrained search tutorial, and can be found at [10.5281/zenodo.15720143](10.5281/zenodo.15720143).

We ran match template on our cluster with the script run_match_template_leopard.sh, which uses the config match_template_config_60S_base.yaml, and the run script process_all_micrographs.py.

In [None]:
sbatch run_match_template_leaopard.sh

## Refine template for the LSU
After running the match template, the results were copied to a local workstation for further proceessing.
The first step was to update the file paths in the results files with the new paths.

In [3]:
#!/usr/bin/env python3

import os
import sys
import glob
import pandas as pd

def replace_paths_in_csv(input_dir):
    """
    Read all _results.csv files in the specified directory, replace path strings,
    and write back to the same files.
    
    Args:
        input_dir (str): Directory containing _results.csv files
    """
    old_path = "/global/scratch/users/jdickerson/2dtm_test_data/"
    new_path = "/data/papers/Leopard-EM_paper_data/"
    
    # Find all _results.csv files in the specified directory
    csv_files = glob.glob(os.path.join(input_dir, "**/*_results.csv"), recursive=True)
    
    if not csv_files:
        print(f"No _results.csv files found in {input_dir}")
        return
    
    print(f"Found {len(csv_files)} _results.csv files")
    
    for csv_file in csv_files:
        print(f"Processing: {csv_file}")
        
        try:
            # Read the CSV file using pandas
            df = pd.read_csv(csv_file)
            
            # Replace path strings in all columns
            for col in df.columns:
                if df[col].dtype == 'object':  # Only process string columns
                    df[col] = df[col].astype(str).str.replace(old_path, new_path)
            
            # Write the modified dataframe back to the file
            df.to_csv(csv_file, index=False)
            print(f"  Successfully updated paths in {csv_file}")
        
        except Exception as e:
            print(f"  Error processing {csv_file}: {e}")
    
    print("Path replacement completed!")

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python replace_paths.py <directory_path>")
        sys.exit(1)
    
    input_dir = "results_match_tm_60S_noB/"
    
    if not os.path.isdir(input_dir):
        print(f"Error: {input_dir} is not a valid directory")
        sys.exit(1)
    
    replace_paths_in_csv(input_dir)

Found 79 _results.csv files
Processing: results_match_tm_60S_noB/xenon_276_000_0.0_DWS_results.csv
  Successfully updated paths in results_match_tm_60S_noB/xenon_276_000_0.0_DWS_results.csv
Processing: results_match_tm_60S_noB/xenon_266_000_0.0_DWS_results.csv
  Successfully updated paths in results_match_tm_60S_noB/xenon_266_000_0.0_DWS_results.csv
Processing: results_match_tm_60S_noB/xenon_270_000_0.0_DWS_results.csv
  Successfully updated paths in results_match_tm_60S_noB/xenon_270_000_0.0_DWS_results.csv
Processing: results_match_tm_60S_noB/xenon_222_000_0.0_DWS_results.csv
  Successfully updated paths in results_match_tm_60S_noB/xenon_222_000_0.0_DWS_results.csv
Processing: results_match_tm_60S_noB/xenon_288_000_0.0_DWS_results.csv
  Successfully updated paths in results_match_tm_60S_noB/xenon_288_000_0.0_DWS_results.csv
Processing: results_match_tm_60S_noB/xenon_284_000_0.0_DWS_results.csv
  Successfully updated paths in results_match_tm_60S_noB/xenon_284_000_0.0_DWS_results.csv


In [4]:
#!/usr/bin/env python3

import os
import sys
import glob
import pandas as pd

def replace_paths_in_csv(input_dir):
    """
    Read all _results.csv files in the specified directory, replace path strings,
    and write back to the same files.
    
    Args:
        input_dir (str): Directory containing _results.csv files
    """
    old_path = "/global/scratch/users/jdickerson/2dtm_test_data/"
    new_path = "/data/papers/Leopard-EM_paper_data/"
    
    # Find all _results.csv files in the specified directory
    csv_files = glob.glob(os.path.join(input_dir, "**/*_results.csv"), recursive=True)
    
    if not csv_files:
        print(f"No _results.csv files found in {input_dir}")
        return
    
    print(f"Found {len(csv_files)} _results.csv files")
    
    for csv_file in csv_files:
        print(f"Processing: {csv_file}")
        
        try:
            # Read the CSV file using pandas
            df = pd.read_csv(csv_file)
            
            # Replace path strings in all columns
            for col in df.columns:
                if df[col].dtype == 'object':  # Only process string columns
                    df[col] = df[col].astype(str).str.replace(old_path, new_path)
            
            # Write the modified dataframe back to the file
            df.to_csv(csv_file, index=False)
            print(f"  Successfully updated paths in {csv_file}")
        
        except Exception as e:
            print(f"  Error processing {csv_file}: {e}")
    
    print("Path replacement completed!")

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python replace_paths.py <directory_path>")
        sys.exit(1)
    
    input_dir = "results_match_tm_40S-body_noB/"
    
    if not os.path.isdir(input_dir):
        print(f"Error: {input_dir} is not a valid directory")
        sys.exit(1)
    
    replace_paths_in_csv(input_dir)

Found 79 _results.csv files
Processing: results_match_tm_40S-body_noB/xenon_276_000_0.0_DWS_results.csv
  Successfully updated paths in results_match_tm_40S-body_noB/xenon_276_000_0.0_DWS_results.csv
Processing: results_match_tm_40S-body_noB/xenon_266_000_0.0_DWS_results.csv
  Successfully updated paths in results_match_tm_40S-body_noB/xenon_266_000_0.0_DWS_results.csv
Processing: results_match_tm_40S-body_noB/xenon_270_000_0.0_DWS_results.csv
  Successfully updated paths in results_match_tm_40S-body_noB/xenon_270_000_0.0_DWS_results.csv
Processing: results_match_tm_40S-body_noB/xenon_222_000_0.0_DWS_results.csv
  Successfully updated paths in results_match_tm_40S-body_noB/xenon_222_000_0.0_DWS_results.csv
Processing: results_match_tm_40S-body_noB/xenon_288_000_0.0_DWS_results.csv
  Successfully updated paths in results_match_tm_40S-body_noB/xenon_288_000_0.0_DWS_results.csv
Processing: results_match_tm_40S-body_noB/xenon_284_000_0.0_DWS_results.csv
  Successfully updated paths in resu

We then ran the script run_refine_template_60S.sh, which uses the config refine_template_config_60S.yaml, and the run script process_all_micrographs_refine.py.

In [None]:
./run_refine_template_60S.sh

## Performing a constrained search for the SSU

We used the coordinates from the LSU to perform a constrained search for the SSU in 4 steps.
Since the models used were the same as in the constrained search tutotial, the center vector and rotation axis are the same as in that tutorial.

For the constrained search, we are setting a false-positive rate of 1 for every 200 LSU positions that we inspect for a SSU. 

The first thing is to add a row with the correct filepaths to the 40S results files that found no picks.

In [6]:
#!/usr/bin/env python3

import os
import csv
import pandas as pd
import glob
import re
from pathlib import Path
import sys

sys.argv = ["process_empty_csv_files.py", "/data/papers/Leopard-EM_paper_data/xe30kv/results_match_tm_40S-body_noB/"]

def get_file_paths_from_csv_name(csv_file_path, base_dir):
    """
    Generate the expected file paths based on the CSV filename pattern.
    
    Args:
        csv_file_path: Path to the CSV file
        base_dir: Base directory containing the files
        
    Returns:
        Dictionary with all the expected file paths
    """
    # Extract the base name from CSV filename
    # e.g., "xenon_235_000_0.0_DWS_results.csv" -> "xenon_235_000_0"
    csv_name = os.path.basename(csv_file_path)
    base_name = csv_name.replace("_results.csv", "").replace(".0_DWS", "")
    
    # Base directory for the result files (make it absolute)
    result_dir = os.path.abspath(os.path.dirname(csv_file_path))
    
    # Generate the expected file paths with absolute paths
    file_paths = {
        'micrograph_path': f"/data/papers/Leopard-EM_paper_data/xe30kv/all_mgraphs/{base_name}.0_DWS.mrc",
        'template_path': "/data/papers/Leopard-EM_paper_data/maps2/SSU-body_map_px0.936_bscale0.5.mrc",
        'mip_path': f"{result_dir}/{base_name}_output_mip.mrc",
        'scaled_mip_path': f"{result_dir}/{base_name}_output_scaled_mip.mrc",
        'psi_path': f"{result_dir}/{base_name}_output_orientation_psi.mrc",
        'theta_path': f"{result_dir}/{base_name}_output_orientation_theta.mrc",
        'phi_path': f"{result_dir}/{base_name}_output_orientation_phi.mrc",
        'defocus_path': f"{result_dir}/{base_name}_output_relative_defocus.mrc",
        'correlation_average_path': f"{result_dir}/{base_name}_output_correlation_average.mrc",
        'correlation_variance_path': f"{result_dir}/{base_name}_output_correlation_variance.mrc"
    }
    
    return file_paths

def create_default_row(file_paths, particle_index=0):
    """
    Create a default row with zeros for numerical values and correct file paths.
    
    Args:
        file_paths: Dictionary with file paths
        particle_index: Index for the particle (default 0)
        
    Returns:
        Dictionary representing a default row
    """
    default_row = {
        'Unnamed: 0': particle_index,
        'particle_index': particle_index,
        'mip': 0.0,
        'scaled_mip': 0.0,
        'correlation_mean': 0.0,
        'correlation_variance': 0.0,
        'total_correlations': 0,
        'pos_x': 0,
        'pos_y': 0,
        'pos_x_img': 0,
        'pos_y_img': 0,
        'pos_x_img_angstrom': 0.0,
        'pos_y_img_angstrom': 0.0,
        'phi': 0.0,
        'theta': 0.0,
        'psi': 0.0,
        'relative_defocus': 0.0,
        'defocus_u': 0.0,
        'defocus_v': 0.0,
        'astigmatism_angle': 0.0,
        'pixel_size': 0.936,  # Common value from the example
        'refined_pixel_size': 0.936,
        'voltage': 300.0,
        'spherical_aberration': 2.7,
        'amplitude_contrast_ratio': 0.07,
        'phase_shift': 0.0,
        'ctf_B_factor': 60.0,
        'micrograph_path': file_paths['micrograph_path'],
        'template_path': file_paths['template_path'],
        'mip_path': file_paths['mip_path'],
        'scaled_mip_path': file_paths['scaled_mip_path'],
        'psi_path': file_paths['psi_path'],
        'theta_path': file_paths['theta_path'],
        'phi_path': file_paths['phi_path'],
        'defocus_path': file_paths['defocus_path'],
        'correlation_average_path': file_paths['correlation_average_path'],
        'correlation_variance_path': file_paths['correlation_variance_path']
    }
    
    return default_row

def is_csv_empty(csv_file_path):
    """
    Check if a CSV file is empty (contains only headers).
    
    Args:
        csv_file_path: Path to the CSV file
        
    Returns:
        True if the file is empty (only headers), False otherwise
    """
    try:
        df = pd.read_csv(csv_file_path)
        return len(df) == 0
    except Exception as e:
        print(f"Error reading {csv_file_path}: {e}")
        return False

def is_csv_default_row(csv_file_path):
    """
    Check if a CSV file has exactly 1 row with the first ~5 columns as zeros 
    (indicating it's a default row that needs file path correction).
    
    Args:
        csv_file_path: Path to the CSV file
        
    Returns:
        True if the file has exactly 1 row with first columns as zeros, False otherwise
    """
    try:
        df = pd.read_csv(csv_file_path)
        if len(df) != 1:
            return False
        
        # Check if the first 5 columns are zeros or close to zero
        row = df.iloc[0]
        first_cols = ['Unnamed: 0', 'particle_index', 'mip', 'scaled_mip', 'correlation_mean']
        
        for col in first_cols:
            if col in row:
                if abs(row[col]) > 0.001:  # Allow for small floating point errors
                    return False
        
        return True
    except Exception as e:
        print(f"Error reading {csv_file_path}: {e}")
        return False

def update_file_paths_in_row(csv_file_path, file_paths, dry_run=False):
    """
    Update file paths in a CSV file that has a default row with incorrect relative paths.
    
    Args:
        csv_file_path: Path to the CSV file
        file_paths: Dictionary with correct file paths
        dry_run: If True, only print what would be done without making changes
        
    Returns:
        True if successfully updated, False otherwise
    """
    try:
        df = pd.read_csv(csv_file_path)
        
        if len(df) != 1:
            return False
            
        # Update the file paths in the row
        for path_key, path_value in file_paths.items():
            if path_key in df.columns:
                df.at[0, path_key] = path_value
        
        if not dry_run:
            df.to_csv(csv_file_path, index=False)
            
        return True
        
    except Exception as e:
        print(f"Error updating {csv_file_path}: {e}")
        return False

def process_csv_files(directory_path, dry_run=False):
    """
    Process all CSV files in a directory and add default rows to empty ones,
    or fix file paths for files with default rows.
    
    Args:
        directory_path: Path to the directory containing CSV files
        dry_run: If True, only print what would be done without making changes
    """
    # Find all CSV files ending with _results.csv
    csv_pattern = os.path.join(directory_path, "*_results.csv")
    csv_files = glob.glob(csv_pattern)
    
    print(f"Found {len(csv_files)} CSV files to process...")
    
    processed_count = 0
    
    for csv_file in csv_files:
        print(f"\nProcessing: {os.path.basename(csv_file)}")
        
        if is_csv_empty(csv_file):
            print(f"  - File is empty, adding default row...")
            
            if not dry_run:
                # Get the expected file paths
                file_paths = get_file_paths_from_csv_name(csv_file, directory_path)
                
                # Create a default row
                default_row = create_default_row(file_paths)
                
                # Read the existing CSV to get headers
                try:
                    df = pd.read_csv(csv_file)
                    
                    # Add the default row
                    df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)
                    
                    # Write back to CSV
                    df.to_csv(csv_file, index=False)
                    
                    print(f"  - Added default row to {os.path.basename(csv_file)}")
                    processed_count += 1
                    
                except Exception as e:
                    print(f"  - Error processing {csv_file}: {e}")
            else:
                print(f"  - Would add default row to {os.path.basename(csv_file)}")
                processed_count += 1
                
        elif is_csv_default_row(csv_file):
            print(f"  - File has 1 row with zeros, updating file paths...")
            
            # Get the expected file paths
            file_paths = get_file_paths_from_csv_name(csv_file, directory_path)
            
            if update_file_paths_in_row(csv_file, file_paths, dry_run):
                if not dry_run:
                    print(f"  - Updated file paths in {os.path.basename(csv_file)}")
                else:
                    print(f"  - Would update file paths in {os.path.basename(csv_file)}")
                processed_count += 1
            else:
                print(f"  - Failed to update file paths in {os.path.basename(csv_file)}")
                
        else:
            print(f"  - File has data, skipping...")
    
    print(f"\nSummary: {'Would process' if dry_run else 'Processed'} {processed_count} CSV files.")

def main():
    import argparse
    
    parser = argparse.ArgumentParser(description="Process empty CSV result files and add default rows, or fix file paths")
    parser.add_argument("directory", help="Directory containing the CSV files")
    parser.add_argument("--dry-run", action="store_true", help="Show what would be done without making changes")
    
    args = parser.parse_args()
    
    if not os.path.exists(args.directory):
        print(f"Error: Directory '{args.directory}' does not exist.")
        return
    
    print(f"Processing CSV files in: {args.directory}")
    if args.dry_run:
        print("DRY RUN MODE - No changes will be made")
    
    process_csv_files(args.directory, dry_run=args.dry_run)

if __name__ == "__main__":
    main() 

Processing CSV files in: /data/papers/Leopard-EM_paper_data/xe30kv/results_match_tm_40S-body_noB/
Found 79 CSV files to process...

Processing: xenon_276_000_0.0_DWS_results.csv
  - File has data, skipping...

Processing: xenon_266_000_0.0_DWS_results.csv
  - File is empty, adding default row...
  - Added default row to xenon_266_000_0.0_DWS_results.csv

Processing: xenon_270_000_0.0_DWS_results.csv
  - File has data, skipping...

Processing: xenon_222_000_0.0_DWS_results.csv
  - File has data, skipping...

Processing: xenon_288_000_0.0_DWS_results.csv
  - File is empty, adding default row...
  - Added default row to xenon_288_000_0.0_DWS_results.csv

Processing: xenon_284_000_0.0_DWS_results.csv
  - File has data, skipping...

Processing: xenon_251_000_0.0_DWS_results.csv
  - File has data, skipping...

Processing: xenon_277_000_0.0_DWS_results.csv
  - File has data, skipping...

Processing: xenon_216_000_0.0_DWS_results.csv
  - File has data, skipping...

Processing: xenon_254_000_0.

  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)
  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)
  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)
  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)
  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)
  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)
  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)
  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)
  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)
  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)
  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)


  - File has data, skipping...

Processing: xenon_240_000_0.0_DWS_results.csv
  - File has data, skipping...

Processing: xenon_267_000_0.0_DWS_results.csv
  - File is empty, adding default row...
  - Added default row to xenon_267_000_0.0_DWS_results.csv

Processing: xenon_229_000_0.0_DWS_results.csv
  - File has data, skipping...

Processing: xenon_282_000_0.0_DWS_results.csv
  - File is empty, adding default row...
  - Added default row to xenon_282_000_0.0_DWS_results.csv

Processing: xenon_250_000_0.0_DWS_results.csv
  - File has data, skipping...

Processing: xenon_287_000_0.0_DWS_results.csv
  - File has data, skipping...

Processing: xenon_248_000_0.0_DWS_results.csv
  - File is empty, adding default row...
  - Added default row to xenon_248_000_0.0_DWS_results.csv

Processing: xenon_214_000_0.0_DWS_results.csv
  - File has data, skipping...

Processing: xenon_286_000_0.0_DWS_results.csv
  - File is empty, adding default row...
  - Added default row to xenon_286_000_0.0_DWS_res

  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)
  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)
  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)
  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)
  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)
  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)
  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)
  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)
  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)


  - Added default row to xenon_285_000_0.0_DWS_results.csv

Processing: xenon_232_000_0.0_DWS_results.csv
  - File has data, skipping...

Processing: xenon_259_000_0.0_DWS_results.csv
  - File has data, skipping...

Processing: xenon_273_000_0.0_DWS_results.csv
  - File has data, skipping...

Processing: xenon_252_000_0.0_DWS_results.csv
  - File has data, skipping...

Processing: xenon_279_000_0.0_DWS_results.csv
  - File has data, skipping...

Processing: xenon_262_000_0.0_DWS_results.csv
  - File has data, skipping...

Processing: xenon_281_000_0.0_DWS_results.csv
  - File is empty, adding default row...
  - Added default row to xenon_281_000_0.0_DWS_results.csv

Processing: xenon_234_000_0.0_DWS_results.csv
  - File has data, skipping...

Processing: xenon_215_000_0.0_DWS_results.csv
  - File has data, skipping...

Processing: xenon_239_000_0.0_DWS_results.csv
  - File is empty, adding default row...
  - Added default row to xenon_239_000_0.0_DWS_results.csv

Processing: xenon_224_

  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)
  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)
  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)
  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)
  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)
  df = pd.concat([df, pd.DataFrame([default_row])], ignore_index=True)


The following scripts are for the successive constrained searches. The details of these scripts, as well as the scripts to find the rotation axis and center vector, are part of the constrained search tutorial. 

All of the follwing scripts use the run script process_all_micrographs_constrained.py

The first step is a rotation around the Z axis, which uses the bash script run_constrained_search_step1.sh and constrained_search_config_SSU-body_step1.yaml

In [None]:
./run_scripts/run_constrained_search_step1.sh

We will next run the second step, a search around the y axis. Since we expect this movement to be small, we will keep this axis as the default axis.

In [None]:
./run_scripts/run_constrained_search_step3.sh

We will now search more finely aroud the best psi and theta angles that we have determined.

In [None]:
./run_scripts/run_constrained_search_step4.sh

Finally, we run a finer angular search again, combined with a defocus search.

In [None]:
./run_scripts/run_constrained_search_step5.sh

We have performed successive searches, using the previous results as the starting angles for the next one.
We must now collate these results, sequentially summing the number of cross correlations so we can accurately estimate our false positive rate.

In [1]:
"""Processes results files from multiple directories sequentially."""

import argparse
import glob
import os
from collections import defaultdict

import numpy as np
import pandas as pd
from scipy.special import (
    erfcinv,  # Required for the gaussian_noise_zscore_cutoff function
)

import sys
sys.argv = ["process_sequential_results.py", "results_constrained_step1_noB", "results_constrained_step3_noB", "results_constrained_step4_noB", "results_constrained_step5_noB", "--output", "results_all_steps_noB_3"]


def gaussian_noise_zscore_cutoff(num_ccg: int, false_positives: float = 0.005) -> float:
    """Determines the z-score cutoff based on Gaussian noise model and number of pixels.

    NOTE: This procedure assumes that the z-scores (normalized maximum intensity
    projections) are distributed according to a standard normal distribution. Here,
    this model is used to find the cutoff value such that there is at most
    'false_positives' number of false positives in all of the pixels.

    Parameters
    ----------
    num_ccg : int
        Total number of cross-correlograms calculated during template matching. Product
        of the number of pixels, number of defocus values, and number of orientations.
    false_positives : float, optional
        Number of false positives to allow in the image (over all pixels). Default is
        0.005 which corresponds to 0.5% false-positives.

    Returns
    -------
    float
        Z-score cutoff.
    """
    tmp = erfcinv(2.0 * false_positives / num_ccg)
    tmp *= np.sqrt(2.0)

    return float(tmp)


def get_micrograph_id(filename: str) -> str:
    """
    Extract micrograph ID from filename.

    Parameters
    ----------
    filename : str
        Filename to extract micrograph ID from

    Returns
    -------
    micrograph_id : str
        Micrograph ID
    """
    base_name = os.path.basename(filename)
    # Extract the part before _results.csv
    parts = base_name.split("_results.csv")[0]
    return parts


def process_directories_sequentially(
    directory_list: list[str],
    output_base_dir: str,
    false_positive_rate: float = 0.005,
) -> dict[str, pd.DataFrame]:
    """
    Process directories sequentially.

    Parameters
    ----------
    directory_list : list
        Ordered list of directories to process
    output_base_dir : str
        Base directory to store output files
    false_positive_rate : float
        False positive rate to use for threshold calculation

    Returns
    -------
    all_particles : dict
        Dictionary of micrograph IDs as keys and df as values
    """
    # Create output directory if it doesn't exist
    os.makedirs(output_base_dir, exist_ok=True)

    # Dictionary to store particles from all steps
    # Key: micrograph_id
    # Value: DataFrame of particles
    all_particles = {}

    # Dictionary to track which particles were found in which step
    # Key: (micrograph_id, particle_index)
    # Value: last step where this particle was found
    particle_step_map = {}

    # Dictionary to track total correlations per micrograph
    # Key: micrograph_id
    # Value: total correlations for this micrograph across all steps
    micrograph_correlations = defaultdict(int)

    # Dictionary to track thresholds per micrograph
    # Key: micrograph_id
    # Value: threshold for this micrograph in the current step
    micrograph_thresholds = {}

    # Process each directory in order
    for step_idx, directory in enumerate(directory_list):
        step_num = step_idx + 1
        print(f"\nProcessing Step {step_num}: {directory}")

        # Create step output directory
        step_output_dir = os.path.join(output_base_dir, f"step_{step_num}")
        os.makedirs(step_output_dir, exist_ok=True)

        # Find all results.csv files in the directory
        results_files = glob.glob(
            os.path.join(directory, "**", "*_results.csv"), recursive=True
        )

        if not results_files:
            print(f"  Warning: No results files found in {directory}")
            continue

        print(f"  Found {len(results_files)} results files")

        # Dictionary to store parameters from each micrograph
        step_micrograph_parameters = {}

        # First, find and read all parameters files to update correlation counts
        for results_file in results_files:
            micrograph_id = get_micrograph_id(results_file)
            params_file = results_file.replace(
                "_results.csv", "_results_parameters.csv"
            )

            if os.path.exists(params_file):
                try:
                    # Read the parameters file
                    params_df = pd.read_csv(params_file)
                    if not params_df.empty:
                        step_micrograph_parameters[micrograph_id] = params_df.iloc[0]

                        # Add to total correlations for this micrograph
                        if "num_correlations" in params_df.columns:
                            correlations = int(params_df.iloc[0]["num_correlations"])
                            micrograph_correlations[micrograph_id] += correlations
                            print(
                                f"  {micrograph_id}: Added {correlations} correlations "
                                f"(total: {micrograph_correlations[micrograph_id]})"
                            )
                except Exception as e:
                    print(f"  Error reading parameters file {params_file}: {e}")
            else:
                print(f"  Warning: Parameters file not found for {results_file}")

        # Calculate threshold for each micrograph based on its cumulative correlations
        for micrograph_id, total_correlations in micrograph_correlations.items():
            threshold = gaussian_noise_zscore_cutoff(
                total_correlations, false_positive_rate
            )
            micrograph_thresholds[micrograph_id] = threshold
            print(
                f"  Threshold for {micrograph_id} in step {step_num}: {threshold:.4f} "
                f"(based on {total_correlations} total correlations)"
            )

        # Process each results file
        for results_file in results_files:
            micrograph_id = get_micrograph_id(results_file)

            try:
                # Read the results file
                results_df = pd.read_csv(results_file)

                if results_df.empty:
                    print(f"  Warning: Empty results file {results_file}")
                    continue

                # Get the threshold for this micrograph
                if micrograph_id not in micrograph_thresholds:
                    print(
                        f"  Warning: No correlation information for {micrograph_id}, "
                        "using default threshold"
                    )
                    # Try to use correlations from the current step's parameters file
                    if (
                        micrograph_id in step_micrograph_parameters
                        and "num_correlations"
                        in step_micrograph_parameters[micrograph_id]
                    ):
                        correlations = int(
                            step_micrograph_parameters[micrograph_id][
                                "num_correlations"
                            ]
                        )
                        micrograph_correlations[micrograph_id] = correlations
                        threshold = gaussian_noise_zscore_cutoff(
                            correlations, false_positive_rate
                        )
                        micrograph_thresholds[micrograph_id] = threshold
                        print(
                            f"  Using threshold {threshold:.4f} for {micrograph_id} "
                            f"based on {correlations} correlations"
                        )
                    else:
                        # If no information at all, use the median of other thresholds
                        # or a reasonable default
                        if micrograph_thresholds:
                            threshold = np.median(list(micrograph_thresholds.values()))
                            print(
                                f"  Using median threshold {threshold:.4f} for "
                                f"{micrograph_id}"
                            )
                        else:
                            #  Default if no other information is available
                            threshold = 5.0
                            print(
                                f"  Using default threshold {threshold:.4f} for "
                                f"{micrograph_id}"
                            )
                        micrograph_thresholds[micrograph_id] = threshold
                else:
                    threshold = micrograph_thresholds[micrograph_id]

                # Check if refined_scaled_mip column exists
                if "refined_scaled_mip" not in results_df.columns:
                    print(
                        f" Warning: refined_scaled_mip not found in {results_file},"
                        " using mip instead"
                    )
                    compare_col = "scaled_mip"
                else:
                    compare_col = "refined_scaled_mip"

                # Filter particles above threshold using the appropriate column
                above_threshold_df = results_df[
                    results_df[compare_col] > threshold
                ].copy()

                if above_threshold_df.empty:
                    print(f"  No particles above threshold in {results_file}")
                    continue

                # Print stats
                print(
                    f"{micrograph_id}: {len(above_threshold_df)} of {len(results_df)}"
                    f" particles above threshold (using {compare_col})"
                )

                # Add a step column to track which step this is from
                above_threshold_df["step"] = step_num

                # If this is the first step, just add all particles above threshold
                if step_num == 1:
                    all_particles[micrograph_id] = above_threshold_df

                    # Update particle step map
                    for idx in above_threshold_df["particle_index"]:
                        particle_step_map[(micrograph_id, idx)] = step_num
                else:
                    # If this micrograph was not seen before, add all particles
                    if micrograph_id not in all_particles:
                        all_particles[micrograph_id] = above_threshold_df

                        # Update particle step map
                        for idx in above_threshold_df["particle_index"]:
                            particle_step_map[(micrograph_id, idx)] = step_num
                    else:
                        # For existing micrographs, handle particles differently
                        existing_df = all_particles[micrograph_id]

                        # Create a new DataFrame to store updated particles
                        updated_df = existing_df.copy()

                        # For each particle in the new results
                        for _, particle in above_threshold_df.iterrows():
                            particle_idx = particle["particle_index"]

                            # Check if this particle exists previously
                            existing_particle = existing_df[
                                existing_df["particle_index"] == particle_idx
                            ]

                            if len(existing_particle) > 0:
                                # Particle exists, update parameters
                                # Find the index in the updated_df
                                idx_to_update = updated_df.index[
                                    updated_df["particle_index"] == particle_idx
                                ].tolist()[0]

                                # Check if original offset columns exist
                                offset_cols = [
                                    "original_offset_phi",
                                    "original_offset_theta",
                                    "original_offset_psi",
                                ]

                                # Add original offset columns from step 1
                                for col in offset_cols:
                                    if col not in updated_df.columns:
                                        updated_df[col] = 0.0

                                # Add offset values from current step
                                for col in offset_cols:
                                    # Add particle's offset to existing offset
                                    if col in particle and pd.notna(particle[col]):
                                        updated_df.at[idx_to_update, col] += particle[
                                            col
                                        ]

                                # Update other parameters
                                for col in particle.index:
                                    if col not in offset_cols and pd.notna(
                                        particle[col]
                                    ):
                                        updated_df.at[idx_to_update, col] = particle[
                                            col
                                        ]

                                # Update step
                                updated_df.at[idx_to_update, "step"] = step_num
                                particle_step_map[(micrograph_id, particle_idx)] = (
                                    step_num
                                )
                            else:
                                # New particle, add it to the DataFrame
                                updated_df = pd.concat(
                                    [updated_df, pd.DataFrame([particle])],
                                    ignore_index=True,
                                )
                                particle_step_map[(micrograph_id, particle_idx)] = (
                                    step_num
                                )

                        # Update the all_particles dictionary
                        all_particles[micrograph_id] = updated_df

            except Exception as e:
                print(f"  Error processing results file {results_file}: {e}")

        # Save intermediate results for this step
        for micrograph_id, particles_df in all_particles.items():
            # Only save particles found or updated in this step
            step_particles = particles_df[particles_df["step"] == step_num]

            if not step_particles.empty:
                output_file = os.path.join(
                    step_output_dir, f"{micrograph_id}_results_above_threshold.csv"
                )
                step_particles.to_csv(output_file, index=False)
                print(
                    f"  Saved {len(step_particles)} particles for {micrograph_id} "
                    f"in step {step_num}"
                )

    # Save final results after all steps
    final_output_dir = os.path.join(output_base_dir, "final_results")
    os.makedirs(final_output_dir, exist_ok=True)

    # Save summary of total particles per micrograph
    summary_data = []

    for micrograph_id, particles_df in all_particles.items():
        output_file = os.path.join(
            final_output_dir, f"{micrograph_id}_results_above_threshold.csv"
        )
        particles_df.to_csv(output_file, index=False)

        # Get the final threshold for this micrograph
        final_threshold = micrograph_thresholds.get(micrograph_id, "N/A")
        total_correlations = micrograph_correlations.get(micrograph_id, 0)

        # Create summary data
        n_particles = len(particles_df)
        summary_data.append(
            {
                "micrograph_id": micrograph_id,
                "total_particles": n_particles,
                "total_correlations": total_correlations,
                "final_threshold": final_threshold,
            }
        )

        print(
            f"Saved {n_particles} final particles for {micrograph_id} "
            f"(threshold: {final_threshold}, correlations: {total_correlations})"
        )

    # Save summary
    summary_df = pd.DataFrame(summary_data)
    summary_df.to_csv(
        os.path.join(final_output_dir, "processing_summary.csv"), index=False
    )

    print(f"\nProcessing complete. Final results saved to {final_output_dir}")

    # Print total particles
    total_particles = sum(len(df) for df in all_particles.values())
    print(f"Total particles across all micrographs: {total_particles}")

    return all_particles


def main() -> None:
    """Main function to process results files sequentially."""
    parser = argparse.ArgumentParser(
        description="Process results files from multiple directories sequentially"
    )
    parser.add_argument(
        "directories", nargs="+", help="Ordered list of directories to process"
    )
    parser.add_argument(
        "--output", "-o", required=True, help="Output directory for results"
    )
    parser.add_argument(
        "--false-positive-rate",
        "-f",
        type=float,
        default=0.005,
        help="False positive rate for threshold calculation (default: 0.005)",
    )

    args = parser.parse_args()

    # Check if all directories exist
    for directory in args.directories:
        if not os.path.exists(directory):
            print(f"Error: Directory {directory} does not exist!")
            return

    # Process directories
    process_directories_sequentially(
        args.directories, args.output, args.false_positive_rate
    )


if __name__ == "__main__":
    main()



Processing Step 1: results_constrained_step1_noB
  Found 69 results files
  xenon_217_000_0.0_DWS_constrained: Added 1296 correlations (total: 1296)
  xenon_276_000_0.0_DWS_constrained: Added 1296 correlations (total: 1296)
  xenon_261_000_0.0_DWS_constrained: Added 1296 correlations (total: 1296)
  xenon_259_000_0.0_DWS_constrained: Added 1296 correlations (total: 1296)
  xenon_240_000_0.0_DWS_constrained: Added 1296 correlations (total: 1296)
  xenon_255_000_0.0_DWS_constrained: Added 1296 correlations (total: 1296)
  xenon_215_000_0.0_DWS_constrained: Added 1296 correlations (total: 1296)
  xenon_280_000_0.0_DWS_constrained: Added 1296 correlations (total: 1296)
  xenon_248_000_0.0_DWS_constrained: Added 1296 correlations (total: 1296)
  xenon_278_000_0.0_DWS_constrained: Added 1296 correlations (total: 1296)
  xenon_237_000_0.0_DWS_constrained: Added 1296 correlations (total: 1296)
  xenon_269_000_0.0_DWS_constrained: Added 1296 correlations (total: 1296)
  xenon_284_000_0.0_DWS_c