# Data processing of ANS treated yeast cells for Leopard-EM manuscript

This notebook walks through the methods used to generate the 2DTM data from ANS treated yeast cells published in the 2025 Leopard-EM manuscript.

## Running Match Template
The match template was identical to that used for the CHX treated cells.


## Refine template for the LSU
The refine template was also identical to the CHX treated cells.

## Performing a constrained search for the SSU

The constrained search follows the same 4 step process as for the untreated cells, except the angular ranges were optimized to be different.
We are again running all searches with a false positive rate of 1/200 LSUs.

Step 1 us a search around the Z-axis.

In [1]:
from IPython.display import Markdown, display

# Read the YAML file
with open("configs/constrained_search_config_SSU-body_step1.yaml") as file:
    yaml_content = file.read()

# Display as markdown code block
display(Markdown(f"```yaml\n{yaml_content}\n```"))

```yaml
###################################################
#### Constrained Search configuration example #####
###################################################
# An example YAML configuration to modify.
# Call `RefineTemplateManager.from_yaml(path)` to load this configuration.
template_volume_path: /home/data/jdickerson/Leopard-EM_paper_data/maps/SSU-body_map_px1.059_bscale0.5_k3dqe.mrc # Volume of small particle
particle_stack_reference: # This is from the large particles
  df_path: /some/path/to/particles.csv  # Needs to be readable by pandas
  extracted_box_size: [520, 520]
  original_template_size: [512, 512]
particle_stack_constrained: # This is from the small particles
  df_path: /some/path/to/particles.csv  # Needs to be readable by pandas
  extracted_box_size: [520, 520]
  original_template_size: [512, 512]
centre_vector: [88.023109, 52.080261, 45.528008] # Vector from large particle to small particle in Angstroms
orientation_refinement_config:
  enabled: true
  base_grid_method: uniform 
  psi_step: 1.0 # psi in degrees
  theta_step: 1.0   # theta and phi in degrees
  rotation_axis_euler_angles: [76.2, 60.02, 0.0] # This is the rotation axis
  phi_min: 0.0
  phi_max: 0.0
  theta_min: 0.0
  theta_max: 0.0
  psi_min: -13.0
  psi_max: 2.5
  search_roll_axis: false
  roll_axis: [0.0,1.0] # [x,y] This defines the roll axis (orthogonal to the rotation axis). None means search
  roll_step: 2.0 
defocus_refinement_config:
  enabled: false
  defocus_max:  100.0  # in Angstroms, relative to "best" defocus value in particle stack dataframe
  defocus_min: -100.0  # in Angstroms, relative to "best" defocus value in particle stack dataframe
  defocus_step: 20.0   # in Angstroms
preprocessing_filters:
  whitening_filter:
    do_power_spectrum: true
    enabled: true
    max_freq: 1.0  # In terms of Nyquist frequency
    num_freq_bins: null
  bandpass_filter:
    enabled: false
    falloff: null
    high_freq_cutoff: null
    low_freq_cutoff: null
computational_config:
  gpu_ids: 
    - 0
    - 1
    - 2
    - 3
  num_cpus: 4

```

In [None]:
#!/bin/bash
# Print current shell and environment before activation
echo "=== ENVIRONMENT BEFORE ACTIVATION ==="
echo "Current shell: $SHELL"
echo "Current conda environments:"
conda env list
echo "Current Python: $(which python)"
echo "Current Python version: $(python --version 2>&1)"
echo "======================================"

# Activate leopard-em conda environment 
echo "=== ACTIVATING CONDA ENVIRONMENT ==="
source $(conda info --base)/etc/profile.d/conda.sh
conda activate leopard-em
ACTIVATION_STATUS=$?

# Check if activation succeeded
if [ $ACTIVATION_STATUS -ne 0 ]; then
    echo "ERROR: Failed to activate the leopard-em environment"
    echo "Available environments:"
    conda env list
    exit 1
fi

# Print environment details after activation
echo "=== ENVIRONMENT AFTER ACTIVATION ==="
echo "Active conda environment: $CONDA_PREFIX"
echo "Python interpreter: $(which python)"
echo "Python version: $(python --version 2>&1)"
echo "Conda packages in environment:"
conda list | grep -E 'program|leopard'
echo "======================================"


# Set up paths (update these paths for your project)
PROJECT_DIR="/home/data/jdickerson/Leopard-EM_paper_data/ANS_data"
MICROGRAPHS_DIR="${PROJECT_DIR}/mgraphs_first"
OUTPUT_DIR="${PROJECT_DIR}/results_constrained_step1"
LARGE_RESULTS_DIR="${PROJECT_DIR}/results_refine_tm_60S"
SMALL_RESULTS_DIR="${PROJECT_DIR}/results_match_tm_40S-body"
TEMPLATE_YAML="${PROJECT_DIR}/configs/constrained_search_config_SSU-body_step1.yaml"
TEMPLATE_DIR="/home/data/jdickerson/Leopard-EM_paper_data/maps/SSU-body_map_px1.059_bscale0.5_k3dqe.mrc"
LARGE_SUFFIX="_refined_results.csv"
SMALL_SUFFIX="_results.csv"
SCRIPT_PATH="${PROJECT_DIR}/process_all_micrographs_constrained.py"

# Create results directory if it doesn't exist
mkdir -p ${OUTPUT_DIR}

# Run the processing script
python ${SCRIPT_PATH} \
  --micrographs-dir "${MICROGRAPHS_DIR}" \
  --template-yaml "${TEMPLATE_YAML}" \
  --large-results-dir "${LARGE_RESULTS_DIR}" \
  --small-results-dir "${SMALL_RESULTS_DIR}" \
  --template-volume "${TEMPLATE_DIR}" \
  --output-dir "${OUTPUT_DIR}" \
  --large-suffix "${LARGE_SUFFIX}" \
  --small-suffix "${SMALL_SUFFIX}" \
  --gpus "2,3" \
  --batch-size 64 \
  --false-positives 0.005

echo "All micrographs processed"

In [None]:
#!/usr/bin/env python
import os
import re
import yaml
import sys
import argparse
import glob
import time
import pandas as pd
from leopard_em.pydantic_models.managers import ConstrainedSearchManager

def extract_micrograph_number(filename):
    """Extract the micrograph number from the filename."""
    # The new pattern matches filenames like: 25_Sep12_11.40.05_145_1.mrc
    # Simply returns the base filename without extension
    # This makes it work directly with similar result files: 25_Sep12_11.40.05_145_1_results.csv
    base_name = os.path.splitext(filename)[0]
    return base_name

def create_yaml_for_constrained_search(template_yaml_path, large_results_csv, small_results_csv, output_yaml_path, template_volume_path, gpu_ids):
    """Create a custom YAML file for constrained search of a specific micrograph's match results."""
    # Load the template YAML
    with open(template_yaml_path, 'r') as file:
        config = yaml.safe_load(file)
    
    # Update the config with the specific match results info
    config['template_volume_path'] = template_volume_path
    config['particle_stack_reference']['df_path'] = large_results_csv
    config['particle_stack_constrained']['df_path'] = small_results_csv
    
    # Set GPU IDs
    config['computational_config']['gpu_ids'] = gpu_ids
    
    # Write the updated config to a new YAML file
    with open(output_yaml_path, 'w') as file:
        yaml.dump(config, file, default_flow_style=False)
    
    return config

def process_micrograph_constrained_search(micrograph_path, large_results_csv, small_results_csv, template_yaml, output_dir, template_volume_path, gpu_ids, batch_size, false_positives):
    """Process constrained search for a single micrograph's match template results."""
    micrograph_basename = os.path.basename(micrograph_path)
    base_name = os.path.splitext(micrograph_basename)[0]
    
    # Check if match results CSV files exist
    if not os.path.exists(large_results_csv):
        print(f"Large particle results not found: {large_results_csv}")
        return False
    
    if not os.path.exists(small_results_csv):
        print(f"Small particle results not found: {small_results_csv}")
        return False
    
    # Check if there are any results to process
    df_large = pd.read_csv(large_results_csv)
    if len(df_large) == 0:
        print(f"No large particle matches in {large_results_csv}")
        return False
    
    df_small = pd.read_csv(small_results_csv)
    if len(df_small) == 0:
        print(f"No small particle matches in {small_results_csv}")
        return False
    
    # Create custom YAML file for constrained search of this micrograph's results
    custom_yaml_path = os.path.join(output_dir, f"{base_name}_constrained_config.yaml")
    create_yaml_for_constrained_search(
        template_yaml, 
        large_results_csv, 
        small_results_csv,
        custom_yaml_path,
        template_volume_path,
        gpu_ids
    )
    
    # Run constrained search with the custom config
    try:
        print(f"Running constrained search for {micrograph_basename} with GPUs {gpu_ids}")
        cs_manager = ConstrainedSearchManager.from_yaml(custom_yaml_path)
        
        # Define output path for constrained search results
        constrained_output_csv = os.path.join(output_dir, f"{base_name}_constrained_results.csv")
        
        # Record start time
        start_time = time.time()
        
        # Run constrained search
        cs_manager.run_constrained_search(constrained_output_csv, false_positives, batch_size)
        
        # Calculate and print elapsed time
        end_time = time.time()
        elapsed_time = end_time - start_time
        elapsed_time_str = time.strftime("%H:%M:%S", time.gmtime(elapsed_time))
        print(f"Constrained search wall time: {elapsed_time_str}")
        
        print(f"Successfully completed constrained search for {micrograph_basename}")
        return True
    except Exception as e:
        print(f"Error in constrained search for {micrograph_basename}: {str(e)}")
        return False

def main():
    parser = argparse.ArgumentParser(description='Process multiple micrographs with constrained search')
    parser.add_argument('--micrographs-dir', required=True, help='Directory containing micrograph files')
    parser.add_argument('--template-yaml', required=True, help='Path to the template constrained search YAML configuration')
    parser.add_argument('--large-results-dir', required=True, help='Directory containing large particle results')
    parser.add_argument('--small-results-dir', required=True, help='Directory containing small particle results')
    parser.add_argument('--template-volume', required=True, help='Path to the template volume MRC file (small particle)')
    parser.add_argument('--output-dir', required=True, help='Directory to store constrained search results')
    parser.add_argument('--large-suffix', default='_refined_results.csv', help='Suffix for large particle result files (default: "_refined_results.csv")')
    parser.add_argument('--small-suffix', default='_results.csv', help='Suffix for small particle result files (default: "_results.csv")')
    parser.add_argument('--gpus', default='0', help='Comma-separated list of GPU IDs to use')
    parser.add_argument('--batch-size', type=int, default=80, help='Particle batch size for constrained search')
    parser.add_argument('--pattern', default='*.mrc', help='File pattern to match micrographs')
    parser.add_argument('--false-positives', type=float, default=0.005, help='False positives rate for constrained search')
    parser.add_argument('--start-idx', type=int, default=None, help='Start index for processing (optional)')
    parser.add_argument('--end-idx', type=int, default=None, help='End index for processing (optional)')
    parser.add_argument('--job-idx', type=int, default=None, help='Job index from SLURM array (optional)')
    parser.add_argument('--jobs-per-array', type=int, default=None, help='Number of micrographs per array job (optional)')
    
    args = parser.parse_args()
    
    # Make sure output directory exists
    os.makedirs(args.output_dir, exist_ok=True)
    
    # Convert GPU IDs string to list of integers
    gpu_ids = [int(gpu_id) for gpu_id in args.gpus.split(',')]
    
    # Get list of micrograph files
    micrograph_pattern = os.path.join(args.micrographs_dir, args.pattern)
    micrograph_files = sorted(glob.glob(micrograph_pattern))
    
    if not micrograph_files:
        print(f"No micrograph files found matching pattern {micrograph_pattern}")
        return 1
    
    print(f"Found {len(micrograph_files)} micrograph files")
    
    # Determine which micrographs to process
    if args.job_idx is not None and args.jobs_per_array is not None:
        # Calculate range for this job in the array
        start_idx = (args.job_idx - 1) * args.jobs_per_array
        end_idx = min(start_idx + args.jobs_per_array, len(micrograph_files))
        micrograph_files = micrograph_files[start_idx:end_idx]
        print(f"Processing micrographs {start_idx+1}-{end_idx} out of {len(micrograph_files)}")
    elif args.start_idx is not None or args.end_idx is not None:
        start_idx = args.start_idx if args.start_idx is not None else 0
        end_idx = args.end_idx if args.end_idx is not None else len(micrograph_files)
        micrograph_files = micrograph_files[start_idx:end_idx]
        print(f"Processing micrographs {start_idx+1}-{end_idx} out of {len(micrograph_files)}")
    
    # Process each micrograph's match results for constrained search
    successful = 0
    for i, micrograph_file in enumerate(micrograph_files):
        micrograph_basename = os.path.basename(micrograph_file)
        base_name = os.path.splitext(micrograph_basename)[0]
        
        # Find the corresponding results CSV files using the configurable suffixes
        large_results_csv = os.path.join(args.large_results_dir, f"{base_name}{args.large_suffix}")
        small_results_csv = os.path.join(args.small_results_dir, f"{base_name}{args.small_suffix}")
        
        print(f"Processing {i+1}/{len(micrograph_files)}: {micrograph_basename}")
        if process_micrograph_constrained_search(
            micrograph_file, 
            large_results_csv,
            small_results_csv,
            args.template_yaml, 
            args.output_dir,
            args.template_volume,
            gpu_ids, 
            args.batch_size,
            args.false_positives
        ):
            successful += 1
    
    print(f"Successfully completed constrained search for {successful}/{len(micrograph_files)} micrographs")
    return 0

if __name__ == "__main__":
    sys.exit(main()) 

The second search was around the y axis (theta)

In [2]:
from IPython.display import Markdown, display

# Read the YAML file
with open("configs/constrained_search_config_SSU-body_step3.yaml") as file:
    yaml_content = file.read()

# Display as markdown code block
display(Markdown(f"```yaml\n{yaml_content}\n```"))

```yaml
###################################################
#### Constrained Search configuration example #####
###################################################
# An example YAML configuration to modify.
# Call `RefineTemplateManager.from_yaml(path)` to load this configuration.
template_volume_path: /home/data/jdickerson/Leopard-EM_paper_data/maps/SSU-body_map_px1.059_bscale0.5_k3dqe.mrc # Volume of small particle
particle_stack_reference: # This is from the large particles
  df_path: /some/path/to/particles.csv  # Needs to be readable by pandas
  extracted_box_size: [520, 520]
  original_template_size: [512, 512]
particle_stack_constrained: # This is from the small particles
  df_path: /some/path/to/particles.csv  # Needs to be readable by pandas
  extracted_box_size: [520, 520]
  original_template_size: [512, 512]
centre_vector: [88.023109, 52.080261, 45.528008] # Vector from large particle to small particle in Angstroms
orientation_refinement_config:
  enabled: true
  base_grid_method: uniform 
  psi_step: 1.0 # psi in degrees
  theta_step: 1.0   # theta and phi in degrees
  rotation_axis_euler_angles: [76.2, 60.02, 0.0] # This is the rotation axis
  phi_min: 0.0
  phi_max: 0.0
  theta_min: -8.0
  theta_max: 5.0
  psi_min: 0.0
  psi_max: 0.0
  search_roll_axis: false
  roll_axis: [0.0,1.0] # [x,y] This defines the roll axis (orthogonal to the rotation axis). None means search
  roll_step: 2.0 
defocus_refinement_config:
  enabled: false
  defocus_max:  100.0  # in Angstroms, relative to "best" defocus value in particle stack dataframe
  defocus_min: -100.0  # in Angstroms, relative to "best" defocus value in particle stack dataframe
  defocus_step: 20.0   # in Angstroms
preprocessing_filters:
  whitening_filter:
    do_power_spectrum: true
    enabled: true
    max_freq: 1.0  # In terms of Nyquist frequency
    num_freq_bins: null
  bandpass_filter:
    enabled: false
    falloff: null
    high_freq_cutoff: null
    low_freq_cutoff: null
computational_config:
  gpu_ids: 
    - 0
    - 1
    - 2
    - 3
  num_cpus: 4

```

In [None]:
#!/bin/bash
# Print current shell and environment before activation
echo "=== ENVIRONMENT BEFORE ACTIVATION ==="
echo "Current shell: $SHELL"
echo "Current conda environments:"
conda env list
echo "Current Python: $(which python)"
echo "Current Python version: $(python --version 2>&1)"
echo "======================================"

# Activate leopard-em conda environment 
echo "=== ACTIVATING CONDA ENVIRONMENT ==="
source $(conda info --base)/etc/profile.d/conda.sh
conda activate leopard-em
ACTIVATION_STATUS=$?

# Check if activation succeeded
if [ $ACTIVATION_STATUS -ne 0 ]; then
    echo "ERROR: Failed to activate the leopard-em environment"
    echo "Available environments:"
    conda env list
    exit 1
fi

# Print environment details after activation
echo "=== ENVIRONMENT AFTER ACTIVATION ==="
echo "Active conda environment: $CONDA_PREFIX"
echo "Python interpreter: $(which python)"
echo "Python version: $(python --version 2>&1)"
echo "Conda packages in environment:"
conda list | grep -E 'program|leopard'
echo "======================================"


# Set up paths (update these paths for your project)
PROJECT_DIR="/home/data/jdickerson/Leopard-EM_paper_data/ANS_data"
MICROGRAPHS_DIR="${PROJECT_DIR}/mgraphs_first"
OUTPUT_DIR="${PROJECT_DIR}/results_constrained_step3"
LARGE_RESULTS_DIR="${PROJECT_DIR}/results_constrained_step1"
SMALL_RESULTS_DIR="${PROJECT_DIR}/results_match_tm_40S-body"
TEMPLATE_YAML="${PROJECT_DIR}/configs/constrained_search_config_SSU-body_step3.yaml"
TEMPLATE_DIR="/home/data/jdickerson/Leopard-EM_paper_data/maps/SSU-body_map_px1.059_bscale0.5_k3dqe.mrc"
LARGE_SUFFIX="_constrained_results.csv"
SMALL_SUFFIX="_results.csv"
SCRIPT_PATH="${PROJECT_DIR}/process_all_micrographs_constrained.py"

# Create results directory if it doesn't exist
mkdir -p ${OUTPUT_DIR}

# Run the processing script
python ${SCRIPT_PATH} \
  --micrographs-dir "${MICROGRAPHS_DIR}" \
  --template-yaml "${TEMPLATE_YAML}" \
  --large-results-dir "${LARGE_RESULTS_DIR}" \
  --small-results-dir "${SMALL_RESULTS_DIR}" \
  --template-volume "${TEMPLATE_DIR}" \
  --output-dir "${OUTPUT_DIR}" \
  --large-suffix "${LARGE_SUFFIX}" \
  --small-suffix "${SMALL_SUFFIX}" \
  --gpus "2,3" \
  --batch-size 64 \
  --false-positives 0.005

echo "All micrographs processed"

The next step was a finer search around both angles simultaneously.

In [3]:
from IPython.display import Markdown, display

# Read the YAML file
with open("configs/constrained_search_config_SSU-body_step4.yaml") as file:
    yaml_content = file.read()

# Display as markdown code block
display(Markdown(f"```yaml\n{yaml_content}\n```"))

```yaml
###################################################
#### Constrained Search configuration example #####
###################################################
# An example YAML configuration to modify.
# Call `RefineTemplateManager.from_yaml(path)` to load this configuration.
template_volume_path: /home/data/jdickerson/Leopard-EM_paper_data/maps/SSU-body_map_px1.059_bscale0.5_k3dqe.mrc # Volume of small particle
particle_stack_reference: # This is from the large particles
  df_path: /some/path/to/particles.csv  # Needs to be readable by pandas
  extracted_box_size: [520, 520]
  original_template_size: [512, 512]
particle_stack_constrained: # This is from the small particles
  df_path: /some/path/to/particles.csv  # Needs to be readable by pandas
  extracted_box_size: [520, 520]
  original_template_size: [512, 512]
centre_vector: [88.023109, 52.080261, 45.528008] # Vector from large particle to small particle in Angstroms
orientation_refinement_config:
  enabled: true
  base_grid_method: uniform 
  psi_step: 0.5 # psi in degrees
  theta_step: 0.5   # theta and phi in degrees
  rotation_axis_euler_angles: [76.2, 60.02, 0.0] # This is the rotation axis
  phi_min: 0.0
  phi_max: 0.0
  theta_min: -5.0
  theta_max: 5.0
  psi_min: -5.0
  psi_max: 5.0
  search_roll_axis: false
  roll_axis: [0.0,1.0] # [x,y] This defines the roll axis (orthogonal to the rotation axis). None means search
  roll_step: 2.0 
defocus_refinement_config:
  enabled: false
  defocus_max:  100.0  # in Angstroms, relative to "best" defocus value in particle stack dataframe
  defocus_min: -100.0  # in Angstroms, relative to "best" defocus value in particle stack dataframe
  defocus_step: 20.0   # in Angstroms
preprocessing_filters:
  whitening_filter:
    do_power_spectrum: true
    enabled: true
    max_freq: 1.0  # In terms of Nyquist frequency
    num_freq_bins: null
  bandpass_filter:
    enabled: false
    falloff: null
    high_freq_cutoff: null
    low_freq_cutoff: null
computational_config:
  gpu_ids: 
    - 0
    - 1
    - 2
    - 3
  num_cpus: 4

```

In [None]:
#!/bin/bash
# Print current shell and environment before activation
echo "=== ENVIRONMENT BEFORE ACTIVATION ==="
echo "Current shell: $SHELL"
echo "Current conda environments:"
conda env list
echo "Current Python: $(which python)"
echo "Current Python version: $(python --version 2>&1)"
echo "======================================"

# Activate leopard-em conda environment 
echo "=== ACTIVATING CONDA ENVIRONMENT ==="
source $(conda info --base)/etc/profile.d/conda.sh
conda activate leopard-em
ACTIVATION_STATUS=$?

# Check if activation succeeded
if [ $ACTIVATION_STATUS -ne 0 ]; then
    echo "ERROR: Failed to activate the leopard-em environment"
    echo "Available environments:"
    conda env list
    exit 1
fi

# Print environment details after activation
echo "=== ENVIRONMENT AFTER ACTIVATION ==="
echo "Active conda environment: $CONDA_PREFIX"
echo "Python interpreter: $(which python)"
echo "Python version: $(python --version 2>&1)"
echo "Conda packages in environment:"
conda list | grep -E 'program|leopard'
echo "======================================"


# Set up paths (update these paths for your project)
PROJECT_DIR="/home/data/jdickerson/Leopard-EM_paper_data/ANS_data"
MICROGRAPHS_DIR="${PROJECT_DIR}/mgraphs_first"
OUTPUT_DIR="${PROJECT_DIR}/results_constrained_step4"
LARGE_RESULTS_DIR="${PROJECT_DIR}/results_constrained_step3"
SMALL_RESULTS_DIR="${PROJECT_DIR}/results_match_tm_40S-body"
TEMPLATE_YAML="${PROJECT_DIR}/configs/constrained_search_config_SSU-body_step4.yaml"
TEMPLATE_DIR="/home/data/jdickerson/Leopard-EM_paper_data/maps/SSU-body_map_px1.059_bscale0.5_k3dqe.mrc"
LARGE_SUFFIX="_constrained_results.csv"
SMALL_SUFFIX="_results.csv"
SCRIPT_PATH="${PROJECT_DIR}/process_all_micrographs_constrained.py"

# Create results directory if it doesn't exist
mkdir -p ${OUTPUT_DIR}

# Run the processing script
python ${SCRIPT_PATH} \
  --micrographs-dir "${MICROGRAPHS_DIR}" \
  --template-yaml "${TEMPLATE_YAML}" \
  --large-results-dir "${LARGE_RESULTS_DIR}" \
  --small-results-dir "${SMALL_RESULTS_DIR}" \
  --template-volume "${TEMPLATE_DIR}" \
  --output-dir "${OUTPUT_DIR}" \
  --large-suffix "${LARGE_SUFFIX}" \
  --small-suffix "${SMALL_SUFFIX}" \
  --gpus "2,3" \
  --batch-size 64 \
  --false-positives 0.005

echo "All micrographs processed"

The final search is even finer with both angles and combined witha  defocus search.

In [4]:
from IPython.display import Markdown, display

# Read the YAML file
with open("configs/constrained_search_config_SSU-body_step5.yaml") as file:
    yaml_content = file.read()

# Display as markdown code block
display(Markdown(f"```yaml\n{yaml_content}\n```"))

```yaml
###################################################
#### Constrained Search configuration example #####
###################################################
# An example YAML configuration to modify.
# Call `RefineTemplateManager.from_yaml(path)` to load this configuration.
template_volume_path: /home/data/jdickerson/Leopard-EM_paper_data/maps/SSU-body_map_px1.059_bscale0.5_k3dqe.mrc # Volume of small particle
particle_stack_reference: # This is from the large particles
  df_path: /some/path/to/particles.csv  # Needs to be readable by pandas
  extracted_box_size: [520, 520]
  original_template_size: [512, 512]
particle_stack_constrained: # This is from the small particles
  df_path: /some/path/to/particles.csv  # Needs to be readable by pandas
  extracted_box_size: [520, 520]
  original_template_size: [512, 512]
centre_vector: [88.023109, 52.080261, 45.528008] # Vector from large particle to small particle in Angstroms
orientation_refinement_config:
  enabled: true
  base_grid_method: uniform 
  psi_step: 0.1 # psi in degrees
  theta_step: 0.1   # theta and phi in degrees
  rotation_axis_euler_angles: [76.2, 60.02, 0.0] # This is the rotation axis
  phi_min: 0.0
  phi_max: 0.0
  theta_min: -0.5
  theta_max: 0.5
  psi_min: -0.5
  psi_max: 0.5
  search_roll_axis: false
  roll_axis: [0.0,1.0] # [x,y] This defines the roll axis (orthogonal to the rotation axis). None means search
  roll_step: 2.0 
defocus_refinement_config:
  enabled: true
  defocus_max:  100.0  # in Angstroms, relative to "best" defocus value in particle stack dataframe
  defocus_min: -100.0  # in Angstroms, relative to "best" defocus value in particle stack dataframe
  defocus_step: 20.0   # in Angstroms
preprocessing_filters:
  whitening_filter:
    do_power_spectrum: true
    enabled: true
    max_freq: 1.0  # In terms of Nyquist frequency
    num_freq_bins: null
  bandpass_filter:
    enabled: false
    falloff: null
    high_freq_cutoff: null
    low_freq_cutoff: null
computational_config:
  gpu_ids: 
    - 0
    - 1
    - 2
    - 3
  num_cpus: 4

```

In [None]:
#!/bin/bash
# Print current shell and environment before activation
echo "=== ENVIRONMENT BEFORE ACTIVATION ==="
echo "Current shell: $SHELL"
echo "Current conda environments:"
conda env list
echo "Current Python: $(which python)"
echo "Current Python version: $(python --version 2>&1)"
echo "======================================"

# Activate leopard-em conda environment 
echo "=== ACTIVATING CONDA ENVIRONMENT ==="
source $(conda info --base)/etc/profile.d/conda.sh
conda activate leopard-em
ACTIVATION_STATUS=$?

# Check if activation succeeded
if [ $ACTIVATION_STATUS -ne 0 ]; then
    echo "ERROR: Failed to activate the leopard-em environment"
    echo "Available environments:"
    conda env list
    exit 1
fi

# Print environment details after activation
echo "=== ENVIRONMENT AFTER ACTIVATION ==="
echo "Active conda environment: $CONDA_PREFIX"
echo "Python interpreter: $(which python)"
echo "Python version: $(python --version 2>&1)"
echo "Conda packages in environment:"
conda list | grep -E 'program|leopard'
echo "======================================"


# Set up paths (update these paths for your project)
PROJECT_DIR="/home/data/jdickerson/Leopard-EM_paper_data/ANS_data"
MICROGRAPHS_DIR="${PROJECT_DIR}/mgraphs_first"
OUTPUT_DIR="${PROJECT_DIR}/results_constrained_step5"
LARGE_RESULTS_DIR="${PROJECT_DIR}/results_constrained_step4"
SMALL_RESULTS_DIR="${PROJECT_DIR}/results_match_tm_40S-body"
TEMPLATE_YAML="${PROJECT_DIR}/configs/constrained_search_config_SSU-body_step5.yaml"
TEMPLATE_DIR="/home/data/jdickerson/Leopard-EM_paper_data/maps/SSU-body_map_px1.059_bscale0.5_k3dqe.mrc"
LARGE_SUFFIX="_constrained_results.csv"
SMALL_SUFFIX="_results.csv"
SCRIPT_PATH="${PROJECT_DIR}/process_all_micrographs_constrained.py"

# Create results directory if it doesn't exist
mkdir -p ${OUTPUT_DIR}

# Run the processing script
python ${SCRIPT_PATH} \
  --micrographs-dir "${MICROGRAPHS_DIR}" \
  --template-yaml "${TEMPLATE_YAML}" \
  --large-results-dir "${LARGE_RESULTS_DIR}" \
  --small-results-dir "${SMALL_RESULTS_DIR}" \
  --template-volume "${TEMPLATE_DIR}" \
  --output-dir "${OUTPUT_DIR}" \
  --large-suffix "${LARGE_SUFFIX}" \
  --small-suffix "${SMALL_SUFFIX}" \
  --gpus "2,3" \
  --batch-size 64 \
  --false-positives 0.005

echo "All micrographs processed"

Finally, we sequentially process thema to get the final list of particles.

In [5]:
"""Processes results files from multiple directories sequentially."""

import argparse
import glob
import os
from collections import defaultdict

import numpy as np
import pandas as pd
from scipy.special import (
    erfcinv,  # Required for the gaussian_noise_zscore_cutoff function
)

import sys
sys.argv = ["process_sequential_results.py", "results_constrained_step1", "results_constrained_step3", "results_constrained_step4", "results_constrained_step5", "--output", "results_all_steps_4"]


def gaussian_noise_zscore_cutoff(num_ccg: int, false_positives: float = 0.005) -> float:
    """Determines the z-score cutoff based on Gaussian noise model and number of pixels.

    NOTE: This procedure assumes that the z-scores (normalized maximum intensity
    projections) are distributed according to a standard normal distribution. Here,
    this model is used to find the cutoff value such that there is at most
    'false_positives' number of false positives in all of the pixels.

    Parameters
    ----------
    num_ccg : int
        Total number of cross-correlograms calculated during template matching. Product
        of the number of pixels, number of defocus values, and number of orientations.
    false_positives : float, optional
        Number of false positives to allow in the image (over all pixels). Default is
        0.005 which corresponds to 0.5% false-positives.

    Returns
    -------
    float
        Z-score cutoff.
    """
    tmp = erfcinv(2.0 * false_positives / num_ccg)
    tmp *= np.sqrt(2.0)

    return float(tmp)


def get_micrograph_id(filename: str) -> str:
    """
    Extract micrograph ID from filename.

    Parameters
    ----------
    filename : str
        Filename to extract micrograph ID from

    Returns
    -------
    micrograph_id : str
        Micrograph ID
    """
    base_name = os.path.basename(filename)
    # Extract the part before _results.csv
    parts = base_name.split("_results.csv")[0]
    return parts


def process_directories_sequentially(
    directory_list: list[str],
    output_base_dir: str,
    false_positive_rate: float = 0.005,
) -> dict[str, pd.DataFrame]:
    """
    Process directories sequentially.

    Parameters
    ----------
    directory_list : list
        Ordered list of directories to process
    output_base_dir : str
        Base directory to store output files
    false_positive_rate : float
        False positive rate to use for threshold calculation

    Returns
    -------
    all_particles : dict
        Dictionary of micrograph IDs as keys and df as values
    """
    # Create output directory if it doesn't exist
    os.makedirs(output_base_dir, exist_ok=True)

    # Dictionary to store particles from all steps
    # Key: micrograph_id
    # Value: DataFrame of particles
    all_particles = {}

    # Dictionary to track which particles were found in which step
    # Key: (micrograph_id, particle_index)
    # Value: last step where this particle was found
    particle_step_map = {}

    # Dictionary to track total correlations per micrograph
    # Key: micrograph_id
    # Value: total correlations for this micrograph across all steps
    micrograph_correlations = defaultdict(int)

    # Dictionary to track thresholds per micrograph
    # Key: micrograph_id
    # Value: threshold for this micrograph in the current step
    micrograph_thresholds = {}

    # Process each directory in order
    for step_idx, directory in enumerate(directory_list):
        step_num = step_idx + 1
        print(f"\nProcessing Step {step_num}: {directory}")

        # Create step output directory
        step_output_dir = os.path.join(output_base_dir, f"step_{step_num}")
        os.makedirs(step_output_dir, exist_ok=True)

        # Find all results.csv files in the directory
        results_files = glob.glob(
            os.path.join(directory, "**", "*_results.csv"), recursive=True
        )

        if not results_files:
            print(f"  Warning: No results files found in {directory}")
            continue

        print(f"  Found {len(results_files)} results files")

        # Dictionary to store parameters from each micrograph
        step_micrograph_parameters = {}

        # First, find and read all parameters files to update correlation counts
        for results_file in results_files:
            micrograph_id = get_micrograph_id(results_file)
            params_file = results_file.replace(
                "_results.csv", "_results_parameters.csv"
            )

            if os.path.exists(params_file):
                try:
                    # Read the parameters file
                    params_df = pd.read_csv(params_file)
                    if not params_df.empty:
                        step_micrograph_parameters[micrograph_id] = params_df.iloc[0]

                        # Add to total correlations for this micrograph
                        if "num_correlations" in params_df.columns:
                            correlations = int(params_df.iloc[0]["num_correlations"])
                            micrograph_correlations[micrograph_id] += correlations
                            print(
                                f"  {micrograph_id}: Added {correlations} correlations "
                                f"(total: {micrograph_correlations[micrograph_id]})"
                            )
                except Exception as e:
                    print(f"  Error reading parameters file {params_file}: {e}")
            else:
                print(f"  Warning: Parameters file not found for {results_file}")

        # Calculate threshold for each micrograph based on its cumulative correlations
        for micrograph_id, total_correlations in micrograph_correlations.items():
            threshold = gaussian_noise_zscore_cutoff(
                total_correlations, false_positive_rate
            )
            micrograph_thresholds[micrograph_id] = threshold
            print(
                f"  Threshold for {micrograph_id} in step {step_num}: {threshold:.4f} "
                f"(based on {total_correlations} total correlations)"
            )

        # Process each results file
        for results_file in results_files:
            micrograph_id = get_micrograph_id(results_file)

            try:
                # Read the results file
                results_df = pd.read_csv(results_file)

                if results_df.empty:
                    print(f"  Warning: Empty results file {results_file}")
                    continue

                # Get the threshold for this micrograph
                if micrograph_id not in micrograph_thresholds:
                    print(
                        f"  Warning: No correlation information for {micrograph_id}, "
                        "using default threshold"
                    )
                    # Try to use correlations from the current step's parameters file
                    if (
                        micrograph_id in step_micrograph_parameters
                        and "num_correlations"
                        in step_micrograph_parameters[micrograph_id]
                    ):
                        correlations = int(
                            step_micrograph_parameters[micrograph_id][
                                "num_correlations"
                            ]
                        )
                        micrograph_correlations[micrograph_id] = correlations
                        threshold = gaussian_noise_zscore_cutoff(
                            correlations, false_positive_rate
                        )
                        micrograph_thresholds[micrograph_id] = threshold
                        print(
                            f"  Using threshold {threshold:.4f} for {micrograph_id} "
                            f"based on {correlations} correlations"
                        )
                    else:
                        # If no information at all, use the median of other thresholds
                        # or a reasonable default
                        if micrograph_thresholds:
                            threshold = np.median(list(micrograph_thresholds.values()))
                            print(
                                f"  Using median threshold {threshold:.4f} for "
                                f"{micrograph_id}"
                            )
                        else:
                            #  Default if no other information is available
                            threshold = 5.0
                            print(
                                f"  Using default threshold {threshold:.4f} for "
                                f"{micrograph_id}"
                            )
                        micrograph_thresholds[micrograph_id] = threshold
                else:
                    threshold = micrograph_thresholds[micrograph_id]

                # Check if refined_scaled_mip column exists
                if "refined_scaled_mip" not in results_df.columns:
                    print(
                        f" Warning: refined_scaled_mip not found in {results_file},"
                        " using mip instead"
                    )
                    compare_col = "scaled_mip"
                else:
                    compare_col = "refined_scaled_mip"

                # Filter particles above threshold using the appropriate column
                above_threshold_df = results_df[
                    results_df[compare_col] > threshold
                ].copy()

                if above_threshold_df.empty:
                    print(f"  No particles above threshold in {results_file}")
                    continue

                # Print stats
                print(
                    f"{micrograph_id}: {len(above_threshold_df)} of {len(results_df)}"
                    f" particles above threshold (using {compare_col})"
                )

                # Add a step column to track which step this is from
                above_threshold_df["step"] = step_num

                # If this is the first step, just add all particles above threshold
                if step_num == 1:
                    all_particles[micrograph_id] = above_threshold_df

                    # Update particle step map
                    for idx in above_threshold_df["particle_index"]:
                        particle_step_map[(micrograph_id, idx)] = step_num
                else:
                    # If this micrograph was not seen before, add all particles
                    if micrograph_id not in all_particles:
                        all_particles[micrograph_id] = above_threshold_df

                        # Update particle step map
                        for idx in above_threshold_df["particle_index"]:
                            particle_step_map[(micrograph_id, idx)] = step_num
                    else:
                        # For existing micrographs, handle particles differently
                        existing_df = all_particles[micrograph_id]

                        # Create a new DataFrame to store updated particles
                        updated_df = existing_df.copy()

                        # For each particle in the new results
                        for _, particle in above_threshold_df.iterrows():
                            particle_idx = particle["particle_index"]

                            # Check if this particle exists previously
                            existing_particle = existing_df[
                                existing_df["particle_index"] == particle_idx
                            ]

                            if len(existing_particle) > 0:
                                # Particle exists, update parameters
                                # Find the index in the updated_df
                                idx_to_update = updated_df.index[
                                    updated_df["particle_index"] == particle_idx
                                ].tolist()[0]

                                # Check if original offset columns exist
                                offset_cols = [
                                    "original_offset_phi",
                                    "original_offset_theta",
                                    "original_offset_psi",
                                ]

                                # Add original offset columns from step 1
                                for col in offset_cols:
                                    if col not in updated_df.columns:
                                        updated_df[col] = 0.0

                                # Add offset values from current step
                                for col in offset_cols:
                                    # Add particle's offset to existing offset
                                    if col in particle and pd.notna(particle[col]):
                                        updated_df.at[idx_to_update, col] += particle[
                                            col
                                        ]

                                # Update other parameters
                                for col in particle.index:
                                    if col not in offset_cols and pd.notna(
                                        particle[col]
                                    ):
                                        updated_df.at[idx_to_update, col] = particle[
                                            col
                                        ]

                                # Update step
                                updated_df.at[idx_to_update, "step"] = step_num
                                particle_step_map[(micrograph_id, particle_idx)] = (
                                    step_num
                                )
                            else:
                                # New particle, add it to the DataFrame
                                updated_df = pd.concat(
                                    [updated_df, pd.DataFrame([particle])],
                                    ignore_index=True,
                                )
                                particle_step_map[(micrograph_id, particle_idx)] = (
                                    step_num
                                )

                        # Update the all_particles dictionary
                        all_particles[micrograph_id] = updated_df

            except Exception as e:
                print(f"  Error processing results file {results_file}: {e}")

        # Save intermediate results for this step
        for micrograph_id, particles_df in all_particles.items():
            # Only save particles found or updated in this step
            step_particles = particles_df[particles_df["step"] == step_num]

            if not step_particles.empty:
                output_file = os.path.join(
                    step_output_dir, f"{micrograph_id}_results_above_threshold.csv"
                )
                step_particles.to_csv(output_file, index=False)
                print(
                    f"  Saved {len(step_particles)} particles for {micrograph_id} "
                    f"in step {step_num}"
                )

    # Save final results after all steps
    final_output_dir = os.path.join(output_base_dir, "final_results")
    os.makedirs(final_output_dir, exist_ok=True)

    # Save summary of total particles per micrograph
    summary_data = []

    for micrograph_id, particles_df in all_particles.items():
        output_file = os.path.join(
            final_output_dir, f"{micrograph_id}_results_above_threshold.csv"
        )
        particles_df.to_csv(output_file, index=False)

        # Get the final threshold for this micrograph
        final_threshold = micrograph_thresholds.get(micrograph_id, "N/A")
        total_correlations = micrograph_correlations.get(micrograph_id, 0)

        # Create summary data
        n_particles = len(particles_df)
        summary_data.append(
            {
                "micrograph_id": micrograph_id,
                "total_particles": n_particles,
                "total_correlations": total_correlations,
                "final_threshold": final_threshold,
            }
        )

        print(
            f"Saved {n_particles} final particles for {micrograph_id} "
            f"(threshold: {final_threshold}, correlations: {total_correlations})"
        )

    # Save summary
    summary_df = pd.DataFrame(summary_data)
    summary_df.to_csv(
        os.path.join(final_output_dir, "processing_summary.csv"), index=False
    )

    print(f"\nProcessing complete. Final results saved to {final_output_dir}")

    # Print total particles
    total_particles = sum(len(df) for df in all_particles.values())
    print(f"Total particles across all micrographs: {total_particles}")

    return all_particles


def main() -> None:
    """Main function to process results files sequentially."""
    parser = argparse.ArgumentParser(
        description="Process results files from multiple directories sequentially"
    )
    parser.add_argument(
        "directories", nargs="+", help="Ordered list of directories to process"
    )
    parser.add_argument(
        "--output", "-o", required=True, help="Output directory for results"
    )
    parser.add_argument(
        "--false-positive-rate",
        "-f",
        type=float,
        default=0.005,
        help="False positive rate for threshold calculation (default: 0.005)",
    )

    args = parser.parse_args()

    # Check if all directories exist
    for directory in args.directories:
        if not os.path.exists(directory):
            print(f"Error: Directory {directory} does not exist!")
            return

    # Process directories
    process_directories_sequentially(
        args.directories, args.output, args.false_positive_rate
    )


if __name__ == "__main__":
    main()



Processing Step 1: results_constrained_step1
  Found 110 results files
  111_Feb07_11.27.26_23_0_constrained: Added 1296 correlations (total: 1296)
  42_Feb06_17.34.11_104_0_constrained: Added 1296 correlations (total: 1296)
  113_Feb07_11.30.59_27_0_constrained: Added 1296 correlations (total: 1296)
  67_Feb06_19.05.20_154_0_constrained: Added 1296 correlations (total: 1296)
  78_Feb06_19.47.10_176_0_constrained: Added 1296 correlations (total: 1296)
  66_Feb06_19.01.33_152_0_constrained: Added 1296 correlations (total: 1296)
  38_Feb06_17.21.47_97_0_constrained: Added 1296 correlations (total: 1296)
  56_Feb06_18.19.54_132_0_constrained: Added 1296 correlations (total: 1296)
  30_Feb06_16.47.43_84_0_constrained: Added 1296 correlations (total: 1296)
  60_Feb06_18.41.27_140_0_constrained: Added 1296 correlations (total: 1296)
  41_Feb06_17.31.04_102_0_constrained: Added 1296 correlations (total: 1296)
  106_Feb07_11.18.15_13_0_constrained: Added 1296 correlations (total: 1296)
  75_F