# BYU Locating Flagellar Motors

## Submission Generation Notebook

This is the fourth and final notebook in a series for the BYU Locating Bacterial Flagellar Motors 2025 Kaggle challenge. This notebook creates predictions on test data and generates the competition submission file.

### Notebook Series:
1. **[Parse Data](https://www.kaggle.com/code/andrewjdarley/parse-data)**: Extracting and preparing 2D slices containing motors to make a YOLO dataset
2. **[Visualize Data](https://www.kaggle.com/code/andrewjdarley/visualize-data)**: Exploratory data analysis and visualization of annotated motor locations
3. **[Train YOLO](https://www.kaggle.com/code/andrewjdarley/train-yolo)**: Fine tuning an YOLOv8 object detection model on the prepared dataset
4. **Submission Notebook (Current)**: Running inference and generating submission files 

## Important: Offline Execution
This notebook is designed to run in an offline environment. The Ultralytics YOLOv8 package has been installed using the offline installation method from [this reference notebook](https://www.kaggle.com/code/itsuki9180/ultralytics-for-offline-install). This implementation was brilliant. I use my own copy as input that works effectively the same as the original.

## About this Notebook

This submission notebook implements an optimized inference pipeline that:

1. **Model Loading**: Loads the best trained YOLOv8 weights from the training notebook
2. **GPU Optimization**: Configures CUDA optimizations, half-precision inference, and memory management
3. **Parallel Processing**: Uses CUDA streams and batch processing for efficient GPU utilization
4. **3D Detection**: Processes each slice to locate motors
5. **Non-Maximum Suppression**: Applies 3D NMS to cluster and merge detections across slices
6. **Submission Generation**: Creates the final CSV file with predicted motor coordinates

The code includes advanced optimizations like dynamic batch sizing based on available GPU memory, preloading batches while processing the current batch, and GPU profiling to monitor performance. The CONCENTRATION parameter can be adjusted to trade off between processing speed and detection accuracy. The only reason you'd ever modify CONCENTRATION is just to verify submission capability since full submission takes a few hours.

In [4]:
!tar xfvz /kaggle/input/ultralytics-for-offline-install/archive.tar.gz
!pip install --no-index --find-links=./packages ultralytics
!rm -rf ./packages

./packages/
./packages/networkx-3.4.2-py3-none-any.whl
./packages/fsspec-2025.2.0-py3-none-any.whl
./packages/python_dateutil-2.9.0.post0-py2.py3-none-any.whl
./packages/jinja2-3.1.5-py3-none-any.whl
./packages/pyparsing-3.2.1-py3-none-any.whl
./packages/charset_normalizer-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
./packages/ultralytics_thop-2.0.14-py3-none-any.whl
./packages/nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl
./packages/urllib3-2.3.0-py3-none-any.whl
./packages/nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl
./packages/pytz-2025.1-py2.py3-none-any.whl
./packages/MarkupSafe-3.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
./packages/numpy-2.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
./packages/cycler-0.12.1-py3-none-any.whl
./packages/nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl
./packages/nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl
./package

In [5]:
import os
import numpy as np
import pandas as pd
from PIL import Image
import torch
import cv2
from tqdm.notebook import tqdm
import threading
import time
from contextlib import nullcontext
from concurrent.futures import ThreadPoolExecutor

# Import torchvision components for Faster R-CNN
import torchvision
from torchvision.models.detection.faster_rcnn import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
import torchvision.transforms as T

# Set random seed for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Define paths
data_path = "/kaggle/input/byu-locating-bacterial-flagellar-motors-2025/"
test_dir = os.path.join(data_path, "test")
submission_path = "/kaggle/working/submission.csv"

# Faster R-CNN model path (adjust if your model is saved elsewhere)
model_path = "/kaggle/input/test-faster-rcnn/pytorch/default/1/fasterrcnn_motor_detector_2.pth"

# Detection parameters
CONFIDENCE_THRESHOLD = 0.45  # Lower threshold to catch more potential motors
MAX_DETECTIONS_PER_TOMO = 3  # Keep track of top N detections per tomogram
NMS_IOU_THRESHOLD = 0.2     # Non-maximum suppression threshold for 3D clustering
CONCENTRATION = 1           # ONLY PROCESS a fraction of slices for speed


In [6]:
# GPU profiling context manager
class GPUProfiler:
    def __init__(self, name):
        self.name = name
        self.start_time = None
        
    def __enter__(self):
        if torch.cuda.is_available():
            torch.cuda.synchronize()
        self.start_time = time.time()
        return self
        
    def __exit__(self, *args):
        if torch.cuda.is_available():
            torch.cuda.synchronize()
        elapsed = time.time() - self.start_time
        print(f"[PROFILE] {self.name}: {elapsed:.3f}s")

# Check GPU availability and set up optimizations
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
BATCH_SIZE = 8  # Default batch size; will adjust if GPU available

if device.startswith('cuda'):
    # CUDA optimization flags
    torch.backends.cudnn.benchmark = True
    torch.backends.cudnn.deterministic = False
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
    
    # Print GPU info
    gpu_name = torch.cuda.get_device_name(0)
    gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1e9  # GB
    print(f"Using GPU: {gpu_name} with {gpu_mem:.2f} GB memory")
    
    # Estimate free memory and adjust batch size (approx. 4 images per GB)
    free_mem = gpu_mem - torch.cuda.memory_allocated(0) / 1e9
    BATCH_SIZE = max(8, min(32, int(free_mem * 4)))
    print(f"Dynamic batch size set to {BATCH_SIZE} based on {free_mem:.2f} GB free memory")
else:
    print("GPU not available, using CPU")
    BATCH_SIZE = 4  # Smaller batch size on CPU

def normalize_slice(slice_data):
    """
    Normalize slice data using 2nd and 98th percentiles for contrast
    """
    p2 = np.percentile(slice_data, 2)
    p98 = np.percentile(slice_data, 98)
    clipped_data = np.clip(slice_data, p2, p98)
    normalized = 255 * (clipped_data - p2) / (p98 - p2)
    return np.uint8(normalized)

def preload_image_batch(file_paths):
    """Preload a batch of images into CPU memory"""
    images = []
    for path in file_paths:
        img = cv2.imread(path)
        if img is None:
            img = np.array(Image.open(path))
        images.append(img)
    return images

def process_tomogram(tomo_id, model, index=0, total=1):
    """
    Process a single tomogram folder and return the top motor detection (z, y, x).
    """
    print(f"Processing tomogram {tomo_id} ({index}/{total})")
    tomo_dir = os.path.join(test_dir, tomo_id)
    slice_files = sorted([f for f in os.listdir(tomo_dir) if f.endswith('.jpg')])
    
    # Apply CONCENTRATION to reduce slice count
    selected_indices = np.linspace(0, len(slice_files)-1, int(len(slice_files) * CONCENTRATION))
    selected_indices = np.round(selected_indices).astype(int)
    slice_files = [slice_files[i] for i in selected_indices]
    print(f"Processing {len(slice_files)} out of {len(os.listdir(tomo_dir))} slices "
          f"based on CONCENTRATION={CONCENTRATION}")
    
    all_detections = []  # to store {z, y, x, confidence}
    
    # Create CUDA streams if on GPU
    if device.startswith('cuda'):
        streams = [torch.cuda.Stream() for _ in range(min(4, BATCH_SIZE))]
    else:
        streams = [None]
    
    next_batch_thread = None
    next_batch_images = None
    
    # Precompute the torchvision transform: convert PIL image to tensor
    to_tensor = T.ToTensor()
    
    for batch_start in range(0, len(slice_files), BATCH_SIZE):
        # Wait for previous preload thread to finish
        if next_batch_thread is not None:
            next_batch_thread.join()
            next_batch_images = None
        
        batch_end = min(batch_start + BATCH_SIZE, len(slice_files))
        batch_files = slice_files[batch_start:batch_end]
        
        # Preload next batch
        next_batch_start = batch_end
        next_batch_end = min(next_batch_start + BATCH_SIZE, len(slice_files))
        next_batch_files = slice_files[next_batch_start:next_batch_end] if next_batch_start < len(slice_files) else []
        if next_batch_files:
            next_batch_paths = [os.path.join(tomo_dir, f) for f in next_batch_files]
            next_batch_thread = threading.Thread(target=preload_image_batch, args=(next_batch_paths,))
            next_batch_thread.start()
        else:
            next_batch_thread = None
        
        # Split this batch across CUDA streams
        sub_batches = np.array_split(batch_files, len(streams))
        
        for i, sub_batch in enumerate(sub_batches):
            if len(sub_batch) == 0:
                continue
            
            stream = streams[i % len(streams)]
            with torch.cuda.stream(stream) if (stream and device.startswith('cuda')) else nullcontext():
                # Prepare images and z-coordinates
                images_tensor_list = []
                sub_batch_slice_nums = []
                for slice_file in sub_batch:
                    img_path = os.path.join(tomo_dir, slice_file)
                    
                    # Read image (grayscale), normalize contrast, convert to RGB
                    raw = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
                    normalized = normalize_slice(raw)
                    rgb = cv2.cvtColor(normalized, cv2.COLOR_GRAY2RGB)
                    pil_img = Image.fromarray(rgb)  # PIL Image
                    
                    # Convert to tensor and move to device
                    img_tensor = to_tensor(pil_img).to(device)
                    images_tensor_list.append(img_tensor)
                    
                    # Extract z-index from filename: e.g., slice_123.jpg → z=123
                    z_index = int(slice_file.split('_')[1].split('.')[0])
                    sub_batch_slice_nums.append(z_index)
                
                if not images_tensor_list:
                    continue
                
                # Run inference with profiling
                with GPUProfiler(f"Inference batch {i+1}/{len(sub_batches)}"):
                    model.eval()
                    with torch.no_grad():
                        outputs = model(images_tensor_list)
                
                # Process each result in this sub-batch
                for j, output in enumerate(outputs):
                    boxes = output['boxes']
                    scores = output['scores']
                    
                    # Filter by confidence threshold
                    keep_idxs = (scores >= CONFIDENCE_THRESHOLD).nonzero(as_tuple=False).view(-1)
                    for idx in keep_idxs:
                        confidence = float(scores[idx].cpu().item())
                        x1, y1, x2, y2 = boxes[idx].cpu().numpy()
                        x_center = (x1 + x2) / 2.0
                        y_center = (y1 + y2) / 2.0
                        
                        all_detections.append({
                            'z': sub_batch_slice_nums[j],
                            'y': round(y_center),
                            'x': round(x_center),
                            'confidence': confidence
                        })
        
        # Synchronize CUDA streams
        if device.startswith('cuda'):
            torch.cuda.synchronize()
    
    # Ensure final preload thread is joined
    if next_batch_thread is not None:
        next_batch_thread.join()
    
    # 3D Non-Maximum Suppression to merge close detections
    final_detections = perform_3d_nms(all_detections, NMS_IOU_THRESHOLD)
    final_detections.sort(key=lambda x: x['confidence'], reverse=True)
    
    # If no detections, return NA
    if not final_detections:
        return {
            'tomo_id': tomo_id,
            'Motor axis 0': -1,
            'Motor axis 1': -1,
            'Motor axis 2': -1
        }
    
    best = final_detections[0]
    return {
        'tomo_id': tomo_id,
        'Motor axis 0': round(best['z']),
        'Motor axis 1': round(best['y']),
        'Motor axis 2': round(best['x'])
    }

def perform_3d_nms(detections, iou_threshold):
    """
    Perform 3D Non-Maximum Suppression on detections to merge nearby motors.
    """
    if not detections:
        return []
    
    detections = sorted(detections, key=lambda x: x['confidence'], reverse=True)
    final_dets = []
    
    def distance_3d(d1, d2):
        return np.sqrt((d1['z'] - d2['z'])**2 +
                       (d1['y'] - d2['y'])**2 +
                       (d1['x'] - d2['x'])**2)
    
    box_size = 24
    dist_thresh = box_size * iou_threshold
    
    while detections:
        best = detections.pop(0)
        final_dets.append(best)
        detections = [d for d in detections if distance_3d(d, best) > dist_thresh]
    
    return final_dets

def debug_image_loading(tomo_id):
    """
    Debug function: check image loading and Faster R-CNN inference on a sample slice.
    """
    tomo_dir = os.path.join(test_dir, tomo_id)
    slice_files = sorted([f for f in os.listdir(tomo_dir) if f.endswith('.jpg')])
    if not slice_files:
        print(f"No image files found in {tomo_dir}")
        return
    
    print(f"Found {len(slice_files)} image files in {tomo_dir}")
    sample_file = slice_files[len(slice_files)//2]
    img_path = os.path.join(tomo_dir, sample_file)
    
    try:
        img_pil = Image.open(img_path)
        img_arr = np.array(img_pil)
        print(f"PIL load: shape {img_arr.shape}, dtype {img_arr.dtype}")
        
        img_cv2 = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
        print(f"OpenCV load (grayscale): shape {img_cv2.shape}, dtype {img_cv2.dtype}")
        
        rgb = cv2.cvtColor(img_cv2, cv2.COLOR_GRAY2RGB)
        print(f"Converted to RGB: shape {rgb.shape}, dtype {rgb.dtype}")
        
        print("Image loading successful!")
    except Exception as e:
        print(f"Error loading {img_path}: {e}")
    
    # Test a single inference
    try:
        # Build the same model instantiation logic to load weights temporarily
        num_classes = 2  # Adjust if your model was trained with a different number of classes
        backbone = fasterrcnn_resnet50_fpn(pretrained=False)
        in_features = backbone.roi_heads.box_predictor.cls_score.in_features
        backbone.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
        backbone.load_state_dict(torch.load(model_path, map_location=device))
        backbone.to(device).eval()
        
        # Prepare one image
        raw = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
        normalized = normalize_slice(raw)
        rgb = cv2.cvtColor(normalized, cv2.COLOR_GRAY2RGB)
        pil_img = Image.fromarray(rgb)
        img_tensor = T.ToTensor()(pil_img).to(device)
        
        with torch.no_grad():
            output = backbone([img_tensor])
        print("Faster R-CNN ran on a sample slice successfully!")
        print(f"Output keys: {list(output[0].keys())}")
    except Exception as e:
        print(f"Error during Faster R-CNN inference: {e}")

def generate_submission():
    """
    Main function to generate the submission CSV.
    """
    # List all tomogram folders
    test_tomos = sorted([d for d in os.listdir(test_dir) if os.path.isdir(os.path.join(test_dir, d))])
    total_tomos = len(test_tomos)
    print(f"Found {total_tomos} tomograms in test directory")
    
    # Debug image loading & model on the first tomogram
    if test_tomos:
        debug_image_loading(test_tomos[0])
    
    # Clear CUDA cache before starting
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    
    # Initialize Faster R-CNN model once
    print(f"Loading Faster R-CNN model from {model_path}")
    num_classes = 2  # background + motor; adjust if necessary
    model = fasterrcnn_resnet50_fpn(pretrained=False)
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
    state_dict = torch.load(model_path, map_location=device)
    model.load_state_dict(state_dict)
    model.to(device).eval()
    
    results = []
    motors_found = 0
    
    # Use ThreadPoolExecutor to parallelize tomogram processing
    with ThreadPoolExecutor(max_workers=1) as executor:
        future_to_tomo = {}
        for i, tomo_id in enumerate(test_tomos, 1):
            future = executor.submit(process_tomogram, tomo_id, model, i, total_tomos)
            future_to_tomo[future] = tomo_id
        
        for future in future_to_tomo:
            tomo_id = future_to_tomo[future]
            try:
                if torch.cuda.is_available():
                    torch.cuda.empty_cache()
                
                result = future.result()
                results.append(result)
                
                # Count detected motors
                if result['Motor axis 0'] != -1:
                    motors_found += 1
                    print(f"Motor found in {tomo_id} at z={result['Motor axis 0']}, "
                          f"y={result['Motor axis 1']}, x={result['Motor axis 2']}")
                else:
                    print(f"No motor detected in {tomo_id}")
                
                pct = motors_found / len(results) * 100
                print(f"Current detection rate: {motors_found}/{len(results)} ({pct:.1f}%)")
            
            except Exception as e:
                print(f"Error processing {tomo_id}: {e}")
                results.append({
                    'tomo_id': tomo_id,
                    'Motor axis 0': -1,
                    'Motor axis 1': -1,
                    'Motor axis 2': -1
                })
    
    # Build submission DataFrame
    submission_df = pd.DataFrame(results)
    submission_df = submission_df[['tomo_id', 'Motor axis 0', 'Motor axis 1', 'Motor axis 2']]
    submission_df.to_csv(submission_path, index=False)
    
    print(f"\nSubmission complete!")
    print(f"Motors detected: {motors_found}/{total_tomos} ({motors_found/total_tomos*100:.1f}%)")
    print(f"Submission saved to: {submission_path}")
    print("\nSubmission preview:")
    print(submission_df.head())
    
    return submission_df

# Run the pipeline
if __name__ == "__main__":
    start_time = time.time()
    submission = generate_submission()
    elapsed = time.time() - start_time
    print(f"\nTotal execution time: {elapsed:.2f} seconds ({elapsed/60:.2f} minutes)")


Using GPU: Tesla T4 with 15.83 GB memory
Dynamic batch size set to 32 based on 15.66 GB free memory
Found 3 tomograms in test directory
Found 500 image files in /kaggle/input/byu-locating-bacterial-flagellar-motors-2025/test/tomo_003acc
PIL load: shape (1912, 1847), dtype uint8
OpenCV load (grayscale): shape (1912, 1847), dtype uint8
Converted to RGB: shape (1912, 1847, 3), dtype uint8
Image loading successful!


  backbone.load_state_dict(torch.load(model_path, map_location=device))


Faster R-CNN ran on a sample slice successfully!
Output keys: ['boxes', 'labels', 'scores']
Loading Faster R-CNN model from /kaggle/input/test-faster-rcnn/pytorch/default/1/fasterrcnn_motor_detector_2.pth


  state_dict = torch.load(model_path, map_location=device)


Processing tomogram tomo_003acc (1/3)
Processing 500 out of 500 slices based on CONCENTRATION=1
[PROFILE] Inference batch 1/4: 7.685s
[PROFILE] Inference batch 2/4: 0.662s
[PROFILE] Inference batch 3/4: 0.660s
[PROFILE] Inference batch 4/4: 0.657s
[PROFILE] Inference batch 1/4: 0.766s
[PROFILE] Inference batch 2/4: 0.665s
[PROFILE] Inference batch 3/4: 0.665s
[PROFILE] Inference batch 4/4: 0.664s
[PROFILE] Inference batch 1/4: 0.760s
[PROFILE] Inference batch 2/4: 0.665s
[PROFILE] Inference batch 3/4: 0.666s
[PROFILE] Inference batch 4/4: 0.667s
[PROFILE] Inference batch 1/4: 0.666s
[PROFILE] Inference batch 2/4: 0.670s
[PROFILE] Inference batch 3/4: 0.670s
[PROFILE] Inference batch 4/4: 0.670s
[PROFILE] Inference batch 1/4: 0.675s
[PROFILE] Inference batch 2/4: 0.673s
[PROFILE] Inference batch 3/4: 0.677s
[PROFILE] Inference batch 4/4: 0.677s
[PROFILE] Inference batch 1/4: 0.678s
[PROFILE] Inference batch 2/4: 0.669s
[PROFILE] Inference batch 3/4: 0.675s
[PROFILE] Inference batch 4/4: