# Sampling a Subset of Training Data

This notebook extracts a representative and manageable subset of the full training data for development and experimentation. The sampled data includes both the images and their corresponding labels.

---

## 1. Setup & Configuration

### 1.1 Import Libraries
Import required libraries and configuration settings.

In [1]:
# Standard Library
import os
import sys
import random

# Add project root to system path to allow src module imports
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

# Core Libraries
import pandas as pd
import shutil
from pathlib import Path
from tqdm import tqdm

# Project Configurations
from src import config

#### 1.2 Load Configuration from `config.py`

In [2]:
# Display key paths for reference
print("Full labels CSV path:", config.FULL_LABELS_PATH)
print("Original train image directory:", config.FULL_DATA_TRAIN_DIR)
print("Sampled image output directory:", config.SAMPLED_TRAIN_DIR)
print("Sampled labels CSV path:", config.SAMPLED_LABELS_PATH)

Full labels CSV path: /Users/ramy/Desktop/team_project/BYU_Locating_Bacterial_Flagellar_Motors_2025/data/raw/train_labels.csv
Original train image directory: /Users/ramy/Desktop/team_project/BYU_Locating_Bacterial_Flagellar_Motors_2025/../byu-locating-bacterial-flagellar-motors-2025/train
Sampled image output directory: /Users/ramy/Desktop/team_project/BYU_Locating_Bacterial_Flagellar_Motors_2025/data/sampled/sampled_train
Sampled labels CSV path: /Users/ramy/Desktop/team_project/BYU_Locating_Bacterial_Flagellar_Motors_2025/data/sampled/sampled_train_labels.csv


#### 1.3 Verify Paths and File Existence  

In [3]:
print("\n Label CSV Exists:", os.path.exists(config.FULL_LABELS_PATH))
print(" Train Folder Exists:", os.path.exists(config.FULL_DATA_TRAIN_DIR))

# Create sampled directories if they do not exist
os.makedirs(config.SAMPLED_TRAIN_DIR, exist_ok=True)
os.makedirs(os.path.dirname(config.SAMPLED_LABELS_PATH), exist_ok=True)


 Label CSV Exists: True
 Train Folder Exists: True


## 3. Sample Creation

We will now create a stratified sample of the tomograms, taking **exactly 10%** from every motor count class (`0`, `1`, `2`, `3`, `4`, `6`, `10`).

### 3.1 Load Full Labels and Identify Classes

Load the full dataset from `FULL_LABELS_PATH` and identify all unique motor count classes.

In [4]:
import pandas as pd
from src.config import FULL_LABELS_PATH

# Load full labels
full_labels = pd.read_csv(FULL_LABELS_PATH)

# Get one row per tomogram
unique_tomos = full_labels.drop_duplicates(subset="tomo_id")

# Identify all classes
all_motor_classes = sorted(unique_tomos["Number of motors"].unique())
print("Motor count classes in dataset:", all_motor_classes)

Motor count classes in dataset: [np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(6), np.int64(10)]


### 3.2 Compute 10% Sample Size Per Class

Calculate how many tomograms to sample from each motor count class by taking 10% of their total (rounded).  
To ensure all classes are represented, a minimum of **1 tomogram** is sampled from each class.

In [5]:
# Count total tomograms per class
motor_class_counts = unique_tomos["Number of motors"].value_counts().sort_index()

# Compute 10% sample size with minimum of 1
sample_sizes = motor_class_counts.apply(lambda x: max(1, round(x * 0.10)))

# Combine into a summary DataFrame
sampling_summary = pd.DataFrame({
    "Motor Count Class": motor_class_counts.index,
    "Total Tomograms": motor_class_counts.values,
    "Sample Size (10%)": sample_sizes.values
}).set_index("Motor Count Class")

# Display the table
print("Sampling Summary (10% per class, min=1):")
display(sampling_summary)

Sampling Summary (10% per class, min=1):


Unnamed: 0_level_0,Total Tomograms,Sample Size (10%)
Motor Count Class,Unnamed: 1_level_1,Unnamed: 2_level_1
0,286,29
1,313,31
2,30,3
3,6,1
4,9,1
6,3,1
10,1,1


### 3.3 Sample the Tomograms

Draw the 10% sample from each motor count class

In [6]:
# Sample 10% of tomograms from each motor count class (with min=1)
# We move 'Number of motors' to the index to avoid future warnings from groupby-apply behavior.
sampled = (
    unique_tomos
    .set_index("Number of motors", append=True)  # move group key to index
    .groupby(level="Number of motors", group_keys=False)
    .apply(lambda x: x.sample(n=sample_sizes.loc[x.name], random_state=42))
    .reset_index(drop=True)  # clean up index
)

print(f"Total sampled tomograms: {len(sampled)}")

Total sampled tomograms: 67


### 3.4 Save Sampled Labels

Save the sampled tomogram metadata to `SAMPLED_LABELS_PATH`, unless the file already exists.

In [7]:
import os
from src.config import SAMPLED_LABELS_PATH

# Check if the output file already exists
if os.path.exists(SAMPLED_LABELS_PATH):
    print(f"File already exists at: {SAMPLED_LABELS_PATH}")
else:
    sampled.to_csv(SAMPLED_LABELS_PATH, index=False)
    print(f"Sampled labels saved to: {SAMPLED_LABELS_PATH}")

File already exists at: /Users/ramy/Desktop/team_project/BYU_Locating_Bacterial_Flagellar_Motors_2025/data/sampled/sampled_train_labels.csv


### 3.5 Copy Sampled Tomogram Folders

After saving the sampled labels, copy the corresponding tomogram folders from `FULL_DATA_TRAIN_DIR` to `EXTERNAL_SAMPLED_TRAIN_DIR`.
Only folders for sampled `tomo_id`s will be copied. Existing folders will be skipped.

In [11]:
import shutil
import os
from src.config import FULL_DATA_TRAIN_DIR, EXTERNAL_SAMPLED_TRAIN_DIR
from tqdm import tqdm

sampled_tomos = sampled["tomo_id"].unique()

os.makedirs(EXTERNAL_SAMPLED_TRAIN_DIR, exist_ok=True)

copied = 0
skipped = 0
missing = 0

for tomo_id in tqdm(sampled_tomos, desc="Copying tomogram folders"):
    src_path = os.path.join(FULL_DATA_TRAIN_DIR, tomo_id)
    dst_path = os.path.join(EXTERNAL_SAMPLED_TRAIN_DIR, tomo_id)

    if not os.path.exists(src_path):
        missing += 1
        continue

    if os.path.exists(dst_path):
        skipped += 1
        continue

    shutil.copytree(src_path, dst_path)
    copied += 1

# Summary output only
print(f"\nCopy complete:")
print(f"- Copied folders: {copied}")
print(f"- Skipped (already exists): {skipped}")
print(f"- Missing source folders: {missing}")

Copying tomogram folders: 100%|██████████| 67/67 [00:31<00:00,  2.13it/s]


Copy complete:
- Copied folders: 67
- Skipped (already exists): 0
- Missing source folders: 0



