# LIVECell Dataset Preparation Tutorial

This tutorial will guide you through the process of:
1. Setting up the LIVECell dataset directory structure
2. Converting LIVECell annotations to YOLO format
3. Creating training and validation text files

## Required Dependencies

First, let's ensure we have all necessary packages installed:

In [None]:
!pip install tqdm opencv-python matplotlib numpy

## 1. Creating Directory Structure

We'll start by creating the necessary directory structure for the LIVECell dataset:

```
Dataset/LIVECell/
├── images/
│   ├── train/
│   ├── val/
│   └── test/
└── labels/
    ├── train/
    ├── val/
    └── test/
```

In [1]:
import os
from pathlib import Path
import shutil
import json
from tqdm import tqdm

def create_directory_structure():
    base_dir = "Dataset/LIVECell"
    dirs = [
        "images/train",
        "images/val",
        "images/test",
        "labels/train",
        "labels/val",
        "labels/test"
    ]
    
    for dir_path in dirs:
        Path(f"{base_dir}/{dir_path}").mkdir(parents=True, exist_ok=True)
    
    return base_dir

base_dir = create_directory_structure()
print(f"Created directory structure in {base_dir}")

def copy_images():
    """Copy images from LIVECell dataset to our organized structure."""
    print("\nCopying images to dataset directories...")
    
    # Copy test images
    test_src = "LIVECell_dataset_2021/images/livecell_test_images"
    test_dst = "Dataset/LIVECell/images/test"
    print("Copying test images...")
    for img in tqdm(os.listdir(test_src)):
        if img.endswith('.tif'):
            shutil.copy2(os.path.join(test_src, img), os.path.join(test_dst, img))
    
    # Load train and val annotations to get image lists
    with open('LIVECell_dataset_2021/annotations/LIVECell/livecell_coco_train.json') as f:
        train_data = json.load(f)
    with open('LIVECell_dataset_2021/annotations/LIVECell/livecell_coco_val.json') as f:
        val_data = json.load(f)
    
    # Get filenames for train and val sets
    train_files = {img['file_name'] for img in train_data['images']}
    val_files = {img['file_name'] for img in val_data['images']}
    
    # Copy train and val images from the combined directory
    train_val_src = "LIVECell_dataset_2021/images/livecell_train_val_images"
    train_dst = "Dataset/LIVECell/images/train"
    val_dst = "Dataset/LIVECell/images/val"
    
    print("Copying train and validation images...")
    for img in tqdm(os.listdir(train_val_src)):
        if img.endswith('.tif'):
            src_path = os.path.join(train_val_src, img)
            if img in train_files:
                shutil.copy2(src_path, os.path.join(train_dst, img))
            elif img in val_files:
                shutil.copy2(src_path, os.path.join(val_dst, img))

# Copy images to their respective directories
copy_images()

Created directory structure in Dataset/LIVECell

Copying images to dataset directories...
Copying test images...


100%|██████████| 1512/1512 [00:00<00:00, 3775.08it/s]


Copying train and validation images...


100%|██████████| 3727/3727 [00:00<00:00, 4041.37it/s]


## 2. Converting LIVECell Annotations to YOLO Format

Next, we'll convert the LIVECell format annotations to YOLO format. The YOLO format is:
```
[class_id] [x_center] [y_center] [width] [height]
```
where:
- class_id: 0-based integer (mapped from cell type in image filename)
- x_center, y_center: center coordinates normalized to [0, 1]
- width, height: bbox dimensions normalized to [0, 1]

In [2]:
import json
import os
from tqdm import tqdm

def convert_coco_to_yolo():
    """Convert LIVECell annotations to YOLO format."""
    
    # Define the cell type to class ID mapping
    class_mapping = {
        'a172': 0,
        'bt474': 1,
        'bv2': 2,
        'huh7': 3,
        'mcf7': 4,
        'shsy5y': 5,
        'skbr3': 6,
        'skov3': 7
    }

    def get_cell_type(filename):
        # Extract cell type from filename (e.g., "A172_Phase_C7_1_02d12h00m_1.tif" -> "a172")
        return filename.split('_')[0].lower()

    def convert_bbox(size, box):
        # Convert COCO bbox [x, y, w, h] to YOLO bbox [x_center, y_center, w, h]
        dw = 1. / size[0]
        dh = 1. / size[1]
        
        # COCO bbox format: [x_min, y_min, width, height]
        x_min, y_min, w, h = box
        
        # Convert to YOLO format (center_x, center_y, width, height)
        x_center = (x_min + w / 2) * dw  # Normalize x_center to [0, 1]
        y_center = (y_min + h / 2) * dh  # Normalize y_center to [0, 1]
        w = w * dw  # Normalize width to [0, 1]
        h = h * dh  # Normalize height to [0, 1]
        
        # Ensure coordinates are valid
        x_center = min(max(x_center, 0), 1)
        y_center = min(max(y_center, 0), 1)
        w = min(max(w, 0), 1)
        h = min(max(h, 0), 1)
        
        return (x_center, y_center, w, h)

    # Process train, val, and test sets
    for dataset in ['train', 'val', 'test']:
        print(f"\nProcessing {dataset} set...")
        
        # Load JSON file containing the annotations
        json_file = f'LIVECell_dataset_2021/annotations/LIVECell/livecell_coco_{dataset}.json'
        with open(json_file) as f:
            data = json.load(f)

        # Create a dictionary that maps image ids to their data
        image_dict = {}
        for img in data['images']:
            image_dict[img['id']] = {
                'file_name': img['file_name'],
                'width': img['width'],
                'height': img['height'],
                'annotations': []
            }

        # Group annotations by image
        for ann in data['annotations']:
            if ann['bbox'][2] > 0 and ann['bbox'][3] > 0:  # Filter out invalid boxes
                image_dict[ann['image_id']]['annotations'].append(ann)

        # Convert annotations to YOLO format
        for img_id, img_data in tqdm(image_dict.items()):
            if not img_data['annotations']:
                continue

            # Get cell type from image filename
            cell_type = get_cell_type(img_data['file_name'])
            if cell_type not in class_mapping:
                print(f"Warning: Unknown cell type in {img_data['file_name']}")
                continue

            # Create directory for YOLO labels
            label_dir = f"Dataset/LIVECell/labels/{dataset}"
            os.makedirs(label_dir, exist_ok=True)
            
            # Get class_id from cell type
            class_id = class_mapping[cell_type]
            
            label_file = f"{label_dir}/{os.path.splitext(img_data['file_name'])[0]}.txt"
            
            with open(label_file, 'w') as f:
                for ann in img_data['annotations']:
                    # Convert bounding box to YOLO format
                    bbox = convert_bbox((img_data['width'], img_data['height']), ann['bbox'])
                    
                    # Validate bounding box values
                    if all(0 <= x <= 1 for x in bbox):
                        # Write annotation in YOLO format
                        f.write(f"{class_id} {bbox[0]:.6f} {bbox[1]:.6f} {bbox[2]:.6f} {bbox[3]:.6f}\n")

# Convert annotations
convert_coco_to_yolo()


Processing train set...


100%|██████████| 3253/3253 [00:02<00:00, 1210.17it/s]



Processing val set...


100%|██████████| 570/570 [00:00<00:00, 1179.27it/s]



Processing test set...


100%|██████████| 1564/1564 [00:01<00:00, 1246.91it/s]


## 3. Creating Training and Validation Text Files

Finally, we'll create text files listing all images in the training, validation and test sets:

In [3]:
def create_image_lists():
    dataset_path = Path('./Dataset/LIVECell')
    
    # Create train.txt, val.txt, and test.txt
    for split in ['train', 'val', 'test']:
        images = list((dataset_path / 'images' / split).glob('*.tif'))
        with open(dataset_path / f'{split}.txt', 'w') as f:
            for img_path in images:
                f.write(f'./Dataset/LIVECell/images/{split}/{img_path.name}\n')
        print(f"Created {split}.txt with {len(images)} images")

create_image_lists()

Created train.txt with 3188 images
Created val.txt with 539 images
Created test.txt with 1512 images


## Verification

Let's verify that our dataset is properly structured and all necessary files are in place:

In [4]:
def verify_dataset():
    base_dir = Path("Dataset/LIVECell")
    
    # Check directory structure
    required_dirs = [
        "images/train",
        "images/val",
        "images/test",
        "labels/train",
        "labels/val",
        "labels/test"
    ]
    
    for dir_path in required_dirs:
        full_path = base_dir / dir_path
        if not full_path.exists():
            print(f"❌ Missing directory: {dir_path}")
        else:
            print(f"✅ Found directory: {dir_path}")
    
    # Check text files
    for txt_file in ["train.txt", "val.txt", "test.txt"]:
        if (base_dir / txt_file).exists():
            print(f"✅ Found file: {txt_file}")
        else:
            print(f"❌ Missing file: {txt_file}")
    
    # Check label files
    print("\nChecking label files format...")
    for dataset in ["train", "val", "test"]:
        label_dir = base_dir / "labels" / dataset
        if label_dir.exists():
            label_files = list(label_dir.glob("*.txt"))
            if label_files:
                sample_file = label_files[0]
                print(f"\nSample {dataset} label file ({sample_file.name}):")
                with open(sample_file) as f:
                    print(f.read().strip())

verify_dataset()

✅ Found directory: images/train
✅ Found directory: images/val
✅ Found directory: images/test
✅ Found directory: labels/train
✅ Found directory: labels/val
✅ Found directory: labels/test
✅ Found file: train.txt
✅ Found file: val.txt
✅ Found file: test.txt

Checking label files format...

Sample train label file (SkBr3_Phase_E3_1_00d12h00m_1.txt):
6 0.918558 0.877202 0.025071 0.034558
6 0.247521 0.950952 0.026435 0.039442
6 0.514141 0.883183 0.020554 0.026904
6 0.474581 0.994106 0.023310 0.011788
6 0.546101 0.913452 0.021690 0.035481
6 0.100923 0.719606 0.018295 0.028135
6 0.156491 0.678923 0.035000 0.042192
6 0.534687 0.866817 0.023722 0.036404
6 0.452926 0.858865 0.023267 0.033346
6 0.148075 0.253721 0.022827 0.031442
6 0.506001 0.848923 0.026435 0.036077
6 0.559766 0.873548 0.024162 0.032712
6 0.988366 0.873692 0.023267 0.043423
6 0.570156 0.914673 0.027330 0.039154
6 0.157621 0.485962 0.023253 0.041577
6 0.464893 0.836692 0.025526 0.035462
6 0.252038 0.759510 0.029134 0.039442
6 0.84

## Conclusion

You have now successfully:
1. Created the necessary directory structure for the LIVECell dataset
2. Copied images from the source directories to our organized structure
3. Converted LIVECell format annotations to YOLO format, with class IDs mapped from cell types in image filenames
4. Created text files listing all training, validation and test images
5. Verified the dataset structure and format

The dataset is now ready to be used for training YOLO models. The directory structure should look like this:

```
Dataset/LIVECell/
├── images/
│   ├── train/    (contains training images)
│   ├── val/      (contains validation images)
│   └── test/     (contains test images)
├── labels/
│   ├── train/    (contains YOLO format labels for training images)
│   ├── val/      (contains YOLO format labels for validation images)
│   └── test/     (contains YOLO format labels for test images)
├── train.txt    (list of training image paths)
├── val.txt      (list of validation image paths)
└── test.txt     (list of test image paths)
```