# 2.- Data Transformation - TFRecords

## Why Use TFRecord for Object Detection Models

When preparing datasets for training machine learning models, especially in the realm of **object detection**, choosing the right data format is crucial for optimizing performance and ensuring seamless integration with training pipelines. **TFRecord** stands out as the preferred format within the **TensorFlow** ecosystem. Below are the key reasons why TFRecord is advantageous for object detection tasks:


### 1. **Efficiency and Performance**

- **Binary Format**: TFRecord stores data in a compact binary format, which is significantly faster to read and write compared to traditional formats like CSV or JSON. This efficiency is vital when dealing with large datasets containing millions of images.

- **Sequential Access**: Data stored in TFRecord files can be accessed sequentially, which aligns well with the way TensorFlow processes data during training. This minimizes the overhead associated with random file reads, enhancing the overall training speed.



### 2. **Seamless Integration with TensorFlow Pipelines**

- **`tf.data` API Compatibility**: TFRecord is natively supported by TensorFlow's `tf.data` API, allowing for straightforward data ingestion, preprocessing, and batching. This compatibility ensures that data loading becomes an integral and optimized part of the training pipeline.

- **Parallel Data Processing**: Leveraging TFRecord in combination with the `tf.data` API enables parallel data loading and preprocessing. This parallelism is essential for maximizing GPU/TPU utilization and reducing training times.



### 3. **Scalability for Large Datasets**

- **Handling Massive Data**: Object detection tasks often require handling extensive datasets with high-resolution images and numerous annotations. TFRecord efficiently manages such large-scale data without significant performance degradation.

- **Sharding Capability**: TFRecord allows datasets to be split into multiple shards (smaller TFRecord files). Sharding facilitates distributed training and makes it easier to manage and access data across different storage systems or machines.



### 4. **Reduced I/O Overhead**

- **Minimized File Operations**: Instead of reading thousands of individual image files and annotation files, TFRecord consolidates all data into fewer large files. This reduction in the number of file operations decreases the I/O overhead, leading to faster data access and improved training throughput.



### 5. **Data Serialization and Consistency**

- **Structured Data Storage**: TFRecord, in conjunction with `tf.train.Example`, allows for the structured serialization of complex data types, including images, bounding boxes, and class labels. This structure ensures consistency in how data is stored and accessed, reducing potential errors during training.

- **Custom Feature Encoding**: With TFRecord, you can define custom features tailored to your specific needs, such as storing multiple bounding boxes per image or incorporating additional metadata. This flexibility is crucial for accommodating the diverse requirements of object detection models.



### 6. **Enhanced Portability and Reproducibility**

- **Cross-Platform Compatibility**: TFRecord files are platform-agnostic, meaning they can be easily shared and used across different environments without compatibility issues. This portability is beneficial for collaborative projects and reproducible research.

- **Version Control Friendly**: Storing data in TFRecord format facilitates better version control practices, especially when dealing with evolving datasets and model iterations.



### 7. **Optimized for Distributed Training**

- **Distributed Systems Support**: TFRecord is optimized for use in distributed training environments, where data needs to be efficiently fed to multiple workers or nodes. Its binary format and sharding capabilities make it ideal for scaling training across clusters.

- **Consistency Across Workers**: By using TFRecord, you ensure that all training workers access the data in a consistent and synchronized manner, which is essential for maintaining model performance and convergence during distributed training.



### 8. **Security and Data Integrity**

- **Data Integrity**: TFRecord's structured binary format reduces the risk of data corruption compared to plain text formats. This integrity is crucial for maintaining the quality and reliability of your training data.

- **Obfuscation**: Storing data in a binary format also provides a layer of obfuscation, making it less accessible for unauthorized users to tamper with the dataset compared to easily readable text formats.



In [1]:
import os
import glob
import random
import tensorflow as tf
import xml.etree.ElementTree as ET

## Helpers

In [2]:
def parse_voc_xml(xml_path):
    """
    Parses a Pascal VOC XML file and returns a dictionary with:
    {
      'filename': 'image_name.jpg',
      'width': 1280,
      'height': 720,
      'objects': [
        {
          'name': 'dog',
          'xmin': 50, 'ymin': 30, 'xmax': 150, 'ymax': 100
        },
        ...
      ]
    }
    
    Args:
        xml_path (str): Path to the XML annotation file.
        
    Returns:
        dict: Parsed data from the XML file.
    """
    tree = ET.parse(xml_path)
    root = tree.getroot()

    data = {}
    data['objects'] = []

    # Extract the filename
    filename_node = root.find('filename')
    data['filename'] = filename_node.text if filename_node is not None else None

    # Extract image size (width and height)
    size_node = root.find('size')
    if size_node is not None:
        w_node = size_node.find('width')
        h_node = size_node.find('height')
        data['width'] = int(w_node.text) if w_node is not None else 0
        data['height'] = int(h_node.text) if h_node is not None else 0
    else:
        data['width'] = 0
        data['height'] = 0

    # Extract object details
    for obj_node in root.findall('object'):
        obj_info = {}
        name_node = obj_node.find('name')
        obj_info['name'] = name_node.text if name_node is not None else "N/A"

        # Extract bounding box coordinates
        bndbox_node = obj_node.find('bndbox')
        if bndbox_node is not None:
            xmin_node = bndbox_node.find('xmin')
            ymin_node = bndbox_node.find('ymin')
            xmax_node = bndbox_node.find('xmax')
            ymax_node = bndbox_node.find('ymax')

            obj_info['xmin'] = float(xmin_node.text) if xmin_node is not None else 0
            obj_info['ymin'] = float(ymin_node.text) if ymin_node is not None else 0
            obj_info['xmax'] = float(xmax_node.text) if xmax_node is not None else 0
            obj_info['ymax'] = float(ymax_node.text) if ymax_node is not None else 0

        data['objects'].append(obj_info)

    return data

In [3]:
def _bytes_feature(value):
    """Converts a byte string into a tf.train.Feature of bytes_list."""
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_list_feature(value):
    """Converts a float list into a tf.train.Feature of float_list."""
    return tf.train.Feature(float_list=tf.train.FloatList(value=value))

def _int64_feature(value):
    """Converts an integer value into a tf.train.Feature of int64_list."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

def _int64_list_feature(value):
    """Converts a list of integers into a tf.train.Feature of int64_list."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))

## Create a tf.train.Example from VOC dict

In [4]:
def voc_dict_to_tfexample(voc_dict, images_folder):
    """
    Takes the dictionary output by parse_voc_xml(xml_file)
    along with the folder containing images (images_folder).
    
    Returns a tf.train.Example with:
    - image/encoded
    - image/filename
    - image/height, image/width
    - image/object/bbox/xmin, xmax, ymin, ymax
    - image/object/class/text
    """
    filename = voc_dict['filename']
    if filename is None:
        # If <filename> is missing in the XML, skip this entry
        return None

    img_path = os.path.join(images_folder, filename)
    if not os.path.isfile(img_path):
        # If the image does not exist, skip this entry
        return None

    # Read the image in binary format
    with tf.io.gfile.GFile(img_path, 'rb') as fid:
        encoded_image = fid.read()

    width = voc_dict['width']
    height = voc_dict['height']

    xmins = []
    xmaxs = []
    ymins = []
    ymaxs = []
    class_texts = []

    for obj in voc_dict['objects']:
        if width > 0 and height > 0:
            # Normalize bounding box coordinates
            xmins.append(obj['xmin'] / width)
            xmaxs.append(obj['xmax'] / width)
            ymins.append(obj['ymin'] / height)
            ymaxs.append(obj['ymax'] / height)
        else:
            # Avoid division by zero if image size is not available
            xmins.append(0.0)
            xmaxs.append(0.0)
            ymins.append(0.0)
            ymaxs.append(0.0)

        # Encode class name as bytes
        class_texts.append(obj['name'].encode('utf8'))

    feature_dict = {
        'image/encoded': _bytes_feature(encoded_image),
        'image/filename': _bytes_feature(filename.encode('utf8')),
        'image/format': _bytes_feature(b'jpg'),  # Assuming JPEG format

        'image/height': _int64_feature(height),
        'image/width': _int64_feature(width),

        'image/object/bbox/xmin': _float_list_feature(xmins),
        'image/object/bbox/xmax': _float_list_feature(xmaxs),
        'image/object/bbox/ymin': _float_list_feature(ymins),
        'image/object/bbox/ymax': _float_list_feature(ymaxs),

        # Store class text; could also map to class IDs if needed
        'image/object/class/text':
            tf.train.Feature(bytes_list=tf.train.BytesList(value=class_texts)),
    }

    return tf.train.Example(features=tf.train.Features(feature=feature_dict))

## Main conversion function (XML -> TFRecord)

In [5]:
def convert_voc_to_tfrecord(annotations_folder, images_folder, output_tfrecord):
    """
    Reads all .xml files from 'annotations_folder', pairs them
    with images in 'images_folder', and writes a TFRecord to
    'output_tfrecord'.

    Returns the number of successfully written examples and
    the number of errors (e.g. missing files).
    
    Args:
        annotations_folder (str): Directory containing annotation XML files.
        images_folder (str): Directory containing image files.
        output_tfrecord (str): Path to the output TFRecord file.
        
    Returns:
        tuple: (number_of_written_examples, number_of_errors)
    """
    xml_files = glob.glob(os.path.join(annotations_folder, "*.xml"))
    num_written = 0
    num_errors = 0

    with tf.io.TFRecordWriter(output_tfrecord) as writer:
        for xml_file in xml_files:
            voc_info = parse_voc_xml(xml_file)
            tf_example = voc_dict_to_tfexample(voc_info, images_folder)
            if tf_example is not None:
                writer.write(tf_example.SerializeToString())
                num_written += 1
            else:
                num_errors += 1

    return num_written, num_errors

In [6]:
def get_image_names(annotations_folder):
    """
    Retrieves a list of image names from the annotation XML files.
    
    Args:
        annotations_folder (str): Directory containing annotation XML files.
        
    Returns:
        list: List of image names without file extensions.
    """
    xml_files = glob.glob(os.path.join(annotations_folder, "*.xml"))
    image_names = [os.path.splitext(os.path.basename(f))[0] for f in xml_files]
    return image_names

In [7]:
def split_dataset(image_names, train_ratio=0.7, val_ratio=0.15, test_ratio=0.15, seed=42):
    """
    Randomly splits the dataset into train, validation, and test sets.
    
    Args:
        image_names (list): List of image names.
        train_ratio (float): Proportion of data for training.
        val_ratio (float): Proportion of data for validation.
        test_ratio (float): Proportion of data for testing.
        seed (int): Seed for randomization to ensure reproducibility.
        
    Returns:
        dict: Dictionary with keys 'train', 'val', 'test' mapping to respective image lists.
    """
    assert train_ratio + val_ratio + test_ratio == 1.0, "Ratios must sum to 1.0"
    
    random.seed(seed)
    random.shuffle(image_names)
    
    total = len(image_names)
    train_end = int(train_ratio * total)
    val_end = train_end + int(val_ratio * total)
    
    splits = {
        'train': image_names[:train_end],
        'val': image_names[train_end:val_end],
        'test': image_names[val_end:]
    }
    
    return splits

In [8]:
def create_tfrecord(split, image_list, annotations_folder, images_folder, output_dir):
    """
    Creates a TFRecord file for a specific data split.
    
    Args:
        split (str): Name of the split ('train', 'val', 'test').
        image_list (list): List of image names for the split.
        annotations_folder (str): Directory containing annotation XML files.
        images_folder (str): Directory containing image files.
        output_dir (str): Directory where the TFRecord will be saved.
        
    Returns:
        int: Number of examples written to the TFRecord.
    """
    output_path = os.path.join(output_dir, f"{split}.tfrecord")
    writer = tf.io.TFRecordWriter(output_path)
    count = 0
    
    for image_name in image_list:
        xml_path = os.path.join(annotations_folder, f"{image_name}.xml")
        if not os.path.exists(xml_path):
            print(f"Warning: XML file not found for {image_name}.jpg")
            continue
        
        voc_dict = parse_voc_xml(xml_path)
        tf_example = voc_dict_to_tfexample(voc_dict, images_folder)
        if tf_example is not None:
            writer.write(tf_example.SerializeToString())
            count += 1
        else:
            print(f"Warning: Failed to create tf.train.Example for {image_name}.jpg")
    
    writer.close()
    print(f"TFRecord for '{split}' created at: {output_path} with {count} examples.")
    return count

In [12]:
def generate_tfrecords(annotations_folder, images_folder, output_dir, 
                       train_ratio=0.7, val_ratio=0.15, test_ratio=0.15, seed=42):
    """
    Generates TFRecord files for train, validation, and test splits.
    
    Args:
        annotations_folder (str): Directory containing annotation XML files.
        images_folder (str): Directory containing image files.
        output_dir (str): Directory where TFRecords will be saved.
        train_ratio (float): Proportion of data for training.
        val_ratio (float): Proportion of data for validation.
        test_ratio (float): Proportion of data for testing.
        seed (int): Seed for randomization to ensure reproducibility.
    """
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
        print(f"Created output directory at: {output_dir}")
    
    # Get all image names
    image_names = get_image_names(annotations_folder)
    print(f"Total images found: {len(image_names)}")
    
    # Split the dataset
    splits = split_dataset(image_names, train_ratio, val_ratio, test_ratio, seed)
    
    for split, images in splits.items():
        print(f"Creating TFRecord for '{split}' with {len(images)} examples.")
        create_tfrecord(split, images, annotations_folder, images_folder, output_dir)


## Demostration

In [13]:
# Define your directories
annotations_dir = "../data/Annotations"       # Path to Annotations directory
images_dir = "../data/JPEGImages"             # Path to JPEGImages directory
output_dir = "../data/TFRecords"              # Path to output TFRecords directory

In [14]:
# Generate the TFRecords with train, val, and test splits
generate_tfrecords(
    annotations_folder=annotations_dir,
    images_folder=images_dir,
    output_dir=output_dir,
    train_ratio=0.7,
    val_ratio=0.15,
    test_ratio=0.15,
    seed=42
)

Total images found: 20
Creating TFRecord for 'train' with 14 examples.
TFRecord for 'train' created at: ../data/TFRecords\train.tfrecord with 14 examples.
Creating TFRecord for 'val' with 3 examples.
TFRecord for 'val' created at: ../data/TFRecords\val.tfrecord with 3 examples.
Creating TFRecord for 'test' with 3 examples.
TFRecord for 'test' created at: ../data/TFRecords\test.tfrecord with 3 examples.


# Understanding TFRecord Structure

A **TFRecord** file is a **binary container** that stores a sequence of data records. In typical TensorFlow pipelines, each record is encoded as a [`tf.train.Example`](https://www.tensorflow.org/api_docs/python/tf/train/Example). Below is an overview of how these components are organized and why they matter.



## 1. TFRecord as a File Format

- **TFRecord** is essentially a stream of serialized protocol buffer messages, each message representing one “data example.”  
- This format is efficient for reading large datasets, especially during training on GPUs or TPUs, because it avoids the overhead of handling many small files.

In other words, you can think of a TFRecord file as “**N** examples concatenated in binary,” where each example is encoded with a protocol buffer schema.



## 2. `tf.train.Example` Messages

Within each record inside a TFRecord file, we typically store data in a structure called **`tf.train.Example`**. Conceptually:

1. **`Example`** is the top-level message that groups a set of “features.”
2. **`Features`** is a container, a map from string keys to `Feature` values.
3. **`Feature`** is a union-type message that holds data in one of three lists:
   - **BytesList** (for binary data such as raw image bytes or strings)
   - **FloatList** (for floating-point values)
   - **Int64List** (for integer values)

The important takeaway is that each “feature” in an `Example` is identified by a **key** (a string) and can store either bytes, floats, or integers in list form.



## 3. The `Feature` Hierarchy

To visualize the hierarchy:

```plaintext

Example 
   └─ Features 
         └─ feature 
               ├─ "some_key" → Feature (bytes_list) 
               ├─ "another_key" → Feature (int64_list) 
               └─ "third_key" → Feature (float_list)

```

### Components Explained:

1. **Example**
   - **Description**: The top-level message that represents a single data record.
   - **Role**: Encapsulates all the features associated with that specific instance (e.g., one image and its annotations).

2. **Features**
   - **Description**: A container within `Example` that holds multiple feature entries.
   - **Role**: Acts as a map (dictionary) where each key is a string representing the feature name, and each value is a `Feature` object.

3. **feature**
   - **Description**: Each entry within the `Features` map.
   - **Role**: Associates a feature name (key) with its corresponding data (value).

4. **"some_key" → Feature (bytes_list)**
   - **Key**: `"some_key"`
   - **Value**: A `Feature` containing a list of bytes.
   - **Use Case**: Typically used for binary data such as encoded images or serialized objects.

5. **"another_key" → Feature (int64_list)**
   - **Key**: `"another_key"`
   - **Value**: A `Feature` containing a list of 64-bit integers.
   - **Use Case**: Often used for categorical labels or counts.

6. **"third_key" → Feature (float_list)**
   - **Key**: `"third_key"`
   - **Value**: A `Feature` containing a list of floating-point numbers.
   - **Use Case**: Commonly used for numerical features like bounding box coordinates or measurement values.


## Detailed Breakdown

### 1. `tf.train.Example`

- **Purpose**: Encapsulates all the data for a single instance in your dataset.
- **Structure**:
  - **Features**: Contains all the individual data points (features) related to that instance.

### 2. `Features`

- **Purpose**: Acts as a container mapping feature names to their data.
- **Structure**:
  - **feature**: Each entry maps a feature name (string) to a `Feature` object.

### 3. `Feature`

- **Purpose**: Represents the actual data associated with a feature name.
- **Types**:
  - **BytesList**: For binary data (e.g., images, serialized data).
  - **FloatList**: For floating-point numbers.
  - **Int64List**: For integer values.


## Common structure in Object Detection

Here's a common visualization of the TFRecord hierarchy for object detection:

```plaintext
Example 
  └─ Features 
        └─ feature 
              ├─ "image/encoded" → Feature (bytes_list) 
              ├─ "image/height" → Feature (int64_list) 
              ├─ "image/width" → Feature (int64_list) 
              ├─ "bbox/xmin" → Feature (float_list) 
              ├─ "bbox/ymin" → Feature (float_list) 
              ├─ "bbox/xmax" → Feature (float_list) 
              ├─ "bbox/ymax" → Feature (float_list) 
              └─ "label" → Feature (int64_list)
```

- **"image/encoded"**: Stores the raw image bytes.
- **"image/height" & "image/width"**: Store the dimensions of the image.
- **"bbox/xmin", "bbox/ymin", "bbox/xmax", "bbox/ymax"**: Store the bounding box coordinates, typically normalized between 0 and 1.
- **"label"**: Stores the class label as an integer.



## 4. Summary

- **A TFRecord file**: A collection of **serialized `Example`** messages.  
- **`tf.train.Example`**: Defines how each record’s data is organized (through a features map).  
- **`Feature`**: The building block that stores a list of bytes, floats, or int64s.  
- **Efficiency & Flexibility**: By encapsulating data in this manner, you can handle images, text, numerical arrays, and more, all in a single format conducive to large-scale machine learning.

Overall, TFRecord and `tf.train.Example` are core tools in TensorFlow to package and process data efficiently.



## Sanity-Check of TFRecords

In [15]:
def inspect_tfrecord(tfrecord_path, max_samples=4):
    """
    Inspects and prints a specified number of examples from a TFRecord file.
    
    Args:
        tfrecord_path (str): Path to the TFRecord file.
        max_samples (int): Number of examples to inspect.
    """
    dataset = tf.data.TFRecordDataset(tfrecord_path)
    for raw_record in dataset.take(max_samples):
        example = tf.train.Example()
        example.ParseFromString(raw_record.numpy())
        print(example)
        print("-" * 80)

In [None]:
# Inspect the generated TFRecords
print("\nInspecting Train TFRecord:")
train_tfrecord_path = os.path.join(output_dir, "train.tfrecord")
inspect_tfrecord(train_tfrecord_path, max_samples=2)

print("\nInspecting Validation TFRecord:")
val_tfrecord_path = os.path.join(output_dir, "val.tfrecord")
inspect_tfrecord(val_tfrecord_path, max_samples=2)

print("\nInspecting Test TFRecord:")
test_tfrecord_path = os.path.join(output_dir, "test.tfrecord")
inspect_tfrecord(test_tfrecord_path, max_samples=2)