# Merge Datasets

This recipe demonstrates a simple pattern for merging FiftyOne Datasets via [Dataset.merge_samples()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html?highlight=merge_samples#fiftyone.core.dataset.Dataset.merge_samples).

Merging datasets is an easy way to:

-   Combine multiple datasets with information about the same underlying raw media (images and videos)
-   Add model predictions to a FiftyOne dataset, to compare with ground truth annotations and/or other models

## Setup

In this recipe, we'll work with a dataset downloaded from the [FiftyOne Dataset Zoo](https://voxel51.com/docs/fiftyone/user_guide/dataset_creation/zoo.html).

To access the dataset, install `torch` and `torchvision`, if necessary:

In [None]:
# Modify as necessary (e.g., GPU install). See https://pytorch.org for options
!pip install torch
!pip install torchvision

Then download the test split of [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html):

In [1]:
# Download the validation split of COCO-2017
!fiftyone zoo datasets download cifar10 --splits test

Split 'test' already downloaded


## Merging model predictions

Load the test split of CIFAR-10 into FiftyOne:

In [2]:
import random
import os

import fiftyone as fo
import fiftyone.zoo as foz

# Load test split of CIFAR-10
dataset = foz.load_zoo_dataset("cifar10", split="test", dataset_name="merge-example")
classes = dataset.info["classes"]

print(dataset)

Split 'test' already downloaded
Loading 'cifar10' split 'test'
 100% |████████████████████████████████████████████████| 10000/10000 [13.0s elapsed, 0s remaining, 790.0 samples/s]      
Name:           merge-example
Media type      image
Num samples:    10000
Persistent:     False
Info:           {'classes': ['airplane', 'automobile', 'bird', ...]}
Tags:           ['test']
Sample fields:
    media_type:   fiftyone.core.fields.StringField
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)


The dataset contains ground truth labels in its `ground_truth` field:

In [3]:
# Print a sample from the dataset
print(dataset.first())

<Sample: {
    'id': '5f778dd116c859ba8e9a0113',
    'media_type': 'image',
    'filepath': '/Users/Brian/fiftyone/cifar10/test/data/000001.jpg',
    'tags': BaseList(['test']),
    'metadata': None,
    'ground_truth': <Classification: {
        'id': '5f778dd116c859ba8e9a0112',
        'label': 'cat',
        'confidence': None,
        'logits': None,
    }>,
}>


Suppose you would like to add model predictions to some samples from the dataset.

The usual way to do this is to just iterate over the dataset and add your predictions directly to the samples:

In [4]:
def run_inference(filepath):
    # Run inference on `filepath` here.
    # For simplicity, we'll just generate a random label
    label = random.choice(classes)
    
    return fo.Classification(label=label)

In [5]:
# Choose 100 samples at random
random_samples = dataset.take(100)

# Add model predictions to dataset
for sample in random_samples:
    sample["prediction"] = run_inference(sample.filepath)
    sample.save()

print(dataset)

Name:           merge-example
Media type      image
Num samples:    10000
Persistent:     False
Info:           {'classes': ['airplane', 'automobile', 'bird', ...]}
Tags:           ['test']
Sample fields:
    media_type:   fiftyone.core.fields.StringField
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    prediction:   fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)


However, suppose you store the predictions in a separate dataset:

In [6]:
# Filepaths of images to proces
filepaths = [s.filepath for s in dataset.take(100)]

# Run inference
predictions = fo.Dataset()
for filepath in filepaths:
    sample = fo.Sample(filepath=filepath)
    
    sample["prediction"] = run_inference(filepath)

    predictions.add_sample(sample)

print(predictions)

Name:           2020.10.02.16.30.42
Media type      image
Num samples:    100
Persistent:     False
Info:           {}
Tags:           []
Sample fields:
    media_type: fiftyone.core.fields.StringField
    filepath:   fiftyone.core.fields.StringField
    tags:       fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:   fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
    prediction: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)


You can easily merge the `predictions` dataset into the main dataset via [Dataset.merge_samples()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html?highlight=merge_samples#fiftyone.core.dataset.Dataset.merge_samples).

Let's start by loading a fresh copy of CIFAR-10 that doesn't have predictions:

In [7]:
dataset2 = foz.load_zoo_dataset("cifar10", split="test", dataset_name="merge-example2")

Split 'test' already downloaded
Loading 'cifar10' split 'test'
 100% |████████████████████████████████████████████████| 10000/10000 [12.7s elapsed, 0s remaining, 775.6 samples/s]      


Now merge the predictions:

In [8]:
# Merge predictions
dataset2.merge_samples(predictions)

# Verify that 100 samples in `dataset2` now have predictions
print(dataset2.exists("prediction"))

 100% |████████████████████████████████████████████████████| 100/100 [288.3ms elapsed, 0s remaining, 346.9 samples/s]      
Dataset:        merge-example2
Num samples:    100
Tags:           ['test']
Sample fields:
    media_type:   fiftyone.core.fields.StringField
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    prediction:   fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
Pipeline stages:
    1. Exists(field='prediction', bool=True)


By default, samples with the same absolute `filepath` are merged. However, you can customize this as desired via various keyword arguments of [Dataset.merge_samples()](https://voxel51.com/docs/fiftyone/api/fiftyone.core.dataset.html?highlight=merge_samples#fiftyone.core.dataset.Dataset.merge_samples).

For example, the command below will merge samples with the same base filename, ignoring the directory:

In [9]:
# Merge predictions, using the base filename of the samples to decide which samples to merge
# In this case, we've already performed the merge, so the existing data is overwritten
dataset.merge_samples(predictions, key_fcn=lambda p: os.path.basename(p))

 100% |████████████████████████████████████████████████████| 100/100 [293.0ms elapsed, 0s remaining, 341.3 samples/s]      


Let's print a sample with predictions to verify that the merge happened as expected:

In [10]:
# Print a sample with predictions
print(dataset2.exists("prediction").first())

<SampleView: {
    'id': '5f778df916c859ba8e9a79d1',
    'media_type': 'image',
    'filepath': '/Users/Brian/fiftyone/cifar10/test/data/000103.jpg',
    'tags': BaseList(['test']),
    'metadata': None,
    'ground_truth': <Classification: {
        'id': '5f778df916c859ba8e9a79d0',
        'label': 'frog',
        'confidence': None,
        'logits': None,
    }>,
    'prediction': <Classification: {
        'id': '5f778df216c859ba8e9a77cb',
        'label': 'frog',
        'confidence': None,
        'logits': None,
    }>,
}>
