# Intro to Dataset Curation with FiftyOne and CLIP (Part 2 of 2)

In this notebook we explore the use of:

* FiftyOne's dataset curation SDK and visualization app
* Multimodal embeddings (text + image) from the [CLIP](https://arxiv.org/pdf/2103.00020) model to produce labels for images (aka Zero-shot classification)

We label dataset of aerial images from Google Earth View. This is the output of deduplication that we performed [in the previous notebook](https://github.com/andandandand/practical-computer-vision/blob/main/notebooks/Intro_Dataset_Curation_Deduplicate_Aerial_Images.ipynb).

![](https://github.com/andandandand/practical-computer-vision/blob/main/images/clip_ensemble_labels.png?raw=true)

In part 2, we use the CLIP model to produce labels for the images, and then visualize the final result in FiftyOne.

In [22]:
import fiftyone as fo
import fiftyone.zoo as foz
import torch
from pathlib import Path
import os
import numpy as np


In [23]:
parent_path = Path("/Users/antonio/Documents/Projects/GettingStartedWithFiftyOne/local_run/")
aerial_images_path = parent_path / 'data/aerial_images'
augmented_aerial_images_path = parent_path / 'data/augmented_aerial_images'
# You can also import the full dataset if you ran the notebook on deduplication first (adapt the path accordingly)
aerial_images_without_duplicates = parent_path / 'data/aerial_images_without_duplicates/data'

In [24]:
# Check if dataset exists and delete it (dataset names are unique in FiftyOne)
dataset_name = "aerial-images-tagged"  

if dataset_name in fo.list_datasets():
    print(f"Dataset '{dataset_name}' exists. Deleting...")
    fo.delete_dataset(dataset_name)
    print(f"Dataset '{dataset_name}' deleted.")
else:
    print(f"Dataset '{dataset_name}' does not exist.")

Dataset 'aerial-images-tagged' exists. Deleting...
Dataset 'aerial-images-tagged' deleted.


In [25]:
# Create the dataset
dataset = fo.Dataset.from_dir(
    dataset_dir=aerial_images_path,
    dataset_type=fo.types.ImageDirectory,
    name=dataset_name,
    persistent=True
)

dataset.compute_metadata()

 100% |█████████████████| 113/113 [18.5ms elapsed, 0s remaining, 6.1K samples/s]      


INFO:eta.core.utils: 100% |█████████████████| 113/113 [18.5ms elapsed, 0s remaining, 6.1K samples/s]      


Computing metadata...


INFO:fiftyone.core.metadata:Computing metadata...


 100% |█████████████████| 113/113 [56.7ms elapsed, 0s remaining, 2.0K samples/s] 


INFO:eta.core.utils: 100% |█████████████████| 113/113 [56.7ms elapsed, 0s remaining, 2.0K samples/s] 


In [26]:
device = "cuda" if torch.cuda.is_available() else "cpu"

In [27]:
text_prompt = 'an aerial photo of'

classes = ['a river',
           'a river in the jungle',
           'a river next to an urban area',
           'a river delta merging with the sea',
           'a body of water',
           'a body of water next to an urban area',
           'a jungle',
           'a forest',
           'a river in a forest',
           'a farmland',
           'a coast',
           'a desert',
           "a harbor next to a desert",
           'a desert next to a river',
           'a desert next to a body of water',
           'a desert next to an urban area',
           'a desert next to a coast',
           'a desert next to a forest',
           'terrain covered by snow',
           'a city',
           'an airport',
           'a sports stadium',
           'an urban area',
           'a city next to the coast',
           'military planes parked next to each other',
           'containers in a harbor',
           'ships in the ocean',
           'the ocean',
           'a beach',
           'a beach next to an urban area',
           'a mountainscape',
           'a refinery',
           'ships and containers in a harbor',
           'ships and boats in a harbor, next to an urban area',
           'dense vegetation next to a desert',
           'an island',
           'a harbor next to an urban area',
           'antartica or artic area, ice and water',
           'railroad tracks',
           'a train station', 
           'a highway', 
           'farming terraces',
           'an oil rig in the sea']

In [28]:
clip_model = foz.load_zoo_model(
    "clip-vit-base32-torch",
    text_prompt=text_prompt,
    classes=classes,
    device=device
)

In [29]:
print(f"The model is loaded on {clip_model._device}")

The model is loaded on cpu


In [30]:
dataset.apply_model(
    model=clip_model,
    label_field="clip_zero_shot_classification",
    # This is how many samples we will show to the model at once
    batch_size=32,
    store_logits=True,
    progress_bar=True,
)

 100% |█████████████████| 113/113 [2.5s elapsed, 0s remaining, 44.8 samples/s]      


INFO:eta.core.utils: 100% |█████████████████| 113/113 [2.5s elapsed, 0s remaining, 44.8 samples/s]      


In [31]:
session = fo.launch_app(dataset, auto=False)
print(session.url)

Session launched. Run `session.show()` to open the App in a cell output.


INFO:fiftyone.core.session.session:Session launched. Run `session.show()` to open the App in a cell output.


http://localhost:5151/



Could not connect session, trying again in 10 seconds


Could not connect session, trying again in 10 seconds


Could not connect session, trying again in 10 seconds


Could not connect session, trying again in 10 seconds


Could not connect session, trying again in 10 seconds


Could not connect session, trying again in 10 seconds


Could not connect session, trying again in 10 seconds


Could not connect session, trying again in 10 seconds


Could not connect session, trying again in 10 seconds


Could not connect session, trying again in 10 seconds


Could not connect session, trying again in 10 seconds


Could not connect session, trying again in 10 seconds


Could not connect session, trying again in 10 seconds


Could not connect session, trying again in 10 seconds


Could not connect session, trying again in 10 seconds


Could not connect session, trying again in 10 seconds


Could not connect session, trying again in 10 seconds


Could not connect session, trying again in 10 s

## CLIP-variants

In [32]:
open_clip_args = {
    "clipa": {
        "clip_model": 'hf-hub:UCSC-VLAA/ViT-L-14-CLIPA-datacomp1B',
        "pretrained": '',
        },
    "dfn": {
        "clip_model": 'ViT-B-16-quickgelu',
        "pretrained": 'dfn2b',
        },
    "eva02_clip": {
        "clip_model": 'EVA02-B-16',
        "pretrained": 'merged2b_s8b_b131k',
        },
    "metaclip": {
        "clip_model": 'ViT-B-32-quickgelu',
        "pretrained": 'metaclip_400m',
        },
    }

In [33]:
for name, args in open_clip_args.items():
    print(f"Applying {name} model")
    clip_model = args["clip_model"]
    pretrained = args["pretrained"]
    model = foz.load_zoo_model(
        "open-clip-torch",
        clip_model=clip_model,
        pretrained=pretrained,
        classes=classes,
        store_logits=True,
        batch_size=32,
        text_promopt=text_prompt
    )

    dataset.apply_model(model, label_field=name, save_logits=True)
    session.refresh()

Applying clipa model
 100% |█████████████████| 113/113 [25.7s elapsed, 0s remaining, 4.4 samples/s]      


INFO:eta.core.utils: 100% |█████████████████| 113/113 [25.7s elapsed, 0s remaining, 4.4 samples/s]      


Applying dfn model
 100% |█████████████████| 113/113 [9.3s elapsed, 0s remaining, 13.5 samples/s]      


INFO:eta.core.utils: 100% |█████████████████| 113/113 [9.3s elapsed, 0s remaining, 13.5 samples/s]      


Applying eva02_clip model
 100% |█████████████████| 113/113 [8.3s elapsed, 0s remaining, 14.3 samples/s]      


INFO:eta.core.utils: 100% |█████████████████| 113/113 [8.3s elapsed, 0s remaining, 14.3 samples/s]      


Applying metaclip model
 100% |█████████████████| 113/113 [4.8s elapsed, 0s remaining, 26.7 samples/s]       


INFO:eta.core.utils: 100% |█████████████████| 113/113 [4.8s elapsed, 0s remaining, 26.7 samples/s]       


In [34]:
session.view = dataset.view()
print(session.url)

http://localhost:5151/


In [35]:
dataset.get_field_schema()

OrderedDict([('id', <fiftyone.core.fields.ObjectIdField at 0x347d1a490>),
             ('filepath', <fiftyone.core.fields.StringField at 0x3478733d0>),
             ('tags', <fiftyone.core.fields.ListField at 0x347883010>),
             ('metadata',
              <fiftyone.core.fields.EmbeddedDocumentField at 0x34788fe10>),
             ('created_at',
              <fiftyone.core.fields.DateTimeField at 0x34fe77210>),
             ('last_modified_at',
              <fiftyone.core.fields.DateTimeField at 0x34788b050>),
             ('clip_zero_shot_classification',
              <fiftyone.core.fields.EmbeddedDocumentField at 0x3b3a81350>),
             ('clipa',
              <fiftyone.core.fields.EmbeddedDocumentField at 0x34786bb90>),
             ('dfn',
              <fiftyone.core.fields.EmbeddedDocumentField at 0x34788d650>),
             ('eva02_clip',
              <fiftyone.core.fields.EmbeddedDocumentField at 0x347880dd0>),
             ('metaclip',
              <fiftyone.cor

In [36]:
sample = dataset.first()
sample

<Sample: {
    'id': '6851d07672d2dbe25ef35c3c',
    'media_type': 'image',
    'filepath': '/Users/antonio/Documents/Projects/GettingStartedWithFiftyOne/local_run/data/aerial_images/abu dhabi.jpeg',
    'tags': [],
    'metadata': <ImageMetadata: {
        'size_bytes': 444325,
        'mime_type': 'image/jpeg',
        'width': 1800,
        'height': 1200,
        'num_channels': 3,
    }>,
    'created_at': datetime.datetime(2025, 6, 17, 20, 30, 46, 530000),
    'last_modified_at': datetime.datetime(2025, 6, 17, 20, 31, 45, 399000),
    'clip_zero_shot_classification': <Classification: {
        'id': '6851d07972d2dbe25ef35cad',
        'tags': [],
        'label': 'a river delta merging with the sea',
        'confidence': 0.24778258800506592,
        'logits': array([29.820421, 27.642035, 28.140705, 31.505201, 28.19588 , 27.019852,
               26.132374, 26.099657, 27.278822, 26.562037, 28.177319, 29.433647,
               30.927937, 31.405432, 30.858131, 28.631344, 30.293571,

In [37]:

predictions_fields = ['clip_zero_shot_classification', 'clipa', 'dfn', 'eva02_clip', 'metaclip']
for sample in dataset:
    sample_labels = []
    confidences = []
    for prediction_field in predictions_fields:
       sample_labels.append(sample[prediction_field].label)
       confidences.append(sample[prediction_field].confidence) 

    # Convert to numpy arrays
    labels_array = np.array(sample_labels)
    confidences_array = np.array(confidences)
    
    # Find unique labels and their counts
    unique_labels, counts = np.unique(labels_array, return_counts=True)
    
    # Find the maximum count and get all labels with that count
    max_count = np.max(counts)
    most_common_mask = counts == max_count
    most_common_labels = unique_labels[most_common_mask]
    
    most_common_label = most_common_labels[0]
    #print(f"Most common label: {most_common_label}")
    
    # Get indices for ONLY the first most common label
    indices = np.where(labels_array == most_common_label)[0]
    
    # Calculate mean confidence for this specific label only
    conf_mean = np.mean(confidences_array[indices])

    # Save the most common label and its mean confidence as a Classification 
    sample['most_common_label'] = fo.Classification(label=most_common_label, confidence=conf_mean)
    sample.save()
    
session.refresh()


## Visualize the dataset with consensus labeling 

In [38]:
# Launch the FiftyOne app to visualize the dataset
session.view = dataset.view()
print(session.url)

http://localhost:5151/


In [39]:
# Export to disk
export_dir = str(parent_path / "data/tagged_aerial_images")
dataset.export(
    export_dir=export_dir,
    dataset_type=fo.types.FiftyOneDataset,
    export_media=True,  # Include media files,
    overwrite=True  # Overwrite existing files if they exist
)
print(f"Dataset exported to: {export_dir}")

Exporting samples...


INFO:fiftyone.utils.data.exporters:Exporting samples...


 100% |████████████████████| 113/113 [106.1ms elapsed, 0s remaining, 1.1K docs/s]    


INFO:eta.core.utils: 100% |████████████████████| 113/113 [106.1ms elapsed, 0s remaining, 1.1K docs/s]    


Dataset exported to: /Users/antonio/Documents/Projects/GettingStartedWithFiftyOne/local_run/data/tagged_aerial_images


In [40]:
# Import from disk
imported_dataset = fo.Dataset.from_dir(
    dataset_dir=export_dir,
    dataset_type=fo.types.FiftyOneDataset,
)


Importing samples...


INFO:fiftyone.utils.data.importers:Importing samples...


 100% |█████████████████| 113/113 [5.0ms elapsed, 0s remaining, 22.7K samples/s]      


INFO:eta.core.utils: 100% |█████████████████| 113/113 [5.0ms elapsed, 0s remaining, 22.7K samples/s]      


In [41]:
# Test that your custom field works correctly
print("Testing most_common_label field:")
for i, sample in enumerate(imported_dataset.take(3)):
    if hasattr(sample, 'most_common_label'):
        print(f"Sample {i+1}: {sample.most_common_label.label} (conf: {sample.most_common_label.confidence:.3f})")
    else:
        print(f"Sample {i+1}: No most_common_label field")

Testing most_common_label field:
Sample 1: farming terraces (conf: 0.559)
Sample 2: a river delta merging with the sea (conf: 0.765)
Sample 3: a farmland (conf: 0.510)
