# 🛣️ Exploring the BDD100K Dataset with FiftyOne

Welcome to this hands-on workshop on dataset exploration and semantic search using [FiftyOne](https://voxel51.com/fiftyone/), the BDD100K dataset, and CLIP embeddings.

## 📦 What is BDD100K?

The **Berkeley DeepDrive BDD100K** dataset is one of the largest and most diverse open driving datasets available. It contains **100,000 video clips**, annotated with rich metadata for tasks such as:

- **Object Detection**
- **Lane Detection**
- **Instance Segmentation**
- **Drivable Area Segmentation**
- **Multiple Object Tracking**
- **Image Classification**
- **Domain Adaptation**

Collected from dashcams mounted on vehicles, BDD100K covers a broad range of real-world driving scenarios, including various **weather conditions, times of day, and geographic locations**. This diversity makes it ideal for training and evaluating robust computer vision models in the context of autonomous driving and mobility research.

## 🎯 Workshop Objective

In this notebook, we will:

- Load and explore the BDD100K dataset using FiftyOne
- Apply filters to create views of interest (e.g., specific weather or time-of-day conditions)
- Prepare the dataset for semantic search using CLIP embeddings

By the end of this session, you will have a solid foundation in using FiftyOne to **interact with large-scale vision datasets**, enabling smarter data curation and analysis for your machine learning workflows.

Let's get started! 🚗💨


⚠️⚠️⚠️ Be sure you run the minimun setup for running this workshop. Read the [README_FILE](readme.md) or the [DOCKER_SETUP_FILE](DOCKER_SETUP.md). ⚠️⚠️⚠️

If you use the Docker File to setup this workshop please skip the next cell, if you are using classic Python Env or Conda Env, please run the next cell. 

In [3]:
# THIS IS THE MINIMUN REQUIREMENT AFTER YOU INSTALL ALL THE DEPENDIENCIES.
import os
os.environ['HUGGING_FACE_HUB_TOKEN'] = "YOUR_TOKEN"

import fiftyone.utils.huggingface as fouh 
dataset = fouh.load_from_hub("dgural/bdd100k", persistent=True, name= "bdd100k_test")

Downloading config file fiftyone.yml from dgural/bdd100k
Loading dataset
Importing samples...
 100% |█████████████| 10000/10000 [480.1ms elapsed, 0s remaining, 20.8K samples/s]      
Migrating dataset 'bdd100k_test' to v1.5.2


Wait until this endpoint is ready, any action before that can create a 500 or 400 HTTP Error.

## 📁 Load the BDD100K Dataset and Launch FiftyOne
We will use the `BDD100K` dataset from HuggingFace Hub.

In [1]:
import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.brain as fob

import fiftyone.utils.huggingface as fouh # Hugging Face integration

import os

# Increase both connection and read timeout values (in seconds)
# os.environ["HF_HUB_DOWNLOAD_TIMEOUT"] = "60"  # default is 10
# os.environ["HF_HUB_ETAG_TIMEOUT"] = "30"      # metadata fetch timeout
# dataset = fouh.load_from_hub("dgural/bdd100k", persistent=True, name= "bdd100k_test") #, overwrite=True)
#fo.delete_dataset("dgural/bdd100k")

# # Define the new dataset name
dataset_name = "bdd100k_test"

# Check if the dataset exists
if dataset_name in fo.list_datasets():
    print(f"Dataset '{dataset_name}' exists. Loading...")
    dataset = fo.load_dataset(dataset_name)
else:
    print(f"Dataset '{dataset_name}' does not exist. Creating a new one...")
    # Clone the dataset with a new name and make it persistent
    dataset = dataset.clone(dataset_name, persistent=True)



  from .autonotebook import tqdm as notebook_tqdm


Dataset 'bdd100k_test' exists. Loading...


### 📋 List Datasets
This cell prints the list of currently available FiftyOne datasets.

In [2]:
print(fo.list_datasets())
print(dataset)

['2025.03.25.13.07.42', '2025.03.25.13.08.53', '2025.05.06.12.36.35.635286', '2025.05.06.12.37.24.156656', '2025.05.08.09.11.29.118507', '2025.05.08.09.11.53.352258', '2025.05.08.09.26.21.103294', '2025.05.09.09.54.39.616869', '2025.05.13.08.40.31.401648', '2025.05.13.08.40.56.684547', '2025.05.13.11.47.29.746170', '2025.05.13.12.53.44.521773', 'ADL_Fall_Videos_Eval', 'AnomalyMerged_MVTec_ViSA', 'Voxel51/mvtec-ad', 'anomaly_predictions_grouped', 'anomaly_predictions_grouped_carpet', 'arcade-dataset', 'arcade-dataset-seg', 'arcade_dataset-seg1', 'arcade_dataset-segmentations', 'arcade_dataset_merged', 'arcade_dataset_merged-complete', 'bdd100k_100_unique', 'bdd100k_test', 'bdd100k_unique_100', 'bdd10k', 'biotrove-train-300k', 'biotrove-url-based', 'biotrove_balanced_full', 'biotrove_balanced_test', 'biotrove_unseen_full', 'car_dd', 'coffee_FO', 'coffee_FO_SAM2_process', 'coffee_FO_geolocation', 'mvtec-ad_1', 'mvtec-ad_2', 'mvtec-ad_3', 'mvtec-ad_4', 'mvtec-ad_5', 'mvtec-ad_6', 'mvtec-ad

### 🚀 Launch FiftyOne App
This cell launches the FiftyOne App for interactive dataset visualization.

In [3]:
session = fo.launch_app(dataset, port=5151, auto=False)

Connected to FiftyOne on port 5151 at localhost.
If you are not connecting to a remote session, you may need to start a new session and specify a port
Session launched. Run `session.show()` to open the App in a cell output.


### 🖨️ Display Dataset
This cell prints basic information about the currently loaded dataset.

In [4]:
print(dataset)

Name:        bdd100k_test
Media type:  image
Num samples: 10000
Persistent:  True
Tags:        []
Sample fields:
    id:               fiftyone.core.fields.ObjectIdField
    filepath:         fiftyone.core.fields.StringField
    tags:             fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:       fiftyone.core.fields.DateTimeField
    last_modified_at: fiftyone.core.fields.DateTimeField
    detections:       fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    polylines:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Polylines)
    weather:          fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    timeofday:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    scene:            fiftyone.core.fields.EmbeddedDocumentField(fiftyon

### 🔍 Access Dataset Fields
This cell uses `ViewField` to reference nested fields in the dataset for filtering or querying.

In [5]:
from fiftyone import ViewField as F

# Access the `label` inside the `timeofday` Classification object
night_view = dataset.match(F("timeofday.label") == "night")
session.view = night_view

### ☔ Filter for Rainy Weather
This cell filters the dataset to only include samples where the weather label is 'rainy'.

In [6]:
rain_view = night_view.match(F("weather.label") == "rainy")
session.view = rain_view

### 💻 Code Execution
This cell performs operations as part of the dataset setup or analysis.

In [7]:
night_pedestrian_view = (
    dataset
    .match(F("timeofday.label") == "night")
    .filter_labels("detections", F("label") == "pedestrian")
)
session.view = night_pedestrian_view