# 🛣️ Exploring the BDD100K Dataset with FiftyOne

Welcome to this hands-on workshop on dataset exploration and semantic search using [FiftyOne](https://voxel51.com/fiftyone/), the BDD100K dataset, and CLIP embeddings.

## 📦 What is BDD100K?

The **Berkeley DeepDrive BDD100K** dataset is one of the largest and most diverse open driving datasets available. It contains **100,000 video clips**, annotated with rich metadata for tasks such as:

- **Object Detection**
- **Lane Detection**
- **Instance Segmentation**
- **Drivable Area Segmentation**
- **Multiple Object Tracking**
- **Image Classification**
- **Domain Adaptation**

Collected from dashcams mounted on vehicles, BDD100K covers a broad range of real-world driving scenarios, including various **weather conditions, times of day, and geographic locations**. This diversity makes it ideal for training and evaluating robust computer vision models in the context of autonomous driving and mobility research.

## 🎯 Workshop Objective

In this notebook, we will:

- Load and explore the BDD100K dataset using FiftyOne
- Apply filters to create views of interest (e.g., specific weather or time-of-day conditions)
- Prepare the dataset for semantic search using CLIP embeddings

By the end of this session, you will have a solid foundation in using FiftyOne to **interact with large-scale vision datasets**, enabling smarter data curation and analysis for your machine learning workflows.

Let's get started! 🚗💨


In [None]:
# Install necessary packages
#!pip install fiftyone torch torchvision python-dotenv mlflow umap-learn

Wait until this endpoint is ready, any action before that can create a 500 or 400 HTTP Error.

## 📁 Load the BDD100K Dataset and Launch FiftyOne
We will use the `BDD100K` dataset from HuggingFace Hub.

In [None]:
import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.brain as fob

import fiftyone.utils.huggingface as fouh # Hugging Face integration

import os

# Increase both connection and read timeout values (in seconds)
# os.environ["HF_HUB_DOWNLOAD_TIMEOUT"] = "60"  # default is 10
# os.environ["HF_HUB_ETAG_TIMEOUT"] = "30"      # metadata fetch timeout
# dataset = fouh.load_from_hub("dgural/bdd100k", persistent=True, name= "bdd10k") #, overwrite=True)

# # Define the new dataset name
# dataset_name = "bdd10k"
 
dataset_name = "bdd10k_imported"

import fiftyone as fo

# Path to the exported folder
export_dir = "bdd100k_FO"

# Load the dataset from the folder
dataset = fo.Dataset.from_dir(
    dataset_dir=export_dir,
    dataset_type=fo.types.FiftyOneDataset,
    name=dataset_name  # You can choose any name here
)

# # Launch the FiftyOne app (optional)
# session = fo.launch_app(dataset)

# Check if the dataset exists
if dataset_name in fo.list_datasets():
    print(f"Dataset '{dataset_name}' exists. Loading...")
    dataset = fo.load_dataset(dataset_name)
else:
    print(f"Dataset '{dataset_name}' does not exist. Creating a new one...")
    # Clone the dataset with a new name and make it persistent
    dataset = dataset.clone(dataset_name, persistent=True)



  from .autonotebook import tqdm as notebook_tqdm


Importing samples...
 100% |█████████████| 10000/10000 [696.1ms elapsed, 0s remaining, 14.4K samples/s]      
Dataset 'bdd10k_imported' exists. Loading...


### 📋 List Datasets
This cell prints the list of currently available FiftyOne datasets.

In [2]:
print(fo.list_datasets())
print(dataset)

['2025.03.25.13.07.42', '2025.03.25.13.08.53', '2025.05.09.09.54.39.616869', '2025.05.13.11.47.29.746170', '2025.05.13.12.53.44.521773', 'ADL_Fall_Videos_Eval', 'AnomalyMerged_MVTec_ViSA', 'Voxel51/mvtec-ad', 'anomaly_predictions_grouped', 'anomaly_predictions_grouped_carpet', 'arcade-dataset', 'bdd100k_100_unique', 'bdd100k_test', 'bdd10k', 'bdd10k_imported', 'biotrove-train-300k', 'biotrove-url-based', 'biotrove_balanced_full', 'biotrove_unseen_full', 'car_dd', 'coffee_FO', 'coffee_FO_SAM2_process', 'coffee_FO_geolocation', 'mvtec-ad_1', 'mvtec-ad_2', 'mvtec-ad_3', 'mvtec-ad_4', 'mvtec-ad_5', 'mvtec-ad_6', 'mvtec-ad_ad-1', 'mvtec-ad_demo', 'mvtec-ad_no_categories', 'mvtec-bottle', 'mvtec-bottle_2', 'mvtec-carpet-1', 'mvtec-screw', 'mvtecad2', 'mvtecad2_demo', 'mvtecad2_demo_2025', 'mvtecad2_demo_cvpr', 'mvtecad2_grouped', 'pjramg/my_colombian_coffe_FO', 'potato_mvtec', 'ucf101-test']
Name:        bdd10k_imported
Media type:  image
Num samples: 10000
Persistent:  False
Tags:        []

### 🚀 Launch FiftyOne App
This cell launches the FiftyOne App for interactive dataset visualization.

In [4]:
session = fo.launch_app(dataset, port=5161, auto=False)

Connected to FiftyOne on port 5161 at localhost.
If you are not connecting to a remote session, you may need to start a new session and specify a port
Session launched. Run `session.show()` to open the App in a cell output.


### 🖨️ Display Dataset
This cell prints basic information about the currently loaded dataset.

In [5]:
print(dataset)

Name:        bdd10k_imported
Media type:  image
Num samples: 10000
Persistent:  False
Tags:        []
Sample fields:
    id:                 fiftyone.core.fields.ObjectIdField
    filepath:           fiftyone.core.fields.StringField
    tags:               fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:           fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:         fiftyone.core.fields.DateTimeField
    last_modified_at:   fiftyone.core.fields.DateTimeField
    detections:         fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    polylines:          fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Polylines)
    weather:            fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    timeofday:          fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    scene:              fiftyone.core.fields.Emb

### 🔍 Access Dataset Fields
This cell uses `ViewField` to reference nested fields in the dataset for filtering or querying.

In [6]:
from fiftyone import ViewField as F

# Access the `label` inside the `timeofday` Classification object
night_view = dataset.match(F("timeofday.label") == "night")
session.view = night_view

### ☔ Filter for Rainy Weather
This cell filters the dataset to only include samples where the weather label is 'rainy'.

In [7]:
rain_view = night_view.match(F("weather.label") == "rainy")
session.view = rain_view

### 💻 Code Execution
This cell performs operations as part of the dataset setup or analysis.

In [8]:
night_pedestrian_view = (
    dataset
    .match(F("timeofday.label") == "night")
    .filter_labels("detections", F("label") == "pedestrian")
)
session.view = night_pedestrian_view