# Pseudo-Labeling Flow

This notebook implements an automated pseudo-labeling pipeline designed to streamline the annotation process for object detection and instance segmentation tasks. The tool iteratively improves model performance by using an initial model trained on a small set of manually annotated data to generate labels on new images, which can then be refined and used to retrain progressively better models.

**A "flow" represents a complete pseudo labeling run with specific configuration settings (model type, initial dataset size, correction strategy), while "iterations" are the individual training cycles within each flow where new data is added and the model is retrained.**

The features within this notebook include:
- **Automated Pipeline**: Complete workflow from data preparation to model training
- **Database Logging**: Database tracking for all iterations
- **CVAT Integration**: For viewing and adjusting annotations
- **Flexible Configuration**: Supports different model architectures and training settings
- **Status Monitoring**: Real-time pipeline status and progress tracking



Automatic Export to CVAT only tested and functional for Instance Segmentation + Object Detection

## Imports

In [1]:
from pseudo_labeling import PseudoLabelingPipeline

## Global Initalizers
Configure the pipeline with your project-specific settings:


In [2]:
# pipeline = PseudoLabelingPipeline(
#     project_name="daniel-osman---streamlining-annotation-bootstrapping/testing",
#     main_dataset_name="full-data:0", #input only
#     initial_annotated_dataset_name="initial-annotations:0",
#     validation_dataset="val:0",
#     sample_size_per_iter=150,
#     current_flow = 0,
#     min_confidence=0.5,
#     local_path='/Users/daniel/Documents/2025 BEP - VBTI Data/testing',
#     cvat_project_id=88,
#     db_path="pseudo_labeling_metadata.db"
# )
#
# print("Pipeline initialized")

pipeline = PseudoLabelingPipeline(
    project_name="vbti/interreg-broccoli",
    main_dataset_name="broccoli-semantic-segmentation-part4-may23-train", #input only
    initial_annotated_dataset_name="broccoli-semantic-segmentation-part4-may23-train",
    validation_dataset="broccoli-semantic-segmentation-part4-may23-val",
    sample_size_per_iter=425,
    current_flow = 0,
    min_confidence=0.75,
    local_path='/Users/daniel/Documents/2025 BEP - VBTI Data/testing',
    cvat_project_id=88, #can leave empty as ''
    db_path="pseudo_labeling_metadata_broc_test.db"
)

print("Pipeline initialized")


[32m2025-06-30 00:01:31.412[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mresolve_latest_version[0m:[36m582[0m - [1mResolved latest version of dataset broccoli-semantic-segmentation-part4-may23-train to 0 with remote='vbti/interreg-broccoli'.[0m
[32m2025-06-30 00:01:31.414[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mload[0m:[36m375[0m - [1mResolved latest version of dataset broccoli-semantic-segmentation-part4-may23-train to 0.[0m
[32m2025-06-30 00:01:31.415[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m838[0m - [1mPulling dataset broccoli-semantic-segmentation-part4-may23-train:0 from remote='vbti/interreg-broccoli' with pull_policy=missing.[0m
[32m2025-06-30 00:01:31.427[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m858[0m - [1mDataset broccoli-semantic-segmentation-part4-may23-train:0 already exists in local store. Skipping[0m


Resuming incomplete iteration 1 (status: MERGING)
GLOBAL INITIALIZATIONS INITIALIZED
Project: vbti/interreg-broccoli
Main dataset: broccoli-semantic-segmentation-part4-may23-train
Initial annotated dataset: broccoli-semantic-segmentation-part4-may23-train
Sample size per iteration: 425
Selected flow: f0


[32m2025-06-30 00:01:31.731[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mresolve_latest_version[0m:[36m582[0m - [1mResolved latest version of dataset broccoli-semantic-segmentation-part4-may23-train to 0 with remote='vbti/interreg-broccoli'.[0m
[32m2025-06-30 00:01:31.732[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mload[0m:[36m375[0m - [1mResolved latest version of dataset broccoli-semantic-segmentation-part4-may23-train to 0.[0m
[32m2025-06-30 00:01:31.733[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m838[0m - [1mPulling dataset broccoli-semantic-segmentation-part4-may23-train:0 from remote='vbti/interreg-broccoli' with pull_policy=missing.[0m
[32m2025-06-30 00:01:31.736[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m858[0m - [1mDataset broccoli-semantic-segmentation-part4-may23-train:0 already exists in local store. Skipping[0m


Initial annotated dataset contains: 425 samples
Flow f0 already exists in database - ready to resume
Last completed iteration: 0


[32m2025-06-30 00:01:32.185[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mresolve_latest_version[0m:[36m582[0m - [1mResolved latest version of dataset train-f0 to 0 with remote='vbti/interreg-broccoli'.[0m
[32m2025-06-30 00:01:32.186[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mload[0m:[36m375[0m - [1mResolved latest version of dataset train-f0 to 0.[0m
[32m2025-06-30 00:01:32.186[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m838[0m - [1mPulling dataset train-f0:0 from remote='vbti/interreg-broccoli' with pull_policy=missing.[0m
[32m2025-06-30 00:01:32.189[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m858[0m - [1mDataset train-f0:0 already exists in local store. Skipping[0m


Training dataset train-f0 exists with 425 samples
Ready for iteration 1
Pipeline initialized


## Training Config
Set up your model training parameters interactively:


In [3]:
# Option 1: Interactive widget setup (uncomment to use)
# train_cfg = pipeline.setup_training_config()

# Option 2: Direct dictionary configuration (recommended for specific config)
pipeline.train_cfg = {
    'model_type': 'UPerNetConfig',
    'task_type': 'semantic_segmentation',
    'backbone': 'CONVNEXT_V2_BASE',
    'epochs': 150,
    'batch_size': 3,
    'input_size': (640, 640)
}

print("Training configuration set:")
print(pipeline.train_cfg)


Training configuration set:
{'model_type': 'UPerNetConfig', 'task_type': 'semantic_segmentation', 'backbone': 'CONVNEXT_V2_BASE', 'epochs': 150, 'batch_size': 3, 'input_size': (640, 640)}


# (1) Initial Flow, Training, and Evaluation Setup

This step initiates the current flow and establishes the baseline model using your initial annotated dataset.
1. Loads the initial annotated dataset.
2. Created training set for the current flow.
3. Trains the first baseline model (iteration 0) for the specified flow.
4. Evaluates Model
5. Logs Metadata (Only variables generated throughout the process of the pipeline are 'predicted_dataset_name', 'model_uid', 'evaluation_uid', 'evaluation_info')

⚠️ ATTENTION: Skip this section if your current flow already exists and if you already have a baseline model

In [4]:
pipeline.get_pipeline_status()


PIPELINE STATUS REPORT
Flow ID: f0
Current Iteration: 1
Current Status: MERGING
Training Dataset: train-f0
Current Model UID: None
Training Configuration: {'model_type': 'UPerNetConfig', 'task_type': 'semantic_segmentation', 'backbone': 'CONVNEXT_V2_BASE', 'epochs': 150, 'batch_size': 3, 'input_size': (640, 640)}
Database Path: pseudo_labeling_metadata_broc_test.db
Sample Size Per Iteration: 425
Minimum Confidence Threshold: 0.75

RECENT ITERATIONS:
  Iteration 1: MERGING
  Iteration 0: COMPLETED (completed: 2025-06-29 17:44:58)


If you want to use an existing model, run this function below then skip to section (2):

In [None]:
pipeline.log_iteration_0_external_model("smoky-shepherd-0")

### 1.1 Train Initial Model and Evaluate on Validation Set
Train the baseline model on your initial annotated dataset.
Evaluate the initial model performance on the validation dataset.




In [None]:
pipeline.train_model()
pipeline.evaluate_model()

### 1.2 Log
Save all metadata for the initial training iteration to the database.


In [None]:
# pipeline.log_iteration_0()
# print("Initial model training and evaluation complete, Current status:")
pipeline.get_pipeline_status()

# (2) Pseudo-Labeling Iteration Workflow

This step executes a complete pseudo-labeling iteration cycle using the model from the previous iteration to generate labels on new data.
1. Sets up the next iteration with correction strategy (manual or automated).
2. Samples new unlabeled data from the full dataset.
3. Runs inference using the previous iteration's model to generate pseudo-labels.
4. Handles corrections based on strategy: exports to CVAT for manual corrections OR merges pseudo-labels directly.
5. Trains updated model on expanded dataset (original + new data).
6. Evaluates the updated model performance.
7. Logs iteration metadata to track progress and results.

**⚠️ ATTENTION: Set 'manual_corrections=True' for CVAT workflow with human review, or 'manual_corrections=False' for fully automated pseudo-labeling**


### Local Initializer
Configure the next iteration parameters

In [None]:
# Set manual_corrections=True for CVAT human review
# Set manual_corrections=False for fully automated pseudo-labeling
manual_corrections = False
pipeline.setup_next_iteration(manual_corrections)


### 2.1 Sample New Data

In [None]:
pipeline.sample_unseen_inputs()


### 2.2 Generate Predictions/Pseudo-Labels


In [5]:
# pipeline.set_inference_model_uid('')
pipeline.set_predicted_dataset('broccoli-semantic-segmentation-part4-may23-train-0--cpu--75080')

# pipeline.run_inference()

Predicted dataset set to: broccoli-semantic-segmentation-part4-may23-train-0--cpu--75080
Ready to proceed to manual corrections or merge.


### 2.3 CVAT Export

In [6]:
if pipeline.manual_corrections_global:
    print("Manual corrections enabled - proceeding to CVAT export")
    pipeline.manually_correct_cvat()
    print("After completing corrections in CVAT, manually update the predicted dataset and run the merge cell below")
else:
    print("No Manual Correction, Proceed to merging the datasets")


No Manual Correction, Proceed to merging the datasets


### 2.4. Merge Data
This cell merges the dataset with current training set. If **manual_correction = True**, then corrected annotations will be exported and merged.


In [7]:
pipeline.merge_pseudo_labels(pseudo_only=True)

Starting merge process...
Automated mode - merging pseudo-labels directly...

=== REBUILDING TRAINING DATASET FOR ITERATION 1 ===
PSEUDO-ONLY MODE: Skipping initial annotations, training only on pseudo-labels
✓ Skipped initial annotations (pseudo-only mode)
✓ No manual corrections dataset found


[32m2025-06-30 00:02:05.816[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mresolve_latest_version[0m:[36m582[0m - [1mResolved latest version of dataset broccoli-semantic-segmentation-part4-may23-train-0--cpu--75080 to 1 with remote='vbti/interreg-broccoli'.[0m
[32m2025-06-30 00:02:05.816[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mload[0m:[36m375[0m - [1mResolved latest version of dataset broccoli-semantic-segmentation-part4-may23-train-0--cpu--75080 to 1.[0m
[32m2025-06-30 00:02:05.817[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m838[0m - [1mPulling dataset broccoli-semantic-segmentation-part4-may23-train-0--cpu--75080:1 from remote='vbti/interreg-broccoli' with pull_policy=missing.[0m
[32m2025-06-30 00:02:05.819[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m858[0m - [1mDataset broccoli-semantic-segmentation-part4-may23-train-0--cpu--75080:1 already exis

✓ Loaded pseudo dataset: 425 images (semantic segmentation - no confidence filtering)
Getting label map from initial dataset for consistency (not for training)...


[32m2025-06-30 00:05:16.382[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mresolve_latest_version[0m:[36m582[0m - [1mResolved latest version of dataset broccoli-semantic-segmentation-part4-may23-train to 0 with remote='vbti/interreg-broccoli'.[0m
[32m2025-06-30 00:05:16.382[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mload[0m:[36m375[0m - [1mResolved latest version of dataset broccoli-semantic-segmentation-part4-may23-train to 0.[0m
[32m2025-06-30 00:05:16.382[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m838[0m - [1mPulling dataset broccoli-semantic-segmentation-part4-may23-train:0 from remote='vbti/interreg-broccoli' with pull_policy=missing.[0m
[32m2025-06-30 00:05:16.385[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m858[0m - [1mDataset broccoli-semantic-segmentation-part4-may23-train:0 already exists in local store. Skipping[0m


Using label map from initial dataset: {0: 'background', 1: 'fungus', 2: 'rotten_area'}




✓ Using pseudo dataset as training set: 425 images


Generating label map from included label maps: 100%|██████████| 424/424 [00:02<00:00, 203.14it/s]
Generating label map from included label maps: 100%|██████████| 424/424 [00:02<00:00, 199.34it/s]
[32m2025-06-30 00:05:23.933[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36msave[0m:[36m179[0m - [1mSaved dataset train-f0:1 to local store.[0m
[32m2025-06-30 00:05:23.933[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mresolve_latest_version[0m:[36m567[0m - [1mResolved latest version of dataset train-f0 to 1 local='vbti/interreg-broccoli'[0m
[32m2025-06-30 00:05:23.978[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpush[0m:[36m419[0m - [1mPushing 803 blobs to remote.[0m
Files Uploaded:   0%|          | 0/804 [00:00<?, ?file/s]
Files Confirmed:   0%|          | 0/804 [00:00<?, ?file/s][A

Getting upload links:   0%|          | 0/804 [00:00<?, ?file/s][A[A

Getting upload links:  32%|███▏      | 256/804 [00:00<00:00,

✓ TRAINING DATASET REBUILD COMPLETE:
  - Pseudo-labels: 425 images
  - Total training dataset: 425 images
  - Saved as: train-f0
  - Mode: PSEUDO-ONLY
  - Label map: {0: 'background', 1: 'fungus', 2: 'rotten_area'}


### 2.5 Train Updated Model
Train a new model on the expanded training set

In [8]:
pipeline.train_model()

Starting model training...


[32m2025-06-30 00:30:06.782[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mresolve_latest_version[0m:[36m582[0m - [1mResolved latest version of dataset train-f0 to 1 with remote='vbti/interreg-broccoli'.[0m
[32m2025-06-30 00:30:07.838[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mresolve_latest_version[0m:[36m582[0m - [1mResolved latest version of dataset train-f0 to 1 with remote='vbti/interreg-broccoli'.[0m
[32m2025-06-30 00:30:08.158[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mresolve_latest_version[0m:[36m582[0m - [1mResolved latest version of dataset broccoli-semantic-segmentation-part4-may23-val to 0 with remote='vbti/interreg-broccoli'.[0m
[32m2025-06-30 00:30:09.047[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mresolve_latest_version[0m:[36m582[0m - [1mResolved latest version of dataset broccoli-semantic-segmentation-part4-may23-val to 0 with remote='vbti/interreg-broccol

Training UPerNetConfig on dataset: train-f0
Configuration: 150 epochs, batch size 3
Backbone: CONVNEXT_V2_BASE
Input size: (640, 640)


[32m2025-06-30 00:30:11.845[0m | [1mINFO    [0m | [36monedl.client.operations.clients._common[0m:[36mcreate_event_stream[0m:[36m79[0m - [1mSubscribing to job events...[0m
[32m2025-06-30 00:30:11.846[0m | [1mINFO    [0m | [36monedl.client.operations.clients._common[0m:[36mcreate_event_stream[0m:[36m80[0m - [1mJob sad-shape-0 in WAITING state[0m
[32m2025-06-30 00:30:13.042[0m | [1mINFO    [0m | [36monedl.client.operations.clients._common[0m:[36mcreate_event_stream[0m:[36m84[0m - [1mJob sad-shape-0 in RUNNING state[0m
[32m2025-06-30 09:35:57.470[0m | [1mINFO    [0m | [36monedl.client.operations.clients._common[0m:[36mcreate_event_stream[0m:[36m89[0m - [1mUnsubscribing from job events...[0m


Training job submitted
Model UID: sad-shape-0


### 2.6 Evaluate Performance

In [9]:
pipeline.evaluate_model()

Evaluating sad-shape-0 on broccoli-semantic-segmentation-part4-may23-val


[32m2025-06-30 09:37:01.342[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mresolve_latest_version[0m:[36m582[0m - [1mResolved latest version of dataset broccoli-semantic-segmentation-part4-may23-val to 0 with remote='vbti/interreg-broccoli'.[0m
[32m2025-06-30 09:37:02.347[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mresolve_latest_version[0m:[36m582[0m - [1mResolved latest version of dataset broccoli-semantic-segmentation-part4-may23-val to 0 with remote='vbti/interreg-broccoli'.[0m
[32m2025-06-30 09:37:03.833[0m | [1mINFO    [0m | [36monedl.client.operations.clients._common[0m:[36mcreate_event_stream[0m:[36m79[0m - [1mSubscribing to job events...[0m
[32m2025-06-30 09:37:03.836[0m | [1mINFO    [0m | [36monedl.client.operations.clients._common[0m:[36mcreate_event_stream[0m:[36m80[0m - [1mJob curious-shrike-0 in WAITING state[0m
[32m2025-06-30 09:37:05.074[0m | [1mINFO    [0m | [36monedl.client.operations.

Evaluation complete
Report URL: https://21e007818fa1dd0840eac0d6d59ba986.eu.r2.cloudflarestorage.com/onedl-data/vbti/interreg-broccoli/-/18bc97b02c605615fd1c0cf9ca076d46.html?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=bb17714b86b2e84a836c55404335cef8%2F20250630%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20250630T075258Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=c4d6a0b9846f76ce4954df4c979c40fae96ba96b3a034ff1ef2236fe06a4e5d9
Metrics: {"mAcc": 0.9889155406268831, "mIoU": 0.6343889516283391, "mDice": 0.7389399310626669, "mFscore": 0.7389399310626669}




### 2.7 Status

In [None]:
pipeline.get_pipeline_status()

# Additional Runs
To run additional iterations, repeat Section 2 after logging. For creating a new flow, go back to Section 1, update the current_flow and go again.
