# Pseudo-Labeling Flow

This notebook implements an automated pseudo-labeling pipeline designed to streamline the annotation process for object detection and instance segmentation tasks. The tool iteratively improves model performance by using an initial model trained on a small set of manually annotated data to generate labels on new images, which can then be refined and used to retrain progressively better models.

**A "flow" represents a complete pseudo labeling run with specific configuration settings (model type, initial dataset size, correction strategy), while "iterations" are the individual training cycles within each flow where new data is added and the model is retrained.**

The features within this notebook include:
- **Automated Pipeline**: Complete workflow from data preparation to model training
- **Database Logging**: Database tracking for all iterations
- **CVAT Integration**: For viewing and adjusting annotations
- **Flexible Configuration**: Supports different model architectures and training settings
- **Status Monitoring**: Real-time pipeline status and progress tracking



Automatic Export to CVAT only tested and functional for Instance Segmentation + Object Detection

## Imports

In [4]:
import onedl.client

from pseudo_labeling import PseudoLabelingPipeline

## Global Initalizers
Configure the pipeline with your project-specific settings:


In [7]:
pipeline = PseudoLabelingPipeline(
    project_name="daniel-osman---streamlining-annotation-bootstrapping/testing",
    main_dataset_name="full-data:0", #input only
    initial_annotated_dataset_name="initial-annotations:0",
    validation_dataset="val:0",
    sample_size_per_iter=150,
    current_flow = 0,
    min_confidence=0.1,
    local_path='/Users/daniel/Documents/2025 BEP - VBTI Data/testing',
    cvat_project_id=88,
    db_path="pseudo_labeling_metadata.db"
)

print("Pipeline initialized")


[32m2025-07-10 08:55:55.287[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m838[0m - [1mPulling dataset full-data:0 from remote='daniel-osman---streamlining-annotation-bootstrapping/testing' with pull_policy=missing.[0m
[32m2025-07-10 08:55:55.293[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m858[0m - [1mDataset full-data:0 already exists in local store. Skipping[0m
[32m2025-07-10 08:55:55.326[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m838[0m - [1mPulling dataset initial-annotations:0 from remote='daniel-osman---streamlining-annotation-bootstrapping/testing' with pull_policy=missing.[0m
[32m2025-07-10 08:55:55.329[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m858[0m - [1mDataset initial-annotations:0 already exists in local store. Skipping[0m


Resuming incomplete iteration 1 (status: MERGE_COMPLETE)
GLOBAL INITIALIZATIONS INITIALIZED
Project: daniel-osman---streamlining-annotation-bootstrapping/testing
Main dataset: full-data:0
Initial annotated dataset: initial-annotations:0
Sample size per iteration: 150
Selected flow: f0
Initial annotated dataset contains: 50 samples
Flow f0 already exists in database - ready to resume
Last completed iteration: 0


[32m2025-07-10 08:55:55.843[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mresolve_latest_version[0m:[36m582[0m - [1mResolved latest version of dataset train-f0 to 5 with remote='daniel-osman---streamlining-annotation-bootstrapping/testing'.[0m
[32m2025-07-10 08:55:55.844[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mload[0m:[36m375[0m - [1mResolved latest version of dataset train-f0 to 5.[0m
[32m2025-07-10 08:55:55.845[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m838[0m - [1mPulling dataset train-f0:5 from remote='daniel-osman---streamlining-annotation-bootstrapping/testing' with pull_policy=missing.[0m
[32m2025-07-10 08:55:55.849[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m858[0m - [1mDataset train-f0:5 already exists in local store. Skipping[0m


Training dataset train-f0 exists with 50 samples
Ready for iteration 1

üîÑ Attempting auto-recovery...
üîÑ Recovering state for f0 iteration 1...
‚úì Recovered predicted_dataset_name: pseudo-f0-0--cpu--266a4:0
‚úì Recovered manual_corrections_mode: False
‚úì Recovery complete - ready to resume
Pipeline initialized


## Training Config
Set up your model training parameters interactively:


In [12]:
# Option 1: Interactive widget setup (uncomment to use)
# train_cfg = pipeline.setup_training_config()


# Option 2: Direct dictionary configuration (recommended for specific config)
pipeline.train_cfg = {
    'model_type': 'FasterRCNNConfig',
    'task_type': 'object_detection',
    'backbone': 'RESNET_50',
    'epochs': 1,
    'batch_size': 3,
}

print("Training configuration set:")
print(pipeline.train_cfg)


Training configuration set:
{'model_type': 'FasterRCNNConfig', 'task_type': 'object_detection', 'backbone': 'RESNET_50', 'epochs': 1, 'batch_size': 3}


# (1) Initial Flow, Training, and Evaluation Setup

This step initiates the current flow and establishes the baseline model using your initial annotated dataset.
1. Loads the initial annotated dataset.
2. Created training set for the current flow.
3. Trains the first baseline model (iteration 0) for the specified flow.
4. Evaluates Model
5. Logs Metadata (Only variables generated throughout the process of the pipeline are 'predicted_dataset_name', 'model_uid', 'evaluation_uid', 'evaluation_info')

‚ö†Ô∏è ATTENTION: Skip this section if your current flow already exists and if you already have a baseline model

In [None]:
pipeline.get_pipeline_status()

### 1.1 Train Initial Model and Evaluate on Validation Set
Train the baseline model on your initial annotated dataset.
Evaluate the initial model performance on the validation dataset.




If you want to use an existing model, then run: <span style="color:#d73a49; font-family:monospace;">pipeline.log_iteration_0_external_model("smoky-shepherd-0")</span>, then skip to section (2):


In [None]:
pipeline.train_model()
pipeline.evaluate_model()

### 1.2 Log
Save all metadata for the initial training iteration to the database.


In [None]:
pipeline.log_iteration_0()
print("Initial model training and evaluation complete, Current status:")

# (2) Pseudo-Labeling Iteration Workflow

This step executes a complete pseudo-labeling iteration cycle using the model from the previous iteration to generate labels on new data.
1. Sets up the next iteration with correction strategy (manual or automated).
2. Samples new unlabeled data from the full dataset.
3. Runs inference using the previous iteration's model to generate pseudo-labels.
4. Handles corrections based on strategy: exports to CVAT for manual corrections OR merges pseudo-labels directly.
5. Trains updated model on expanded dataset (original + new data).
6. Evaluates the updated model performance.
7. Logs iteration metadata to track progress and results.

**‚ö†Ô∏è ATTENTION: Set 'manual_corrections=True' for CVAT workflow with human review, or 'manual_corrections=False' for fully automated pseudo-labeling**


In [8]:
pipeline.get_pipeline_status()


PIPELINE STATUS REPORT
Flow ID: f0
Current Iteration: 1
Current Status: MERGE_COMPLETE
Training Dataset: train-f0
Current Model UID: None
Training Configuration: None
Database Path: pseudo_labeling_metadata.db
Sample Size Per Iteration: 150
Minimum Confidence Threshold: 0.1

RECENT ITERATIONS:
  Iteration 1: MERGE_COMPLETE
  Iteration 0: COMPLETED (completed: 2025-07-09 12:08:24)


### Local Initializer
Configure the next iteration parameters

In [9]:
# Set manual_corrections=True for CVAT human review
# Set manual_corrections=False for fully automated pseudo-labeling
manual_corrections = False
pipeline.setup_next_iteration(manual_corrections)


Current iteration 1 status: MERGE_COMPLETE
Continuing with current iteration...
----------------------------------------
ITERATION INITIALIZED - PERSISTENT ARCHITECTURE
Flow ID: f0
Current iteration: 1
Manual corrections: False
Sample size this iteration: 150
GT added this iteration: 0
Pseudo added this iteration: 150
Total GT images after this step: 50
Total pseudo-labeled images after this step: 150
Total expected training set size: 200
Train dataset name: train-f0
Persistent pseudo dataset: pseudo-f0
Manual corrections dataset: manual-corrections-f0
Pseudo input dataset: pseudo-f0
Initial annotations: initial-annotations:0
Inference model UID: teal-ellipsis-0
----------------------------------------


### 2.1 Sample New Data

In [None]:
pipeline.sample_unseen_inputs()


### 2.2 Generate Predictions/Pseudo-Labels


In [None]:
# pipeline.set_inference_model_uid('')
# pipeline.set_predicted_dataset('broccoli-semantic-segmentation-part4-may23-train-0--cpu--75080')

pipeline.run_inference()

### 2.3 CVAT Export
Run even if manual correction is false.

In [None]:
if pipeline.manual_corrections_global:
    print("Manual corrections enabled - proceeding to CVAT export")
    pipeline.manually_correct_cvat()
    print("After completing corrections in CVAT, manually update the predicted dataset and run the merge cell below")
else:
    print("No Manual Correction, Proceed to merging the datasets")


### 2.4. Merge Data
This cell merges the dataset with current training set. If **manual_correction = True**, then corrected annotations will be exported and merged.


In [10]:
pipeline.merge_pseudo_labels()

[32m2025-07-10 08:56:55.188[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m838[0m - [1mPulling dataset initial-annotations:0 from remote='daniel-osman---streamlining-annotation-bootstrapping/testing' with pull_policy=missing.[0m
[32m2025-07-10 08:56:55.191[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m858[0m - [1mDataset initial-annotations:0 already exists in local store. Skipping[0m


Starting simplified merge process...

=== AUTO PSEUDO-LABELING MODE ===
‚úì Pseudo dataset already updated after inference

=== REBUILDING TRAINING DATASET (SIMPLIFIED) ===
‚úì Started with initial dataset: 50 images


[32m2025-07-10 08:56:55.647[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mresolve_latest_version[0m:[36m576[0m - [1mThere is no remote version. Resolved latest version of dataset manual-corrections-f0 to 0 local='daniel-osman---streamlining-annotation-bootstrapping/testing'[0m
[32m2025-07-10 08:56:55.648[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mload[0m:[36m375[0m - [1mResolved latest version of dataset manual-corrections-f0 to 0.[0m
[32m2025-07-10 08:56:55.648[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m838[0m - [1mPulling dataset manual-corrections-f0:0 from remote='daniel-osman---streamlining-annotation-bootstrapping/testing' with pull_policy=missing.[0m
[32m2025-07-10 08:56:55.649[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m849[0m - [1mPulling dataset manual-corrections-f0:0 from remote. pull_policy=<PullPolicy.missing: 'missing'> and dataset no

‚úì No manual corrections found for this flow: This resource cannot be found. GET https://api.onedl.ai/v2/storage/contexts/daniel-osman---streamlining-annotation-bootstrapping/testing/-/datasets/manual-corrections-f0:0/info
404 Not Found - {'detail': 'Dataset manual-corrections-f0:0 was not found in project daniel-osman---streamlining-annotation-bootstrapping/testing.'}
Received Body b'{"detail":"Dataset manual-corrections-f0:0 was not found in project daniel-osman---streamlining-annotation-bootstrapping/testing."}'


[32m2025-07-10 08:56:56.200[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mresolve_latest_version[0m:[36m582[0m - [1mResolved latest version of dataset pseudo-f0 to 0 with remote='daniel-osman---streamlining-annotation-bootstrapping/testing'.[0m
[32m2025-07-10 08:56:56.200[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mload[0m:[36m375[0m - [1mResolved latest version of dataset pseudo-f0 to 0.[0m
[32m2025-07-10 08:56:56.201[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m838[0m - [1mPulling dataset pseudo-f0:0 from remote='daniel-osman---streamlining-annotation-bootstrapping/testing' with pull_policy=missing.[0m
[32m2025-07-10 08:56:56.203[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m858[0m - [1mDataset pseudo-f0:0 already exists in local store. Skipping[0m
[32m2025-07-10 08:56:56.210[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m

‚úì Added pseudo dataset: 150 images, total: 200


Generating label map from unique labels: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 200/200 [00:00<00:00, 15570.21it/s]
[32m2025-07-10 08:56:56.255[0m | [1mINFO    [0m | [36monedl.datasets.columns.base_column[0m:[36m_generate_label_map_from_unique_labels[0m:[36m457[0m - [1mGenerated label map: {0: 'AboveGround', 1: 'Defect', 2: 'Overgrown', 3: 'Stone', 4: 'Tip'}[0m
[32m2025-07-10 08:56:56.266[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36msave[0m:[36m179[0m - [1mSaved dataset train-f0:6 to local store.[0m
[32m2025-07-10 08:56:56.267[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mresolve_latest_version[0m:[36m567[0m - [1mResolved latest version of dataset train-f0 to 6 local='daniel-osman---streamlining-annotation-bootstrapping/testing'[0m
[32m2025-07-10 08:56:56.316[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpush[0m:[36m419[0m - [1mPushing 250 blobs to remote.[0m
Files Uploaded:   0%|          | 0/2


‚úì TRAINING DATASET REBUILT:
  - Initial GT: 50 images
  - Pseudo labels: 150 images
  - Total: 200 images
  - Saved as: train-f0
  - Label map: {0: 'AboveGround', 1: 'Defect', 2: 'Overgrown', 3: 'Stone', 4: 'Tip'}


### 2.5 Train Updated Model
Train a new model on the expanded training set

In [19]:
pipeline.train_model()

‚úì Training already completed. Recovered model UID: insulated-aggregate-0


### 2.6 Evaluate Performance

In [None]:
pipeline.evaluate_model()

### 2.7 Status

In [None]:
pipeline.get_pipeline_status()

# Additional Runs
To run additional iterations, repeat Section 2 after logging. For creating a new flow, go back to Section 1, update the current_flow and go again.
