# Pseudo-Labeling Flow

This notebook implements an automated pseudo-labeling pipeline designed to streamline the annotation process for object detection and instance segmentation tasks. The tool iteratively improves model performance by using an initial model trained on a small set of manually annotated data to generate labels on new images, which can then be refined and used to retrain progressively better models.

**A "flow" represents a complete pseudo labeling run with specific configuration settings (model type, initial dataset size, correction strategy), while "iterations" are the individual training cycles within each flow where new data is added and the model is retrained.**

The features within this notebook include:
- **Automated Pipeline**: Complete workflow from data preparation to model training
- **Database Logging**: Database tracking for all iterations
- **CVAT Integration**: For viewing and adjusting annotations
- **Flexible Configuration**: Supports different model architectures and training settings
- **Status Monitoring**: Real-time pipeline status and progress tracking


## Imports

In [1]:
from pseudo_labeling import PseudoLabelingPipeline

## Global Initalizers
Configure the pipeline with your project-specific settings:


In [2]:
pipeline = PseudoLabelingPipeline(
    project_name="daniel-osman---streamlining-annotation-bootstrapping/testing",
    main_dataset_name="full-data:0", #input only
    initial_annotated_dataset_name="initial-annotations:0",
    validation_dataset="val:0",
    sample_size_per_iter=150,
    current_flow = 0,
    min_confidence=0.5,
    local_path='/Users/daniel/Documents/2025 BEP - VBTI Data/testing',
    cvat_project_id=88,
    db_path="pseudo_labeling_metadata_test.db"
)

print("Pipeline initialized")



[32m2025-06-13 15:09:03.686[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m838[0m - [1mPulling dataset full-data:0 from remote='daniel-osman---streamlining-annotation-bootstrapping/testing' with pull_policy=missing.[0m
[32m2025-06-13 15:09:03.691[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m858[0m - [1mDataset full-data:0 already exists in local store. Skipping[0m
[32m2025-06-13 15:09:03.718[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m838[0m - [1mPulling dataset initial-annotations:0 from remote='daniel-osman---streamlining-annotation-bootstrapping/testing' with pull_policy=missing.[0m
[32m2025-06-13 15:09:03.721[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m858[0m - [1mDataset initial-annotations:0 already exists in local store. Skipping[0m


GLOBAL INITIALIZATIONS INITIALIZED
Project: daniel-osman---streamlining-annotation-bootstrapping/testing
Main dataset: full-data:0
Initial annotated dataset: initial-annotations:0
Sample size per iteration: 150
Selected flow: f0
Initial annotated dataset contains: 50 samples
Flow f0 already exists in database - ready to resume
Last completed iteration: 0


[32m2025-06-13 15:09:04.051[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mresolve_latest_version[0m:[36m582[0m - [1mResolved latest version of dataset train-f0 to 2 with remote='daniel-osman---streamlining-annotation-bootstrapping/testing'.[0m
[32m2025-06-13 15:09:04.051[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mload[0m:[36m375[0m - [1mResolved latest version of dataset train-f0 to 2.[0m
[32m2025-06-13 15:09:04.052[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m838[0m - [1mPulling dataset train-f0:2 from remote='daniel-osman---streamlining-annotation-bootstrapping/testing' with pull_policy=missing.[0m
[32m2025-06-13 15:09:04.054[0m | [1mINFO    [0m | [36monedl._local_store.datasets[0m:[36mpull[0m:[36m858[0m - [1mDataset train-f0:2 already exists in local store. Skipping[0m


Training dataset train-f0 exists with 200 samples
Ready for iteration 1
Pipeline initialized


## Training Config
Set up your model training parameters interactively:


In [3]:
train_cfg = pipeline.setup_training_config()


=== TRAINING CONFIGURATION SETUP ===


VBox(children=(Dropdown(description='Model Type:', options=(('FasterRCNNConfig (Object Detection)', 'FasterRCN…

# (1) Initial Flow, Training, and Evaluation Setup

This step initiates the current flow and establishes the baseline model using your initial annotated dataset.
1. Loads the initial annotated dataset.
2. Created training set for the current flow.
3. Trains the first baseline model (iteration 0) for the specified flow.
4. Evaluates Model
5. Logs Metadata (Only variables generated throughout the process of the pipeline are 'predicted_dataset_name', 'model_uid', 'evaluation_uid', 'evaluation_info')

⚠️ ATTENTION: Skip this section if your current flow already exists and if you already have a baseline model

### Local Initializer
Setup variables and training set for the current flow


In [4]:
pipeline.get_pipeline_status()


PIPELINE STATUS REPORT
Flow ID: f0
Current Iteration: 1
Training Dataset: train-f0
Current Model UID: None
Training Configuration: {'model_type': 'FasterRCNNConfig', 'task_type': 'object_detection', 'backbone': <FasterRCNNBackbone.REGNET_GF6_4: 'REGNET_GF6_4'>, 'epochs': 50, 'batch_size': 6}
Database Path: pseudo_labeling_metadata_test.db
Sample Size Per Iteration: 150
Minimum Confidence Threshold: 0.5


### 1.1 Train Initial Model
Train the baseline model on your initial annotated dataset.




In [None]:
pipeline.train_model()

### 1.2 - Evaluate
Evaluate the initial model performance on the validation dataset.

In [None]:
pipeline.evaluate_model()

### 1.3 Log
Save all metadata for the initial training iteration to the database.


In [None]:
pipeline.log_iteration_0()
print("Initial model training and evaluation complete, Current status:")
pipeline.get_pipeline_status()

# (2) Pseudo-Labeling Iteration Workflow

This step executes a complete pseudo-labeling iteration cycle using the model from the previous iteration to generate labels on new data.
1. Sets up the next iteration with correction strategy (manual or automated).
2. Samples new unlabeled data from the full dataset.
3. Runs inference using the previous iteration's model to generate pseudo-labels.
4. Handles corrections based on strategy: exports to CVAT for manual corrections OR merges pseudo-labels directly.
5. Trains updated model on expanded dataset (original + new data).
6. Evaluates the updated model performance.
7. Logs iteration metadata to track progress and results.

**⚠️ ATTENTION: Set 'manual_corrections=True' for CVAT workflow with human review, or 'manual_corrections=False' for fully automated pseudo-labeling**


### Local Initializer
Configure the next iteration parameters

In [6]:
# Set manual_corrections=True for CVAT human review
# Set manual_corrections=False for fully automated pseudo-labeling
manual_corrections = False
pipeline.setup_next_iteration(manual_corrections)


----------------------------------------
FLOW INITIALIZATION COMPLETE
Resuming flow_id: f0
Current iteration: 1
Manual corrections: False
Sample size this iteration: 150
GT added this iteration: 0
Pseudo added this iteration: 150
Total GT images after this step: 50
Total pseudo-labeled images after this step: 150
Total expected training set size: 200
Train dataset name: train-f0
Pseudo input dataset name: pseudo-iter1-f0
Using initial annotations: initial-annotations:0
Inference model UID: mad-omega-0
----------------------------------------


### 2.1 Sample New Data

In [None]:
pipeline.sample_unseen_inputs()

### 2.2 Generate Predictions/Pseudo-Labels


In [None]:
#pipeline.run_inference()
pipeline.set_predicted_dataset('demo-val')

### 2.3 CVAT Export

In [19]:
if pipeline.manual_corrections_global:
    print("Manual corrections enabled - proceeding to CVAT export")
    pipeline.manually_correct_cvat()
    print("After completing corrections in CVAT, manually update the predicted dataset and run the merge cell below")
else:
    print("No Manual Correction, Proceed to merging the datasets")


Manual corrections enabled - proceeding to CVAT export


KeyboardInterrupt: Interrupted by user

### 2.4. Merge Data
This cell merges the dataset with current training set. If **manual_correction = True**, then corrected annotations will be exported and merged.


In [None]:
pipeline.merge_pseudo_labels()

### 2.5 Train Updated Model
Train a new model on the expanded training set

In [None]:
pipeline.train_model()

### 2.6 Evaluate Performance

In [None]:
pipeline.evaluate_model()

### 2.7 Log Iteration  Results

In [None]:
pipeline.log_iteration()
print("Iteration Complete")
pipeline.get_pipeline_status()

# Additional Runs
To run additional iterations, repeat Section 2 after logging. For creating a new flow, go back to Section 1, update the current_flow and go again.
