PyLabFlow is a self-contained Python framework for managing computational research experiments. It is designed to work with various domains, from machine learning and data processing to simulation and numerical analysis. Built on the principles of flexibility, reproducibility, and data privacy, PyLabFlow allows researchers to define, run, track, and transfer entire custom workflows offline.
PyLabFlow is domain-agnostic, leveraging abstract Component and WorkFlow classes that users can customize to fit their needs. This makes it ideal for use with frameworks like PyTorch, TensorFlow, or any other complex pipeline structures used in scientific computing.
-
Offline-First:
- All experiment tracking and data management are handled locally using file systems and SQLite databases, ensuring complete data privacy and uninterrupted work, even without an internet connection.
-
100% Customization:
- Built on abstract
ComponentandWorkFlowclasses, users can define their own logic for data loading, processing, simulation, and analysis.
- Built on abstract
-
Pipeline Tracking (PPLs):
- Centralized management of experiment configurations, status (e.g.,
init,running,frozen,cleaned), and history.
- Centralized management of experiment configurations, status (e.g.,
-
Reproducibility Guarantee:
- Each pipeline (PPL) is uniquely identified by the cryptographic hash of its entire configuration, preventing configuration drift and ensuring exact reproducibility.
-
Seamless Transfer:
- Easily archive experiments, or transfer project setups (including configurations and output artifacts) between local directories, cloud storage, or high-performance computing (HPC) nodes without path reconfiguration.
-
Session Logging:
- Automatic tracking of the session origin (e.g., Jupyter session, script filename) and execution time for enhanced auditability.
You can install PyLabFlow using pip:
pip install PyLabFlow
Alternatively, clone the repository for development purposes:
git clone [https://github.com/ExperQuick/PyLabFlow.git](https://github.com/ExperQuick/PyLabFlow.git)
cd PyLabFlow
pip install -e .
PyLabFlow structures work around Labs (your project environment) and Pipelines (PPLs) (customizable experiment runs).
The Lab manages file paths and databases. You only need to run this once per project.
import os
from plf.lab import create_project, lab_setup
# Define mandatory project settings
SETTINGS = {
"project_name": "General_Research_Lab",
"project_dir": "/path/to/my/research_projects", # Parent directory for the project
"component_dir": "/path/to/my/custom_components", # Directory for custom components and workflows
}
# Create project structure, databases, and settings file
settings_path = create_project(SETTINGS)
print(f"Project structure and settings created at: {settings_path}")
# Set up the lab environment for the current Python session
lab_setup(settings_path)Since PyLabFlow is domain-independent, you define your experiment logic by subclassing WorkFlow and Component.
import os
from plf.utils import WorkFlow, Component
from typing import Dict, Any
# A generic computational component
class MyComputationalComponent(Component):
def _setup(self, args: Dict[str, Any]):
print(f"Setting up component with args: {args}")
self.data = args.get("initial_value", 0)
# Define the flow that combines components and executes the run logic
class GenericDataWorkflow(WorkFlow):
# Initialize the pipeline run configuration
def new(self, args: Dict[str, Any]):
# Ensure all required configuration keys are provided
if not self.template.issubset(set(args.keys())):
raise ValueError(f'the args should have {", ".join(self.template- set(list(args.keys())))}')
# Perform setup (e.g., loading large datasets/models into memory)
def prepare(self):
self.data_source = self.load_component(**args['data_source'])
self.algorithm = self.load_component(**args['algorithm'])
print("Preparing workflow: loading external data or setting up environment...")
return True
# Main execution logic (e.g., training loop, simulation run)
def run(self):
print(f"Running PPL: {self.P.pplid}")
result = self.data_source.data + 10
print(f"Final result: {result}")
# Define standardized paths for saving artifacts
def get_path(self, of: str, pplid: str, args: Dict) -> str:
if of == 'results':
return os.path.join(self.P.settings['data_path'], 'Results', f'{pplid}_output.txt')
raise NotImplementedError(f"Path for artifact type '{of}' is not defined in GenericDataWorkflow.")
# Clean up temporary files or output artifacts
def clean(self):
print(f"Cleaning artifacts for {self.P.pplid}...")
# Add logic to delete files here
# Return execution status or key metrics
def status(self):
return {"last_result": 100, "status_detail": "Completed successfully"}Once your components are defined, you can create and execute the pipeline using the PipeLine class.
from plf.experiment import PipeLine
# Define the full configuration with fully qualified paths to components
pipeline_config = {
"workflow": {
"loc": "my_workflows.GenericDataWorkflow",
"args": {}
},
"args": {
"data_source": {"loc": "my_workflows.MyComputationalComponent", "args": {"initial_value": 42}},
"algorithm": {"loc": "my_workflows.MyComputationalComponent", "args": {"param_b": 5}},
}
}
# Create a new pipeline. Configuration is hashed and logged
P = PipeLine()
P.new(pplid="ppl_data_run_001", args=pipeline_config)
# Prepare the environment
P.prepare()
# Run the workflow
P.run()The plf.experiment module provides powerful tools for managing your PPL database:
get_ppls(): List all active pipeline IDs in the current Lab.get_ppl_status(): Returns a DataFrame summarizing the status, last run, and key metrics for all PPLs.filter_ppls(query): Filters PPLs based on configuration arguments (e.g.,filter_ppls("data_source=my_workflows.MyComponent")).archive_ppl(ppls): Archives a pipeline, moving its configurations and artifacts to an archived folder for safe storage.archive_ppl(ppls, reverse=True): Unarchives a pipeline and returns it to the active environment.delete_ppl(ppls): Permanently deletes a pipeline from the archive.stop_running(): Gracefully stops a currently running pipeline after its current iteration completes.
This project is licensed under the Apache License 2.0. © 2025 BBEK Anand