Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
3f660ec
Creating an extensions directory in SYNPY to share out the first cura…
BryanFauble Oct 27, 2025
22c2f45
Correct MD formatting
BryanFauble Oct 27, 2025
07532e0
Refactor type hints to use TypeVar for DataFrame and update import st…
BryanFauble Oct 27, 2025
78d3d4b
removed unused import
andrewelamb Oct 28, 2025
030e903
fixes to create_json_schema_entity_view
andrewelamb Oct 28, 2025
350e0b5
create_file_based_metadata_task
andrewelamb Oct 28, 2025
aa08a3a
fix assert statement
andrewelamb Oct 28, 2025
9eafdef
remove print statements
andrewelamb Oct 28, 2025
c3ae215
remove print statements
andrewelamb Oct 28, 2025
16585d2
Extract convert and create json schema functions from schematic
BryanFauble Oct 30, 2025
baa0190
rename schema generation file, patch some returns, and logging. Updat…
BryanFauble Oct 30, 2025
77213b7
Add unit tests for json schema creation
BryanFauble Oct 30, 2025
b5be02a
Include curator dependencies in github run
BryanFauble Oct 30, 2025
0f02f03
Add rdflib dependency to curator requirements
BryanFauble Oct 30, 2025
9ec094f
add mock for isinstance in TestFileBasedHelperFunctions
BryanFauble Oct 30, 2025
94e2708
Clarify comment for data coordination center in metadata curation guide
BryanFauble Oct 30, 2025
543b597
Merge pull request #1265 from Sage-Bionetworks/synpy-1668-curator-ext…
andrewelamb Oct 31, 2025
7018190
Merge branch 'synpy-1668-curator-extension' into synpy-1672-extracted…
BryanFauble Oct 31, 2025
cefcbaa
Merge branch 'develop' into synpy-1672-extracted-schematic-code
BryanFauble Oct 31, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -84,15 +84,15 @@ jobs:
path: |
${{ steps.get-dependencies.outputs.site_packages_loc }}
${{ steps.get-dependencies.outputs.site_bin_dir }}
key: ${{ runner.os }}-${{ matrix.python }}-build-${{ env.cache-name }}-${{ hashFiles('setup.py') }}-v27
key: ${{ runner.os }}-${{ matrix.python }}-build-${{ env.cache-name }}-${{ hashFiles('setup.py') }}-v28

- name: Install py-dependencies
if: steps.cache-dependencies.outputs.cache-hit != 'true'
shell: bash
run: |
python -m pip install --upgrade pip

pip install -e ".[boto3,pandas,pysftp,tests]"
pip install -e ".[boto3,pandas,pysftp,tests,curator]"

# ensure that numpy c extensions are installed on windows
# https://stackoverflow.com/a/59346525
Expand Down
2 changes: 1 addition & 1 deletion Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,4 @@ synapseclient = {file = ".", path = "."}
python_version = "3.12.6"

[dev-packages]
synapseclient = {file = ".", editable = true, path = ".", extras = ["dev", "tests", "pandas", "pysftp", "boto3", "docs"]}
synapseclient = {file = ".", editable = true, path = ".", extras = ["dev", "tests", "pandas", "pysftp", "boto3", "docs", "curator"]}
5 changes: 5 additions & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,11 @@ pandas =

curator =
%(pandas)s
pandarallel>=1.6.4
inflection>=0.5.1
networkx>=2.2.8
dataclasses-json>=0.6.1
rdflib>=6.0.0

pysftp =
pysftp>=0.2.8,<0.3
Expand Down
3 changes: 3 additions & 0 deletions synapseclient/extensions/curator/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,13 @@

from .file_based_metadata_task import create_file_based_metadata_task
from .record_based_metadata_task import create_record_based_metadata_task
from .schema_generation import generate_jsonld, generate_jsonschema
from .schema_registry import query_schema_registry

__all__ = [
"create_file_based_metadata_task",
"create_record_based_metadata_task",
"query_schema_registry",
"generate_jsonld",
"generate_jsonschema",
]
62 changes: 59 additions & 3 deletions synapseclient/extensions/curator/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,24 +12,30 @@ The curator extension is designed around three core principles:

## Module Structure

The curator extension consists of three focused modules:
The curator extension consists of four focused modules:

```
synapseclient/extensions/curator/
├── __init__.py # Clean public API surface
├── file_based_metadata_task.py # File-annotation workflows
├── record_based_metadata_task.py # Structured record workflows
└── schema_registry.py # Schema discovery and validation
├── schema_registry.py # Schema discovery and validation
└── schema_generation.py # Data model and JSON Schema generation
```

## Public API Design

The module exposes three main functions that follow consistent design patterns:
The module exposes five main functions that follow consistent design patterns:

**Metadata Curation Workflows:**
- **`create_file_based_metadata_task()`** - Configurable file-annotation curation workflows
- **`create_record_based_metadata_task()`** - Configurable structured-record curation workflows
- **`query_schema_registry()`** - Flexible schema discovery with custom filtering

**Data Model and Schema Generation:**
- **`generate_jsonld()`** - Convert CSV data models to JSON-LD format with validation
- **`generate_jsonschema()`** - Generate JSON Schema validation files from data models

## Configuration and Flexibility

### Extensive Parameter Control
Expand Down Expand Up @@ -167,6 +173,56 @@ The module provides composable building blocks that can be combined to create so
- Version filtering (latest-only or all versions)
- Dynamic filter construction using keyword arguments

### Data Model and Schema Generation

**Purpose**: Create and validate data models, then generate JSON Schema validation files.

The schema generation workflow consists of two key functions that work together:

#### JSON-LD Data Model Generation (`generate_jsonld`)

Converts CSV-based data model specifications into standardized JSON-LD format with comprehensive validation:

**Input Requirements**:
- CSV file with attributes, validation rules, dependencies, and valid values
- Columns defining display names, descriptions, requirements, and relationships

**Validation Performed**:
- Required field presence checks
- Dependency cycle detection (ensures valid DAG structure)
- Blacklisted character detection in display names
- Reserved name conflict checking
- Graph structure validation

**Configuration Levers**:
- Label format selection (`class_label` vs `display_label`)
- Custom output path or automatic naming
- Comprehensive error and warning logging

**Output**: JSON-LD file suitable for schema generation and other data model operations

#### JSON Schema Generation (`generate_jsonschema`)

Generates JSON Schema validation files from JSON-LD data models, translating validation rules into schema constraints:

**Supported Validation Rules**:
- Type validation (string, number, integer, boolean)
- Enum constraints from valid values
- Required field enforcement (including component-specific requirements)
- Range constraints (`inRange` → min/max)
- Pattern matching (`regex` → JSON Schema patterns)
- Format validation (`date`, `url`)
- Array handling (`list` rules)
- Conditional dependencies (if/then schemas)

**Configuration Levers**:
- Component selection (specific data types or all components)
- Label format for property names
- Custom output directory structure
- Component-based rule application using `#Component` syntax

**Output**: JSON Schema files for each component, enabling validation of submitted manifests

## Development Philosophy

### Fail Fast with Clear Messages
Expand Down
Loading
Loading