feat: add LDBC dataset integration#91
Conversation
Implements support for loading LDBC Social Network Benchmark (SNB)
datasets from .tar.zst archives with multi-file CSV format.
## Changes
**Compression Support:**
- Add `compression.py` module for .tar.zst extraction
- Support .tar.gz, .tar, and .tar.zst formats
- Uses zstandard library (optional dependency)
**LDBC Schema:**
- Define 8 node types (Person, Post, Comment, Forum, etc.)
- Define 3 relationship types (KNOWS, LIKES)
- Property mappings with type converters
- Temporal property support (uses date/datetime)
**LDBC Loader:**
- Multi-file CSV parser with pipe delimiter
- Node-first loading strategy (cache for relationships)
- Required/optional property validation
- Subdirectory search for flexible archive structures
**Dataset Registration:**
- Register 4 scale factors (SF0.001, SF0.1, SF1, SF10)
- Metadata includes URLs, node/edge counts, schemas
- Auto-registration on import
## Implementation Details
Uses dataclasses for internal schema definitions (performance)
and Pydantic for dataset metadata (validation). Follows existing
CSV loader patterns but handles multi-file format.
Temporal types (DATE, DATETIME) work with LDBC timestamps via
ISO 8601 parsing. Properties with years/months use isodate.Duration.
## Testing
- 11 compression utility tests
- 25 schema definition tests
- 18 loader functionality tests
- 54 total tests, 100% coverage on new code
- All pre-push checks passing (93.67% total coverage)
## Example Usage
```python
from graphforge import GraphForge
from graphforge.datasets import load_dataset
gf = GraphForge()
load_dataset(gf, "ldbc-snb-sf0.001") # Smallest scale
# Query the data
results = gf.execute("""
MATCH (p:Person)-[k:KNOWS]->(friend:Person)
WHERE p.firstName = 'Alice'
RETURN friend.firstName AS name, k.creationDate AS since
""")
```
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. WalkthroughAdds LDBC SNB support: new LDBCLoader, LDBC schema definitions, compression utilities for .tar.zst/.tar.gz/.tar, dataset registrations for multiple scale factors, and unit tests for compression, loader, and schema parsing. Changes
Sequence DiagramsequenceDiagram
actor User
participant LDBCLoader
participant Compression
participant FileSystem
participant SchemaParser
participant GraphForge
User->>LDBCLoader: load(gf, path)
LDBCLoader->>FileSystem: validate path exists
alt compressed archive detected
LDBCLoader->>Compression: extract_archive(path)
Compression->>FileSystem: read archive (.tar.zst / .tar.gz / .tar)
Compression->>FileSystem: extract to temp directory
Compression-->>LDBCLoader: return extraction path
else directory
LDBCLoader->>FileSystem: use path directly
end
LDBCLoader->>LDBCLoader: _load_nodes(gf, data_dir)
loop node schemas
LDBCLoader->>FileSystem: find CSV file
LDBCLoader->>FileSystem: read CSV rows
loop rows
LDBCLoader->>SchemaParser: _parse_properties(row, mappings)
SchemaParser-->>LDBCLoader: parsed properties
LDBCLoader->>GraphForge: create_node(label, properties)
LDBCLoader->>LDBCLoader: cache node by (label,id)
end
end
LDBCLoader->>LDBCLoader: _load_relationships(gf, data_dir, node_cache)
loop relationship schemas
LDBCLoader->>FileSystem: find CSV file
LDBCLoader->>FileSystem: read CSV rows
loop rows
LDBCLoader->>LDBCLoader: resolve source/target from cache
LDBCLoader->>SchemaParser: _parse_properties(row, mappings)
SchemaParser-->>LDBCLoader: parsed properties
alt nodes exist
LDBCLoader->>GraphForge: create_relationship(source, target, type, properties)
else
LDBCLoader->>LDBCLoader: skip relationship
end
end
end
LDBCLoader-->>User: loading complete
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly Related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #91 +/- ##
==========================================
- Coverage 91.17% 90.48% -0.69%
==========================================
Files 21 25 +4
Lines 2821 3027 +206
Branches 700 737 +37
==========================================
+ Hits 2572 2739 +167
- Misses 109 136 +27
- Partials 140 152 +12
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
|
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@src/graphforge/datasets/loaders/compression.py`:
- Around line 49-55: The tar extraction using tar.extractall(compressed_file) is
vulnerable to path traversal; replace it by iterating TarFile members and
validate each member's final path before extraction (e.g., compute target_path =
os.path.join(extract_to, member.name), resolve both target_path and extract_to
and ensure target_path.startswith(resolved_extract_to)) and only extract
validated members via tar.extract(member, extract_to). Implement this as a
shared helper (e.g., safe_extract_tar(tar: tarfile.TarFile, extract_to: str))
and call it from the code paths that currently use tar.extractall (the block
that opens tar via dctx.stream_reader and the other tar-open block), ensuring no
absolute paths or parent-traversal entries are allowed.
- Around line 3-7: The module docstring incorrectly claims support for .gz and
.zip; update the top-level docstring to only list the formats actually handled
by the code (e.g., .tar.zst and .tar.gz) so it matches the behavior of
extract_archive() and is_compressed_archive(); reference those function names in
the docstring change to clarify supported formats and remove .gz/.zip from the
list (or, if you prefer to implement support, add handling for .gz/.zip inside
extract_archive() and is_compressed_archive() instead—pick one approach and make
docstring and functions consistent).
In `@src/graphforge/datasets/loaders/ldbc.py`:
- Around line 60-69: The current extraction branch (is_compressed_archive)
creates a persistent folder (.{path.stem}_extracted) and never removes it;
change extract logic in the loader so extraction uses a temporary directory
(e.g., tempfile.TemporaryDirectory or tempfile.mkdtemp with explicit cleanup)
instead of path.parent / f".{path.stem}_extracted", set data_dir to that temp
dir, and ensure the temp dir is removed after loading completes (or use a
context manager so cleanup is automatic). Update the code around
is_compressed_archive, extract_archive, and how data_dir is used so any
downstream processing runs while the temp dir is alive and is cleaned up
afterwards.
In `@src/graphforge/datasets/sources/ldbc.py`:
- Around line 23-67: Make the LDBC loader/dataset registration idempotent by
checking for existing registrations before registering: guard the
register_loader("ldbc", LDBCLoader) call (or catch ValueError) and likewise
guard register_dataset(DatasetInfo(...)) so repeated imports/calls don't raise;
reference the symbols register_loader, register_dataset, LDBCLoader and the
DatasetInfo block and implement the same pattern SNAP uses (check-if-registered
or try/except around registration) to skip re-registration.
In `@tests/unit/datasets/test_compression.py`:
- Line 13: Move the I/O-heavy compression tests out of the unit suite by
relocating tests/unit/datasets/test_compression.py to an appropriate
slow/integration directory (e.g., tests/integration/datasets/ or
tests/slow/datasets/) and change the module marker from pytestmark =
pytest.mark.unit to a more appropriate marker such as pytestmark =
pytest.mark.integration or pytestmark = pytest.mark.slow; ensure any test
collection/config (pytest.ini or conftest.py) is aware of the new
directory/marker so these tests run in the slower suite rather than unit.
In `@tests/unit/datasets/test_ldbc_loader.py`:
- Line 11: This test file is an integration test (it instantiates GraphForge and
runs Cypher) but is marked and located as unit; move the test from tests/unit to
tests/integration, change the module-level marker from pytest.mark.unit to
pytest.mark.integration (or add `@pytest.mark.integration` on the test), and
ensure any imports or test utilities referenced (e.g., GraphForge) still resolve
after the move; update test metadata accordingly so the test runner treats it as
an integration test.
🧹 Nitpick comments (3)
src/graphforge/datasets/loaders/ldbc.py (2)
98-109: Missing column access could produce cryptic KeyError.If the CSV file doesn't contain the expected
id_column, line 100 will raise a rawKeyErrorwithout context about which file or schema failed.💡 Suggested defensive access with better error message
for row in reader: # Extract node ID - node_id = row[schema.id_column] + try: + node_id = row[schema.id_column] + except KeyError: + raise ValueError( + f"Missing ID column '{schema.id_column}' in {csv_path}" + ) from None
135-146: Same KeyError risk for relationship ID columns.Similar to node loading, accessing
row[schema.source_id_column]androw[schema.target_id_column]without guards will produce cryptic errors if columns are missing.The silent skip for missing nodes (lines 144-146) is appropriate for handling partial datasets, though a debug-level log could aid troubleshooting.
src/graphforge/datasets/loaders/ldbc_schema.py (1)
103-107: Redundant empty check.The condition
not value or value == ""is redundant since an empty string is falsy.💅 Simplification
def parse_list(value: str) -> list[str]: """Parse semicolon-separated list.""" - if not value or value == "": + if not value: return [] return [item.strip() for item in value.split(";") if item.strip()]
| is_compressed_archive, | ||
| ) | ||
|
|
||
| pytestmark = pytest.mark.unit |
There was a problem hiding this comment.
Reclassify these I/O-heavy tests out of the unit suite.
Creating/extracting tar archives is not <1ms and doesn’t target parser/planner/executor/storage layers. Please move these tests out of tests/unit/ (e.g., to a slow/integration suite) and update the marker accordingly.
🧭 Suggested marker change (paired with file move)
-pytestmark = pytest.mark.unit
+pytestmark = pytest.mark.slowAs per coding guidelines: “Unit tests MUST test ONE component in isolation and complete in < 1ms” and “tests/unit/**/*.py: Unit tests in tests/unit/ MUST test parser, planner, executor, and storage layers in isolation with minimal mocking.”
🤖 Prompt for AI Agents
In `@tests/unit/datasets/test_compression.py` at line 13, Move the I/O-heavy
compression tests out of the unit suite by relocating
tests/unit/datasets/test_compression.py to an appropriate slow/integration
directory (e.g., tests/integration/datasets/ or tests/slow/datasets/) and change
the module marker from pytestmark = pytest.mark.unit to a more appropriate
marker such as pytestmark = pytest.mark.integration or pytestmark =
pytest.mark.slow; ensure any test collection/config (pytest.ini or conftest.py)
is aware of the new directory/marker so these tests run in the slower suite
rather than unit.
…urity) Fixes 3 high-severity Bandit findings (B202:tarfile_unsafe_members). ## Security Issue tar.extractall() without validation can be exploited by malicious archives containing files with paths like ../../etc/passwd that escape the intended extraction directory (CWE-22: Path Traversal). ## Fix Added filter='data' parameter to all tar.extractall() calls: - extract_tar_zst(): .tar.zst extraction - extract_archive(): .tar.gz extraction - extract_archive(): .tar extraction The 'data' filter (Python 3.12+) is the safest option: - Strips leading slashes from paths - Refuses absolute paths - Refuses symlinks pointing outside extraction dir - Refuses device files ## Verification - Bandit scan: 0 issues (was 3 high-severity) - All tests passing - All pre-push checks passing Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/graphforge/datasets/loaders/compression.py`:
- Around line 53-55: The three calls to
tar.extractall(tar.extractall(extract_to, filter="data")) use the Python
3.11-only filter kwarg and will TypeError on Python 3.10; update each occurrence
where variable tar and extract_to are used (the tar.zst, tar.gz and uncompressed
tar extraction blocks) to guard on sys.version_info >= (3,11) and call
tar.extractall(extract_to, filter="data") only for 3.11+, otherwise perform a
safe fallback: iterate tar.getmembers() and validate member paths (prevent path
traversal) before extracting (e.g., build the target path from extract_to and
member.name and skip/raise on unsafe paths) so the extraction logic is secure on
Python 3.10.
Resolves all issues from code review:
## Security (compression.py)
- Replace filter='data' with manual path validation
- Implement safe_extract_tar() helper function
- Validate each tar member before extraction
- Refuse absolute paths and parent directory references
- Ensure extracted paths remain within target directory
## Documentation (compression.py)
- Fix module docstring to only list supported formats
- Update to mention .tar.zst, .tar.gz, .tar (removed .gz, .zip)
- Reference extract_archive() and is_compressed_archive() functions
## Resource Cleanup (ldbc.py)
- Use tempfile.TemporaryDirectory for extraction
- Automatic cleanup on context manager exit
- No more persistent .{stem}_extracted directories
## Idempotent Registration (ldbc.py sources)
- Check _DATASET_REGISTRY before register_dataset()
- Try/except around register_loader() like SNAP does
- Safe for repeated imports and function calls
## Test Organization
- Move test_compression.py: unit -> integration
- Move test_ldbc_loader.py: unit -> integration
- Update pytestmark to pytest.mark.integration
- Update docstrings to explain integration classification
All tests passing (54/54), Bandit clean (0 issues), pre-push checks pass.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Summary
Implements support for loading LDBC Social Network Benchmark (SNB) datasets, completing the final piece of the temporal/spatial/LDBC integration plan.
Background
LDBC provides industry-standard benchmark datasets for graph databases. The SNB models a social network (similar to Facebook) with realistic data including:
Changes
1. Compression Support (
compression.py)zstandarddependency with graceful fallback2. LDBC Schema (
ldbc_schema.py)3. LDBC Loader (
ldbc.py)4. Dataset Registration (
ldbc.py)Architecture Decisions
Dataclasses for Schemas (Not Pydantic)
Following CLAUDE.md guidance:
The
PropertyMapping,NodeSchema, andRelationshipSchemaclasses are:Temporal Type Integration
LDBC datasets use temporal properties extensively:
creationDate: DATETIME (ISO 8601:2023-01-15T10:30:00.000+0000)birthday: DATE (1990-01-15)This leverages temporal types from #86, demonstrating the value of completing the type system first.
Testing
Unit Tests (54 total)
Coverage
Example Usage
Dependencies
Optional:
zstandard>=0.22.0for .tar.zst supportIf not installed:
Implementation Sequence
This completes the approved plan:
All three pieces now work together:
Future Enhancements
The current implementation includes 3 relationship types. Additional LDBC relationships can be easily added to
ldbc_schema.py:The schema is designed to be incrementally extended.
Performance Notes
The loader uses a node-first strategy:
(label, id) -> NodeRefThis ensures O(1) relationship creation and handles missing nodes gracefully.
Closes #51
Summary by CodeRabbit
New Features
API
Tests