feat: implement CSV loader and register 5 SNAP datasets#69
Conversation
- Add CSVLoader for edge-list datasets (CSV/TSV/space-delimited) - Support auto-delimiter detection, gzip compression, weighted edges - Register 5 SNAP datasets: ego-facebook, email-enron, ca-astroph, web-google, twitter-combined - Add auto-registration on module import - Add comprehensive tests (14 CSV loader tests, 13 SNAP dataset tests) - Fix linting issues: use Path.open(), avoid loop variable overwrite - Add test fixture to handle registry isolation issues Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
WalkthroughImports now auto-register SNAP datasets at package import; adds a CSVLoader for edge-list CSV/TSV/space-delimited (gzipped supported) files and unit tests for both CSV loading and SNAP dataset registration. Changes
Sequence DiagramsequenceDiagram
participant Import as "datasets/__init__.py"
participant Reg as "register_snap_datasets()"
participant Registry as "Global Registry (loaders & datasets)"
participant CSV as "CSVLoader"
Import->>Reg: call register_snap_datasets()
Reg->>Registry: ensure loader registered ("csv")
Registry-->>Reg: loader registered (or already present)
Reg->>Registry: register dataset (snap-ego-facebook, metadata)
Reg->>Registry: register dataset (snap-email-enron, metadata)
Reg->>Registry: register dataset (snap-ca-astroph, metadata)
Reg->>Registry: register dataset (snap-web-google, metadata)
Reg->>Registry: register dataset (snap-twitter-combined, metadata)
Registry-->>Import: datasets and loader available
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #69 +/- ##
=======================================
Coverage 93.86% 93.86%
=======================================
Files 17 19 +2
Lines 2134 2201 +67
Branches 526 542 +16
=======================================
+ Hits 2003 2066 +63
- Misses 49 52 +3
- Partials 82 83 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/graphforge/datasets/loaders/csv.py`:
- Around line 80-95: The parser currently uses stripped_line.split(delimiter)
which, when delimiter is a single space, yields empty tokens for consecutive
spaces; update the parsing logic in the CSV loader (around
self._detect_delimiter, delimiter, and the block that assigns
source_id/target_id) so that if delimiter == " " you call stripped_line.split()
(no argument) to collapse consecutive whitespace before extracting source_id and
target_id, and keep the existing error handling; also add a unit test for the
CSV loader that parses a line with multiple consecutive spaces (e.g., "0 1
0.5") to ensure no empty node IDs are produced.
In `@src/graphforge/datasets/sources/snap.py`:
- Around line 13-19: Wrap the call to register_loader("csv", CSVLoader) inside a
try/except in register_snap_datasets so a duplicate-registration error doesn't
abort the rest of the function; specifically, call register_loader("csv",
CSVLoader) in a try block and catch the duplicate-loader exception (e.g.,
DuplicateLoaderError or the specific exception your registry throws) and
silently ignore it (or log.debug) then continue with the dataset registration
logic so datasets still get registered; re-raise any other unexpected
exceptions.
CSV Loader Improvements: - Handle consecutive spaces in space-delimited files by using split() without arguments when delimiter is space - Prevents empty node IDs from being created with malformed data - Add comprehensive test for multiple consecutive spaces with weights SNAP Registration Improvements: - Make register_snap_datasets() idempotent by catching duplicate registration errors - Wrap register_loader() and register_dataset() in try-except blocks - Gracefully handle cases where registry was already initialized - Simplify test fixture since registration is now idempotent Testing: - Add test_load_multiple_consecutive_spaces() test case - Verify no empty node IDs created with consecutive spaces - All 1108 tests passing, coverage 95.59% Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Additional ImprovementsCSV Loader Enhancements
SNAP Registration Robustness
Testing
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/graphforge/datasets/sources/snap.py`:
- Around line 100-103: The `# noqa: PERF203` on the except clause is unused and
should be removed to satisfy strict linting; edit the try/except block that
calls register_dataset(dataset) (the except ValueError as e: handler) and delete
the `# noqa: PERF203` comment so the except line reads simply `except ValueError
as e:` (no other changes needed).
- Replace try-except in loop with registry check before registration - Avoids PERF203 linting warning about performance overhead - More efficient: checks registry directly instead of catching exceptions - Import _DATASET_REGISTRY for idempotent registration check - All tests passing, cleaner implementation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Version Bump: - Bump version to 0.2.1 in pyproject.toml, __init__.py, and uv.lock - Add comprehensive v0.2.1 changelog entry Documentation Updates: - Update README.md with dataset loading examples and quickstart - Add "Load Real-World Datasets" section to main README - Update docs/index.md with dataset features and examples - Complete rewrite of docs/datasets/snap.md: - Mark as available in v0.2.1 (5 datasets) - Add detailed dataset table with stats - Add comprehensive usage examples and query patterns - Document download/caching behavior - Add performance tips for large datasets - Update docs/datasets/overview.md: - Reorganize to show SNAP as "Available Now" - Mark other sources as "Coming Soon" - List all 5 available SNAP datasets - Update docs/getting-started/quickstart.md: - Add "Load a Dataset" section with examples - Add dataset browsing examples - Update navigation links Release Contents (v0.2.1): - Dataset loading infrastructure with caching (#68) - CSV loader for edge-list datasets (#69) - 5 SNAP datasets available - MERGE ON CREATE SET syntax (#65) - MERGE ON MATCH SET syntax (#66) - WITH clause variable passing fix (#67) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* docs: prepare v0.2.1 release with dataset documentation Version Bump: - Bump version to 0.2.1 in pyproject.toml, __init__.py, and uv.lock - Add comprehensive v0.2.1 changelog entry Documentation Updates: - Update README.md with dataset loading examples and quickstart - Add "Load Real-World Datasets" section to main README - Update docs/index.md with dataset features and examples - Complete rewrite of docs/datasets/snap.md: - Mark as available in v0.2.1 (5 datasets) - Add detailed dataset table with stats - Add comprehensive usage examples and query patterns - Document download/caching behavior - Add performance tips for large datasets - Update docs/datasets/overview.md: - Reorganize to show SNAP as "Available Now" - Mark other sources as "Coming Soon" - List all 5 available SNAP datasets - Update docs/getting-started/quickstart.md: - Add "Load a Dataset" section with examples - Add dataset browsing examples - Update navigation links Release Contents (v0.2.1): - Dataset loading infrastructure with caching (#68) - CSV loader for edge-list datasets (#69) - 5 SNAP datasets available - MERGE ON CREATE SET syntax (#65) - MERGE ON MATCH SET syntax (#66) - WITH clause variable passing fix (#67) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * chore: update uv.lock after version bump --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Summary
Implements CSV edge-list loader and registers 5 SNAP (Stanford Network Analysis Project) datasets.
Changes
Core Implementation
CSVLoader (
src/graphforge/datasets/loaders/csv.py):SNAP Dataset Registration (
src/graphforge/datasets/sources/snap.py):Auto-Registration: Datasets registered automatically on module import
Testing
Code Quality
Path.open()instead ofopen()Examples
Testing
All pre-push checks passing:
Related
Part of v0.2.1 dataset infrastructure implementation.
Follows PR #68 (dataset loading infrastructure).
Summary by CodeRabbit
New Features
Tests