feat: implement CSV loader and register 5 SNAP datasets by DecisionNerd · Pull Request #69 · DecisionNerd/graphforge

DecisionNerd · 2026-02-03T22:25:31Z

Summary

Implements CSV edge-list loader and registers 5 SNAP (Stanford Network Analysis Project) datasets.

Changes

Core Implementation

CSVLoader (src/graphforge/datasets/loaders/csv.py):
- Auto-delimiter detection (tab, comma, space)
- Gzip compression support (.gz files)
- Comment line handling (lines starting with #)
- Empty line skipping
- Weighted and unweighted edges
- Node deduplication via cache
- Comprehensive error handling
SNAP Dataset Registration (src/graphforge/datasets/sources/snap.py):
- snap-ego-facebook (4K nodes, 88K edges)
- snap-email-enron (37K nodes, 184K edges)
- snap-ca-astroph (19K nodes, 198K edges)
- snap-web-google (876K nodes, 5.1M edges)
- snap-twitter-combined (81K nodes, 1.7M edges)
Auto-Registration: Datasets registered automatically on module import

Testing

✅ 14 CSV loader unit tests
✅ 13 SNAP dataset unit tests
✅ All existing tests passing (1107 passed, 14 skipped)
✅ 95.69% total code coverage maintained

Code Quality

Fixed linting issues:
- PTH123: Use Path.open() instead of open()
- PLW2901: Avoid overwriting loop variables
- E501: Line length compliance
Added test fixture to handle registry isolation

Examples

from graphforge import GraphForge
from graphforge.datasets import list_datasets, load_dataset

# List available SNAP datasets
snap_datasets = list_datasets(source="snap")
for ds in snap_datasets:
    print(f"{ds.name}: {ds.nodes} nodes, {ds.edges} edges")

# Load a dataset
gf = GraphForge.from_dataset("snap-ego-facebook")

# Or load into existing instance
gf = GraphForge()
load_dataset(gf, "snap-email-enron")

# Query the graph
result = gf.execute("""
    MATCH (n)-[r:CONNECTED_TO]->(m)
    RETURN count(r) as edge_count
""")

Testing

All pre-push checks passing:

✅ Linting (ruff format + check)
✅ Type checking (mypy strict)
✅ 1107 tests passed, 14 skipped
✅ Coverage: 95.69%

Summary by CodeRabbit

New Features
- CSV/TSV/space-delimited loader with automatic delimiter detection, weighted-edge support, node caching, and gzip handling.
- Five SNAP datasets added: Facebook ego, Enron emails, AstroPh collaboration, Google web graph, and Twitter ego; these datasets become available automatically on initialization.
Tests
- Extensive tests for CSV loader behavior (formats, weights, compression, errors) and SNAP dataset metadata/registration.

- Add CSVLoader for edge-list datasets (CSV/TSV/space-delimited) - Support auto-delimiter detection, gzip compression, weighted edges - Register 5 SNAP datasets: ego-facebook, email-enron, ca-astroph, web-google, twitter-combined - Add auto-registration on module import - Add comprehensive tests (14 CSV loader tests, 13 SNAP dataset tests) - Fix linting issues: use Path.open(), avoid loop variable overwrite - Add test fixture to handle registry isolation issues Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai · 2026-02-03T22:25:46Z

Walkthrough

Imports now auto-register SNAP datasets at package import; adds a CSVLoader for edge-list CSV/TSV/space-delimited (gzipped supported) files and unit tests for both CSV loading and SNAP dataset registration.

Changes

Cohort / File(s)	Summary
CSV Loader Implementation `src/graphforge/datasets/loaders/csv.py`	Adds `CSVLoader`: loads edge lists with auto-delimiter detection, supports comments, optional weights, gzipped files, node caching/deduplication, `_load_edges`, `_detect_delimiter`, and `get_format()` returning `"csv"`.
SNAP Dataset Registration `src/graphforge/datasets/sources/snap.py`	Adds `register_snap_datasets()` which idempotently ensures the CSV loader is registered and registers five SNAP datasets with metadata (name, source, URL, counts, category, license, loader_class).
Package Initialization `src/graphforge/datasets/__init__.py`	Now imports and immediately calls `register_snap_datasets()` on package import, causing import-time registration side-effect.
Test Coverage `tests/unit/datasets/test_csv_loader.py`, `tests/unit/datasets/test_snap_datasets.py`	New unit tests: CSV loader behavior (delimiters, weights, gzip, comments, node dedup, errors, get_format) and SNAP dataset registration/metadata, URL format, loader assignment, and filtering.

Sequence Diagram

sequenceDiagram
    participant Import as "datasets/__init__.py"
    participant Reg as "register_snap_datasets()"
    participant Registry as "Global Registry (loaders & datasets)"
    participant CSV as "CSVLoader"

    Import->>Reg: call register_snap_datasets()
    Reg->>Registry: ensure loader registered ("csv")
    Registry-->>Reg: loader registered (or already present)
    Reg->>Registry: register dataset (snap-ego-facebook, metadata)
    Reg->>Registry: register dataset (snap-email-enron, metadata)
    Reg->>Registry: register dataset (snap-ca-astroph, metadata)
    Reg->>Registry: register dataset (snap-web-google, metadata)
    Reg->>Registry: register dataset (snap-twitter-combined, metadata)
    Registry-->>Import: datasets and loader available

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Register additional SNAP datasets #70: Adds register_snap_datasets and related loader registration, aligning with the issue's goal to expand/refactor SNAP dataset registration.

Possibly related PRs

feat: implement core dataset loading infrastructure #68: Introduced the dataset registry/loader abstractions and layout that this PR builds upon.

Suggested labels

v0.3

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title clearly summarizes the main changes: CSV loader implementation and SNAP dataset registration, which aligns with the primary objectives.
Description check	✅ Passed	Description provides comprehensive details covering implementation, testing, code quality, examples, and related context. Includes type of change (✨ New feature), test coverage, and checklist items are satisfied.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/csv-loader-snap-datasets

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-02-03T22:27:53Z

Codecov Report

❌ Patch coverage is 94.02985% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.86%. Comparing base (3f39b82) to head (3111a65).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #69   +/-   ##
=======================================
  Coverage   93.86%   93.86%           
=======================================
  Files          17       19    +2     
  Lines        2134     2201   +67     
  Branches      526      542   +16     
=======================================
+ Hits         2003     2066   +63     
- Misses         49       52    +3     
- Partials       82       83    +1

Flag	Coverage Δ
full-coverage	`93.86% <94.02%> (+<0.01%)`	⬆️
unittests	`75.96% <94.02%> (+1.50%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
parser	`95.31% <ø> (ø)`
planner	`94.20% <ø> (ø)`
executor	`89.87% <ø> (ø)`
storage	`99.50% <ø> (ø)`
ast	`100.00% <ø> (ø)`
types	`98.42% <ø> (ø)`

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f39b82...3111a65. Read the comment docs.

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@src/graphforge/datasets/loaders/csv.py`:
- Around line 80-95: The parser currently uses stripped_line.split(delimiter)
which, when delimiter is a single space, yields empty tokens for consecutive
spaces; update the parsing logic in the CSV loader (around
self._detect_delimiter, delimiter, and the block that assigns
source_id/target_id) so that if delimiter == " " you call stripped_line.split()
(no argument) to collapse consecutive whitespace before extracting source_id and
target_id, and keep the existing error handling; also add a unit test for the
CSV loader that parses a line with multiple consecutive spaces (e.g., "0  1 
0.5") to ensure no empty node IDs are produced.

In `@src/graphforge/datasets/sources/snap.py`:
- Around line 13-19: Wrap the call to register_loader("csv", CSVLoader) inside a
try/except in register_snap_datasets so a duplicate-registration error doesn't
abort the rest of the function; specifically, call register_loader("csv",
CSVLoader) in a try block and catch the duplicate-loader exception (e.g.,
DuplicateLoaderError or the specific exception your registry throws) and
silently ignore it (or log.debug) then continue with the dataset registration
logic so datasets still get registered; re-raise any other unexpected
exceptions.

CSV Loader Improvements: - Handle consecutive spaces in space-delimited files by using split() without arguments when delimiter is space - Prevents empty node IDs from being created with malformed data - Add comprehensive test for multiple consecutive spaces with weights SNAP Registration Improvements: - Make register_snap_datasets() idempotent by catching duplicate registration errors - Wrap register_loader() and register_dataset() in try-except blocks - Gracefully handle cases where registry was already initialized - Simplify test fixture since registration is now idempotent Testing: - Add test_load_multiple_consecutive_spaces() test case - Verify no empty node IDs created with consecutive spaces - All 1108 tests passing, coverage 95.59% Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

DecisionNerd · 2026-02-04T01:01:19Z

Additional Improvements

CSV Loader Enhancements

Fixed consecutive whitespace handling: Now uses split() without arguments when delimiter is a space, which properly collapses consecutive spaces
Added test coverage: New test_load_multiple_consecutive_spaces() test verifies parsing of data with multiple consecutive spaces and weights
Prevents empty node IDs: Ensures malformed data with excessive spacing doesn't create empty node IDs

SNAP Registration Robustness

Idempotent registration: register_snap_datasets() now safely handles being called multiple times
Graceful error handling: Both loader and dataset registration wrapped in try-except blocks to handle duplicate registration
Simplified test fixtures: No longer need error handling in test setup since registration is idempotent

Testing

✅ 1108 tests passing (added 1 new test)
✅ Coverage: 95.59%
✅ All pre-push checks passing

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/graphforge/datasets/sources/snap.py`:
- Around line 100-103: The `# noqa: PERF203` on the except clause is unused and
should be removed to satisfy strict linting; edit the try/except block that
calls register_dataset(dataset) (the except ValueError as e: handler) and delete
the `# noqa: PERF203` comment so the except line reads simply `except ValueError
as e:` (no other changes needed).

- Replace try-except in loop with registry check before registration - Avoids PERF203 linting warning about performance overhead - More efficient: checks registry directly instead of catching exceptions - Import _DATASET_REGISTRY for idempotent registration check - All tests passing, cleaner implementation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Version Bump: - Bump version to 0.2.1 in pyproject.toml, __init__.py, and uv.lock - Add comprehensive v0.2.1 changelog entry Documentation Updates: - Update README.md with dataset loading examples and quickstart - Add "Load Real-World Datasets" section to main README - Update docs/index.md with dataset features and examples - Complete rewrite of docs/datasets/snap.md: - Mark as available in v0.2.1 (5 datasets) - Add detailed dataset table with stats - Add comprehensive usage examples and query patterns - Document download/caching behavior - Add performance tips for large datasets - Update docs/datasets/overview.md: - Reorganize to show SNAP as "Available Now" - Mark other sources as "Coming Soon" - List all 5 available SNAP datasets - Update docs/getting-started/quickstart.md: - Add "Load a Dataset" section with examples - Add dataset browsing examples - Update navigation links Release Contents (v0.2.1): - Dataset loading infrastructure with caching (#68) - CSV loader for edge-list datasets (#69) - 5 SNAP datasets available - MERGE ON CREATE SET syntax (#65) - MERGE ON MATCH SET syntax (#66) - WITH clause variable passing fix (#67) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: prepare v0.2.1 release with dataset documentation Version Bump: - Bump version to 0.2.1 in pyproject.toml, __init__.py, and uv.lock - Add comprehensive v0.2.1 changelog entry Documentation Updates: - Update README.md with dataset loading examples and quickstart - Add "Load Real-World Datasets" section to main README - Update docs/index.md with dataset features and examples - Complete rewrite of docs/datasets/snap.md: - Mark as available in v0.2.1 (5 datasets) - Add detailed dataset table with stats - Add comprehensive usage examples and query patterns - Document download/caching behavior - Add performance tips for large datasets - Update docs/datasets/overview.md: - Reorganize to show SNAP as "Available Now" - Mark other sources as "Coming Soon" - List all 5 available SNAP datasets - Update docs/getting-started/quickstart.md: - Add "Load a Dataset" section with examples - Add dataset browsing examples - Update navigation links Release Contents (v0.2.1): - Dataset loading infrastructure with caching (#68) - CSV loader for edge-list datasets (#69) - 5 SNAP datasets available - MERGE ON CREATE SET syntax (#65) - MERGE ON MATCH SET syntax (#66) - WITH clause variable passing fix (#67) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * chore: update uv.lock after version bump --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

DecisionNerd added the enhancement New feature or request label Feb 3, 2026

coderabbitai Bot reviewed Feb 3, 2026

View reviewed changes

Comment thread src/graphforge/datasets/loaders/csv.py

Comment thread src/graphforge/datasets/sources/snap.py

coderabbitai Bot reviewed Feb 4, 2026

View reviewed changes

Comment thread src/graphforge/datasets/sources/snap.py Outdated

DecisionNerd merged commit 9793a92 into main Feb 4, 2026
19 checks passed

DecisionNerd deleted the feat/csv-loader-snap-datasets branch February 4, 2026 01:21

DecisionNerd mentioned this pull request Feb 4, 2026

docs: prepare v0.2.1 release with dataset documentation #71

Merged

DecisionNerd mentioned this pull request Feb 4, 2026

feat: add SNAP dataset integration #54

Closed

6 tasks

This was referenced Feb 4, 2026

feat: implement CypherLoader with constraint/index skipping #74

Merged

feat: register Neo4j example datasets #76

Merged

feat: expand SNAP datasets to 95 using JSON metadata #81

Merged

feat: add NetworkRepository dataset integration #110

Merged

coderabbitai Bot mentioned this pull request May 4, 2026

perf: bulk ingest API and deferred statistics (#389) #444

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement CSV loader and register 5 SNAP datasets#69

feat: implement CSV loader and register 5 SNAP datasets#69
DecisionNerd merged 3 commits into
mainfrom
feat/csv-loader-snap-datasets

DecisionNerd commented Feb 3, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Feb 3, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Feb 3, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

DecisionNerd commented Feb 4, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DecisionNerd commented Feb 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Core Implementation

Testing

Code Quality

Examples

Testing

Related

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Possibly related issues

Possibly related PRs

Suggested labels

Uh oh!

codecov Bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

DecisionNerd commented Feb 4, 2026

Additional Improvements

CSV Loader Enhancements

SNAP Registration Robustness

Testing

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DecisionNerd commented Feb 3, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 3, 2026 •

edited

Loading

codecov Bot commented Feb 3, 2026 •

edited

Loading