Skip to content

feat: implement CSV loader and register 5 SNAP datasets#69

Merged
DecisionNerd merged 3 commits into
mainfrom
feat/csv-loader-snap-datasets
Feb 4, 2026
Merged

feat: implement CSV loader and register 5 SNAP datasets#69
DecisionNerd merged 3 commits into
mainfrom
feat/csv-loader-snap-datasets

Conversation

@DecisionNerd
Copy link
Copy Markdown
Owner

@DecisionNerd DecisionNerd commented Feb 3, 2026

Summary

Implements CSV edge-list loader and registers 5 SNAP (Stanford Network Analysis Project) datasets.

Changes

Core Implementation

  • CSVLoader (src/graphforge/datasets/loaders/csv.py):

    • Auto-delimiter detection (tab, comma, space)
    • Gzip compression support (.gz files)
    • Comment line handling (lines starting with #)
    • Empty line skipping
    • Weighted and unweighted edges
    • Node deduplication via cache
    • Comprehensive error handling
  • SNAP Dataset Registration (src/graphforge/datasets/sources/snap.py):

    • snap-ego-facebook (4K nodes, 88K edges)
    • snap-email-enron (37K nodes, 184K edges)
    • snap-ca-astroph (19K nodes, 198K edges)
    • snap-web-google (876K nodes, 5.1M edges)
    • snap-twitter-combined (81K nodes, 1.7M edges)
  • Auto-Registration: Datasets registered automatically on module import

Testing

  • ✅ 14 CSV loader unit tests
  • ✅ 13 SNAP dataset unit tests
  • ✅ All existing tests passing (1107 passed, 14 skipped)
  • ✅ 95.69% total code coverage maintained

Code Quality

  • Fixed linting issues:
    • PTH123: Use Path.open() instead of open()
    • PLW2901: Avoid overwriting loop variables
    • E501: Line length compliance
  • Added test fixture to handle registry isolation

Examples

from graphforge import GraphForge
from graphforge.datasets import list_datasets, load_dataset

# List available SNAP datasets
snap_datasets = list_datasets(source="snap")
for ds in snap_datasets:
    print(f"{ds.name}: {ds.nodes} nodes, {ds.edges} edges")

# Load a dataset
gf = GraphForge.from_dataset("snap-ego-facebook")

# Or load into existing instance
gf = GraphForge()
load_dataset(gf, "snap-email-enron")

# Query the graph
result = gf.execute("""
    MATCH (n)-[r:CONNECTED_TO]->(m)
    RETURN count(r) as edge_count
""")

Testing

All pre-push checks passing:

  • ✅ Linting (ruff format + check)
  • ✅ Type checking (mypy strict)
  • ✅ 1107 tests passed, 14 skipped
  • ✅ Coverage: 95.69%

Related

Part of v0.2.1 dataset infrastructure implementation.

Follows PR #68 (dataset loading infrastructure).

Summary by CodeRabbit

  • New Features

    • CSV/TSV/space-delimited loader with automatic delimiter detection, weighted-edge support, node caching, and gzip handling.
    • Five SNAP datasets added: Facebook ego, Enron emails, AstroPh collaboration, Google web graph, and Twitter ego; these datasets become available automatically on initialization.
  • Tests

    • Extensive tests for CSV loader behavior (formats, weights, compression, errors) and SNAP dataset metadata/registration.

- Add CSVLoader for edge-list datasets (CSV/TSV/space-delimited)
- Support auto-delimiter detection, gzip compression, weighted edges
- Register 5 SNAP datasets: ego-facebook, email-enron, ca-astroph,
  web-google, twitter-combined
- Add auto-registration on module import
- Add comprehensive tests (14 CSV loader tests, 13 SNAP dataset tests)
- Fix linting issues: use Path.open(), avoid loop variable overwrite
- Add test fixture to handle registry isolation issues

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@DecisionNerd DecisionNerd added the enhancement New feature or request label Feb 3, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 3, 2026

Walkthrough

Imports now auto-register SNAP datasets at package import; adds a CSVLoader for edge-list CSV/TSV/space-delimited (gzipped supported) files and unit tests for both CSV loading and SNAP dataset registration.

Changes

Cohort / File(s) Summary
CSV Loader Implementation
src/graphforge/datasets/loaders/csv.py
Adds CSVLoader: loads edge lists with auto-delimiter detection, supports comments, optional weights, gzipped files, node caching/deduplication, _load_edges, _detect_delimiter, and get_format() returning "csv".
SNAP Dataset Registration
src/graphforge/datasets/sources/snap.py
Adds register_snap_datasets() which idempotently ensures the CSV loader is registered and registers five SNAP datasets with metadata (name, source, URL, counts, category, license, loader_class).
Package Initialization
src/graphforge/datasets/__init__.py
Now imports and immediately calls register_snap_datasets() on package import, causing import-time registration side-effect.
Test Coverage
tests/unit/datasets/test_csv_loader.py, tests/unit/datasets/test_snap_datasets.py
New unit tests: CSV loader behavior (delimiters, weights, gzip, comments, node dedup, errors, get_format) and SNAP dataset registration/metadata, URL format, loader assignment, and filtering.

Sequence Diagram

sequenceDiagram
    participant Import as "datasets/__init__.py"
    participant Reg as "register_snap_datasets()"
    participant Registry as "Global Registry (loaders & datasets)"
    participant CSV as "CSVLoader"

    Import->>Reg: call register_snap_datasets()
    Reg->>Registry: ensure loader registered ("csv")
    Registry-->>Reg: loader registered (or already present)
    Reg->>Registry: register dataset (snap-ego-facebook, metadata)
    Reg->>Registry: register dataset (snap-email-enron, metadata)
    Reg->>Registry: register dataset (snap-ca-astroph, metadata)
    Reg->>Registry: register dataset (snap-web-google, metadata)
    Reg->>Registry: register dataset (snap-twitter-combined, metadata)
    Registry-->>Import: datasets and loader available
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Possibly related PRs

Suggested labels

v0.3

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed Title clearly summarizes the main changes: CSV loader implementation and SNAP dataset registration, which aligns with the primary objectives.
Description check ✅ Passed Description provides comprehensive details covering implementation, testing, code quality, examples, and related context. Includes type of change (✨ New feature), test coverage, and checklist items are satisfied.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/csv-loader-snap-datasets

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 3, 2026

Codecov Report

❌ Patch coverage is 94.02985% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.86%. Comparing base (3f39b82) to head (3111a65).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #69   +/-   ##
=======================================
  Coverage   93.86%   93.86%           
=======================================
  Files          17       19    +2     
  Lines        2134     2201   +67     
  Branches      526      542   +16     
=======================================
+ Hits         2003     2066   +63     
- Misses         49       52    +3     
- Partials       82       83    +1     
Flag Coverage Δ
full-coverage 93.86% <94.02%> (+<0.01%) ⬆️
unittests 75.96% <94.02%> (+1.50%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
parser 95.31% <ø> (ø)
planner 94.20% <ø> (ø)
executor 89.87% <ø> (ø)
storage 99.50% <ø> (ø)
ast 100.00% <ø> (ø)
types 98.42% <ø> (ø)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f39b82...3111a65. Read the comment docs.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/graphforge/datasets/loaders/csv.py`:
- Around line 80-95: The parser currently uses stripped_line.split(delimiter)
which, when delimiter is a single space, yields empty tokens for consecutive
spaces; update the parsing logic in the CSV loader (around
self._detect_delimiter, delimiter, and the block that assigns
source_id/target_id) so that if delimiter == " " you call stripped_line.split()
(no argument) to collapse consecutive whitespace before extracting source_id and
target_id, and keep the existing error handling; also add a unit test for the
CSV loader that parses a line with multiple consecutive spaces (e.g., "0  1 
0.5") to ensure no empty node IDs are produced.

In `@src/graphforge/datasets/sources/snap.py`:
- Around line 13-19: Wrap the call to register_loader("csv", CSVLoader) inside a
try/except in register_snap_datasets so a duplicate-registration error doesn't
abort the rest of the function; specifically, call register_loader("csv",
CSVLoader) in a try block and catch the duplicate-loader exception (e.g.,
DuplicateLoaderError or the specific exception your registry throws) and
silently ignore it (or log.debug) then continue with the dataset registration
logic so datasets still get registered; re-raise any other unexpected
exceptions.

Comment thread src/graphforge/datasets/loaders/csv.py
Comment thread src/graphforge/datasets/sources/snap.py
CSV Loader Improvements:
- Handle consecutive spaces in space-delimited files by using split()
  without arguments when delimiter is space
- Prevents empty node IDs from being created with malformed data
- Add comprehensive test for multiple consecutive spaces with weights

SNAP Registration Improvements:
- Make register_snap_datasets() idempotent by catching duplicate
  registration errors
- Wrap register_loader() and register_dataset() in try-except blocks
- Gracefully handle cases where registry was already initialized
- Simplify test fixture since registration is now idempotent

Testing:
- Add test_load_multiple_consecutive_spaces() test case
- Verify no empty node IDs created with consecutive spaces
- All 1108 tests passing, coverage 95.59%

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@DecisionNerd
Copy link
Copy Markdown
Owner Author

Additional Improvements

CSV Loader Enhancements

  • Fixed consecutive whitespace handling: Now uses split() without arguments when delimiter is a space, which properly collapses consecutive spaces
  • Added test coverage: New test_load_multiple_consecutive_spaces() test verifies parsing of data with multiple consecutive spaces and weights
  • Prevents empty node IDs: Ensures malformed data with excessive spacing doesn't create empty node IDs

SNAP Registration Robustness

  • Idempotent registration: register_snap_datasets() now safely handles being called multiple times
  • Graceful error handling: Both loader and dataset registration wrapped in try-except blocks to handle duplicate registration
  • Simplified test fixtures: No longer need error handling in test setup since registration is idempotent

Testing

  • ✅ 1108 tests passing (added 1 new test)
  • ✅ Coverage: 95.59%
  • ✅ All pre-push checks passing

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/graphforge/datasets/sources/snap.py`:
- Around line 100-103: The `# noqa: PERF203` on the except clause is unused and
should be removed to satisfy strict linting; edit the try/except block that
calls register_dataset(dataset) (the except ValueError as e: handler) and delete
the `# noqa: PERF203` comment so the except line reads simply `except ValueError
as e:` (no other changes needed).

Comment thread src/graphforge/datasets/sources/snap.py Outdated
- Replace try-except in loop with registry check before registration
- Avoids PERF203 linting warning about performance overhead
- More efficient: checks registry directly instead of catching exceptions
- Import _DATASET_REGISTRY for idempotent registration check
- All tests passing, cleaner implementation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@DecisionNerd DecisionNerd merged commit 9793a92 into main Feb 4, 2026
19 checks passed
@DecisionNerd DecisionNerd deleted the feat/csv-loader-snap-datasets branch February 4, 2026 01:21
DecisionNerd added a commit that referenced this pull request Feb 4, 2026
Version Bump:
- Bump version to 0.2.1 in pyproject.toml, __init__.py, and uv.lock
- Add comprehensive v0.2.1 changelog entry

Documentation Updates:
- Update README.md with dataset loading examples and quickstart
- Add "Load Real-World Datasets" section to main README
- Update docs/index.md with dataset features and examples
- Complete rewrite of docs/datasets/snap.md:
  - Mark as available in v0.2.1 (5 datasets)
  - Add detailed dataset table with stats
  - Add comprehensive usage examples and query patterns
  - Document download/caching behavior
  - Add performance tips for large datasets
- Update docs/datasets/overview.md:
  - Reorganize to show SNAP as "Available Now"
  - Mark other sources as "Coming Soon"
  - List all 5 available SNAP datasets
- Update docs/getting-started/quickstart.md:
  - Add "Load a Dataset" section with examples
  - Add dataset browsing examples
  - Update navigation links

Release Contents (v0.2.1):
- Dataset loading infrastructure with caching (#68)
- CSV loader for edge-list datasets (#69)
- 5 SNAP datasets available
- MERGE ON CREATE SET syntax (#65)
- MERGE ON MATCH SET syntax (#66)
- WITH clause variable passing fix (#67)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
DecisionNerd added a commit that referenced this pull request Feb 4, 2026
* docs: prepare v0.2.1 release with dataset documentation

Version Bump:
- Bump version to 0.2.1 in pyproject.toml, __init__.py, and uv.lock
- Add comprehensive v0.2.1 changelog entry

Documentation Updates:
- Update README.md with dataset loading examples and quickstart
- Add "Load Real-World Datasets" section to main README
- Update docs/index.md with dataset features and examples
- Complete rewrite of docs/datasets/snap.md:
  - Mark as available in v0.2.1 (5 datasets)
  - Add detailed dataset table with stats
  - Add comprehensive usage examples and query patterns
  - Document download/caching behavior
  - Add performance tips for large datasets
- Update docs/datasets/overview.md:
  - Reorganize to show SNAP as "Available Now"
  - Mark other sources as "Coming Soon"
  - List all 5 available SNAP datasets
- Update docs/getting-started/quickstart.md:
  - Add "Load a Dataset" section with examples
  - Add dataset browsing examples
  - Update navigation links

Release Contents (v0.2.1):
- Dataset loading infrastructure with caching (#68)
- CSV loader for edge-list datasets (#69)
- 5 SNAP datasets available
- MERGE ON CREATE SET syntax (#65)
- MERGE ON MATCH SET syntax (#66)
- WITH clause variable passing fix (#67)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* chore: update uv.lock after version bump

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant