Skip to content

fix: small-corpus path #1986

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

KartikVashishta
Copy link

@KartikVashishta KartikVashishta commented Jun 25, 2025

Description

Tiny corpora + CacheType.memory crashed with
ValueError: Columns must be same length as key (and follow-on dtype issues).

This PR hardens three spots:

  1. build_noun_graph._extract_edges

    • Expand edges column into source/target via pd.DataFrame, padding bad rows and dropping NaNs leading to no broadcast error.
  2. graph_to_dataframes

    • Store numpy embeddings as list so the single-column assignment can’t broadcast-fail.
  3. prune_graph

    • a) Return early for empty graphs.
    • b) astype(str) on source/target before merges to avoid float64 vs object mismatch.

Related Issues

Closes #1983

Proposed Changes

  • graphrag/index/operations/build_noun_graph/build_noun_graph.py
  • graphrag/index/operations/graph_to_dataframes.py
  • graphrag/index/operations/prune_graph.py
  • New test tests/unit/indexing/graph/test_small_corpus_bug.py

Checklist

  • Tested locally on a one-line corpus (input/tiny.txt) with IndexingMethod.Fast & CacheType.memory.
  • I have reviewed the code changes.
  • I have updated the documentation (if necessary).
  • I have added appropriate unit tests (if applicable).

Additional Notes

These fixes touch only the small-corpus edge-cases - normal pipelines are unaffected.
Heads-up: the prompt files have been altered, can be discarded. Shout if you’d
like them dropped or squashed.

* build_noun_graph – safely expand edge tuples
* graph_to_dataframes – store embeddings as list objects
* prune_graph – guard empty graphs + dtype-match source/target
microsoft#1983
@KartikVashishta KartikVashishta requested a review from a team as a code owner June 25, 2025 06:36
@KartikVashishta
Copy link
Author

@KartikVashishta please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree

1 similar comment
@KartikVashishta
Copy link
Author

@KartikVashishta please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: ValueError: Columns must be same length as key
1 participant