Skip to content

kg cleanup + reintroducing deep extraction & classification #4949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 26, 2025

Conversation

Orbital-Web
Copy link
Contributor

Description

Mostly just moving things around and removing unused code.
Writing some template for where classification, deep extraction etc. should go once we reintroduce it.

How Has This Been Tested?

Locally. No change in logic currently at least for the supported features of KG.

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

@Orbital-Web Orbital-Web requested a review from a team as a code owner June 25, 2025 23:41
Copy link

vercel bot commented Jun 25, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 26, 2025 7:49pm

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

Major cleanup of Knowledge Graph (KG) module, removing unused code and restructuring components while preserving core functionality.

  • Removed numerous unused classes from backend/onyx/kg/models.py (KGProcessingStatus, KGChunkRelationship, etc.) and simplified remaining models like KGChunkFormat
  • Renamed key functions in backend/onyx/kg/utils/extraction_utils.py for clarity (e.g., kg_document_entities_relationships_attribute_generationkg_implied_extraction)
  • Added placeholder infrastructure in extraction modules for future deep extraction and classification features
  • Simplified Vespa interactions by renaming get_document_chunks_for_kg_processing to get_document_vespa_contents and removing KG-specific fields
  • Improved type handling and error management across KG modules while maintaining existing supported features

5 files reviewed, 2 comments
Edit PR Review Bot Settings | Greptile

@@ -25,10 +25,6 @@ def KG_COVERAGE_START_DATE(self) -> datetime:
return datetime.strptime(self.KG_COVERAGE_START, "%Y-%m-%d")


class KGProcessingStatus(BaseModel):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of these were unused



class KGClassificationContent(BaseModel):
class KGMetadataContent(BaseModel):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed from KGClassificationContent (and removed unused properties)

@@ -216,7 +138,7 @@ class KGEntityTypeInstructions(BaseModel):
metadata_attribute_conversion: dict[str, KGAttributeProperty]
classification_instructions: KGClassificationInstructions
extraction_instructions: KGExtractionInstructions
filter_instructions: dict[str, Any] | None = None
entity_filter_attributes: dict[str, Any] | None = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

classification_class: str


class KGImpliedExtractionResults(BaseModel):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed from KGDocumentEntitiesRelationshipsAttributes (also renamed kg_core_document_id_name to document_entity)

for unprocessed_document in unprocessed_document_batch
],
batch_size=processing_chunk_batch_size,
batch_metadata = get_batch_documents_metadata(
Copy link
Contributor Author

@Orbital-Web Orbital-Web Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

documents are already batched. No need to pass batch_size here and batch it up again, so removed that argument and simplified the function logic

logger.info(f"Processing document batch {document_batch_counter}")

# Get the document attributes and entity types
batch_metadata: dict[str, KGEnhancedDocumentMetadata] = _get_batch_metadata(
batch_metadata = _get_batch_documents_enhanced_metadata(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, documents is already batched here

unprocessed_document,
batch_metadata[unprocessed_document.id],
active_entity_types,
kg_config_settings,
)
)

# TODO 2. perform deep extraction and classification
# 2. prepare inputs for deep extraction and classification
if batch_metadata[unprocessed_document.id].deep_extraction:
Copy link
Contributor Author

@Orbital-Web Orbital-Web Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will not be true, can largely ignore this if case for now. Leaving it here as a template (although it should in theory run and maybe even work). Need to sort out a few things first to fully implement.

# staging objects
# this will also allow us to move the KGChunkFormat extraction inside this function,
# removing the need for slow vespa querying for non-deep extraction chunks
def _kg_chunk_batch_extraction(
Copy link
Contributor Author

@Orbital-Web Orbital-Web Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything past here was unused. A lot of the logic are no longer needed. The actual extraction logic is moved into kg_deep_extract_chunks now

document_classification_content: KGClassificationContent,
category_list: str,
category_definition_string: str,
def kg_deep_extraction(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mix of _kg_document_classification and _kg_chunk_batch_extraction, but broken down into smaller bits and reorganized. Will never run as of now.

@@ -113,50 +111,6 @@ def extract_relationship_type_id(relationship_id_name: str) -> str:
)


def aggregate_kg_extractions(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

completely unused and unnecessary now (upsert takes care of counts)

from onyx.utils.logger import setup_logger

logger = setup_logger()


def get_document_classification_content_for_kg_processing(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to get_batch_documents_metadata as this has nothing to do with vespa anymore.

category_definitions: dict[str, KGEntityTypeClassificationInfo],
kg_config_settings: KGConfigSettings,
) -> KGDocumentClassificationPrompt:
def get_batch_documents_metadata(document_ids: list[str]) -> list[KGMetadataContent]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exact same as get_document_classification_content_for_kg_processing from vespa_interactions.py. Moved here as it makes more sense

@@ -140,84 +78,3 @@ def get_document_chunks_for_kg_processing(
# Yield any remaining chunks
if current_batch:
yield current_batch


def _get_classification_content_from_call_chunks(
Copy link
Contributor Author

@Orbital-Web Orbital-Web Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following two logics were moved into kg_classify_document

@@ -292,62 +369,40 @@ def kg_process_person(
return None


def prepare_llm_content_extraction(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following logics were moved into kg_deep_extract_chunks

)
)

return classification_instructions_dict


def get_entity_types_str(active: bool | None = None) -> str:
Copy link
Contributor Author

@Orbital-Web Orbital-Web Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved this and the relationship one to extraction utils as it's called by the deep extraction fn in extraction_utils (otherwise we'd have a circular import)

@Orbital-Web Orbital-Web changed the title kg cleanup kg cleanup + reintroducing deep extraction & classification Jun 26, 2025
Copy link
Contributor

@evan-onyx evan-onyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hooray deleted lines!!

@Orbital-Web Orbital-Web enabled auto-merge June 26, 2025 20:00
@Orbital-Web Orbital-Web disabled auto-merge June 26, 2025 21:46
@Orbital-Web Orbital-Web merged commit 211102f into main Jun 26, 2025
10 of 12 checks passed
@Orbital-Web Orbital-Web deleted the kg-cleanup branch June 26, 2025 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants