kg cleanup + reintroducing deep extraction & classification #4949

Orbital-Web · 2025-06-25T23:41:37Z

Description

Mostly just moving things around and removing unused code.
Writing some template for where classification, deep extraction etc. should go once we reintroduce it.

How Has This Been Tested?

Locally. No change in logic currently at least for the supported features of KG.

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

This PR should be backported (make sure to check that the backport attempt succeeds)
[Optional] Override Linear Check

vercel · 2025-06-25T23:41:41Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
internal-search	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jun 26, 2025 7:49pm

greptile-apps

PR Summary

Major cleanup of Knowledge Graph (KG) module, removing unused code and restructuring components while preserving core functionality.

Removed numerous unused classes from backend/onyx/kg/models.py (KGProcessingStatus, KGChunkRelationship, etc.) and simplified remaining models like KGChunkFormat
Renamed key functions in backend/onyx/kg/utils/extraction_utils.py for clarity (e.g., kg_document_entities_relationships_attribute_generation → kg_implied_extraction)
Added placeholder infrastructure in extraction modules for future deep extraction and classification features
Simplified Vespa interactions by renaming get_document_chunks_for_kg_processing to get_document_vespa_contents and removing KG-specific fields
Improved type handling and error management across KG modules while maintaining existing supported features

_{5 files reviewed, 2 comments}
_{Edit PR Review Bot Settings | Greptile}

backend/onyx/kg/utils/extraction_utils.py

Orbital-Web · 2025-06-26T03:44:10Z

backend/onyx/kg/models.py

@@ -25,10 +25,6 @@ def KG_COVERAGE_START_DATE(self) -> datetime:
        return datetime.strptime(self.KG_COVERAGE_START, "%Y-%m-%d")


-class KGProcessingStatus(BaseModel):


all of these were unused

Orbital-Web · 2025-06-26T03:44:34Z

backend/onyx/kg/models.py

-
-
-class KGClassificationContent(BaseModel):
+class KGMetadataContent(BaseModel):


renamed from KGClassificationContent (and removed unused properties)

Orbital-Web · 2025-06-26T03:44:42Z

backend/onyx/kg/models.py

@@ -216,7 +138,7 @@ class KGEntityTypeInstructions(BaseModel):
    metadata_attribute_conversion: dict[str, KGAttributeProperty]
    classification_instructions: KGClassificationInstructions
    extraction_instructions: KGExtractionInstructions
-    filter_instructions: dict[str, Any] | None = None
+    entity_filter_attributes: dict[str, Any] | None = None


Orbital-Web · 2025-06-26T03:45:16Z

backend/onyx/kg/models.py

+    classification_class: str
+
+
+class KGImpliedExtractionResults(BaseModel):


renamed from KGDocumentEntitiesRelationshipsAttributes (also renamed kg_core_document_id_name to document_entity)

Orbital-Web · 2025-06-26T03:50:26Z

backend/onyx/kg/extractions/extraction_processing.py

-            for unprocessed_document in unprocessed_document_batch
-        ],
-        batch_size=processing_chunk_batch_size,
+    batch_metadata = get_batch_documents_metadata(


documents are already batched. No need to pass batch_size here and batch it up again, so removed that argument and simplified the function logic

Orbital-Web · 2025-06-26T03:52:07Z

backend/onyx/kg/extractions/extraction_processing.py

            logger.info(f"Processing document batch {document_batch_counter}")

            # Get the document attributes and entity types
-            batch_metadata: dict[str, KGEnhancedDocumentMetadata] = _get_batch_metadata(
+            batch_metadata = _get_batch_documents_enhanced_metadata(


again, documents is already batched here

Orbital-Web · 2025-06-26T03:53:14Z

backend/onyx/kg/extractions/extraction_processing.py

                        unprocessed_document,
                        batch_metadata[unprocessed_document.id],
                        active_entity_types,
                        kg_config_settings,
                    )
                )

-                # TODO 2. perform deep extraction and classification
+                # 2. prepare inputs for deep extraction and classification
+                if batch_metadata[unprocessed_document.id].deep_extraction:


will not be true, can largely ignore this if case for now. Leaving it here as a template (although it should in theory run and maybe even work). Need to sort out a few things first to fully implement.

Orbital-Web · 2025-06-26T03:55:46Z

backend/onyx/kg/extractions/extraction_processing.py

-# staging objects
-# this will also allow us to move the KGChunkFormat extraction inside this function,
-# removing the need for slow vespa querying for non-deep extraction chunks
-def _kg_chunk_batch_extraction(


everything past here was unused. A lot of the logic are no longer needed. The actual extraction logic is moved into kg_deep_extract_chunks now

Orbital-Web · 2025-06-26T03:57:27Z

backend/onyx/kg/utils/extraction_utils.py

-    document_classification_content: KGClassificationContent,
-    category_list: str,
-    category_definition_string: str,
+def kg_deep_extraction(


mix of _kg_document_classification and _kg_chunk_batch_extraction, but broken down into smaller bits and reorganized. Will never run as of now.

Orbital-Web · 2025-06-26T03:59:38Z

backend/onyx/kg/utils/formatting_utils.py

@@ -113,50 +111,6 @@ def extract_relationship_type_id(relationship_id_name: str) -> str:
    )


-def aggregate_kg_extractions(


completely unused and unnecessary now (upsert takes care of counts)

Orbital-Web · 2025-06-26T04:00:19Z

backend/onyx/kg/vespa/vespa_interactions.py

 from onyx.utils.logger import setup_logger

 logger = setup_logger()


-def get_document_classification_content_for_kg_processing(


moved to get_batch_documents_metadata as this has nothing to do with vespa anymore.

Orbital-Web · 2025-06-26T04:00:50Z

backend/onyx/kg/utils/extraction_utils.py

-    category_definitions: dict[str, KGEntityTypeClassificationInfo],
-    kg_config_settings: KGConfigSettings,
-) -> KGDocumentClassificationPrompt:
+def get_batch_documents_metadata(document_ids: list[str]) -> list[KGMetadataContent]:


exact same as get_document_classification_content_for_kg_processing from vespa_interactions.py. Moved here as it makes more sense

Orbital-Web · 2025-06-26T04:14:38Z

backend/onyx/kg/vespa/vespa_interactions.py

@@ -140,84 +78,3 @@ def get_document_chunks_for_kg_processing(
    # Yield any remaining chunks
    if current_batch:
        yield current_batch
-
-
-def _get_classification_content_from_call_chunks(


following two logics were moved into kg_classify_document

…nt formatting

Orbital-Web · 2025-06-26T04:24:49Z

backend/onyx/kg/utils/extraction_utils.py

@@ -292,62 +369,40 @@ def kg_process_person(
    return None


-def prepare_llm_content_extraction(


following logics were moved into kg_deep_extract_chunks

Orbital-Web · 2025-06-26T19:12:20Z

backend/onyx/kg/extractions/extraction_processing.py

            )
        )

    return classification_instructions_dict


-def get_entity_types_str(active: bool | None = None) -> str:


moved this and the relationship one to extraction utils as it's called by the deep extraction fn in extraction_utils (otherwise we'd have a circular import)

evan-onyx

hooray deleted lines!!

backend/onyx/kg/extractions/extraction_processing.py

backend/onyx/kg/utils/extraction_utils.py

Orbital-Web requested a review from a team as a code owner June 25, 2025 23:41

greptile-apps bot reviewed Jun 25, 2025

View reviewed changes

backend/onyx/kg/utils/extraction_utils.py Outdated Show resolved Hide resolved

backend/onyx/kg/utils/extraction_utils.py Show resolved Hide resolved

vercel bot deployed to Preview June 25, 2025 23:42 View deployment

vercel bot deployed to Preview June 25, 2025 23:47 View deployment

Orbital-Web requested review from evan-onyx June 26, 2025 01:18

Orbital-Web added 2 commits June 25, 2025 20:42

kg cleanup

c2b994b

more cleanup

a2e8e42

Orbital-Web force-pushed the kg-cleanup branch from 1a28659 to a2e8e42 Compare June 26, 2025 03:42

Orbital-Web commented Jun 26, 2025

View reviewed changes

vercel bot deployed to Preview June 26, 2025 03:45 View deployment

Orbital-Web commented Jun 26, 2025

View reviewed changes

fix: copy over _get_classification_content_from_call_chunks for conte…

74dd8c0

…nt formatting

Orbital-Web commented Jun 26, 2025

View reviewed changes

added back deep extraction logic

2484e47

vercel bot deployed to Preview June 26, 2025 04:40 View deployment

Orbital-Web requested a review from joachim-danswer June 26, 2025 16:42

feat: making deep extraction and clustering work

2473896

vercel bot deployed to Preview June 26, 2025 19:12 View deployment

Orbital-Web commented Jun 26, 2025

View reviewed changes

Orbital-Web changed the title ~~kg cleanup~~ kg cleanup + reintroducing deep extraction & classification Jun 26, 2025

evan-onyx approved these changes Jun 26, 2025

View reviewed changes

backend/onyx/kg/extractions/extraction_processing.py Show resolved Hide resolved

backend/onyx/kg/extractions/extraction_processing.py Outdated Show resolved Hide resolved

backend/onyx/kg/utils/extraction_utils.py Outdated Show resolved Hide resolved

nit

67900ec

vercel bot deployed to Preview June 26, 2025 19:49 View deployment

Orbital-Web enabled auto-merge June 26, 2025 20:00

Orbital-Web disabled auto-merge June 26, 2025 21:46

Orbital-Web merged commit 211102f into main Jun 26, 2025
10 of 12 checks passed

Orbital-Web deleted the kg-cleanup branch June 26, 2025 21:46

		@@ -25,10 +25,6 @@ def KG_COVERAGE_START_DATE(self) -> datetime:
		return datetime.strptime(self.KG_COVERAGE_START, "%Y-%m-%d")


		class KGProcessingStatus(BaseModel):



		class KGClassificationContent(BaseModel):
		class KGMetadataContent(BaseModel):

		classification_class: str


		class KGImpliedExtractionResults(BaseModel):

		@@ -113,50 +111,6 @@ def extract_relationship_type_id(relationship_id_name: str) -> str:
		)


		def aggregate_kg_extractions(

		@@ -292,62 +369,40 @@ def kg_process_person(
		return None


		def prepare_llm_content_extraction(

kg cleanup + reintroducing deep extraction & classification #4949

kg cleanup + reintroducing deep extraction & classification #4949

Uh oh!

Conversation

Orbital-Web commented Jun 25, 2025

Description

How Has This Been Tested?

Backporting (check the box to trigger backport action)

Uh oh!

vercel bot commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

PR Summary

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Orbital-Web Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Orbital-Web Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Orbital-Web Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Orbital-Web Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Orbital-Web Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

evan-onyx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vercel bot commented Jun 25, 2025 •

edited

Loading

Orbital-Web Jun 26, 2025 •

edited

Loading

Orbital-Web Jun 26, 2025 •

edited

Loading

Orbital-Web Jun 26, 2025 •

edited

Loading

Orbital-Web Jun 26, 2025 •

edited

Loading

Orbital-Web Jun 26, 2025 •

edited

Loading