-
Notifications
You must be signed in to change notification settings - Fork 1.7k
kg cleanup + reintroducing deep extraction & classification #4949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
Major cleanup of Knowledge Graph (KG) module, removing unused code and restructuring components while preserving core functionality.
- Removed numerous unused classes from
backend/onyx/kg/models.py
(KGProcessingStatus, KGChunkRelationship, etc.) and simplified remaining models like KGChunkFormat - Renamed key functions in
backend/onyx/kg/utils/extraction_utils.py
for clarity (e.g.,kg_document_entities_relationships_attribute_generation
→kg_implied_extraction
) - Added placeholder infrastructure in extraction modules for future deep extraction and classification features
- Simplified Vespa interactions by renaming
get_document_chunks_for_kg_processing
toget_document_vespa_contents
and removing KG-specific fields - Improved type handling and error management across KG modules while maintaining existing supported features
5 files reviewed, 2 comments
Edit PR Review Bot Settings | Greptile
@@ -25,10 +25,6 @@ def KG_COVERAGE_START_DATE(self) -> datetime: | |||
return datetime.strptime(self.KG_COVERAGE_START, "%Y-%m-%d") | |||
|
|||
|
|||
class KGProcessingStatus(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of these were unused
|
||
|
||
class KGClassificationContent(BaseModel): | ||
class KGMetadataContent(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed from KGClassificationContent (and removed unused properties)
@@ -216,7 +138,7 @@ class KGEntityTypeInstructions(BaseModel): | |||
metadata_attribute_conversion: dict[str, KGAttributeProperty] | |||
classification_instructions: KGClassificationInstructions | |||
extraction_instructions: KGExtractionInstructions | |||
filter_instructions: dict[str, Any] | None = None | |||
entity_filter_attributes: dict[str, Any] | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
classification_class: str | ||
|
||
|
||
class KGImpliedExtractionResults(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed from KGDocumentEntitiesRelationshipsAttributes (also renamed kg_core_document_id_name to document_entity)
for unprocessed_document in unprocessed_document_batch | ||
], | ||
batch_size=processing_chunk_batch_size, | ||
batch_metadata = get_batch_documents_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
documents are already batched. No need to pass batch_size here and batch it up again, so removed that argument and simplified the function logic
logger.info(f"Processing document batch {document_batch_counter}") | ||
|
||
# Get the document attributes and entity types | ||
batch_metadata: dict[str, KGEnhancedDocumentMetadata] = _get_batch_metadata( | ||
batch_metadata = _get_batch_documents_enhanced_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, documents is already batched here
unprocessed_document, | ||
batch_metadata[unprocessed_document.id], | ||
active_entity_types, | ||
kg_config_settings, | ||
) | ||
) | ||
|
||
# TODO 2. perform deep extraction and classification | ||
# 2. prepare inputs for deep extraction and classification | ||
if batch_metadata[unprocessed_document.id].deep_extraction: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will not be true, can largely ignore this if case for now. Leaving it here as a template (although it should in theory run and maybe even work). Need to sort out a few things first to fully implement.
# staging objects | ||
# this will also allow us to move the KGChunkFormat extraction inside this function, | ||
# removing the need for slow vespa querying for non-deep extraction chunks | ||
def _kg_chunk_batch_extraction( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
everything past here was unused. A lot of the logic are no longer needed. The actual extraction logic is moved into kg_deep_extract_chunks
now
document_classification_content: KGClassificationContent, | ||
category_list: str, | ||
category_definition_string: str, | ||
def kg_deep_extraction( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mix of _kg_document_classification and _kg_chunk_batch_extraction, but broken down into smaller bits and reorganized. Will never run as of now.
@@ -113,50 +111,6 @@ def extract_relationship_type_id(relationship_id_name: str) -> str: | |||
) | |||
|
|||
|
|||
def aggregate_kg_extractions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
completely unused and unnecessary now (upsert takes care of counts)
from onyx.utils.logger import setup_logger | ||
|
||
logger = setup_logger() | ||
|
||
|
||
def get_document_classification_content_for_kg_processing( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to get_batch_documents_metadata
as this has nothing to do with vespa anymore.
category_definitions: dict[str, KGEntityTypeClassificationInfo], | ||
kg_config_settings: KGConfigSettings, | ||
) -> KGDocumentClassificationPrompt: | ||
def get_batch_documents_metadata(document_ids: list[str]) -> list[KGMetadataContent]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exact same as get_document_classification_content_for_kg_processing
from vespa_interactions.py. Moved here as it makes more sense
@@ -140,84 +78,3 @@ def get_document_chunks_for_kg_processing( | |||
# Yield any remaining chunks | |||
if current_batch: | |||
yield current_batch | |||
|
|||
|
|||
def _get_classification_content_from_call_chunks( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
following two logics were moved into kg_classify_document
@@ -292,62 +369,40 @@ def kg_process_person( | |||
return None | |||
|
|||
|
|||
def prepare_llm_content_extraction( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
following logics were moved into kg_deep_extract_chunks
) | ||
) | ||
|
||
return classification_instructions_dict | ||
|
||
|
||
def get_entity_types_str(active: bool | None = None) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved this and the relationship one to extraction utils as it's called by the deep extraction fn in extraction_utils (otherwise we'd have a circular import)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hooray deleted lines!!
Description
Mostly just moving things around and removing unused code.
Writing some template for where classification, deep extraction etc. should go once we reintroduce it.
How Has This Been Tested?
Locally. No change in logic currently at least for the supported features of KG.
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.