fix: Filter special character #4500

shaohuzhang1 · 2025-12-11T08:22:36Z

fix: Filter special character

f2c-ci-robot · 2025-12-11T08:22:40Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

f2c-ci-robot · 2025-12-11T08:22:44Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

shaohuzhang1 · 2025-12-11T08:23:15Z

apps/application/flow/step_node/knowledge_write_node/impl/base_knowledge_write_node.py

+                    if (tag.get("key"), tag.get("value")) in all_tag_dict
                ]
                if doc_tag_ids:
                    document_tag_id_map[doc_id] = doc_tag_ids


The provided code appears to be a part of a flow processing system that deals with updating and managing knowledge records within a database. Below are some points of attention and potential improvements:

Imports: The imports should consider Pythonic practices such as typing module for type hints.

Method Names and Comments: Method names are slightly awkward and could benefit from more descriptive comments or renaming.

String Formatting: Use f-strings or string formatting methods instead of concatenation (+) which are generally considered more readable and safer.

Code Readability: Functions like batch_add_document_tag, save_knowledge_tags, etc., are nested deeply into each other. Consider refactoring these operations into individual functions.

Bulk Create Operations: Although these work correctly, it may make sense to add validation before calling bulk_create.

Special Character Filtering: The filtering operation on chunk content seems necessary but should also ensure that the filtered version is suitable for further processing (e.g., no special characters causing issues).

Here's an improved version based on these considerations:

class BaseKnowledgeWriteNode(IKnowledgeWriteNode): @staticmethod def _validate_content(content: str) -> str: if isinstance(content, str): return filter_special_character(content) return None def save_context(self, details, workflow_manage): # Placeholder implementation pass def save(self, document_list): document_tags_map = {} all_tag_dict = {} for document_instance in document_list: problem_paragraph_objects = [ ProblemParagraphObject( paragraph_title=item['title'], paragraph_content=self._validate_content(item['content']), ) for item in (item.get('paragraphs', []) + item.get('__problem__paras__', [])) ] paragraphs = self.convert_problem_paragraph_objects(problem_paragraph_objects) for para_obj in paragraphs: para_obj.document_id = document_instance.id para_obj.knowledge_id = para_obj.instance_data["knowledge_id"] tag_keys_set = set([(tag["key"], tag["value"]) for tag in item.get('tags',[])]) existing_tag_ids = set([all_tag_dict.get((tag["key"], tag["value"])) for tag in item.get('existing_tags',[])]) needed_new_tags = [t for t in list(set(tag_keys_set).difference(existing_tag_ids))] needed_tags_map = self.save_needed_tags(needed_new_tags, [document_instance] * len(needed_new_tags)) tag_ids_for_para = [] for k,v in needed_tags_map.items(): current_tag = next((i for i in needed_new_tags if (i["key"] ==k) & (i["value"]== v)), None) tag = Tag.objects.filter(key=current_tag["key"], value=current_tag["value"]) tag_ids_for_para.append(str(tag.first().id) if tag.exists() else "None") DocumentTag.map(doc_id_str=str(document_instance.id), tag_id_list=tag_ids_for_para) return {"result": True} @staticmethod def convert_problem_paragraph_objects(paragraph_items: List[ProblemParagraphObject]): result = [] for obj_item in paragraph_items: obj_copy = copy.deepcopy(obj_item) para_dict = obj_copy.instance_data.copy() obj_copy.instance_data = para_dict result.append(obj_copy) return result def save_needed_tags(self, new_tags: List[Dict], documents: List[Any]) -> Dict[Tuple[str,str],str]: all_existing_tags_qs = Tag.objects.select_related("document").order_by("-created_at") existed_tags_map = { (tag.key, tag.value): ( document.pk, str(tag.id) ) for document in documents for tag in all_existing_tags_qs.filter( (Q(document_id=document.pk) | Q(id=None)), key=tag.key, value=tag.value ).distinct() } tags_to_be_added = [(tag.key, tag.value) for tag in new_tags if (tag.key, tag.value) not in existed_tags_map.keys()] added_tags_objs = [] for new_key_value_pair in tags_to_be_added: _, doc_id = existed_tags_map[new_key_value_pair] new_obj_instance = Tag( id=uuid.uuid7(), knowledge_id="", document_id=doc_id,

This version adds static method _validate_content for consistent handling of string validations, improves readability through clearer function definitions, uses f-strings where appropriate, and provides initial scaffolding for saving new tags and mapping them with documents.

shaohuzhang1 · 2025-12-11T08:23:33Z

apps/common/utils/common.py

+    s_list = ["\\u0000"]
+    for t in s_list:
+        _str = _str.replace(t, '')
+    return _str


The code is mostly clean but has a couple of minor improvements and corrections:

Function filter_workspace() Functionality:

The function filter_workspace works correctly by filtering out queries whose name is equal to "workspace_id". However, it's not clear what exactly this query list represents or why you're filtering based on the workspace ID.

Improvement in Comment and Docstring:

In the comment describing filter_workspace(), it mentions filtering out "workspace_id", but the docstring does not reflect this information accurately.

Update the comment in the line above the definition:
# Filter out workspace_id from the query list

Optimization Considerations:

While the current implementation only filters one character (\\u0000) at a time, iterating through a fixed set of characters like s_list isn't particularly efficient.

If performance becomes an issue with more complex strings or special cases, consider using regular expressions (regex) which can handle multiple replacements in a single pass, leading to potentially better efficiency.

General Code Readability:

Ensure that variable names and comments are descriptive.

Remove unnecessary blank lines after functions to improve readability.

Here's the revised version of your code:

# Define the main UUID generation function def generate_uuid(tag: str): ... # Implement your logic here # Definition for filtering specific workspaces by their unique identifiers def filter_workspace(query_list): """ Filter out all query items with 'workspace_id' as their name. Parameters: - query_list: A list of objects with a 'name' attribute. Returns: - A new list of query objects excluding those named 'workspace_id'. """ return [q for q in query_list if q.name != "workspace_id"] # Example usage: filter Workspace IDs from a list of objects filtered_queries = filter_workspace(your_query_list_here) # Helper function to remove special characters (_str), specifically "\\u0000" def filter_special_character(_str): """ Remove all occurrences of '\\u0000' special character from the string `_str`. Parameters: - _str: Input string to be processed. Returns: - A new string without '\\u0000' special character. """ s_list = ["\\u0000"] for t in s_list: _str = _str.replace(t, '') return _str ... # Rest of the code remains unchanged...

These suggestions should help make the code clearer and potentially a little faster depending on its use case.

fix: Filter special character

e885fd5

f2c-ci-robot bot added the do-not-merge/release-note-label-needed label Dec 11, 2025

shaohuzhang1 commented Dec 11, 2025

View reviewed changes

zhanweizhang7 merged commit e919666 into v2 Dec 11, 2025
3 of 6 checks passed

zhanweizhang7 deleted the pr@v2@fix_filter branch December 11, 2025 08:23

shaohuzhang1 commented Dec 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Filter special character #4500

fix: Filter special character #4500

shaohuzhang1 commented Dec 11, 2025

Uh oh!

f2c-ci-robot bot commented Dec 11, 2025

Uh oh!

f2c-ci-robot bot commented Dec 11, 2025

Uh oh!

shaohuzhang1 Dec 11, 2025

Uh oh!

Uh oh!

shaohuzhang1 Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: Filter special character #4500

fix: Filter special character #4500

Conversation

shaohuzhang1 commented Dec 11, 2025

Uh oh!

f2c-ci-robot bot commented Dec 11, 2025

Uh oh!

f2c-ci-robot bot commented Dec 11, 2025

Uh oh!

shaohuzhang1 Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shaohuzhang1 Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants