# SOAP: Semantic-Oriented Alignment for Prescription Generation

- author: Jiale Cai
- date: 2025-03-30
- email: mc36401@um.edu.mo

## 0. Data Overview

We will use the first case from our test dataset for demonstration. Load the test case using following:

In [None]:
import json
with open("test.json", "r", encoding="utf-8") as file:
    data = json.load(file)
    test = data[0]
print(test)

{
    "id": 1, 
    "question": "患者男，3岁7月。饱食后呕吐半年余，加重1日。患儿半年前无明显诱因出现餐后呕吐不消化食物，口气臭秽，脘腹胀满，吐后觉舒，大便秘结，泻下酸臭，未予以重视，1日前呕吐量较以往增多，行腹部DR提示不完全性肠梗阻。舌质红，苔厚腻，脉滑数有力，指纹紫滞。",
    "answer": {
        "symptoms": ["呕吐", "腹胀", "便秘", "指纹紫滞", "红舌", "厚舌苔", "腻舌苔", "脉有力", "数脉", "滑脉"],
        "disease": "积滞",
        "syndrome": "饮食停滞证",
        "therapy": ["消食化滞","和胃降逆"],
        "formula": "保和丸",
        "herbs": ["炒建曲", "焦山楂", "姜半夏", "炒莱菔子", "陈皮", "连翘", "生姜", "炒鸡内金", "麸炒枳壳"]
    }
}


- Diagnostic Interpretation (ID:1)
  - TCM Diagnosis: `Pediatric vomiting (小儿呕吐)` | `Food stagnation syndrome (饮食积滞证)`
  - Western Diagnosis: `Intestinal obstruction (肠梗阻)`
  - Therapeutic Principle: `Digestive regulation and stomach function normalization (消食导滞，和胃降逆)`
  - Recommended Formula: `Modified Baohe Pill` - `保和丸加减`.
  - Herbal Composition: `"炒建曲","焦山楂","姜半夏","炒莱菔子","陈皮","连翘","生姜","炒鸡内金","麸炒枳壳"`

Each test case comes from real clinical records, containing:

- **"id"`**: Unique case identifier
- **"question"**: Clinical consultation information
  - Generally includes patient cheif complaints and medical history (`subjective /S`) and examination findings (`objective/O`)
  - Covers TCM diagnostic methods (observation, auscultation, inquiry, palpation) and modern medical tests
- **"answer"**: Structured diagnosis results
  - `Assessment /A`
    - **symptoms**: Standardized symptom terms (GB/T 16751.1-2023)
    - **disease**: TCM disease diagnosis (GB/T 16751.1-2023)
    - **syndrome**: TCM syndrome pattern (GB/T 16751.2-2021)
  - `Prescription /P`
    - **therapy**: Treatment protocol (GB/T 16751.3-2023)
    - **formula**: Herbal formula (GB/T 31773-2015)
    - **herbs**: Standardized herb combinations (GB/T 31774-2015)

> **Note**: All the terms used here comply with the Chinese terminology standards (the numbers in parentheses refer to GB/T). The recommended herbal combinations are effective prescriptions derived from long-term clinical tracking. In practical applications, clinical physicians make personalized adjustments according to individual conditions. However, the basis for providing the herbal combinations (syndrome, therapy methods, and prescription formula) is uniquely determined. Therefore, when evaluating the prescription generated by the model in the end, its effectiveness should be ultimately determined by clinical practitioners. The “golden answers” (syndrome, therapy methods, and prescription formula) can be assessed through metrics, such as Recall@K.

## 1. Naive QA System

This section presents a simple question and answer system where a user inputs a question and the system generates an answer. We use the `glm-4-flash` model, which is relatively advanced among current models.

In [2]:
# language model support
from openai import OpenAI
from typing import List, Dict
def chatbot(
    APIKEY: str = "f2ce4abd8dc240762875670e8faaeb73.KnoQbOP7N2kGK2SW",
    BASEURL: str = "https://open.bigmodel.cn/api/paas/v4/",
    MODEL: str = "glm-4-flash",
    MESSAGE: List[Dict] = [{'role': 'system', 'content': 'You are a helpful assistant.'}, {'role': 'user', 'content': 'Hello!'}]
):
    client = OpenAI(
        api_key=APIKEY,
        base_url=BASEURL,
    )
    response = client.chat.completions.create(
        model=MODEL, 
        messages=MESSAGE,
        temperature=0.7, # [0,2], the higher the value, the more creative the output
        top_p=0.7, # [0,1], the higher the value, the more likely it will generate new words
        presence_penalty=-0.5, # [-2, 2] the higher the value, the more likely it will generate new topics
        frequency_penalty=1, # [-2, 2] the smaller the value, the more likely it will repeat the same words
        n = 1,
        max_tokens=2048,
        stop=None,
        stream=False
    )
    return response.choices[0].message.content

In [None]:
message = [
    {'role':'system','content':'请你扮演一位临床中医师，你的任务是根据给出的临床信息，对当前患者的症状做进一步的推理，最终给出相应处方的中草药组成。'},
    # System Prompts: "Please acting as a clinical traditional Chinese medicine practitioner. 
    # Your task is to further deduce the patient's symptoms based on the provided clinical information and 
    # ultimately prescribe the herbal composition of the traditional Chinese medicine."
    {'role':'user','content':f'临床信息如下：{test["question"]}'}
    # The clinical information are as follow: $test["question"]
]

pred = chatbot(MESSAGE=message)
print(f"The prediction of case {test['id']} is: \n {pred}")

The prediction of case 1 is: 
 根据您提供的临床信息，这位3岁7个月的男孩可能患有脾胃湿热型呕吐。以下是对患者症状的推理分析及相应的中草药处方建议：

1. **症状分析**：
   - 饱食后呕吐：提示脾胃功能失调，饮食不节。
   - 口气臭秽：脾胃湿热，浊气上蒸。
   - 脘腹胀满：脾胃气机不畅，湿阻中焦。
   - 吐后觉舒：脾胃之气暂时得通。
   - 大便秘结，泻下酸臭：肠道积热，湿邪内阻。
   - 舌质红，苔厚腻：脾胃湿热内蕴。
   - 脉滑数有力：内有湿热，脉象滑数为湿邪内阻之象。
   - 指纹紫滞：血行不畅，可能为气滞血瘀。

2. **诊断**：
   - 脾胃湿热型呕吐，不完全性肠梗阻。

3. **处方建议**：
   - **中草药组成**：
     - 黄连：清热燥湿，泻火解毒。
     - 黄芩：清热燥湿，泻火解毒，善于清中焦湿热。
     - 厚朴：行气化湿，消胀除满。
     - 陈皮：理气健脾，燥湿化痰。
     - 神曲：消食化积，和中止泻。
     - 茯苓：健脾利湿，渗湿止泻。
     - 大黄：清热泻火，通便导滞。
     - 枳实：破气消积，行气止痛。
     - 甘草：调和诸药，缓急止痛。

   - **用法**：上述药物按比例配伍，煎煮成汤剂，每日一剂，分两次服用。

4. **注意事项**：
   - 饮食宜清淡易消化，避免油腻、生冷、辛辣食物。
   - 注意休息，避免过度劳累。
   - 观察病情变化，如有加重或不适，应及时就医。

请注意，以上处方仅供参考，具体用药需根据患者的实际情况和医生的建议进行调整。在用药过程中，应密切观察患者的反应，如有不适，应及时停药并咨询医生。


In [4]:
test = {
    "id": 1, 
    "question": "患者男，3岁7月。饱食后呕吐半年余，加重1日。患儿半年前无明显诱因出现餐后呕吐不消化食物，口气臭秽，脘腹胀满，吐后觉舒，大便秘结，泻下酸臭，未予以重视，1日前呕吐量较以往增多，行腹部DR提示不完全性肠梗阻。舌质红，苔厚腻，脉滑数有力，指纹紫滞。",
    "answer": {
        "symptoms": ["呕吐", "腹胀", "便秘", "指纹紫滞", "红舌", "厚舌苔", "腻舌苔", "脉有力", "数脉", "滑脉"],
        "disease": "呕吐",
        "syndrome": "饮食停滞证",
        "therapy": ["消食化滞","和胃降逆"],
        "formula": "保和丸",
        "herbs": ["炒建曲", "焦山楂", "姜半夏", "炒莱菔子", "陈皮", "连翘", "生姜", "炒鸡内金", "麸炒枳壳"]
    },
    "prediction": {
        "disease": "呕吐",
        "syndrome": "脾胃湿热证",
        "therapy": ["消食导滞"],
        "formula": "枳实导滞丸",
        "herbs": ["黄连","黄芩","厚朴","陈皮","神曲","茯苓","大黄","枳实","甘草"]
    }
}

### 1.1. Analysis

> - The model-selected prescription covers commonly - used Chinese herbs for eliminating `food stagnation`, promoting qi movement to reduce distension, and clearing heat to relieve constipation. These include `"黄连","黄芩","厚朴","陈皮","神曲","茯苓","大黄","枳实","甘草"`. They are effective for food accumulation, gastrointestinal qi stagnation, and heat - induced constipation. The addition and subtraction part also considers targeted medication for different symptoms, showing certain flexibility.
> - `Lack of consideration for children's physiological characteristics`: Children have relatively weaker spleen and stomach functions, necessitating more cautious medication. Some drugs may conflict or not coordinate well. For example, `大黄 (Rhubarb)` are effective in clearing heat and relieving constipation, but their dosage and combination need to be extra cautious for a 3-year-7-month-old child to avoid adverse reactions such as excessive purging. The model-predicted efficacy of `枳实导滞丸` (Citrus Aurantium and Rhubarb Pill) is stronger than that of the gold standard `保和丸` (Baohe Pill), which may not be suitable for this pediatric patient. Moreover, the syndrome diagnosed by the model is also incorrect (`脾胃湿热证` is wrong).

## 2. Conventional Retrieval-Augmented Generation (Embedding-based Retrieval EBR)

Now, we provide the knowledge and see how the performance of the large model is enhanced with the support of the retrieval system.
- The knowledge we provided are come from the authoritative textbook `Traditional Chinese Internal Medicine`, which includes potential solutions for the aforementioned case `"vomiting"`.
- First, we need to split the text into `text chunks`. Generally, we split the text using a `sliding window approach`. For example, if the sliding window size is 1,500, then we split the text into chunks of 1,500 characters each.
- Then we vectorize the text, and here we use `bge-large-zh-v1.5` for embedding.
- Finally, we use the retrieval system to query the information and then generate the output with the language model.

In [5]:
"""
Conventional RAG 
"""

import torch
import numpy as np
from typing import List, Dict
from FlagEmbedding import FlagModel

# embedding models: https://huggingface.co/BAAI/bge-large-zh-v1.5
EMBEDDING_MODEL_PATH = "F:/soap/models/BAAI/bge-large-zh-v1.5"

# Prompts: "Generate vectors for the following Traditional Chinese Medicine Knowledge to be used for retrieving: "
QUERY_INSTRUCTION = "为以下中医知识生成向量，以便用于检索：" 

# embedding model support
class CustomEmbeddingModel(FlagModel):
    def __init__(self, **kwargs):
        super().__init__(
            model_name_or_path=EMBEDDING_MODEL_PATH,
            query_instruction_for_retrieval=QUERY_INSTRUCTION,
            use_fp16=False,
            **kwargs
        )

embedding_model = CustomEmbeddingModel()

def get_embedding(text:str):
    """
    Get the embedding vector of the given text.
    Parameters:
        text (str): the text to be embedded

    Returns:
        np.array: the embedding vector of the given text
    """
    inputs = [text]
    with torch.no_grad():
        outputs= embedding_model.encode(inputs)
    embedding_vector = outputs[0]
    return embedding_vector

def cos_similarity(q, p):
    """
    Calculate the cosine similarity between two vectors.
    q,p: vector numpy.array
    """
    return np.dot(q, p) / ((np.linalg.norm(q) * np.linalg.norm(p)) + 1e-8)

def text_chunker(file: str, chunk_size: int = 1500) -> List[str]:
    """
    Split the text into chunks of specified size.
    Parameters:
        file (str): the file path of the text
        chunk_size (int): the size of each chunk

    Returns:
        List[str]: a list of chunks
    """
    with open(file, "r", encoding="utf-8") as f:
        text = f.read()
        text = text.replace("\n", " ")
        print("The first 50 characters of the text are: ", text[:50])
        print("The length of the text is: ", len(text))
    chunks = [text[i : i + chunk_size] for i in range(0, len(text), chunk_size)]
    return chunks

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
vomiting_knowledge = "vomiting.txt"
vomiting_knowledge_chunks = text_chunker(vomiting_knowledge)
vomiting_knowledge_chunks_vector = []
for id, chunk in enumerate(vomiting_knowledge_chunks):
    vomiting_knowledge_chunks_vector.append({"id": id, "vector": get_embedding(chunk), "text": chunk})

score = 0
query_vector = get_embedding(test["question"])
for vk in vomiting_knowledge_chunks_vector:
    sim_score = cos_similarity(query_vector, vk['vector'])
    if sim_score > score:
        score = sim_score
        most_relevant_chunk_id = vk['id'] 
print(f"The most relevant chunk is id:{most_relevant_chunk_id} with a similarity score of {score}")
print(vomiting_knowledge_chunks_vector[most_relevant_chunk_id])

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


The first 50 characters of the text are:  第四节 呕吐 呕吐是由于胃失和降、胃气上逆所致的以饮食、痰涎等胃内之物从胃中上涌，自口而出为临床特征
The length of the text is:  7594
The most relevant chunk is id:4 with a similarity score of 0.6320789831755625
{'id': 4, 'vector': array([ 0.04332899,  0.04660345,  0.03002791, ..., -0.04130913,
       -0.02275172, -0.04633958], dtype=float32), 'text': '急，郁郁微烦者，为未解也，与大柴胡汤下之则愈。”  《金匮要略·呕吐哕下利病脉证治》：“呕而胸满者，茱萸汤主之。”“呕而肠鸣，心下痞者，半夏泻心汤主之。”“诸呕吐，谷不得下者，小半夏汤主之。”“食已即吐者，大黄甘草汤主之。”  《诸病源候论·呕哕候》：“呕吐者，皆由脾胃虚弱，受于风邪所为也。”  《三因极一病证方论·呕吐叙论》：“呕吐虽本于胃，然所因亦多端，故有饮食寒热气血之不同，皆使人呕吐。”  《医学正传·呕吐》：“外有伤寒，阳明实热太甚而吐逆者；有内伤饮食，填塞太阴，以致胃气不得宣通而吐者；有胃热而吐者；有胃寒而吐者；有久病气虚，胃气衰甚，闻谷气则呕哕者；有脾湿太甚，不能运化精微，致清痰留饮郁滞上中二焦，时时恶心吐清水者，宜各以类推而治之，不可执一见也。”．  《症因脉治·呕吐论》：“秦子曰：呕以声响名，吐以吐物言。有声无物曰呕，有物无声曰吐，有声有物曰呕吐，皆阳明胃家所主。”  【现代研究】  近年来对于呕吐进行了一些研究，取得了一定的成效。王氏治疗神经性呕吐40例，其中属肝胃不和型26例，胃阴不足型8例，肝胆火盛型6例。病程3个月以内23例，3个月至半年15例，半年以上2例。服用下述基本方：伏龙肝、代赭石、半夏、竹茹、茵陈、枳壳、木香、生麦芽、山药、鸡山金。每剂以伏龙肝60g布包先煎20分钟代水，后下诸药煎煮300mi药液，视呕吐轻重分2-3次温服，每次间隔20分钟，每日2剂，早晚各1剂。连续服用10日为1个疗程，并随证略有加减。结果：临床治愈31例（77．5％），好转7例（

In [8]:
message = [
    {'role':'system','content':'请你扮演一位临床中医师，你的任务是根据给出的临床信息，对当前患者的症状做进一步的推理，最终给出相应处方的中草药组成。'},
    # System Prompts: "Please acting as a clinical traditional Chinese medicine practitioner. 
    # Your task is to further deduce the patient's symptoms based on the provided clinical information and 
    # ultimately prescribe the herbal composition of the traditional Chinese medicine."
    {'role':'user','content':f'临床信息如下：{test["question"]}\n补充参考资料：{vomiting_knowledge_chunks[most_relevant_chunk_id]}'}
    # Add support text
]

pred = chatbot(MESSAGE=message)
print(f"The prediction of case {test['id']} is: \n {pred}")

The prediction of case 1 is: 
 根据所提供的临床信息，患者为3岁7个月大的男孩，主要症状为饱食后呕吐半年余，加重1日，伴有口气臭秽、脘腹胀满、大便秘结、泻下酸臭等。结合舌质红、苔厚腻、脉滑数有力、指纹紫滞等体征，可以推断患者可能存在脾胃湿热、气机不畅、胃热内蕴等问题。

结合《金匮要略》中的相关论述，患者症状与“呕而胸满者，茱萸汤主之”以及“呕而肠鸣，心下痞者，半夏泻心汤主之”相符合。因此，可以采用以下中草药组成方剂：

**处方：**
1. 黄连6g - 清热燥湿，泻火解毒
2. 黄芩9g - 清热燥湿，泻火解毒
3. 姜半夏9g - 化痰止呕，降逆止呕
4. 党参9g - 健脾益气，扶正固本
5. 干姜6g - 温中散寒，回阳救逆
6. 大枣4枚 - 补中益气，调和诸药
7. 炙甘草6g - 补中益气，调和诸药
8. 茯苓9g - 利水渗湿，健脾宁心
9. 竹茹9g - 清热化痰，降逆止呕
10. 枳实9g - 行气化湿，消痞除满

**煎服法：**
将上述中草药加水1000ml，煎煮30分钟，去渣取汁，分早晚两次温服。

**注意事项：**
1. 本方剂适用于脾胃湿热、气机不畅、胃热内蕴的患者，如症状不符，请及时调整。
2. 建议患者饮食清淡，避免油腻、辛辣、生冷食物。
3. 若症状加重或持续不缓解，请及时就医。

以上处方仅供参考，具体用药请遵医嘱。


In [9]:
test = {
    "id": 1, 
    "question": "患者男，3岁7月。饱食后呕吐半年余，加重1日。患儿半年前无明显诱因出现餐后呕吐不消化食物，口气臭秽，脘腹胀满，吐后觉舒，大便秘结，泻下酸臭，未予以重视，1日前呕吐量较以往增多，行腹部DR提示不完全性肠梗阻。舌质红，苔厚腻，脉滑数有力，指纹紫滞。",
    "answer": {
        "symptoms": ["呕吐", "腹胀", "便秘", "指纹紫滞", "红舌", "厚舌苔", "腻舌苔", "脉有力", "数脉", "滑脉"],
        "disease": "呕吐",
        "syndrome": "饮食停滞证",
        "therapy": ["消食化滞","和胃降逆"],
        "formula": "保和丸",
        "herbs": ["炒建曲", "焦山楂", "姜半夏", "炒莱菔子", "陈皮", "连翘", "生姜", "炒鸡内金", "麸炒枳壳"]
    },
    "prediction": {
        "disease": "呕吐",
        "syndrome": "脾胃湿热证",
        "therapy": ["消食导滞"],
        "formula": "枳实导滞丸",
        "herbs": ["黄连","黄芩","厚朴","陈皮","神曲","茯苓","大黄","枳实","甘草"]
    },
    "prediction_rag": {
        "disease": "呕吐",
        "syndrome": "脾胃湿热证",
        "therapy": ["化痰止呕","行气化湿"],
        "formula": "半夏泻心汤",
        "herbs": ["黄连","黄芩","姜半夏","党参","干姜","大枣","炙甘草","茯苓","竹茹","枳实"]
    }
}

### 2.1 Analysis

> This time, the model-selected prescription still covers commonly - used Chinese herbs for eliminating food stagnation, promoting qi movement to reduce distension, and clearing heat to relieve constipation. However, due to the inconsistent text chunks, the generation quality is not as good as before. This is a very common problem in RAG (Retrieval-Augmented Generation), which is mainly caused by `the lack of contextual information` in the text chunks or `the inconsistency and fragmentation of the text` chunks themselves.

## 3 Our Methods

To address the challenges posed by inconsistent text chunks and the lack of contextual information in traditional RAG models, we introduce a prescription generation system based on Large Language Models (LLM), which seamlessly integrates Retrieval-Augmented Generation (RAG) with semantic networks constructed in a Knowledge Graph (KG).

Our system (Figure 1) comprises two phases: 
- First, during the KG construction phase, our system constructs a comprehensive knowledge graph from the latest medical guidelines and medical literature. It integrates a semantic node structured representation of each medical concept and interlinks them based on relational context. It also generates embedding for each node to facilitate later semantic searching.
- Second, during the question-answering phase, our method parses clinical queries to identify named entities and intents. It then navigates within the KG to identify related sub-graphs for generating answers.

### 3.1 Semantic Network Construction

#### 3.1.1 Graph Structure Definition.

In defining the knowledge graph structure for medical knowledge representation, we employ a semantic-level architecture that segregates different medical entities based on **`semantic labels`**, as illustrated in Figure 1.

- The **Semantic Nodes** $N_i(A,E,R)$ models each medical concept $c_i$ as a node, where attribute $(label_{i}:id_{i})∈A$ is a unique combination of a semantic label and a code, corresponding to a distinct medical concept entity $c_{i}$ that belongs to a specific semantic space $label$. Each edge $e∈E$ and relation $r∈R$ signifies the hierarchical connection and type of relations within context of concepts $c_{i}$.

- The **Semantic Network Graph** $G(N,E,R)$ represents the network of connections across different semantic concepts, incorporating both explicit links $E_{exp}$, defined in medical knowledge, and implicit connections $E_{imp}$, derived from semantic similarity between medical concepts. For implicit connections, we leverage cosine similarity between the embedding vectors of text-rich parts of medical description, a method adaptable to specific use cases.

For instance, Figure 1 portrays "Common Cold" as a semantic nodes. It exhibits a direct linkage to "Wind-heat syndrome", indicating an explicit relationship. Additionally, it’s implicitly connected with "Flu" due to the semantic similarities.

#### 3.1.2 Knowledge Graph Construction. 

Graph construction is delineated into two phases: **Semantic parsing** and **Relation connection**.

1) **Semantic Parsing Phase**: This phase transforms each text-based semantic entity description (e.g definition of one specific disease) $n_i$ into a node representation $N_{i}$. 

    We employ a hybrid methodology, initially utilizing rule-based extraction for predefined fields, such as "symptoms:" identified via keywords. Subsequently, for text not amenable to rule-based parsing, we engage an LLM for parsing. The LLM is directed by a YAML template $T_{template}$, representing in graph the medical concepts routinely utilized by clinical support.

$$N_{i} = RuleParse(n_{i,rule}) ∪ LLMParse(n_{i,llm}, T_{template}, prompt)$$

1) **Relation Connection Phase**: Here, individual node $N_{i}$ are amalgamated into a comprehensive graph $G$. 

- Explicit connections $E_{exp}$ are delineated as specified within medical standards, exemplified by designated fields in GB/T (Chinese National Recommended Standard). 

$$E_{exp} = {(N_{i}, N_{j}) | N_{i} \text{ explicitly connected to } N_{j} }$$

- Implicit connections $E_{imp}$ are inferred from textual-semantic similarities across different semantic concepts, employing embedding techniques and a threshold mechanism to discern the most relevant medical entity for each medical concept.

$$E_{imp} = {(N_{i}, N_{j}) | cos(embed(N_{i}), embed(N_{j}) )  > \Theta }$$

#### 3.1.3 Embedding Generation. 

To support embedding-based retrieval, we generate embeddings for graph node values using pre-trained text-embedding models like BGE[], specifically targeting nodes for text-rich sections such as "Criteria of diagnosis and differentiation" and "Recommendation", etc. These embeddings are then stored in a vector database (for instance, Neo4j[] and Milvus[]). For most cases the text-length within each node can meet the text-embedding model’s context length constraints, but for certain lengthy texts,we can safely divide the text into smaller chunks for individual embedding without worrying about quality since the text all belong to the same section.

In [None]:
# Rule-based Parse Example (Ni = RuleParse(ni,rule)

import re
def rule_parse(text:str)->dict:
    """
    Content structure parse by rule-based method: chapter -> section -> paragraph -> knowledge point
    """
    # chapter parse
    chapter_pattern = re.compile(r'(第[一二三四五六七八九十]+章\s*[^\u4e00-\u9fa5]*[\u4e00-\u9fa5]+)(.*?)(?=第[一二三四五六七八九十]+章|$)') # chapter keywords
    chapters = chapter_pattern.findall(text)
    
    result = {}
    for chapter in chapters:
        chapter_title = chapter[0].strip()
        chapter_content = chapter[1]

        # section parse
        section_pattern = re.compile(r'(第[一二三四五六七八九十]+节\s*[^\u4e00-\u9fa5]*[\u4e00-\u9fa5]+)(.*?)(?=第[一二三四五六七八九十]+节|$)')  # section keywords
        sections = section_pattern.findall(chapter_content)

        chapter_data = {}
        for section in sections:
            section_title = section[0].strip()
            section_content = section[1]

            # paragraph parse
            paragraph_pattern = re.compile(r'(【[^】]+】)(.*?)(?=【|$)')  # paragraph keywords
            paragraphs = paragraph_pattern.findall(section_content)
            
            section_data = {}
            for paragraph in paragraphs:
                paragraph_title = paragraph[0].strip()
                paragraph_content = paragraph[1].strip()

                # knowledge point parse
                knowledge_pattern = re.compile(r'(?<=\s)(·[\u4e00-\u9fa5]{4}.*?)(?=·[\u4e00-\u9fa5]{4}|$)') # knowledge point keywords
                knowledges = knowledge_pattern.findall(paragraph_content)

                if knowledges:
                    knowledge_data = []
                    for knowledge in knowledges:
                        knowledge_data.append(knowledge.strip())
                    section_data[paragraph_title] = knowledge_data
                else:
                    section_data[paragraph_title] = paragraph_content
            
            chapter_data[section_title] = section_data
        
        result[chapter_title] = chapter_data

    return result

import json
def knowledge_parse(text:str):
    """
    Knowledge parse by rule-based method: 
    """
    # Extract components using regular expressions
    name_match = re.search(r"·(.+?)\s+症状", text)
    symptoms_match = re.search(r"症状：(.+?)\s+治法", text)
    therapy_match = re.search(r"治法：(.+?)\s+方药", text)
    formula_match = re.search(r"方药：(.+?)。", text)

    if name_match and symptoms_match and therapy_match and formula_match:
        name = name_match.group(1) + "证"
        symptoms = symptoms_match.group(1).strip()
        therapies = [t.strip() for t in therapy_match.group(1).strip("。").split("，")]
        formula = formula_match.group(1).strip()
        
        # Build the JSON structure
        result = [
            {
                "id": 0,
                "attributes": {
                    "name": name,
                    "label": "Syndrome",
                    "symptoms": symptoms,
                },
                "relations": []
            }
        ]
        
        # Add therapies to relations
        for therapy in therapies:
            result[0]["relations"].append({
                "Syndrome": name,
                "Therapy": therapy,
                "Relation": "HAS_THERAPY"
            })
        
        # Add formula to relations
        result[0]["relations"].append({
            "Syndrome": name,
            "Formula": formula,
            "Relation": "HAS_FORMULA"
        })
        
        # Convert to JSON with proper Unicode handling and indentation
        formatted_json = json.dumps(result, ensure_ascii=False, indent=4)
        return(formatted_json)

# Example
with open("corpus.txt", "r", encoding="utf-8") as f:
    text = f.read().replace("\n", " ")

parsed_data = rule_parse(text)

# Example: Access the content of "Differential Diagnosis and Treatment" in Section 2 of Chapter 4
# Note: The key names here need to be adjusted according to the actual text
example_chapter = "第四章 脾胃肠病证"
example_section = "第四节 呕吐"
example_paragraph = "【辨证论治】"

if example_chapter in parsed_data:
    if example_section in parsed_data[example_chapter]:
        if example_paragraph in parsed_data[example_chapter][example_section]:
            print(f"{example_chapter} - {example_section} - {example_paragraph}:")
            content = parsed_data[example_chapter][example_section][example_paragraph]
            if isinstance(content, list):
                for i, knowledge in enumerate(content, 1):
                    print(f"Knowledge  Point: {i}: {knowledge}")
                    print(f"Nodes Representation with Json format: {knowledge_parse(knowledge)}")
                    break
            else:
                print(content)
        else:
            print(f"段落 '{example_paragraph}' 未找到")
    else:
        print(f"节 '{example_section}' 未找到")
else:
    print(f"章 '{example_chapter}' 未找到")

第四章 脾胃肠病证 - 第四节 呕吐 - 【辨证论治】:
Knowledge  Point: 1: ·外邪犯胃  症状：呕吐食物，吐出有力，突然发生，起病较急，常伴有恶寒发热，胸脘满闷，不思饮食，舌苔白，脉濡缓。  治法：疏邪解表，和胃降逆。  方药：藿香正气散。  方中藿香、紫苏、白芷芳香化浊，疏邪解表；厚朴、大腹皮理气除满；白术、茯苓、甘草健脾化湿；陈皮、半夏和胃降逆，共奏疏邪解表，和胃降逆止呕之功。若风邪偏重，寒热无汗，可加荆芥、防风以疏风散寒；若见胸闷腹胀嗳腐。为兼食滞，可加鸡内金、神曲、莱菔子以消积化滞；若身痛，腰痛，头身困重，苔厚腻者，为兼外湿，可加羌活、独活、苍术以除湿健脾；若暑邪犯胃，身热汗出，可用新加香薷饮以解暑化湿；若秽浊犯胃，呕吐甚剧，可吞服玉枢丹以辟秽止呕；若风热犯胃、头痛身热可用银翘散去桔梗之升提，加陈皮、竹茹疏风清热，和胃降逆。
Nodes Representation with Json format: [
    {
        "id": 0,
        "attributes": {
            "name": "外邪犯胃证",
            "label": "Syndrome",
            "symptoms": "呕吐食物，吐出有力，突然发生，起病较急，常伴有恶寒发热，胸脘满闷，不思饮食，舌苔白，脉濡缓。"
        },
        "relations": [
            {
                "Syndrome": "外邪犯胃证",
                "Therapy": "疏邪解表",
                "Relation": "HAS_THERAPY"
            },
            {
                "Syndrome": "外邪犯胃证",
                "Therapy": "和胃降逆",
                "Relation": "HAS_THERAPY"
            },
            {
                "Syndrome": "外邪犯胃证",
                "Formu

In [None]:
# LLM-based Parse Example LLMParse(n{i,llm}, T{template}, prompt)

# load the customized template
import yaml
with open("template.yaml", "r", encoding="utf-8") as f:
    template = yaml.safe_load(f)

# build up the prompt based on the template
def build_prompt(template: dict) -> str:
    prompt = "请执行医学实体识别，要求：\n"
    
    # named entity recognition
    if template["entities"]:
        prompt += "1. 识别以下实体类型：\n"
        for entity in template["entities"]:
            prompt += f"   - {entity['label']}（{entity['description']}），示例：{entity['example']}。{entity['instruction']}\n"
    
    # relation detection
    if template.get("relations"):
        prompt += "\n2. 提取以下语义关系：\n"
        for rel in template["relations"]:
            prompt += f"   - {rel['from']} → {rel['to']} 的 {rel['type']} 关系：{rel['description']}\n"
    
    # output format
    prompt += "\n3. 输出要求：\n"
    prompt += "   - 使用严格JSON格式输出\n"
    prompt += "   - 包含composition（处方组成药材）、function（处方功效）字段\n"
    prompt += "   - 关系字段使用Relation指定关系类型\n"
    prompt += "   - 确保完整提取所有实体和关系\n\n"
    prompt += "输出示例：\n"
    prompt += template["output_format"]["example"]
    
    return prompt

prompt = build_prompt(template)

""" prompt translation:
prompt:

Please perform medical entity recognition with the following requirements:
1.Recognize the following entity types:
    - Formula (Prescription name), example: Mahuang Tang. Identify the prescription names in the input text.
    - Herb (Chinese herbal ingredients), example: Huoxiang, Zisu. Identify the herbal ingredients in the prescription.
    - Therapy (Therapeutic effect), example: Aromatic turbidity-removing, Evacuating exterior pathogens. Identify the therapeutic effects of the prescription.

2. Extract the following semantic relationships:
    - Formula → Herb HAS_HERB relationship: The inclusion relationship between the prescription and its herbal ingredients.
    - Formula → Therapy HAS_THERAPY relationship: The association between the prescription and its therapeutic effects.

3.Output requirements:
    - Use strict JSON format for output.
    - Include fields for composition (herbal ingredients of the prescription) and function (therapeutic effects of the prescription).
    - Use the Relation field to specify the relationship type.
    - Ensure that all entities and relationships are fully extracted.

Output example:
[
  {
    "id": 0,
    "attributes": {
      "name": "藿香正气散",
      "label": "Formula",
      "composition": ["藿香","紫苏"],
      "function": ["芳香化浊","疏邪解表"]
    },
    "relations": [
      {"Formula": "藿香正气散", "Herb": "藿香", "Relation": "HAS_HERB"}
    ]
  }
]
"""

# example
text = """藿香正气散。  方中藿香、紫苏、白芷芳香化浊，疏邪解表；厚朴、大腹皮理气除满；白术、茯苓、甘草健脾化湿；陈皮、半夏和胃降逆，共奏疏邪解表，和胃降逆止呕之功。若风邪偏重，寒热无汗，可加荆芥、防风以疏风散寒；若见胸闷腹胀嗳腐。为兼食滞，可加鸡内金、神曲、莱菔子以消积化滞；若身痛，腰痛，头身困重，苔厚腻者，为兼外湿，可加羌活、独活、苍术以除湿健脾；若暑邪犯胃，身热汗出，可用新加香薷饮以解暑化湿；若秽浊犯胃，呕吐甚剧，可吞服玉枢丹以辟秽止呕；若风热犯胃、头痛身热可用银翘散去桔梗之升提，加陈皮、竹茹疏风清热，和胃降逆。"""

message = [
    {'role':'system','content': prompt},
    {'role':'user','content':f'Input text：{text}'}
]

example = chatbot(MESSAGE=message)
print(f"LLM-based Parse Result: \n {example}")

LLM-based Parse Result: 
 ```json
[
  {
    "id": 0,
    "attributes": {
      "name": "藿香正气散",
      "label": "Formula",
      "composition": ["藿香", "紫苏", "白芷", "厚朴", "大腹皮", "白术", "茯苓", "甘草", "陈皮", "半夏", "荆芥", "防风", "鸡内金", "神曲", "莱菔子", "羌活", "独活", "苍术", "香薷", "玉枢丹", "银翘散", "陈皮", "竹茹"],
      "function": ["芳香化浊", "疏邪解表", "理气除满", "健脾化湿", "和胃降逆", "疏风散寒", "消积化滞", "除湿健脾", "解暑化湿", "辟秽止呕", "疏风清热", "和胃降逆"]
    },
    "relations": [
      {"Formula": "藿香正气散", "Herb": "藿香", "Relation": "HAS_HERB"},
      {"Formula": "藿香正气散", "Herb": "紫苏", "Relation": "HAS_HERB"},
      {"Formula": "藿香正气散", "Herb": "白芷", "Relation": "HAS_HERB"},
      {"Formula": "藿香正气散", "Herb": "厚朴", "Relation": "HAS_HERB"},
      {"Formula": "藿香正气散", "Herb": "大腹皮", "Relation": "HAS_HERB"},
      {"Formula": "藿香正气散", "Herb": "白术", "Relation": "HAS_HERB"},
      {"Formula": "藿香正气散", "Herb": "茯苓", "Relation": "HAS_HERB"},
      {"Formula": "藿香正气散", "Herb": "甘草", "Relation": "HAS_HERB"},
      {"Formula": "藿香正气散", "Herb": "陈皮", "

> `Note`:
> The example above illustrates how we use a hybrid approach to process text. We first apply rule-based methods to handle predefined fields (such as the structure of a document's table of contents). If we encounter text that cannot be processed by rules (e.g., text descriptions with irregular lengths), we then use a Large Language Model (LLM) for parsing.
> 
> Features of the Solution:
> 
> - Modular Design: Separation of template configuration and code logic through YAML files, making management and modification easier.
> - Precise Control: Clear distinction between different entity types and output rules to prevent LLM from "free-form" generation, ensuring the accuracy of results.
> - Flexible Expansion: When adding new entity types, simply modify the YAML file without changing the code, making the process straightforward.
> - Error Tolerance: Constraints on output rules to avoid empty values or placeholders in the results, thereby improving data quality.
> - Format Enhancement: Explicit declaration of output format requirements to make result parsing more reliable.
> 
> Practical Usage Suggestions:
> In practical applications, adjustments should be made according to the specific needs of the domain, such as:
> - Adding more entity types to the YAML file;
> - Enhancing the example vocabulary to provide more comprehensive learning content for the model;
> - Adjusting output rules to meet different business requirements;
> - Adding special processing logic, such as synonym mapping, to improve processing effectiveness.

We store all the standardized processed data in the `data/node` folder, saved in `json` format. In the next step, we will use these semantic nodes and semantic relationships to build a semantic graph.

In [None]:
# neo4j
from neo4j import GraphDatabase
from neo4j.exceptions import ServiceUnavailable, AuthError

def neo4j_connector(
    uri="bolt://localhost:7687", 
    username="neo4j", 
    password="neo4j@soap"
):
    """
    Connect to Neo4j instance.
    """
    driver = GraphDatabase.driver(uri, auth=(username, password))
    return driver

driver = neo4j_connector()

In [None]:
def add_entity(
    tx, 
    entity_name:str, 
    entity_label:str, 
    entity_attributes:dict, 
    entity_relations:list, 
):
    """
    Initialize all entity nodes information into the graph database.
    """
    # Add entity attributes
    tx.run(
        f"""
        MERGE (e:`Entity`:`{entity_label}` {{name:$name}})
        SET e += $attributes
        """,
        name=entity_name,
        attributes=entity_attributes,
    )

    # Create entity relations
    if entity_relations:
        for relation in entity_relations:
            relation = list(relation.items())
            source_label = relation[0][0]
            source_name = relation[0][1]
            target_label = relation[1][0]
            target_name = relation[1][1]
            relation_label = relation[2][0]
            relation_name = relation[2][1]

            cypher = f"""
            MERGE (s:`Entity`:`{source_label}` {{name: $source_name}})
            MERGE (t:`Entity`:`{target_label}` {{name: $target_name}})
            MERGE (s)-[:`{relation_name}`]->(t)
            """
            tx.run(cypher, source_name=source_name, target_name=target_name)

def node2vec(
    tx, 
    entity_name:str, 
    entity_label:str, 
    entity_vector_text: str = ""
):
    """
    Get the embedding vector of the given text and create node embedding vector.
    """
    # Get the embedding vector of the given text.
    entity_vector = get_embedding(entity_vector_text)
    # Create node embedding vector
    tx.run(
        f"""
        MERGE (e:`Entity`:`{entity_label}` {{name:$name}})
        WITH e
        CALL db.create.setNodeVectorProperty(e, 'embedding', $embedding)
        """,
        name=entity_name,
        embedding=entity_vector
    )

def _index_exists(tx, index_name: str) -> bool:
    result = tx.run("SHOW INDEXES")
    for record in result:
        if record["name"] == index_name:
            return True
    return False

def _create_vector_index(tx, label: str, dim: int):
    """
    Create vector index for the given label.
    This is used to sperate different entity into different search spaces.
    """
    if label == "entityEmbeddings":
        label = "Entity"
        LABEL = "entityEmbeddings"
    else:
        LABEL = label.upper()
    if not _index_exists(tx, LABEL):
        tx.run(f"""
        CREATE VECTOR INDEX {LABEL}
        FOR (n: `{label}`) ON (n.embedding)
        OPTIONS {{
            indexConfig: {{
                `vector.dimensions`: {dim},
                `vector.similarity_function`: 'cosine'
            }}
        }};
        """)

In [None]:
import jsonlines
def jsonl_loader(file_path):
    with jsonlines.open(file_path, 'r') as reader:
        data = list(reader)
    return data

symptoms = jsonl_loader("./data/node/symptoms.json")
diseases = jsonl_loader("./data/node/diseases.json")
syndromes = jsonl_loader("./data/node/syndromes.json")
therapies = jsonl_loader("./data/node/therapies.json")
formulae = jsonl_loader("./data/node/formulae.json")
herbs = jsonl_loader("./data/node/herbs.json")

with driver.session() as session:
    LABELS = ["Symptom", "Disease", "Syndrome", "Therapy", "Formula", "Herb", "entityEmbeddings"]
    for label in LABELS:
        session.execute_write(_create_vector_index, label, 1024)

    for symptom in symptoms:
        session.execute_write(
            add_entity, 
            symptom["attributes"]["name"], 
            symptom["attributes"]["label"],
            symptom["attributes"], 
            symptom["relations"]
        )

    for disease in diseases:
        session.execute_write(
            add_entity, 
            disease["attributes"]["name"], 
            disease["attributes"]["label"], 
            disease["attributes"], 
            disease["relations"]
        )

    for syndrome in syndromes:
        session.execute_write(
            add_entity, 
            syndrome["attributes"]["name"], 
            syndrome["attributes"]["label"], 
            syndrome["attributes"], 
            syndrome["relations"]
        )

    # Generate embeddings for graph node values using pre-trained text-embedding models such as BERT,
    # specifically targeting nodes for text-rich sections
    for syndrome in syndromes:
        session.execute_write(
            node2vec, 
            syndrome["attributes"]["name"], 
            syndrome["attributes"]["label"], 
            syndrome["attributes"]["symptoms"] # text-rich section, the clinical performance of syndrome
        )

    for therapy in therapies:
        session.execute_write(
            add_entity, 
            therapy["attributes"]["name"], 
            therapy["attributes"]["label"], 
            therapy["attributes"], 
            therapy["relations"]
        )

    for formula in formulae:
        session.execute_write(
            add_entity, 
            formula["attributes"]["name"], 
            formula["attributes"]["label"], 
            formula["attributes"], 
            formula["relations"]
        )

    for herb in herbs:
        session.execute_write(
            add_entity, 
            herb["attributes"]["name"], 
            herb["attributes"]["label"],
            herb["attributes"],
            herb["relations"]
        )

### 3.2 Semantic-Oriented Alignment

#### 3.2.1 Query Named Entity Identification and Intent Detection. 

In this step, we extract the named entities C of type Map(N → V) and the query intent set I from each clinical query q. The method involves parsing each query q into a key-value pair, where each key n, mentioned within the query, corresponds to an semantic element in the semantic template $T_{template}$, and the value v represents the information extracted from the query. Concurrently, the query intents I include the medical task mentioned in the graph  $T_{template}$ that the query aims to address. We leverage LLM with a suitable prompt in this parsing process. 

$$C,I = LLM(q, template, prompt)$$

For instance, given the query q = "What is the potential syndrome of patients with nasal congestion and rhinorrhea for 4 days, fever, chills, and headache for 2 days, a thin yellow tongue coating, and a floating and rapid pulse?", the extracted entity is C = Map(
    "symptom" → "nasal congestion and rhinorrhea for 4 days, fever, chills, and headache for 2 days",
    "pulseSymptom" → "floating and rapid pulse",
    "tongueSymptom" → "thin yellow tongue coating",
), and the intent set is I=Set("syndrome differentiation"). This method demonstrates notable flexibility in accommodating varied query formulations by leveraging the LLM’s extensive understanding and interpretive capabilities.

In [None]:
# template
import yaml
with open("query_ner_template.yaml", encoding="utf-8") as file:
    template = yaml.safe_load(file)

##
# please check the query_ner_template.yaml
##

# prompt for named entity extraction
def named_entity_extractor(template:dict) -> str:
    prompt = "请执行医学实体识别，要求：\n"
    for entity in template['entities']:

        if entity['label'] == 'Diseases':
            candidates = entity.get('candidates', [])
            prompt += f"- 【{entity['label']}】{entity['instruction']}\n"
            prompt += f"  候选疾病列表：{', '.join(candidates)}\n"  # the candidates are separated by commas
            prompt += f"  示例：'{entity['example']}'\n"
        else:
            prompt += f"- 【{entity['label']}】{entity['instruction']}\n"
            prompt += f"  示例：'{entity['example']}'\n"
    
    # Dynamic Format Generation Constraints
    format_desc = "{\n"
    for e in template['entities']:
        if e['label'] == 'Diseases':
            format_desc += f'  "{e["label"]}": "",  # 必须从{len(e["candidates"])}个候选疾病中选择\n'
        else:
            format_desc += f'  "{e["label"]}": "",\n'
    format_desc += "}"
    
    prompt += f"\n请严格按以下JSON格式输出：\n{format_desc}"
    return prompt

# prompt for intent detection
def intent_detector(template:dict) -> str:
    prompt = "## 意图识别任务说明\n"
    prompt += "请分析以下医疗查询的意图，需满足：\n"
    prompt += f"- 必须从预定义标签集选择：{template['intents']['candidates']}\n"
    prompt += "- 返回所有相关意图（0到多个）\n\n"
    
    prompt += "## 候选意图定义\n"
    for item in template['intents']['items']:
        prompt += f"{item['task']}：{item['description']}\n"
    
    prompt += "\n## 输出要求\n"
    prompt += "- 格式：JSON字符串数组\n"
    prompt += "- 示例：[\"辩证\", \"治疗建议\"]\n"
    prompt += "- 注意：不要添加注释或说明文本"
    
    return prompt

# example test case id 1
# test["question"] + "What is the syndrome of this patient?"

query = "患者男，3岁7月。饱食后呕吐半年余，加重1日。患儿半年前无明显诱因出现餐后呕吐不消化食物，口气臭秽，脘腹胀满，吐后觉舒，大便秘结，泻下酸臭，未予以重视，1日前呕吐量较以往增多，行腹部DR提示不完全性肠梗阻。舌质红，苔厚腻，脉滑数有力，指纹紫滞。请问该患者的证候是什么？"
print("-----------------------")
ner_prompt = named_entity_extractor(template)
print(f"Prompt:{ner_prompt}")
print("-----------------------")
print(" ")
ner_message = [
    {'role':'system','content': ner_prompt},
    {'role':'user','content':f'Input text：{query}'}
]
concept = chatbot(MESSAGE=ner_message)
print(f"Query NER Result: \n {concept}")

print("-----------------------")
intent_prompt = intent_detector(template)
print(f"Prompt:{intent_prompt}")
print("-----------------------")
print(" ")
intent_message = [
    {'role':'system','content': intent_prompt},
    {'role':'user','content':f'Input text：{query}'}
]
intent= chatbot(MESSAGE=intent_message)
print(f"User Intent Detection Result: \n {intent}")


-----------------------
Prompt:请执行医学实体识别，要求：
- 【Symptoms】提取所有临床症状描述，包含持续时间、加重情况等细节
  示例：'间断性头晕1月。患者自述1月前因劳累出现间断头晕，无头痛，无视物旋转，无双下肢无力，休息后症状有所缓解，但未完全消失，尤以劳累后症状明显，目下症见：患者神清，精神尚可，间断性头晕，食纳欠佳，夜寐尚可，二便调。舌红，苔白腻，脉滑。'
- 【Diseases】从候选疾病列表中选择匹配的疾病名称
  候选疾病列表：呕吐, 眩晕, 感冒, 肺炎
  示例：'眩晕'

请严格按以下JSON格式输出：
{
  "Symptoms": "",
  "Diseases": "",  # 必须从4个候选疾病中选择
}
-----------------------
 
Query NER Result: 
 {
  "Symptoms": "饱食后呕吐半年余，加重1日，餐后呕吐不消化食物，口气臭秽，脘腹胀满，吐后觉舒，大便秘结，泻下酸臭，不完全性肠梗阻，舌质红，苔厚腻，脉滑数有力，指纹紫滞。",
  "Diseases": "呕吐"  # 根据症状描述，呕吐是最匹配的候选疾病
}
-----------------------
Prompt:## 意图识别任务说明
请分析以下医疗查询的意图，需满足：
- 必须从预定义标签集选择：['辩证', '诊断', '治疗建议', '预后评估']
- 返回所有相关意图（0到多个）

## 候选意图定义
辩证：根据四诊信息进行中医证候分析
诊断：确定疾病诊断结果
治疗建议：获取治疗方案建议
预后评估：预测疾病发展及康复情况

## 输出要求
- 格式：JSON字符串数组
- 示例：["辩证", "治疗建议"]
- 注意：不要添加注释或说明文本
-----------------------
 
User Intent Detection Result: 
 Output: ["辩证"]


> In the code above, we obtained `the semantic entity set C` and `the query intent set I`. In practical applications, these entity sets and intent sets may be customized according to specific business scenarios. You might be concerned that the recognition rate of large language models is not high enough. However, after our testing, by limiting the candidate range and strictly specifying the output prompt words, large language models can effectively recognize entities and intents. In the tests, the recognition accuracy of large language models was almost 100%. Therefore, we believe that in short-text tasks, large language models have achieved good performance in recognizing entities and intents.

In [None]:
C =  {
  "Symptoms": "饱食后呕吐半年余，加重1日，餐后呕吐不消化食物，口气臭秽，脘腹胀满，吐后觉舒，大便秘结，泻下酸臭，不完全性肠梗阻，舌质红，苔厚腻，脉滑数有力，指纹紫滞。",
  "Diseases": "呕吐"  # Based on the symptom description, vomiting is the most matching candidate disease.
}

I = ["辩证"]

#### 3.2.2 Embedding-based Retrieval of Sub-graphs. 

Our method extracts pertinent sub-graphs from the knowledge graph, aligned with clinical provided specifics such as "symptoms" and "disease", as well as clinical intentions like "syndrome differentiation". This process consists of two primary steps: EBR-based concepts identification and LLM-driven subgraph extraction.

In the **EBR-based concept identification step**, the top $K_{concept}$ most relevant medical concept are pinpointed by harnessing the named entity set C derived from clinical queries. For each entity pair (k,v) ∈ C, cosine similarity is computed between the entity value v and all graph nodes n corresponding to section k via pretrained text embeddings. 

Aggregating these node-level scores to concept-level by summing contributions from nodes belonging to the same concept, we rank and select the top $K_{concept}$ concepts. This method presupposes that the occurrence of multiple query entities is indicative of pertinent links, thus improving retrieval precision.

$$ S_{N_i} = \sum_{(k,v)\in C}\left[
    \sum_{n\in N_i}\mathbb{I}\{n.label=k\} \cdot \cos(\mathrm{embed}(v),\mathrm{embed}(n.\mathrm{text}))\right] $$

In [None]:
def retrieve_syndromes(session, C, k=3) -> list[dict]:
    query_vector = get_embedding(C["Symptoms"]) # embed(v)
    result = session.run(
        """
        MATCH (d:Disease {name: $disease})-[:`HAS_SYNDROME`]->(s:Syndrome)
        WITH s, vector.similarity.cosine(s.embedding, $query_vector) AS score
        ORDER BY score DESC
        LIMIT $top_k
        RETURN s.name AS syndrome, score
        """,
        disease = C["Disease"],
        query_vector = query_vector,
        top_k = k
    ) 
    
    # "WITH s, vector.similarity.cosine(s.embedding, $query_vector) AS score" means:
    # 
    # cos(embed(v),embed(n.text))
    # 
    # Here, n.text refers to the code mentioned above. See the code snippet:
    """
    # Generate embeddings for graph node values using pre-trained text-embedding models such as BERT,
    # specifically targeting nodes for text-rich sections
    for syndrome in syndromes:
        session.execute_write(
            node2vec, 
            syndrome["attributes"]["name"], 
            syndrome["attributes"]["label"], 
            syndrome["attributes"]["symptoms"] # text-rich section, the clinical performance of syndrome
        )
    """

    records = result.data()
    return [{"syndrome": record["syndrome"], "score": record["score"]} for record in records] if records else []

with driver.session() as session:
    ranked = retrieve_syndromes(
        session,
        C,
        k=3
    )
    for item in ranked:
        print(f"{item['syndrome']}: {item['score']:.4f}")

饮食积滞证: 0.8537
食滞胃热证: 0.8502
乳食积滞证: 0.8404


In the **LLM-driven subgraph extraction step**, the system first rephrases the original user query q to include the retrieved concept entity ID; the modified query q′ is then translated into a graph database language, such as Cypher for Neo4j for question answering. For instance, from the initial query 𝑞 ="What is the therapeutic methods of Wind-heat exterior syndrome?", the query is reformulated to "What is the therapeutic methods of 'SYN0220'?" and there after transposed into the Cypher query MATCH (s:Syndrome {ID:'SYN0220'})-[:HAS_THERAPY]-> (t:THERAPY) RETURN t.description. 

> It is worth noting that the LLM-driven query construction is sufficiently flexible. The underlying principle is essentially to guide the language through customized templates and prompts for semantic classification. The code for this part is similar to the LLM parsing of the knowledge graph construction and the user query parsing steps mentioned above, so we will not elaborate further.

We can further retrieve the subgraphs (more specific therapy, formula and herbs) based on the top k retrieved syndromes: 

In [None]:
def retrieve_subgraph_based_on_syndromes(session, syndrome_names, hop=1):
    query = f"""
        MATCH (s:Syndrome {{name: $syndrome_names}})-[*0..{hop}]-(n)
        RETURN n.name AS concept, collect(DISTINCT n.description) AS text
    """
    result = session.run(query, syndrome_names=syndrome_names)
    data = result.data()
    return data if data else []

with driver.session() as session:
    retrieved_information = retrieve_subgraph_based_on_syndromes(session, "饮食积滞证", hop=1) # here "饮食积滞证" is the top 1 potential syndrome
    print(retrieved_information)

[{'concept': '饮食积滞证','text': ['症状：呕吐物酸腐，脘腹胀满拒按，嗳气厌食，得食更甚，吐后反快，大便或溏或结，气味臭秽，苔厚腻，脉滑实。治法：消食化滞，和胃降逆。方药：保和丸']}, {'concept': '保和丸','text': ['方中神曲、山楂、莱菔子消食化滞，陈皮、半夏、茯苓和胃降逆，连翘清散积热。尚可加谷芽、麦芽、鸡内金等消食健胃；若积滞化热，腹胀便秘，可用小承气汤以通腑泄热，使浊气下行，呕吐自止；若食已即吐，口臭干渴，胃中积热上冲，可用竹茹汤清胃降逆；若误食不洁、酸腐食物，而见腹中疼痛，胀满欲吐而不得者，可因势利导，用压舌板探吐祛邪。']}]


#### 3.2.3 Answer Generation. 

Answers are synthesized by correlating retrieved data from Section 3.2.2 with the initial query. The LLM serves as a decoder to formulate responses to user inquiries given the retrieved information. 

In [23]:
reference_information = f"参考信息：\n"
for info in retrieved_information:
    reference_information += f"{info['concept']}:{info['text']}\n"
print(f"The reference information is:{reference_information}")
print("------------------------------------------------")
message = [
    {'role':'system','content':'请你扮演一位临床中医师，你的任务是根据给出的临床信息，结合参考信息对当前患者的症状做进一步的推理，最终给出相应处方的中草药组成。'},
    # System Prompts: "Please acting as a clinical traditional Chinese medicine practitioner. 
    # Your task is to further deduce the patient's symptoms based on the provided clinical information and 
    # ultimately prescribe the herbal composition of the traditional Chinese medicine."
    {'role':'user','content':f'临床信息如下：{test["question"]}\n{reference_information}'}
]

pred = chatbot(MESSAGE=message)
print(f"The prediction of case {test['id']} is: \n {pred}")

The reference information is:参考信息：
饮食积滞证:['症状：呕吐物酸腐，脘腹胀满拒按，嗳气厌食，得食更甚，吐后反快，大便或溏或结，气味臭秽，苔厚腻，脉滑实。治法：消食化滞，和胃降逆。方药：保和丸']
保和丸:['方中神曲、山楂、莱菔子消食化滞，陈皮、半夏、茯苓和胃降逆，连翘清散积热。尚可加谷芽、麦芽、鸡内金等消食健胃；若积滞化热，腹胀便秘，可用小承气汤以通腑泄热，使浊气下行，呕吐自止；若食已即吐，口臭干渴，胃中积热上冲，可用竹茹汤清胃降逆；若误食不洁、酸腐食物，而见腹中疼痛，胀满欲吐而不得者，可因势利导，用压舌板探吐祛邪。']

------------------------------------------------
The prediction of case 1 is: 
 根据患者的临床信息，结合参考信息，可以推断患者可能患有饮食积滞证。以下是详细的推理过程：

1. 患者症状：饱食后呕吐半年余，加重1日，呕吐物酸腐，口气臭秽，脘腹胀满，吐后觉舒，大便秘结，泻下酸臭。
2. 舌象：舌质红，苔厚腻，提示体内有湿热积滞。
3. 脉象：脉滑数有力，指纹紫滞，表明体内有积热，气血运行不畅。

综合以上症状和体征，可以判断患者属于饮食积滞证，治疗原则为消食化滞，和胃降逆。

处方建议如下：

方药：保和丸加减

组成：
- 神曲 10g
- 山楂 10g
- 莱菔子 10g
- 陈皮 6g
- 半夏 6g
- 茯苓 10g
- 连翘 10g
- 谷芽 10g
- 麦芽 10g
- 鸡内金 10g

加减：
- 若腹胀便秘明显，可加小承气汤（大黄、厚朴、枳实）以通腑泄热。
- 若胃中积热上冲，口臭干渴，可加竹茹汤（竹茹、生姜、大枣）清胃降逆。
- 若误食不洁、酸腐食物，可考虑用压舌板探吐祛邪。

用药时需注意：
- 本方适用于饮食积滞所致的呕吐，若症状未改善或加重，应及时就医。
- 儿童用药需根据体质和病情调整剂量。
- 服药期间，患者应避免食用油腻、辛辣、生冷等不易消化的食物。


In [24]:
test = {
    "id": 1, 
    "question": "患者男，3岁7月。饱食后呕吐半年余，加重1日。患儿半年前无明显诱因出现餐后呕吐不消化食物，口气臭秽，脘腹胀满，吐后觉舒，大便秘结，泻下酸臭，未予以重视，1日前呕吐量较以往增多，行腹部DR提示不完全性肠梗阻。舌质红，苔厚腻，脉滑数有力，指纹紫滞。",
    "answer": {
        "symptoms": ["呕吐", "腹胀", "便秘", "指纹紫滞", "红舌", "厚舌苔", "腻舌苔", "脉有力", "数脉", "滑脉"],
        "disease": "呕吐",
        "syndrome": "饮食停滞证",
        "therapy": ["消食化滞","和胃降逆"],
        "formula": "保和丸",
        "herbs": ["炒建曲", "焦山楂", "姜半夏", "炒莱菔子", "陈皮", "连翘", "生姜", "炒鸡内金", "麸炒枳壳"]
    },
    "prediction": {
        "disease": "呕吐",
        "syndrome": "脾胃湿热证",
        "therapy": ["消食导滞"],
        "formula": "枳实导滞丸",
        "herbs": ["黄连","黄芩","厚朴","陈皮","神曲","茯苓","大黄","枳实","甘草"]
    },
    "prediction_rag": {
        "disease": "呕吐",
        "syndrome": "脾胃湿热证",
        "therapy": ["化痰止呕","行气化湿"],
        "formula": "半夏泻心汤",
        "herbs": ["黄连","黄芩","姜半夏","党参","干姜","大枣","炙甘草","茯苓","竹茹","枳实"]
    },
    "prediction_soap": {
        "disease": "呕吐",
        "syndrome": "饮食积滞证",
        "therapy": ["消食化滞","和胃降逆"],
        "formula": "保和丸",
        "herbs": ["神曲","山楂","莱菔子","陈皮","半夏","茯苓","连翘","谷芽","麦芽","鸡内金"]
    }
}

### 3.3 Analysis

> Thanks to the hard-coded relation retrieval methods, the model accurately identified the "Dietary Stagnation Syndrome," which is consistent with the standard answer. The proposed treatment principle of "Digestive Stagnation Elimination and Stomach Harmonization" perfectly matches clinical needs, aligning with the pathogenesis of food stagnation transforming into heat and obstructed qi movement in children. The selection of "保和丸" as the main formula demonstrates a thorough grasp of traditional formula knowledge. The digestive effect of Baohe Pill is milder compared to "枳实导滞丸", making it more suitable for children's delicate spleen and stomach, avoiding the risk of excessive purgation that may be caused by drastic purgatives like rhubarb. Although the standard answer includes "Fresh Ginger" (for harmonizing the stomach and relieving vomiting) and "Wheat-Bran Fried Aurantium Shell" (for promoting qi movement and reducing bloating), the combination of "Pinellia, Radish Seed, and Tangerine Peel" still achieves the effect of reducing vomiting and harmonizing the stomach, with an overall focus on safety.

## 4. Experiment

### 4.1 Objective Metrics

Our evaluation employed a meticulously curated "gold" dataset, comprising typical symptom queries, supporting descriptions, and their authoritative prescriptions. The control group utilized traditional text-based EBR, while the experimental group adopted the method outlined in this study. For both groups, we employed the same LLM, specifically glm-4-flash, and the same embedding model, bge-large-zh-v1.5. We measured retrieval efficacy using Mean Reciprocal Rank (MRR) and Recall@K. MRR gauges the average of the average correct responses, while Recall@K determines the likelihood of relevant items appearing within the top K selections. 

**For classification tasks and herb recommendation evaluation, we further adopted precision, recall and F1-score metrics:**

$$ 
Precision = \frac{|\mathcal{P} \cap \mathcal{G}|}{|\mathcal{P}|},\quad 
Recall = \frac{|\mathcal{P} \cap \mathcal{G}|}{|\mathcal{G}|},\quad 
F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} 
$$

**where $\mathcal{P}$ denotes predicted herbs/syndromes and $\mathcal{G}$ denotes gold-standard items.**

$$ 
MRR = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{rank_i} 
$$

In [25]:
test = {
    "id": 1, 
    "answer": {
        "syndrome": "饮食积滞证",
        "formula": "保和丸",
        "therapy": ["消食化滞","和胃降逆"],
        "herbs": ["炒建曲","焦山楂","姜半夏","炒莱菔子","陈皮","连翘","生姜","炒鸡内金","麸炒枳壳"]
    },
    "prediction_rag": {
        "topk": ["脾胃湿热证","食滞胃热证","饮食积滞证"], # # top 3 syndromes predicted by convential RAG
        "syndrome": "脾胃湿热证",
        "formula": "半夏泻心汤",
        "therapy": ["化痰止呕","行气化湿"],
        "herbs": ["黄连","黄芩","姜半夏","党参","干姜","大枣","炙甘草","茯苓","竹茹","枳实"]
    },
    "prediction_soap": {
        "topk": ["饮食积滞证","食滞胃热证","乳食积滞证"], # top 3 syndromes predicted by SOAP-G
        "syndrome": "饮食积滞证",
        "formula": "保和丸",
        "therapy": ["消食化滞","和胃降逆"],
        "herbs": ["神曲","山楂","莱菔子","陈皮","半夏","茯苓","连翘","谷芽","麦芽","鸡内金"]
    }
}

In [26]:
# Normalization function (adjust according to actual needs)
def normalize_herb(herb):
    prefixes = ['roasted', 'ginger', 'charred', 'wheat-fried', 'fried', 'prepared']
    for prefix in prefixes:
        if herb.startswith(prefix):
            return herb[len(prefix):]
    return herb

# MRR calculation function
def calculate_mrr(gold, predictions):
    ranks = []
    for pred in predictions:
        if 'topk' not in pred: continue
        rank = pred['topk'].index(gold) + 1 if gold in pred['topk'] else float('inf')
        ranks.append(1/rank if rank != float('inf') else 0)
    return sum(ranks)/len(ranks) if ranks else 0

# Recall@K calculation function
def calculate_recall(gold, predictions, k=3):
    recalls = []
    for pred in predictions:
        if 'topk' not in pred: continue
        recalls.append(1 if gold in pred['topk'][:k] else 0)
    return sum(recalls)/len(recalls) if recalls else 0

# Herb evaluation function
def herb_metrics(gold_herbs, pred_herbs):
    gold_set = set(normalize_herb(h) for h in gold_herbs)
    pred_set = set(normalize_herb(h) for h in pred_herbs)
    tp = len(gold_set & pred_set)
    precision = tp / len(pred_set) if pred_set else 0
    recall = tp / len(gold_set) if gold_set else 0
    f1 = 2*(precision*recall)/(precision+recall) if (precision+recall) else 0
    return precision, recall, f1

# Main calculation process
if __name__ == "__main__":
    gold = test['answer']
    models = {
        'RAG': test['prediction_rag'],
        'SOAP': test['prediction_soap']
    }
    
    # Syndrome retrieval metrics
    print("Syndrome retrieval metrics:")
    for name, pred in models.items():
        if 'topk' not in pred: continue
        rank = pred['topk'].index(gold['syndrome'])+1 if gold['syndrome'] in pred['topk'] else None
        mrr = 1/rank if rank else 0
        recall = 1 if gold['syndrome'] in pred['topk'][:3] else 0
        print(f"{name}: MRR={mrr:.3f}, Recall@3={recall}")

    # Classification accuracy
    print("\nClassification accuracy:")
    for name, pred in models.items():
        syndrome_acc = int(pred['syndrome'] == gold['syndrome'])
        formula_acc = int(pred['formula'] == gold['formula'])
        therapy_acc = int(pred['therapy'] == gold['therapy'])
        print(f"{name}: Syndrome={syndrome_acc}, Formula={formula_acc}, Therapy={therapy_acc}")

    # Herb recommendation metrics
    print("\nHerb recommendation metrics:")
    for name, pred in models.items():
        p, r, f = herb_metrics(gold['herbs'], pred['herbs'])
        print(f"{name}: Precision={p:.3f}, Recall={r:.3f}, F1={f:.3f}")

Syndrome retrieval metrics:
RAG: MRR=0.333, Recall@3=1
SOAP: MRR=1.000, Recall@3=1

Classification accuracy:
RAG: Syndrome=0, Formula=0, Therapy=0
SOAP: Syndrome=1, Formula=1, Therapy=1

Herb recommendation metrics:
RAG: Precision=0.100, Recall=0.111, F1=0.105
SOAP: Precision=0.200, Recall=0.222, F1=0.211


> Syndrome Retrieval:
> - The MRR of KG-RAG is 1.0 (the correct syndrome is ranked first).
> - The MRR of RAG is 0.333 (the correct syndrome is ranked third).
> 
> Classification Accuracy:
> - KG-RAG is accurate in all three aspects: syndrome, formula, and therapy.
> - RAG is inaccurate in all three aspects.
> 
> Herb Recommendation:
> - The F1 score of KG-RAG is 0.526 (5 common herbs with the standard answer after normalization).
> - The F1 score of RAG is 0.105 (only 1 common herb).
> The implementation considers the following key points:
> - Standardization of traditional Chinese medicine (TCM) names (removing prefixes related to processing).
> Boundary checks on the topk list.
> Strict matching of classification metrics.
> Multi-dimensional evaluation (retrieval, classification, and recommendation).
> 
> In practical applications, the following are needed: Expansion of the test case set. Optimization of TCM standardization rules. Addition of a human evaluation module. Support for batch calculation and statistical summarization. The normalization rules in the normalize_herb() function can be adjusted to accommodate.

### 4.1 Professional Physician Evaluation

Given that automatically generating traditional Chinese medicine (TCM) prescriptions is an extremely complex task, I invited professional TCM practitioners to evaluate the generated TCM prescriptions. The evaluators were required to score the generated prescriptions in the following two aspects: 1) Herb Effectiveness (HE) and 2) Herb Compatibility (HC). The value range for both scores is [0,10]. Higher scores indicate better effectiveness and compatibility of the prescriptions, and vice versa. The doctors evaluated based on theories, principles, prescriptions, and their own TCM experience. In addition to the generated prescriptions, the invited doctors were also required to evaluate the labeled prescriptions as a benchmark for the generated ones. Unlike automatic evaluation methods, human evaluators focus on the potential effectiveness of candidate answers, rather than just literal similarity, which is more rational and relevant.

## Conclusion

In this script, we present some code snippets and comments on how the prescription generation system based on Large Language Models (LLM) works. This system seamlessly integrates Retrieval-Augmented Generation (RAG) with semantic networks constructed in a Knowledge Graph (KG).
To put it simply, by `introducing semantic labels`, we have addressed the issue of text inconsistency in traditional retrieval-augmented methods. This has enabled more flexible and accurate prescription generation recommendations. However, it is evident that we still need to `manually write templates and construct the knowledge base`. In practical applications, we also encounter more complex and difficult-to-recognize text storage formats, such as PDF. Currently, I am using `from llama_index.readers.file import PDFReader`, but the text conversion results are not satisfactory. I am still working on resolving this issue (a potential solution is `olmocr`, see the following link: GitHub - [allenai/olmocr: Toolkit for linearizing PDFs for LLM datasets/training](https://github.com/allenai/olmocr), but currently I do not have a high-quality GPU to run it). Nonetheless, I will continue to seek a cost-effective and efficient way to address this challenge.

If you encounter any issues or have questions while reading this note, please feel free to contact me at mc36401@um.edu.mo. I will open-source all the code and the prescription dataset in the near future.

In [None]:
import os
from pathlib import Path

"""
Some tips for you:
This is just a simple PDF reader that uses the LlamaIndex library to extract text from PDFs.
But if you have enough GPU resources, I strongly recommend using the OLM OCR model to extract text from PDFs.
It has much better performance than the other PDF reader libs. The only problem is that it is required to use high-end GPU.

- Recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100, A6000...) with at least 20 GB of GPU RAM
- Follow: https://github.com/allenai/olmocr
"""

def pdfreader(file_path:str)->str:
    assert os.path.exists(file_path), "File does not exist"
    assert file_path.endswith(".pdf"), "File format not supported"

    from llama_index.readers.file import PDFReader
    doc = PDFReader().load_data(file=Path(file_path))

    text = "\n\n".join([d.get_content() for d in doc])
    return text

def plainreader(file_path:str)->str:
    assert os.path.exists(file_path), "File does not exist"
    with open(file_path, "r") as f:
        text = f.read()
    return text