
# 📢 Code Showcase: How to Use the Latest Version of the Summarizer

In this sharing on GitHub, we will demonstrate how to use the latest version of our summarizer. This new version of the summarizer employs the output of DeepSeek-R1-671B as the target for distillation. Specifically, we use the disease entries from OMIM or Orphanet, along with their HPO annotation definitions, and input them into DeepSeek-R1-671B to generate synthetic simulated cases. For the detailed process, please refer to our manuscript.

Next, we will showcase two practical functions of the summarizer:
1. **Symptom Summary Generation**: Input the patient's HPO terms, and the summarizer will quickly generate a summary report of the patient's symptoms.
2. **Structured Clinical Report Generation**: Demonstrate how the summarizer receives the results from the ranker and the recommender, and then generates a structured clinical report.

The creator of this notebook: Baole Wen (2025.03.29)

In [1]:
import re
import torch
from transformers import pipeline
from transformers import AutoModelForCausalLM, AutoTokenizer

from pyhpo import Ontology

ontology = Ontology(data_folder='../data/hpo-2025-05-06')

In [2]:
def Get_Definition(hpo_list):
    definition_list = []
    for t in hpo_list:
        definition = Ontology.get_hpo_object(t).definition
        match = re.search(r'"(.*?)"', definition)
        if match:
            definition_list.append(match.group(1))
    return ' '.join(definition_list)

# HPO Terms Detected in a Patient

1. **HP:0000006** - Autosomal dominant inheritance  
2. **HP:0003593** - Infantile onset  
3. **HP:0025104** - Capillary malformation  
4. **HP:0001009** - Telangiectasia  
5. **HP:0003829** - Typified by incomplete penetrance  
6. **HP:0030713** - Vein of Galen aneurysmal malformation  


In [3]:
Patient_hps = (["HP:0000006", "HP:0003593", "HP:0025104", "HP:0001009", "HP:0003829", "HP:0030713"])

In [4]:
Input_text = Get_Definition(Patient_hps)

In [5]:
model_path = '../data/model/Bio-Medical-3B-CoT-Finetuned'

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [6]:
summaries = []

max_input_tokens = 2048

hpo_def = Input_text

prompt = f"""
I will provide you with the definitions of some HPO (Human Phenotype Ontology) terms exhibited by a patient. Based on these definitions, please generate a concise, clinically focused summary of the patient's symptoms in one paragraph, approximately 100-300 words in length. Ensure the summary is highly readable, with smooth transitions between ideas, logical coherence, and accurate representation of the clinical features. Emphasize clarity, fluency, and clinical relevance to create a realistic and precise description of the patient's presentation.\nText:\n{hpo_def}
"""
  
tokenized_text = tokenizer(prompt, return_tensors="pt").input_ids[0]
truncated_tokenized_text = tokenized_text[:max_input_tokens]
    
truncated_text = tokenizer.decode(truncated_tokenized_text)  + '<think>:\n'

summarizer = pipeline(
    "text-generation",  
    model=model,  
    tokenizer=tokenizer
)
response = summarizer(
    truncated_text,
    max_new_tokens= max_input_tokens + 1024,
    top_p=0.95,
    top_k=50,
    do_sample=True
)

summary = response[0]['generated_text'].split('<think>:')
summaries.append(summary)
torch.cuda.empty_cache()


Device set to use cuda:1


## Input prompt

In [7]:
print(summaries[0][0])


I will provide you with the definitions of some HPO (Human Phenotype Ontology) terms exhibited by a patient. Based on these definitions, please generate a concise, clinically focused summary of the patient's symptoms in one paragraph, approximately 100-300 words in length. Ensure the summary is highly readable, with smooth transitions between ideas, logical coherence, and accurate representation of the clinical features. Emphasize clarity, fluency, and clinical relevance to create a realistic and precise description of the patient's presentation.
Text:
A mode of inheritance that is observed for traits related to a gene encoded on one of the autosomes (i.e., the human chromosomes 1-22) in which a trait manifests in heterozygotes. In the context of medical genetics, an autosomal dominant disorder is caused when a single copy of the mutant allele is present. Males and females are affected equally, and can both transmit the disorder with a risk of 50% for each child of inheriting the muta

## Output

In [8]:
print(summaries[0][1])


Okay, let me start by reading through the provided HPO terms carefully. The first term mentions autosomal dominant inheritance. So the condition is passed down from one parent, and there's a 50% chance each child inherits it. I should note that males and females are equally affected. 

Next, onset between 28 days to one year—so symptoms started within the first year. Then there's a capillary malformation and telangiectasias. Capillaries are small blood vessels, and telangiectasias are dilated ones. Both are described in different areas like tongue, lips, etc. Also, there's mention of incomplete penetrance, meaning not everyone with the mutation shows symptoms.

The Vein of Galen aneurysmal malformation is a key point. It's an arteriovenous malformation affecting the vein of Galen during fetal development, leading to blood shunting issues. These malformations typically present early, around 6-11 weeks, but symptoms might manifest later when complications arise.

Putting this together, 

### Case Demonstration
In the following case, we first used the prompt text generated in the example tutorials of the **Rank** and **Recommender**. This prompt text integrates the patient's input symptom information as well as the results from the Rank and Recommender. 

In [9]:
with open('../data/case_report_prompt.txt', 'r', encoding='utf-8') as file:
    content = file.read()
    print(content)



Assume you are an experienced clinical physician. Below is a patient’s symptom description using HPO (Human Phenotype Ontology) terms, along with three candidate diagnoses. To further differentiate between these diagnoses, the physician has provided potential symptoms that the patient does not currently exhibit but could help clarify or confirm the diagnosis. Your task is to explain why these potential symptoms are critical for distinguishing between the three diseases.  

**Patient’s Symptom Description**:  
Caries is a multifactorial bacterial infection affecting the structure of the tooth. This term has been used to describe the presence of more than expected dental caries. A height below that which is expected according to age and gender norms. Although there is no universally accepted definition of short stature, many refer to \ An increased sensitivity of the skin to light. Photosensitivity may result in a rash upon exposure to the sun (which is known as photodermatosis). Photos

In [10]:
summaries = []

max_input_tokens = 4096

hpo_def = Input_text

prompt = content
  
tokenized_text = tokenizer(prompt, return_tensors="pt").input_ids[0]
truncated_tokenized_text = tokenized_text[:max_input_tokens]
    
truncated_text = tokenizer.decode(truncated_tokenized_text)  + '<think>:\n'

summarizer = pipeline(
    "text-generation",  
    model=model,  
    tokenizer=tokenizer
)
response = summarizer(
    truncated_text,
    max_new_tokens= max_input_tokens + 4096,
    top_p=0.95,
    top_k=50,
    do_sample=True
)

summary = response[0]['generated_text'].split('<think>:')
summaries.append(summary)
torch.cuda.empty_cache()


Device set to use cuda:1


# Input

In [11]:
print(summaries[0][0])


Assume you are an experienced clinical physician. Below is a patient’s symptom description using HPO (Human Phenotype Ontology) terms, along with three candidate diagnoses. To further differentiate between these diagnoses, the physician has provided potential symptoms that the patient does not currently exhibit but could help clarify or confirm the diagnosis. Your task is to explain why these potential symptoms are critical for distinguishing between the three diseases.  

**Patient’s Symptom Description**:  
Caries is a multifactorial bacterial infection affecting the structure of the tooth. This term has been used to describe the presence of more than expected dental caries. A height below that which is expected according to age and gender norms. Although there is no universally accepted definition of short stature, many refer to \ An increased sensitivity of the skin to light. Photosensitivity may result in a rash upon exposure to the sun (which is known as photodermatosis). Photos

# Output

In [12]:
print(summaries[0][1])


Okay, let's tackle this query step by step. The user has presented a patient with several symptoms described using HPO terms and three likely diagnoses. They want me to explain why certain potential symptoms are crucial for differentiating between these diagnoses.

First, I need to understand each condition's key features. Let's list them out:

1. **Cockayne Syndrome Type A (OMIM:216400)**: Autosomal recessive, severe photosensitivity, growth failure, cataracts.
   - Key symptoms: Photophobia, skin photosensitivity, growth retardation (short stature), intellectual disability, cataracts.

2. **Cockayne Syndrome Type B (OMIM:133540)**: Also autosomal recessive, similar to Type A but milder.
   - Key symptoms: More prominent facial features like triangular face, prominent nasal bridge, microcornea.

3. **Xeroderma Pigmentosum CGF (OMIM:278760)**: DNA repair defect leading to sun damage.
   - Key symptoms: Sun-induced dermatitis, skin cancers, hearing loss, cognitive decline (dementia).



# Chain of thought

In [13]:
print(summaries[0][1].split('</think>')[0])


Okay, let's tackle this query step by step. The user has presented a patient with several symptoms described using HPO terms and three likely diagnoses. They want me to explain why certain potential symptoms are crucial for differentiating between these diagnoses.

First, I need to understand each condition's key features. Let's list them out:

1. **Cockayne Syndrome Type A (OMIM:216400)**: Autosomal recessive, severe photosensitivity, growth failure, cataracts.
   - Key symptoms: Photophobia, skin photosensitivity, growth retardation (short stature), intellectual disability, cataracts.

2. **Cockayne Syndrome Type B (OMIM:133540)**: Also autosomal recessive, similar to Type A but milder.
   - Key symptoms: More prominent facial features like triangular face, prominent nasal bridge, microcornea.

3. **Xeroderma Pigmentosum CGF (OMIM:278760)**: DNA repair defect leading to sun damage.
   - Key symptoms: Sun-induced dermatitis, skin cancers, hearing loss, cognitive decline (dementia).



# Case Report

In [14]:
print(summaries[0][1].split('</think>')[1])

 The potential symptoms critical for distinguishing between the three diseases lie in their characteristic phenotypic features, which align uniquely with the underlying etiologies:

1. **Cockayne Syndrome Type A (OMIM:216400)**:  
   - **Hypogonadism**, **prominent nose**, and **cataracts** are hallmark features. These indicate a severe autosomal recessive disorder characterized by progressive photosensitivity, endocrine dysfunction, and early-onset aging-like complications. The combination of midface hypoplasia (prominent nose) and lens opacities suggests a distinct molecular mechanism, distinguishing it from Cockayne Type B.  
   - *Differentiation*: Unlike Cockayne Type B, which exhibits more pronounced craniofacial anomalies (e.g., triangular face, nasal bridge prominence), Type A lacks such prominent facial features and focuses instead on midface and ocular manifestations.

2. **Cockayne Syndrome Type B (OMIM:133540)**:  
   - **Triangular face**, **prominent nasal bridge**, and *