
# 📢 Code Showcase: How to Use the Latest Version of the Summarizer

In this sharing on GitHub, we will demonstrate how to use the latest version of our summarizer. This new version of the summarizer employs the output of DeepSeek-R1-671B as the target for distillation. Specifically, we use the disease entries from OMIM or Orphanet, along with their HPO annotation definitions, and input them into DeepSeek-R1-671B to generate synthetic simulated cases. For the detailed process, please refer to our manuscript.

Next, we will showcase two practical functions of the summarizer:
1. **Symptom Summary Generation**: Input the patient's HPO terms, and the summarizer will quickly generate a summary report of the patient's symptoms.
2. **Structured Clinical Report Generation**: Demonstrate how the summarizer receives the results from the ranker and the recommender, and then generates a structured clinical report.

Please note that the following code runs in an environment based on Pytorch 2.0.1. 

The creator of this notebook: Baole Wen (2025.03.29)

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import pipeline
from tqdm import tqdm
import pandas as pd
import torch
import re
from pyhpo import Ontology
Ontology()

<pyhpo.ontology.OntologyClass at 0x7fa0f8618940>

In [2]:
def Get_Definition(hpo_list):
    definition_list = []
    for t in hpo_list:
        definition = Ontology.get_hpo_object(t).definition
        match = re.search(r'"(.*?)"', definition)
        if match:
            definition_list.append(match.group(1))
    return ' '.join(definition_list)

# HPO Terms Detected in a Patient

1. **HP:0000006** - Autosomal dominant inheritance  
2. **HP:0003593** - Infantile onset  
3. **HP:0025104** - Capillary malformation  
4. **HP:0001009** - Telangiectasia  
5. **HP:0003829** - Typified by incomplete penetrance  
6. **HP:0030713** - Vein of Galen aneurysmal malformation  


In [3]:
Patient_hps = (["HP:0000006", "HP:0003593", "HP:0025104", "HP:0001009", "HP:0003829", "HP:0030713"])

In [4]:
Input_text = Get_Definition(Patient_hps)

In [5]:
Model_Path = '/remote-home/share/data3/ly/phenoDP/new-checkpoint-finetune-with-4-datasets/'

device = "cuda:1" # the device to load the model onto


model = AutoModelForCausalLM.from_pretrained(
    Model_Path,
    torch_dtype=torch.float16,
    device_map="cuda:1"
)
tokenizer = AutoTokenizer.from_pretrained(Model_Path)


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [6]:

summaries = []

max_input_tokens = 2048

hpo_def = Input_text

prompt = f"""
I will provide you with the definitions of some HPO (Human Phenotype Ontology) terms exhibited by a patient. Based on these definitions, please generate a concise, clinically focused summary of the patient's symptoms in one paragraph, approximately 100-300 words in length. Ensure the summary is highly readable, with smooth transitions between ideas, logical coherence, and accurate representation of the clinical features. Emphasize clarity, fluency, and clinical relevance to create a realistic and precise description of the patient's presentation.\nText:\n{hpo_def}
"""
  
tokenized_text = tokenizer(prompt, return_tensors="pt").input_ids[0]
truncated_tokenized_text = tokenized_text[:max_input_tokens]
    
truncated_text = tokenizer.decode(truncated_tokenized_text)  + '<think>:\n'

summarizer = pipeline(
    "text-generation",  
    model=model,  
    tokenizer=tokenizer
)
response = summarizer(
    truncated_text,
    max_new_tokens= max_input_tokens + 1024,
    top_p=0.95,
    top_k=50,
    do_sample=True
)

summary = response[0]['generated_text'].split('<think>:')
summaries.append(summary)
torch.cuda.empty_cache()


Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


## Input prompt

In [7]:
print(summaries[0][0])


I will provide you with the definitions of some HPO (Human Phenotype Ontology) terms exhibited by a patient. Based on these definitions, please generate a concise, clinically focused summary of the patient's symptoms in one paragraph, approximately 100-300 words in length. Ensure the summary is highly readable, with smooth transitions between ideas, logical coherence, and accurate representation of the clinical features. Emphasize clarity, fluency, and clinical relevance to create a realistic and precise description of the patient's presentation.
Text:
A mode of inheritance that is observed for traits related to a gene encoded on one of the autosomes (i.e., the human chromosomes 1-22) in which a trait manifests in heterozygotes. In the context of medical genetics, an autosomal dominant disorder is caused when a single copy of the mutant allele is present. Males and females are affected equally, and can both transmit the disorder with a risk of 50% for each child of inheriting the muta

## Output

In [9]:
print(summaries[0][1])


Okay, let me try to process this query step by step. The user has provided several HPO terms and wants a concise clinical summary. First, I need to identify each term and understand their implications.

The first term mentions autosomal dominant inheritance. That tells me the condition is likely genetic, passed down with a 50% chance, affecting males and females equally. Then there's the onset between 28 days to a year, so early infancy symptoms. 

Next, capillary malformations and telangiectasias. These are vascular issues. Capillary malformations are flat vascular stains, maybe visible at birth. Telangiectasias are small blood vessel dilations, noted on various parts like tongue, lips, etc. Both are part of a larger vascular anomaly mentioned later: venous malformation.

The venous malformation details: vein of Galen aneurysm. That's a specific type, seen in the first trimester. Causes problems because the vessel doesn't involute properly, leading to a shunt. Symptoms would relate t

### Case Demonstration
In the following case, we first used the prompt text generated in the example tutorials of the **Rank** and **Recommender**. This prompt text integrates the patient's input symptom information as well as the results from the Rank and Recommender. 

In [10]:

with open('/root/HPO2SUM/github_project/Git_validation/Case_Report_Prompt.txt', 'r', encoding='utf-8') as file:
    content = file.read()
    print(content)



Assume you are an experienced clinical physician. Below is a patient’s symptom description using HPO (Human Phenotype Ontology) terms, along with three candidate diagnoses. To further differentiate between these diagnoses, the physician has provided potential symptoms that the patient does not currently exhibit but could help clarify or confirm the diagnosis. Your task is to explain why these potential symptoms are critical for distinguishing between the three diseases.  

**Patient’s Symptom Description**:  
Caries is a multifactorial bacterial infection affecting the structure of the tooth. This term has been used to describe the presence of more than expected dental caries. A height below that which is expected according to age and gender norms. Although there is no universally accepted definition of short stature, many refer to \ An increased sensitivity of the skin to light. Photosensitivity may result in a rash upon exposure to the sun (which is known as photodermatosis). Photos

In [11]:

summaries = []

max_input_tokens = 4096

hpo_def = Input_text

prompt = content
  
tokenized_text = tokenizer(prompt, return_tensors="pt").input_ids[0]
truncated_tokenized_text = tokenized_text[:max_input_tokens]
    
truncated_text = tokenizer.decode(truncated_tokenized_text)  + '<think>:\n'

summarizer = pipeline(
    "text-generation",  
    model=model,  
    tokenizer=tokenizer
)
response = summarizer(
    truncated_text,
    max_new_tokens= max_input_tokens + 4096,
    top_p=0.95,
    top_k=50,
    do_sample=True
)

summary = response[0]['generated_text'].split('<think>:')
summaries.append(summary)
torch.cuda.empty_cache()


# Input

In [12]:
print(summaries[0][0])


Assume you are an experienced clinical physician. Below is a patient’s symptom description using HPO (Human Phenotype Ontology) terms, along with three candidate diagnoses. To further differentiate between these diagnoses, the physician has provided potential symptoms that the patient does not currently exhibit but could help clarify or confirm the diagnosis. Your task is to explain why these potential symptoms are critical for distinguishing between the three diseases.  

**Patient’s Symptom Description**:  
Caries is a multifactorial bacterial infection affecting the structure of the tooth. This term has been used to describe the presence of more than expected dental caries. A height below that which is expected according to age and gender norms. Although there is no universally accepted definition of short stature, many refer to \ An increased sensitivity of the skin to light. Photosensitivity may result in a rash upon exposure to the sun (which is known as photodermatosis). Photos

# Output

In [13]:
print(summaries[0][1])


Okay, let's tackle this query. The user wants me to explain why certain potential symptoms are critical for differentiating between three genetic disorders. First, I need to understand each of the three conditions given.

The patient has features like short stature, photosensitivity, generalized hypotonia, microcephaly, deep-set eyes, and dental caries. These point towards either Cockayne syndrome types A/B, Xeroderma pigmentosum (XP) group F, or Richieri-Costa/Guion-Almeida syndrome.

Starting with Cockayne syndrome types. Both types have severe photosensitivity and growth issues. Type A mentions ventriculomegaly, cataracts, hip contractures. Type B has microcornea, subcortical white matter calcifications, and developmental cataracts. So, if the patient had signs like ventriculomegaly or hip contractures, that would support Cockayne A. On the other hand, microcornea and white matter changes would lean towards B.

Xeroderma pigmentosum (XP) F has high risk of skin cancers and seborrhe

# Chain of thought

In [14]:
print(summaries[0][1].split('</think>')[0])


Okay, let's tackle this query. The user wants me to explain why certain potential symptoms are critical for differentiating between three genetic disorders. First, I need to understand each of the three conditions given.

The patient has features like short stature, photosensitivity, generalized hypotonia, microcephaly, deep-set eyes, and dental caries. These point towards either Cockayne syndrome types A/B, Xeroderma pigmentosum (XP) group F, or Richieri-Costa/Guion-Almeida syndrome.

Starting with Cockayne syndrome types. Both types have severe photosensitivity and growth issues. Type A mentions ventriculomegaly, cataracts, hip contractures. Type B has microcornea, subcortical white matter calcifications, and developmental cataracts. So, if the patient had signs like ventriculomegaly or hip contractures, that would support Cockayne A. On the other hand, microcornea and white matter changes would lean towards B.

Xeroderma pigmentosum (XP) F has high risk of skin cancers and seborrhe

# Case Report

In [15]:
print(summaries[0][1].split('</think>')[1])

 The patient presents with features suggesting multisystem involvement, including photosensitivity, growth retardation, and ocular abnormalities. To differentiate between the proposed diagnoses—Cockayne Syndrome types A/B, Xeroderma Pigmentosum (XP) Group F, and Richieri-Costa/Guion-Almeida Syndrome—we analyze the following critical potential symptoms:

1. **Photosensitivity and Growth Failure**:  
   - Cockayne Syndrome (types A/B) is characterized by profound photosensitivity (skin rashes, hair loss) and severe growth failure. The absence of malignant neoplasms aligns with XP Group F, where photosensitivity increases tumor risk. Richieri-Costa/Guion-Almeida Syndrome lacks documented photosensitivity but shows craniofacial dysmorphisms like retrognathia and esotropia.

2. **Ocular Abnormalities**:  
   - **Cockayne A (Ventriculomegaly/Cataracts)** and **Cockayne B (Microcornea/Subcortical White Matter Calcifications)** involve distinct retinal/brainstem pathologies. Cataracts (seen in