## Biomedical Text Generation: Multi-Entity-to-Text Dataset

In this approach, we build a text generation dataset where the **input** is a combination of multiple biomedical entities found in an abstract, and the **output** is the abstract itself.

Each entry contains:
- `pmid`: PubMed ID
- `entities`: List of biomedical entities found in the abstract
- `abstract`: The full abstract text
- `input`: A prompt like `"Write a biomedical paragraph using the terms: <entity1>, <entity2>, ..."`
- `target`: The original abstract

This method encourages the model to generate a coherent biomedical paragraph conditioned on multiple domain-specific concepts, which is useful for entity-aware text generation or assisted authoring systems.


In [None]:
!pip install jsonlines

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')


In [None]:
import json
from tqdm import tqdm
import os
import jsonlines


In [None]:
input_path = "/content/drive/MyDrive/biomedical_text_generation/data/enriched/abstracts_with_entities.json"

with open(input_path, "r", encoding="utf-8") as f:
    abstracts = json.load(f)

print(f"Loaded {len(abstracts)} abstracts.")


In [None]:
multi_entity_to_text = []

for entry in tqdm(abstracts):
    pmid = entry.get("pmid")
    abstract = entry["abstract"]
    entities = entry.get("entities", [])

    # Filter out overly short ones and deduplicate
    filtered_entities = list(set([ent for ent in entities if len(ent.split()) >= 2]))

    if len(filtered_entities) < 2:
        continue  # We want multi-entity prompts

    input_text = f"Write a biomedical paragraph using the terms: {', '.join(filtered_entities)}."

    multi_entity_to_text.append({
        "pmid": pmid,
        "entities": filtered_entities,
        "abstract": abstract,
        "input": input_text,
        "target": abstract
    })



In [None]:
output_dir = "/content/drive/MyDrive/biomedical_text_generation/data/training/text_gen"
os.makedirs(output_dir, exist_ok=True)

output_path = os.path.join(output_dir, "multi_entity_to_text.jsonl")

with jsonlines.open(output_path, mode="w") as writer:
    writer.write_all(multi_entity_to_text)

print(f"Saved {len(multi_entity_to_text)} multi-entity entries to {output_path}")
