## Biomedical Text Generation: Entity-to-Text Dataset

In this setup, we prepare a training dataset for **conditional biomedical text generation**, where the model learns to generate an abstract-like passage based on a **single biomedical entity**.

Each entry contains:

- `pmid`: The PubMed identifier of the abstract
- `entity`: The biomedical term used as a generation prompt
- `abstract`: The full abstract text
- `input`: A prompt in the form of `"Write a biomedical paragraph about <entity>."`
- `target`: The abstract itself (used as generation target)

This dataset is useful for training models like T5 or GPT-style LLMs to generate domain-specific text, conditioned on biomedical topics.


## Biomedical Text Generation: Entity-to-Text Dataset

In this setup, we prepare a training dataset for **conditional biomedical text generation**, where the model learns to generate an abstract-like passage based on a **single biomedical entity**.

Each entry contains:

- `pmid`: The PubMed identifier of the abstract
- `entity`: The biomedical term used as a generation prompt
- `abstract`: The full abstract text
- `input`: A prompt in the form of `"Write a biomedical paragraph about <entity>."`
- `target`: The abstract itself (used as generation target)

This dataset is useful for training models like T5 or GPT-style LLMs to generate domain-specific text, conditioned on biomedical topics.


In [None]:
!pip install jsonlines

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')


In [None]:
import json
from tqdm import tqdm
import os
import jsonlines


In [None]:
# Load the enriched abstracts with biomedical entities
input_path = "/content/drive/MyDrive/biomedical_text_generation/data/enriched/abstracts_with_entities.json"

with open(input_path, "r", encoding="utf-8") as f:
    abstracts = json.load(f)

print(f"Loaded {len(abstracts)} abstracts.")


In [None]:
# Prepare the dataset
entity_to_text = []

for entry in tqdm(abstracts):
    pmid = entry.get("pmid")
    abstract = entry["abstract"]
    entities = entry.get("entities", [])

    for entity in entities:
        if len(entity.split()) < 2:
            continue  # Skip overly generic terms

        input_text = f"Write a biomedical paragraph about {entity}."

        entity_to_text.append({
            "pmid": pmid,
            "entity": entity,
            "abstract": abstract,
            "input": input_text,
            "target": abstract
        })


In [None]:
# Output directory
output_dir = "/content/drive/MyDrive/biomedical_text_generation/data/training/text_gen"
os.makedirs(output_dir, exist_ok=True)

output_path = os.path.join(output_dir, "entity_to_text.jsonl")

with jsonlines.open(output_path, mode="w") as writer:
    writer.write_all(entity_to_text)

print(f"Saved {len(entity_to_text)} entries to {output_path}")
