# Entity-to-Abstract Dataset Creation

##  Purpose

The goal of this notebook is to create a training dataset for **biomedical text generation**, where the model learns to generate or summarize abstracts based on **biomedical entities**.

This can be used to train a model to:
- generate relevant scientific text about a specific medical concept,
- retrieve or summarize known findings about a disease, treatment, or biological process.

### Example:
- **Input (Prompt):** `"Summarize findings about immune checkpoint inhibitors."`
- **Target (Output):** `"Immune checkpoint inhibitors have emerged as a promising therapy for various types of cancer, particularly in non-small cell lung cancer..."`

---

##  Methodology

We use the preprocessed dataset (`abstracts_with_entities.json`) where each abstract is already annotated with a list of extracted biomedical entities.

We will:
1. Load the enriched dataset.
2. For each abstract, create one or more `(entity → abstract)` training pairs.
3. Store the generated dataset in a format suitable for training (e.g. JSONL, CSV, etc.).

Each training example will consist of:
- `input`: a templated prompt such as `"Summarize findings about {entity}."`
- `output`: the corresponding abstract

This method assumes that abstracts are informative with respect to the mentioned entities.


In [None]:
!pip install jsonlines

In [None]:
# Only if you are using Google Colab and want to retreive the data from your Google Drive.
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import json
import pandas as pd

# Load the enriched abstracts that include entities
with open("/content/drive/MyDrive/biomedical_text_generation/data/enriched/abstracts_with_entities.json", "r", encoding="utf-8") as f:
    data = json.load(f)

print(f"Total abstracts loaded: {len(data)}")


In [None]:
df = pd.DataFrame(data)

# Confirm structure
df[["pmid", "title", "entities", "abstract"]].head(3)


In [None]:
# Store generated training examples here
training_pairs = []

# Loop through all abstracts
for entry in data:
    abstract = entry["abstract"]
    pmid = entry.get("pmid", None)
    for entity in entry["entities"]:
        # Create a templated prompt
        prompt = f"Summarize findings about {entity}."
        training_pairs.append({
            "input": prompt,
            "output": abstract,
            "pmid": pmid,
            "entity": entity  # optional: to trace which entity was used
        })

print(f"Generated {len(training_pairs)} training pairs.")


In [None]:
import os
import jsonlines

# Create output folder
output_path = "/content/drive/MyDrive/biomedical_text_generation/data/training_data/summarization"
os.makedirs(output_path, exist_ok=True)

# Save as JSON Lines (one JSON object per line)
with jsonlines.open(os.path.join(output_path, "entity_to_abstract.jsonl"), mode="w") as writer:
    writer.write_all(training_pairs)

print("Saved dataset to entity_to_abstract.jsonl")
