## Combined Summarization Dataset

This dataset combines two previously created entity-centric summarization datasets:

- `entity_to_abstracts.jsonl`: abstracts linked to **single** biomedical entities.
- `multi_entity_to_abstracts.jsonl`: abstracts linked to **multiple** biomedical entities.

Each entry includes:
- A `pmid` (PubMed ID)
- A list of `entities` associated with the abstract
- The full `abstract` text
- A generated `input` field used as the **summarization prompt** (based on entity or entity combination)
- A `target` field containing the **title of the article**, used as the **summary**

We merged these two files and **removed fully identical records** — i.e., duplicate pairs of `(input, target)` — while **preserving multiple unique prompts per abstract**. This ensures rich semantic diversity and maintains the connection between entities and content.

---

### Why Not Use the Vanilla Summarization Dataset?

In early experiments, we created a "vanilla" summarization dataset where:

- The `input` was the **full abstract**
- The `target` was the **title** of the article

This setup has some limitations:

- It encourages the model to generate short titles, not true abstractive summaries
- It ignores the valuable **entity-level information** we've extracted
- It doesn't allow conditional summarization based on specific biomedical concepts

---

### Why Use This Dataset Instead?

- Uses **entities as input prompts**, aligning with biomedical summarization use cases
- More flexible: can create summaries **targeted to specific topics** (e.g. “immune checkpoint inhibitors” or “quality of life”)
- Retains multiple valid prompts for the same abstract, enhancing training diversity
- Suitable for **future RAG pipelines** or **multi-input summarization tasks**

This dataset now serves as the **primary source for training biomedical summarization models**, including T5 and BART.


In [None]:
!pip install jsonlines

In [None]:
# Only if you are using Google Colab and want to retreive the data from your Google Drive.
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import jsonlines

# Paths to the two datasets
entity_file = "/content/drive/MyDrive/biomedical_text_generation/data/training/summarization/entity_to_abstracts.jsonl"
multi_entity_file = "/content/drive/MyDrive/biomedical_text_generation/data/training/summarization/multi_entity_to_abstracts.jsonl"

# Load both datasets into a combined list
all_samples = []

def load_jsonl(file_path):
    with jsonlines.open(file_path) as reader:
        return list(reader)

all_samples += load_jsonl(entity_file)
all_samples += load_jsonl(multi_entity_file)

print(f"Total loaded samples (before deduplication): {len(all_samples)}")


In [None]:
# Remove exact duplicates using a set of (input, output) tuples
seen = set()
unique_samples = []

for sample in all_samples:
    pair = (sample["input"], sample["output"])
    if pair not in seen:
        seen.add(pair)
        unique_samples.append(sample)

print(f"Samples after deduplication: {len(unique_samples)}")


In [None]:
# Save combined dataset
output_path = "/content/drive/MyDrive/biomedical_text_generation/data/training/summarization/combined_deduplicated.jsonl"

with jsonlines.open(output_path, mode='w') as writer:
    writer.write_all(unique_samples)

print("✅ Combined dataset saved:", output_path)


In [None]:
# Some statistics
from collections import Counter

# How many pmids do we have?
pmid_counts = Counter([s["pmid"] for s in unique_samples])
print("Unique abstracts (pmids):", len(pmid_counts))
print("Average prompts per abstract:", round(len(unique_samples) / len(pmid_counts), 2))
