# Dataset Creation for Vanilla Summarization

This notebook generates a training dataset for text summarization from biomedical abstracts. 
We use the enriched dataset (`abstracts_with_entities.json`), where each entry contains the abstract, title, and biomedical entities. 

### Objective:
Create training samples for supervised summarization tasks using models such as T5 or BART. Each sample consists of:
- `input_text`: The biomedical abstract
- `target_text`: The title of the publication

This setup allows the model to learn to generate concise and informative summaries (titles) based on the abstract content.

We save the result in `.jsonl` format under:


In [None]:
!pip install jsonlines

In [None]:
# Only if you are using Google Colab and want to retreive the data from your Google Drive.
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import json
import jsonlines
from tqdm import tqdm

# Path to the enriched abstracts
path = "/content/drive/MyDrive/biomedical_text_generation/data/enriched/abstracts_with_entities.json"

with open(path, "r", encoding="utf-8") as f:
    data = json.load(f)

print(f"Total abstracts loaded: {len(data)}")


In [None]:
# Create list of dicts in the format {"input": abstract, "output": title}
vanilla_dataset = []

for entry in tqdm(data):
    abstract = entry.get("abstract", "").strip()
    title = entry.get("title", "").strip()

    # Skip empty or corrupted entries
    if abstract and title:
        vanilla_dataset.append({
            "input": abstract,
            "output": title
        })

print(f"Total samples in summarization dataset: {len(vanilla_dataset)}")


In [None]:
# Define output path
output_path = "/content/drive/MyDrive/biomedical_text_generation/data/training/summarization/vanilla_summarization.jsonl"

# Save the dataset
with jsonlines.open(output_path, mode="w") as writer:
    writer.write_all(vanilla_dataset)

print(f"Saved summarization dataset to: {output_path}")
