In [None]:
!pip install datasets

## Importing Libraries / Loading Dataset and Pre-trained Model



1.   **Installing necessary tools** (libraries).
2.   **Loading training data** from a JSON Lines file.
3.   **Loading a pre-trained summarization model** and its tokenizer for fine-tuning on your data.



In [2]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainer, Seq2SeqTrainingArguments, TrainerCallback
from datasets import load_dataset
import json
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from tqdm import tqdm

# Load Dataset for training
dataset = load_dataset("json", data_files="train_bart-large-cnn.jsonl")

print(dataset)

# Select to load and fine-tune a small model so Google Colab can handle it
model_name = "sshleifer/distilbart-cnn-12-6"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'summary'],
        num_rows: 50
    })
})


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

## Preprocessing the Data for the Model

This code snippet focuses on preparing the data for training the summarization model.

In [3]:
def preprocess_function(examples):
    """
    The `preprocess_function` takes examples with text and summary, tokenizes them with padding, and
    returns model inputs with labels for training.

    :param examples: The `examples` parameter is a dictionary containing the input text and target
    summary for each example in the dataset. The keys in the dictionary are "text" for input text and
    "summary" for target summary
    :return: The preprocess_function returns the model_inputs dictionary with tokenized and padded
    inputs for the text and summary examples, along with the labels for the model training.
    """
    inputs = examples["text"]
    targets = examples["summary"]

    # Tokenize inputs with padding
    model_inputs = tokenizer(
        inputs,
        max_length=1024,
        truncation=True,
        padding="max_length"  # pad to max length for training
    )

    # Tokenize targets with padding
    labels = tokenizer(
        targets,
        max_length=128,
        truncation=True,
        padding="max_length"  # pad to max length for training
    )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


tokenized_dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

## Setting Training Arguments

This section defines the configuration for training the summarization model using `Seq2SeqTrainingArguments`.

Let's go through some key parameters:

*   `output_dir`: Specifies the directory where training results will be saved (./results).
*   `report_to`="none": Disables integration with logging tools like Weights & Biases (wandb) and TensorBoard.
*   `learning_rate`: Sets the learning rate for the optimizer (2e-5).
*   `per_device_train_batch_size`: Determines the batch size for training on each device (4).
*   `weight_decay`: Applies weight decay regularization to prevent overfitting (0.01).
*   `save_total_limit`: Limits the number of saved checkpoints to 1.
*   `num_train_epochs`: Sets the total number of training epochs to 2.
*   `predict_with_generate`: Enables text generation during prediction.
*   `fp16`: Uses 16-bit floating-point precision if your GPU supports it for faster training.
*   `disable_tqdm`: False ensures a progress bar is displayed during training.

In [4]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    report_to="none",  # disable wandb, tensorboard etc.
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    weight_decay=0.01,
    save_total_limit=1,
    num_train_epochs=2,  # Keep low for Colab!
    predict_with_generate=True,
    fp16=True,  # If GPU supports
    disable_tqdm=False,   # ensures progress bar
)

## Training the model

This code takes the model, the training settings, and the training data, then kicks off the training process to teach the model how to summarize text effectively.

In [5]:
# The `ProgressCallback` class prints a message at the beginning of each epoch during training.
class ProgressCallback(TrainerCallback):
    def on_epoch_begin(self, args, state, control, **kwargs):
        print(f"\n🚀 Starting epoch {state.epoch + 1}/{args.num_train_epochs}")

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    callbacks=[ProgressCallback]  # add custom callback
)

trainer.train()


🚀 Starting epoch 1/2


Step,Training Loss



🚀 Starting epoch 2.0/2




TrainOutput(global_step=26, training_loss=6.238956744854267, metrics={'train_runtime': 2747.1321, 'train_samples_per_second': 0.036, 'train_steps_per_second': 0.009, 'total_flos': 154791208550400.0, 'train_loss': 6.238956744854267, 'epoch': 2.0})

## Saving the Fine-tuned Model and Tokenizer
After training the model, these lines are crucial for saving your work:

Why is this important?

Saving the model and tokenizer allows you to:

1.   **Reuse the model later** without retraining, saving you time and resources.
2.   **Share the model** with others so they can use it for their own summarization tasks.
3.   **Deploy the model** to a production environment for real-world applications.

By saving both the model and tokenizer in the same directory, they can easily be loaded and used together in the future.

In [None]:
model.save_pretrained("./finetuned_bart-large-cnn")
tokenizer.save_pretrained("./finetuned_train_bart-large-cnn")

## Importing Data from a JSON Lines File

This code snippet is responsible for importing data from a JSON Lines (JSONL) file and then examining the structure of the imported data.

In [10]:
def import_json_as_list(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)  # load entire JSON array
    return data

test_data = import_json_as_list("test_summarized_articles.jsonl")
test_data[0].keys()

dict_keys(['title', 'url', 'source', 'author', 'publishedAt', 'description', 'content', 'content_len', 'clean_content', 'clean_content_len', 'summary_facebook/bart-large-cnn', 'summary_google/pegasus-xsum', 'summary_google/pegasus-multi_news', 'summary_google/pegasus-cnn_dailymail'])

## Creating the Summarization Pipeline


In [11]:
# === Load model & tokenizer if needed ===
# model_path = "./finetuned_model"
# tokenizer = AutoTokenizer.from_pretrained(model_path)
# model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, device=-1)  # device=0 if GPU

# === Load articles ===
articles = import_json_as_list("test_summarized_articles.jsonl")

# === Summarize ===
results = []
for article in tqdm(articles):
    input_text = article.get("clean_content") or ""
    if not input_text.strip():
        summary = None
    else:
        # Handle potential out-of-vocabulary tokens
        try:
            summary = summarizer(input_text, max_length=130, min_length=30, do_sample=False)[0]["summary_text"]
        except IndexError:
            print(f"Warning: Skipping article with potential out-of-vocabulary tokens: {article.get('title', 'Unknown Title')}")
            summary = None  # or some default summary

    results.append({
        "title": article.get("title"),
        "url": article.get("url"),
        "source": article.get("source"),
        "author": article.get("author"),
        "publishedAt": article.get("publishedAt"),
        "description": article.get("description"),
        "content": article.get("content"),
        "content_len": article.get("content_len"),
        "clean_content": article.get("clean_content"),
        "clean_content_len": article.get("clean_content_len"),
        "summary_facebook/bart-large-cnn": article.get("summary_facebook/bart-large-cnn"),
        # "summary_google/pegasus-xsum": article.get("summary_google/pegasus-xsum"),
        # "summary_google/pegasus-multi_news": article.get("summary_google/pegasus-multi_news"),
        # "summary_google/pegasus-cnn_dailymail": article.get("summary_google/pegasus-cnn_dailymail"),
        "summary_finetuned_pegasus-xsum": summary
    })

# === Save results ===
with open("summarized_articles_pegasus-xsum.json", "w", encoding="utf-8") as f:
    json.dump(results, f, ensure_ascii=False, indent=2)

print("Summaries saved to summarized_articles.json ✅")

Device set to use cpu
  3%|▎         | 1/33 [00:22<12:09, 22.78s/it]Token indices sequence length is longer than the specified maximum sequence length for this model (2481 > 1024). Running this sequence through the model will result in indexing errors
  6%|▌         | 2/33 [00:22<04:54,  9.49s/it]Your max_length is set to 130, but your input_length is only 61. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=30)




 21%|██        | 7/33 [01:17<05:31, 12.75s/it]Your max_length is set to 130, but your input_length is only 59. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=29)




 36%|███▋      | 12/33 [02:11<04:47, 13.69s/it]



 45%|████▌     | 15/33 [02:24<02:41,  8.97s/it]Your max_length is set to 130, but your input_length is only 47. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=23)
 52%|█████▏    | 17/33 [02:48<02:46, 10.41s/it]



 58%|█████▊    | 19/33 [03:06<02:18,  9.87s/it]



 67%|██████▋   | 22/33 [03:20<01:23,  7.62s/it]Your max_length is set to 130, but your input_length is only 90. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=45)
 70%|██████▉   | 23/33 [03:29<01:18,  7.87s/it]



 82%|████████▏ | 27/33 [04:12<01:03, 10.50s/it]



100%|██████████| 33/33 [05:19<00:00,  9.69s/it]

Summaries saved to summarized_articles.json ✅



