# Long-Document Summarization with LongT5

This notebook demonstrates how to use the `pszemraj/long-t5-tglobal-base-16384-book-summary` model to summarize long-form content like articles or books.


In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
!pip install transformers torch datasets accelerate rouge_score  --quiet

### 🔹 Step 1: Load LongT5 Model and Tokenizer


In [1]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from datasets import load_dataset
import torch

In [3]:
# Load model and tokenizer
model_name = "pszemraj/long-t5-tglobal-base-16384-book-summary"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

Some weights of LongT5ForConditionalGeneration were not initialized from the model checkpoint at pszemraj/long-t5-tglobal-base-16384-book-summary and are newly initialized: ['decoder.embed_tokens.weight', 'encoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

### 🔹 Step 2: Provide Input Text (Long Document)


In [6]:
# Sample long text
text = """
OpenAI released ChatGPT in November 2022, which quickly became one of the fastest-growing applications in history.
The tool allows users to interact with an AI model capable of generating human-like text, aiding in tasks like
writing, summarization, and coding. Following its release, tech giants accelerated their work on AI assistants,
including Google’s Bard and Anthropic’s Claude. This led to an AI race, impacting sectors like education,
productivity, and research. Researchers also raised concerns about ethical risks, misinformation, and over-reliance.
"""

### 🔹 Step 3: Tokenize Input and Generate Summary


In [7]:
# Tokenize input
inputs = tokenizer(text, return_tensors="pt", max_length=16384, truncation=True).to(device)

# Generate summary
summary_ids = model.generate(
    **inputs,
    max_length=512,
    num_beams=4,
    early_stopping=True
)

### 🔹 Step 4: Decode and Display Summary


In [8]:
# Decode and print
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("\n📌 Summary:\n", summary)


📌 Summary:
 thea: and to of fillse int- is de for’i that youd I withn on'o are iten be The as yourl ( or have at from an was thiser lamring can! will by? notre) wey und has all die but our their A more un dercuin so they one about myul whichà In/hef le out also des It up " timeăif This Wep do– “onh sile les în his who likeb when; been otherly"g cu care what newor some get were just there wouldS them any).al into me had se makeat than du over You how no peoplean”éit Ifk peis her workve only may its first most well use zu pourzil need these din den usable S mit very am& au many maiAth through pentru two von wayllI ce și help best),un years 2 C nu goodv 1w das ca where know year He see für auf 3deest back such shouldx after could ist now muchand... hometo ein even que day take want For said sur une să dans great este because informationului findC she imation then est par used E made Soam eine şi business right here being B those before And Pers donB life go As M each qui placecomant sich