# Text Summarization Assignment: Pre-trained T5 or BART with HuggingFace Transformers

## Introduction
**Text summarization** is the task of creating a short, concise, and coherent summary of a longer text while retaining its main points. There are two main types:
1.  **Extractive Summarization:** Identifies and extracts key sentences or phrases directly from the original text.
2.  **Abstractive Summarization:** Generates new sentences and phrases that capture the essence of the original text, potentially using words and structures not present in the source.

Modern Transformer models, particularly **T5 (Text-to-Text Transfer Transformer)** and **BART (Bidirectional and Auto-Regressive Transformers)**, excel at abstractive summarization. They are trained as sequence-to-sequence models, capable of taking a text sequence as input and generating a new, summarized text sequence as output.

The **HuggingFace Transformers library** provides an incredibly easy way to load and use these pre-trained models for various NLP tasks, including summarization.

---

## Learning Objectives
Upon completion of this assignment, you should be able to:
- Understand the concept of abstractive text summarization.
- Load pre-trained T5 or BART models and their tokenizers using HuggingFace Transformers.
- Prepare input text for summarization (including T5's specific prefix).
- Generate summaries from single and multiple documents using `model.generate()`.
- Control summary generation parameters like `max_length`, `min_length`, and `num_beams`.
- Qualitatively evaluate the quality, coherence, and relevance of generated summaries.
- Discuss the strengths, weaknesses, and real-world applications of abstractive summarization models.

---

## Setup and Prerequisites
Ensure you have the necessary libraries installed. If not, uncomment and run the following cells:

```bash
# pip install transformers torch # or tensorflow
# pip install sentencepiece # Required for T5 tokenizer
```

---

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

print(f"PyTorch Version: {torch.__version__}")
print(f"Transformers Version: {transformers.__version__}")

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# --- Sample Long Texts for Summarization ---
sample_long_texts = [
    "The Amazon rainforest is the largest rainforest in the world, covering an area of approximately 5.5 million square kilometers. It spans across nine South American countries: Brazil, Peru, Colombia, Ecuador, Bolivia, Guyana, Suriname, French Guiana, and Venezuela. The Amazon is incredibly biodiverse, housing around 10% of the world's known species, including numerous unique plants, insects, birds, and mammals. It plays a crucial role in regulating the Earth's climate by absorbing vast amounts of carbon dioxide, earning it the nickname 'the lungs of the Earth'. However, the rainforest faces severe threats from deforestation, primarily due to agricultural expansion, logging, and mining. Conservation efforts are underway to protect this vital ecosystem.",

    "Artificial intelligence (AI) is rapidly transforming various industries, from healthcare to finance and transportation. Machine learning, a subset of AI, enables systems to learn from data without explicit programming. Deep learning, in turn, is a specialized form of machine learning that uses neural networks with multiple layers to uncover intricate patterns in data. Recent advancements in AI have led to breakthroughs in areas such as natural language processing, computer vision, and autonomous driving. While AI offers immense potential for innovation and efficiency, it also raises ethical concerns regarding job displacement, privacy, and algorithmic bias. Researchers and policymakers are actively working on addressing these challenges to ensure responsible AI development.",

    "Climate change refers to long-term shifts in temperatures and weather patterns. These shifts may be natural, but since the 1800s, human activities have been the main driver of climate change, primarily due to the burning of fossil fuels (like coal, oil, and gas) which produces heat-trapping gases. The consequences of climate change include rising global temperatures, more frequent and intense heatwaves, melting glaciers and ice sheets leading to sea-level rise, altered precipitation patterns, and an increase in the intensity of extreme weather events such as floods, droughts, wildfires, and storms. Mitigating climate change requires a global effort to reduce greenhouse gas emissions, transition to renewable energy sources, and implement sustainable land use practices. Adaptation strategies are also crucial to cope with the unavoidable impacts of a changing climate."
]

print("Sample long texts loaded. Total texts:", len(sample_long_texts))
print("\nFirst sample text for summarization:\n", sample_long_texts[0][:200], "...") # Print first 200 chars

---

## Assignment Questions

---

### Question 1: Model and Tokenizer Loading (T5 or BART)
Choose one of the following pre-trained models for summarization:
- **T5:** `t5-small`, `t5-base`, `t5-large` (e.g., `'t5-small'`) - *Note: T5 models require a specific prefix for summarization.*
- **BART:** `facebook/bart-large-cnn`, `sshleifer/distilbart-cnn-12-6` (e.g., `'facebook/bart-large-cnn'`)

1.  **Select Model Name:** Choose your preferred model name (e.g., `'t5-small'`).
2.  **Load Tokenizer:** Load the corresponding tokenizer using `AutoTokenizer.from_pretrained()`.
3.  **Load Model:** Load the model for sequence-to-sequence tasks using `AutoModelForSeq2SeqLM.from_pretrained()`. Move the model to your `device` (GPU if available, otherwise CPU).
4.  **Inspect:** Print the type of the loaded tokenizer and model. Print a small part of the model's architecture (e.g., `model.config` or `model.encoder.block[0]`) to confirm it loaded correctly.

---

---

### Question 2: Single Document Summarization
Let's summarize the first `sample_long_texts` document.

1.  **Prepare Input:** Take `sample_long_texts[0]`.
    * **If using T5:** Prepend the text with `"summarize: "` (e.g., `"summarize: " + text`).
    * **If using BART:** No special prefix needed.
    Tokenize this prepared text, ensuring `return_tensors="pt"` and moving the tensors to your `device`.
2.  **Generate Summary:** Use `model.generate()` to produce the summary.
    * Set `max_length` (e.g., 50-100 tokens, experiment based on text length).
    * Set `min_length` (e.g., 10-20 tokens).
    * Set `num_beams` (e.g., 4 or 5) for beam search to get better quality.
    * Set `early_stopping=True`.
3.  **Decode and Print:** Decode the generated token IDs back to human-readable text using `tokenizer.decode()`. Print the original text and the generated summary.

---

---

### Question 3: Controlling Summary Length and Quality
The parameters in `model.generate()` significantly influence the summary's characteristics.

1.  **Experiment with `max_length`:** Summarize `sample_long_texts[1]` (the AI text) twice:
    * Once with a `max_length` of 30 tokens.
    * Once with a `max_length` of 80 tokens.
    Keep other parameters (like `num_beams`, `min_length`) consistent.
    Print both summaries.
2.  **Experiment with `num_beams`:** Summarize `sample_long_texts[2]` (the climate change text) twice:
    * Once with `num_beams=1` (greedy decoding).
    * Once with `num_beams=5` (standard beam search).
    Keep `max_length` and `min_length` consistent.
    Print both summaries.
3.  **Discussion:** Based on your observations, explain how `max_length` and `num_beams` impact the generated summaries in terms of length, coherence, and overall quality.

---

---

### Question 4: Batch Summarization
To efficiently summarize multiple documents, you can pass them as a batch to the tokenizer and then to the model.

1.  **Prepare Batch Input:** Take the entire `sample_long_texts` list.
    * **If using T5:** Prepend each text with `"summarize: "`.
    Tokenize the list. Remember to set `padding=True` and `truncation=True` to handle varying lengths, and `return_tensors="pt"`. Move tensors to `device`.
2.  **Generate Batch Summaries:** Pass the batched input to `model.generate()`. Use reasonable `max_length` and `num_beams` values.
3.  **Decode and Print:** Iterate through the generated summaries, decode each one, and print the original text alongside its corresponding summary for all documents in the batch.

---

---

### Question 5: Qualitative Analysis and Limitations
Reflect on the summaries generated by the T5/BART model.

1.  **Abstractive vs. Extractive:** Based on your observations, are the summaries primarily abstractive (rewording) or extractive (copying sentences)? Provide examples.
2.  **Quality Assessment:** Do the summaries capture the main points of the original texts? Are there any instances of factual inaccuracies (hallucinations) or grammatical errors/awkward phrasing? Discuss.
3.  **Limitations:** What are some inherent limitations of using these pre-trained general-purpose summarization models (like T5/BART fine-tuned on CNN/DailyMail) for arbitrary text? Consider aspects like domain specificity, very long documents, or factual consistency.

---

---

### Question 6: Potential Applications and Extensions
1.  **Real-world Applications:** List at least three distinct real-world applications where text summarization could be highly beneficial. Explain how it would be used in each scenario.
2.  **Model Improvements/Adaptations:** How could you potentially improve the quality of the summaries for a specific domain or use case? Consider methods beyond just adjusting `generate` parameters (e.g., fine-tuning, different models, post-processing).

---

## Submission Guidelines
- Ensure your notebook runs without errors from top to bottom.
- Save your notebook as `your_name_text_summarization_assignment.ipynb`.
- Clearly answer all questions and provide explanations where requested in Markdown cells.
- Feel free to add additional code cells or markdown cells for clarity or experimentation.

---