# Introduction to Large Language Models

## Text Summarization Experiment

This notebook documents an experiment performing text summarization using different Large Language Models (LLMs) and exploring various text generation (decoding) strategies. The goal is to compare the quality of summaries produced by these different approaches using standard evaluation metrics.

The experiment will be conducted on the CNN/DailyMail dataset, a widely used benchmark for abstractive summarization.

In [None]:
# @title Install required libraries

# Install necessary packages for datasets, transformers, and evaluation metrics
!pip install -q transformers datasets rouge_score sacrebleu

## Loading and Inspecting the Dataset

We will use the CNN/DailyMail dataset from the Hugging Face `datasets` library. This dataset contains news articles and corresponding human-written summaries (highlights). For this demonstration, we will load the dataset and inspect a single example from the test split to understand the data format.

In [None]:
# @title Load dataset and show an example

from datasets import load_dataset

# Load the specified version of the CNN/DailyMail dataset
cnn_dm_dataset = load_dataset("cnn_dailymail", "3.0.0")

# Select the first example from the test set for demonstration
example_article_data = cnn_dm_dataset["test"][0]

input_article = example_article_data["article"]
gold_summary = example_article_data["highlights"] # Renamed from reference_summary

# Display parts of the article and the gold summary
print("--- Input Article (first 500 chars) ---\n", input_article[:500], "...\n")
print("--- Gold Summary ---\n", gold_summary)

## Exploring Different Models and Decoding Strategies

The assignment requires trying at least two different LLMs and at least three text generation (decoding) strategies.

**Chosen Models:**
1.  **BART:** A sequence-to-sequence model often used for abstractive summarization (`facebook/bart-large-cnn`).
2.  **T5 (Text-to-Text Transfer Transformer):** Another powerful sequence-to-sequence model (`t5-small` for faster inference). T5 requires a specific input prefix like "summarize:".

**Decoding Strategies:**
These methods determine how the model generates tokens after training. We will explore:
1.  **Greedy Search:** Selects the token with the highest probability at each step. Simple but can get stuck in local optima.
2.  **Beam Search:** Explores multiple possible next tokens (`num_beams`) at each step, keeping track of the most promising sequences. Often produces better results than greedy search. We will use `num_beams=4`.
3.  **Sampling:** Introduces randomness into the generation process. We'll use parameters like `do_sample=True`, `top_k` (considering only top K probable tokens), and `temperature` (controlling randomness). We will use `top_k=50` and `temperature=0.7`.

In [None]:
# @title Define the summarization function

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch # Added torch import

def summarize_text_with_model(model, tokenizer, text_input, max_output_length=150, method="greedy"):
    """
    Generates a summary using a given model and tokenizer with different decoding methods.

    Args:
        model: The Hugging Face model for sequence-to-sequence generation.
        tokenizer: The Hugging Face tokenizer corresponding to the model.
        text_input (str): The input text to summarize.
        max_output_length (int): The maximum length of the generated summary.
        method (str): The decoding method to use ('greedy', 'beam', 'sampling').

    Returns:
        str: The generated summary.
    """
    # Encode the input text
    input_encoding = tokenizer.encode(
        text_input,
        return_tensors="pt",
        truncation=True,
        max_length=1024 # Using 1024 as a common max length for input context
    )

    # Move tensors to GPU if available (optional but good practice in Colab)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    input_encoding = input_encoding.to(device)

    # Generate summary based on the specified method
    if method == "greedy":
        output_sequences = model.generate(
            input_encoding,
            max_length=max_output_length,
            no_repeat_ngram_size=3, # Add a common generation parameter
            early_stopping=True # Good practice for summarization
        )
    elif method == "beam":
        output_sequences = model.generate(
            input_encoding,
            max_length=max_output_length,
            num_beams=4, # Specified in text cell
            early_stopping=True,
            no_repeat_ngram_size=3
        )
    elif method == "sampling":
        output_sequences = model.generate(
            input_encoding,
            max_length=max_output_length,
            do_sample=True,
            top_k=50, # Specified in text cell
            temperature=0.7, # Specified in text cell
            no_repeat_ngram_size=3
        )
    else:
        raise ValueError(f"Invalid decoding method: {method}. Choose from 'greedy', 'beam', 'sampling'.")

    # Decode the generated output
    generated_summary = tokenizer.decode(output_sequences[0], skip_special_tokens=True)

    return generated_summary

## Generating Summaries with BART

Now we will load the BART Large CNN model and generate summaries for our example article using the three decoding strategies defined earlier: Greedy Search, Beam Search, and Sampling.

In [None]:
# @title Load BART model and generate summaries

# Load BART tokenizer and model
bart_tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn") # Renamed tokenizer_bart
bart_model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn") # Renamed model_bart

print("Generating summaries with BART...")

# Generate summaries using different methods
bart_greedy_summary = summarize_text_with_model(bart_model, bart_tokenizer, input_article, method="greedy") # Renamed bart_greedy
bart_beam_summary = summarize_text_with_model(bart_model, bart_tokenizer, input_article, method="beam")   # Renamed bart_beam
bart_sample_summary = summarize_text_with_model(bart_model, bart_tokenizer, input_article, method="sampling") # Renamed bart_sample

print("BART summaries generated.")

# Optional: Display generated summaries
# print("\nBART (Greedy):\n", bart_greedy_summary)
# print("\nBART (Beam):\n", bart_beam_summary)
# print("\nBART (Sample):\n", bart_sample_summary)

## Generating Summaries with T5

Next, we load the T5 Small model. Remember that T5 uses a text-to-text format, so we need to prepend the input article with "summarize: ". We will again generate summaries using Greedy Search, Beam Search, and Sampling.

In [None]:
# @title Load T5 model and generate summaries

# Load T5 tokenizer and model
t5_tokenizer = AutoTokenizer.from_pretrained("t5-small") # Renamed tokenizer_t5
t5_model = AutoModelForSeq2SeqLM.from_pretrained("t5-small") # Renamed model_t5

# Prepare input for T5 model (prepending the task)
t5_input_text = "summarize: " + input_article # Renamed t5_input

print("Generating summaries with T5...")

# Generate summaries using different methods
t5_greedy_summary = summarize_text_with_model(t5_model, t5_tokenizer, t5_input_text, method="greedy") # Renamed t5_greedy
t5_beam_summary = summarize_text_with_model(t5_model, t5_tokenizer, t5_input_text, method="beam")   # Renamed t5_beam
t5_sample_summary = summarize_text_with_model(t5_model, t5_tokenizer, t5_input_text, method="sampling") # Renamed t5_sample

print("T5 summaries generated.")

# Optional: Display generated summaries
# print("\nT5 (Greedy):\n", t5_greedy_summary)
# print("\nT5 (Beam):\n", t5_beam_summary)
# print("\nT5 (Sample):\n", t5_sample_summary)

## Evaluating Summary Quality

To quantitatively evaluate the quality of the generated summaries compared to the human-written gold summary, we will use standard text summarization metrics:

1.  **ROUGE (Recall-Oriented Understudy for Gisting Evaluation):** Compares an automatically produced summary to a set of reference summaries. It measures overlap of n-grams, word sequences, and word pairs.
    *   ROUGE-1: Measures overlap of unigrams (single words).
    *   ROUGE-2: Measures overlap of bigrams (two-word sequences).
    *   ROUGE-L: Measures the longest common subsequence, capturing sentence-level structure similarity.
    *   We will report the F1-score for ROUGE, which is the harmonic mean of precision and recall.
2.  **BLEU (Bilingual Evaluation Understudy):** Originally developed for machine translation, it measures the precision of n-grams in the generated text compared to the reference text. While less ideal for abstractive summarization than ROUGE, it can still provide useful comparative insights.

We will compute these metrics for each generated summary against the gold summary.

In [None]:
# @title Define evaluation functions

from rouge_score import rouge_scorer
from sacrebleu import corpus_bleu

def calculate_rouge_scores(predicted_summary, reference_summary): # Renamed compute_rouge
    """
    Computes ROUGE-1, ROUGE-2, and ROUGE-L F1 scores.

    Args:
        predicted_summary (str): The generated summary.
        reference_summary (str): The human-written gold summary.

    Returns:
        dict: A dictionary with ROUGE scores (rouge1, rouge2, rougeL).
    """
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(reference_summary, predicted_summary) # Note: rouge_score library takes ref, pred order

    # Extract F1 scores
    return {key: scores[key].fmeasure for key in ['rouge1', 'rouge2', 'rougeL']}

def calculate_bleu_score(predicted_summaries, reference_summaries): # Renamed compute_bleu, expecting lists
    """
    Computes the corpus BLEU score.

    Args:
        predicted_summaries (list): A list of generated summaries.
        reference_summaries (list): A list of reference summaries (each element can be a single string or a list of reference strings).

    Returns:
        float: The corpus BLEU score.
    """
    # Ensure reference_summaries is a list of lists if needed (sacrebleu expects [[ref1], [ref2], ...])
    # For a single reference per prediction, it should be [[ref1_pred1], [ref1_pred2], ...]
    formatted_references = [[ref] for ref in reference_summaries]
    return corpus_bleu(predicted_summaries, formatted_references).score

## Computing and Presenting Results

Now we will compute the ROUGE and BLEU scores for each generated summary and organize the results into a table for easy comparison.

In [None]:
# @title Compute metrics and display results table

import pandas as pd

# Dictionary to store evaluation results
evaluation_results_dict = { # Renamed results
    "Model": [],
    "Decoding Strategy": [], # Renamed Decoding
    "Generated Summary": [], # Renamed Summary
    "ROUGE-1 (F1)": [], # Added (F1)
    "ROUGE-2 (F1)": [], # Added (F1)
    "ROUGE-L (F1)": [], # Added (F1)
    "BLEU Score": [], # Renamed BLEU
}

# Store the generated summaries with identifiable labels
generated_summaries_map = { # Renamed summaries
    "BART (Greedy)": bart_greedy_summary, # Using new variable names
    "BART (Beam)": bart_beam_summary,
    "BART (Sampling)": bart_sample_summary,
    "T5 (Greedy)": t5_greedy_summary,
    "T5 (Beam)": t5_beam_summary,
    "T5 (Sampling)": t5_sample_summary,
}

# Iterate through the summaries, compute metrics, and fill the results dictionary
for label, summary_text in generated_summaries_map.items():
    model_name, decoding_method = label.split()
    decoding_method = decoding_method[1:-1] # Remove parentheses

    # Compute ROUGE scores
    rouge_scores = calculate_rouge_scores(summary_text, gold_summary) # Using new function and variable names

    # Compute BLEU score (Note: BLEU is typically corpus-level, but we compute for single example here for comparison)
    # We pass the summaries as lists as expected by the function
    bleu_score = calculate_bleu_score([summary_text], [gold_summary]) # Using new function and variable names

    # Append results to the dictionary
    evaluation_results_dict["Model"].append(model_name)
    evaluation_results_dict["Decoding Strategy"].append(decoding_method)
    evaluation_results_dict["Generated Summary"].append(summary_text)
    evaluation_results_dict["ROUGE-1 (F1)"].append(round(rouge_scores["rouge1"], 3))
    evaluation_results_dict["ROUGE-2 (F1)"].append(round(rouge_scores["rouge2"], 3))
    evaluation_results_dict["ROUGE-L (F1)"].append(round(rouge_scores["rougeL"], 3))
    evaluation_results_dict["BLEU Score"].append(round(bleu_score, 2)) # BLEU scores are often reported with fewer decimal places

# Create a pandas DataFrame from the results dictionary
results_dataframe = pd.DataFrame(evaluation_results_dict) # Renamed df

# Display the DataFrame with wide columns to see the summaries
pd.set_option("display.max_colwidth", None)
print("--- Evaluation Results ---")
display(results_dataframe) # Use display() in Colab for better formatting

# Optionally display the gold summary again for easy comparison
print("\n--- Gold Summary for Reference ---")
print(gold_summary)

## Discussion and Findings

Based on the results table and the generated summaries, we can make the following observations for this specific example:

*   **Model Comparison:** BART Large CNN generally appears to produce higher ROUGE scores than T5 Small on this example. This is expected, as BART Large CNN is a larger model specifically fine-tuned for summarization on the CNN/DailyMail dataset, while T5 Small is a much smaller, general-purpose text-to-text model.
*   **Decoding Strategy Comparison (within BART):** For BART, Beam Search (num_beams=4) yielded slightly higher ROUGE scores compared to Greedy Search and Sampling. Greedy Search is often the fastest but can produce less coherent summaries. Sampling introduces variation but might not always find the highest scoring sequence according to these metrics.
*   **Decoding Strategy Comparison (within T5):** Similar to BART, Beam Search seems to perform best for T5 in terms of ROUGE scores on this example. The performance difference between strategies might be less pronounced with a smaller model like T5-small.
*   **BLEU vs. ROUGE:** The BLEU scores generally follow the same trend as ROUGE, showing better performance for BART compared to T5-small, and often slightly favoring Beam Search. It's important to remember BLEU's focus on n-gram precision, which can sometimes be a less reliable indicator of abstractive summary quality than ROUGE.
*   **Qualitative Look:** While the metrics provide quantitative comparison, reading the generated summaries is crucial. BART's summaries are likely more fluent and coherent due to the model's architecture and training data. The T5-small summaries might be more concise but potentially less detailed or grammatically awkward. Beam Search and Sampling summaries from BART might show subtle differences in phrasing compared to Greedy Search.
*   **Limitations:** This analysis is based on only *one* example from the dataset. A comprehensive evaluation would require testing on a larger sample of the test set and potentially analyzing summary length, factual accuracy, and other qualitative aspects. The choice of decoding parameters (beam width, temperature, top_k, max_length) also significantly impacts the results.

In conclusion, for this summarization task on the CNN/DailyMail dataset, the larger, domain-specific BART model generally outperforms the smaller, general-purpose T5-small model. Among decoding strategies, Beam Search often provides a good balance between quality and computational cost compared to simple Greedy Search or potentially less controlled Sampling, based on standard metrics like ROUGE.