# Homework Sketch

* Develop llm_scorer for scoring text summaries and apply it xsum and/or SAMSUM
* Develop llm_summarizer for summarization and apply it to smaller subset of xsum or SAMSUM, score with ROUGE, BERTSCore, and llm_scorer.  Use local or API-based model.
* Add llm_score to compute all metrics and apply it to the sentence pairs in the lesson.  In each case, comment on the llm similarity scores.
* Load another fine-tuned summarizer for xsum and/or SAMSUM and evaluate with same metrics.  Use T5, Pegasus, or ???
* Don't bother with fine-tuning.


# 📝 Exercise: Domain-Specific Summarization Using Fine-Tuned Models vs. General-Purpose LLMs

## Overview
In this exercise, you will fine-tune a specialized summarization model, `facebook/bart-large`, on the **PubMed** dataset to perform domain-specific summarization. You will then compare its performance to a general-purpose LLM (`gpt-4` or another open LLM like `LLaMA-2`) on the same task.

---

## Learning Objectives
1. Understand the process of fine-tuning a pre-trained transformer model for domain-specific summarization.
2. Compare performance of fine-tuned models to general-purpose LLMs.
3. Use standard summarization metrics (e.g., ROUGE) to evaluate performance.
4. Explore trade-offs between specialized and general-purpose models for summarization.

---

## Part 1: Fine-Tuning `facebook/bart-large` on PubMed Dataset

### Setup
Install necessary libraries:
```bash
pip install transformers datasets accelerate
```

### Load Dataset

In [None]:
from datasets import load_dataset

dataset = load_dataset("scientific_papers", "pubmed")

### Fine-Tuning Code
Follow the provided code snippet to fine-tune `facebook/bart-large` on a subset of the PubMed dataset.

- Train on a subset (2000 samples).
- Validate on a smaller subset (500 samples).
- Use `fp16=True` for efficient training.


---

## Part 2: Evaluation with ROUGE Metrics

Add the following to your code after training:

In [None]:
from datasets import load_metric

eval_results = trainer.evaluate()

rouge = load_metric("rouge")
outputs = trainer.predict(tokenized_val)

preds = tokenizer.batch_decode(outputs.predictions, skip_special_tokens=True)
refs = [example['abstract'] for example in val_data]

rouge_output = rouge.compute(predictions=preds, references=refs, use_stemmer=True)
print(rouge_output)

---

## Part 3: Comparison with General-Purpose LLM

### Instructions
1. Use an LLM (e.g., GPT-4 via API or LLaMA-2 locally) to generate summaries for the same validation set.
2. Compare outputs using the same ROUGE metrics.

### Prompt Example:
"Summarize the following scientific paper: \n\n{Article}"

---

## Part 4: Analysis & Discussion

Answer the following questions:
1. How do the ROUGE scores of the fine-tuned BART model compare to those of the general-purpose LLM?
2. Are there qualitative differences in the summaries (e.g., coherence, conciseness, relevance)?
3. What are the strengths and weaknesses of each approach?
4. In what scenarios would you prefer using a fine-tuned model over a general-purpose LLM?

---

## Extension (Optional)
- Fine-tune the model on the full PubMed dataset or another domain-specific dataset.
- Compare the performance of BART, PEGASUS, and T5 for this task.
- Explore different evaluation metrics (e.g., BERTScore).

---

## Submission
Submit a Jupyter notebook containing:
- Your fine-tuning code.
- ROUGE metric outputs for both models.
- Your analysis and discussion.

---

Good luck and have fun! 🎉

# Assignment: Comparing General and Fine-Tuned BART Models

## Objective
In this assignment, you will compare the performance of two pretrained BART models on the task of scientific text summarization:
- A general-domain BART model: `facebook/bart-large` (trained on CNN/DailyMail)
- A domain-specific BART model: 'ccdv/lsg-bart-base-4096-pubmed' (fine-tuned on PubMed)

You will evaluate the models both qualitatively (by examining the generated summaries) and quantitatively (using ROUGE scores).

## Instructions

### Part 1: Setup
1. **Install the required packages**
```bash
%pip install transformers datasets rouge_score
```

2. **Import necessary libraries**

In [None]:
from transformers import BartTokenizer, BartForConditionalGeneration, pipeline
from datasets import load_metric

### Part 2: Load Models and Tokenizers

In [None]:
# Load general BART model
general_tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")
general_model = BartForConditionalGeneration.from_pretrained("facebook/bart-large")

general_summarizer = pipeline("summarization", model=general_model, tokenizer=general_tokenizer)

# Load domain-specific BART model
pubmed_tokenizer = BartTokenizer.from_pretrained("ccdv/lsg-bart-base-4096-pubmed")
pubmed_model = BartForConditionalGeneration.from_pretrained("ccdv/lsg-bart-base-4096-pubmed")

domain_summarizer = pipeline("summarization", model=pubmed_model, tokenizer=pubmed_tokenizer)

### Part 3: Provide Scientific Text for Summarization
Paste a sample scientific text below. It should be at least a few paragraphs long.

In [None]:
sample_text = """
Your scientific text goes here.
"""

### Part 4: Generate Summaries

In [None]:
# Generate summaries using both models
general_summary = general_summarizer(sample_text, max_length=150, min_length=40, do_sample=False)[0]['summary_text']
domain_summary = domain_summarizer(sample_text, max_length=150, min_length=40, do_sample=False)[0]['summary_text']

print("General Model Summary:\n", general_summary)
print("\nDomain-Finetuned Model Summary:\n", domain_summary)

### Part 5: Evaluate using ROUGE Metrics

In [None]:
from rouge_score import rouge_scorer

# Initialize the ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Provide a reference summary for comparison
reference_summary = """
Your reference summary goes here.
"""

# Compute ROUGE scores for each model
scores_general = scorer.score(reference_summary, general_summary)
scores_domain = scorer.score(reference_summary, domain_summary)

# Display results
print("\nGeneral Model ROUGE Scores:")
for metric, score in scores_general.items():
    print(f"{metric}: {score}")

print("\nDomain Model ROUGE Scores:")
for metric, score in scores_domain.items():
    print(f"{metric}: {score}")

### Part 6: Analysis
1. **Qualitative Comparison**
    - Compare the outputs of the general and domain-specific models. What differences do you notice? Which model produces more coherent, relevant, or precise summaries?

2. **Quantitative Comparison**
    - Compare the ROUGE scores from both models. Which model achieves higher scores? Does this align with your qualitative observations?

3. **Discussion**
    - Discuss why the domain-specific model might perform better on PubMed data. What trade-offs might exist between using a general-purpose model vs. a fine-tuned model?

### Submission
Save your notebook and submit it according to your instructor’s guidelines.