# Assignment: Compare Multiple Tuned Outputs using BLEU and ROUGE


## Objective
This assignment focuses on evaluating and comparing the outputs of different Large Language Model (LLM) configurations or fine-tuning experiments, specifically for text generation tasks, using standard NLP metrics: BLEU and ROUGE. You will learn to prepare reference texts, generate hypotheses from multiple models, and calculate these metrics to quantitatively assess output quality.

## Part 1: Environment Setup and Dataset Preparation (25 Marks)

1.  **Environment Setup:**
    * Create a new Python virtual environment.
    * Install necessary libraries: `transformers`, `torch` (or `tensorflow`), `nltk` (for BLEU), `rouge_score` (for ROUGE), `datasets`, `pandas`, `numpy`.
        * If using `nltk`, remember to download `punkt` and `wordnet`:
          ```python
          import nltk
          nltk.download('punkt')
          nltk.download('wordnet')
          ```
    * Provide a `requirements.txt` file.

2.  **Dataset Acquisition:**
    * Select a small, publicly available dataset suitable for a text generation task. Examples:
        * **Summarization:** A subset of XSum or CNN/DailyMail dataset (focus on short summaries).
        * **Question Answering:** A subset of SQuAD (where answers are extractive or short generated passages).
        * **Dialogue Generation:** A subset of a dialogue dataset where models generate responses.
    * **Minimum Requirement:** The dataset should have at least **50-100 samples**, with each sample containing:
        * An input text/context for the LLM.
        * One or more high-quality **reference (ground truth) outputs** for comparison.
    * Load your chosen dataset into a format easily iterable (e.g., Hugging Face `Dataset` object or a list of dictionaries).
    * Describe your chosen dataset, its source, number of samples, and the nature of the input/output pairs.

In [None]:
# Your code for environment setup, dataset loading, and NLTK downloads.
# Provide `requirements.txt`.
        # Describe your dataset.

## Part 2: Model Configuration and Output Generation (35 Marks)

1.  **Model Selection and Loading:**
    * Choose a base GPT-like model (e.g., `distilgpt2`, `gpt2`) for text generation.
    * **Define at least three distinct model configurations/"tunes"** for text generation. These could be achieved by:
        * **Option A (Recommended):** Using the same base model but loading different LoRA adapters (if you completed the LoRA assignment). This simulates different fine-tuning experiments.
        * **Option B:** Using the same base model but varying inference parameters significantly (e.g., `do_sample=True, top_k=50, temperature=0.7` vs. `num_beams=4` vs. `do_sample=True, top_p=0.9` etc.). This simulates different decoding strategies.
        * **Option C:** Using slightly different base models (e.g., `gpt2` vs. `distilgpt2` vs. `gpt2-medium`).
    * Load each model/configuration and its corresponding tokenizer.

2.  **Generate Hypotheses:**
    * For each of your **three configurations**, iterate through your test dataset.
    * For each sample, pass the input text to the LLM and generate an output (hypothesis).
    * Store the generated outputs in separate lists for each configuration (e.g., `hypotheses_config_A`, `hypotheses_config_B`, `hypotheses_config_C`).
    * Store the reference outputs for the entire dataset in a single list (where each item is a list of reference strings if multiple references exist).
    * Ensure consistency in decoding parameters (e.g., `max_new_tokens`) across configurations unless intentionally varied for comparison.

In [None]:
# Your code for loading models/configurations and generating hypotheses for each.
        # Clearly state the three configurations you are comparing.

## Part 3: Metric Calculation and Comparison (30 Marks)

1.  **BLEU Score Calculation:**
    * Implement a function to calculate the BLEU score using `nltk.translate.bleu_score.sentence_bleu` or `corpus_bleu`.
    * Remember that BLEU requires tokenized inputs (lists of words).
    * For each of your three hypothesis sets, calculate the overall BLEU score against your reference set.
    * Print the BLEU score for each configuration.

2.  **ROUGE Score Calculation:**
    * Implement a function to calculate ROUGE scores using `rouge_score.rouge_scorer.RougeScorer`.
    * Calculate ROUGE-1, ROUGE-2, and ROUGE-L F1-scores for each of your three hypothesis sets against your reference set.
    * Print the ROUGE scores (F1-score for each type) for each configuration.

3.  **Tabular Comparison:**
    * Present all calculated BLEU and ROUGE scores in a clear tabular format (e.g., using a Pandas DataFrame).

4.  **Qualitative Analysis and Discussion:**
    * Select 2-3 sample inputs from your dataset.
    * For each sample, print the input, the reference output, and the generated outputs from all three configurations.
    * Based on the quantitative scores (BLEU, ROUGE) and qualitative inspection of the outputs:
        * Which configuration performs best according to each metric?
        * Do the metrics align with your human intuition? Why or why not?
        * Discuss the strengths and weaknesses of BLEU and ROUGE for evaluating text generation. When is one more appropriate than the other?
        * What insights did you gain about your different model configurations?

In [None]:
# Your code for BLEU and ROUGE score calculation.
        # Tabular comparison of scores.
        # Qualitative analysis with sample outputs and discussion.

## Part 4: Reflection and Future Work (Bonus - 10 Marks)

1.  **Limitations of N-gram Overlap Metrics:**
    * Discuss the inherent limitations of n-gram overlap metrics (like BLEU and ROUGE) for evaluating the semantic quality and fluency of generated text.
    * What scenarios might these metrics fail to capture true quality?

2.  **Alternative Evaluation Metrics:**
    * Suggest at least two alternative or complementary evaluation approaches for text generation (e.g., semantic similarity metrics like BERTScore, human evaluation, LLM-as-a-judge evaluators, task-specific metrics).
    * Briefly explain how they work and what advantages they offer.

3.  **Hyperparameter Tuning Strategy:**
    * How would you use these metrics in an automated hyperparameter tuning loop for a text generation model? Which metric would you prioritize and why?

## Submission Guidelines

* Submit this Jupyter Notebook (.ipynb file) with all cells executed and outputs visible.
    * Ensure your code is well-commented and easy to understand.
* Provide a `requirements.txt` file listing all dependencies.
* Clearly present all calculated scores, tables, and discussions.
* Make sure your notebook runs without errors in the specified environment.