# 📚 GLUE Benchmark — Task Overview

GLUE includes 9 English language understanding tasks, each with a different structure and purpose. These tasks cover:
	
	•	Single-sentence classification
	•	Sentence pair classification
	•	Semantic similarity (regression)
	•	Natural language inference (NLI)


## 🔹 1. CoLA (Corpus of Linguistic Acceptability)

- **Task Type**: Single-sentence classification
- **Goal**: Judge grammatical acceptability
- **Input**: One sentence
- **Label**: 
  - 1 = grammatically acceptable  
  - 0 = unacceptable
- **Example**:  
  `"The boy is sleeping." → 1`
- **Dataset Size**: ~8.5k train, 1k dev
- **Source**: Linguistic literature (Warstadt et al., 2018)

---

## 🔹 2. SST-2 (Stanford Sentiment Treebank)

- **Task Type**: Single-sentence classification
- **Goal**: Predict sentiment (positive/negative)
- **Input**: One sentence
- **Label**: 
  - 1 = positive  
  - 0 = negative
- **Example**:  
  `"A delightfully quirky film." → 1`
- **Dataset Size**: ~67k train, 872 dev
- **Source**: Stanford Sentiment Treebank

---

## 🔹 3. MRPC (Microsoft Research Paraphrase Corpus)

- **Task Type**: Sentence pair classification
- **Goal**: Determine if two sentences are paraphrases
- **Input**: Sentence1, Sentence2
- **Label**: 
  - 1 = paraphrase  
  - 0 = not paraphrase
- **Example**:  
  `"He ran the company." vs "He managed the business." → 1`
- **Dataset Size**: ~3.7k train, 408 dev
- **Source**: Microsoft Research

---

## 🔹 4. QQP (Quora Question Pairs)

- **Task Type**: Sentence pair classification
- **Goal**: Identify if two questions have the same meaning
- **Input**: Question1, Question2
- **Label**: 
  - 1 = duplicate  
  - 0 = not duplicate
- **Example**:  
  `"What is AI?" vs "What is artificial intelligence?" → 1`
- **Dataset Size**: ~364k train, 40k dev
- **Source**: Quora

---

## 🔹 5. STS-B (Semantic Textual Similarity Benchmark)

- **Task Type**: Sentence pair regression
- **Goal**: Rate how similar two sentences are (scale: 0 to 5)
- **Input**: Sentence1, Sentence2
- **Label**: Float between 0.0 and 5.0
- **Example**:  
  `"A man is playing guitar." vs "Someone is playing an instrument." → 4.2`
- **Dataset Size**: ~5.7k train, 1.5k dev
- **Source**: Various MTurk datasets

---

## 🔹 6. MNLI (Multi-Genre Natural Language Inference)

- **Task Type**: Sentence pair classification (NLI)
- **Goal**: Determine if the hypothesis is entailed by, contradicts, or is neutral with respect to the premise
- **Input**: Premise, Hypothesis
- **Label**: 
  - 0 = entailment  
  - 1 = neutral  
  - 2 = contradiction
- **Example**:  
  `Premise: "He ordered pizza."  
   Hypothesis: "He got food." → 0 (entailment)`
- **Dataset Size**: ~393k train, 20k dev
- **Source**: Multi-genre corpora (spoken, fiction, government, etc.)

---

## 🔹 7. QNLI (Question Natural Language Inference)

- **Task Type**: Sentence pair classification (QA reformulated as NLI)
- **Goal**: Determine if a context sentence contains the answer to a given question
- **Input**: Question, Sentence
- **Label**: 
  - 1 = answerable (entailment)  
  - 0 = not answerable (neutral)
- **Dataset Size**: ~105k train, 5.4k dev
- **Source**: SQuAD reformatted

---

## 🔹 8. RTE (Recognizing Textual Entailment)

- **Task Type**: Sentence pair classification (NLI)
- **Goal**: Determine if premise entails the hypothesis
- **Input**: Premise, Hypothesis
- **Label**: 
  - 1 = entailment  
  - 0 = not entailment
- **Dataset Size**: ~2.5k train, 277 dev
- **Source**: RTE Challenges (1–4)

---

## 🔹 9. WNLI (Winograd NLI)

- **Task Type**: Sentence pair classification (NLI/coreference)
- **Goal**: Determine if the hypothesis is entailed based on coreference resolution
- **Input**: Sentence1, Sentence2
- **Label**: 
  - 1 = entailment  
  - 0 = not entailment
- **Dataset Size**: 634 train, 146 dev
- **Note**: Known to be adversarial; some models perform worse than chance
- **Source**: Winograd Schema Challenge items

---

## ✅ GLUE Summary Table

| Task    | Input            | Label Type       | Task Type               | Goal                                |
|---------|------------------|------------------|--------------------------|-------------------------------------|
| CoLA    | Sentence          | 0/1              | Single-sentence classification | Grammatical acceptability           |
| SST-2   | Sentence          | 0/1              | Sentiment classification | Sentiment polarity                  |
| MRPC    | Sentence pair     | 0/1              | Paraphrase classification | Are the sentences paraphrases?     |
| QQP     | Question pair     | 0/1              | Duplicate question classification | Are questions semantically equivalent? |
| STS-B   | Sentence pair     | Float [0.0–5.0]  | Semantic similarity regression | Degree of semantic similarity       |
| MNLI    | Sentence pair     | 0/1/2            | Natural Language Inference | Entailment, contradiction, or neutral |
| QNLI    | Question + sentence | 0/1            | QA-style entailment      | Does sentence answer the question? |
| RTE     | Sentence pair     | 0/1              | Textual entailment       | Is the hypothesis entailed?        |
| WNLI    | Sentence pair     | 0/1              | Coreference-based NLI    | Entailment based on coreference    |

In [1]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

In [5]:
def preprocess_glue(task_name:str, tokenizer=None, checkpoint='bert-base-uncased'):
    # Define the input fields for each GLUE task
    sentence_keys = {
        "cola": ("sentence", None),
        "sst2": ("sentence", None),
        "mrpc": ("sentence1", "sentence2"),
        "qqp": ("question1", "question2"),
        "stsb": ("sentence1", "sentence2"),
        "mnli": ("premise", "hypothesis"),
        "qnli": ("question", "sentence"),
        "rte": ("sentence1", "sentence2"),
        "wnli": ("sentence1", "sentence2"),
    }

    if tokenizer is None:
        tokenizer = AutoTokenizer.from_pretrained(checkpoint)

    raw_dataset = load_dataset('glue', task_name)

    key1, key2 = sentence_keys[task_name.lower()]

    def tokenize_function(examples):
        if key2 == None:
            return tokenizer(examples[key1], truncation=True)
        else:
            return tokenizer(examples[key1], examples[key2], truncation=True)

    tokenized_datasets = raw_dataset.map(tokenize_function, batched=True)

    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    # Handle label format (float for STS-B, int for others)
    def format_labels(example):
        if task_name.lower() == "stsb":
            example["label"] = float(example["label"])
        else:
            example["label"] = int(example["label"])
        return example

    tokenized_datasets = tokenized_datasets.map(format_labels, batched=False)

    non_tensor_columns = raw_dataset["train"].column_names
    tokenized_datasets = tokenized_datasets.remove_columns(
        [col for col in non_tensor_columns if col not in ["label"]]
    )

    # Set format for PyTorch
    tokenized_datasets.set_format("torch")

    # Create a dynamic padding collator
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    return tokenized_datasets, data_collator, "label"


In [6]:
tokenized_datasets, data_collator, label_column = preprocess_glue("rte")
print(tokenized_datasets["train"].features)

Map:   0%|          | 0/277 [00:00<?, ? examples/s]

Map:   0%|          | 0/2490 [00:00<?, ? examples/s]

Map:   0%|          | 0/277 [00:00<?, ? examples/s]

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

{'label': ClassLabel(names=['entailment', 'not_entailment'], id=None), 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}
