# GPT Research Paper | Part IV

## Fine-tuning: Task-Specific Adaptation

---

**Paper:** [Improving Language Understanding by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)

**Authors:** Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever (OpenAI, 2018)

---

This is the **revolutionary contribution** of GPT: showing that a single pre-trained model can be fine-tuned to achieve state-of-the-art results across diverse NLP tasks.

In this notebook:
1. **The fine-tuning objective** - combining task loss with language modeling
2. **Input transformations** - converting any task to GPT's format
3. **Task-specific architectures** - minimal changes for each task type
4. **Results analysis** - GPT's performance on 12 datasets

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
from matplotlib.patches import FancyBboxPatch, Rectangle, FancyArrowPatch, Circle
import matplotlib.patches as mpatches
import numpy as np
import math
from dataclasses import dataclass
from typing import Optional, Tuple, List

torch.manual_seed(42)
np.random.seed(42)

print("PyTorch version:", torch.__version__)

---

## 1. The Two-Stage Training Paradigm

### 1.1 The Core Idea

From Section 1 (Introduction):

> *"We demonstrate that large gains on these tasks can be realized by **generative pre-training** of a language model on a diverse corpus of unlabeled text, followed by **discriminative fine-tuning** on each specific task."*

This establishes the **pre-train then fine-tune** paradigm that now dominates NLP:

| Stage | Data | Objective | Duration |
|-------|------|-----------|----------|
| **Pre-training** | Unlabeled (BooksCorpus) | Language modeling | 100 epochs, weeks |
| **Fine-tuning** | Labeled (task-specific) | Task + LM auxiliary | 3 epochs, hours |

### 1.2 Why This Works

From Section 1:

> *"In our experiments, we use a combination of unsupervised pre-training and supervised fine-tuning. Our training objective is to learn a universal representation that transfers with little adaptation to a wide range of tasks."*

The key insight: **language modeling forces the model to learn useful representations**
- To predict the next word, you must understand:
  - Syntax (grammar)
  - Semantics (meaning)
  - World knowledge (facts)
  - Reasoning patterns

In [None]:
def visualize_two_stage_paradigm():
    """Visualize the pre-train then fine-tune paradigm."""
    fig, ax = plt.subplots(figsize=(16, 8))
    ax.set_xlim(0, 16)
    ax.set_ylim(0, 8)
    ax.axis('off')
    
    # Title
    ax.text(8, 7.5, 'GPT: Pre-train then Fine-tune Paradigm', 
            fontsize=16, fontweight='bold', ha='center')
    
    # === STAGE 1: Pre-training ===
    rect1 = FancyBboxPatch((0.5, 3.5), 6, 3, boxstyle="round,pad=0.05",
                           facecolor='#e8f4f8', edgecolor='#3498db', linewidth=2)
    ax.add_patch(rect1)
    
    ax.text(3.5, 6.2, 'Stage 1: Pre-training', fontsize=13, fontweight='bold', 
            ha='center', color='#2980b9')
    
    # Data
    ax.text(1, 5.5, 'Data:', fontsize=10, fontweight='bold')
    ax.text(2.5, 5.5, 'BooksCorpus (~1B words)', fontsize=10)
    ax.text(1, 5.0, 'Labels:', fontsize=10, fontweight='bold')
    ax.text(2.5, 5.0, 'None (self-supervised)', fontsize=10, color='#27ae60')
    ax.text(1, 4.5, 'Objective:', fontsize=10, fontweight='bold')
    ax.text(2.5, 4.5, 'Predict next token', fontsize=10)
    ax.text(1, 4.0, 'Duration:', fontsize=10, fontweight='bold')
    ax.text(2.5, 4.0, '100 epochs (~weeks)', fontsize=10)
    
    # Arrow
    ax.annotate('', xy=(7.5, 5), xytext=(6.7, 5),
                arrowprops=dict(arrowstyle='->', color='black', lw=2.5))
    ax.text(7.1, 5.4, 'Transfer\nlearned\nweights', fontsize=9, ha='center')
    
    # === STAGE 2: Fine-tuning ===
    rect2 = FancyBboxPatch((8.5, 3.5), 7, 3, boxstyle="round,pad=0.05",
                           facecolor='#fef9e7', edgecolor='#f39c12', linewidth=2)
    ax.add_patch(rect2)
    
    ax.text(12, 6.2, 'Stage 2: Fine-tuning', fontsize=13, fontweight='bold', 
            ha='center', color='#d68910')
    
    ax.text(9, 5.5, 'Data:', fontsize=10, fontweight='bold')
    ax.text(10.5, 5.5, 'Task-specific (small)', fontsize=10)
    ax.text(9, 5.0, 'Labels:', fontsize=10, fontweight='bold')
    ax.text(10.5, 5.0, 'Yes (supervised)', fontsize=10, color='#e74c3c')
    ax.text(9, 4.5, 'Objective:', fontsize=10, fontweight='bold')
    ax.text(10.5, 4.5, 'Task loss + LM auxiliary', fontsize=10)
    ax.text(9, 4.0, 'Duration:', fontsize=10, fontweight='bold')
    ax.text(10.5, 4.0, '3 epochs (~hours)', fontsize=10)
    
    # === Equation ===
    eq_box = FancyBboxPatch((2, 1), 12, 1.8, boxstyle="round,pad=0.05",
                            facecolor='#f5f5f5', edgecolor='gray', linewidth=1.5)
    ax.add_patch(eq_box)
    
    ax.text(8, 2.5, 'The Fine-tuning Objective (Equation 3 from paper):', 
            fontsize=11, fontweight='bold', ha='center')
    ax.text(8, 1.7, r'$L_3(C) = L_2(C) + \lambda \cdot L_1(C)$', 
            fontsize=14, ha='center', style='italic')
    ax.text(8, 1.2, 'Task loss + (weight) x Language modeling loss', 
            fontsize=10, ha='center', color='gray')
    
    plt.tight_layout()
    plt.show()

visualize_two_stage_paradigm()

---

## 2. The Fine-tuning Objective

### 2.1 The Three Equations

The paper defines three loss functions. Understanding them is crucial:

**Equation 1: Pre-training (Language Modeling)**

$$L_1(\mathcal{U}) = \sum_i \log P(u_i | u_{i-k}, ..., u_{i-1}; \Theta)$$

> *"Given an unsupervised corpus of tokens $\mathcal{U}$, we use a standard language modeling objective."*

**Equation 2: Task-Specific Supervised Loss**

$$L_2(C) = \sum_{(x, y)} \log P(y | x^1, ..., x^m)$$

From Section 3.2:

> *"Given a labeled dataset $C$, where each instance consists of a sequence of input tokens, $x^1, ..., x^m$, along with a label $y$, the inputs are passed through our pre-trained model to obtain the final transformer block's activation $h_l^m$, which is then fed into an added linear output layer with parameters $W_y$ to predict $y$."*

$$P(y | x^1, ..., x^m) = \text{softmax}(h_l^m W_y)$$

**Equation 3: Combined Fine-tuning Objective**

$$L_3(C) = L_2(C) + \lambda \cdot L_1(C)$$

This is the **key innovation**: keep the language modeling objective as an auxiliary loss!

### 2.2 Why the Auxiliary LM Loss?

From Section 3.2:

> *"We additionally found that including language modeling as an auxiliary objective to the fine-tuning helped learning by (a) improving generalization of the supervised model, and (b) accelerating convergence."*

The intuition:
- The LM objective acts as a **regularizer**
- Prevents the model from forgetting useful pre-trained features
- Keeps gradients flowing through the entire model

### 2.3 The Lambda Hyperparameter

From Section 4.1:

> *"For fine-tuning, we use... a weight λ = 0.5 for the auxiliary language model loss."*

So the actual objective is:

$$L_3(C) = L_{\text{task}} + 0.5 \cdot L_{\text{LM}}$$

In [None]:
def visualize_finetuning_objective():
    """Visualize the combined fine-tuning objective."""
    
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    # === Left: The three equations ===
    ax1 = axes[0]
    ax1.set_xlim(0, 10)
    ax1.set_ylim(0, 10)
    ax1.axis('off')
    
    ax1.text(5, 9.5, 'The Three Equations', fontsize=14, fontweight='bold', ha='center')
    
    # Equation 1
    rect1 = FancyBboxPatch((0.5, 6.5), 9, 2.2, boxstyle="round,pad=0.03",
                           facecolor='#e8f4f8', edgecolor='#3498db', linewidth=2)
    ax1.add_patch(rect1)
    ax1.text(5, 8.3, 'Equation 1: Pre-training (Language Modeling)', 
             fontsize=11, fontweight='bold', ha='center', color='#2980b9')
    ax1.text(5, 7.5, r'$L_1(\mathcal{U}) = \sum_i \log P(u_i | u_{i-k}, ..., u_{i-1})$', 
             fontsize=12, ha='center')
    ax1.text(5, 6.9, '"Predict the next token"', fontsize=9, ha='center', 
             style='italic', color='gray')
    
    # Equation 2
    rect2 = FancyBboxPatch((0.5, 3.5), 9, 2.2, boxstyle="round,pad=0.03",
                           facecolor='#fef9e7', edgecolor='#f39c12', linewidth=2)
    ax1.add_patch(rect2)
    ax1.text(5, 5.3, 'Equation 2: Task-Specific Loss', 
             fontsize=11, fontweight='bold', ha='center', color='#d68910')
    ax1.text(5, 4.5, r'$L_2(C) = \sum_{(x,y)} \log P(y | x^1, ..., x^m)$', 
             fontsize=12, ha='center')
    ax1.text(5, 3.9, '"Predict the task label"', fontsize=9, ha='center', 
             style='italic', color='gray')
    
    # Equation 3
    rect3 = FancyBboxPatch((0.5, 0.5), 9, 2.2, boxstyle="round,pad=0.03",
                           facecolor='#eafaf1', edgecolor='#27ae60', linewidth=2)
    ax1.add_patch(rect3)
    ax1.text(5, 2.3, 'Equation 3: Combined Fine-tuning', 
             fontsize=11, fontweight='bold', ha='center', color='#1e8449')
    ax1.text(5, 1.5, r'$L_3(C) = L_2(C) + \lambda \cdot L_1(C)$', 
             fontsize=12, ha='center')
    ax1.text(5, 0.9, r'"Task loss + 0.5 $\times$ LM loss"', fontsize=9, ha='center', 
             style='italic', color='gray')
    
    # === Right: Why auxiliary loss helps ===
    ax2 = axes[1]
    ax2.set_xlim(0, 10)
    ax2.set_ylim(0, 10)
    ax2.axis('off')
    
    ax2.text(5, 9.5, 'Why Auxiliary LM Loss Helps', fontsize=14, fontweight='bold', ha='center')
    
    benefits = [
        ('1. Regularization', 
         'Prevents overfitting to small\ntask-specific datasets'),
        ('2. Generalization', 
         '"Improving generalization of\nthe supervised model"'),
        ('3. Faster Convergence', 
         '"Accelerating convergence"'),
        ('4. Gradient Flow', 
         'Keeps all layers active,\nnot just the classifier'),
    ]
    
    colors = ['#e74c3c', '#3498db', '#27ae60', '#9b59b6']
    
    for i, ((title, desc), color) in enumerate(zip(benefits, colors)):
        y = 8 - i * 2
        rect = FancyBboxPatch((0.5, y - 0.8), 9, 1.5, boxstyle="round,pad=0.03",
                              facecolor=color, edgecolor='black', linewidth=1.5, alpha=0.15)
        ax2.add_patch(rect)
        ax2.text(1, y + 0.3, title, fontsize=11, fontweight='bold', color=color)
        ax2.text(1, y - 0.3, desc, fontsize=10, color='black')
    
    plt.tight_layout()
    plt.show()

visualize_finetuning_objective()

---

## 3. Input Transformations: The Key Innovation

### 3.1 The Challenge

GPT is trained for **sequence-to-sequence** language modeling. But NLP tasks have diverse input formats:
- **Classification**: Single text → Label
- **Entailment**: Premise + Hypothesis → Label
- **Similarity**: Text A + Text B → Score
- **QA**: Context + Question → Answer
- **Multiple Choice**: Question + Options → Choice

### 3.2 The Solution: Structured Input Transformations

From Section 3.3:

> *"For some tasks, like text classification, we can directly fine-tune our model as described above. Certain other tasks, like question answering or textual entailment, have structured inputs such as ordered sentence pairs, or triplets of document, question, and answers. Since our pre-trained model was trained on contiguous sequences of text, we require some modifications to apply it to these tasks."*

The key insight:

> *"We use a traversal-style approach, where we convert structured inputs into an ordered sequence that our pre-trained model can process. These input transformations allow us to avoid making extensive changes to the architecture across tasks."*

### 3.3 Special Tokens

The paper introduces special delimiter tokens:

| Token | Purpose |
|-------|--------|
| `<s>` | Start of sequence |
| `<e>` | End of sequence / Extract token |
| `$` | Delimiter between segments |

In [None]:
def visualize_input_transformations():
    """
    Visualize the four input transformation types from Figure 1 of the paper.
    This is one of the most important figures in the paper!
    """
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
    # Colors
    c_start = '#3498db'    # Start token
    c_text = '#ecf0f1'     # Text
    c_delim = '#e74c3c'    # Delimiter
    c_extract = '#27ae60'  # Extract token
    
    def draw_token_sequence(ax, tokens, colors, y, title, subtitle):
        """Draw a sequence of tokens."""
        ax.set_xlim(0, 14)
        ax.set_ylim(0, 6)
        ax.axis('off')
        
        ax.text(7, 5.5, title, fontsize=13, fontweight='bold', ha='center')
        ax.text(7, 5, subtitle, fontsize=10, ha='center', style='italic', color='gray')
        
        # Calculate positions
        total_width = sum([len(t) * 0.15 + 0.8 for t in tokens])
        start_x = (14 - total_width) / 2
        
        x = start_x
        for tok, col in zip(tokens, colors):
            width = len(tok) * 0.15 + 0.6
            rect = FancyBboxPatch((x, y), width, 0.7, boxstyle="round,pad=0.02",
                                  facecolor=col, edgecolor='black', linewidth=1.5)
            ax.add_patch(rect)
            text_color = 'white' if col in [c_start, c_delim, c_extract] else 'black'
            ax.text(x + width/2, y + 0.35, tok, ha='center', va='center', 
                   fontsize=9, color=text_color, fontweight='bold')
            x += width + 0.1
        
        return start_x, x
    
    # === 1. Classification ===
    ax1 = axes[0, 0]
    tokens1 = ['<s>', 'This', 'movie', 'was', 'great', '!', '<e>']
    colors1 = [c_start, c_text, c_text, c_text, c_text, c_text, c_extract]
    draw_token_sequence(ax1, tokens1, colors1, 3, 
                        'Classification (e.g., Sentiment)',
                        'Single text sequence')
    
    ax1.text(7, 2.2, 'Format: <s> Text <e>', fontsize=11, ha='center', 
             fontweight='bold', color='#2c3e50')
    ax1.text(7, 1.6, 'Extract representation at <e> position', fontsize=10, 
             ha='center', color='gray')
    ax1.text(7, 1.0, 'Linear layer maps to class probabilities', fontsize=10, 
             ha='center', color='gray')
    
    # === 2. Entailment ===
    ax2 = axes[0, 1]
    tokens2 = ['<s>', 'Premise', 'text', '$', 'Hypothesis', 'text', '<e>']
    colors2 = [c_start, c_text, c_text, c_delim, c_text, c_text, c_extract]
    draw_token_sequence(ax2, tokens2, colors2, 3,
                        'Entailment (e.g., MNLI, SNLI)',
                        'Premise-hypothesis pairs')
    
    ax2.text(7, 2.2, 'Format: <s> Premise $ Hypothesis <e>', fontsize=11, 
             ha='center', fontweight='bold', color='#2c3e50')
    ax2.text(7, 1.6, 'Delimiter $ separates the two segments', fontsize=10, 
             ha='center', color='gray')
    ax2.text(7, 1.0, 'Predict: entailment / contradiction / neutral', fontsize=10, 
             ha='center', color='gray')
    
    # === 3. Similarity ===
    ax3 = axes[1, 0]
    ax3.set_xlim(0, 14)
    ax3.set_ylim(0, 6)
    ax3.axis('off')
    
    ax3.text(7, 5.5, 'Similarity (e.g., QQP, STS-B)', fontsize=13, 
             fontweight='bold', ha='center')
    ax3.text(7, 5, 'Two text sequences (order independent)', fontsize=10, 
             ha='center', style='italic', color='gray')
    
    # Two orderings
    tokens3a = ['<s>', 'Text', 'A', '$', 'Text', 'B', '<e>']
    colors3 = [c_start, c_text, c_text, c_delim, c_text, c_text, c_extract]
    
    x = 1
    for tok, col in zip(tokens3a, colors3):
        width = len(tok) * 0.12 + 0.5
        rect = FancyBboxPatch((x, 3.5), width, 0.6, boxstyle="round,pad=0.02",
                              facecolor=col, edgecolor='black', linewidth=1.5)
        ax3.add_patch(rect)
        text_color = 'white' if col in [c_start, c_delim, c_extract] else 'black'
        ax3.text(x + width/2, 3.8, tok, ha='center', va='center', 
                fontsize=9, color=text_color, fontweight='bold')
        x += width + 0.08
    
    tokens3b = ['<s>', 'Text', 'B', '$', 'Text', 'A', '<e>']
    x = 1
    for tok, col in zip(tokens3b, colors3):
        width = len(tok) * 0.12 + 0.5
        rect = FancyBboxPatch((x, 2.5), width, 0.6, boxstyle="round,pad=0.02",
                              facecolor=col, edgecolor='black', linewidth=1.5)
        ax3.add_patch(rect)
        text_color = 'white' if col in [c_start, c_delim, c_extract] else 'black'
        ax3.text(x + width/2, 2.8, tok, ha='center', va='center', 
                fontsize=9, color=text_color, fontweight='bold')
        x += width + 0.08
    
    ax3.text(7.5, 3.8, '+', fontsize=16, fontweight='bold', ha='center', va='center')
    
    ax3.text(7, 1.8, 'Process BOTH orderings, add representations', fontsize=10, 
             ha='center', fontweight='bold', color='#2c3e50')
    ax3.text(7, 1.2, 'Makes the model symmetric to input order', fontsize=10, 
             ha='center', color='gray')
    
    # === 4. Multiple Choice ===
    ax4 = axes[1, 1]
    ax4.set_xlim(0, 14)
    ax4.set_ylim(0, 6)
    ax4.axis('off')
    
    ax4.text(7, 5.5, 'Multiple Choice (e.g., RACE, ROCStories)', fontsize=13, 
             fontweight='bold', ha='center')
    ax4.text(7, 5, 'Context + Question + Answer options', fontsize=10, 
             ha='center', style='italic', color='gray')
    
    # Multiple sequences
    y_positions = [4, 3.2, 2.4]
    answer_labels = ['A', 'B', 'C']
    
    for y, ans in zip(y_positions, answer_labels):
        tokens4 = ['<s>', 'Context', '$', 'Question', '$', f'Ans {ans}', '<e>']
        x = 1.5
        for tok, col in zip(tokens4, colors3):
            width = len(tok) * 0.1 + 0.4
            rect = FancyBboxPatch((x, y), width, 0.55, boxstyle="round,pad=0.02",
                                  facecolor=col, edgecolor='black', linewidth=1.2)
            ax4.add_patch(rect)
            text_color = 'white' if col in [c_start, c_delim, c_extract] else 'black'
            ax4.text(x + width/2, y + 0.27, tok, ha='center', va='center', 
                    fontsize=8, color=text_color, fontweight='bold')
            x += width + 0.05
        
        # Score
        ax4.text(x + 0.3, y + 0.27, f'-> Score({ans})', fontsize=9, va='center')
    
    ax4.text(7, 1.5, 'Create N sequences (one per answer option)', fontsize=10, 
             ha='center', fontweight='bold', color='#2c3e50')
    ax4.text(7, 0.9, 'Softmax over scores to select best answer', fontsize=10, 
             ha='center', color='gray')
    
    # Legend
    fig.text(0.5, 0.02, 'Legend:  ', fontsize=10, ha='center', fontweight='bold')
    legend_items = [(c_start, '<s> Start'), (c_text, 'Text'), 
                    (c_delim, '$ Delimiter'), (c_extract, '<e> Extract')]
    for i, (color, label) in enumerate(legend_items):
        fig.text(0.35 + i * 0.1, 0.02, f'  {label}  ', fontsize=9, 
                ha='center', backgroundcolor=color,
                color='white' if color != c_text else 'black')
    
    plt.tight_layout(rect=[0, 0.05, 1, 1])
    plt.show()

visualize_input_transformations()

### 3.4 Paper Quotes for Each Task Type

**Classification:**
> *"For text classification, we simply fine-tune our model directly. We add a special start and end token to the input."*

**Entailment:**
> *"For entailment tasks, we concatenate the premise p and hypothesis h token sequences, with a delimiter token ($) in between."*

**Similarity:**
> *"For similarity tasks, there is no inherent ordering of the two sentences being compared. To reflect this, we modify the input sequence to contain both possible sentence orderings (with a delimiter in between) and process each independently to produce two sequence representations $h_l^m$ which are added element-wise before being fed into the linear output layer."*

**Multiple Choice:**
> *"For these tasks, we are given a context document z, a question q, and a set of possible answers $\{a_k\}$. We concatenate the document context and question with each possible answer, adding a delimiter token in between to get $[z; q; \$; a_k]$. Each of these sequences are processed independently with our model and then normalized via a softmax layer to produce an output distribution over possible answers."*

---

## 4. Implementation: Fine-tuning Architecture

### 4.1 The Classification Head

From Section 3.2:

> *"The inputs are passed through our pre-trained model to obtain the final transformer block's activation $h_l^m$, which is then fed into an added linear output layer with parameters $W_y$ to predict $y$."*

$$P(y | x^1, ..., x^m) = \text{softmax}(h_l^m W_y)$$

The architecture is remarkably simple:
1. Pre-trained GPT (unchanged)
2. One linear layer for classification

This is why GPT is so efficient - **minimal task-specific parameters**!

In [None]:
# === Model from Part II (simplified) ===

@dataclass
class GPTConfig:
    vocab_size: int = 40478
    n_positions: int = 512
    n_embd: int = 768
    n_layer: int = 12
    n_head: int = 12
    n_inner: int = 3072
    embd_pdrop: float = 0.1
    attn_pdrop: float = 0.1
    resid_pdrop: float = 0.1

def gelu_approx(x):
    return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))

class LayerNorm(nn.Module):
    def __init__(self, n_embd, eps=1e-5):
        super().__init__()
        self.gamma = nn.Parameter(torch.ones(n_embd))
        self.beta = nn.Parameter(torch.zeros(n_embd))
        self.eps = eps
    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        var = x.var(-1, keepdim=True, unbiased=False)
        return self.gamma * (x - mean) / torch.sqrt(var + self.eps) + self.beta

class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.head_dim = config.n_embd // config.n_head
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        self.attn_dropout = nn.Dropout(config.attn_pdrop)
        self.resid_dropout = nn.Dropout(config.resid_pdrop)
        mask = torch.tril(torch.ones(config.n_positions, config.n_positions))
        self.register_buffer('mask', mask.view(1, 1, config.n_positions, config.n_positions))
    
    def forward(self, x):
        B, T, C = x.shape
        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
        q = q.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        k = k.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        attn = (q @ k.transpose(-2, -1)) / math.sqrt(self.head_dim)
        attn = attn.masked_fill(self.mask[:,:,:T,:T] == 0, float('-inf'))
        attn = self.attn_dropout(F.softmax(attn, dim=-1))
        out = (attn @ v).transpose(1, 2).contiguous().view(B, T, C)
        return self.resid_dropout(self.c_proj(out))

class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, config.n_inner)
        self.c_proj = nn.Linear(config.n_inner, config.n_embd)
        self.dropout = nn.Dropout(config.resid_pdrop)
    def forward(self, x):
        return self.dropout(self.c_proj(gelu_approx(self.c_fc(x))))

class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln_1 = LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = LayerNorm(config.n_embd)
        self.mlp = MLP(config)
    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.wte = nn.Embedding(config.vocab_size, config.n_embd)
        self.wpe = nn.Embedding(config.n_positions, config.n_embd)
        self.drop = nn.Dropout(config.embd_pdrop)
        self.blocks = nn.ModuleList([Block(config) for _ in range(config.n_layer)])
        self.ln_f = LayerNorm(config.n_embd)
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        self.lm_head.weight = self.wte.weight
        self._init_weights()
        
    def _init_weights(self):
        for module in self.modules():
            if isinstance(module, nn.Linear):
                torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
                if module.bias is not None:
                    torch.nn.init.zeros_(module.bias)
            elif isinstance(module, nn.Embedding):
                torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
    
    def forward(self, input_ids, targets=None):
        B, T = input_ids.shape
        x = self.drop(self.wte(input_ids) + self.wpe(torch.arange(T, device=input_ids.device)))
        for block in self.blocks:
            x = block(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        return logits, loss, x  # Also return hidden states

config = GPTConfig()
print(f"Base GPT model defined")

In [None]:
class GPTForSequenceClassification(nn.Module):
    """
    GPT with a classification head for fine-tuning.
    
    From paper Section 3.2:
    "The inputs are passed through our pre-trained model to obtain the 
    final transformer block's activation h_l^m, which is then fed into 
    an added linear output layer with parameters W_y to predict y."
    
    P(y | x^1, ..., x^m) = softmax(h_l^m * W_y)
    """
    
    def __init__(self, config, num_labels: int, lm_weight: float = 0.5):
        super().__init__()
        self.gpt = GPT(config)
        self.num_labels = num_labels
        self.lm_weight = lm_weight  # Lambda from Equation 3
        
        # Classification head: single linear layer
        # "fed into an added linear output layer with parameters W_y"
        self.classifier = nn.Linear(config.n_embd, num_labels)
        
        # Dropout for regularization
        self.dropout = nn.Dropout(0.1)
        
        # Initialize classifier
        torch.nn.init.normal_(self.classifier.weight, std=0.02)
        torch.nn.init.zeros_(self.classifier.bias)
        
        n_params = sum(p.numel() for p in self.parameters())
        classifier_params = sum(p.numel() for p in self.classifier.parameters())
        print(f"GPT for Classification: {n_params:,} total parameters")
        print(f"  - Base GPT: {n_params - classifier_params:,}")
        print(f"  - Classifier head: {classifier_params:,} (only new parameters!)")
    
    def forward(
        self, 
        input_ids: torch.Tensor,
        labels: Optional[torch.Tensor] = None,
        extract_positions: Optional[torch.Tensor] = None,
        lm_targets: Optional[torch.Tensor] = None,
    ):
        """
        Forward pass with combined loss (Equation 3).
        
        Args:
            input_ids: Token IDs, shape (batch, seq_len)
            labels: Classification labels, shape (batch,)
            extract_positions: Position of <e> token for each sample
            lm_targets: Targets for auxiliary LM loss
        
        Returns:
            logits: Classification logits
            loss: Combined loss (L_3 = L_2 + lambda * L_1)
        """
        batch_size = input_ids.size(0)
        
        # Get GPT outputs
        lm_logits, lm_loss, hidden_states = self.gpt(input_ids, lm_targets)
        
        # Extract representation at <e> position (or last position)
        if extract_positions is None:
            # Default: use last position
            extract_positions = torch.tensor([input_ids.size(1) - 1] * batch_size)
        
        # Get h_l^m for each sample
        batch_indices = torch.arange(batch_size)
        pooled_output = hidden_states[batch_indices, extract_positions]  # (batch, n_embd)
        
        # Apply dropout and classifier
        pooled_output = self.dropout(pooled_output)
        classification_logits = self.classifier(pooled_output)  # (batch, num_labels)
        
        # Compute combined loss (Equation 3)
        loss = None
        if labels is not None:
            # L_2: Task-specific loss
            task_loss = F.cross_entropy(classification_logits, labels)
            
            # L_3 = L_2 + lambda * L_1
            if lm_loss is not None:
                loss = task_loss + self.lm_weight * lm_loss
            else:
                loss = task_loss
        
        return classification_logits, loss


# Create model for binary classification (e.g., sentiment)
model_cls = GPTForSequenceClassification(config, num_labels=2)

In [None]:
class GPTForMultipleChoice(nn.Module):
    """
    GPT for multiple choice tasks (e.g., RACE, ROCStories).
    
    From paper Section 3.3:
    "We concatenate the document context and question with each possible answer,
    adding a delimiter token in between to get [z; q; $; a_k]. Each of these 
    sequences are processed independently with our model and then normalized 
    via a softmax layer to produce an output distribution over possible answers."
    """
    
    def __init__(self, config, lm_weight: float = 0.5):
        super().__init__()
        self.gpt = GPT(config)
        self.lm_weight = lm_weight
        
        # Score head: maps hidden state to scalar score
        self.score_head = nn.Linear(config.n_embd, 1)
        self.dropout = nn.Dropout(0.1)
        
        torch.nn.init.normal_(self.score_head.weight, std=0.02)
        torch.nn.init.zeros_(self.score_head.bias)
    
    def forward(
        self,
        input_ids: torch.Tensor,  # (batch, num_choices, seq_len)
        labels: Optional[torch.Tensor] = None,  # (batch,) index of correct answer
        lm_targets: Optional[torch.Tensor] = None,
    ):
        """
        Process each choice independently, then softmax over scores.
        """
        batch_size, num_choices, seq_len = input_ids.shape
        
        # Flatten for processing
        flat_input_ids = input_ids.view(-1, seq_len)  # (batch * num_choices, seq_len)
        
        # Get GPT outputs
        _, lm_loss, hidden_states = self.gpt(flat_input_ids, None)
        
        # Extract representation at last position
        pooled = hidden_states[:, -1, :]  # (batch * num_choices, n_embd)
        pooled = self.dropout(pooled)
        
        # Get scores
        scores = self.score_head(pooled).squeeze(-1)  # (batch * num_choices,)
        scores = scores.view(batch_size, num_choices)  # (batch, num_choices)
        
        # Compute loss
        loss = None
        if labels is not None:
            task_loss = F.cross_entropy(scores, labels)
            loss = task_loss
            if lm_loss is not None:
                loss = task_loss + self.lm_weight * lm_loss
        
        return scores, loss


print("\nGPT for Multiple Choice:")
model_mc = GPTForMultipleChoice(config)
print(f"  Score head parameters: {sum(p.numel() for p in model_mc.score_head.parameters()):,}")

In [None]:
class GPTForSimilarity(nn.Module):
    """
    GPT for similarity tasks (e.g., QQP, STS-B).
    
    From paper Section 3.3:
    "For similarity tasks, there is no inherent ordering of the two sentences 
    being compared. To reflect this, we modify the input sequence to contain 
    both possible sentence orderings (with a delimiter in between) and process 
    each independently to produce two sequence representations h_l^m which are 
    added element-wise before being fed into the linear output layer."
    """
    
    def __init__(self, config, num_labels: int = 1, lm_weight: float = 0.5):
        super().__init__()
        self.gpt = GPT(config)
        self.lm_weight = lm_weight
        self.num_labels = num_labels
        
        # Output head
        self.classifier = nn.Linear(config.n_embd, num_labels)
        self.dropout = nn.Dropout(0.1)
    
    def forward(
        self,
        input_ids_ab: torch.Tensor,  # (batch, seq_len) - "A $ B"
        input_ids_ba: torch.Tensor,  # (batch, seq_len) - "B $ A"
        labels: Optional[torch.Tensor] = None,
    ):
        """
        Process both orderings and add representations.
        """
        # Process A $ B
        _, _, hidden_ab = self.gpt(input_ids_ab, None)
        pooled_ab = hidden_ab[:, -1, :]  # (batch, n_embd)
        
        # Process B $ A
        _, _, hidden_ba = self.gpt(input_ids_ba, None)
        pooled_ba = hidden_ba[:, -1, :]  # (batch, n_embd)
        
        # Element-wise addition (key insight from paper!)
        combined = pooled_ab + pooled_ba  # (batch, n_embd)
        combined = self.dropout(combined)
        
        # Predict
        logits = self.classifier(combined)  # (batch, num_labels)
        
        loss = None
        if labels is not None:
            if self.num_labels == 1:
                # Regression (e.g., STS-B)
                loss = F.mse_loss(logits.squeeze(), labels.float())
            else:
                # Classification (e.g., QQP)
                loss = F.cross_entropy(logits, labels)
        
        return logits, loss


print("\nGPT for Similarity (processes both orderings):")
model_sim = GPTForSimilarity(config, num_labels=2)
print(f"  This makes the model symmetric to input order")

---

## 5. Fine-tuning Hyperparameters

### 5.1 What the Paper Says

From Section 4.1:

> *"For fine-tuning, we use... a learning rate of 6.25e-5, a batchsize of 32, and a linear learning rate decay schedule with warmup over 0.2% of training. We use a weight λ = 0.5 for the auxiliary language model loss. We train for 3 epochs."*

### 5.2 Complete Hyperparameter Table

| Hyperparameter | Pre-training | Fine-tuning |
|----------------|--------------|-------------|
| **Learning rate** | 2.5e-4 | **6.25e-5** (25x smaller!) |
| **Batch size** | 64 | **32** |
| **Epochs** | 100 | **3** |
| **LR warmup** | 2000 steps | **0.2% of training** |
| **LR schedule** | Cosine | **Linear decay** |
| **Dropout** | 0.1 | **0.1** (unchanged) |
| **λ (LM weight)** | N/A | **0.5** |

### 5.3 Why These Choices?

**Smaller learning rate (6.25e-5):**
- Pre-trained weights are already good
- Don't want to destroy learned features
- Just need small adjustments for the task

**Only 3 epochs:**
- Task datasets are small (thousands of examples)
- More epochs would lead to overfitting
- Pre-trained features do most of the work

**λ = 0.5:**
- Balance between task and language modeling
- Too small: lose regularization benefit
- Too large: task learning is hampered

In [None]:
@dataclass
class FineTuningConfig:
    """
    Fine-tuning configuration from Section 4.1.
    """
    # === From Section 4.1 ===
    learning_rate: float = 6.25e-5      # "learning rate of 6.25e-5"
    batch_size: int = 32                 # "batchsize of 32"
    epochs: int = 3                      # "train for 3 epochs"
    warmup_fraction: float = 0.002       # "warmup over 0.2% of training"
    lm_weight: float = 0.5               # "weight lambda = 0.5"
    
    # === Standard ===
    dropout: float = 0.1
    weight_decay: float = 0.01
    max_grad_norm: float = 1.0


ft_config = FineTuningConfig()

print("Fine-tuning Configuration (from paper Section 4.1)")
print("=" * 60)
print(f"\n[Optimization]")
print(f"  Learning rate:     {ft_config.learning_rate} (vs 2.5e-4 pre-training)")
print(f"  Batch size:        {ft_config.batch_size}")
print(f"  Epochs:            {ft_config.epochs}")
print(f"  Warmup:            {ft_config.warmup_fraction*100}% of training")
print(f"\n[Auxiliary Loss]")
print(f"  LM weight (lambda): {ft_config.lm_weight}")
print(f"  L3 = L_task + {ft_config.lm_weight} * L_LM")

---

## 6. Results Analysis

### 6.1 Datasets

GPT was evaluated on 12 datasets across 4 task categories:

| Category | Datasets | Task Type |
|----------|----------|----------|
| **Natural Language Inference** | SNLI, MNLI, QNLI, SciTail, RTE | Entailment |
| **Question Answering** | RACE, Story Cloze | Multiple Choice |
| **Semantic Similarity** | QQP, STS-B, MRPC | Similarity |
| **Classification** | CoLA, SST-2 | Single sequence |

### 6.2 Main Results

From Table 2 in the paper, GPT achieved **state-of-the-art on 9 out of 12 tasks**:

In [None]:
def visualize_results():
    """Visualize GPT's results on various benchmarks."""
    
    # Results from paper Table 2
    results = {
        'Natural Language Inference': [
            ('SNLI', 89.9, 'Previous SOTA: 89.3'),
            ('MNLI-m', 82.1, 'Previous SOTA: 80.6'),
            ('MNLI-mm', 81.4, 'Previous SOTA: 80.1'),
            ('QNLI', 88.1, 'Previous SOTA: 82.3'),
            ('RTE', 56.0, 'Previous: 61.7'),
        ],
        'Question Answering': [
            ('RACE-m', 62.9, 'Previous SOTA: 55.7'),
            ('RACE-h', 57.4, 'Previous SOTA: 53.3'),
            ('Story Cloze', 86.5, 'Previous SOTA: 77.6'),
        ],
        'Semantic Similarity': [
            ('QQP', 70.3, 'Acc / F1'),
            ('STS-B', 82.0, 'Pearson / Spearman'),
            ('MRPC', 82.3, 'Acc / F1'),
        ],
        'Classification': [
            ('CoLA', 45.4, 'Matthew\'s Corr'),
            ('SST-2', 91.3, 'Previous SOTA: 90.2'),
        ]
    }
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    axes = axes.flatten()
    
    colors = ['#3498db', '#e74c3c', '#27ae60', '#9b59b6']
    
    for idx, (category, data) in enumerate(results.items()):
        ax = axes[idx]
        
        names = [d[0] for d in data]
        scores = [d[1] for d in data]
        notes = [d[2] for d in data]
        
        y_pos = np.arange(len(names))
        bars = ax.barh(y_pos, scores, color=colors[idx], edgecolor='black', height=0.6)
        
        ax.set_yticks(y_pos)
        ax.set_yticklabels(names, fontsize=10)
        ax.set_xlabel('Score', fontsize=11)
        ax.set_title(category, fontsize=12, fontweight='bold')
        ax.set_xlim(0, 100)
        ax.grid(True, axis='x', alpha=0.3)
        
        # Add score labels
        for bar, score in zip(bars, scores):
            ax.text(score + 1, bar.get_y() + bar.get_height()/2, 
                   f'{score}', va='center', fontsize=10, fontweight='bold')
    
    plt.suptitle('GPT Results on Downstream Tasks\n(State-of-the-art on 9/12 datasets)', 
                 fontsize=14, fontweight='bold', y=1.02)
    plt.tight_layout()
    plt.show()
    
    print("\nKey achievements:")
    print("  - RACE (reading comprehension): +7.2% absolute improvement")
    print("  - Story Cloze: +8.9% absolute improvement")
    print("  - QNLI: +5.8% absolute improvement")
    print("  - All with same pre-trained model + minimal task-specific parameters")

visualize_results()

### 6.3 Ablation Studies

The paper includes important ablation studies (Table 5):

**Effect of Auxiliary LM Loss:**

| Setting | RACE | QQP | MRPC | CoLA | SST-2 |
|---------|------|-----|------|------|-------|
| **Full model** | **59.2** | **70.3** | **82.3** | **45.4** | **91.3** |
| Without aux LM | 57.5 | 69.8 | 79.4 | 44.2 | 91.2 |

The auxiliary LM loss helps on most tasks, especially smaller datasets!

**Effect of Pre-training:**

| Setting | RACE | QQP | MRPC | CoLA | SST-2 |
|---------|------|-----|------|------|-------|
| **Full model** | **59.2** | **70.3** | **82.3** | **45.4** | **91.3** |
| No pre-training | 48.1 | 69.1 | 72.5 | 17.5 | 82.1 |

Pre-training provides **massive improvements**, especially on smaller datasets!

In [None]:
def visualize_ablations():
    """Visualize ablation study results."""
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # === Auxiliary LM loss ablation ===
    ax1 = axes[0]
    
    tasks = ['RACE', 'QQP', 'MRPC', 'CoLA', 'SST-2']
    with_aux = [59.2, 70.3, 82.3, 45.4, 91.3]
    without_aux = [57.5, 69.8, 79.4, 44.2, 91.2]
    
    x = np.arange(len(tasks))
    width = 0.35
    
    bars1 = ax1.bar(x - width/2, with_aux, width, label='With aux LM loss', 
                    color='#27ae60', edgecolor='black')
    bars2 = ax1.bar(x + width/2, without_aux, width, label='Without aux LM loss', 
                    color='#e74c3c', edgecolor='black')
    
    ax1.set_ylabel('Score', fontsize=11)
    ax1.set_title('Effect of Auxiliary LM Loss\n(Equation 3: L3 = L2 + lambda * L1)', 
                  fontsize=12, fontweight='bold')
    ax1.set_xticks(x)
    ax1.set_xticklabels(tasks)
    ax1.legend(fontsize=10)
    ax1.set_ylim(0, 100)
    ax1.grid(True, axis='y', alpha=0.3)
    
    # Add improvement annotations
    for i, (w, wo) in enumerate(zip(with_aux, without_aux)):
        diff = w - wo
        if diff > 0:
            ax1.annotate(f'+{diff:.1f}', xy=(i, max(w, wo) + 2), 
                        ha='center', fontsize=9, color='green', fontweight='bold')
    
    # === Pre-training ablation ===
    ax2 = axes[1]
    
    with_pt = [59.2, 70.3, 82.3, 45.4, 91.3]
    without_pt = [48.1, 69.1, 72.5, 17.5, 82.1]
    
    bars3 = ax2.bar(x - width/2, with_pt, width, label='With pre-training', 
                    color='#3498db', edgecolor='black')
    bars4 = ax2.bar(x + width/2, without_pt, width, label='Without pre-training', 
                    color='#f39c12', edgecolor='black')
    
    ax2.set_ylabel('Score', fontsize=11)
    ax2.set_title('Effect of Pre-training\n(The core contribution of GPT)', 
                  fontsize=12, fontweight='bold')
    ax2.set_xticks(x)
    ax2.set_xticklabels(tasks)
    ax2.legend(fontsize=10)
    ax2.set_ylim(0, 100)
    ax2.grid(True, axis='y', alpha=0.3)
    
    # Add improvement annotations
    for i, (w, wo) in enumerate(zip(with_pt, without_pt)):
        diff = w - wo
        ax2.annotate(f'+{diff:.1f}', xy=(i, max(w, wo) + 2), 
                    ha='center', fontsize=9, color='blue', fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    print("\nKey insights from ablations:")
    print("  1. Auxiliary LM loss helps on most tasks (especially smaller datasets)")
    print("  2. Pre-training is CRUCIAL - CoLA drops from 45.4 to 17.5 without it!")
    print("  3. Larger datasets (QQP) benefit less from pre-training")

visualize_ablations()

### 6.4 Zero-Shot Performance

From Section 5.1:

> *"We observe that GPT shows improvements over the baselines on all datasets except CoLA. This demonstrates that the model can perform a wide range of tasks with little or no supervision."*

Even **without fine-tuning**, GPT shows reasonable performance - hinting at the emergent capabilities that would be fully realized in GPT-2 and GPT-3.

---

## 7. Putting It All Together: Fine-tuning Pipeline

In [None]:
def demo_finetuning():
    """
    Demonstrate the complete fine-tuning pipeline.
    """
    print("Complete Fine-tuning Pipeline Demo")
    print("=" * 60)
    
    # 1. Load pre-trained model
    print("\n1. Load pre-trained GPT model")
    config = GPTConfig()
    model = GPTForSequenceClassification(config, num_labels=3)  # 3-way classification
    
    # 2. Prepare data
    print("\n2. Prepare input data (using input transformation)")
    batch_size = 4
    seq_len = 64
    
    # Simulated input: <s> premise $ hypothesis <e>
    input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_len))
    labels = torch.randint(0, 3, (batch_size,))  # entail/contradict/neutral
    
    # For auxiliary LM loss
    lm_targets = torch.randint(0, config.vocab_size, (batch_size, seq_len))
    
    print(f"   Input shape: {input_ids.shape}")
    print(f"   Labels shape: {labels.shape}")
    
    # 3. Setup optimizer
    print("\n3. Setup optimizer (from paper: lr=6.25e-5)")
    optimizer = torch.optim.AdamW(model.parameters(), lr=6.25e-5, weight_decay=0.01)
    
    # 4. Training step
    print("\n4. Training step with combined loss")
    model.train()
    
    # Forward pass
    logits, loss = model(input_ids, labels=labels, lm_targets=lm_targets)
    
    print(f"   Logits shape: {logits.shape}")
    print(f"   Combined loss (L3 = L2 + 0.5*L1): {loss.item():.4f}")
    
    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    
    print("   Backward pass complete!")
    
    # 5. Evaluation
    print("\n5. Evaluation")
    model.eval()
    with torch.no_grad():
        logits, _ = model(input_ids)
        preds = logits.argmax(dim=-1)
        accuracy = (preds == labels).float().mean()
    
    print(f"   Predictions: {preds.tolist()}")
    print(f"   True labels: {labels.tolist()}")
    print(f"   Accuracy: {accuracy.item()*100:.1f}%")
    
    print("\n" + "=" * 60)
    print("Fine-tuning pipeline complete!")
    print("\nIn practice:")
    print("  - Load actual pre-trained weights")
    print("  - Use real task data (MNLI, SNLI, etc.)")
    print("  - Train for 3 epochs")
    print("  - Evaluate on held-out test set")

demo_finetuning()

---

## 8. Summary

### 8.1 Key Contributions of GPT Fine-tuning

| Innovation | Description | Impact |
|------------|-------------|--------|
| **Input transformations** | Convert any task to sequence format | One model, many tasks |
| **Minimal task-specific params** | Just one linear layer | Efficient transfer |
| **Auxiliary LM loss** | L3 = L2 + λ*L1 | Better generalization |
| **Pre-train then fine-tune** | Two-stage paradigm | Now industry standard |

### 8.2 The Four Task Types

| Task Type | Format | Example Datasets |
|-----------|--------|------------------|
| **Classification** | `<s> text <e>` | SST-2, CoLA |
| **Entailment** | `<s> premise $ hypothesis <e>` | MNLI, SNLI, RTE |
| **Similarity** | Both orderings, add | QQP, STS-B, MRPC |
| **Multiple Choice** | One sequence per choice | RACE, Story Cloze |

### 8.3 Fine-tuning Recipe

```
1. Start with pre-trained GPT weights
2. Add task-specific linear head
3. Transform inputs using appropriate format
4. Train with L3 = L_task + 0.5 * L_LM
5. Use lr=6.25e-5, batch=32, epochs=3
```

### 8.4 Historical Impact

GPT established the **foundation** for:
- **BERT** (Oct 2018): Same pre-train/fine-tune, bidirectional
- **GPT-2** (Feb 2019): Larger scale, zero-shot capabilities
- **GPT-3** (Jun 2020): 175B parameters, in-context learning
- **ChatGPT** (Nov 2022): RLHF fine-tuning for dialogue

---

## References

1. Radford et al. (2018). [Improving Language Understanding by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)
2. Wang et al. (2018). [GLUE: A Multi-Task Benchmark](https://arxiv.org/abs/1804.07461)
3. Bowman et al. (2015). [SNLI: Stanford Natural Language Inference](https://arxiv.org/abs/1508.05326)
4. Lai et al. (2017). [RACE: Large-scale ReAding Comprehension Dataset](https://arxiv.org/abs/1704.04683)