# ![Banner](https://github.com/LittleHouse75/flatiron-resources/raw/main/NevitsBanner.png)
---
# Experiment 1 — BERT Encoder → GPT-2 Decoder  
### *“Frankenstein” Encoder–Decoder Summarization Model*
---

This notebook runs Experiment 1 for the project:

**Goal:**  
Evaluate a custom encoder–decoder architecture where:

- **Encoder:** `bert-base-uncased`  
- **Decoder:** `gpt2` (augmented with cross-attention by HuggingFace)  

This is intentionally *not* a pretrained summarization model.  
The purpose is to test whether a glued-together architecture can learn dialogue summarization with curriculum training (warmup → finetune).

All reusable code is imported from `src/`, keeping this notebook clean.

## 1. Environment Setup

In [1]:
# Disable tokenizers parallelism warning
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import torch
from pathlib import Path
import pandas as pd

# Ensure project root is importable
PROJECT_ROOT = Path("..").resolve()
import sys
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

# Mute warnings
import warnings
warnings.filterwarnings("ignore", message="Mem Efficient attention")
warnings.filterwarnings(
    "ignore",
    message=".*copy construct from a tensor.*"
)
warnings.filterwarnings(
    "ignore",
    category=FutureWarning,
    message=".*better way to train encoder-decoder models.*"
)
warnings.filterwarnings("ignore", message=".*requires_grad=True.*")
warnings.filterwarnings("ignore", message=".*Flash Efficient attention.*")

# Device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

## 2. Project Imports (Shared Utilities)
We import:
- SAMSum loader  
- Dataset wrapper  
- Model builder  
- Trainer  
- Qualitative preview  

In [2]:
from src.data.load_data import load_samsum
from src.data.preprocess import SummaryDataset
from src.models.build_bert_gpt2 import build_bert_gpt2_model
from src.train.trainer_seq2seq import train_model
from src.eval.qualitative import qualitative_samples

## 3. Constants & Hyperparameters  
These values were chosen based on EDA and practical training needs.

In [3]:
MAX_SOURCE_LEN = 512       # <= BERT's max_position_embeddings
MAX_TARGET_LEN = 128
EPOCHS = 10

BATCH_SIZE = 1
GRAD_ACCUM = 4

LEARNING_RATE = 1e-5
BRIDGE_LR = 1e-3

RUN_TRAINING = True

HIST_PATH = PROJECT_ROOT / "models" / "bert-gpt2" / "history.csv"
BEST_DIR = PROJECT_ROOT / "models" / "bert-gpt2" / "best"

## 4. Load SAMSum Data
Data is pulled from `src/data/load_data.py`.  
Local parquet cache is used automatically if available.

In [4]:
train_df, val_df, test_df = load_samsum()
len(train_df), len(val_df), len(test_df)

(14731, 818, 819)

## 5. Tokenizers & Datasets

GPT-2 has **no pad token**, so we set pad = eos.  

We then build the shared `SummaryDataset`.

In [None]:
from transformers import BertTokenizer, GPT2Tokenizer

from transformers import BertTokenizer, GPT2Tokenizer

bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_tokenizer.model_max_length = 512  # native BERT limit

gpt_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Ensure we actually have a pad token
if gpt_tokenizer.pad_token is None:
    gpt_tokenizer.add_special_tokens({"pad_token": gpt_tokenizer.eos_token})

gpt_tokenizer.model_max_length = 1024 

# PyTorch datasets
train_dataset = SummaryDataset(train_df, bert_tokenizer, gpt_tokenizer,
                               MAX_SOURCE_LEN, MAX_TARGET_LEN)

val_dataset = SummaryDataset(val_df, bert_tokenizer, gpt_tokenizer,
                             MAX_SOURCE_LEN, MAX_TARGET_LEN)

from torch.utils.data import DataLoader

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE,
                          shuffle=True, num_workers=0)

val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE,
                        shuffle=False, num_workers=0)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

## 6. Build the BERT→GPT-2 Model
This calls the modular builder in `src/models/build_bert_gpt2.py`.

In [6]:
model = build_bert_gpt2_model(
    gpt_pad_token_id=gpt_tokenizer.pad_token_id,
    gpt_bos_token_id=gpt_tokenizer.bos_token_id,
    decoder_tokenizer=gpt_tokenizer,
).to(device)

# Disable cache for gradient checkpointing
model.config.use_cache = False

# Turn on gradient checkpointing
model.encoder.gradient_checkpointing_enable()
model.decoder.gradient_checkpointing_enable()
model.decoder.resize_token_embeddings(len(gpt_tokenizer))

# or, if your HF version supports it:
# model.gradient_checkpointing_enable()

model

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['transformer.h.0.crossattention.c_attn.bias', 'transformer.h.0.crossattention.c_attn.weight', 'transformer.h.0.crossattention.c_proj.bias', 'transformer.h.0.crossattention.c_proj.weight', 'transformer.h.0.crossattention.q_attn.bias', 'transformer.h.0.crossattention.q_attn.weight', 'transformer.h.0.ln_cross_attn.bias', 'transformer.h.0.ln_cross_attn.weight', 'transformer.h.1.crossattention.c_attn.bias', 'transformer.h.1.crossattention.c_attn.weight', 'transformer.h.1.crossattention.c_proj.bias', 'transformer.h.1.crossattention.c_proj.weight', 'transformer.h.1.crossattention.q_attn.bias', 'transformer.h.1.crossattention.q_attn.weight', 'transformer.h.1.ln_cross_attn.bias', 'transformer.h.1.ln_cross_attn.weight', 'transformer.h.10.crossattention.c_attn.bias', 'transformer.h.10.crossattention.c_attn.weight', 'transformer.h.10.crossattention.c_proj.bias', 'transformer.h.10.cros

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

EncoderDecoderModel(
  (encoder): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, el

## 7. Optimizer (Warm-up → Fine-tune)

The training loop is shared, but **Experiment-1’s warmup logic is unique**.  
We handle it here in the notebook and pass the correct optimizer into `train_model()`.

In [None]:
import torch.optim as optim

# Phase 1 — train decoder only (encoder frozen)
for name, p in model.named_parameters():
    if name.startswith("encoder."):
        p.requires_grad = False
    else:
        p.requires_grad = True

decoder_params = [p for p in model.parameters() if p.requires_grad]

print("Trainable params in warmup:", sum(p.requires_grad for p in model.parameters()))
print("Decoder-only params:", len(decoder_params))

optimizer = optim.AdamW(decoder_params, lr=BRIDGE_LR)

Trainable params in warmup: 244
Decoder-only params: 244


: 

## 8. Warm-Up Phase (Train Only Cross-Attention)

We warm up for `WARMUP_STEPS` batches, then unfreeze the whole model.

In [None]:
if RUN_TRAINING:

    WARMUP_STEPS = 3000
    global_step = 0

    model.train()
    loss_trace = []

    for batch in train_loader:
        global_step += 1

        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels,
        )
        loss = outputs.loss
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()
        optimizer.zero_grad()

        loss_trace.append(loss.item())

        if global_step >= WARMUP_STEPS:
            break

    len(loss_trace)

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


## 9. Fine-Tune Phase (Unfreeze All Layers)

In [None]:
if RUN_TRAINING:

    # Unfreeze all parameters
    for p in model.parameters():
        p.requires_grad = True

    optimizer = optim.AdamW(model.parameters(), lr=LEARNING_RATE)

## 10. Full Training Loop  
This uses the shared `train_model()` from `src/train/trainer_seq2seq.py`  
which handles:
- training epochs  
- validation  
- ROUGE metrics  
- returns a summary DataFrame  

In [None]:
if RUN_TRAINING:
        
    history_df = train_model(
        model=model,
        train_loader=train_loader,
        val_loader=val_loader,
        optimizer=optimizer,
        tokenizer=gpt_tokenizer,
        device=device,
        epochs=EPOCHS,
        max_target_len=MAX_TARGET_LEN,
        checkpoint_dir=str(BEST_DIR),
        patience=2,
        grad_accum_steps=GRAD_ACCUM, 
    )
    print("Best checkpoint saved to:", BEST_DIR)

    # --- SAVE HISTORY CSV ---
    
    history_df.to_csv(HIST_PATH, index=False)
    print("Saved training history to:", HIST_PATH)

In [None]:
if not RUN_TRAINING:
    
    print("Skipping training and loading best saved model...")
    from transformers import EncoderDecoderModel

    # Load Model
    model = EncoderDecoderModel.from_pretrained(BEST_DIR).to(device)

    # Load History
    history_df = pd.read_csv(HIST_PATH)
    print("Loaded saved training history from:", HIST_PATH)

## 11. Loss Curves  
(Optional small plot)

In [None]:
import matplotlib.pyplot as plt

plt.plot(history_df["epoch"], history_df["train_loss"], label="train")
plt.plot(history_df["epoch"], history_df["val_loss"], label="val")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.title("Experiment 1 — Loss Curve")
plt.show()

## 12. Qualitative Examples  

Shows 5 model summaries vs human summaries.

In [None]:
qualitative_samples(
    df=val_df,
    model=model,
    encoder_tokenizer=bert_tokenizer,
    decoder_tokenizer=gpt_tokenizer,
    device=device,
    max_target_len=MAX_TARGET_LEN,
)

## 13. Save Model + Tokenizers

Matches your README exactly:

In [None]:
SAVE_DIR = PROJECT_ROOT / "models" / "bert-gpt2"
SAVE_DIR.mkdir(parents=True, exist_ok=True)

model.save_pretrained(SAVE_DIR)
bert_tokenizer.save_pretrained(SAVE_DIR)
gpt_tokenizer.save_pretrained(SAVE_DIR)

print("Model saved to:", SAVE_DIR)

# Key Takeaways for Experiment-1

This section will be finished after training, but expected themes:

- Cross-attention warm-up stabilizes training  
- ROUGE improves slowly but plateaus early  
- Model tends to produce chatty, narrative summaries  
- Strong evidence this architecture is sub-optimal compared to BART/T5  

This notebook demonstrates the feasibility and limitations of a hand-assembled encoder–decoder system versus pretrained seq2seq models.