### Summarization and Q&A Using Transformer

- we need encoder-decoder architecture

- Encoder is for

- Decoder is for 

⸻

1. Understand the Transformer Architecture

At a high level, the original Transformer (from the 2017 paper) consists of:
	•	Input Embedding (plus positional encoding)
	•	Encoder Blocks (for encoding input)
	•	Decoder Blocks (for generating output)
	•	Multi-Head Self-Attention
	•	Feedforward Neural Networks
	•	Layer Normalization & Residual Connections

⸻

2. Minimal Setup (Start Here)

Use PyTorch or TensorFlow — most people use PyTorch for scratch implementations.

Basic Structure:

class Transformer(nn.Module):
    def __init__(self, ...):
        super().__init__()
        self.encoder = Encoder(...)
        self.decoder = Decoder(...)
    
    def forward(self, src, tgt):
        encoder_output = self.encoder(src)
        output = self.decoder(tgt, encoder_output)
        return output

⸻

3. Core Components to Implement
	•	Positional Encoding
	•	Multi-Head Attention
	•	Layer Normalization
	•	Feed-Forward Layer
	•	Encoder Layer (stacked N times)
	•	Decoder Layer (stacked N times)
	•	Final Linear + Softmax for prediction

⸻

4. Helpful Resources
	•	The Annotated Transformer (PyTorch):
	•	https://nlp.seas.harvard.edu/2018/04/03/attention.html
	•	Transformer from scratch in PyTorch (YouTube):
The AI Epiphany channel is gold for this.

⸻

5. Training Your Transformer

Once you build the model:
	•	Choose your dataset (e.g., for Q&A, use SQuAD; for summarization, try CNN/DailyMail).
	•	Tokenize your data.
	•	Define your loss (usually CrossEntropyLoss for text).
	•	Use teacher forcing during training.

⸻

Optional: Start Small

You can start by building:
	•	Just the Encoder (like BERT)
	•	Or a small Encoder-Decoder (like TinyT5)

⸻

## TODO:

- collect data:
    - collect all the ids
    - use https://arxiv.org/pdf/{id} to download pdf
    - read and save all the pdf articles in to csv file.

- which part of the transformer is controlling summarization vs translation?

In [12]:
import os
import kagglehub
import pandas as pd

# Data Preparation

In [None]:
# Download latest version
path = kagglehub.dataset_download("Cornell-University/arxiv")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/Cornell-University/arxiv?dataset_version_number=225...


100%|██████████| 1.42G/1.42G [00:39<00:00, 38.1MB/s]

Extracting files...





Path to dataset files: /Users/seaqueue/.cache/kagglehub/datasets/Cornell-University/arxiv/versions/225


In [8]:
files = os.listdir(path)
print(files)

['arxiv-metadata-oai-snapshot.json']


In [13]:
df = pd.read_json(path+'/arxiv-metadata-oai-snapshot.json', lines=True)
print("First 5 records:", df.head())

First 5 records:           id           submitter  \
0  0704.0001      Pavel Nadolsky   
1  0704.0002        Louis Theran   
2  0704.0003         Hongjun Pan   
3  0704.0004        David Callan   
4  0704.0005  Alberto Torchinsky   

                                             authors  \
0  C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...   
1                    Ileana Streinu and Louis Theran   
2                                        Hongjun Pan   
3                                       David Callan   
4           Wael Abu-Shammala and Alberto Torchinsky   

                                               title  \
0  Calculation of prompt diphoton production cros...   
1           Sparsity-certifying Graph Decompositions   
2  The evolution of the Earth-Moon system based o...   
3  A determinant of Stirling cycle numbers counts...   
4  From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...   

                                  comments  \
0  37 pages, 15 figures; published version   


# Model Architecture


In [None]:
import torch
import torch.nn as nn
from torch.nn.functional import log_softmax
import copy

In [16]:
class EncoderDecoder(nn.Module):
    # TODO:
    # what is generator?
    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.generator = generator
    
    # what are these masks for why do we need two masks?
    def forward(self, src, tgt, src_mask, tgt_mask):
        # take in and process maksed src and tgt sequences
        return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask)
    
    def encode(self, src, src_mask):
        return self.encoder(self.src_embed(src), src_mask)
    
    def decode(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)

In [17]:
class Generator(nn.Module):
    '''
    Standard linear + softmax generation step
    '''
    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)
    
    def forward(self, x):
        # TODO: which activation function is the best for summarization?
        return log_softmax(self.proj(x), dim=-1)

In [18]:
def clones(module, N):
    # Produce N identical layers
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

In [22]:
class Encoder(nn.Module):
    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)
        
    def forward(self, x, mask):
        'pass the input (and mask) through each layer in turn'
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)


In [21]:
class LayerNorm(nn.Module):
    '''Construct a layernorm module (See citation for details).'''
    # TODO: what is the purpose of eps?
    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps
    
    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2