## Architecure Overview

> Reformer

The reformer is a transformer based architecture which was introduced to address the limitations of traditional transformer models, particularly while dealing with long sequences of data. It incorporates several key building blocks to achieve these goals.

1. Self attn mechanism - Like other transformer architectures, The reformer too relies on self attn mechanism. This mechanism allows the model to weigh the importance of diff words or tokens within a sequence when processing the data. It enables capturing dependencies and relationships between diff parts of the o/p.

2. Leveraging Locality Sensitive hashing(LSH): One of the key diff. of the Reformer is the use of LSH, which is a technique that helps approximate the self attn. mechanism more efficiently. LSH is used to divide the input seq into smaller chunks or segments, called buckets. By doing this, the model can reduce the computational complexity of self attn, making it more scalable to long sequences.

3. Sparse Factorization: In the Reformer, the self attn mechanism is also sparsely factorized, meaning that it only attends ot a subset of the tokens within a given segment. This further reduces the computational requirements, as the model doesn't need to attend to every token in the sequence. The specific subset of tokens attended to within a segment can vary across different segments, enabling the model to capture diff. dependencies.

4. Reversible Layers: The reformer employs reversible layes, which allow for memory conservation during training. Reversible layers enable the model to reconstruct the intermediate states of the layer from the final output during the backward pass, reducing the memory footprint of the model.

5. Chunking - To handle very long sequences, The reformer divides the input into smaller chunks or blocks, processing them independently. This approach enables parallelization and makes it possible to scale the model to handle sequences of arbitrary length. 

In [30]:
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import ReformerModel, ReformerConfig, ReformerTokenizer, ReformerForSequenceClassification, AdamW


In [31]:
# Define a custom dataset class
class CommonlitDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length):
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        excerpt = self.data['excerpt'][index]
        target = self.data['target'][index]
        inputs = self.tokenizer.encode_plus(
            excerpt,
            add_special_tokens=True,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        return {
            'input_ids': inputs['input_ids'][0],
            'attention_mask': inputs['attention_mask'][0],
            'target': torch.tensor(target, dtype=torch.float)
        }

In [32]:
# Load the dataset
train_df = pd.read_csv('/kaggle/input/commonlitreadabilityprize/train.csv')
test_df = pd.read_csv('/kaggle/input/commonlitreadabilityprize/test.csv')


In [45]:
# Initialize the tokenizer
tokenizer = ReformerTokenizer.from_pretrained('google/reformer-crime-and-punishment', force_download=True)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/242k [00:00<?, ?B/s]

1

In [46]:
# Set the maximum sequence length
max_length = 512


In [47]:
# Create train and validation datasets
train_dataset = CommonlitDataset(train_df, tokenizer, max_length)
val_dataset = CommonlitDataset(test_df, tokenizer, max_length)


In [48]:
# Create data loaders
batch_size = 8
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)


In [49]:
# Initialize a Reformer configuration
configuration = ReformerConfig()


In [57]:
# Initialize a Reformer model
model = ReformerForSequenceClassification.from_pretrained('google/reformer-crime-and-punishment')


You might want to disable causal masking for sequence classification
Some weights of the model checkpoint at google/reformer-crime-and-punishment were not used when initializing ReformerForSequenceClassification: ['lm_head.decoder.weight', 'lm_head.bias', 'lm_head.decoder.bias']
- This IS expected if you are initializing ReformerForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ReformerForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ReformerForSequenceClassification were not initialized from the model checkpoint at google/reformer-crime-and-punishment and are newly initialized: ['reformer.encoder.layers.1.attention.self_attention.ma

In [58]:
# Define the device to train on
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)


ReformerForSequenceClassification(
  (reformer): ReformerModel(
    (embeddings): ReformerEmbeddings(
      (word_embeddings): Embedding(320, 256)
      (position_embeddings): AxialPositionEmbeddings(
        (weights): ParameterList(
            (0): Parameter containing: [torch.float32 of size 512x1x64]
            (1): Parameter containing: [torch.float32 of size 1x1024x192]
        )
      )
    )
    (encoder): ReformerEncoder(
      (layers): ModuleList(
        (0): ReformerLayer(
          (attention): ReformerAttention(
            (layer_norm): LayerNorm((256,), eps=1e-12, elementwise_affine=True)
            (self_attention): LocalSelfAttention(
              (query): Linear(in_features=256, out_features=128, bias=False)
              (key): Linear(in_features=256, out_features=128, bias=False)
              (value): Linear(in_features=256, out_features=128, bias=False)
            )
            (output): ReformerSelfOutput(
              (dense): Linear(in_features=128, out

In [59]:
for batch in train_loader:
    print(batch)
    break

{'input_ids': tensor([[ 98, 262,  32,  ..., 320, 320, 320],
        [258, 309, 263,  ..., 320, 320, 320],
        [ 96, 183, 272,  ..., 320, 320, 320],
        ...,
        [125,  67,  16,  ..., 320, 320, 320],
        [258, 308, 261,  ..., 320, 320, 320],
        [102, 262, 222,  ..., 320, 320, 320]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'target': tensor([-0.0219,  1.1202,  0.1600,  0.3450, -0.0101, -0.1188,  0.9027, -0.1223])}


In [60]:
# Set hyperparameters
learning_rate = 1e-4
num_epochs = 10


In [61]:
# Define the optimizer and loss function
optimizer = AdamW(model.parameters(), lr=learning_rate)
loss_fn = torch.nn.MSELoss()


In [62]:
# Training loop
for epoch in range(num_epochs):
    model.train()
    train_loss = 0.0
    for batch in train_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        target = batch['target'].to(device)
        print("Input IDs shape:", input_ids.shape)
        print("Input IDs content:", input_ids)

        optimizer.zero_grad()

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = outputs.logits.squeeze(-1)
        
        loss = loss_fn(predictions, target)
        loss.backward()
        optimizer.step()

        train_loss += loss.item() * input_ids.size(0)

    train_loss /= len(train_loader.dataset)

    # Validation loop
    model.eval()
    val_loss = 0.0
    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            target = batch['target'].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            predictions = outputs.logits.squeeze(-1)
            
            loss = loss_fn(predictions, target)

            val_loss += loss.item() * input_ids.size(0)

        val_loss /=len(val_loader.dataset)

    print(f"Epoch {epoch+1}/{num_epochs}: Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}")


Input IDs shape: torch.Size([8, 512])
Input IDs content: tensor([[258, 313, 267,  ..., 320, 320, 320],
        [258, 317,  52,  ..., 320, 320, 320],
        [ 33, 100,  44,  ..., 320, 320, 320],
        ...,
        [140,  10,  59,  ...,  21,   5, 271],
        [140,  98,  45,  ...,  13,  98,  45],
        [108, 265,  24,  ..., 320, 320, 320]])
