## Extend model length with LongRope

Due to technical limitation, we can not apply it to a big pre train LLM. Here we just demonstrate how the code works. We start from a vanilla LLM (even tho the whole point of this paper is to start from a **pre-trained one**) where we suppose that its pre trained length is set to 1048 and we want to extend it to 2048. This notebook is just a demonstration on how to use the functions in ``src`` we'll do several simplification for the code to run quickly

In [1]:
import torch
import numpy as np
import os
from src.dataset import TextDataset
from src.utils_data import load_data
from src.utils_general import truncate_ids

In [2]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.model_max_length = 2048

# We use this tokenizer just to convert the inputs to ids, basically any other tokenizer would work
# since we don't use the vector representation but only the id. The embedding is learned by the model

tensor_list = load_data("../data/input.txt", tokenizer, tokenizer.model_max_length)

tensor_list = truncate_ids(tensor_list, 5000)

dataset = TextDataset(tensor_list)
# 


Token indices sequence length is longer than the specified maximum sequence length for this model (338025 > 2048). Running this sequence through the model will result in indexing errors


In [3]:
from src.longrope import LongRoPEModel

model = LongRoPEModel(
    d_model=256,
    n_heads=32,
    num_layers=6,
    vocab_size=5000,
    max_len=2048, # max_len is 2048, meaning a model can not take input of higher length
)

In [4]:
model_2 = model.extend_context(
    dataset=tensor_list,
    target_length=1048*2,
    max_sequence_length=2048,
    tokenizer=tokenizer,
    population_size=2,
    num_mutations=1,
    num_crossovers=1,
    max_iterations=2,
) # the code here is to adapt a model that is pre trained with 1048 length inputs and we want to reach 2048


searching for lambda factors: 100%|██████████| 2/2 [01:32<00:00, 46.21s/it]
fine tuning step:   1%|          | 2/200 [00:03<04:24,  1.34s/it]

Step 0, Validation Perplexity: 3458.436767578125


fine tuning step:  26%|██▌       | 52/200 [00:21<02:06,  1.17it/s]

Step 50, Validation Perplexity: 200.75770568847656


fine tuning step:  50%|█████     | 101/200 [00:39<01:56,  1.18s/it]

Step 100, Validation Perplexity: 115.11137390136719


fine tuning step:  76%|███████▌  | 151/200 [00:57<00:57,  1.18s/it]

Step 150, Validation Perplexity: 94.21158599853516


fine tuning step: 100%|██████████| 200/200 [01:12<00:00,  2.78it/s]


In [6]:
model.only_embeddings(dataset[0][0].unsqueeze(0))

(tensor([[[ 0.1254, -0.1966,  0.7539,  ..., -0.5824,  0.9575,  0.0959],
          [-0.3238,  0.6256,  0.3047,  ..., -0.5419,  0.9554,  0.1364],
          [-1.7411, -0.0195, -0.0610,  ...,  1.0606,  2.0380, -0.0664],
          ...,
          [ 1.1909, -0.1648, -0.1538,  ..., -2.7744, -2.2751, -0.3570],
          [ 0.6313, -2.5097,  0.5689,  ..., -0.1238, -0.9221, -0.5455],
          [ 2.1247, -1.4082,  0.4335,  ..., -0.6183, -0.0743, -2.0550]]],
        device='cuda:0', grad_fn=<AddBackward0>),
 (tensor([[[-0.8517, -0.1966, -0.2232,  ..., -0.5824,  0.5667,  0.0959],
           [-0.8517, -0.1966, -0.2232,  ..., -0.5824,  0.5667,  0.0959],
           [-1.3345, -0.9080,  0.3456,  ...,  0.9801,  1.6555, -0.1468],
           ...,
           [ 2.1528, -0.3367,  0.8080,  ..., -2.3844, -2.2493,  0.0330],
           [ 1.2956, -1.7932,  1.2332,  ...,  0.2668, -0.9368, -0.1550],
           [ 1.8807, -0.4621,  0.1895,  ..., -0.2313, -0.1293, -1.6681]]],
         device='cuda:0', grad_fn=<EmbeddingB