## Extend model length with LongRope

Due to technical limitation, we can not apply it to a big pre train LLM. Here we just demonstrate how the code works. We start from a vanilla LLM (even tho the whole point of this paper is to start from a **pre-trained one**) where we suppose that its pre trained length is set to 1048 and we want to extend it to 2048. This notebook is just a demonstration on how to use the functions in ``src`` we'll do several simplification for the code to run quickly

In [1]:
import torch
import numpy as np
import os
from src.dataset import TextDataset
from src.utils_data import load_data
from src.utils_general import truncate_ids

In [2]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.model_max_length = 2048

# We use this tokenizer just to convert the inputs to ids, basically any other tokenizer would work
# since we don't use the vector representation but only the id. The embedding is learned by the model

tensor_list = load_data("../data/input.txt", tokenizer, tokenizer.model_max_length)

tensor_list = truncate_ids(tensor_list, 5000)

dataset = TextDataset(tensor_list)
# 


Token indices sequence length is longer than the specified maximum sequence length for this model (338025 > 2048). Running this sequence through the model will result in indexing errors


In [None]:
from src.longrope import LongRoPEModel

model = LongRoPEModel(
    d_model=256,
    n_heads=32,
    num_layers=6,
    vocab_size=5000,
    max_len=1048, # max_len is 1048 is the "base max_len"
)

In [None]:
model_2 = model.extend_context(
    dataset=tensor_list,
    target_length=1048*2, # here the scale factor s is 2
    max_sequence_length=1048,
    tokenizer=tokenizer,
    population_size=2,
    num_mutations=1,
    num_crossovers=1,
    max_iterations=2,
) # the code here is to adapt a model that is pre trained with 1048 length inputs and we want to reach 2048
# we actually only do 1 step (extend to 1048 to 2096)
# we take the minimum parameters in order to check if the model runs

searching for lambda factors: 100%|██████████| 2/2 [04:15<00:00, 127.67s/it]
fine tuning step:   0%|          | 1/200 [00:08<27:30,  8.30s/it]

Step 0, Validation Perplexity: 3015.17919921875


fine tuning step:  26%|██▌       | 51/200 [00:54<07:53,  3.18s/it]

Step 50, Validation Perplexity: 207.99923706054688


fine tuning step:  50%|█████     | 101/200 [01:40<05:14,  3.18s/it]

Step 100, Validation Perplexity: 117.55162811279297


fine tuning step:  76%|███████▌  | 151/200 [02:26<02:35,  3.17s/it]

Step 150, Validation Perplexity: 94.0124740600586


fine tuning step: 100%|██████████| 200/200 [03:02<00:00,  1.09it/s]


In [5]:
dataset[0][0].shape

torch.Size([2048])

In [21]:
model.only_embeddings(dataset[4][0].unsqueeze(0))

(tensor([[[ 2.1492, -0.1572,  0.0369,  ...,  0.7049,  0.0450, -0.2447],
          [ 0.4464, -1.6841, -1.1774,  ..., -0.9527,  1.0362, -0.8891],
          [ 0.2646,  1.3279, -0.1183,  ..., -0.0783, -0.2850, -0.2472],
          ...,
          [-0.2188,  1.3719, -1.3156,  ...,  0.6452,  0.6182,  0.0043],
          [-0.9520, -0.9932,  0.1300,  ..., -0.3708, -0.0574,  0.6455],
          [ 1.6752, -1.4626, -0.0169,  ...,  1.8839, -0.2566, -0.0997]]],
        device='cuda:0', grad_fn=<AddBackward0>),
 (tensor([[[ 1.6492, -0.1572, -0.4631,  ...,  0.7049, -0.1550, -0.2447],
           [ 0.1763, -2.1048, -1.4475,  ..., -0.9734,  0.8373, -0.9098],
           [ 0.4727,  0.8733,  0.0898,  ..., -0.1195, -0.4807, -0.2884],
           ...,
           [-0.0147,  0.9155, -1.1115,  ...,  0.4455,  0.6069, -0.1954],
           [-0.4576, -1.0681,  0.6243,  ..., -0.5706, -0.0480,  0.4457],
           [ 2.0054, -1.0870,  0.3133,  ...,  1.6862, -0.2266, -0.2975]]],
         device='cuda:0', grad_fn=<SliceBackw

In [19]:
model(dataset[4][0][:-1].unsqueeze(0))[:,-1,:].argmax()

tensor(25, device='cuda:0')

In [20]:
dataset[4][0][-1]

tensor(25)