# LongMem with Vicuna 13B

SERVER HARDWARE SPECIFICATION:
Model: Dell Precision 7920

Processor: Intel Xeon Gold 5218 x 2  (64 Cores)

Memory: 512 GB DDR4 3200 MHz

SSD: Samsung 870 1 TB x 2 RAID 1

GPU: Nvidia RTX 3090 x 2 (nvlink enabled)

In [1]:
import gc
import torch


gc.collect()


torch.cuda.empty_cache()

In [1]:
!pip install pytorch-transformers

Collecting pytorch-transformers
  Using cached pytorch_transformers-1.2.0-py3-none-any.whl (176 kB)
Collecting sacremoses (from pytorch-transformers)
  Using cached sacremoses-0.0.53.tar.gz (880 kB)
  Preparing metadata (setup.py) ... [?25ldone
INFO: pip is looking at multiple versions of botocore to determine which version is compatible with other requirements. This could take a while.
Collecting botocore<1.32.0,>=1.31.1 (from boto3->pytorch-transformers)
  Obtaining dependency information for botocore<1.32.0,>=1.31.1 from https://files.pythonhosted.org/packages/95/ef/ed8c3cf1058601218f5b9656f9305db8f1f255c71eee9b8388b73ccee892/botocore-1.31.15-py3-none-any.whl.metadata
  Downloading botocore-1.31.15-py3-none-any.whl.metadata (5.9 kB)
  Obtaining dependency information for botocore<1.32.0,>=1.31.1 from https://files.pythonhosted.org/packages/73/f0/ad1309cf4583e1b1ea76db26e99961ef236e22deaf2f40a32f9c85339034/botocore-1.31.14-py3-none-any.whl.metadata
  Downloading botocore-1.31.14-py3

  Obtaining dependency information for botocore<1.32.0,>=1.31.1 from https://files.pythonhosted.org/packages/c1/5b/43d721ba00649b8d7c2d1002176a2f3bff556e5af61cd38858ef46e9300a/botocore-1.31.9-py3-none-any.whl.metadata
  Downloading botocore-1.31.9-py3-none-any.whl.metadata (5.9 kB)
INFO: pip is still looking at multiple versions of botocore to determine which version is compatible with other requirements. This could take a while.
  Obtaining dependency information for botocore<1.32.0,>=1.31.1 from https://files.pythonhosted.org/packages/51/26/24a75bb8502fcda1b8dcfe117534ba553109cfe1bba2fbeba2bb0130dc4b/botocore-1.31.8-py3-none-any.whl.metadata
  Downloading botocore-1.31.8-py3-none-any.whl.metadata (5.9 kB)
  Obtaining dependency information for botocore<1.32.0,>=1.31.1 from https://files.pythonhosted.org/packages/1a/91/1bc3d8e3dd1081284ea7b10967e3be6d4af1cfd2d4e8d2e9eda39dbff62b/botocore-1.31.7-py3-none-any.whl.metadata
  Downloading botocore-1.31.7-py3-none-any.whl.metadata (5.9 kB)


In [2]:
!nvidia-smi

Mon Jul 31 06:13:42 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:17:00.0 Off |                  N/A |
|  0%   22C    P8              23W / 350W |  12719MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off | 00000000:73:00.0 Off |  

In [3]:
!pip install --upgrade torch

[33mDEPRECATION: omegaconf 2.0.6 has a non-standard dependency specifier PyYAML>=5.1.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of omegaconf or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


The code you provided implements a Long-term Memory Augmented Large Language Model (LLM) that uses a memory bank to cache long-form previous context or knowledge and further takes advantage of it via a decoupled memory module. The LLM model is designed to address the issue of memory staleness in existing large language models.

Here's a summary of what the code is doing:

Import the required libraries and modules.

--Define a CachedMemoryBank class that serves as a memory bank to cache previous context or knowledge. It uses an embedding layer followed by linear layers to create keys and values representations from the input tensor.

--Define a ResidualSideNet class that serves as a decoupled memory module. It takes the transformer outputs (contextual embeddings) as input, applies linear transformations with a ReLU activation function, and then adds a residual connection to the original input. This process allows the model to retrieve and fuse memory-augmented long-context information.

--Define the main LLMWithMemory class, which is the Long-term Memory Augmented Large Language Model. This class combines the CachedMemoryBank and ResidualSideNet modules along with a transformer-based backbone language model (such as GPT-2). The LLMWithMemory class is responsible for memory augmentation and long-form language modeling.

In the forward method of the LLMWithMemory class:
a. The input tensor (input_ids) is passed through the memory bank to generate keys and values representations.
b. The backbone language model is used to obtain transformer outputs (contextual embeddings) for the input tensor.
c. The keys and transformer outputs are multiplied to get memory_attention, which is then softmaxed to obtain attention weights.
d. The attention weights are used to retrieve memory_augmented representations from the memory bank using matrix multiplication.
e. The transformer outputs are reshaped and passed through the ResidualSideNet to get side_net_output.
f. The side_net_output is linearly transformed to predict logits for language modeling.

The generate_text method is provided to generate text from the model. It takes an input tensor and performs a forward pass repeatedly, predicting the next token in a loop until the desired maximum length of the generated text is reached.

The overall goal of this Long-term Memory Augmented Large Language Model is to allow the language model to remember and utilize long-form context or knowledge efficiently, thus improving its language modeling capabilities.

Please note that the code provided may still require further adjustments or fine-tuning based on the specific context and downstream tasks.

In [2]:
import os

# Set CUDA_LAUNCH_BLOCKING environment variable
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
os.environ["TORCH_USE_CUDA_DSA"] = "1"

In [1]:
import torch
import torch.nn as nn
import transformers
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig, pipeline
import torch.quantization
from transformers import BitsAndBytesConfig
#from torch.nn.parallel import DataParallel
#import torch.distributed as dist
#import torch.multiprocessing as mp
from torch.cuda.amp import autocast, GradScaler


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /root/anaconda3/envs/py39/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /root/anaconda3/envs/py39/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...


  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


[2023-08-29 13:38:53,880] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)


2023-08-29 13:38:57.811915: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-08-29 13:38:57.859354: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("TheBloke/stable-vicuna-13B-HF")
model = AutoModelForCausalLM.from_pretrained("TheBloke/stable-vicuna-13B-HF")

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [3]:
# Access the linear layer (final output layer)
linear_layer = model.lm_head

# Print information about the linear layer
print(linear_layer)

Linear(in_features=5120, out_features=32001, bias=False)


In [4]:
#class CachedMemoryBank(nn.Module):
#    def __init__(self, vocab_size, embedding_dim, memory_dim):
#        super(CachedMemoryBank, self).__init__()

#        self.embedding = nn.Embedding(vocab_size, embedding_dim)
#        self.memory_key = nn.Linear(embedding_dim, memory_dim)
#        self.memory_value = nn.Linear(embedding_dim, memory_dim)

#    def forward(self, input_ids):
#        embedded = self.embedding(input_ids)
#        keys = self.memory_key(embedded)
#        values = self.memory_value(embedded)
#        return keys, values
import os

class CachedMemoryBank(nn.Module):
    def __init__(self, vocab_size, embedding_dim, memory_dim, cache_folder):
        super(CachedMemoryBank, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.memory_key = nn.Linear(embedding_dim, memory_dim)
        self.memory_value = nn.Linear(embedding_dim, memory_dim)
        self.cache_folder = cache_folder

    def forward(self, input_ids):
        keys, values = self._load_from_disk(input_ids)
        return keys, values

    def _get_cache_path(self, input_id):
        return os.path.join(self.cache_folder, f"embedding_{input_id}.pt")

    def _load_from_disk(self, input_ids):
        batch_keys, batch_values = [], []
        for input_id in input_ids.view(-1).tolist():
            cache_path = self._get_cache_path(input_id)
            if os.path.exists(cache_path):
                # Load embeddings from the disk cache if available
                key_value = torch.load(cache_path)
            else:
                # Generate embeddings and save them to the disk cache
                embedded = self.embedding(torch.tensor(input_id).unsqueeze(0))
                key = self.memory_key(embedded)
                value = self.memory_value(embedded)
                key_value = (key, value)
                os.makedirs(self.cache_folder, exist_ok=True)
                torch.save(key_value, cache_path)

            batch_keys.append(key_value[0])
            batch_values.append(key_value[1])

        keys = torch.cat(batch_keys, dim=0)
        values = torch.cat(batch_values, dim=0)
        return keys, values


class ResidualSideNet(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(ResidualSideNet, self).__init__()

        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, input_dim)

    def forward(self, input_tensor):
        residual = input_tensor
        output = self.fc1(input_tensor)
        output = self.relu(output)
        output = self.fc2(output)
        output += residual  # Add residual connection
        return output


class MemoryAugmentation(nn.Module):
    def __init__(self):
        super(MemoryAugmentation, self).__init__()

    def forward(self, keys, values, transformer_outputs):
        memory_attention = torch.matmul(keys, transformer_outputs.transpose(-1, -2))
        memory_attention = nn.functional.softmax(memory_attention, dim=-1)
        memory_augmented = torch.matmul(memory_attention.transpose(-1, -2), values)
        return memory_augmented

class MemoryRetrievalFusion(nn.Module):
    def __init__(self, memory_dim, input_dim):
        super(MemoryRetrievalFusion, self).__init__()

        self.linear_query = nn.Linear(input_dim, memory_dim)
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, memory_augmented, transformer_outputs):
        query = self.linear_query(transformer_outputs)
        attention_scores = torch.matmul(query, memory_augmented.transpose(-1, -2))
        attention_weights = self.softmax(attention_scores)
        fused_output = torch.matmul(attention_weights, memory_augmented)
        return fused_output

class BackboneLLM(nn.Module):
    def __init__(self, model_name):
        super(BackboneLLM, self).__init__()
        self.Llama = LlamaForCausalLM.from_pretrained(model_name,
                                            device_map='auto', 
                                            torch_dtype=torch.float16)

    def forward(self, input_ids, attention_mask=None):
        output = self.Llama(input_ids, attention_mask=attention_mask)
        return output.logits

# Rest of your code remains the same


class LongMEM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers, memory_dim, backbone_model_name):
        super(LongMEM, self).__init__()

        self.frozen_llm = BackboneLLM(backbone_model_name)
        self.memory_bank = CachedMemoryBank(vocab_size, embedding_dim, memory_dim)
        self.side_net = ResidualSideNet(embedding_dim, hidden_dim)
        self.memory_fusion = MemoryRetrievalFusion(memory_dim, embedding_dim)
        self.memory_augmentation = MemoryAugmentation()
        self.linear = nn.Linear(embedding_dim, vocab_size)

    def forward(self, input_ids):
        # Frozen LLM

        frozen_last_hidden_state = self.frozen_llm(input_ids)

        # Memory Bank
        keys, values = self.memory_bank(input_ids)

        # Transformer Outputs
        transformer_outputs = frozen_last_hidden_state

        # Perform memory augmentation
        memory_augmented = self.memory_augmentation(keys, values, transformer_outputs)

        # Side Net
        side_net_output = self.side_net(transformer_outputs)

        # Memory Retrieval Fusion
        fused_output = self.memory_fusion(memory_augmented, side_net_output)

        # Final Linear Layer
        logits = self.linear(fused_output)
        return logits

    def generate_text(self, input_ids, max_length):
        #with torch.no_grad():
        output_ids = input_ids.clone()
        for _ in range(max_length):
            logits = self.forward(input_ids)
            predicted_token = torch.argmax(logits[:, -1, :], dim=-1)
            output_ids = torch.cat((output_ids, predicted_token.unsqueeze(1)), dim=1)
            input_ids = output_ids

        return output_ids


True


You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
print(torch.cuda.is_available())
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("TheBloke/stable-vicuna-13B-HF")
# Example usage
vocab_size = 50257
embedding_dim = 32001
hidden_dim = 32001
num_layers = 12
memory_dim = 32001
backbone_model_name = "TheBloke/stable-vicuna-13B-HF"  # Change this if you have a different pretrained model
#backbone_config = GPT2Config.from_pretrained(backbone_model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LongMEM(vocab_size, embedding_dim, hidden_dim, num_layers, memory_dim, backbone_model_name).to(device)

# Example input
input_ids = torch.tensor([[1, 2, 3, 4, 5]])

# Forward pass
output_logits = model(input_ids)

print("Output logits shape:", output_logits.shape)

# Generate text
max_length = 20
generated_text = model.generate_text(input_ids, max_length)

print("Generated text:", generated_text)

In [None]:
import os

output_dir = "./model/"
os.makedirs(output_dir, exist_ok=True)
torch.save(model.state_dict(), os.path.join(output_dir, "model_state_dict.pth"))

In [5]:
model.eval()

LLMWithMemory(
  (memory_bank): CachedMemoryBank(
    (embedding): Embedding(70000, 32001)
    (memory_key): Linear(in_features=32001, out_features=32001, bias=True)
    (memory_value): Linear(in_features=32001, out_features=32001, bias=True)
  )
  (transformer): LlamaForCausalLM(
    (model): LlamaModel(
      (embed_tokens): Embedding(32001, 5120, padding_idx=0)
      (layers): ModuleList(
        (0-39): 40 x LlamaDecoderLayer(
          (self_attn): LlamaAttention(
            (q_proj): Linear8bitLt(in_features=5120, out_features=5120, bias=False)
            (k_proj): Linear8bitLt(in_features=5120, out_features=5120, bias=False)
            (v_proj): Linear8bitLt(in_features=5120, out_features=5120, bias=False)
            (o_proj): Linear8bitLt(in_features=5120, out_features=5120, bias=False)
            (rotary_emb): LlamaRotaryEmbedding()
          )
          (mlp): LlamaMLP(
            (gate_proj): Linear8bitLt(in_features=5120, out_features=13824, bias=False)
            (u

In [None]:
# Instantiate the GPT2 tokenizer
tokenizer = LlamaTokenizer.from_pretrained(backbone_model_name,trust_remote_code=True)

# Example text input
text_input= '### Human: What is AI? ### Assitant:'

# Tokenize the text input
input_ids = tokenizer.encode(text_input, return_tensors="pt")

# Create LONGMEM model
#longmem_model = LongMEM(vocab_size, embedding_dim, hidden_dim, num_layers, memory_dim, backbone_model_name, backbone_config)
max_length=20
outputs = model.generate_text(input_ids,max_length)
print(tokenizer.batch_decode(outputs))


In [6]:
import torch
from torch.cuda.amp import autocast, GradScaler

def generate_text_with_gpu_handling(model, input_ids, max_length, batch_size=32, cache_folder="./cached_data"):
    with torch.no_grad():
        #device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        #model = model.to(device)
        
        os.makedirs(cache_folder, exist_ok=True)
        
        def load_cached_data(batch_index):
            filename = os.path.join(cache_folder, f"batch_{batch_index}.pt")
            return torch.load(filename)

        def save_cached_data(batch_index, data):
            filename = os.path.join(cache_folder, f"batch_{batch_index}.pt")
            torch.save(data, filename)

        output_ids = input_ids.clone()
        total_batches = len(input_ids) // batch_size + 1

        for batch_index in range(total_batches):
            start_index = batch_index * batch_size
            end_index = (batch_index + 1) * batch_size
            batch_input_ids = input_ids[start_index:end_index]

            if os.path.exists(os.path.join(cache_folder, f"batch_{batch_index}.pt")):
                cached_data = load_cached_data(batch_index)
                batch_input_ids = cached_data["batch_input_ids"]
                batch_output_ids = cached_data["batch_output_ids"]
            else:
                batch_output_ids = batch_input_ids.clone()
                for _ in range(max_length):
                    with autocast():
                        logits, _, _, _ = model(batch_input_ids)
                        predicted_token = torch.argmax(logits[:, -1, :], dim=-1)
                        batch_input_ids = torch.cat((batch_input_ids, predicted_token.unsqueeze(1)), dim=1)

                save_cached_data(batch_index, {"batch_input_ids": batch_input_ids, "batch_output_ids": batch_output_ids})

            output_ids[start_index:end_index] = batch_output_ids

        return output_ids


In [8]:
import PyPDF2

def pdf_to_text(pdf_path):
    text = ""
    with open(pdf_path, "rb") as file:
        pdf_reader = PyPDF2.PdfFileReader(file)
        num_pages = pdf_reader.numPages

        for page_num in range(num_pages):
            page = pdf_reader.getPage(page_num)
            text += page.extractText()

    return text

#if __name__ == "__main__":
#    pdf_file_path = "demodoc.pdf"
#    extracted_text = pdf_to_text(pdf_file_path)
#    print(len(extracted_text))
#    print(extracted_text)


In [9]:
#modelclass = LLMmodel(modelname)
# Instantiate the tokenizer
tokenizer = LlamaTokenizer.from_pretrained(modelname,trust_remote_code=True)
pdf_file_path = "demodoc.pdf"
extracted_text = pdf_to_text(pdf_file_path)
extracted_text

'Orca: Progressive Learning from Complex\nExplanation Traces of GPT-4\nSubhabrata Mukherjee∗†, Arindam Mitra∗\nGanesh Jawahar, Sahaj Agarwal, Hamid Palangi, Ahmed Awadallah\nMicrosoft Research\nAbstract\nRecent research has focused on enhancing the capability of smaller models\nthrough imitation learning, drawing on the outputs generated by large\nfoundation models (LFMs). A number of issues impact the quality of these\nmodels, ranging from limited imitation signals from shallow LFM outputs;\nsmall scale homogeneous training data; and most notably a lack of rigorous\nevaluation resulting in overestimating the small model’s capability as they\ntend to learn to imitate the style, but not the reasoning process of LFMs . To\naddress these challenges, we develop Orca, a 13-billion parameter model\nthat learns to imitate the reasoning process of LFMs. Orca learns from\nrich signals from GPT-4 including explanation traces; step-by-step thought\nprocesses; and other complex instructions, guide

In [6]:
len(extracted_text)

141925

In [10]:
# Set the maximum length you want to generate
# Example text input
text_input = extracted_text[1:3000]
# Tokenize the text input
input_ids = tokenizer.encode(text_input, return_tensors="pt")
max_length = 100
output_ids = generate_text_with_gpu_handling(model, input_ids, max_length, batch_size=32)
print("generated_output ",generated_output)

OutOfMemoryError: CUDA out of memory. Tried to allocate 72.00 MiB (GPU 0; 23.69 GiB total capacity; 2.36 GiB already allocated; 10.19 MiB free; 2.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [6]:

# Example text input
text_input = extracted_text[1:3000]
# Tokenize the text input
input_ids = tokenizer.encode(text_input, return_tensors="pt")

output_logits, keys, values, memory_augmented = model(input_ids)

print("Output logits shape:", output_logits.shape)
print("Keys shape:", keys.shape)
print("Values shape:", values.shape)
print("Memory augmented shape:", memory_augmented.shape)

Output logits shape: torch.Size([1, 968, 70000])
Keys shape: torch.Size([968, 32001])
Values shape: torch.Size([968, 32001])
Memory augmented shape: torch.Size([1, 968, 32001])


In [7]:
# Generate text
max_length = 20
#generated_text = model.generate_text(input_ids, max_length)
# Get the predicted token for the next step
predicted_token = torch.argmax(output_logits[:, -1, :], dim=-1)
# Append the predicted token to the output_ids
output_ids = torch.cat((input_ids, predicted_token.unsqueeze(1)), dim=1)

text= output_ids
print("Generated Ids:", output_ids)
print("predicted_token:", predicted_token)
# Verify the token IDs (output_ids)
print("Output IDs:", output_ids)

# Check if the tokenizer is loaded correctly
print("Tokenizer:", tokenizer)

# Verify the tokenizer vocabulary size
print("Vocabulary Size:", tokenizer.vocab_size)
# Decode generated token IDs to text
# ... (Previous code remains the same)

# Decode generated token IDs to text, skipping special tokens and handling out-of-range token IDs
generated_text_tokens = []
for token_id in output_ids.squeeze():
    if token_id < tokenizer.vocab_size:
        print(tokenizer.decode(token_id.item(), skip_special_tokens=True))
        generated_text_tokens.append(tokenizer.decode(token_id.item(), skip_special_tokens=True))
    else:
        # Handle out-of-range token ID
        generated_text_tokens.append("<unk>")
        
# Concatenate the tokens to form the generated text
generated_text = " ".join(generated_text_tokens)

print("Generated text:", generated_text)

#generated_text = tokenizer.decode(output_ids.squeeze(), skip_special_tokens=True)

#print("Generated text:", generated_text)

Generated Ids: tensor([[    1,   364,  1113, 29901, 20018,   573, 29257,   515, 26596,    13,
          1252,  9018,   362,  3201,   778,   310,   402,  7982, 29899, 29946,
            13,  4035,  7308, 29878,   532,   341,  2679,  2276,  1324, 29872,
         31083, 30087, 29892,   826,   513,   314,  4573,   336, 31083,    13,
         29954,   273, 12094,   435,  1450,   801,   279, 29892, 24246,  1175,
          4059,   279, 14625, 29892,  7904,   333,  3793,   574, 29875, 29892,
          9070,  2168,   319, 13829,   284,  8083,    13, 11277, 10550,    13,
          9118,    13,  4789,   296,  5925,   756, 21309,   373,   427,  5403,
          3277,   278,  2117,  3097,   310,  7968,  4733,    13, 20678,   527,
          7018,  6509, 29892, 11580,   373,   278, 14391,  5759,   491,  2919,
            13, 11940,   362,  4733,   313, 29931, 22192, 29879,   467,   319,
          1353,   310,  5626, 10879,   278, 11029,   310,  1438,    13,  9794,
         29892,   364,  9776,   515, 

In [15]:
output_ids[0]

tensor([    1, 33469])

In [None]:
import torch

print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))  # Assuming you have one GPU (change 0 to the appropriate GPU number if needed)
import os

os.environ["TORCH_USE_CUDA_DSA"] = "0"  # Disable device-side assertions

modelclass = LLMmodel(modelname)
# Instantiate the tokenizer
tokenizer = modelclass.tokenizer

# Example text input
text_input = "Once upon a time"

# Tokenize the text input
input_ids = tokenizer.encode(text_input, return_tensors="pt")
input_ids= input_ids[:, :5]
input_ids

print("Output logits shape:", output_logits.shape)
print("Keys shape:", keys.shape)
print("Values shape:", values.shape)
print("Memory augmented shape:", memory_augmented.shape)
# Generate text
max_length = 20
generated_ids = model.generate_text(input_ids, max_length)
print("Generated ids:", generated_id)
# Decode generated token IDs to text
generated_text = tokenizer.decode(output_ids, skip_special_tokens=True)

print("Generated text:", generated_text)