In [1]:
import os
from huggingface_hub import login

from google.colab import userdata  # to fetch secrets in Colab

# Load the Hugging Face token set in Colab Secrets
hf_token = userdata.get("HF_API_TOKEN")
if not hf_token:
    raise EnvironmentError("Hugging Face token not found in Colab Secrets.")

# Log in using the retrieved token (non-interactive and secure)
login(token=hf_token)

**Prompt Reuse** (Caching by Prompt): This is what you're doing in the code. When a prompt is passed to the generate_with_cache function, it generates a hash of the prompt and checks whether that prompt has been seen before. If so, it returns the cached response. This is "prompt reuse" because you're storing the generated response based on the exact input prompt.

In [3]:
!pip install transformers
!pip install torch

from transformers import pipeline
import hashlib
import torch

# Move model to GPU if available
device = 0 if torch.cuda.is_available() else -1
generator = pipeline("text-generation", model="gpt2", device=device)

# Cache for model outputs (responses to prompts)
prompt_cache = {}

def generate_with_cache(prompt):
    """
    This function checks if the prompt has been used before.
    If it has, return the cached response; otherwise, generate and cache it.
    """
    # Generate a unique hash for the prompt to be used as a cache key
    cache_key = hashlib.sha256(prompt.encode()).hexdigest()

    # Check if the response for this prompt is already cached
    if cache_key in prompt_cache:
        print("Using cached response.")
        return prompt_cache[cache_key]
    else:
        print("Generating new response.")
        # Generate a response from the model
        result = generator(prompt, max_length=50)
        generated_text = result[0]['generated_text']

        # Cache the result
        prompt_cache[cache_key] = generated_text

        return generated_text

# Test with a new prompt
prompt = "Once upon a time"
response1 = generate_with_cache(prompt)
print(response1)

# Test with the same prompt (should use cached response)
response2 = generate_with_cache(prompt)
print(response2)

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generating new response.
Once upon a time, the human mind was at the mercy of the spirit. However, with the power of the Spirit, even the most powerful of beings could not have a complete grasp on reality. Therefore, the ability of a person to comprehend the future was not limited to the physical world. The ability to comprehend the future was also limited to the emotions of the people.

However, this time, there wasn't a single person who could understand what was going on.

The reason why the people of the future were so concerned was the fact that the world was not completely stable. The atmosphere of the human world was now very unstable. The situation of the world felt like it was being thrown into chaos.

The only thing that could bring about a calm was a sense of peace. This peace was very important to the people of the future, but it was only when a very strong feeling of peace was introduced into the people of the future that the people could not understand what was happening.

**Prefix Caching** would involve caching parts of the output generated by the model, specifically caching parts of the generated text (e.g., the prefix of the output) so that the model can continue from that point if the same or similar context is encountered in the future. In this case, you're not caching the generated text in the middle of the output, but caching the entire output once it's fully generated.

In [4]:
from transformers import pipeline
import hashlib
import torch

# Move model to GPU if available
device = 0 if torch.cuda.is_available() else -1
generator = pipeline("text-generation", model="gpt2", device=device)

# Cache for prefix
prefix_cache = {}

def generate_with_prefix_cache(prompt):
    """
    This function caches only the prefix of the output and reuses it for continued generation.
    """
    # Generate a unique hash for the prompt to be used as a cache key
    cache_key = hashlib.sha256(prompt.encode()).hexdigest()

    # Check if there's cached output for this prompt prefix
    if cache_key in prefix_cache:
        print("Using cached prefix.")
        # Continue generating from the cached prefix
        prefix = prefix_cache[cache_key]
        result = generator(prefix, max_length=50, num_return_sequences=1)
        generated_text = result[0]['generated_text']
        return generated_text
    else:
        print("Generating new response.")
        # Generate a response from the model (initial generation)
        result = generator(prompt, max_length=50, num_return_sequences=1)
        generated_text = result[0]['generated_text']

        # Cache the generated prefix (start of the generated text)
        prefix_cache[cache_key] = generated_text

        return generated_text

# Test with a new prompt
prompt = "Once upon a time"
response1 = generate_with_prefix_cache(prompt)
print(response1)

# Test with the same prompt (should use cached prefix and continue generation)
response2 = generate_with_prefix_cache(prompt)
print(response2)

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generating new response.


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Once upon a time, it seemed that the old man was beginning to fall into a deep, dark depression. His heart was beating fast and his bones were trembling.

But the old man had never known that a man's feelings had changed.

"My heart is on the side of my head. I am in a great, deep sense of sorrow." The old man's face was filled with a deep sense of sorrow.

"My heart is on the side of my head, but I can't say that my feelings are changing. Although it's not that I'm happy, it's rather that I'm feeling so bad that I have to go to an emergency room."

No, not anymore.

The old man's heart was pounding fast and his bones were trembling. His heart was pounding fast and his bones were trembling.

"My heart is on the side of my head, but I can't say that I'm unhappy."

"My heart is on the side of my head, but I can't say that my feelings are changing."

"My heart is on the side of my head, but I can't say that my feelings are changing."

The old man's heart was pounding fast and his bones we