<a href="https://colab.research.google.com/github/TrelisResearch/code-llama-32k/blob/main/Code_Llama_32k.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# *Llama 2 with 32k+ token context length*
---

## Model attribution
[Nous Research on HuggingFace](https://huggingface.co/conceptofmind/Yarn-Llama-2-7b-128k) and [GitHub](https://github.com/jquesnelle/yarn) and also [The Bloke](TheBloke/CodeLlama-7B-Instruct-GPTQ) for quantization.

## PRO Notebooks
- Allows for saving and re-loading of conversations
- Allows for uploading and analysis of documents
- Works on Google Colab or on a Server (e.g. AWS, Azure, RunPod)
- Purchase [here](https://buy.stripe.com/fZe14Q5tP0zpaMUfZi)

## Prepared by Trelis Research
- Find Trelis on [HuggingFace](https://huggingface.co/Trelis) and [YouTube](https://www.youtube.com/@TrelisResearch) and [GitHub](https://github.com/TrelisResearch).

## Setup and Installation

! To run with a long context length, you'll need to use an A6000 or A100 (T4 or V100 does not support Flash Attention). This means you need a pay as you go or Pro+ plan !

- Save a copy of this notebook: Go to File -> Save a copy in Drive. (optional, but needed if you want to make changes).
- Go to the menu -> Runtime -> Change Runtime Type - Select GPU (A100).
- Then go to Runtime -> Run all.
- It takes about 2-5 mins for the installation (which all happens in the cloud in this notebook).
- Once all cells have run, you'll find the chat interface at the bottom.

All of your data remains within your Google Drive and Google's computers.

## Install and Load Model

In [1]:
!pip3 install -q -U git+https://github.com/huggingface/transformers.git
# !pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
!pip3 install -q -U git+https://github.com/huggingface/optimum.git
!pip3 install -q -U flash-attn
!pip install -q -U bitsandbytes
!pip install -q -U accelerate

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m61.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m14.9 MB/s[0m e

In [2]:
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# model_id = "TheBloke/Yarn-Llama-2-7B-64K-GPTQ"
# model_id = "TheBloke/Yarn-Llama-2-13B-128K-GPTQ"
# model_id = "TheBloke/CodeLlama-7B-Instruct-GPTQ"
# model_id = "code-llama/CodeLlama-7B-Instruct"
model_id = "codellama/CodeLlama-13b-Instruct-hf"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config, #uncomment for quantized fine-tuning.
    # device_map={"":0},
    device_map='auto',
    torch_dtype=torch.bfloat16,
    # torch_dtype=torch.float16,
    use_flash_attention_2=True, # works with Llama models and reduces memory reqs
    )

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

config.json:   0%|          | 0.00/589 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/31.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/749 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

# Inference

## Streaming

In [3]:
import json

# Read the contents of the 'berkshire.txt' file
with open('berkshire23.txt', 'r') as f:
    transcript = f.read()

# Tokenize the transcript
tokens = tokenizer.tokenize(transcript)

# Shorten the list of tokens
shortened_tokens = tokens[:16000]

# Join the list of shortened tokens back into a string
shortened_transcript = tokenizer.convert_tokens_to_string(shortened_tokens)

# Safely escape special characters using json.dumps
escaped_transcript = json.dumps(shortened_transcript)

# Note: json.dumps will add extra quotes at the beginning and end of the string, remove them.
escaped_transcript = escaped_transcript[1:-1]

# Output the length to confirm
print(f"The length of the shortened transcript is {len(shortened_tokens)} tokens.")

The length of the shortened transcript is 16000 tokens.


In [4]:
## If a chat/instruct model
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

system_prompt="You are a helpful assistant and an expert on summarisation. You provide accurate summaries of the information provided and don't make things up."

# Define a stream *without* function calling capabilities
def stream(user_prompt):
    prompt = f"{B_INST} {B_SYS}{system_prompt.strip()}{E_SYS}{escaped_transcript}\n\n{user_prompt.strip()} {E_INST}"

    inputs = tokenizer([prompt], return_tensors="pt").input_ids.cuda()

    # with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
    streamer = TextStreamer(tokenizer)

    with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
        # Despite returning the usual output, the streamer will also print the generated text to stdout.
        # _ = model.generate(inputs=inputs, streamer=streamer, max_new_tokens=500, do_sample=False)
        _ = model.generate(inputs=inputs, streamer=streamer, max_new_tokens=500, temperature=0.01, do_sample=True)

    # _ = model.generate(inputs=inputs, streamer=streamer, max_new_tokens=500, temperature=0.01, do_sample=True)

stream(f'Provide a three bullet summary of the above content.')
torch.cuda.empty_cache()

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] <<SYS>>
You are a helpful assistant and an expert on summarisation. You provide accurate summaries of the information provided and don't make things up.
<</SYS>>

we are here live in Omaha Nebraska good morning everybody I'm Becky quick\nalong with Mike santoli and in just 30 minutes time Berkshire Hathaway chairman and CEO Warren Buffett's going to be\ntaking the stage with his vice chair Charlie Munger the legendary duo will also be joined by berkshire's two other\nVice chairs Greg Abel who manages the non-insurance operations for the company and Ajit Jain who runs all of the\ninsurance businesses and as always it's pretty big crowd here lots and lots of people and a few people you might notice\ntoo Tim Cook is here Apple of course is still berkshire's largest holding big big part of its portfolio there you see\nhim backstage getting ready to go out and take his seat he gets to sit down in the special seats by the way that's\nDebbie pasonic Warren's assistant who's standin

RuntimeError: ignored