<a href="https://colab.research.google.com/github/TrelisResearch/code-llama-32k/blob/main/Code_Llama_32k.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# *Llama 2 with 32k+ token context length*
---

## Model attribution
[Nous Research on HuggingFace](https://huggingface.co/conceptofmind/Yarn-Llama-2-7b-128k) and [GitHub](https://github.com/jquesnelle/yarn) and also [The Bloke](TheBloke/CodeLlama-7B-Instruct-GPTQ) for quantization.

## PRO Notebooks
- Allows for saving and re-loading of conversations
- Allows for uploading and analysis of documents
- Works on Google Colab or on a Server (e.g. AWS, Azure, RunPod)
- Purchase [here](https://buy.stripe.com/fZe14Q5tP0zpaMUfZi)

## Prepared by Trelis Research
- Find Trelis on [HuggingFace](https://huggingface.co/Trelis) and [YouTube](https://www.youtube.com/@TrelisResearch) and [GitHub](https://github.com/TrelisResearch).

## Setup and Installation

! To run with a long context length, you'll need to use a V100 or A100 (T4 does not support Flash Attention). This means you need a pay as you go or Pro+ plan !

- Save a copy of this notebook: Go to File -> Save a copy in Drive. (optional, but needed if you want to make changes).
- Go to the menu -> Runtime -> Change Runtime Type - Select GPU (V100 or A100).
- Then go to Runtime -> Run all.
- It takes about 2-5 mins for the installation (which all happens in the cloud in this notebook).
- Once all cells have run, you'll find the chat interface at the bottom.

All of your data remains within your Google Drive and Google's computers.

## Install and Load Model

In [None]:
!pip3 install git+https://github.com/huggingface/transformers.git
!pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
!pip3 install git+https://github.com/huggingface/optimum.git
!pip3 install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary
!pip3 install flash-attn==2.1.1 --no-build-isolation

Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-zgy9ke8t
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-zgy9ke8t
  Resolved https://github.com/huggingface/transformers.git to commit 0c67a72c9ab46996b0dc3175c80c1fee881bcc83
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Looking in indexes: https://pypi.org/simple, https://huggingface.github.io/autogptq-index/whl/cu118/
Collecting git+https://github.com/huggingface/optimum.git
  Cloning https://github.com/huggingface/optimum.git to /tmp/pip-req-build-kfgqyywc
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/optimum.git /tmp/pip-req-build-kfgqyywc
  Resolved https://github.com/huggingface/optimum.git to commit b30

In [None]:
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# model_id = "TheBloke/Yarn-Llama-2-7B-64K-GPTQ"
# model_id = "TheBloke/Yarn-Llama-2-13B-128K-GPTQ"
model_id = "TheBloke/CodeLlama-7B-Instruct-GPTQ"

# To use a different branch, change revision
# For example: revision="gptq-4bit-32g-actorder_True"
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             torch_dtype=torch.float16,
                                             device_map="auto",
                                             revision="main",
                                             trust_remote_code=True,
                                             cache_dir='')

model = model.to_bettertransformer()

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.25k [00:00<?, ?B/s]

Downloading (…)nfiguration_llama.py:   0%|          | 0.00/8.56k [00:00<?, ?B/s]

Downloading (…)in/modeling_llama.py:   0%|          | 0.00/45.9k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.


Downloading (…)okenizer_config.json:   0%|          | 0.00/824 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

# Inference

## Streaming

In [None]:
import json

# Read the contents of the 'berkshire.txt' file
with open('berkshire23.txt', 'r') as f:
    transcript = f.read()

# Tokenize the transcript
tokens = tokenizer.tokenize(transcript)

# Shorten the list of tokens
shortened_tokens = tokens[:16000]

# Join the list of shortened tokens back into a string
shortened_transcript = tokenizer.convert_tokens_to_string(shortened_tokens)

# Safely escape special characters using json.dumps
escaped_transcript = json.dumps(shortened_transcript)

# Note: json.dumps will add extra quotes at the beginning and end of the string, remove them.
escaped_transcript = escaped_transcript[1:-1]

# Output the length to confirm
print(f"The length of the shortened transcript is {len(shortened_tokens)} tokens.")

The length of the shortened transcript is 16000 tokens.


In [None]:
## If a chat/instruct model
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

system_prompt="You are a helpful assistant and an expert on summarisation. You provide accurate summaries of the information provided and don't make things up."

# Define a stream *without* function calling capabilities
def stream(user_prompt):
    prompt = f"{B_INST} {B_SYS}{system_prompt.strip()}{E_SYS}{escaped_transcript}\n\n{user_prompt.strip()} {E_INST}"

    inputs = tokenizer([prompt], return_tensors="pt").input_ids.cuda()

    # with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
    streamer = TextStreamer(tokenizer)

    with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
        # Despite returning the usual output, the streamer will also print the generated text to stdout.
        # _ = model.generate(inputs=inputs, streamer=streamer, max_new_tokens=500, do_sample=False)
        _ = model.generate(inputs=inputs, streamer=streamer, max_new_tokens=500, temperature=0.01, do_sample=True)

    # _ = model.generate(inputs=inputs, streamer=streamer, max_new_tokens=500, temperature=0.01, do_sample=True)

stream(f'Provide a three bullet summary of the above content.')
torch.cuda.empty_cache()

<s> [INST] <<SYS>>
You are a helpful assistant. You are an expert on summarisation.
<</SYS>>

we are here live in Omaha Nebraska good morning everybody I'm Becky quick\nalong with Mike santoli and in just 30 minutes time Berkshire Hathaway chairman and CEO Warren Buffett's going to be\ntaking the stage with his vice chair Charlie Munger the legendary duo will also be joined by berkshire's two other\nVice chairs Greg Abel who manages the non-insurance operations for the company and Ajit Jain who runs all of the\ninsurance businesses and as always it's pretty big crowd here lots and lots of people and a few people you might notice\ntoo Tim Cook is here Apple of course is still berkshire's largest holding big big part of its portfolio there you see\nhim backstage getting ready to go out and take his seat he gets to sit down in the special seats by the way that's\nDebbie pasonic Warren's assistant who's standing by just went bite beside him also in the crowd Bill Murray he has\nbeen here f

This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (16384). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


 * The first bullet point is about the Berkshire Hathaway annual meeting, which is taking place today.
* The second bullet point is about the earnings of Berkshire Hathaway, which were released earlier today.
* The third bullet point is about the speeches and Q&A session that will take place during the annual meeting.</s>
