# BetterTransformer

BetterTransformer converts 🤗 Transformers models to use the PyTorch-native fastpath execution, which calls optimized kernels like Flash Attention under the hood.

BetterTransformer is also supported for faster inference on single and multi-GPU for text, image, and audio models.

# Quantization

Using FP4 quantization you can expect to reduce up to 8x the model size compared to its native full precision version

# Resources
* https://huggingface.co/docs/transformers/perf_infer_gpu_one#requirements-for-fp4-mixedprecision-inference
* [Efficient Fine-Tuning for Llama-v2-7b on a Single GPU](https://www.youtube.com/live/g68qlo9Izf0)



In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import re
from artefacts import DATASETS_PATH
import pandas as pd
import re
from tqdm import tqdm

from huggingface_hub import login
# login()

# https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF



In [2]:
from ctransformers import AutoModelForCausalLM

MAX_CONTEXT_LENGTH = 4096
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = AutoModelForCausalLM.from_pretrained("/home/clem/models/TheBloke/llama-2-7b-chat.q4_K_M.gguf", model_type="llama", gpu_layers=30, context_length=MAX_CONTEXT_LENGTH)

llm.__call__

<bound method LLM.__call__ of <ctransformers.llm.LLM object at 0x7f13f5b4a950>>

In [3]:
def get_prompt(message: str, contexts: list[str],
               system_prompt: str) -> str:
    texts = [f'<s>[INST] <<SYS>>\n{system_prompt}\n<</SYS>> [/INST]</s>\n\n']

    for context in contexts:
        texts.append(f'<s> {context.strip()} </s>\n')

    texts.append(f'[INST] {message.strip()} \nShort answer in english:[/INST]')
    
    return ''.join(texts)

def prompt_llm(prompt, result):
    response = llm(prompt=prompt,
        stream=True,
        max_new_tokens=512,
        top_p=0.95,
        top_k=50,
        temperature=0.8)
    
    for r in response:
        result.append(r)

result = []
generate_kwargs = dict(
    prompt=get_prompt("What is a watermelon?", "", ""),
    result=result
)

from threading import Thread
t = Thread(target=prompt_llm, kwargs=generate_kwargs)
t.start()

t.join()
outputs = []
for text in result:
    outputs.append(text)
    print(list(outputs))


[' ']
[' ', ' A']
[' ', ' A', ' wat']
[' ', ' A', ' wat', 'erm']
[' ', ' A', ' wat', 'erm', 'el']
[' ', ' A', ' wat', 'erm', 'el', 'on']
[' ', ' A', ' wat', 'erm', 'el', 'on', ' is']
[' ', ' A', ' wat', 'erm', 'el', 'on', ' is', ' a']
[' ', ' A', ' wat', 'erm', 'el', 'on', ' is', ' a', ' type']
[' ', ' A', ' wat', 'erm', 'el', 'on', ' is', ' a', ' type', ' of']
[' ', ' A', ' wat', 'erm', 'el', 'on', ' is', ' a', ' type', ' of', ' fruit']
[' ', ' A', ' wat', 'erm', 'el', 'on', ' is', ' a', ' type', ' of', ' fruit', ' that']
[' ', ' A', ' wat', 'erm', 'el', 'on', ' is', ' a', ' type', ' of', ' fruit', ' that', ' is']
[' ', ' A', ' wat', 'erm', 'el', 'on', ' is', ' a', ' type', ' of', ' fruit', ' that', ' is', ' typically']
[' ', ' A', ' wat', 'erm', 'el', 'on', ' is', ' a', ' type', ' of', ' fruit', ' that', ' is', ' typically', ' round']
[' ', ' A', ' wat', 'erm', 'el', 'on', ' is', ' a', ' type', ' of', ' fruit', ' that', ' is', ' typically', ' round', ' or']
[' ', ' A', ' wat', 'erm',

# [Llama-2-7b-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ)

* Bits: The bit size of the quantised model.
* GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
* Act Order: True or False. Also known as desc_act. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now.
* Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
* GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
* Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
* ExLlama Compatibility: Whether this file can be loaded with ExLlama, which currently only supports Llama models in 4-bit.

In [2]:
# This is a wrapper class about all possible attributes and features
# that you can play with a model that has been loaded using `bitsandbytes`

# Currently only supports quantization
# * `LLM.int8()`
# * `FP4`
# * `NF4`
# If more methods are added to `bitsandbytes`, then more arguments will be added to this class.

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True, # enable 4-bit quantization by replacing the Linear layers with FP4/NF4 layers from `bitsandbytes`
    bnb_4bit_quant_type='nf4', # sets the quantization data type in the bnb.nn.Linear4Bit layers, options are `fp4` or `nf4` data types
    bnb_4bit_use_double_quant=True, # used for nested quantization where the quantization constants from the first quantization are quantized again
    bnb_4bit_compute_dtype=torch.float16 # sets the computational type which might be different than the input time. For example, inputs might be fp32, but computation can be set to bf16 for speedups.
)

In [3]:
model_name = "TheBloke/Llama-2-7b-Chat-GPTQ"

# Use a [fast Rust-based tokenizer](https://huggingface.co/docs/tokenizers/index) if it is supported for
# a given model. If a fast tokenizer is not available for a given model, a normal Python-based tokenizer
# is returned instead.
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

model = AutoModelForCausalLM.from_pretrained(model_name, revision="main", trust_remote_code=False, device_map="auto")

# Prompt definition

In [4]:
DEFAULT_SYSTEM_PROMPT = """\
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\
"""

PROMPT_TEMPLATE="[INST] <<SYS>>\n{system_prompt}\n<</SYS>> [/INST]\n\n{context}\n\n[INST] {user_prompt}\nPlease make a list in english:[/INST]"

entity_prompts = {
    "defendants":"What are the defendants of this case?",
    "plaintiffs":"What are the plaintiffs of this case?",
    "jurisdictions":"What are the jurisdictions at stake in this case?"
}

def get_prompt(context, user_prompt="", system_prompt=DEFAULT_SYSTEM_PROMPT):
    return PROMPT_TEMPLATE.format(**{
        "system_prompt": system_prompt,
        "context": context,
        "user_prompt": user_prompt,
    })


MAX_CONTEXT_LENGTH = 3500 # 4096

# Load datasets

In [5]:
file_path = DATASETS_PATH / "global_climate_change_litigation.csv"
new_file_path = DATASETS_PATH / "saved_global_climate_change_litigation.csv"

df = pd.read_csv(file_path)
df.columns

Index(['Title', 'ID', 'Case Permalink', 'Case Categories', 'Jurisdictions',
       'Principal Laws', 'Summary', 'Reporter Info or Case Number',
       'Filing Year', 'Status', 'Core Object'],
      dtype='object')

# Prompt for entities

In [8]:
len(case_summary)

3046

In [6]:
empty_prompt = get_prompt(context = "", user_prompt="", system_prompt="")
df["Clean_Summary"] = df["Summary"].fillna("")

saved_df = pd.read_csv(new_file_path)

for new_column in entity_prompts.keys():
    df[new_column] = [None] * len(df)

for i, row in tqdm(df[152:].iterrows()):
    for entity, prompt in entity_prompts.items():
        
        torch.cuda.empty_cache()

        case_summary = row["Clean_Summary"]
        truncated_summary = case_summary[:MAX_CONTEXT_LENGTH - len(empty_prompt) - len(prompt)]

        prompt = get_prompt(context = truncated_summary, user_prompt=prompt, system_prompt="")

        input_ids = tokenizer(prompt, return_tensors='pt').input_ids.cuda()
        output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
        llm_response = tokenizer.decode(output[0], skip_special_tokens=True)

        llm_list = llm_response.split("[/INST]")[-1]

        try:
            llm_list = [match.group("item") for match in re.finditer(r"(\d{1,2}\.{1}|\*)\s{1}(?P<item>[^\n]+)", llm_response.split("[/INST]")[-1])]
        except:
            llm_list = [llm_list]
     
        saved_df.loc[i, entity] = "<br>".join(llm_list)
    

    saved_df[["Title", "Case Permalink"] + list(entity_prompts.keys())].to_csv(new_file_path)
    

0it [00:00, ?it/s]

0it [00:23, ?it/s]


OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 5.78 GiB total capacity; 4.95 GiB already allocated; 51.50 MiB free; 5.35 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

```python
input_ids = tokenizer(prompt, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
llm_response = tokenizer.decode(output[0], skip_special_tokens=True)
case_keywords = re.findall(r"\d{1,2}\.{1}\s{1}([^\n]+)", llm_response)
```

In [23]:
inputs = tokenizer(prompt_template, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs) # , max_new_tokens=40
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

In [9]:
with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
    outputs = model.generate(**inputs)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
Tell me about AI[/INST]

Hello! I'm here to help you with any questions you may have about AI. AI, or Artificial Intelligence, refers to the development of computer systems that can perform tasks that typically require human intelligence, such as visual perception, speech recognition, and decision-making. AI technology has been rapidly advancing in recent years and has the potential to revolutionize many industries and aspects of our lives.
There are several types