Name		Name	Last commit message	Last commit date
parent directory ..
dec_llama		dec_llama
README.md		README.md

README.md

LLaMA: Open and Efficient Foundation Language Models

https://arxiv.org/abs/2302.13971

Introduction

LLaMA is an auto-regressive language model, based on the transformer architecture. The model comes in different sizes: 7B, 13B, 33B and 65B parameters.

This implementation is based on Hugging Face LLaMA, and we have made few modifications to the code to generate the decomposition of the output.

Example usage

Setup

Make sure you have Hugging Face Transformers installed, read here to install it.

To use it, first prepare the model checkpoints and configuration information. Here is an example, as a ready-to-use resource.

Decompose LLaMA in generation

We occupy the 'attentions' field of the output to store the decomposition results

import torch
from dec_llama import LlamaForCausalLM, LlamaTokenizer

model_dir = "/path/to/llama"
tokenizer = LlamaTokenizer.from_pretrained(model_dir)
model = LlamaForCausalLM.from_pretrained(
    model_dir,
    torch_dtype=torch.float16,  # recommended to save memory
    device_map="auto",
)

inputs = tokenizer(
    "Instruction:\nAs a language model, tell me about your name.\n\n### Response:\n",
    return_tensors="pt",
).to("cuda")
generate_output = model.generate(
    inputs.input_ids,
    max_length=500,
    decompose=True,
    use_cache=True,
    return_dict_in_generate=True,
    output_attentions=True,  # use to store the decomposition results
)
generate_results = tokenizer.batch_decode(
    generate_output["sequences"],
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)[0]

print("Generated results:\n{}".format(generate_results))

print("\n\nThe decomposition for each token:")
input_len = inputs.input_ids.size(1)
for step, composition in enumerate(
    generate_output["attentions"]
):  # We occupy the 'attentions' field of the output to store the decomposition results
    print(
        "\n\nGenerated token: '{}'".format(
            tokenizer.decode(generate_output["sequences"][0, input_len + step])
        ),
    )

    pred = composition.recovery.argmax(dim=0)
    print("Decomposition:")
    print("{:<16}| {}".format("Word piece", "Score"))
    for token_idx, score in enumerate(composition[pred].components):
        print(
            "{:<16}| {:.2f}".format(
                ascii(tokenizer.decode([generate_output["sequences"][0, token_idx]])),
                score.item(),
            )
        )

Output:

Generated results:
Instruction:
As a language model, tell me about your name.

### Response:
My name is Vicuna, and I'm a language model developed by Large Model Systems Organization (LMSYS).


The decomposition for each token:


Generated token: 'My'
Decomposition:
Word piece      | Score
'<s>'           | 7.36
'Inst'          | -0.00
'ruction'       | -0.07
':'             | 0.79
'\n'            | 0.24
'As'            | 0.49
'a'             | 0.04
'language'      | 0.26
'model'         | 0.12
','             | 0.03
'tell'          | 0.52
'me'            | 0.46
'about'         | 0.66
'your'          | 0.91
'name'          | 1.47
'.'             | 0.27
'\n'            | 0.02
'\n'            | 0.10
'##'            | 0.50
'#'             | 0.12
'Response'      | 1.21
':'             | -0.06
'\n'            | 0.21

...

To reduce memory usage

Here are some suggestions to reduce memory overhead.

The 8bit quantization is not currently supported.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama

llama

dec_llama

dec_llama

README.md

README.md

README.md

LLaMA: Open and Efficient Foundation Language Models

Introduction

Example usage

Setup

Decompose LLaMA in generation

To reduce memory usage

Files

llama

Directory actions

More options

Directory actions

More options

Latest commit

History

llama

Folders and files

parent directory

dec_llama

dec_llama

README.md

README.md

README.md

LLaMA: Open and Efficient Foundation Language Models

Introduction

Example usage

Setup

Decompose LLaMA in generation

To reduce memory usage