# LLaMA2 70B 4-bit Inference

## Setup: 
- EC2 `G5.12x` (96G GPU memory) instance
- Deep Learning AMI GPU PyTorch 2.0.1 (Ubuntu 20.04) 20230827
    - `nvcc --version`: 12.1
- EBS: 300G
- Python: 3.10.12
- torch: 2.2.0.dev20230911+cu121

## Installation
```
conda create -n 0911a python=3.10
source activate 0911a
pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121
pip3 install transformers bitsandbytes
huggingface-cli login
```

## Inference script

In [1]:
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    pipeline,
    BitsAndBytesConfig,
)
import torch

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
name = "meta-llama/Llama-2-70b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(name)
tokenizer.pad_token_id = tokenizer.eos_token_id    # for open-ended generation

Downloading (…)okenizer_config.json: 100%|██████████| 776/776 [00:00<00:00, 5.38MB/s]
Downloading tokenizer.model: 100%|██████████| 500k/500k [00:00<00:00, 16.3MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.84M/1.84M [00:00<00:00, 6.84MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 414/414 [00:00<00:00, 3.37MB/s]


In [4]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
    name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

Downloading (…)lve/main/config.json: 100%|██████████| 614/614 [00:00<00:00, 4.70MB/s]
Downloading (…)fetensors.index.json: 100%|██████████| 66.7k/66.7k [00:00<00:00, 160MB/s]
Downloading (…)of-00015.safetensors: 100%|██████████| 9.85G/9.85G [00:43<00:00, 225MB/s]
Downloading (…)of-00015.safetensors: 100%|██████████| 9.80G/9.80G [00:43<00:00, 224MB/s]
Downloading (…)of-00015.safetensors: 100%|██████████| 9.97G/9.97G [00:43<00:00, 231MB/s]
Downloading (…)of-00015.safetensors: 100%|██████████| 9.80G/9.80G [00:41<00:00, 234MB/s]
Downloading (…)of-00015.safetensors: 100%|██████████| 9.80G/9.80G [00:41<00:00, 235MB/s]
Downloading (…)of-00015.safetensors: 100%|██████████| 9.80G/9.80G [00:44<00:00, 218MB/s]
Downloading (…)of-00015.safetensors: 100%|██████████| 9.97G/9.97G [00:43<00:00, 229MB/s]
Downloading (…)of-00015.safetensors: 100%|██████████| 9.80G/9.80G [01:00<00:00, 162MB/s]
Downloading (…)of-00015.safetensors: 100%|██████████| 9.80G/9.80G [01:14<00:00, 131MB/s]
Downloading (…)of-00015.

In [5]:
generation_pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    trust_remote_code=True,
    device_map="auto",
)

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [8]:
text = "who is jeff bezos?"    # prompt goes here

sequences = generation_pipe(
    text,
    max_length=256,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
    do_sample=True,
    top_k=10,
    temperature=0.4,
    top_p=0.9
)

print(sequences[0]["generated_text"])

who is jeff bezos?
Jeff Bezos is an American technology and retail entrepreneur, and the founder, chairman, and CEO of Amazon, the world's largest online retailer. He is widely recognized as one of the most successful entrepreneurs of our time, and has been named the richest person in the world by Forbes magazine for several years in a row.
Bezos was born in 1964 in Albuquerque, New Mexico, and grew up in Houston, Texas. He graduated from Princeton University in 1986 with a degree in electrical engineering and computer science. After working on Wall Street for several years, he left to start Amazon in 1994, initially operating the company out of his garage.
Under Bezos' leadership, Amazon has grown from a small online bookstore to a global retail giant, selling a wide range of products including electronics, clothing, home goods, and more. The company has also expanded into new areas such as cloud computing, advertising, and media production.
Bezos is known for his focus on customer sa