<a href="https://colab.research.google.com/github/GhostDecoder/llama-2/blob/main/Llama_2_7b_chat_hf_testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installation
Install necessary dependencies

In [1]:
!pip install -qU transformers accelerate einops langchain xformers bitsandbytes

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m803.3/803.3 kB[0m [31m33.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m75.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m208.0/208.0 kB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━

# Initializing the Hugging Face pipeline

In [2]:
from torch import cuda, bfloat16
import transformers


model_id = 'meta-llama/Llama-2-7b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the 'bitsandbytes' library

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, need auth token
hf_auth = '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'

model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)

model.eval()
print(f"Model loadded on {device}")



config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Model loadded on cuda:0


The pipeline requires a tokenizer which handles the translation of human readable plaintext to LLM readable token IDs. The Llama 2 70B models were trained using the Llama 2 70B tokenizer, which we initialize like so:

In [3]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)



tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Finally we need to deine the stopping criteria of the model. The stopping criteria allows us to specify when the model should stop generating text. If we don't provide a stopping criteria the model just goes on a bit of a tangent after answering the initial question.

In [4]:
{
    "action": "Calculator",
    "action_input": "2+2"
}

{'action': 'Calculator', 'action_input': '2+2'}

In [5]:
stop_list = ['\nHuman:', '\n```\n']

stop_token_ids = [tokenizer(x)['input_ids'] for x in stop_list]
stop_token_ids

[[1, 29871, 13, 29950, 7889, 29901], [1, 29871, 13, 28956, 13]]

We need to convert these into `LongTensor` objects:

In [6]:
import torch

stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids]
stop_token_ids

[tensor([    1, 29871,    13, 29950,  7889, 29901], device='cuda:0'),
 tensor([    1, 29871,    13, 28956,    13], device='cuda:0')]

We can do a quick spot check that `<unk>` token IDs (o) appear in the `stop_token_ids` - There are none so we can move on to building the stopping criteria object that will check whether the stopping criteria has been satisfied - meanning whether any of these token ID combinations have been generated.

In [7]:
from transformers import StoppingCriteria, StoppingCriteriaList


# define custom stopping criteria object
class StopOnTokens(StoppingCriteria):
  def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
    for stop_ids in stop_token_ids:
      if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():
        return True
    return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

Now we're ready to initialize the HF pipeline. There are a few additional parameters that we must define here. Comments explaing these have been included in the code.

In [10]:
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True, # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    # stopping_criteria=stopping_criteria, # without the model rambles during chat
    temperature=0.1, # 'randomness' of outputs, 0.0 <= t <= 1.0
    max_new_tokens=512, # max number of tokens to generate in the output
    repetition_penalty=1.1 # without this output begins repeating
)

Confim this is working:

In [11]:
res = generate_text("Explain to me the difference between nuclear fission and fusion.")
print(res[0]["generated_text"])

Explain to me the difference between nuclear fission and fusion. Unterscheidung between nuclear fission and fusion: Nuclear fission is a process in which an atomic nucleus splits into two or more smaller nuclei, releasing energy in the process. Nuclear fusion, on the other hand, is the process by which two or more atomic nuclei combine to form a single, heavier nucleus.
Nuclear fission occurs when an atom's nucleus is split into two or more smaller nuclei, releasing energy in the process. This process typically involves the use of neutron-induced fission, where a neutron collides with an atomic nucleus and causes it to split. The most commonly used fissile materials are uranium-235 (U-235) and plutonium-239 (Pu-239). Fusion, on the other hand, is the process by which two or more atomic nuclei combine to form a single, heavier nucleus. This process releases energy in the form of light and heat, and is the same process that powers the sun and other stars. The most commonly discussed fusi