This notebook is inspired by this [blog post](https://huggingface.co/blog/llama2).  

The author of this notebook is [Christoph Schnell](mailto:christoph@schnell.de) 

# How to Perform Inference with Llama 2 Models on Lyra
To use Llama 2 on Lyra with Python you can use the [transformers](https://huggingface.co/docs/transformers/index) library from Hugging Face.

## Apply for access
To run Llama 2 with the transformers library, you will first need to request access to Meta's models (1) and Meta's Hugging Face repositories (2).

Start with requesting access to Meta's models [here](https://ai.meta.com/resources/models-and-libraries/llama-downloads/). You will *not* need to download the models manually once Meta has granted you access.

Once Meta has granted you access, you will need to request for access to Meta's Hugging Face repositories. To request access, visit the repository of the Llama model you want to access (e.g. this [repository](https://huggingface.co/meta-llama/Llama-2-7b-hf)) and use the "Access Llama 2 on Hugging Face" form. More information can be found [here](https://huggingface.co/meta-llama).

*Note: Your Hugging Face account email address MUST match the email address you provided on the Meta website, or your request will not be approved.*

## Install dependencies
Once you have successfully applied for access to the Hugging Face repositories you can start installing the necessary dependencies. You will need to install the transformers library from Hugging Face via pip and login to your Hugging Face account. You can do this by running the following commands in your terminal:

```bash
pip install transformers
huggingface-cli login
```

## Run inference of the Llama 2 model
Once you have installed the dependencies and logged into your Hugging Face account you can run inference of the Llama 2 model. On Lyra you can run all model sizes (7B, 13B, 70B). To load a model, run the following code:

*Note that the first load for each model will take a while as the model is automatically downloaded*

In [None]:
# Import modules and load model
from transformers import AutoTokenizer
import transformers
import torch

model = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

To generate text with the model you can use the pipeline you just created:

In [None]:

# Generate text
sequences = pipeline(
    '[INST]Is an apple a fruit? Start with yes or no and then explain your answer.[/INST]\n\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=200,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

To generate only the next line instead of an entire response, you can add the newline token as an end-of-sequence token to the text generation pipeline:

In [None]:
sequences = pipeline(
    '[INST]Is an apple a fruit? Start with yes or no and then explain your answer.[/INST]\n\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=[tokenizer.eos_token_id, tokenizer.encode("\n", add_special_tokens=False)[-1]],
    max_length=200,
)
for seq in sequences:
    print("------- Result -------")
    print(f"{seq['generated_text']}")
    print("----------------------")

To get only the next token probability you can run the following code:

*Note that a word may be represented by multiple tokens; therefore, not every word has a one-to-one correspondence with a singular token.*

In [None]:
import torch
from collections import defaultdict
prompt = '[INST]Is an apple a fruit? Start with yes or no and then explain your answer.[/INST]\n\n'

inputs = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0).to(pipeline.device)
outputs = pipeline.model(inputs)
probs = outputs[0][:, -1, :]
probs = torch.softmax(probs, dim=-1)
probs = probs.cpu().detach().numpy()[0]
token_probs = defaultdict(float)
for token, prob in enumerate(probs):
    token_probs[tokenizer.decode(token)] += prob

for token, prob in sorted(token_probs.items(), key=lambda x: x[1], reverse=True)[:10]:
    print(f"{token:<10}{prob:.2%}")
