# Interating with Llama 3
This notebook aims to give an example on how to interate with Large Lanuage Model (LLama 3) using a Python Script

To use LLama, we will need two main libraries: PyTorch and HuggingFace Transformers.

## Pytorch
`!pip install torch`
- deep learning framework developed by Facebook AI Lab (FAIR)
- easy access to create and train neural networks with flexibility
- GPU acceleration


## HuggingFace Transformers
`!pip install transformers`
- built on PyTorch
- provide state-of-the-art large language models for natural language processing (NLP) and other tasks


**Note:** For Google Colab, every time you open the notebook, you need to **reinstall** all the libraries.

In [1]:
!pip install torch
!pip install transformers accelerate huggingface-hub
!pip install bitsandbytes

Collecting bitsandbytes
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl (122.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.44.1


To use LLama, we need to
1. register a hugging face account.
2. apply premision from https://huggingface.co/meta-llama/Meta-Llama-3-8B.
It will take around 30min to grant access.
3. generate a hugging face access token. Profile -> Setting -> Access Tokens



In [7]:
# import libraries

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from huggingface_hub import login

login(token='your-hugging-face-access-token')

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## Model Setup

We use `meta-llama/Meta-Llama-3-8B-Instruct` in this example.

In [3]:
# Model Setup
def llama_llm():
    torch.cuda.empty_cache()

    model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
    device = "cuda" # runing on GPU, 'cpu' for running on CPU
    dtype = torch.bfloat16
    quantization_config = BitsAndBytesConfig(
        load_in_4bit=True, llm_int8_threshold=200.0
    )  # apply quantization when loading a large language model, reduce memory and computational requirements of a model

    # Load tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=dtype,
        device_map=device,
        quantization_config=quantization_config,
    )

    return tokenizer, model

In [4]:
#Running the LLM
def extract(message):
    """Instruct the model, returning a response
    Parameters:
        message: a string containing an LLM prompt and the abstract to review
    Returns: the LLM response
    """
    tokenizer, model = llama_llm()

    input_ids = tokenizer.apply_chat_template(
                                                message, add_generation_prompt=True, return_tensors="pt"
                                            ).to(model.device) # format input text for a model
    terminators = [
        tokenizer.eos_token_id,
        tokenizer.convert_tokens_to_ids("<|eot_id|>"),
    ]  # token IDs tell the model when to stop generating text
    outputs = model.generate(
        input_ids,
        max_new_tokens=1024,
        eos_token_id=terminators,
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
        temperature=0.1,
        # top_p=0.15,
    )  # generate text
    response = outputs[0][input_ids.shape[-1] :]  # extract the generated response (i.e. remove input text)
    return tokenizer.decode(response, skip_special_tokens=True)  # decode the response tokens into text


Giving a movie name, we would like to know
1. What year is this movie made
2. Who directed the movie
3. Who is the main actor in the movie
4. What is the general rating of this movie

The answer has to be a **single JSON string**.

In [5]:
#Packing the prompts
def main(movie):
    systemprompt = {
    "role": "system",
    "content": """You are to provide some information about some movies. You must answer with a STRICT single JSON string.
    """
    }
    userprompt = {
        "role": "user",
        "content": """

        1. What year is this movie made
        2. Who directed the movie
        3. Who is the main actor in the movie
        4. What is the general rating of this movie
        """
        }

    exampleprompt = {
    "role": "assistant",
    "content": """This is an just an example output in JSON format. Please provide your own answers:
    { "Movie Title": Troy,
      "Year": 2004,
      "Actor": Brad Pitt,
      "Rating": 10}
    """
    }

    message = [
                systemprompt,
                userprompt,
                exampleprompt,
                {
                    "role": "user",
                    "content": f"""
                        This is a movie.
                    "{movie}"   """,
                },
            ]
    extracted = extract(message)
    print(extracted)


## Test with movie

In [8]:
Movie = "Avengers: Endgame"
main(Movie)

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


{"Movie Title": "Avengers: Endgame", "Year": 2019, "Director": "Anthony Russo, Joe Russo", "Main Actor": "Robert Downey Jr.", "Rating": 8.5}
