<a href="https://colab.research.google.com/github/Farzad-R/LLM-playground/blob/master/LLM-function-calling/llm_function_calling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* **NOTE:** Change the GPU to T4.
* **NOTE:** Pay attention to the model size that you are using. In case LLAMA-13b is being used, for rerunning the import model you might need to start a new runtime to have a GPU with full memory.
* **NOTE:** Before using an LLM make sure to find the proper prompt structure for that specific model and modify the prompt preparation accordingly.

**Install libraries:**

In [None]:
!pip install -q pydantic transformers SentencePiece torch accelerate bitsandbytes

**Import libraries:**

In [None]:
from pydantic import create_model
import inspect, json
from inspect import Parameter
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    AutoModelForCausalLM,
    pipeline
    )
import sentencepiece
import torch
from typing import List
import warnings
warnings.filterwarnings("ignore")

**Define configs and necessary functions in LlamaModel:**

In [None]:
llama_models = ["NousResearch/Llama-2-13b-chat-hf", "NousResearch/Llama-2-7b-chat-hf"]
"""
model suggestion:
stablebeluga_seven_b_model ="stabilityai/StableBeluga-7B"
llama_seven_b_gptq_model = "TheBloke/Llama-2-7b-Chat-GPTQ"
"""
class LlamaModel:
  def __init__(self, quant_four=False, quant_eight=False, model="NousResearch/Llama-2-7b-chat-hf"):
    self.base_model = model
    self.quant_four=quant_four
    self.quant_four_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=getattr(torch, "float16"),
        bnb_4bit_use_double_quant=False,
      )
    self.quant_eight=quant_eight
    self.quant_eight_config=BitsAndBytesConfig(
        load_in_8bit=True,
        bnb_8bit_quant_type="nf8",
        bnb_8bit_compute_dtype=getattr(torch, "float16"),
        bnb_8bit_use_double_quant=False,
      )
    self.model = self.import_model()
    self.tokenizer_padding_side="right"
    self.tokenizer = self.import_tokenizer()

  def import_model(self):
    if self.quant_four:
      print(f"Importing 4bit {self.base_model} ...\n")
      return AutoModelForCausalLM.from_pretrained(
        self.base_model,
        quantization_config=self.quant_four_config,
        device_map={"": 0},
    )
    elif self.quant_eight:
      print(f"Importing 8bit {self.base_model} ...\n")
      return AutoModelForCausalLM.from_pretrained(
        self.base_model,
        quantization_config=self.quant_eight_config,
        device_map={"": 0},
    )
    else:
      print(f"Importing 32b {self.base_model} ...\n")
      return AutoModelForCausalLM.from_pretrained(
        self.base_model,
        device_map={"": 0},
    )
      return "Requested model is not defined."

  def import_tokenizer(self):
    print("\nImporting tokenizer ...\n")
    return AutoTokenizer.from_pretrained(self.base_model, padding_side=self.tokenizer_padding_side)


  def prepare_prompt_for_llama(self, prompt):
    prompt = f"<s>[INST] {prompt} [/INST]"
    return self.tokenizer(prompt, return_tensors="pt")

**Import LLAMA_2_7b model and tokenizer:**

In [None]:
llama_instance = LlamaModel(quant_eight=True)

Importing 8bit NousResearch/Llama-2-7b-chat-hf ...



config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]


Importing tokenizer ...



tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

**Import the tokenizer:**

In [None]:
prompt = "Who is Elon Musk?"
result = llama_instance.model.generate(**llama_instance.prepare_prompt_for_llama(prompt).to("cuda"), max_new_tokens=150).to('cpu')
print(llama_instance.tokenizer.batch_decode(result))

['<s><s>[INST] Who is Elon Musk? [/INST]  Elon Musk is a South African-born entrepreneur, inventor, and business magnate who is best known for his ambitious goals in revolutionizing transportation, energy, and space exploration.\n\nMusk was born in 1971 in Pretoria, South Africa. He developed an interest in computing and programming at an early age and taught himself computer programming. He moved to Canada in 1992 to attend college, and later transferred to the University of Pennsylvania, where he graduated with a degree in economics and physics.\n\nAfter college, Musk moved to California to pursue a career in technology and entrepreneurship. He co-founded his first company,']


**Function calling in LLAMA 2:**

In [None]:
prompt = """[INST] <<SYS>>

You are a helpful assistent, that only comunicates using JSON files.
The expected output from you has to be:
    {
        "function": {function_name},
        "args": [],
        "ai_notes": {explanation}
    }
The INST block will always be a json string:
    {
        "prompt": {the user request}
    }
Here are the functions available to you:
    function_name=get_local_weather_update
    args=[{country}, {state}]


<</SYS>> [/INST]
[INST]
{
    "prompt": "what is the weather in California right now?"
}
[/INST]"""

In [None]:
result = llama_instance.model.generate(**llama_instance.prepare_prompt_for_llama(prompt).to("cuda"), max_new_tokens=150).to('cpu')
print(llama_instance.tokenizer.batch_decode(result))

['<s><s>[INST] [INST] <<SYS>>\n\nYou are a helpful assistent, that only comunicates using JSON files.\nThe expected output from you has to be:\n    {\n        "function": {function_name},\n        "args": [],\n        "ai_notes": {explanation}\n    }\nThe INST block will always be a json string:\n    {\n        "prompt": {the user request}\n    }\nHere are the functions available to you:\n    function_name=get_local_weather_update\n    args=[{country}, {state}]\n\n\n<</SYS>> [/INST]\n[INST]\n{\n    "prompt": "what is the weather in California right now?"\n}\n[/INST] [/INST] {\n    "function": "get_local_weather_update",\n    "args": [\n        {\n            "country": "USA",\n            "state": "California"\n        }\n    ],\n    "ai_notes": {\n        "explanation": "Using OpenWeatherMap API to retrieve current weather conditions for California, USA."\n    }\n}</s>']
