# Different ways to call the Llama models online

## Information

Inference API of Hugging Face exposes models that have large community interest and are in active use: https://huggingface.co/docs/api-inference/supported-models


**Precondition**: create an Access Token (https://huggingface.co/settings/tokens), set up a pro account to use the larger LLMs like Llama-3-70B (https://huggingface.co/pricing#pro) and accept the META LLAMA 3 COMMUNITY LICENSE AGREEMENT for the two different Llama models:

* for `meta-llama/Llama-3.1-8B-Instruct`: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
* for `meta-llama/Meta-Llama-3-70B-Instruct`: https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct

Remark: Llama models are published under the **META LLAMA 3 COMMUNITY LICENSE AGREEMENT**. The Meta Llama 3 Community License grants users a non-exclusive, royalty-free license (you not need to pay ongoing fees) to use, modify, and distribute Llama 3 materials, with requirements for attribution and naming conventions when creating derivative works. Users with over 700 million monthly active users need a separate license, and Meta disclaims all warranties and limits liability for any use of the materials.


**Possibilities to call models online using an API**:
To call the `meta-llama/Llama-3.1-8B-Instruct` or `meta-llama/Meta-Llama-3-70B-Instruct` models from Hugging Face, you can use several different methods depending on your preferences and technical requirements. 

Here are the most common approaches:

1. Using the InferenceClient from Hugging Face
2. Using the openai API from OpenAI
3. Using langchain_huggingface from langchain



*** 
**Background information**

* ...


***
**Coding sources**

* You can run the `meta-llama/Llama-3.1-8B-Instruct`, see model page: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
* You can run the `meta-llama/Meta-Llama-3-70B-Instruct`, see model page: https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct
    + Hugging Face documentation: https://huggingface.co/docs/transformers/main/en/model_doc/llama3



***
**Aim of the code template**

Exemplify different approaches to call Llama (LLMs) online.

## Load necessary libraries:

In [2]:
# loaded within the single code chunks

## Get API key(s)

In [3]:
import os
import sys

# Assuming 'src' is one level down (in the current directory or a subdirectory)
path_to_src = os.path.join('..','src')  # Moves one level down to 'src' folder

# Add the path to sys.path
sys.path.append(path_to_src)

# Now you can import your API_key module
import API_key as key

Create simple prompts, which is identical for all of the following approaches:

In [4]:
# Create prompts
system_content = "You are a helpful assistant specialized on animal names."
user_content = """
 Please write down five animals, provide only the names seperated by comma (\,).
"""

  user_content = """


# Online approaches

When using the online approaches we using the Serverless Inference API, see: https://huggingface.co/docs/api-inference/index

## using the "InferenceClient" from huggingface_hub

**Technical Considerations:**

When querying the API, outputs are automatically cached if the inputs are identical. This applies to our case, as explained in the documentation: https://huggingface.co/docs/api-inference/parameters

> To bypass caching and ensure fresh results for each query, it’s necessary to define the header: `"x-use-cache": "false"`, which is not possible using the OpenAI approach!

In [5]:
from huggingface_hub import InferenceClient
import textwrap
client = InferenceClient(model="meta-llama/Meta-Llama-3-70B-Instruct", headers={"X-use-cache": "false"}, token=key.hugging_api_key)    # "meta-llama/Meta-Llama-3.1-70B-Instruct"


# Feed prompts into model
output = client.chat_completion(
    messages=[
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_content}
    ],
    stream=False,
    max_tokens=500,
    temperature=1
)

# Accessing the text in the output object
text = output.choices[0].message.content

# Printing the output in a more readable format
print('\n'.join(textwrap.wrap(text, 100)))

Kangaroo, Lion, Penguin, Squirrel, Gorilla


In general LLMs return **complex objects**, which contains 

it makes sense to store the complete objects and if needed add additional information:

dir(output):
* Purpose: Lists the names of the attributes and methods of the object.
* Output Type: Returns a list of strings representing all attributes and methods.
* Scope: Includes all members (both built-in and user-defined) of the object, such as methods and properties.

vars(output)
* Purpose: Returns the `__dict__` attribute of the object, which contains the object’s writable attributes.
* Output Type: Returns a dictionary containing the object's writable attributes and their values.
* Scope: Only includes the object's instance attributes; does not list methods.

In [6]:
print("Get attributes and methods and writable attributes:\n")

# Get attributes and methods
attributes = dir(output)
print("Get attributes and methods:\n", attributes)

# Get writable attributes
writable_attributes = vars(output)
print("Get writable attributes:\n", writable_attributes)
print("To get generated text:\n", output.choices[0].message.content)


print("\nSee complete object:\n")
print("Complete object:\n", output)


import datetime
# Get the current date and time
current_datetime = datetime.datetime.now()
print("At best add current date and time:\n", current_datetime)

Get attributes and methods and writable attributes:

Get attributes and methods:
 ['__annotations__', '__class__', '__class_getitem__', '__contains__', '__dataclass_fields__', '__dataclass_params__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__ior__', '__iter__', '__le__', '__len__', '__lt__', '__match_args__', '__module__', '__ne__', '__new__', '__or__', '__post_init__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__ror__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'choices', 'clear', 'copy', 'created', 'fromkeys', 'get', 'id', 'items', 'keys', 'model', 'parse_obj', 'parse_obj_as_instance', 'parse_obj_as_list', 'pop', 'popitem', 'setdefault', 'system_fingerprint', 'update', 'usage', 'values']
Get writable attributes:
 {'choices': [ChatCompletionOutputComplete(fi

an example using `stream=True`:

In [7]:
client = InferenceClient(model=model_name, headers={"X-use-cache": "false"}, token=key.hugging_api_key)    # "meta-llama/Meta-Llama-3.1-70B-Instruct"


# Feed prompts into model
output = client.chat_completion(
    messages=[
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_content}
    ],
    stream=True,
    max_tokens=500,
    temperature=1
)


# iterate and print stream
text = "" 
import json  # Import the json module  for error handling 
try:
    for chunk in output:
        text += chunk.choices[0].delta.content
        print(text)
except json.JSONDecodeError as e:
    print(f"JSON decode error: {e}")
    print(f"Raw output: {chunk}")  # Display raw output for inspection


Ele
Elephant
Elephant,
Elephant, C
Elephant, Cheet
Elephant, Cheetah
Elephant, Cheetah,
Elephant, Cheetah, Ko
Elephant, Cheetah, Koala
Elephant, Cheetah, Koala,
Elephant, Cheetah, Koala, Nar
Elephant, Cheetah, Koala, Narwh
Elephant, Cheetah, Koala, Narwhal
Elephant, Cheetah, Koala, Narwhal,
Elephant, Cheetah, Koala, Narwhal, M
Elephant, Cheetah, Koala, Narwhal, Mongoose
Elephant, Cheetah, Koala, Narwhal, Mongoose


In [8]:
print("\nSee complete object:\n")
print("Complete object:\n", output)


See complete object:

Complete object:
 <generator object _stream_chat_completion_response at 0x7bc7c078d620>


## using "openai" from OpenAI

In [9]:
from openai import OpenAI
import logging

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Initialize the client, pointing it to one of the available models
client = OpenAI(
    base_url="https://api-inference.huggingface.co/v1/",
    api_key=key.hugging_api_key
)
output = client.chat.completions.create(
    model=model_name,
    messages=[
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_content}
    ],
    stream=True,
    max_tokens=500,
    temperature=1
)

# iterate and print stream
text = ""
    
for chunk in output:
    # print(chunk)
    text = text + chunk.choices[0].delta.content
    print(text)

2024-10-29 12:13:32,412 - INFO - HTTP Request: POST https://api-inference.huggingface.co/v1/chat/completions "HTTP/1.1 200 OK"


T
Tiger
Tiger,
Tiger, C
Tiger, Cheet
Tiger, Cheetah
Tiger, Cheetah,
Tiger, Cheetah, Gor
Tiger, Cheetah, Gorilla
Tiger, Cheetah, Gorilla,
Tiger, Cheetah, Gorilla, Kang
Tiger, Cheetah, Gorilla, Kangaroo
Tiger, Cheetah, Gorilla, Kangaroo,
Tiger, Cheetah, Gorilla, Kangaroo, Arm
Tiger, Cheetah, Gorilla, Kangaroo, Armad
Tiger, Cheetah, Gorilla, Kangaroo, Armadillo
Tiger, Cheetah, Gorilla, Kangaroo, Armadillo


### motivation to use "openai" from OpenAI - add functionalities of openAI

see docs: https://platform.openai.com/docs/quickstart

like including:
* Error Handling: A try-except block is added to catch any exceptions that might occur during the API call, logging the error for debugging purposes.
* Logging: A logging setup is included to log both the successful generation of responses and any errors that occur, which can be very helpful for monitoring and debugging.

In [10]:
# Initialize the client, pointing it to one of the available models
client = OpenAI(
    base_url="https://api-inference.huggingface.co/v1/",
    api_key=key.hugging_api_key,
)

def generate_chat_response(model_name, system_content, user_content, max_tokens=500, temperature=1, stream=True):
    try:
        output = client.chat.completions.create(
            model=model_name,
            messages=[
                {"role": "system", "content": system_content},
                {"role": "user", "content": user_content}
            ],
            stream=stream,
            max_tokens=max_tokens,
            temperature=temperature
        )

        # Initialize an empty string for the response text
        text = ""

        # Iterate and print the stream
        for chunk in output:
            text += chunk.choices[0].delta.content
            print(text)
        
        # Optionally log the complete response
        logging.info("Generated response: %s", text)

    except Exception as e:
        logging.error("An error occurred: %s", str(e))

# Example usage of the function
generate_chat_response(
    model_name=model_name, 
    system_content=system_content, 
    user_content=user_content,
    max_tokens=500, 
    temperature=0.7  # Adjust temperature for creativity
)

2024-10-29 12:13:33,126 - INFO - HTTP Request: POST https://api-inference.huggingface.co/v1/chat/completions "HTTP/1.1 200 OK"


L
Lion
Lion,
Lion, Elephant
Lion, Elephant,
Lion, Elephant, Kang
Lion, Elephant, Kangaroo
Lion, Elephant, Kangaroo,
Lion, Elephant, Kangaroo, Penguin
Lion, Elephant, Kangaroo, Penguin,


2024-10-29 12:13:33,379 - INFO - Generated response: Lion, Elephant, Kangaroo, Penguin, Tiger


Lion, Elephant, Kangaroo, Penguin, Tiger
Lion, Elephant, Kangaroo, Penguin, Tiger


### motivation to use "openai" from OpenAI - add functionalities of openAI AND langchain


1. create a prompt template:

In [11]:
from langchain_core.prompts import ChatPromptTemplate

# Simplified system template
system_template = """
<Context>
You are a knowledgeable assistant whose task is to provide two arrays: one for the names of the capitals and another for the primary languages spoken in the provided list of countries. 
Respond solely with the information requested, formatted as specified.
</Context>

<Data Structure>
The provided countries are simply an array of country names.
</Data Structure>

<Task>
Write a JSON object containing two arrays: "capital_names" for the names of the requested capitals and "languages" for the primary languages spoken in those capitals. 
Ensure that the information is structured as follows:

{{
  "capitals": [
   {{
      "name": "name1",
      "language": "language1"
    }},
    {{
      "name": "name2",
      "language": "language2"
    }},
    {{
      "name": "name3",
      "language": "language3"
    }}
  ]
}}

Please respond with the entire JSON structure as a dictionary called "capitals", exactly as shown above, without any additional formatting or text.
</Task>
"""


# Simplified user template
user_template = """
What are the capitals of {userinput}, along with the languages spoken there?
"""

# Set up the ChatPromptTemplate
prompt_template = ChatPromptTemplate.from_messages(
    [("system", system_template), ("user", user_template)]
)

# Test the invoke with simplified templates
result = prompt_template.invoke({"userinput": "Germany, USA, France, China, Australia, and South Africa"})
print(result)

messages=[SystemMessage(content='\n<Context>\nYou are a knowledgeable assistant whose task is to provide two arrays: one for the names of the capitals and another for the primary languages spoken in the provided list of countries. \nRespond solely with the information requested, formatted as specified.\n</Context>\n\n<Data Structure>\nThe provided countries are simply an array of country names.\n</Data Structure>\n\n<Task>\nWrite a JSON object containing two arrays: "capital_names" for the names of the requested capitals and "languages" for the primary languages spoken in those capitals. \nEnsure that the information is structured as follows:\n\n{\n  "capitals": [\n   {\n      "name": "name1",\n      "language": "language1"\n    },\n    {\n      "name": "name2",\n      "language": "language2"\n    },\n    {\n      "name": "name3",\n      "language": "language3"\n    }\n  ]\n}\n\nPlease respond with the entire JSON structure as a dictionary called "capitals", exactly as shown above, wit

2. create a  JSON schema for structured output [not needed]:

! not working for my code (don't know why)

In [12]:
# JSON schema for structured output
json_schema = {
    "title": "Outputs",
    "description": "Structured response detailing the capitals and languages spoken in those capitals.",
    "type": "object",
    "properties": {
        "capitals": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "capital_names": {
                        "type": "string",
                        "description": "The name of the capital city."
                    },
                    "language": {
                        "type": "string",
                        "description": "The primary language spoken in the capital city."
                    }
                },
                "required": ["capital_names", "language"]
            },
            "description": "An array of objects containing the capital cities and their respective languages."
        },
    },
    "required": ["capitals"],
}

3. apply model

! if your JSON schema would work uncomment `structured_llm, chain`  

In [13]:
from langchain_openai import ChatOpenAI
from langchain.callbacks import get_openai_callback


# Initialize the ChatOpenAI model
model = ChatOpenAI(model=model_name, openai_api_key=key.hugging_api_key, openai_api_base="https://api-inference.huggingface.co/v1/", max_tokens=500, temperature=0.2)

# Configure the model with structured output
#structured_llm = model.with_structured_output(json_schema, include_raw=True)
#chain = prompt_template | structured_llm

chain = prompt_template | model

# Execute the model and output response details
with get_openai_callback() as cb:
    response = chain.invoke(
        {"userinput": "Germany, USA, France, China, Australia, Angola, Egypt"}
    )
    print(cb)
    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    print(f"Total Cost (USD): ${cb.total_cost}")

2024-10-29 12:13:37,348 - INFO - HTTP Request: POST https://api-inference.huggingface.co/v1/chat/completions "HTTP/1.1 200 OK"


Tokens Used: 424
	Prompt Tokens: 275
	Completion Tokens: 149
Successful Requests: 1
Total Cost (USD): $0.0
Total Tokens: 424
Prompt Tokens: 275
Completion Tokens: 149
Total Cost (USD): $0.0


4. extract information in a readable form:

In [14]:
import json
import pandas as pd

print(response.content)


# Load the JSON data
data = json.loads(response.content)

# Convert to DataFrame
df = pd.DataFrame(data['capitals'])

# Display the DataFrame
print(df)

{
  "capitals": [
    {
      "name": "Berlin",
      "language": "German"
    },
    {
      "name": "Washington D.C.",
      "language": "English"
    },
    {
      "name": "Paris",
      "language": "French"
    },
    {
      "name": "Beijing",
      "language": "Mandarin Chinese"
    },
    {
      "name": "Canberra",
      "language": "English"
    },
    {
      "name": "Luanda",
      "language": "Portuguese"
    },
    {
      "name": "Cairo",
      "language": "Arabic"
    }
  ]
}
              name          language
0           Berlin            German
1  Washington D.C.           English
2            Paris            French
3          Beijing  Mandarin Chinese
4         Canberra           English
5           Luanda        Portuguese
6            Cairo            Arabic


## using the "langchain_huggingface" and highlighting special tokens

see docs: https://python.langchain.com/docs/integrations/platforms/huggingface/

and blog post: https://huggingface.co/blog/langchain


**Prompt format for Llama >= 3.1:**
It is possible to define **special tokens** within prompts to trigger all kinds of behaviours. For more information, see:

* Prompt formats of llama.com (Meta) webpage: https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/
* on their GitHub page: https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/text_prompt_format.md

Example 1 - Get names of the capitals of different countries:

In [15]:
from langchain_huggingface import HuggingFaceEndpoint

llm = HuggingFaceEndpoint(
    repo_id="meta-llama/Meta-Llama-3-8B-Instruct",
    task="text-generation",
    max_new_tokens=100,
    do_sample=False,
    huggingfacehub_api_token=key.hugging_api_key,
    temperature=0.1
)
llm.invoke("""          
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
    You are a helpful assistant, who only response with the name of the capital asked for.

<|start_header_id|>user<|end_header_id|>
    What is the capital of Germany?
    
<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
""")

'Berlin'

Example 2 - Zero shot function calling:

In [16]:
from langchain_huggingface import HuggingFaceEndpoint

llm = HuggingFaceEndpoint(
    repo_id="meta-llama/Meta-Llama-3-8B-Instruct",
    task="text-generation",
    max_new_tokens=100,
    do_sample=False,
    huggingfacehub_api_token=key.hugging_api_key,
    temperature=0.1
)
llm.invoke("""          
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
You are an expert in composing functions. You are given a question and a set of possible functions.
Based on the question, you will need to make one or more function/tool calls to achieve the purpose.
If none of the function can be used, point it out. If the given question lacks the parameters required by the function,
also point it out. You should only return the function call in tools call sections.

If you decide to invoke any of the function(s), you MUST put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]
You SHOULD NOT include any other text in the response.

Here is a list of functions in JSON format that you can invoke.

[
    {
        "name": "get_weather",
        "description": "Get weather info for places",
        "parameters": {
            "type": "dict",
            "required": [
                "city"
            ],
            "properties": {
                "city": {
                    "type": "string",
                    "description": "The name of the city to get the weather for"
                },
                "metric": {
                    "type": "string",
                    "description": "The metric for weather. Options are: celsius, fahrenheit",
                    "default": "celsius"
                }
            }
        }
    }
]


<|start_header_id|>user<|end_header_id|>
What is the weather in SF and Seattle in celsius?

<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
""")

"get_weather(city='San Francisco', metric='celsius'), get_weather(city='Seattle', metric='celsius')"

Example 3 - Create Python Code to check the answer to a question:

In [17]:
from langchain_huggingface import HuggingFaceEndpoint

llm = HuggingFaceEndpoint(
    repo_id="meta-llama/Meta-Llama-3-8B-Instruct",
    task="text-generation",
    max_new_tokens=500,
    do_sample=False,
    huggingfacehub_api_token=key.hugging_api_key,
    temperature=0.1
)


output = llm.invoke("""          
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>

Environment: ipython

<|start_header_id|>user<|end_header_id|>
Write code to check if number is prime. Use it to verify if number 7 is prime

<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
""")

print('\n')
print('\n'.join(textwrap.wrap(output, 100)))



Here is a simple function to check if a number is prime: ``` def is_prime(n):     if n <= 1:
return False     for i in range(2, int(n**0.5) + 1):         if n % i == 0:             return False
return True ``` Now, let's use this function to check if the number 7 is prime: ```
print(is_prime(7))  # Output: True ``` As expected, the output is `True`, indicating that 7 is
indeed a prime number.
