# Using Open Source LLMs Natively

Here we will see briefly how you can use popular open source LLM APIs including

- Hugging Face Transformers
- Hugging Face Serverless Inference APIs
- Hugging Face Inference Client
- Groq Cloud

## Install Dependencies

In [0]:
!pip install transformers==4.47.0
!pip install accelerate==1.1.0 # useful when using models with GPUs locally via huggingface
!pip install groq==0.13.0

## Get Hugging Face Access Token

Here you need to get an access token to be able to download or access models using Hugging Face's platform:

- Hugging Face Access Token: Go [here](https://huggingface.co/settings/tokens) and create a key with write permissions. You need to setup an account which is totally free of cost.


1. Go to [Settings -> Access Tokens](https://huggingface.co/settings/tokens) after creating your account and make sure to create a new access token with write permissions

![](https://i.imgur.com/dtS6tFr.png)

2. Remember to __Save__ your key somewhere safe as it will just be shown once as shown below. So copy and save it in a local secure file to use it later on. If you forget, just create a new key anytime.

![](https://i.imgur.com/NmZmpmw.png)

## Load Hugging Face Access Token


In [0]:
from getpass import getpass

hf_key = getpass("Enter your Hugging Face Access Token: ")

## Configure Key in Environment


In [0]:
import os

os.environ["HF_TOKEN"] = hf_key

## Using LLMs Locally with Hugging Face

This is if you want to download LLMs locally completely and run it without the need of sending your data to any external server. Do note you would need a GPU to run any of these models as even the smaller language models are still essentially quite big.

Certain LLMs are gated like [Meta Llama 3.2 1B Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) so make sure to apply for access as shown below else you will get an error when using the model

![](https://i.imgur.com/M88MOu5.png)

## Load the LLM locally using Huggingface

In [0]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model_id = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    torch_dtype=torch.bfloat16
)

In [0]:
chat = [
    { "role": "user", "content": "Explain what is Generative AI in 2 bullet points" },
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
print(prompt)

Remember to always refer to the [__documentation__](https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate) where all the arguments of the generation pipeline are mentioned in detail. Most notably:

- **max_length:** The maximum length of the sequence to be generated
- **max_new_tokens:** The maximum numbers of tokens to generate, ignore the current number of tokens. Use either max_new_tokens or max_length but not both, they serve the same purpose
- **do_sample:** Whether or not to use sampling. False means use greedy decoding i.e temperature=0
- **temperature:** Between 0 - 1, The value used to module the next token probabilities. Higher temperature means the results may vary and be more creative

In [0]:
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=1000)
print(tokenizer.decode(outputs[0]))

### Pipelines make it easier to send prompts

You don't need to encode and decode your inputs and outputs everytime

In [0]:
llama_pipe = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="cuda",
)

In [0]:
chat = [
    { "role": "user", "content": "Explain what is Generative AI in 2 bullet points" },
]

In [0]:
response = llama_pipe(chat, max_new_tokens=1000)
print(response)

In [0]:
print(response[0]["generated_text"][-1]['content'])

## Using LLMs via Hugging Face Inference APIs

Thankfully HuggingFace has made its [__Inference API__](https://huggingface.co/docs/api-inference/quicktour) free to use with some basic rate limits etc. in place so you don't end up making unlimited requests on it's servers.

The best part is you can access 150,000+ deep learning models without worrying about your infrastructure.

## Load Hugging Face Access Token


In [0]:
from getpass import getpass

hf_key = getpass("Enter your Hugging Face Access Token: ")

## Configure Key in Environment


In [0]:
import os

os.environ["HF_TOKEN"] = hf_key

### Create LLM API Access Function

Here we create a basic function which can access any LLM API endpoint available on HuggingFace.

For more details refer to the [detailed documentation](https://huggingface.co/docs/api-inference/detailed_parameters#text-generation-task) as needed.

In [0]:
import requests

headers = {"Authorization": "Bearer "+hf_key}

def query(payload, MODEL_API_URL):
  response = requests.post(MODEL_API_URL, headers=headers, json=payload)
  print('API Response:', response)
  return response.json()

## Create LLM API Access Config

Here we decide which LLMs we will access by getting their inference API endpoints.

We also set some general configuration settings. You can find the [detailed documentation](https://huggingface.co/docs/api-inference/detailed_parameters#text-generation-task) here.

Some useful config settings include:

- max_new_tokens: The amount of new tokens to be generated in the response
- do_sample: Whether or not to use sampling. False means use greedy decoding i.e temperature=0
- temperature: Between 0 - 1, The value used to module the next token probabilities. Higher temperature means the results may vary and be more creative
- return_full_text: If set to False, does not return your input prompt to the model
- wait_for_model:  If the model is not ready, wait for it instead of receiving 503. It limits the number of requests required to get your inference done
- repetition_penalty: The more a token is used within generation the more it is penalized to not be picked in successive generation passes.

In [0]:
HF_API_URL = "https://api-inference.huggingface.co/models/"
model_name = "meta-llama/Llama-3.2-1B-Instruct"
LLAMA_API_URL = HF_API_URL + model_name
params = {
    "wait_for_model": True,
    "return_full_text": False,
    "max_new_tokens": 1000,
}

In [0]:
prompt =  "Explain what is Generative AI in 2 bullet points"

In [0]:
output = query(payload={
                "inputs": prompt,
                "parameters": params
                },
                MODEL_API_URL=LLAMA_API_URL)

print(output[0]['generated_text'])

## Using LLMs via Hugging Face Inference Client

Thankfully HuggingFace has made its new [__Inference Client__](https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_client) free to use with some basic rate limits etc. in place so you don't end up making unlimited requests on its servers.

The best part is you can access 150,000+ deep learning models without worrying about your infrastructure. Similar to the inference API

In [0]:
from huggingface_hub import InferenceClient

Feel free to refer to the [documentation](https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_client#huggingface_hub.InferenceClient) at any time as needed for more details on function names, arguments and more.

In [0]:
model_name = "meta-llama/Llama-3.2-1B-Instruct"
client = InferenceClient(model=model_name, api_key=hf_key)

chat = [
    { "role": "user", "content": "Explain what is Generative AI in 2 bullet points" },
]

response = client.chat_completion(chat, max_tokens=1000)
print(response)

In [0]:
print(response.choices[0].message.content)

## Get Grok API

Here you need to get an access token to be able to access models using Grok's platform via APIs:

- Groq API Key: Go [here](https://console.groq.com/keys) and create an API key. You need to setup an account which is totally free of cost. Also while Groq has a generous free tier, there are also paid plans if you are interested.


1. Go to [Groq Cloud -> Create API Key](https://console.groq.com/keys) after creating your account and make sure to create a new API Key as shown

![](https://i.imgur.com/tgHXlcV.png)

2. Remember to __Save__ your key somewhere safe as it will just be shown once as shown below. So copy and save it in a local secure file to use it later on. If you forget, just create a new key anytime.

![](https://i.imgur.com/Q27AgA1.png)

## Load Groq API Credentials


In [0]:
from getpass import getpass

groq_key = getpass("Enter your Groq API Key: ")

## Using Open Source LLMs Directly via Groq API

This is if you want to use it without wrappers like LangChain, we will show you how you use open LLMs like Meta Llama 3.2 Instruct using Groq APIs. The free tier should be good enough for most experiments.

## API Pricing

Right now the best models to use include Mistral, Gemma 2 and Llama 3.1 and 3.2. Check out [pricing details here for free API](https://console.groq.com/settings/limits) and [here for paid API](https://groq.com/pricing/)

![](https://i.imgur.com/JE8lfXV.png)

## Use Groq for Prompting Open Source LLMs

In [0]:
from groq import Groq

groq_client = Groq(api_key=groq_key)

In [0]:
def get_completion_chatgroq(prompt, model="llama-3.2-3b-preview"):
    messages = [{"role": "user", "content": prompt}]
    response = groq_client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0, # degree of randomness of the model's output
    )
    return response.choices[0].message.content

In [0]:
prompt = 'Explain Generative AI in 2 bullet points'
response = get_completion_chatgroq(prompt=prompt, model="llama-3.2-3b-preview")

print(response)

In [0]:
prompt = 'Explain Generative AI in 2 bullet points'
response = get_completion_chatgroq(prompt=prompt, model="llama-3.2-90b-vision-preview")

print(response)