# Huggingface API

The [Huggingface](https://huggingface.co) API make open source/weight models available locally. As the model is loaded into memory, we can reuse it, and do not have to reload it entirely. This might be beneficial in scenarios where we want to prompt the same model many times.

In [1]:
def prompt_hf(request, model="meta-llama/Meta-Llama-3.1-8B"):
    global prompt_hf
    import transformers
    import torch
    
    if prompt_hf._pipeline is None:    
        prompt_hf._pipeline = transformers.pipeline(
            "text-generation", model=model, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto"
        )
    
    return prompt_hf._pipeline(request, 
                               max_length=1000, 
                               num_return_sequences=1)[0]['generated_text']
prompt_hf._pipeline = None

We can then submit a prompt to the LLM like this:

In [2]:
prompt_hf("What is the capital of France?")

config.json:   0%|          | 0.00/826 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

Some parameters are on the meta device device because they were offloaded to the cpu.


tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  attn_output = torch.nn.functional.scaled_dot_product_attention(


'What is the capital of France? The answer is Paris. The capital of France is Paris. This is the place where the government is located. The following map shows the location of the capital of France.'

As the model is kept loaded in memory, a second call might be faster, not showing the model-loading output:

In [None]:
prompt_hf("What is the capital of the Czech Republic?")

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


## Exercise

Explore the [HuggingFace hub for more text-generation models](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending). Download one and test it using the function above. Also read its documentation and consider updating the function above according to the recommendations and examples.