# Tutorial 6: HuggingFace Client

In this tutorial, we'll explore how to use the `HuggingFaceClient`. However, be aware that this is not the recommended way of using RadPrompter and it is advised to use API based methods like `vLLMClient` or `OllamaClient` to have a much faster inference.

## Installation

If you don't have `RadPrompter` installed, you can install it using pip:

```bash
pip install radprompter
```

## Prompt

As always, we start by importing the `Prompt` class and creating a prompt object from a TOML file:

In [1]:
from radprompter import Prompt

prompt = Prompt('06_HuggingFace-Client.toml')
prompt





## Client and Engine

We'll use the new `HuggingFaceClient`. This model accepts the model and tokenizer from HuggingFace. This allows for all customizations on the model including quantization.

Before running the next cell, make sure you have the following libraries installed:

```bash
pip install torch transformers flash_attn SentencePiece accelerate bitsandbytes
```

In [2]:
import os
import torch
from transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
model_name = "google/medgemma-4b-it"

quantization_conf = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    attn_implementation="eager", # This is required for Gemma models (https://github.com/google-deepmind/gemma/issues/169)
    quantization_config=quantization_conf,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

config.json:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/90.6k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

In [3]:
from radprompter import RadPrompter, HuggingFaceClient

client = HuggingFaceClient(
    hf_model = model,
    hf_tokenizer = tokenizer,
    temperature = 0.0,
    seed = 42
)

engine = RadPrompter(
    client=client,
    prompt=prompt, 
    output_file="output_tutorial_6.csv",
    concurrency=1,
    hide_blocks=False,
)



And we run it on our sample reports:

In [None]:
import glob

report_files = glob.glob("../../sample_reports/*.txt")

reports = []
for file in report_files:
    with open(file, "r") as f:
        reports.append({"report": f.read(), "file_name": file})

engine(reports)

Processing items:   0%|          | 0/3 [00:00<?, ?it/s]

The engine will process each report and saves the results to `output_tutorial_6.csv`.

In [5]:
import pandas as pd

df = pd.read_csv("output_tutorial_6.csv", index_col='index')
df

Unnamed: 0_level_0,default_response,report,file_name
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,ERROR,Clinical Information:\n67-year-old male with s...,../../sample_reports/sample_report_1.txt
1,ERROR,Clinical Information:\n72-year-old female with...,../../sample_reports/sample_report_2.txt
2,ERROR,Here is an example radiology report describing...,../../sample_reports/sample_report_3.txt


Finally, we save the log:

In [6]:
engine.save_log("log_tutorial_6.log")

with open("log_tutorial_6.log", "r") as f:
    print(f.read())

RadPrompter Version: 2.0.0
Model: Gemma3ForConditionalGeneration
Seed: 42
Temperature: 0.0
Frequency Penalty: 0.0
Top-P: 0.9
Prompt TOML: /opt/localdata/Data/Datathon_BKh/LLMSynth/RadPrompter/tutorials/06_HuggingFace-Client/06_HuggingFace-Client.toml
Prompt Version: 0.1
Prompt Hash: 83cd2cee67b7e10589df0b8f9e9fd2f2
Concurrency Factor: 1
Use Pydantic: False
Start Time: 2025-06-30 17:17:05
End Time: 2025-06-30 17:17:05
Duration: 0.0
Number of Items: 3
Average Processing Time: 0.0


-------------------- *** - Prompt Content - *** --------------------
[METADATA]
version = 0.1
description = "A sample prompt for RadPrompter"

[PROMPTS]
system_prompt = "You are a helpful assistant that has 20 years of experience in reading radiology reports and extracting data elements."

user_prompt_intro = """
Carefully review the provided chest CT report (in the <report> tag). Ensure that each data element is accurately captured. Here is the report:
<report>
{{report}}
</report>
"""

user_prompt_cot = """


The `HuggingFaceClient` can be useful for beginning to explore RadPrompter's capabilities, but it is certainly **not the best option**. We highly advise using the `vLLMClient` or `OllamaClient`, as they support concurrency and are more stable for batch document processing.