# An Evaluation of Medical Large Language Models – September 7, 2025 - version 0

## 0. Key Libraries in the Environment


- **llama_cpp_python**: Core library for running quantized GGUF large language models locally with fast inference on CPU and GPU.

- **huggingface-hub**: Provides tools to download and cache models from the Hugging Face Hub, enabling seamless model access.

- **torch**: PyTorch, a deep learning framework often required by auxiliary tooling or conversion scripts.

- **bitsandbytes**: Efficient 8-bit optimizers and quantization utilities, helpful for memory-efficient model handling.

- **accelerate**: Hugging Face library to streamline distributed model training and inference.

- **numpy**: Fundamental package for numerical computations and array operations.

- **ipython / ipykernel / jupyter_client**: Interactive environment and kernel support for running and managing Jupyter notebooks.

- **requests**: For HTTP requests, used by huggingface-hub and other networking needs.

- **nvidia-cuda-*** libraries: Various CUDA toolkit components for GPU acceleration, essential for offloading inference workloads.

- **tqdm**: Progress bar utility for monitoring long-running processes.


### Machine Hardware Characteristics

The machine running this notebook has the following key CPU and GPU specifications:

- **Architecture:** x86_64 (64-bit architecture supporting both 32-bit and 64-bit operation modes)
- **Addressable Memory:** 39-bit physical addressing, 48-bit virtual addressing
- **CPU Details:**
  - Total CPUs: 32 cores (list: 0-31)
  - Vendor: Intel (GenuineIntel)
  - Model: Intel Core i9-14900HX
    - 24 cores and 32 threads
    - Hybrid architecture with high-performance cores and efficient cores
    - Max turbo frequency up to 5.8 GHz
    - Supports DDR5/DDR4 RAM, up to 192 GB memory capacity
    - Built on an advanced 10 nm process technology

- **GPU Details:**
  - Model: NVIDIA GeForce RTX 4080 (Laptop GPU variant)
  - Driver Version: 575.64.03, CUDA 12.9
  - GPU Temperature: 39°C
  - Power Usage: 16W out of 90W capacity
  - Memory: 12,282 MiB total, 15 MiB currently used
  - GPU Utilization: 0% (idle state)
  - Processes using GPU: Xorg display server consuming 4 MiB GPU memory

This setup provides a powerful combination of a cutting-edge multi-core CPU and a high-end NVIDIA GPU suitable for intensive computing tasks such as deep learning model inference, especially with large language models that leverage GPU acceleration.


## 1. Testing Biomistral-7B 

**BioMistral-7B.Q4_K_M.gguf Model Summary**

- 7.24 billion parameter LLaMA-based biomedical language model.
- Uses efficient 4-bit quantization (Q4_K_M) in GGUF format for fast local inference.
- Supports very long context windows up to 32,768 tokens.
- Compatible with llama.cpp and runs on consumer CPUs and GPUs.
- Excels at generating coherent biomedical text, chat, and summarization.
- Reduced memory use thanks to quantization, enabling wider hardware support.
- Some accuracy loss due to quantization trade-offs.
- Large contexts require significant RAM and GPU power.
- Static knowledge cutoff, no real-time updates or internet access.
- May generate hallucinated or incorrect outputs; human oversight needed.
- Best suited for biomedical NLP research, prototyping, and assistive tasks, not autonomous clinical decision-making.


This code imports the Llama class from the llama_cpp library and initializes a local large language model instance by loading the BioMistral-7B.Q4_K_M.gguf model file from the specified path for inference and generation tasks.

In [25]:
from llama_cpp import Llama
model_path = "./biomistral_model/BioMistral-7B.Q4_K_M.gguf"
llm = Llama(
    model_path="./biomistral_model/BioMistral-7B.Q4_K_M.gguf",
    n_ctx=512,
    n_threads=8,
    n_gpu_layers=20  # Offloads 20 layers to GPU for acceleration
)

llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from ./biomistral_model/BioMistral-7B.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = hub
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_cou

This code sends the prompt `"Explain why rapomycine is effective in treatment of depression"` to the loaded large language model `llm`, requesting up to 300 tokens in response. It then extracts the generated text from the output, removes any leading or trailing whitespace, and prints the clean, human-readable response prefixed by "Response:" for clear display. This process demonstrates how to interact programmatically with the model and retrieve coherent textual answers.


In [22]:
prompt = "Explain why rapomycine is effective in treatment of depression"

output = llm(prompt, max_tokens=300)

# Extract and clean the generated text
response_text = output['choices'][0]['text'].strip()

print("Response:\n" + response_text)

llama_perf_context_print:        load time =     347.46 ms
llama_perf_context_print: prompt eval time =     347.36 ms /    14 tokens (   24.81 ms per token,    40.30 tokens per second)
llama_perf_context_print:        eval time =    6607.62 ms /    59 runs   (  111.99 ms per token,     8.93 tokens per second)
llama_perf_context_print:       total time =    6970.32 ms /    73 tokens
llama_perf_context_print:    graphs reused =         56


Response:
: Rapamycin is effective in treatment of depression because it inhibits the mTOR pathway. Inhibition of mTOR pathway leads to reduction in neuroinflammation and increases in neurogenesis. This neuroprotective effect may be effective in treatment of depression.


In [23]:
prompt = "A patient reports chest pain radiating to the left arm. What workup should be done?"

output = llm(prompt, max_tokens=300)

# Extract and clean the generated text
response_text = output['choices'][0]['text'].strip()

print("Response:\n" + response_text)


Llama.generate: 1 prefix-match hit, remaining 19 prompt tokens to eval
llama_perf_context_print:        load time =     347.46 ms
llama_perf_context_print: prompt eval time =     412.38 ms /    19 tokens (   21.70 ms per token,    46.07 tokens per second)
llama_perf_context_print:        eval time =    4575.46 ms /    39 runs   (  117.32 ms per token,     8.52 tokens per second)
llama_perf_context_print:       total time =    4997.77 ms /    58 tokens
llama_perf_context_print:    graphs reused =         37


Response:
A: ECG, Troponin, EchocardiographyB: ECG, Troponin, CT Chest C: ECG, Troponin, Cardiac MRI


## 2. Testing Bio-Medical-3B

**Bio-Medical-3B-CoT-012025-GGUF Model Summary**

- 3 billion parameter language model based on Qwen2.5-3b-Instruct fine-tuned for biomedical tasks.
- Incorporates Chain-of-Thought (CoT) prompting to enhance logical reasoning and step-by-step interpretability.
- Trained on a high-quality mixture of over 600,000 synthetic and real biomedical samples, ensuring broad domain coverage.
- Designed to generate context-aware, reasoning-driven biomedical text suited for complex question answering and clinical summarization.
- Excels at tasks like hypothesis generation, biomedical data interpretation, and supporting evidence-based clinical decisions.
- Available in efficient GGUF quantized format optimized for local inference with llama.cpp on CPUs and GPUs.
- Balances computational efficiency with advanced reasoning capabilities suitable for research and clinical support.
- Has a static knowledge cutoff and requires human supervision to mitigate hallucinations or inaccurate outputs.
- Intended as an assistive tool for healthcare professionals, researchers, and educators—not for fully autonomous clinical use.

This code snippet demonstrates downloading a model file from Hugging Face Hub and running inference locally using the llama.cpp Python bindings:

- `hf_hub_download` fetches the specified model file (`bio-medical-3b-cot-012025-q4_0.gguf`) from the Hugging Face repository `matrixportalx/Bio-Medical-3B-CoT-012025-GGUF`. It handles caching and efficient downloads, returning the local file path.
- The `Llama` class loads the model from this downloaded file, configuring:
  - `n_ctx=2048` to allow a 2048-token context window for processing longer prompts or documents.
  - `n_threads=8` to utilize 8 CPU threads for faster computation.
  - `n_gpu_layers=-1` to offload all possible model layers to the GPU for accelerated inference.
  - `verbose=True` to print detailed runtime logs including GPU usage.
- A medical prompt about chest pain is passed to the model to generate a detailed response with up to 500 tokens, controlling randomness via `temperature=0.7` and nucleus sampling with `top_p=0.9`.
- Finally, the generated answer is extracted from the response object and printed after trimming whitespace, providing a clean and concise output.

This approach enables efficient local usage of large biomedical language models downloaded securely and managed with Hugging Face tooling while benefiting from GPU acceleration.


In [24]:
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

model_name = "matrixportalx/Bio-Medical-3B-CoT-012025-GGUF"
model_file = "bio-medical-3b-cot-012025-q4_0.gguf"
model_path = hf_hub_download(model_name, filename=model_file)

llm = Llama(
    model_path=model_path,
    n_ctx=2048,
    n_threads=8,
    n_gpu_layers=-1,  # -1 means offload all possible layers to GPU
    verbose=True      # Show logs including GPU usage info
)

prompt = "A patient reports chest pain radiating to the left arm. What workup should be done?"
response = llm(prompt, max_tokens=500, temperature=0.7, top_p=0.9)

print(response['choices'][0]['text'].strip())

llama_model_loader: loaded meta data with 38 key-value pairs and 434 tensors from /home/slmar/.cache/huggingface/hub/models--matrixportalx--Bio-Medical-3B-CoT-012025-GGUF/snapshots/90191c301dfbe946f14ef47e1375860d8440f6c8/bio-medical-3b-cot-012025-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 3B Instruct
llama_model_loader: - kv   3:                            general.version str              = 012025
llama_model_loader: - kv   4:                       general.organization str              = Qwen
llama_model_loader: - kv   5:                           general.finetune str              = Instruct
llama_model_

Is it an ECG or an MRI? I know both are used to evaluate cardiac issues, but which one is more appropriate for this situation?

When someone experiences chest pain radiating to the left arm, it's crucial to choose the right diagnostic test to ensure accurate identification of the issue. An ECG is often the first step because it's quick and gives immediate information about heart rhythm and any ischemic changes. It's like having a quick look at the heart without needing too much time or equipment. But, oh, it can't show blood vessels or other structural issues inside the heart.

So, if we think about it, we need something more detailed, right? That's where an MRI comes into play. An MRI can actually show the heart's structure and blood vessels in a really detailed way. That's important because it can catch problems like issues in the coronary arteries, which might not show up on an ECG.

Alright, let's think about the situation with chest pain radiating to the arm. This could mean a hea