<a href="https://colab.research.google.com/github/RSNA/AI-Deep-Learning-Lab-2023/blob/main/sessions/nlp-text-classification/RSNA23_llama_cpp_report_labeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RSNA 2023: Deep Learning Lab
## Report Labeling with Llama.cpp

Llama.cpp is a project led by Georgi Gerganov that was initially designed as a pure C/C++ implementation of the Llama large language model developed and open-sourced by Meta's AI team.

Quoted from the llama.cpp GitHub repository:

>The main goal of llama.cpp is to run the LLaMA model using 4-bit integer quantization on a MacBook
> - Plain C/C++ implementation without dependencies
> - Apple silicon first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
> - AVX, AVX2 and AVX512 support for x86 architectures
> - Mixed F16 / F32 precision
> - 2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit integer quantization support
> - CUDA, Metal and OpenCL GPU backend support

In lay terms, this means that we can implement these models in such a way that they can be run on nearly any physical or virtual machine! **You don't need an industrial-grade, multi-GPU server to use open-source LLMs locally.**

## When to Use an LLM Locally
* You have sensitive data that you don't want to send to OpenAI's servers for them to potentially store and use for the training of futures models
    - Virtually all healthcare data
* You want to fine-tune an open-source LLM for a specific purpose

## Overview of This Module
1. Install llama.cpp and Hugging Face Hub (to download model files)
2. Download the 7 billion parameter Llama2 model fine-tuned for chat
3. Engineer a prompt to have the LLM read a chest radiography report and return structured labels for specific findings in JSON format.
4. Test a few example reports on Llama2-7B-Chat.
5. Repeat the process for the Mistral-7B-Instruct-v0.1 model and compare the results.

> _Note: At the time this module was developed, Mistral-7B is the best open-source, 7B parameter model available. This field is moving very quickly, so this very well could change before the end of the year._

## References
- Llama.cpp on GitHub: https://github.com/ggerganov/llama.cpp
- Meta AI's Llama 2: https://ai.meta.com/llama/
- MistralAI's Mistral-7B: https://mistral.ai/news/announcing-mistral-7b/
- HuggingFace Models:
    * [TheBloke/Llama-2-7B-Chat-GGUF](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF)
    * [TheBloke/Mistral-7B-Instruct-v0.1-GGUF](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF)

> _Note: If you would like to experiment with other models, please search for the "GGUF" version of the model on Hugging Face._

In [None]:
# @title Install llama.cpp and HuggingFace Hub
# @markdown This cell takes approximately 2 minutes to run. The output is suppressed, so if no error is shown, you may assume that it worked.

%%capture

!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.2.11 --force-reinstall --upgrade --no-cache-dir
!pip install huggingface_hub==0.18.0

In [None]:
# @title Importing the necessary libraries

from huggingface_hub import hf_hub_download
from llama_cpp import Llama
import regex as re
import json

In [None]:
# @title Select the model you'd like to test

# Comment out these two lines and uncomment the two lines at the bottom of the cell to test out Mistral-7B
model_name = "TheBloke/Llama-2-7b-Chat-GGUF"
model_basename = "llama-2-7b-chat.Q4_K_M.gguf"

# model_name = "TheBloke/Mistral-7B-Instruct-v0.1-GGUF"
# model_basename = "mistral-7b-instruct-v0.1.Q4_K_M.gguf"

In [None]:
# @title Download the model from Hugging Face Hub

model_path = hf_hub_download(repo_id=model_name, filename=model_basename)
print(model_path)

llama-2-7b-chat.Q4_K_M.gguf:   0%|          | 0.00/4.08G [00:00<?, ?B/s]

/root/.cache/huggingface/hub/models--TheBloke--Llama-2-7b-Chat-GGUF/snapshots/191239b3e26b2882fb562ffccdd1cf0f65402adb/llama-2-7b-chat.Q4_K_M.gguf


In [None]:
# @title Initialize the llama.cpp constructor

# Feel free to play around with different hyperparameters below

lcpp_llm = Llama(
    model_path=model_path,
    n_threads=2, # CPU cores
    n_batch=512, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU. Should be a power of 2.
    n_gpu_layers=36, # Change this value based on your model and your GPU VRAM pool.
    n_ctx=2048, # Context window = maximum input sequence length (in tokens)
    n_gqa=8,
)

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 


## Prompt Engineering

Prompt engineering has emerged as an important skill set in getting LLMs to execute your desired task. For this, you should know if there is a **prompt template**

1. We start with a `system` prompt. This gives the LLM a role to play in the requests that follow.
2. We implement a JSON `schema` to prompt the LLM to return structured labels for each report we submit.
3. We provide a sample `report` for the LLM to analyze.
4. We construct the `prompt` that will present the report text to the model, ask it to use the JSON schema provided, and analyze the report for the findings included in the schema.
5. Finally, we utilize the `prompt templates` for the Llama-2-Chat and Mistral-7B-Instruct-v0.1 models to construct our complete prompt.

> _Note: Mistral-7B does not have a separate delimiter for the system role, so we pass that portion of the prompt with the remainder._

For more details on prompt engineering, see this guide: [Prompt Engineering Guide](https://www.promptingguide.ai/)

In [None]:
# @title System prompt

system = '''You are an expert radiologist's assistant, skilled in analyzing radiology reports.
Please first provide a response to any specific requests. Then explain your reasoning, as appropriate.'''

In [None]:
# @title Construct JSON schema

schema = '''
{
    "cardiomegaly": { "type": "boolean" },
    "lung_opacity": { "type": "boolean" },
    "pneumothorax": { "type": "boolean" },
    "pleural_effusion": { "type": "boolean" },
    "pulmonary_edema": { "type": "boolean" },
    "abnormal_study": { "type": "boolean" }
}
'''

In [None]:
# @title Provide a sample chest radiograph report

report_text = "No focal consolidation, pneumothorax, or pleural effusion. Cardiomediastinal silhouette is stable and unremarkable. No acute osseous abnormalities are identified. No acute cardiopulmonary abnormality."

In [None]:
# @title Construct User prompt

prompt = f'''
```{report_text}```

Please extract the findings from the preceding text radiology report using the following JSON schema:
```{schema}```
Note that "lung_opacity" may include nodule, mass, atelectasis, or consolidation.
'''

In [None]:
# @title Llama-2-Chat & Mistral-7B-Instruct-v0.1 prompt templates

llama2_prompt_template = f'''[INST] <<SYS>>
{system}
<</SYS>>
{prompt}[/INST]
'''

mistral_prompt_template = f'''<s>[INST] {system} {prompt} [/INST]'''

In [None]:
# @title Generate LLM response and print response text

response = lcpp_llm(
    prompt=llama2_prompt_template, # Comment out this line and uncomment the line below to test Mistral-7B
    # prompt=mistral_prompt_template,
    max_tokens=512,
    temperature=0.5,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=50,
    # echo=True, # return the prompt
);

res_txt = response["choices"][0]["text"]
print(res_txt)

Of course! I'd be happy to help you extract the findings from the radiology report using the provided JSON schema. Here are the results of my analysis:
{
    "cardiomegaly": false,
    "lung_opacity": {
        "type": "boolean",
        "value": true
    },
    "pneumothorax": false,
    "pleural_effusion": false,
    "pulmonary_emia": false,
    "abnormal_study": true
}
Explanation:
* The report states that there is no focal consolidation, pneumothorax, or pleural effusion. Therefore, the values for these fields in the JSON schema are set to false.
* The report does mention that the cardiomediastinal silhouette is stable and unremarkable, which means that the field "cardiomegaly" should be set to false as well.
* Under the "lung_opacity" field, the report mentions that there is opacity in both lungs, which could indicate nodules, masses, atelectasis, or consolidation. Therefore, the value for this field is set to true.
* The report does not mention any acute osseous abnormalities or 

## Limitations of this Approach

1. **Errors:** You may notice when using Llama-2-7B-Chat or other models that the JSON returned is not ideal for what we requested or may even have an error like turning `pulmonary_edema` into `pulmonary_emia`.
    - This can be improved by simplifying your request for smaller models or using a model that is better trained for returning structured data in JSON format, like Mistral-7B.
    - Playing around with some of the model inference hyperparameters can also help. See this guide for further details: [Prompt Engineering Guide: LLM Settings](https://www.promptingguide.ai/introduction/settings)
2. **Hallucinations:** LLMs can provide very confident answers that are flat out wrong. You may see output like `"Under the 'lung_opacity' field, the report mentions that there is opacity in both lungs, which could indicate nodules, masses, atelectasis, or consolidation. Therefore, the value for this field is set to true."`, even when there is no mention of that in the report referenced!
    - This can be improved by careful prompt engineering. You may want to include in your `system prompt` an instruction to not return an answer if the model is not confident. Or you may want to try without having the model explain it's reasoning.
    - A group at NIH found that asking Vicuna-13B to perform a single labeling task at one time provided more robust results in this article published in _Radiology_: [Feasibility of Using the Privacy-preserving Large Language Model Vicuna for Labeling Radiology Reports](https://pubs.rsna.org/doi/10.1148/radiol.231147)
    - For certain use cases, retrieval-augmented generation (RAG) can be helpful. We'll cover that in the next notebook.
    - Finally, if all else fails and you have several hundred labeled examples of the task you want the LLM to perform, you may consider parameter-efficient fine-tuning (PEFT). See this guide from NVIDIA for more details: [Selecting LLM Customization Techniques](https://developer.nvidia.com/blog/selecting-large-language-model-customization-techniques/)

In [None]:
# @title Define a function to postprocess the response text and extract the JSON object into a Python dict

def json_from_str(s):
    expr = re.compile(r'\{(?:[^{}]*|(?R))*\}')
    res = expr.findall(s)
    return json.loads(res[0]) if res else None


In [None]:
# @title Assign an ID number to the report and associate extracted labels with the report ID

id = 1
labels = json_from_str(res_txt)
result_dict = {id: labels}
result_dict

{1: {'cardiomegaly': False,
  'lung_opacity': {'type': 'boolean', 'value': True},
  'pneumothorax': False,
  'pleural_effusion': False,
  'pulmonary_emia': False,
  'abnormal_study': True}}