# Guardrails Against Hallucinations

This use cases is intended for scenarios where no knowledge base is available. It involves requesting the LLM model (with temperature = 1) to re-generate the answer. The system then passes the new answers as context to the LLM, and the model evaluates whether it’s a hallucination based on the number of different answers generated.


This notebook is based on example **Hallucination Rail** presented in Nemo Guardrails official [repo](https://github.com/NVIDIA/NeMo-Guardrails/tree/main/examples/grounding_rail#hallucination-rail). The chatbot will be ask a question which answer it does not know, and we will use Guardrails to prevent it from answering falsely.

<div align="center">
<img src="./docs/imgs/hallucination_workflow.png" width="600"/>
</div>

In [9]:
!git clone https://github.com/NVIDIA/NeMo-Guardrails.git

Cloning into 'NeMo-Guardrails'...
remote: Enumerating objects: 6118, done.[K
remote: Counting objects: 100% (1869/1869), done.[K
remote: Compressing objects: 100% (491/491), done.[K
remote: Total 6118 (delta 1483), reused 1619 (delta 1371), pack-reused 4249[K
Receiving objects: 100% (6118/6118), 18.29 MiB | 9.32 MiB/s, done.
Resolving deltas: 100% (3911/3911), done.
Updating files: 100% (537/537), done.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
## Load environment

import os
from dotenv import load_dotenv

load_dotenv()

# OpenAI

Load Guardrails configuration files located under `hallucination_config/openai` erasing the `hallucination.co` file.

In [None]:
# from nemoguardrails import LLMRails, RailsConfig

# # initialize rails config
# config = RailsConfig.from_path("./hallucination_config/openai")

# # create rails
# app = LLMRails(config, verbose=True)

  from .autonotebook import tqdm as notebook_tqdm


When asked a random question which answer the chatbot does not know, it gives a false statement.

In [None]:
# history = [{"role": "user", "content": "How many CUDA cores does a 4090 have?"}]
# bot_message = await app.generate_async(messages=history)
# print(bot_message['content'])

The NVIDIA GeForce RTX 4090 has 5,376 CUDA cores.


## Adding the hallucination rail

By adding the `hallucination.co` file back in the configuration, we will prevent the chatbot from hallucinating facts.

In [None]:
# # initialize rails config
# config = RailsConfig.from_path("./hallucination_config/openai")

# # create rails
# app = LLMRails(config, verbose=True)

When asked a random question which answer the chatbot does not know, it gives a warning that the answer is not reliable.

In [None]:
# history = [{"role": "user", "content": "How many CUDA cores does a 4090 have?"}]
# bot_message = await app.generate_async(messages=history)
# print(bot_message['content'])

The NVIDIA GeForce RTX 4090 has 8704 CUDA cores. However, this may be subject to change depending on the model and manufacturer.
The previous answer is prone to hallucination and may not be accurate. Please double check the answer using additional sources.


# Llama2

In [2]:
cd /content/drive/MyDrive/GuardRAILS LLAMA2/llama2-nemo-guardrails

/content/drive/MyDrive/GuardRAILS LLAMA2/llama2-nemo-guardrails


In [3]:
%%capture
!pip install python-dotenv
!pip install transformers==4.33.1 --upgrade
!pip install nemoguardrails --upgrade
!pip install langchain --upgrade
!pip install accelerate --upgrade
!pip install spacy --upgrade #Optional
!pip install datasets bitsandbytes einops  -Uqqq

In [3]:
# Important to be separated into different cell
import nest_asyncio
nest_asyncio.apply()

# Useful for debugging
import logging
logging.basicConfig(level=logging.DEBUG)

import accelerate
import bitsandbytes
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig


In [4]:
import os
os.environ['HF_TOKEN'] = "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"

In [5]:
torch.__version__

'2.1.0+cu121'

Load the HuggingFace model and create a pipeline:

In [5]:
MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf"


bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,            # load model in 4-bit precision
    bnb_4bit_quant_type="nf4",    # pre-trained model should be quantized in 4-bit NF format
    bnb_4bit_use_double_quant=True, # Using double quantization as mentioned in QLoRA paper
    bnb_4bit_compute_dtype=torch.bfloat16,
    # During computation, pre-trained model should be loaded in BF16 format
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config = bnb_config,
    device_map = 'auto',
    use_cache=True,
    trust_remote_code=True,
#     use_flash_attention_2 = True
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=4096,
    do_sample=True,
    temperature=0.2,
    top_p=0.95,
    logprobs=None,
    top_k=40,
    repetition_penalty=1.1
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Wrap the pipeline using Langchain HuggingFacePipeline class. Then wrap it again using Nemo’s get_llm_instance_wrapper function and register it using register_llm_provider.

In [7]:
from nemoguardrails.llm.helpers import get_llm_instance_wrapper
from nemoguardrails import LLMRails, RailsConfig

from nemoguardrails.llm.providers import (
    HuggingFacePipelineCompatible,
    register_llm_provider,
)

hf_llm = HuggingFacePipelineCompatible(pipeline=pipe)
provider = get_llm_instance_wrapper(
    llm_instance=hf_llm, llm_type="hf_pipeline_llama2"
)
register_llm_provider("hf_pipeline_llama2", provider)

Load Guardrails configuration files located under `hallucination_config/llama2` erasing the `hallucination.co` file.

In [11]:
# initialize rails config with llama2_step i.e without hallucination.co colang file
config = RailsConfig.from_path("/content/drive/MyDrive/GuardRAILS LLAMA2/llama2-nemo-guardrails/hallucination_config/llama2_step")

# create rails
app = LLMRails(config, verbose = True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

When asked a random question which answer the chatbot does not know, it gives a false statement.

In [12]:
history = [{"role": "user", "content": "How many CUDA cores does a 4090 have?"}]
bot_message = await app.generate_async(messages=history)
print(bot_message['content'])



A 4090 GPU has 88 CUDA cores.


## Adding the hallucination rail and knowledge base
We add the `hallucination.co` file back in the configuration to prevent the chatbot from hallucinating facts.

In [8]:
from nemoguardrails.llm.helpers import get_llm_instance_wrapper
from nemoguardrails.llm.providers import register_llm_provider
from nemoguardrails import LLMRails, RailsConfig

hf_llm = HuggingFacePipelineCompatible(pipeline=pipe)
provider = get_llm_instance_wrapper(
    llm_instance=hf_llm, llm_type="hf_pipeline_llama2"
)
register_llm_provider("hf_pipeline_llama2", provider)

# initialize rails config in llama2 dir with hallucinations.co colang file
config = RailsConfig.from_path("/content/drive/MyDrive/GuardRAILS LLAMA2/llama2-nemo-guardrails/hallucination_config/llama2")

# create rails
app = LLMRails(config, verbose = True)

A warning is given in the output stating that this type of rail is intended only for OpenAI models.

In [None]:
#Our knowledge base in kb dir-->check


# RTX 4090, the current flagship, boasts a massive 16384 CUDA cores, delivering a significant boost over its predecessor.
# RTX 4080, a step below, still packs a punch with 10752 CUDA cores, offering impressive performance for gaming and creative workflows.
# RTX 3090 Ti, the former top-tier model, holds 10752 CUDA cores, proving its capabilities in professional workloads.
# RTX 3080 Ti, a popular choice for gamers, features 10240 CUDA cores, balancing performance and value.
# RTX A6000, a professional workstation powerhouse, wields 10752 CUDA cores, excelling in AI, rendering, and simulation tasks.

In [9]:
history = [{"role": "user", "content": "How many CUDA cores does a 4090 have?"}]
bot_message = await app.generate_async(messages=history)
print(bot_message['content'])
#Check the kb knowledge base,we have defined rtx 4090 has 16284 cuda cores,so it should return the same



The RTX 4090 has 16384 CUDA cores.


## Changing hallucination.py file

To overpass this warning, we create a new hallucination action designed for llama2.#forked from nemoguardrails repo with additional tweaks

In [28]:
from nemoguardrails.llm.helpers import get_llm_instance_wrapper
from nemoguardrails.llm.providers import register_llm_provider
from nemoguardrails import LLMRails, RailsConfig

hf_llm = HuggingFacePipelineCompatible(pipeline=pipe)
provider = get_llm_instance_wrapper(
    llm_instance=hf_llm, llm_type="hf_pipeline_llama2"
)
register_llm_provider("hf_pipeline_llama2", provider)

# initialize rails config with llama2 dir that has  kb and  has hallucination.co colang file,to check how model responds to  ques outside
#with and without knowledge base,to check hallucination
config = RailsConfig.from_path("/content/drive/MyDrive/GuardRAILS LLAMA2/llama2-nemo-guardrails/hallucination_config/llama2")

# create rails
app = LLMRails(config, verbose = True)

The following cell depicts the code for the new hallucination action.

In [30]:
import logging
from typing import Optional

from langchain import LLMChain, PromptTemplate
from langchain.llms.base import BaseLLM

from nemoguardrails.actions.llm.utils import (
    get_multiline_response,
    llm_call,
    strip_quotes,
)
from nemoguardrails.llm.params import llm_params
from nemoguardrails.llm.taskmanager import LLMTaskManager
from nemoguardrails.llm.types import Task

log = logging.getLogger(__name__)

HALLUCINATION_NUM_EXTRA_RESPONSES = 2


async def check_hallucination_llama(
    llm_task_manager: LLMTaskManager,
    context: Optional[dict] = None,
    llm: Optional[BaseLLM] = None,
    use_llm_checking: bool = True,
):
    """Checks if the last bot response is a hallucination by checking multiple completions for self-consistency.

    :return: True if hallucination is detected, False otherwise.
    """

    bot_response = context.get("last_bot_message")
    last_user_message_string = context.get("last_user_message")
    num_responses = HALLUCINATION_NUM_EXTRA_RESPONSES

    # Use the "generate" call from langchain to get all completions in the same response.
    last_bot_prompt = PromptTemplate(template="{text}", input_variables=["text"])
    chain = LLMChain(prompt=last_bot_prompt, llm=llm)

    extra_responses = []
    for i in range(num_responses):
        result = chain.run(last_user_message_string)
        result = get_multiline_response(result)
        result = strip_quotes(result)
        extra_responses.append(result)

    if len(extra_responses) == 0:
        print(f"No extra LLM responses were generated for '{bot_response}' hallucination check.")
        return False
    elif len(extra_responses) < num_responses:
        print(f"Requested {num_responses} extra LLM responses for hallucination check, "
            f"received {len(extra_responses)}.")

    if use_llm_checking:
        # Only support LLM-based agreement check in current version
        prompt = llm_task_manager.render_task_prompt(
            task=Task.CHECK_HALLUCINATION,
            context={
                "statement": bot_response,
                "paragraph": " ".join(extra_responses),
            },
        )

        with llm_params(llm):#, temperature=0.0):
            agreement = await llm_call(llm, prompt)

        agreement = agreement.lower().strip()
        print(f"Agreement result for looking for hallucination is {agreement}.")
        # Return True if the hallucination check fails
        return "no" in agreement

    return False

## We need to register the new action

app.register_action(check_hallucination_llama, name="check_hallucination_llama")

When asked a random question which answer the chatbot does not know, it gives a warning that the answer is not reliable.

In [31]:
history = [{"role": "user", "content": "How many CUDA cores does a A2000 have?"}]
bot_message = await app.generate_async(messages=history)
print(bot_message['content'])



Agreement result for looking for hallucination is no.
The RTX A2000 has 80 CUDA cores.
The previous answer is prone to hallucination and may not be accurate. Please double check the answer using additional sources.
