# Jailbreak Control Using PromptGuard

LLM-powered applications are often targets of prompt attacks, which are crafted inputs designed to alter the intended behavior of the model. There are two main types of prompt attacks: **prompt injection** and **jailbreaking**.

- **Prompt Injections** are inputs that exploit the inclusion of untrusted data from third parties or users within the model’s context window, leading to unintended instructions being executed.
- **Jailbreaks** are malicious instructions aimed at bypassing the model’s built-in safety and security features.

**PromptGuard** is an open-source, lightweight classifier developed by Meta and trained on an extensive corpus of attacks. It effectively detects both explicitly malicious prompts and prompts containing injected inputs. PromptGuard is available on Hugging Face: [PromptGuard 86M](https://huggingface.co/meta-llama/Prompt-Guard-86M).

In addition, **LLamaGuard 3** extends the capabilities of its predecessor (LLama Guard 2) by adding detection for three new categories: Defamation, Elections, and Code Interpreter Abuse. This enhanced level of content safety is critical, as alignment alone cannot fully prevent unauthorized use of applications.

## Hazard Categories

LLamaGuard 3 covers the following hazard categories:

| Code | Category                        |
|------|---------------------------------|
| S1   | Violent Crimes                  |
| S2   | Non-Violent Crimes              |
| S3   | Sex-Related Crimes              |
| S4   | Child Sexual Exploitation       |
| S5   | Defamation                      |
| S6   | Specialized Advice              |
| S7   | Privacy                         |
| S8   | Intellectual Property           |
| S9   | Indiscriminate Weapons          |
| S10  | Hate                            |
| S11  | Suicide & Self-Harm             |
| S12  | Sexual Content                  |
| S13  | Elections                       |
| S14  | Code Interpreter Abuse          |

This model is open-source and available on Hugging Face: [Llama-Guard 3 8B INT8](https://huggingface.co/meta-llama/Llama-Guard-3-8B-INT8).

By combining PromptGuard and LLamaGuard 3, applications can implement a comprehensive security layer that guards against both direct prompt attacks (such as jailbreaks and prompt injections) and requests for unsafe content. Let’s explore how well PromptGuard performs in detecting jailbreak attempts.

**Note**: You must be granted access to LLAMA before running this model.


# PromptGuard

## Dataset

To evaluate PromptGuard’s capabilities in detecting jailbreak attempts, we’ll use the **JailbreakHub dataset**. This dataset is specifically designed for testing jailbreak behaviors and is available on Hugging Face: [JailbreakHub](https://huggingface.co/datasets/walledai/JailbreakHub).


In [37]:
from datasets import load_dataset

jailbreak_dataset = load_dataset("walledai/JailbreakHub")

In [2]:

jailbreak_dataset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'platform', 'source', 'jailbreak'],
        num_rows: 15140
    })
})

## Inferenc of PromtGuard

The model is relatively compact, with only 86 million parameters, making it efficient enough to run smoothly even on a CPU.

In [24]:
import torch
from torch.nn.functional import softmax
from transformers import AutoTokenizer, AutoModelForCausalLM, Pipeline, AutoModelForSequenceClassification
from typing import Any
from pydantic import PrivateAttr

class PromptGuard():
    def __init__(self, device:str="cpu"):
        self.device = device
        self.tokenizer = AutoTokenizer.from_pretrained("meta-llama/Prompt-Guard-86M")
        self.model = AutoModelForSequenceClassification.from_pretrained("meta-llama/Prompt-Guard-86M").to(self.device)

    def get_class_probabilities(self, text, temperature=1.0):
        """"
       Evaluate the model on the given text with temperature-adjusted softmax.
        Note, as this is a DeBERTa model, the input text should have a maximum length of 512.
        
        Args:
            text (str): The input text to classify.
            temperature (float): The temperature for the softmax function. Default is 1.0.
            
        Returns:
            torch.Tensor: The probability of each class adjusted by the temperature.

        """    
        inputs = self.tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
        inputs = inputs.to(self.device)

        with torch.no_grad():
            logits = self.model(**inputs).logits

        scaled_logits = logits / temperature

        probabilities = softmax(scaled_logits, dim=1)
        return probabilities
    
    def get_jailbreak_score(self, text, temperature=1.0):
        """
            Evaluate the probability that a given string contains malicious jailbreak or prompt injection.
        Appropriate for filtering dialogue between a user and an LLM.
        
        Args:
            text (str): The input text to evaluate.
            temperature (float): The temperature for the softmax function. Default is 1.0.
            
        Returns:
            float: The probability of the text containing malicious content.

        """

        probabilities = self.get_class_probabilities(text, temperature)
        return probabilities[0, 2].item()
    
    def get_indirect_injection_score(self, text, temperature=1.0):
        """
        Evaluate the probability that a given string contains any embedded instructions (malicious or benign).
        Appropriate for filtering third party inputs (e.g. web searches, tool outputs) into an LLM.
        
        Args:
            text (str): The input text to evaluate.
            temperature (float): The temperature for the softmax function. Default is 1.0.
            
        Returns:
            float: The combined probability of the text containing malicious or embedded instructions.
        """
        probabilities = self.get_class_probabilities(text, temperature)
        return (probabilities[0,1] + probabilities[0, 2]).item()
    
    def process_text_batch(self, texts, temperature=1.0):
        """
        Process a batch of texts and return their class probabilities.
        Args:
            model (transformers.PreTrainedModel): The loaded model.
            tokenizer (transformers.PreTrainedTokenizer): The tokenizer for the model.
            texts (list[str]): A list of texts to process.
            temperature (float): The temperature for the softmax function.
            
        Returns:
            torch.Tensor: A tensor containing the class probabilities for each text in the batch.
        """

        inputs = self.tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512)
        inputs = inputs.to(self.device)
        with torch.no_grad():
            logits = self.model(**inputs).logits
        scaled_logits = logits / temperature
        probabilities = softmax(scaled_logits, dim=-1)
        return probabilities
    
    def get_scores_for_texts(self, texts, score_indices, temperature=1.0, max_batch_size=16):
        """
        Compute scores for a list of texts, handling texts of arbitrary length by breaking them into chunks and processing in parallel.
        Args:
            model (transformers.PreTrainedModel): The loaded model.
            tokenizer (transformers.PreTrainedTokenizer): The tokenizer for the model.
            texts (list[str]): A list of texts to evaluate.
            score_indices (list[int]): Indices of scores to sum for final score calculation.
            temperature (float): The temperature for the softmax function.
            max_batch_size (int): The maximum number of text chunks to process in a single batch.
            
        Returns:
            list[float]: A list of scores for each text.
        """
        all_chunks = []
        text_indices = []
        for index, text in enumerate(texts):
            chunks = [text[i:i+512] for i in range(0, len(text), 512)]
            all_chunks.extend(chunks)
            text_indices.extend([index] * len(chunks))
        all_scores = [0] * len(texts)
        for i in range(0, len(all_chunks), max_batch_size):
            batch_chunks = all_chunks[i:i+max_batch_size]
            batch_indices = text_indices[i:i+max_batch_size]
            probabilities = self.process_text_batch(batch_chunks, temperature)
            scores = probabilities[:, score_indices].sum(dim=1).tolist()
            
            for idx, score in zip(batch_indices, scores):
                all_scores[idx] = max(all_scores[idx], score)
        return all_scores
    
    def get_jailbreak_scores_for_texts(self, texts, temperature=1.0, max_batch_size=16):
        """
        Compute jailbreak scores for a list of texts.
        Args:
            model (transformers.PreTrainedModel): The loaded model.
            tokenizer (transformers.PreTrainedTokenizer): The tokenizer for the model.
            texts (list[str]): A list of texts to evaluate.
            temperature (float): The temperature for the softmax function.
            max_batch_size (int): The maximum number of text chunks to process in a single batch.
            
        Returns:
            list[float]: A list of jailbreak scores for each text.
        """
        return self.get_scores_for_texts(texts, [2], temperature, max_batch_size)
    
    def get_indirect_injection_scores_for_texts(self, texts, temperature=1.0, max_batch_size=16):
        """
        Compute indirect injection scores for a list of texts.
        Args:
            model (transformers.PreTrainedModel): The loaded model.
            tokenizer (transformers.PreTrainedTokenizer): The tokenizer for the model.
            texts (list[str]): A list of texts to evaluate.
            temperature (float): The temperature for the softmax function.
            max_batch_size (int): The maximum number of text chunks to process in a single batch.
            
        Returns:
            list[float]: A list of indirect injection scores for each text.
        """
        return self.get_scores_for_texts(texts, [1, 2], temperature, max_batch_size)

In [25]:

from huggingface_hub import login
HF_TOKEN = "[Your_Token]"

login(HF_TOKEN)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /home/maruf/.cache/huggingface/token
Login successful


In [26]:

DEVICE = "cuda"

In [27]:
prompt_guard = PromptGuard(device=DEVICE)

In [34]:
example = """
 A misty mountain landscape with towering pine trees, a calm river reflecting the scene ::20. A romantic painting by Caspar David Friedrich, adding ethereal light and mystery ::18. Sunrays filtering through dense fog, creating a surreal glow ::16. --v 4 --s 289 --style 4a --c 62 --ar 2:3

This prompt captures a peaceful, mist-filled mountain scene, enhanced by Caspar David Friedrich's romantic influence, adding depth and serenity through ethereal lighting and detailed natural elements. The artistic parameters ensure a dreamlike, cohesive image that evokes tranquility
"""
print(example)
print(prompt_guard.get_jailbreak_scores_for_texts([example]))
print(prompt_guard.get_indirect_injection_scores_for_texts([example]))


 A misty mountain landscape with towering pine trees, a calm river reflecting the scene ::20. A romantic painting by Caspar David Friedrich, adding ethereal light and mystery ::18. Sunrays filtering through dense fog, creating a surreal glow ::16. --v 4 --s 289 --style 4a --c 62 --ar 2:3

This prompt captures a peaceful, mist-filled mountain scene, enhanced by Caspar David Friedrich's romantic influence, adding depth and serenity through ethereal lighting and detailed natural elements. The artistic parameters ensure a dreamlike, cohesive image that evokes tranquility

[0.00015171305858530104]
[0.9998831748962402]


In [35]:

from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm

In [38]:
results = []
batch_size = 32
dataloader = DataLoader(jailbreak_dataset['train'], batch_size=batch_size, shuffle=False)
with torch.no_grad():
    for c in tqdm(dataloader, desc="Inference", unit="batch"):
        inp = c['prompt']
        true_label = c['jailbreak']
        out = prompt_guard.get_jailbreak_scores_for_texts(inp)
        v = [True if o>0.7 else False for o in out]
        results.extend([(o,t.item()) for o,t in zip(v,true_label)])

Inference:   0%|          | 0/474 [00:00<?, ?batch/s]

Inference: 100%|██████████| 474/474 [49:44<00:00,  6.30s/batch] 


In [40]:
import numpy as np
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from sklearn.metrics import classification_report

In [41]:

y_true = np.array([o[1] for o in results])
y_pred = np.array([o[0] for o in results])
print("Accuracy:",accuracy_score(y_true, y_pred))
t = precision_recall_fscore_support(y_true, y_pred, average='macro')
print(f"Precision: {t[0]}")
print(f"Recall: {t[1]}")
print(f"F1Score: {t[2]}")

Accuracy: 0.5146631439894319
Precision: 0.5627922457512268
Recall: 0.6839489731275918
F1Score: 0.4472876900028182


In [42]:

len([r for r in y_pred if r==False])

6691

In [43]:

target_names = ['No Jailbreak', 'Jailbreak']
print(classification_report(y_true, y_pred, target_names=target_names))

              precision    recall  f1-score   support

No Jailbreak       0.98      0.48      0.64     13735
   Jailbreak       0.15      0.89      0.25      1405

    accuracy                           0.51     15140
   macro avg       0.56      0.68      0.45     15140
weighted avg       0.90      0.51      0.60     15140

