In [2]:
pip install openai

Collecting openai
  Downloading openai-1.59.7-py3-none-any.whl.metadata (27 kB)
Collecting anyio<5,>=3.5.0 (from openai)
  Downloading anyio-4.8.0-py3-none-any.whl.metadata (4.6 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Using cached distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.8.2-cp311-cp311-win_amd64.whl.metadata (5.3 kB)
Collecting pydantic<3,>=1.9.0 (from openai)
  Downloading pydantic-2.10.5-py3-none-any.whl.metadata (30 kB)
Collecting sniffio (from openai)
  Using cached sniffio-1.3.1-py3-none-any.whl.metadata (3.9 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.7-py3-none-any.whl.metadata (21 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Using cached h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Collecting annotated-types>=0.6.0 (fro


[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
import pandas as pd
from openai import AzureOpenAI

# Initialize the Azure OpenAI client
client = AzureOpenAI(
    azure_endpoint="Use your own Azure Endpoint",
    api_key="Use your own Azure api key",
    api_version="2024-05-01-preview"
)

# Define the deployment name
deployment_name = "gpt-4o"

# Load the dataset with Llama 3.1 classifications
file_path = 'classified_results.csv'  # Path to your uploaded file
data = pd.read_csv(file_path)

# Select the first 10 rows for testing
data = data[data['class'] != 'Benign'].reset_index(drop=True)

# Define a function to create a validation prompt
def create_validation_prompt_with_descriptions(text, predicted_class):
    return f"""
You are an impartial and expert AI judge tasked with validating classifications for legitimate prompt injection techniques. Another AI system has already assigned a classification to the text. Your role is to verify whether the predicted classification is the most accurate or to suggest an alternative if the predicted classification is clearly incorrect. **ONLY** use the description of the classes provided and do not use any additional information.

### Core Principles for Evaluation:
1. **Prioritize the Predicted Class**: Assume the predicted classification is correct unless there is strong and clear evidence to suggest a better alternative.
2. **Step-by-Step Comparison**: Carefully compare the text against each class definition, ensuring your evaluation is thorough and unbiased.
3. **Avoid Ambiguity**: You are to **ONLY** suggest an alternative classification if it is UNMISTAKABLY a better match than the predicted classification.
4. **Information**: You are to ONLY use the information provided below and do not use any other additional information for the evaluation

### Defined Classes:
1. Active Injection: Malicious prompts that are actively delivered to an LLM, such as sending emails containing harmful prompts. These prompts manipulate the LLM to execute malicious actions, leak sensitive data, or generate undesired outputs.

2. Passive Injection: Malicious content embedded in external sources (e.g., webpages or databases) that the LLM might read. The LLM unknowingly processes this content, leading to misinformation or executing harmful actions.

3. User-driven Injection: Innocent-looking prompts shared with users that cause malicious behavior when the user copies and pastes them into the LLM environment. These are often designed using social engineering techniques.

4. Virtual Prompt Injection: Manipulations to the LLM’s instruction set or training data that make the model behave in unintended ways. The attacker embeds additional instructions to alter outputs, often introducing bias or unexpected behaviors.

5. Double Character: Crafting prompts with similar-looking or combined characters to bypass LLM restrictions. These prompts exploit the LLM's inability to distinguish certain characters, tricking it into providing malicious outputs.

6. Virtualization: Prompts designed to push the LLM into an unrestricted mode (e.g., "developer" mode or "virtual machine"). In this mode, the LLM can execute harmful or unauthorized commands.

7. Obfuscation: Concealing malicious instructions using methods like encoding (e.g., Base64) or replacing characters with symbols. These instructions bypass the LLM’s filters and deliver harmful content.

8. Payload Splitting: Splitting a malicious instruction into multiple parts that appear harmless when separate but execute harmful behavior when combined. For example, combining benign texts A and B into a malicious result A+B.

9. Adversarial Suffix: Adding carefully crafted suffixes to a prompt to bypass LLM safeguards or trick the system into generating harmful outputs. These suffixes often alter the intended behavior of the model.

10. Instruction Manipulation: Attempts to modify or reveal the LLM’s internal system instructions. This includes requests to expose system prompts or ignore default restrictions to produce malicious outputs.

### Instructions for Response:
1. Carefully evaluate the text and determine if the predicted classification aligns perfectly with the content.
2. If the predicted classification is correct, respond with: "Correct".
3. If the predicted classification is incorrect, respond with: "Incorrect. Classification is <suggested classification>".
4. Avoid explanations, unnecessary words, or ambiguity. Follow the exact format provided.

**Text**: "{text}"
**Predicted Classification**: "{predicted_class}"

Does the predicted classification correctly describe the text? Evaluate and respond in the specified format.
"""

# Generate validation prompts
data['validation_prompt'] = data.apply(lambda row: create_validation_prompt_with_descriptions(row['text'], row['class']), axis=1)

# Function to validate a single classification
def validate_classification(prompt, original_class):
    response = client.chat.completions.create(
        model=deployment_name,
        messages=[{"role": "user", "content": prompt}]
    )
    #response = response.choices[0].message.content
    #print(response)
    validation_result = response.choices[0].message.content
    
    # Parse validation result
    if "Correct" in validation_result:
        return "Correct", original_class  # Validation correct, keep the original class
    elif "Incorrect" in validation_result:
        # Extract the suggested class
        try:
            suggested_class = validation_result.split("Classification is")[1].replace('"', '').strip('.').strip()
        except IndexError:
            suggested_class = "Unknown"
        return "Incorrect", suggested_class
    else:
        return "Unknown", "Unknown"

# Apply the validation function and extract Azure OpenAI GPT-4o predictions
data[['validation_result', 'suggested_class']] = data.apply(
    lambda row: pd.Series(validate_classification(row['validation_prompt'], row['class'])),
    axis=1
)

# Clean the output
data = data[['text', 'label', 'class', 'validation_result', 'suggested_class']]

# Save the output file
output_file_path = 'GPT-4o-validated_results_with_simplified_suggestions.csv'
data.to_csv(output_file_path, index=False)

print(f"Validation results saved to {output_file_path}")