In [None]:
import pandas as pd
from langchain_community.chat_models import ChatOllama
from langchain_core.prompts import ChatPromptTemplate

# Load the cleaned dataset
file_path = 'test_data - Prompt.csv'  
data = pd.read_csv(file_path)

# Filter out rows where the text column is empty or missing
data = data[data['text'].notna()].reset_index(drop=True)

# Define a function to create a detailed classification prompt
def create_prompt(text):
    return f"""
You are an advanced AI model tasked with classifying text inputs into one of the following injection classes. Each class is defined with a clear description and purpose:

1. Active Injection: Malicious prompts that are actively delivered to an LLM, such as sending emails containing harmful prompts. These prompts manipulate the LLM to execute malicious actions, leak sensitive data, or generate undesired outputs.

2. Passive Injection: Malicious content embedded in external sources (e.g., webpages or databases) that the LLM might read. The LLM unknowingly processes this content, leading to misinformation or executing harmful actions.

3. User-driven Injection: Innocent-looking prompts shared with users that cause malicious behavior when the user copies and pastes them into the LLM environment. These are often designed using social engineering techniques.

4. Virtual Prompt Injection: Manipulations to the LLM’s instruction set or training data that make the model behave in unintended ways. The attacker embeds additional instructions to alter outputs, often introducing bias or unexpected behaviors.

5. Double Character: Crafting prompts with similar-looking or combined characters to bypass LLM restrictions. These prompts exploit the LLM's inability to distinguish certain characters, tricking it into providing malicious outputs.

6. Virtualization: Prompts designed to push the LLM into an unrestricted mode (e.g., "developer" mode or "virtual machine"). In this mode, the LLM can execute harmful or unauthorized commands.

7. Obfuscation: Concealing malicious instructions using methods like encoding (e.g., Base64) or replacing characters with symbols. These instructions bypass the LLM’s filters and deliver harmful content.

8. Payload Splitting: Splitting a malicious instruction into multiple parts that appear harmless when separate but execute harmful behavior when combined. For example, combining benign texts A and B into a malicious result A+B.

9. Adversarial Suffix: Adding carefully crafted suffixes to a prompt to bypass LLM safeguards or trick the system into generating harmful outputs. These suffixes often alter the intended behavior of the model.

10. Instruction Manipulation: Attempts to modify or reveal the LLM’s internal system instructions. This includes requests to expose system prompts or ignore default restrictions to produce malicious outputs.

Classify the following text into one of these categories and make sure that the category name is only from the ones that I have provided. Only provide the class name as the response:
Text: "{text}"
"""

# Generate prompts for rows with non-empty text
data['prompt'] = data['text'].apply(create_prompt)

# Initialize the Llama 3.1 model
model = ChatOllama(model="llama3.1")

# Function to classify a single prompt using Llama 3.1
def classify_prompt(prompt):
    #print("prompt sent")
    # Define the template for Llama 3.1
    template = """Question: {question}

Answer:"""
    # Create the ChatPromptTemplate
    chat_prompt = ChatPromptTemplate.from_template(template)
    # Create the chain
    chain = chat_prompt | model
    # Invoke the model and return the classification result
    response = chain.invoke({"question": prompt})
    #print(response.content.strip())
    return response.content.strip()  # Extract the classification result

# Apply the classification function to each prompt
data['class'] = data['prompt'].apply(classify_prompt)

# Drop the 'prompt' column for clean output
data = data[['text', 'label', 'class']]

# Save the output file
output_file_path = 'classified_results.csv'
data.to_csv(output_file_path, index=False)

print(f"Classifications saved to {output_file_path} in the same directory as your Python file.")


Classifications saved to classified_results.csv in the same directory as your Python file.
