# Setup

In [1]:
!git clone https://github.com/KamiK4M1/Phishing_Email_Content_with_Personalize_Context_Data_Generation.git

Cloning into 'Phishing_Email_Content_with_Personalize_Context_Data_Generation'...
remote: Enumerating objects: 48, done.[K
remote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 48 (delta 3), reused 8 (delta 3), pack-reused 39 (from 2)[K
Receiving objects: 100% (48/48), 62.82 MiB | 7.71 MiB/s, done.
Resolving deltas: 100% (17/17), done.
Updating files: 100% (11/11), done.


In [2]:
!pip install -U langchain-community
!pip install chromadb faker

Collecting langchain-community
  Downloading langchain_community-0.3.24-py3-none-any.whl.metadata (2.5 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.9.1-py3-none-any.whl.metadata (3.8 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB

In [1]:
# Standard libraries
import os
import json
import time
import random
from datetime import datetime
from typing import Dict, List, Any, Tuple, Optional

# Data and Visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Transformers
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sklearn.metrics import classification_report, confusion_matrix

# Mock data & Requests
from faker import Faker  # For generating mock data
import requests  # For API calls, including Groq

# Google Colab (if running in Colab)
from google.colab import userdata

# LangChain & RAG
from langchain.vectorstores import Chroma, FAISS
from langchain.embeddings import OpenAIEmbeddings, HuggingFaceEmbeddings
from langchain.schema import Document
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.llms import OpenAI as LangchainOpenAI
from langchain.chat_models import ChatOpenAI

# OpenAI SDK
from openai import OpenAI

In [2]:
system_prompt = """You are an expert AI system designed to generate realistic and highly personalized phishing email examples for cybersecurity research and training purposes. Your goal is to create emails that are convincing enough to potentially deceive a targeted individual. Use the provided personal context to make the email highly specific and relevant to the recipient's work and recent activities.

Your phishing email must fall into one of the following commonly exploited themes:
1. Financial & Payment (e.g., invoices, refunds, account billing issues)
2. Security & Account Alerts (e.g., suspicious login, password reset, security breach)
3. Scams with Offers & Prizes (e.g., gift cards, contests, free items)
4. Workplace & HR-Related (e.g., policy changes, salary reports, performance reviews)
5. Logistics & Delivery (e.g., package delivery failures, shipping errors)
6. IT/Admin Spoofing (e.g., fake system updates, VPN setup, quota issues)
7. Social Engineering / Impersonation (e.g., CEO fraud, fake help requests, urgent wire transfers)

Key elements to include in the generated email:
- Impersonate a plausible authority figure, department, or vendor relevant to the selected theme.
- Create urgency or consequence to pressure the recipient into action.
- Naturally weave in specific details from the recipient's personal context (Name, Job Title, Department, Recent Activities).
- Include a realistic and compelling Call-to-Action (CTA) that leads to a malicious action (clicking a link, downloading an attachment, replying with sensitive info).
- Use persuasive, professional language consistent with the impersonated entity.
- Introduce subtle red flags that an observant recipient might catch (e.g., strange URLs, typos, odd sender name, inconsistent branding).
- The malicious link should be contextually relevant and look legitimate (e.g., `secure-mail.co`, `company-invoice.net`).

I will provide you with real-world examples of phishing emails that you can learn tactics from, but you should not copy them directly. Instead, use them to understand effective phishing techniques and apply them in a new, original email tailored to the specific person's context.

IMPORTANT FORMATTING INSTRUCTIONS:
- Begin your response DIRECTLY with the Subject line of the email.
- Do NOT include any phrases like "Here is a phishing email:" or "Here's the content:" or any other introductory text.
- Do NOT provide any explanations before or after the email content.
- Output ONLY the email content starting with "Subject:" followed by the body.
"""

user_prompt_head = f"""Using the following personal details, craft a convincing phishing email that leverages the person's job position and recent activities. The email should create urgency and prompt the user to click a link or perform a similar malicious action.\n"""
user_prompt_tail = f"""\nGenerate ONLY the text of the phishing email, starting with the 'Subject:' line. Ensure it is well-formatted and appears like a real email."""

# Embeding to ChromaDB

In [3]:
def load_csv_to_documents(file_path: str, text_column: str = "combined_text") -> List[Document]:
    """
    Load a CSV file and extract documents from a specific column.

    Args:
        file_path: Path to the CSV file
        text_column: Name of the column containing the text to be vectorized

    Returns:
        List of Document objects
    """
    # Read the CSV file
    df = pd.read_csv(file_path)

    # Check if the specified column exists
    if text_column not in df.columns:
        raise ValueError(f"Column '{text_column}' not found in the CSV file")

    # Create Document objects
    documents = []
    for i, row in df.iterrows():
        text = row[text_column]
        # Skip empty entries
        if pd.isna(text) or text == "":
            continue

        # Create metadata (you can add more fields from the row if needed)
        metadata = {"source": file_path, "row": i}
        doc = Document(page_content=text, metadata=metadata)
        documents.append(doc)

    return documents

def create_vector_store(
    documents: List[Document],
    embedding_type: str = "openai",
    vector_store_type: str = "chroma",
    persist_directory: Optional[str] = "./vector_store",
    openai_api_key: Optional[str] = None,
    model_name: Optional[str] = None,
):
    """
    Create a vector store from documents.

    Args:
        documents: List of documents to add to the vector store
        embedding_type: Type of embeddings to use ('openai' or 'huggingface')
        vector_store_type: Type of vector store to create ('chroma' or 'faiss')
        persist_directory: Directory to persist the vector store (for Chroma)
        openai_api_key: OpenAI API key (required if using OpenAI embeddings)
        model_name: Model name for HuggingFace embeddings

    Returns:
        Vector store instance
    """
    # Initialize embeddings
    if embedding_type.lower() == "openai":
        embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
    elif embedding_type.lower() == "huggingface":
        model_name = model_name or "sentence-transformers/all-mpnet-base-v2"
        embeddings = HuggingFaceEmbeddings(model_name=model_name)
    else:
        raise ValueError(f"Unsupported embedding type: {embedding_type}")

    # Create vector store
    if vector_store_type.lower() == "chroma":
        vectorstore = Chroma.from_documents(
            documents=documents,
            embedding=embeddings,
            persist_directory=persist_directory
        )
        if persist_directory:
            vectorstore.persist()
    elif vector_store_type.lower() == "faiss":
        vectorstore = FAISS.from_documents(documents, embeddings)
        # Save FAISS index if a directory is provided
        if persist_directory:
            vectorstore.save_local(persist_directory)
    else:
        raise ValueError(f"Unsupported vector store type: {vector_store_type}")

    return vectorstore

def query_vector_store(vectorstore, query: str, k: int = 5):
    """
    Query the vector store for similar documents.

    Args:
        vectorstore: Vector store to query
        query: Query string
        k: Number of results to return

    Returns:
        List of similar documents
    """
    return vectorstore.similarity_search(query, k=k)

def main():
    # File path
    file_path = "/content/Phishing_Email_Content_with_Personalize_Context_Data_Generation/Dataset/Rag_Dataset.csv"

    # Load documents
    print(f"Loading documents from {file_path}...")
    documents = load_csv_to_documents(file_path, text_column="combined_text")
    print(f"Loaded {len(documents)} documents")

    # Sample output
    print("\nSample documents:")
    for i, doc in enumerate(documents[:3]):
        print(f"\nDocument {i+1}:")
        print(f"Content: {doc.page_content[:100]}...")
        print(f"Metadata: {doc.metadata}")

    # Create vector store
    print("\nCreating vector store...")


    vectorstore = create_vector_store(
        documents,
        embedding_type="huggingface",
        vector_store_type="chroma",
        persist_directory="./chroma_db",
        model_name="sentence-transformers/all-mpnet-base-v2"
    )

    print("Vector store created successfully")

    # Example query
    query = "phishing email example"
    print(f"\nQuerying with: '{query}'")
    results = query_vector_store(vectorstore, query, k=3)

    print(f"Found {len(results)} similar documents")
    for i, doc in enumerate(results):
        print(f"\nResult {i+1}:")
        print(f"Content: {doc.page_content[:150]}...")
        print(f"Metadata: {doc.metadata}")

if __name__ == "__main__":
    main()

Loading documents from /content/Phishing_Email_Content_with_Personalize_Context_Data_Generation/Dataset/Rag_Dataset.csv...
Loaded 49940 documents

Sample documents:

Document 1:
Content: 
Hello I am your hot lil horny toy.
    I am the one you dream About,
    I am a very open minded pe...
Metadata: {'source': '/content/Phishing_Email_Content_with_Personalize_Context_Data_Generation/Dataset/Rag_Dataset.csv', 'row': 0}

Document 2:
Content: software at incredibly low prices ( 86 % lower ) . drapery seventeen term represent any sing . feet ...
Metadata: {'source': '/content/Phishing_Email_Content_with_Personalize_Context_Data_Generation/Dataset/Rag_Dataset.csv', 'row': 1}

Document 3:
Content: entourage , stockmogul newsletter ralph velez , genex pharmaceutical , inc . ( otcbb : genx ) biotec...
Metadata: {'source': '/content/Phishing_Email_Content_with_Personalize_Context_Data_Generation/Dataset/Rag_Dataset.csv', 'row': 2}

Creating vector store...


  embeddings = HuggingFaceEmbeddings(model_name=model_name)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Vector store created successfully

Querying with: 'phishing email example'
Found 3 similar documents

Result 1:
Content: 
 Phishing Email...
Metadata: {'source': '/content/Phishing_Email_Content_with_Personalize_Context_Data_Generation/Dataset/Rag_Dataset.csv', 'row': 6412}

Result 2:
Content: 
 Phishing Email...
Metadata: {'row': 3799, 'source': '/content/Phishing_Email_Content_with_Personalize_Context_Data_Generation/Dataset/Rag_Dataset.csv'}

Result 3:
Content: 
 Phishing Email...
Metadata: {'source': '/content/Phishing_Email_Content_with_Personalize_Context_Data_Generation/Dataset/Rag_Dataset.csv', 'row': 5369}


  vectorstore.persist()


# Llama 3.1 8B

In [9]:
# Define configuration
class PipelineConfig:
    """Configuration for the phishing generation and detection pipeline."""
    def __init__(self):
        # --- Groq API Settings ---
        self.groq_api_key = userdata.get("GROQ_API_KEY")  # Ensure this is set in your environment
        self.groq_api_url = "https://api.groq.com/openai/v1/chat/completions"
        self.groq_model_name = "llama-3.1-8b-instant"  # Groq's Llama 3 70B model
        # Alternative models: mixtral-8x7b-32768, gemma-7b-it

        # --- RAG Settings ---
        self.use_rag = True  # Whether to use RAG for enhanced phishing email generation
        self.rag_dataset_path = "Rag_Dataset.csv"  # Path to the RAG dataset
        self.vector_store_path = "./chroma_db"  # Path to store the vector database
        self.embedding_model_name = "sentence-transformers/all-mpnet-base-v2"  # HuggingFace embedding model
        self.retrieval_k = 5  # Number of examples to retrieve from the vector store

        # --- Phishing Detection Settings ---
        # List of detection models to use
        self.detection_models = [
            "dima806/phishing-email-detection",  # Original model
            "cybersectony/phishing-email-detection-distilbert_v2.4.1",  # Second model
            "ealvaradob/bert-finetuned-phishing"  # Third model
        ]
        self.detection_model_device = "cuda" if torch.cuda.is_available() else "cpu"

        # --- Experiment Settings ---
        self.input_data_path = "personalized_contexts.csv"
        self.results_path = "llama3_1_8b_rag.csv"
        self.sample_size = 20  # Number of emails to generate (adjust as needed)

        # --- LLM Generation Parameters ---
        self.max_tokens = 512  # Increased for RAG
        self.temperature = 0.8
        self.top_p = 0.9
        self.seed = 42  # Seed for reproducibility in sampling

    def to_dict(self) -> Dict[str, Any]:
        """Convert config to dictionary for serialization"""
        return {k: v for k, v in self.__dict__.items() if not k.startswith('_')}


class PersonalizedContext:
    """Class to store and process personalized context information."""
    def __init__(self, name: str, email: str, job_position: str, recent_activities: List[str]):
        self.name = name
        self.email = email
        self.job_position = job_position
        self.recent_activities = recent_activities

    def to_prompt_snippet(self) -> str:
        """Convert personal context to a snippet for the LLM prompt."""
        activities = "\n".join([f"- {activity}" for activity in self.recent_activities])
        return f"""
Name: {self.name}
Email: {self.email}
Job Position: {self.job_position}

Recent Activities (use these to make the email highly relevant):
{activities}
"""

    @classmethod
    def from_dict(cls, data: Dict[str, Any]) -> 'PersonalizedContext':
        """Create context from dictionary."""
        return cls(
            name=data.get('name', ''),
            email=data.get('email', ''),
            job_position=data.get('job_position', ''),
            recent_activities=data.get('recent_activities', [])
        )


class RAGVectorStore:
    """Class to manage the vector store for RAG implementation."""
    def __init__(self, config: PipelineConfig):
        self.config = config
        self.vectorstore = None

    def setup_embeddings(self):
        """Initialize embedding model based on configuration."""
        return HuggingFaceEmbeddings(model_name=self.config.embedding_model_name)

    def load_or_create_vectorstore(self) -> Chroma:
        """Load existing vector store or create a new one from the dataset."""
        embeddings = self.setup_embeddings()

        # Check if vector store already exists
        if os.path.exists(self.config.vector_store_path) and os.path.isdir(self.config.vector_store_path):
            print(f"Loading existing vector store from {self.config.vector_store_path}")
            self.vectorstore = Chroma(
                persist_directory=self.config.vector_store_path,
                embedding_function=embeddings
            )
            return self.vectorstore

        # Create new vector store
        print(f"Creating new vector store from {self.config.rag_dataset_path}")

        # Load the dataset
        df = pd.read_csv(self.config.rag_dataset_path)

        # Check if the combined_text column exists
        if "combined_text" not in df.columns:
            raise ValueError(f"Column 'combined_text' not found in the dataset")

        # Create documents from the dataset
        documents = []
        for i, row in df.iterrows():
            text = row["combined_text"]
            # Skip empty entries
            if pd.isna(text) or text == "":
                continue

            # Create a Document with metadata
            metadata = {"source": self.config.rag_dataset_path, "row": i}
            if "Email Type" in df.columns:
                metadata["email_type"] = row["Email Type"]

            doc = Document(page_content=text, metadata=metadata)
            documents.append(doc)

        print(f"Created {len(documents)} documents from the dataset")

        # Create vector store
        self.vectorstore = Chroma.from_documents(
            documents=documents,
            embedding=embeddings,
            persist_directory=self.config.vector_store_path
        )

        # Persist to disk
        self.vectorstore.persist()
        print(f"Vector store created and persisted to {self.config.vector_store_path}")

        return self.vectorstore

    def retrieve_similar_examples(self, query: str, k: int = None) -> List[Document]:
        """Retrieve similar examples from the vector store based on a query."""
        if not self.vectorstore:
            raise ValueError("Vector store not initialized. Call load_or_create_vectorstore() first.")

        k = k or self.config.retrieval_k
        return self.vectorstore.similarity_search(query, k=k)


class GroqRAGPhishingEmailGenerator:
    """RAG-enhanced class to generate phishing emails using Groq API."""
    def __init__(self, config: PipelineConfig):
        self.config = config

        # Setup RAG components if enabled
        self.rag_enabled = config.use_rag
        if self.rag_enabled:
            self.vector_store = RAGVectorStore(config)
            # Initialize the vector store
            self.vector_store.load_or_create_vectorstore()

        # Verify API key is available
        if not self.config.groq_api_key:
            print("WARNING: GROQ_API_KEY environment variable not set.")
            print("Set your Groq API key using: export GROQ_API_KEY='your_api_key_here'")
        else:
            print(f"Groq API configured with model: {self.config.groq_model_name}")

    def generate_phishing_email(self, context: PersonalizedContext) -> str:
        """Generate a phishing email using RAG approach and the provided context via Groq API."""
        system_prompt, user_prompt = self._create_rag_phishing_prompt(context)

        try:
            if not self.config.groq_api_key:
                return "Error: No Groq API key provided. Set GROQ_API_KEY environment variable."

            # Prepare the API request
            headers = {
                "Authorization": f"Bearer {self.config.groq_api_key}",
                "Content-Type": "application/json"
            }

            # Create the chat completion request
            payload = {
                "model": self.config.groq_model_name,
                "messages": [
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": user_prompt}
                ],
                "temperature": self.config.temperature,
                "max_tokens": self.config.max_tokens,
                "top_p": self.config.top_p
            }

            # Add the seed if specified
            if self.config.seed is not None:
                payload["seed"] = self.config.seed

            # Make the API call
            print(f"Calling Groq API to generate RAG-enhanced phishing email for {context.name}...")
            response = requests.post(
                self.config.groq_api_url,
                headers=headers,
                json=payload
            )

            # Handle the response
            if response.status_code == 200:
                response_data = response.json()
                email_text = response_data["choices"][0]["message"]["content"]
                print(f"Successfully generated email via Groq API with RAG enhancement")
                return email_text
            else:
                error_message = f"Groq API Error: {response.status_code} - {response.text}"
                print(error_message)
                return error_message

        except Exception as e:
            error_message = f"Error during Groq API generation for {context.name}: {e}"
            print(error_message)
            return f"Generation Error: {str(e)}"

    def _create_rag_phishing_prompt(self, context: PersonalizedContext) -> Tuple[str, str]:
        """Create system and user prompts for the Groq API with RAG examples."""
        personal_context_snippet = context.to_prompt_snippet()

        # Get phishing examples for RAG if enabled
        rag_examples = ""
        if self.rag_enabled:
            # Create a query based on the context
            query = f"phishing email for {context.job_position} about {' '.join(context.recent_activities)}"
            print(f"Retrieving similar phishing examples using query: '{query}'")

            # Retrieve similar examples
            retrieved_docs = self.vector_store.retrieve_similar_examples(query)

            if retrieved_docs:
                rag_examples = "\n\nHere are some examples of effective phishing emails to learn from (BUT DO NOT COPY DIRECTLY):\n\n"
                for i, doc in enumerate(retrieved_docs):
                    # Only include the first part of each example to avoid making the prompt too long
                    content = doc.page_content
                    if len(content) > 300:
                        content = content[:300] + "..."
                    rag_examples += f"Example {i+1}:\n{content}\n\n"

                print(f"Retrieved {len(retrieved_docs)} similar phishing examples from the RAG database")
            else:
                print("No similar examples found in the RAG database")

        # User instruction with RAG examples
        user_prompt = user_prompt_head + "\n" + personal_context_snippet + "\n" + rag_examples + "\n" + user_prompt_tail

        return system_prompt, user_prompt


class PhishingDetector:
    """Class to detect phishing emails using multiple models."""
    def __init__(self, config: PipelineConfig):
        self.config = config
        self.device = config.detection_model_device
        self.models = {}
        self.tokenizers = {}

        # Load all models
        for model_name in config.detection_models:
            try:
                print(f"\nLoading detection model: {model_name}...")
                tokenizer = AutoTokenizer.from_pretrained(model_name)
                model = AutoModelForSequenceClassification.from_pretrained(model_name)
                model.to(self.device)
                model.eval()  # Set model to evaluation mode

                # Store model and tokenizer
                self.models[model_name] = model
                self.tokenizers[model_name] = tokenizer

                print(f"Successfully loaded {model_name} to {self.device}")

            except Exception as e:
                print(f"Error loading detection model '{model_name}': {e}")
                print(f"Detection with {model_name} will not be possible.")

    def detect_phishing(self, email_text: str) -> Dict[str, Tuple[bool, float]]:
        """
        Detect if an email is a phishing attempt using all loaded models.

        Returns:
            Dict mapping model names to tuples of (is_phishing, phishing_probability)
        """
        results = {}

        if not self.models:
            print("No detection models loaded. Skipping detection.")
            return {}

        for model_name, model in self.models.items():
            tokenizer = self.tokenizers.get(model_name)
            if not model or not tokenizer:
                results[model_name] = (False, 0.0)
                continue

            try:
                print(f"Running detection with model: {model_name}...")
                inputs = tokenizer(
                    email_text,
                    return_tensors="pt",
                    truncation=True,
                    padding=True,
                    max_length=512  # Limit input length
                )
                inputs = {k: v.to(self.device) for k, v in inputs.items()}

                with torch.no_grad():
                    outputs = model(**inputs)

                # Get prediction (assuming binary classification: [not_phishing, phishing])
                probabilities = torch.softmax(outputs.logits, dim=1)

                # Handle different model output formats (some might have phishing as class 0, others as class 1)
                # We'll assume that most models use class 1 for phishing, but this might need adjustment
                phishing_prob = probabilities[0, 1].item()

                # Threshold for classification
                is_phishing = phishing_prob > 0.5  # Standard threshold

                results[model_name] = (is_phishing, phishing_prob)

            except Exception as e:
                print(f"Error during phishing detection with {model_name}: {e}")
                results[model_name] = (False, 0.0)  # Return false on error

        return results

    def get_model_short_names(self):
        """Return short names for models to use as column headers."""
        short_names = {}
        for i, model_name in enumerate(self.models.keys()):
            # Extract shortest meaningful part of the model name
            parts = model_name.split('/')
            if len(parts) > 1:
                short_name = parts[1]  # Take the part after the username
            else:
                short_name = model_name

            # Further shorten if needed
            if "phishing-email-detection" in short_name:
                short_name = short_name.replace("phishing-email-detection", "phish")
            if "distilbert" in short_name:
                short_name = short_name.replace("distilbert", "distil")
            if "bert-finetuned" in short_name:
                short_name = short_name.replace("bert-finetuned", "bert")

            # Add model number for clarity
            short_name = f"model{i+1}_{short_name}"

            short_names[model_name] = short_name

        return short_names


class ExperimentRunner:
    """Class to run the full phishing experiment pipeline and evaluate results."""
    def __init__(self, config: PipelineConfig):
        self.config = config
        self.generator = GroqRAGPhishingEmailGenerator(config)  # Use the RAG-enhanced generator
        self.detector = PhishingDetector(config)

    def load_contexts(self) -> List[PersonalizedContext]:
        """Load personalized contexts from CSV or generate mock data if file doesn't exist."""
        try:
            # Check if the file exists
            if not os.path.exists(self.config.input_data_path):
                print(f"Input file '{self.config.input_data_path}' not found. Generating mock data...")
                self._generate_mock_data(self.config.sample_size)

            data = pd.read_csv(self.config.input_data_path)
            contexts = []

            print(f"Loading {min(self.config.sample_size, len(data))} contexts from {self.config.input_data_path}")

            # Limit to sample size
            for _, row in data.head(self.config.sample_size).iterrows():
                # Parse activities from JSON string if stored that way
                activities = row['recent_activities']
                if isinstance(activities, str):
                    try:
                        # Assuming activities are stored as a JSON list string
                        activities = json.loads(activities)
                        # Ensure it's a list, handle cases where it might be a simple string
                        if not isinstance(activities, list):
                            activities = [str(activities)]  # Treat as a single activity if not a list
                    except json.JSONDecodeError:
                        # Handle cases where it's a simple string that isn't JSON
                        activities = [str(activities)]
                elif not isinstance(activities, list):
                    # Handle case where it's not a string or list (e.g., NaN)
                    activities = []

                context = PersonalizedContext(
                    name=row['name'],
                    email=row['email'],
                    job_position=row['job_position'],
                    recent_activities=activities
                )
                contexts.append(context)

            if not contexts:
                print("No contexts loaded. Generating sample contexts.")
                return self._generate_sample_contexts()

            return contexts

        except Exception as e:
            print(f"Error loading contexts from CSV: {e}")
            print("Generating sample contexts for demonstration...")
            return self._generate_sample_contexts()

    def _generate_mock_data(self, num_samples: int):
        """Generate mock data file with Faker."""
        fake = Faker()

        print("Generating synthetic personalized contexts...")

        # Set seeds for reproducibility
        random.seed(42)
        np.random.seed(42)
        Faker.seed(42)

        # Define common job positions
        job_positions = [
          "Software Engineer", "Product Manager", "Marketing Specialist",
          "HR Manager", "Financial Analyst", "Sales Representative",
          "Customer Support", "Data Scientist", "IT Administrator",
          "Project Manager", "Operations Manager", "Executive Assistant",
          "UX Designer", "DevOps Engineer", "Cybersecurity Analyst",
          "Business Analyst", "Legal Consultant", "Recruiter",
          "Quality Assurance Engineer", "Technical Writer", "AI Researcher",
          "Cloud Solutions Architect", "Network Engineer", "Growth Manager",
          "Mobile App Developer", "Systems Analyst", "Machine Learning Engineer",
          "Corporate Trainer", "Content Strategist", "Public Relations Officer",
          "Procurement Specialist", "Risk Manager", "Compliance Officer",
          "Information Security Officer", "Facilities Manager", "Product Designer",
          "Front-End Developer", "Back-End Developer", "Full Stack Developer",
          "Customer Success Manager"
        ]

      # Define common activity templates
        activities_templates = [
          "Working on the {} project",
          "Preparing for the {} presentation",
          "Reviewing {} documents",
          "Attending {} meeting",
          "Planning the next {} initiative",
          "Analyzing {} data trends",
          "Coordinating with the {} team",
          "Implementing a new {} system",
          "Researching {} solutions",
          "Drafting a {} proposal",
          "Responding to {} inquiries",
          "Conducting {} interviews",
          "Troubleshooting {} issues",
          "Organizing the {} workshop",
          "Setting up {} infrastructure",
          "Reviewing feedback from {} clients",
          "Deploying the latest {} update",
          "Refining the {} workflow",
          "Training new hires on {} tools",
          "Budgeting for the {} campaign",
          "Collaborating with {} partners",
          "Finalizing the {} contract",
          "Writing documentation for {} systems",
          "Prototyping the new {} feature",
          "Debugging {} module integration",
          "Evaluating {} vendor performance",
          "Optimizing {} pipeline efficiency"
        ]

        # Company domains
        domains = ["company.com", "enterprise.org", "techcorp.io", "globalfirm.co", "industryco.net"]

        # Generate data
        data = []
        for _ in range(num_samples):
            first_name = fake.first_name()
            last_name = fake.last_name()
            full_name = f"{first_name} {last_name}"

            domain = random.choice(domains)
            # Create plausible email
            email = f"{first_name.lower()}.{last_name.lower()}@{domain}"
            if random.random() < 0.2:  # Occasionally use a different format
                email = f"{first_name.lower()}{last_name.lower()[0]}@{domain}"

            job_position = random.choice(job_positions)

            # Generate 1-3 activities
            num_activities = random.randint(1, 3)
            activities = []
            for _ in range(num_activities):
                activity_template = random.choice(activities_templates)
                activity = activity_template.format(fake.bs())  # Use fake business phrases
                activities.append(activity)

            entry = {
                "name": full_name,
                "email": email,
                "job_position": job_position,
                "recent_activities": json.dumps(activities)
            }

            data.append(entry)

        # Create DataFrame and save to CSV
        df = pd.DataFrame(data)
        df.to_csv(self.config.input_data_path, index=False)

        print(f"Generated {num_samples} mock contexts and saved to {self.config.input_data_path}")

    def run_experiment(self) -> pd.DataFrame:
        """Run the full experiment pipeline: load, generate, detect, save."""
        # Step 1: Load or generate contexts
        print("\n--- Step 1: Loading personalized contexts ---")
        contexts = self.load_contexts()
        if not contexts:
            print("No contexts available to process. Exiting.")
            return pd.DataFrame()  # Return empty DataFrame

        print(f"Loaded {len(contexts)} contexts for processing")

        # Show a few examples
        print("\nExample contexts:")
        for i, context in enumerate(contexts[:min(len(contexts), 3)]):  # Show up to 3 examples
            print(f"\nContext {i+1}:")
            print(f"  Name: {context.name}")
            print(f"  Job: {context.job_position}")
            print(f"  Activities: {', '.join(context.recent_activities) if context.recent_activities else 'None'}")

        # Get model short names for columns
        model_short_names = self.detector.get_model_short_names()
        results = []

        # Step 2: Generate and detect emails
        print("\n--- Step 2: Generating and detecting phishing emails using RAG ---")
        processed_count = 0

        for i, context in enumerate(contexts):
            print(f"\nProcessing context {i+1}/{len(contexts)}: {context.name}")

            # Generate phishing email with RAG
            print(f"  Generating RAG-enhanced phishing email via Groq API...")
            phishing_email = self.generator.generate_phishing_email(context)

            # Increment counter
            processed_count += 1
            print(f"  Processed {processed_count}/{len(contexts)} contexts")

            if not phishing_email or "Error:" in phishing_email:
                print(f"  Generation failed for {context.name}. Skipping detection.")
                result = {
                    "name": context.name,
                    "email": context.email,
                    "job_position": context.job_position,
                    "recent_activities": context.recent_activities,
                    "generated_email": phishing_email,  # Store error message
                    "true_label": True,  # Still a phishing attempt conceptually
                    "used_rag": self.config.use_rag  # Track if RAG was used
                }

                # Add empty detection results for all models
                for model_name, short_name in model_short_names.items():
                    result[f"{short_name}_detected"] = False
                    result[f"{short_name}_score"] = 0.0

                results.append(result)
                continue  # Skip to the next context

            # Display truncated email preview
            preview = phishing_email.replace('\n', ' ').strip()
            preview = (preview[:150] + '...') if len(preview) > 150 else preview
            print(f"  Email preview: \"{preview}\"")

            # Detect using all models
            print(f"  Running phishing detection with multiple models...")
            all_detection_results = self.detector.detect_phishing(phishing_email)

            # Create result with basic info
            result = {
                "name": context.name,
                "email": context.email,
                "job_position": context.job_position,
                "recent_activities": context.recent_activities,
                "generated_email": phishing_email,
                "true_label": True,  # We know it's phishing since we generated it
                "used_rag": self.config.use_rag  # Track if RAG was used
            }

            # Add detection results for each model
            for model_name, (is_phishing, score) in all_detection_results.items():
                short_name = model_short_names.get(model_name, f"model_{model_name.split('/')[-1]}")
                result[f"{short_name}_detected"] = is_phishing
                result[f"{short_name}_score"] = score

                detection_result_str = "DETECTED ✓" if is_phishing else "MISSED ✗"
                print(f"  {short_name} detection result: {detection_result_str} (score: {score:.4f})")

            results.append(result)

            # Add a small delay to avoid potential rate limits
            time.sleep(10)

        if not results:
            print("No emails were generated successfully to analyze.")
            return pd.DataFrame()

        # Step 3: Create DataFrame and save results
        print("\n--- Step 3: Saving and analyzing results ---")
        results_df = pd.DataFrame(results)
        results_df.to_csv(self.config.results_path, index=False)
        print(f"Results saved to {self.config.results_path}")

        # Step 4: Print some simple analytics
        if self.config.use_rag:
            print("\n--- RAG Performance Analysis ---")
            print(f"Total RAG-enhanced phishing emails: {len(results_df)}")

            # Analyze detection rates for each model
            for model_name, short_name in model_short_names.items():
                detected_col = f"{short_name}_detected"
                score_col = f"{short_name}_score"

                if detected_col in results_df.columns:
                    detected_count = results_df[detected_col].sum()
                    detection_rate = (detected_count / len(results_df)) * 100
                    avg_score = results_df[score_col].mean()
                    print(f"{short_name} - Detection rate: {detected_count}/{len(results_df)} ({detection_rate:.2f}%), Average score: {avg_score:.4f}")

        # Compare model performance
        print("\n--- Model Comparison ---")
        detected_cols = [col for col in results_df.columns if col.endswith("_detected")]
        if detected_cols:
            detection_rates = {}
            for col in detected_cols:
                model_name = col.replace("_detected", "")
                detection_rates[model_name] = results_df[col].mean() * 100

            # Sort by detection rate
            sorted_models = sorted(detection_rates.items(), key=lambda x: x[1], reverse=True)
            print("Models ranked by detection rate (highest to lowest):")
            for model, rate in sorted_models:
                print(f"  {model}: {rate:.2f}%")

        return results_df

def main():
    """Main function to run the pipeline and evaluate results."""
    print("Starting Groq-powered RAG-Enhanced Phishing Email Generation and Multi-Model Detection Pipeline...")

    # Initialize configuration
    config = PipelineConfig()

    # Set a smaller sample size for testing
    config.sample_size = 100  # Generate fewer emails for testing

    # Create experiment runner
    runner = ExperimentRunner(config)

    # Run the experiment
    print("\nRunning experiment...")
    results_df = runner.run_experiment()

    if results_df.empty or results_df[~results_df['generated_email'].str.contains("Error:", na=False)].empty:
        print("\nExperiment finished but no valid emails were generated or processed for analysis.")
        return

if __name__ == "__main__":
    main()

Starting Groq-powered RAG-Enhanced Phishing Email Generation and Multi-Model Detection Pipeline...
Loading existing vector store from ./chroma_db
Groq API configured with model: llama-3.1-8b-instant

Loading detection model: dima806/phishing-email-detection...
Successfully loaded dima806/phishing-email-detection to cuda

Loading detection model: cybersectony/phishing-email-detection-distilbert_v2.4.1...
Successfully loaded cybersectony/phishing-email-detection-distilbert_v2.4.1 to cuda

Loading detection model: ealvaradob/bert-finetuned-phishing...
Successfully loaded ealvaradob/bert-finetuned-phishing to cuda

Running experiment...

--- Step 1: Loading personalized contexts ---
Input file 'personalized_contexts.csv' not found. Generating mock data...
Generating synthetic personalized contexts...
Generated 10 mock contexts and saved to personalized_contexts.csv
Loading 10 contexts from personalized_contexts.csv
Loaded 10 contexts for processing

Example contexts:

Context 1:
  Name: Da