## Retrieval-Augmented Generation (RAG) Notebook Overview

This notebook demonstrates the complete workflow of a simple RAG system designed to identify potentially suspicious organizations from a fictional dataset. The main steps include:

1. **Dependency Installation:**  
   Ensuring that all necessary Python packages are installed for the notebook to run smoothly.

2. **Data Indexing:**  
   Loading and preparing a dataset of organization descriptions.

3. **Embedding Generation:**  
   Using a pre-trained SentenceTransformer model to convert text descriptions into numerical embeddings.

4. **Similarity Search:**  
   Leveraging FAISS to perform efficient vector similarity searches between a user query and document embeddings.

5. **Prompt Construction:**  
   Building a query prompt by integrating the top-k retrieved document details.

6. **Response Generation:**  
   Using the Hugging Face Mistral API to generate human-like answers based on the constructed prompt.



Follow the cells sequentially to install dependencies, index the data, generate embeddings, perform the similarity search, and finally produce a detailed, human-like response.

In [1]:
%pip install -r requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## FAISS Index Creation and Data Embedding Details

This cell contains key functions that enable the Retrieval-Augmented Generation (RAG) system to process and index the dataset. The main steps are:

1. **Data Loading & Preprocessing:**
   - **Function:** `load_and_preprocess_data`
   - **Purpose:**  
     Reads the raw dataset file from the specified `FILE_PATH` and parses it using a regular expression.  
     It extracts each document's ID, title, and description, returning a list of dictionaries.

2. **Embedding Generation:**
   - **Function:** `embed_texts`
   - **Purpose:**  
     Uses a pre-trained SentenceTransformer model (default: `"BAAI/bge-small-en-v1.5"`) to generate vector embeddings for each document.  
     The embeddings are computed from a concatenation of the document title and text, then returned as a NumPy array.

3. **FAISS Index Construction:**
   - **Function:** `create_faiss_index`
   - **Purpose:**  
     Combines the data loading and embedding functions to build a FAISS index with L2 distance (using `faiss.IndexFlatL2`).  
     It indexes the embeddings and also creates a mapping between index positions and the original document dictionaries for later retrieval.

> **Note:**  
> Ensure the `.env` file is configured with the correct `FILE_PATH` to your dataset before running this cell.

In [2]:
import os
import re
from typing import Any, Dict, List, Tuple

import numpy as np
import faiss

from dotenv import load_dotenv
from sentence_transformers import SentenceTransformer

# Load environment variables
load_dotenv(dotenv_path="../.env")

# Constants
FILE_PATH = os.getenv("FILE_PATH")


def load_and_preprocess_data(file_path: str) -> List[Dict[str, Any]]:
    """
    Load and preprocess the dataset from the given file path.
    
    Args:
        file_path (str): Path to the text file containing documents.

    Returns:
        List[Dict[str, Any]]: A list of dictionaries containing parsed document information.
    """
    with open(file_path, "r", encoding="utf-8") as file:
        raw_text = file.read()

    pattern = r"Document (\d+): (.*?)\nDescription:\n(.*?)(?=\nDocument \d+:|\Z)"
    matches = re.findall(pattern, raw_text, re.DOTALL)

    documents = [
        {
            "doc_id": int(doc_id),
            "title": title.strip(),
            "text": description.strip(),
        }
        for doc_id, title, description in matches
    ]
    return documents


def embed_texts(
    documents: List[Dict[str, Any]], model_name: str = "BAAI/bge-small-en-v1.5"
) -> np.ndarray:
    """
    Embed the texts using a SentenceTransformer model.

    Args:
        documents (List[Dict[str, Any]]): List of documents with 'title' and 'text'.
        model_name (str): Name of the SentenceTransformer model to use.

    Returns:
        np.ndarray: Embeddings for the input documents.
    """
    model = SentenceTransformer(model_name)
    corpus = [f"{doc['title']}: {doc['text']}" for doc in documents]
    embeddings = model.encode(corpus, show_progress_bar=True)
    return np.array(embeddings, dtype=np.float32)


def create_faiss_index() -> Tuple[faiss.Index, Dict[int, Dict[str, Any]]]:
    """
    Create a FAISS index from the embedded document vectors.

    Returns:
        Tuple[faiss.Index, Dict[int, Dict[str, Any]]]: 
            FAISS index and mapping from index ID to document.
    """
    file_path = os.getenv("FILE_PATH")
    if not file_path:
        raise ValueError("FILE_PATH environment variable is not set.")

    documents = load_and_preprocess_data(file_path)
    embeddings = embed_texts(documents)

    dimension = embeddings.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(embeddings)

    id_to_doc = {i: doc for i, doc in enumerate(documents)}

    print(f"FAISS index created with {index.ntotal} documents.")
    return index, id_to_doc


  from .autonotebook import tqdm as notebook_tqdm


## Search Query Function Details

This cell defines the `search_query` function, which is responsible for retrieving the top matching document snippets based on a user query. The key steps include:

1. **Query Encoding:**
   - Uses the same pre-trained SentenceTransformer model (`BAAI/bge-small-en-v1.5`) to encode the input query into a vector representation.

2. **Similarity Search:**
   - Performs a FAISS search on the pre-built index using the query embedding.
   - Retrieves the top `k` similar documents from the index based on Euclidean (L2) distance.

3. **Results Display:**
   - Iterates over the retrieved results and prints each document's rank, title, description, and distance.
   - Aggregates the title and description into a context list for potential further use.

> **Note:**  
> This function is optimized for a conversational AI context, enabling users to ask questions and receive relevant document snippets as answers.

In [3]:
# Search Query Funcion for Index
#  This function takes a query string, encodes it using the same model used for embedding the documents,
#  and performs a similarity search in the FAISS index to retrieve the top k most similar documents.
#  It returns the titles and descriptions of the top k documents along with their distances from the query.
#  The function also prints the results in a readable format.
#  The function is designed to be used in a conversational AI context, where the user can ask questions
#  and receive relevant document snippets as answers.
#  ========================================================================
#  Search Query Function
#  ========================================================================
from typing import Any, Dict, List, Tuple
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss


def search_query(
    index: faiss.Index,
    id_to_doc: Dict[int, Dict[str, Any]],
    query: str,
    model: SentenceTransformer = SentenceTransformer("BAAI/bge-small-en-v1.5"),
    top_k: int = 3,
) -> List[str]:
    """
    Encode the query, perform FAISS similarity search, and return top matching contexts.

    Args:
        index (faiss.Index): The FAISS index containing document embeddings.
        id_to_doc (Dict[int, Dict[str, Any]]): Mapping from index position to document.
        query (str): User query string.
        model (SentenceTransformer): Preloaded SentenceTransformer model.
        top_k (int): Number of top results to retrieve.

    Returns:
        List[str]: List of top document contexts as strings.
    """
    # Encode the query
    query_embedding = model.encode([query]).astype("float32")

    # Search the FAISS index
    distances, indices = index.search(query_embedding, top_k)

    print("=======================================================================")
    print(f"Top {top_k} results retrieved for the Query: {query}")
    print("=======================================================================")


    context: List[str] = []
    for i, idx in enumerate(indices[0]):
        title = id_to_doc[idx]["title"]
        description = id_to_doc[idx]["text"]
        distance = distances[0][i]

        print(f"Rank {i + 1}:")
        print(f"Title: {title}")
        print(f"Description: {description}")
        print(f"Distance: {distance:.4f}\n")
        context.append(f"Title: {title}")
        context.append(f"Description: {description}")
    
    print("=======================================================================")
    return context


## Prompt Generation Strategies

This cell introduces multiple strategies for constructing prompts tailored to various language model inference tasks. The design is intended to offer flexibility and control over how context and queries are combined, which is especially valuable when fine-tuning interactions with large language models.

### Key Components

- **PromptStyle Enum:**  
  An enumeration (`PromptStyle`) is defined to encapsulate different prompt formatting strategies, including:
  - **STANDARD:** Basic prompt combining context and query.
  - **FEW_SHOT:** Adds a few recent Q&A examples to guide the model.
  - **CHAIN_OF_THOUGHT:** Encourages step-by-step reasoning.
  - **ROLE_BASED:** Frames the query in the context of domain expertise (e.g., forensic investigator).
  - **BULLET_POINTS:** Instructs the model to summarize findings in bullet points.
  - **SCORING:** Requests the model to score each organization based on risk and provide explanations.
  - **CHATML:** Utilizes a ChatML-style formatting for models that require it.

- **generate_prompt Function:**  
  This function constructs the final prompt by:
  1. **Aggregating Context:**  
     Joins multiple context documents with clear separation.
  2. **Handling Optional History:**  
     For few-shot prompting, it appends recent Q&A pairs to enrich the prompt.
  3. **Conditionally Formatting the Prompt:**  
     Checks the selected `prompt_style` and formats the prompt accordingly, ensuring:
     - **Clarity & Structure:** Each version clearly lays out the context, query, and expected model behavior.
     - **Adaptability:** Different styles serve different purposes depending on the inference task at hand.

This modular approach allows data scientists to experiment with and select the most effective prompting style for their specific use case, helping optimize the quality and relevance of model responses.

In [None]:
from enum import Enum
from typing import List, Tuple


class PromptStyle(Enum):
    STANDARD = "standard"
    FEW_SHOT = "few_shot"
    CHAIN_OF_THOUGHT = "chain_of_thought"
    ROLE_BASED = "role_based"
    BULLET_POINTS = "bullet_points"
    SCORING = "scoring"
    CHATML = "chatml"


def generate_prompt(
    query: str,
    context_docs: List[str],
    prompt_style: PromptStyle = PromptStyle.STANDARD,
    # history: List[Tuple[str, str]] = None,
) -> str:
    """
    Generate a prompt for the LLM based on the selected prompt style.

    Args:
        query (str): User's current query.
        context_docs (List[str]): Retrieved documents for context.
        prompt_style (PromptStyle): The strategy for prompt formatting.
        history (List[Tuple[str, str]]): Optional memory of previous Q&A for few-shot examples.

    Returns:
        str: The final prompt to be sent to the LLM.
    """
    context = "\n\n".join(context_docs)

    if prompt_style == PromptStyle.STANDARD:
        return (
            "Your task is to analyse the question based on the context, and then provide an appropriate answer.\n\n"
            f"Context:\n{context}\n\n"
            f"Question: {query}\n\n"
            f"Answer:"
        )

    # if prompt_style == PromptStyle.FEW_SHOT:
    #     few_shot_examples = ""
    #     if history:
    #         for past_q, past_a in history[-2:]:  # last 2 examples
    #             few_shot_examples += (
    #                 f"Context: [Previous Retrieval]\n"
    #                 f"Question: {past_q}\n"
    #                 f"Answer: {past_a}\n\n"
    #             )
        # return (
        #     f"{few_shot_examples}"
        #     f"Context:\n{context}\n\n"
        #     f"Question: {query}\n\n"
        #     f"Answer:"
        # )

    if prompt_style == PromptStyle.CHAIN_OF_THOUGHT:
        return (
            f"Context:\n{context}\n\n"
            f"Question: {query}\n\n"
            f"Let's think step by step:"
        )

    if prompt_style == PromptStyle.ROLE_BASED:
        return (
            f"You are a senior forensic investigator specializing in financial crime.\n\n"
            f"Context:\n{context}\n\n"
            f"Analyze the above organizations for potential risk indicators.\n\n"
            f"Question: {query}\n\n"
            f"Answer:"
        )

    if prompt_style == PromptStyle.BULLET_POINTS:
        return (
            f"Context:\n{context}\n\n"
            f"Question: {query}\n\n"
            f"List your findings in bullet points:\n"
            f"- "
        )

    if prompt_style == PromptStyle.CHATML:
        return (
            f"You are a compliance analyst. Based on the context, answer the query thoroughly.\n\n"
            f"Context:\n{context}\n\n"
            f"Question: {query}\n\n"
            f"Answer:"
        )

    raise ValueError(f"Unsupported prompt style: {prompt_style}")


## Prompt Construction and Mistral Inference Details

This cell defines the functions responsible for constructing a prompt for the language model, formatting the prompt for Mistral-Instruct models, and invoking the Hugging Face Inference API to generate a response. The main functions are:

1. **build_prompt:**  
   - Combines the user's query and the retrieved context into a unified prompt.
   - The prompt instructs the model to analyze the context and answer the question.

2. **format_chat_prompt:**  
   - Wraps the prompt in a ChatML-styled template, which is required by Mistral-Instruct models for proper formatting.

3. **call_mistral_hf:**  
   - Sends the formatted prompt to the Hugging Face API endpoint for the Mistral model.
   - Sets parameters such as temperature and max token output.
   - Uses the API token from the environment to authenticate the request.
   - Parses and returns the generated text from the API response.

> **Note:**  
> Ensure that your environment variable `HUGGINGFACE_API_TOKEN` is correctly set in the `.env` file before executing this cell.

In [None]:
# Build Prompt, Format chat prompt and Call Mistral HF
#  This cell defines functions to build a prompt for a language model, format it for Mistral-Instruct models,
#  and call the Hugging Face Inference API to generate a response.
import os
from typing import Optional
import requests

def build_prompt(query: str, context: str,prompt_style="standard") -> str:
    """
    Build a prompt for language model inference based on the given query and context.

    Args:
        query (str): User query or question.
        context (str): Context retrieved from the documents.

    Returns:
        str: Formatted prompt string.
    """
    if prompt_style not in [
        PromptStyle.STANDARD,
        PromptStyle.FEW_SHOT,
        PromptStyle.CHAIN_OF_THOUGHT,
        PromptStyle.ROLE_BASED,
        PromptStyle.BULLET_POINTS,
        PromptStyle.SCORING,
        PromptStyle.CHATML,
    ]:
        raise ValueError(f"Unsupported prompt style: {prompt_style}")

    prompt = generate_prompt(
    query=query,
    context_docs=context,
    prompt_style=
    )
    return prompt


def format_chat_prompt(prompt: str) -> str:
    """
    Format a prompt using ChatML-style for Mistral-Instruct models.

    Args:
        prompt (str): The user input to be wrapped.

    Returns:
        str: Formatted prompt suitable for Mistral models.
    """
    return f"<s>[INST] {prompt.strip()} [/INST]"


def call_mistral_hf(prompt: str, api_token: Optional[str] = os.getenv("HUGGINGFACE_API_TOKEN")) -> str:
    """
    Call the Hugging Face Inference API for the Mistral model with the given prompt.

    Args:
        prompt (str): The input prompt for generation.
        api_token (Optional[str]): Hugging Face API token. If not provided, will read from env.

    Returns:
        str: The generated response from the model.
    """
    if api_token is None:
        api_token = os.getenv("HUGGINGFACE_API_TOKEN")

    if not api_token:
        raise ValueError("Hugging Face API token not found. Please set 'HUGGINGFACE_API_TOKEN' in the environment.")

    headers = {
        "Authorization": f"Bearer {api_token}",
        "Content-Type": "application/json",
    }

    formatted_prompt = format_chat_prompt(prompt)

    payload = {
        "inputs": formatted_prompt,
        "parameters": {
            "temperature": 0.7,
            "max_new_tokens": 512,
            "do_sample": True,
            "return_full_text": False,
        },
    }

    # Make the API call to the Mistral model
    api_url = "https://api-inference.huggingface.co/models/mistralai/Mistral-7B-Instruct-v0.2"
    response = requests.post(api_url, headers=headers, json=payload)
    response.raise_for_status()

    result = response.json()

    return result[0]["generated_text"].strip()


## End-to-End Inference Function Details

This cell defines the `inference` function, which ties together all previous functions to run end-to-end inference for a given query. The main steps are:

1. **Retrieve Context:**  
   - Calls the `search_query` function with the FAISS index and document mapping to retrieve relevant document snippets based on the query.
   - Combines the retrieved snippets into a single context string.

2. **Prompt Construction:**  
   - Utilizes the `build_prompt` function to create a prompt that includes both the query and the context.
  
3. **Response Generation:**  
   - Calls the `call_mistral_hf` function to send the prompt to the Hugging Face Inference API for the Mistral model.
   - Returns the generated response from the model.

This function encapsulates the complete workflow of the RAG system, making it straightforward to process a user query by retrieving relevant context and generating a human-like answer.

In [6]:
from typing import Dict, Any
import faiss


def inference(
    query: str,
    index: faiss.Index,
    id_to_docs: Dict[int, Dict[str, Any]],
) -> str:
    """
    Run end-to-end inference on a query using a FAISS index and Mistral API.

    Args:
        query (str): The user query to process.
        index (faiss.Index): The FAISS index containing document embeddings.
        id_to_docs (Dict[int, Dict[str, Any]]): Mapping of index positions to documents.

    Returns:
        str: The generated response from the language model.
    """
    context_list = search_query(index, id_to_docs, query)
    # context_str = "\n\n".join(context_list)
    prompt = build_prompt(query, context_list)
    return call_mistral_hf(prompt)


## Index Creation and Inference Pipeline Overview

### FAISS Index Creation:
     The cell initializes the FAISS index by calling the `create_faiss_index()` function.  
     This function reads and preprocesses the dataset, generates embeddings using a SentenceTransformer model, and builds a FAISS index with L2 distance.  
     It also returns a mapping (`id_to_docs`) from index positions to document details.

In [7]:
# create index
index, id_to_docs = create_faiss_index() 

Batches: 100%|██████████| 1/1 [00:00<00:00, 11.18it/s]

FAISS index created with 12 documents.





In [8]:
queries = [
"Tell me about Cascade Capital Management?",
"Which organizations show signs of potential money laundering through complex structures?",
"What irregular transaction patterns are identified for Aurora Financial Services, and why do these raise concerns about potential money laundering?",
"How do the described transaction patterns of Blue Horizon Investments hint at possible insider trading or market manipulation?",
"Compare the descriptions of Falcon Secure Bank and Helix Fintech Solutions. What aspects of their operations contribute to one being perceived as having higher transparency and regulatory compliance than the other?"
]

## Running Inference on Multiple Queries

This cell processes a list of queries by executing the end-to-end inference pipeline for each query. The key steps are:

- **Iteration Over Queries:**  
  Iterates through each query in the `queries` list.

- **Printing Query Information:**  
  For each query, it prints the query string and a divider for clarity.

- **Inference Execution:**  
  Calls the `inference` function with the current query, the pre-built FAISS index (`index`), and the document mapping (`id_to_docs`).  
  This function retrieves relevant document snippets, constructs a prompt, and obtains a generated response via the Mistral model.

- **Response Collection:**  
  Appends the generated response to the `responses` list for further use or analysis.

In [9]:
responses = []
for query in queries:
    print(f"Query: {query}")
    print("=======================================================================")
    # Run inference for each query
    response = inference(query, index, id_to_docs)
    responses.append(response)


Query: Tell me about Cascade Capital Management?
Top 3 results retrieved for the Query: Tell me about Cascade Capital Management?
Rank 1:
Title: Cascade Capital Management
Description: Cascade Capital Management, a venture capital firm specializing in tech investments, has grown rapidly but relies on a complex network of subsidiary shell companies. This structure has attracted regulatory scrutiny regarding transparency and compliance.
Distance: 0.2569

Rank 2:
Title: Gemini Asset Management
Description: Serving high-net-worth clients in Asia, Gemini Asset Management has recently been spotlighted for unusually high commissions and inconsistent portfolio reporting. These anomalies have sparked concerns over potential money laundering and fraudulent practices.
Distance: 0.7213

Rank 3:
Title: Blue Horizon Investments
Description: Based in London, Blue Horizon Investments is known for its innovative portfolio strategies. However, irregular transaction patterns and rapid, unexplained fund m

## Displaying the Generated Responses

This cell is responsible for presenting the generated responses in a readable format. The steps involved are:

1. **Iterating Over Responses:**  
   - Loops through each response stored in the `responses` list.

2. **Rendering Responses in HTML:**  
   - Uses the `IPython.display` module to render each response within an HTML `<div>` container.  
   - The HTML container is styled with `white-space: pre-wrap` to ensure proper word wrapping.

This ensures that the output is easily readable and neatly formatted within the Jupyter Notebook interface.

In [10]:
from IPython.display import display, HTML

for response in responses:
    # Display the response in plain text
    # print(responses)
    
    # To display it as wrapped HTML
    display(HTML(f"<div style='white-space: pre-wrap; word-wrap: break-word;'>{response}</div>"))
    print("=======================================================================")











