### 🧪 Final Project: Smart AI - Retrieval-Augmented Generation (RAG)

Use this code as the base for your final project. In this code we build an interactive **Gradio application** that combines document understanding with open-source Large Language Models (LLMs) using the **Retrieval-Augmented Generation (RAG)** approach.

You will learn how to:

- 📄 Process and chunk PDF documents  
- 🧠 Generate embeddings for semantic search  
- 🔎 Use FAISS for fast vector similarity retrieval  
- 🤖 Prompt LLMs with retrieved context for:
  - Question Answering  
  - Summarization  
  - Translation (to Finnish)

This exercise brings together the key components of the RAG pipeline:  
**Document Processing → Embedding → Indexing → Retrieval → Generation**

We’ll use:

- 🧠 **Sentence Transformers** for embeddings  
- 🔎 **FAISS** for similarity search  
- 🤖 **Google Gemma 2B** (or any other open model) for text generation  
- 🌐 **Gradio** for creating a user interface

### 🔧 Part 0: Setup and Installations

Before we begin, let’s install the required libraries.

We’ll need:
- `faiss-cpu` for efficient similarity search and indexing
- `gradio` to build our interactive interface
- `pypdf` to extract text from PDF files



In [None]:
!pip install faiss-cpu
!pip install gradio
!pip install pypdf

### 📦 Part 1: Import Required Libraries

Now that we’ve installed the necessary packages, let’s import them into our environment.

These include:
- Core Python libraries: `os`, `re`, `warnings`
- Numerical and ML tools: `torch`, `numpy`, `faiss`
- PDF handling: `pypdf`
- Embedding model: `sentence-transformers`
- Language model pipeline: `transformers`
- Interface: `gradio`

We’ll also suppress some warnings to keep the output clean.


In [None]:

import os
import re
import torch
import faiss
import numpy as np
import gradio as gr
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig
from sentence_transformers import SentenceTransformer
from pypdf import PdfReader
import warnings

# Suppress specific warnings if needed (e.g., from sentence_transformers)
warnings.filterwarnings("ignore", category=FutureWarning)

print("Libraries imported successfully.")

### ⚙️ Part 2: Configuration – Device and Model Selection

In this section, we:

- Detect whether a GPU is available and set the appropriate device for model execution
- Select our embedding and language generation models
- Define the target language for translation
- Prepare global variables to hold our models (to avoid reloading them on every interaction)

In [None]:


# --- Configuration ---
# Select device (GPU if available, otherwise CPU)
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("Using CPU. Note: LLM inference will be significantly slower.")

# --- Model Selection ---
# Embedding Model (Choose one)
# 'all-MiniLM-L6-v2' is fast and efficient.
# 'all-mpnet-base-v2' offers higher quality embeddings.
EMBEDDING_MODEL_NAME = 'all-MiniLM-L6-v2'

# Large Language Model (Choose one)
# Smaller models (< 3B params) are recommended for easier running without large GPUs.
# Examples: 'google/gemma-2b-it', 'microsoft/phi-2', 'TinyLlama/TinyLlama-1.1B-Chat-v1.0'
LLM_MODEL_NAME = "google/gemma-2b-it" # Using Gemma 2B Instruct as an example

# Target language for translation task
TARGET_LANGUAGE = "Finnish"

# --- Global Variables (Loaded Models) ---
# We load models globally to avoid reloading them on every Gradio interaction.
embedder = None
text_generator = None
tokenizer = None # Keep tokenizer for potential manual formatting if needed

### 🔑 Part 3: Authenticate with Hugging Face Hub

To download models from the Hugging Face Hub—especially larger or gated models—you may need to authenticate with your Hugging Face account.

Run the following command and paste in your **Hugging Face token** when prompted


In [None]:
!huggingface-cli login

### 🤖 Part 4: Load Models – Embedder & LLM

In this step, we load:

- The **Sentence Transformer** model for generating document embeddings.
- The **Large Language Model (LLM)** for response generation.

We also configure:
- **4-bit quantization** (optional) for memory-efficient LLM loading using `bitsandbytes`.
- Automatic selection of optimal `torch_dtype` based on hardware (e.g., `bfloat16` or `float16`).


In [None]:


# Load Models (Generator LLM & Embedder)

def load_models():
    """Loads the embedding and language models."""
    global embedder, text_generator, tokenizer

    print(f"Loading embedding model: {EMBEDDING_MODEL_NAME}...")
    try:
        embedder = SentenceTransformer(EMBEDDING_MODEL_NAME, device=device)
        print("Embedding model loaded successfully.")
    except Exception as e:
        print(f"Error loading embedding model: {e}")
        raise

    print(f"Loading LLM: {LLM_MODEL_NAME}...")
    try:
        # Optional: Configuration for loading in 4-bit for memory savings
        # Requires 'bitsandbytes' library
        use_4bit = True # Set to False if you don't want 4-bit or have issues
        bnb_config = None
        if use_4bit and torch.cuda.is_available():
            try:
                bnb_config = BitsAndBytesConfig(
                    load_in_4bit=True,
                    bnb_4bit_quant_type="nf4",
                    bnb_4bit_compute_dtype=torch.bfloat16, # Use bfloat16 for faster computation if supported
                    bnb_4bit_use_double_quant=False,
                )
                print("Using 4-bit quantization.")
            except Exception as e:
                print(f"Could not set up 4-bit quantization, proceeding without it: {e}")
                bnb_config = None # Fallback if BitsAndBytesConfig fails
        elif use_4bit:
            print("4-bit quantization requires a CUDA GPU. Proceeding without it.")


        tokenizer = AutoTokenizer.from_pretrained(LLM_MODEL_NAME)

        # Determine torch_dtype based on device and availability
        if torch.cuda.is_available() and torch.cuda.is_bf16_supported():
             model_dtype = torch.bfloat16
             print("Using bfloat16 dtype.")
        else:
             model_dtype = torch.float16 # Use float16 as a fallback on GPU or CPU
             print("Using float16 dtype.")


        model = AutoModelForCausalLM.from_pretrained(
            LLM_MODEL_NAME,
            device_map="auto",  # Automatically map model layers to available devices (GPU/CPU/Disk)
            torch_dtype=model_dtype,
            quantization_config=bnb_config, # Apply 4-bit config if defined
            trust_remote_code=True # Needed for some models like Phi-2
        )
        print("LLM loaded successfully.")

        # Create a text generation pipeline
        # Note: max_new_tokens limits the length of the *generated* response.
        text_generator = pipeline(
            "text-generation",
            model=model,
            tokenizer=tokenizer,
            max_new_tokens=300, # Increased max tokens for potentially longer summaries/translations
            # temperature=0.7, # Controls randomness (higher = more random)
            # top_p=0.9,       # Nucleus sampling (considers top p% probability mass)
            do_sample=True,   # Enable sampling for more creative responses
            framework="pt" # Specify PyTorch framework
        )
        print("Text generation pipeline ready.")

    except Exception as e:
        print(f"Error loading LLM or creating pipeline: {e}")
        raise

# --- Call the loading function once at the start ---
try:
    load_models()
except Exception as e:
    print(f"Failed to load models. The application might not work correctly. Error: {e}")
    # Optionally exit or handle this case more gracefully depending on deployment context
    # exit()



### 📄 Part 5: Document Processing & Vector Store Creation

In this step, we:

1. **Extract text** from the uploaded PDF
2. **Split the text into manageable chunks** using a sliding window approach
3. **Generate embeddings** for each chunk using the embedding model
4. **Store the embeddings** in a FAISS index for fast similarity-based retrieval

This sets the foundation for enabling semantic search over the document content.

In [None]:

# Document Processing and Vector Store Creation

def load_and_chunk_pdf(file_path, chunk_size=700, chunk_overlap=70):
    """
    Loads text from a PDF file, cleans it, and splits it into overlapping chunks.

    Args:
        file_path (str): Path to the PDF file.
        chunk_size (int): The approximate size of each text chunk.
        chunk_overlap (int): The number of characters to overlap between consecutive chunks.

    Returns:
        list[str]: A list of text chunks, or None if processing fails.
    """
    if not file_path or not os.path.exists(file_path):
        print(f"Error: PDF file not found at {file_path}")
        return None
    try:
        print(f"Loading PDF: {file_path}")
        reader = PdfReader(file_path)
        text = ""
        for i, page in enumerate(reader.pages):
            page_text = page.extract_text()
            if page_text:
                text += page_text + "\n" # Add newline between pages
            else:
                print(f"Warning: Could not extract text from page {i+1}.")

        if not text:
            print("Error: No text extracted from the PDF.")
            return None

        # Basic cleaning
        text = re.sub(r'\n\s*\n', '\n', text).strip() # Remove multiple blank lines
        text = re.sub(r'\s+', ' ', text).strip()      # Replace multiple spaces with single space

        # Simple sliding window chunking
        chunks = []
        start_index = 0
        while start_index < len(text):
            end_index = start_index + chunk_size
            chunks.append(text[start_index:end_index])
            start_index += chunk_size - chunk_overlap # Move window

        # Filter out very short chunks that might result from the end of the text
        chunks = [chunk for chunk in chunks if len(chunk.strip()) > 50]

        print(f"Document loaded and split into {len(chunks)} chunks.")
        return chunks
    except Exception as e:
        print(f"Error loading or chunking PDF '{file_path}': {e}")
        return None

def build_vector_store(chunks, embedder_model):
    """
    Generates embeddings for text chunks and builds a FAISS index for fast retrieval.

    Args:
        chunks (list[str]): The list of text chunks.
        embedder_model: The loaded SentenceTransformer model.

    Returns:
        tuple(faiss.Index, list[str]): The FAISS index and the original chunks, or (None, None) if failed.
    """
    if not chunks or embedder_model is None:
        print("Error: No chunks or embedder model provided for vector store creation.")
        return None, None
    try:
        print(f"Generating embeddings for {len(chunks)} chunks...")
        # Generate embeddings (move embedder to CPU temporarily if it's on GPU and causes memory issues here)
        # This depends on available VRAM vs. embedding model size.
        # For smaller models like MiniLM, GPU is usually fine.
        embeddings = embedder_model.encode(chunks, convert_to_tensor=False, show_progress_bar=True) # Get numpy arrays directly

        # Ensure embeddings are float32 for FAISS
        embeddings_np = np.array(embeddings).astype('float32')

        # Create FAISS index (using L2 distance for similarity)
        embedding_dim = embeddings_np.shape[1]
        index = faiss.IndexFlatL2(embedding_dim)
        index.add(embeddings_np)

        print(f"FAISS index created successfully with {index.ntotal} vectors.")
        return index, chunks # Return chunks for easy lookup later
    except Exception as e:
        print(f"Error building vector store: {e}")
        return None, None

### 🔎 Part 6: Semantic Retrieval from Vector Store

Once the PDF has been processed and indexed, we can now perform **semantic search** to retrieve the most relevant content based on the user's input.

This function retrieves the top `k` chunks from the vector store that are most semantically similar to the query using **FAISS similarity search**.


In [None]:


# Retrieval

def retrieve_context(query, vector_store, embedder_model, indexed_chunks, top_k=3):
    """
    Retrieves the top_k most relevant text chunks from the vector store based on the query.

    Args:
        query (str): The user's query.
        vector_store (faiss.Index): The FAISS index.
        embedder_model: The loaded SentenceTransformer model.
        indexed_chunks (list[str]): The original chunks corresponding to the index vectors.
        top_k (int): The number of relevant chunks to retrieve.

    Returns:
        str: A string containing the concatenated relevant chunks, or an error message.
    """
    if vector_store is None or embedder_model is None or indexed_chunks is None:
        return "Error: Vector store, embedder, or chunks not initialized."
    try:
        print(f"Retrieving context for query: '{query}'")
        # Generate embedding for the query
        query_embedding = embedder_model.encode([query], convert_to_tensor=False) # Get numpy array
        query_embedding_np = np.array(query_embedding).astype('float32')

        # Search the FAISS index
        distances, indices = vector_store.search(query_embedding_np, top_k)

        # Get the actual text chunks based on the indices
        retrieved_chunks = [indexed_chunks[i] for i in indices[0] if 0 <= i < len(indexed_chunks)]

        if not retrieved_chunks:
            print("Warning: No relevant chunks found for the query.")
            return "Could not find relevant context for this query in the document."

        # Combine the chunks into a single context string
        context = "\n\n---\n\n".join(retrieved_chunks) # Separate chunks clearly
        print(f"Retrieved {len(retrieved_chunks)} chunks.")
        return context
    except Exception as e:
        print(f"Error during context retrieval: {e}")
        return f"Error retrieving context: {e}"



### 🧠 Part 7: RAG Response Generation with Prompt Templates

We now define task-specific **prompt templates** that guide the language model to produce accurate, contextual responses. This is a key feature of **Retrieval-Augmented Generation (RAG)**, where retrieved chunks are combined with smart prompting.


In [None]:
# RAG Generation (with Task-Specific Prompts)

# --- Prompt Templates ---
# Define templates for different tasks. These guide the LLM on how to use the context.
# Note: Gemma-Instruct uses a specific chat format. We'll format the prompt string
#       and rely on the pipeline to handle the underlying model specifics, but
#       for optimal results, using the model's chat template directly might be better.

QA_PROMPT_TEMPLATE = """SYSTEM: Use the following context to answer the question concisely.
If the answer is not found in the context, state that you cannot answer based on the provided information. Do not make up information.

CONTEXT:
{context}

USER: {query}

ASSISTANT:"""


SUMMARIZATION_PROMPT_TEMPLATE = """SYSTEM: Based *only* on the following text, provide a concise summary. Focus on the main points.

CONTEXT:
{context}

USER: Provide a summary of the context above.

ASSISTANT: Summary:"""

# CORRECTED DEFINITION: Use double braces {{context}} to escape it in the f-string.
# This leaves {context} as a literal placeholder for the .format() call later.
TRANSLATION_PROMPT_TEMPLATE = f"""SYSTEM: Translate the following text accurately into {TARGET_LANGUAGE}. Provide only the translation.

CONTEXT:
{{context}}

USER: Translate the context above into {TARGET_LANGUAGE}.

ASSISTANT: Translation ({TARGET_LANGUAGE}):"""


def generate_response(query, context, task_prompt_template, generator_pipeline):
    """
    Generates a response using the LLM pipeline based on the query, retrieved context,
    and a task-specific prompt template.

    Args:
        query (str): The original user query (used within the prompt template).
        context (str): The retrieved context chunks.
        task_prompt_template (str): The prompt template for the specific task.
        generator_pipeline: The Hugging Face text-generation pipeline.

    Returns:
        str: The generated response from the LLM, or an error message.
    """
    if generator_pipeline is None:
        return "Error: Text generation pipeline not initialized."
    # Check if context retrieval itself returned an error or no context
    if not context or "Error:" in context or "Could not find relevant context" in context:
         # Pass the specific error message from retrieval
         return f"Cannot generate response because context retrieval failed: {context}"

    try:
        # Format the final prompt using the template, context, and query
        # The .format() call will correctly substitute the actual 'context' variable
        # into the {context} placeholder within the template string.
        prompt = task_prompt_template.format(context=context, query=query)

        print("Generating response with LLM...")
        # print(f"--- Prompt Start ---\n{prompt}\n--- Prompt End ---") # Uncomment for debugging

        # Use the pipeline for generation
        outputs = generator_pipeline(prompt)
        generated_text = outputs[0]['generated_text']

        # Clean up the response: Remove the prompt from the generated text
        assistant_marker = "ASSISTANT:"
        marker_pos = generated_text.rfind(assistant_marker)
        if marker_pos != -1:
            response = generated_text[marker_pos + len(assistant_marker):].strip()
            # Further cleanup for specific tasks if needed
            if task_prompt_template == SUMMARIZATION_PROMPT_TEMPLATE and response.startswith("Summary:"):
                 response = response[len("Summary:"):].strip()
            # Check against the dynamically formatted TARGET_LANGUAGE string
            translation_marker = f"Translation ({TARGET_LANGUAGE}):"
            if task_prompt_template == TRANSLATION_PROMPT_TEMPLATE and response.startswith(translation_marker):
                 response = response[len(translation_marker):].strip()

        else:
             # Fallback cleanup if ASSISTANT marker isn't found
             response = generated_text.replace(prompt, "").strip()


        print("LLM Generation complete.")
        return response

    except Exception as e:
        print(f"Error during LLM generation: {e}")
        if "CUDA out of memory" in str(e):
             return "Error: GPU out of memory during generation. Try a smaller model, shorter document, or enable 4-bit quantization if not already active."
        return f"Error generating response: {e}"






### 🧠 Part 7: Gradio Interface Logic – Backend Functionality

This section handles all the logic that powers our Gradio interface.

When a user uploads a PDF and selects a task, this function:
1. Processes the uploaded document if it hasn't been processed already
2. Builds or retrieves the document’s vector store
3. Uses semantic search to retrieve relevant content
4. Generates a task-specific response using a prompt and the language model

In [None]:
#  Gradio Interface Logic

# --- Store vector store state IN MEMORY (Simple approach for demo) ---
document_state = {
    "file_path": None,
    "vector_store": None,
    "indexed_chunks": None
}

def process_document_and_query(file_obj, task, query):
    """
    Main function called by the Gradio interface.
    Handles PDF processing, vector store creation/update, context retrieval,
    and response generation based on the selected task.
    """
    global document_state # Use the global state

    status_message = ""
    result_output = ""

    # --- Step 1: Check Models ---
    if embedder is None or text_generator is None:
        status_message = "Error: Models not loaded. Please check console."
        print(status_message)
        return status_message, result_output

    # --- Step 2: Process Document (if new or not processed) ---
    current_file_path = file_obj.name if file_obj else None

    if current_file_path is None:
        status_message = "Please upload a PDF document."
        return status_message, result_output

    if current_file_path != document_state.get("file_path"):
        status_message = f"Processing new document: {os.path.basename(current_file_path)}..."
        print(status_message)
        document_state = {"file_path": None, "vector_store": None, "indexed_chunks": None} # Reset state

        chunks = load_and_chunk_pdf(current_file_path)
        if chunks is None:
            status_message = "Error: Failed to load or chunk the PDF."
            print(status_message)
            return status_message, result_output

        vector_store, indexed_chunks = build_vector_store(chunks, embedder)
        if vector_store is None:
            status_message = "Error: Failed to build the vector store."
            print(status_message)
            return status_message, result_output

        document_state["file_path"] = current_file_path
        document_state["vector_store"] = vector_store
        document_state["indexed_chunks"] = indexed_chunks
        status_message = "Document processed successfully. Ready for tasks."
        print(status_message)
    else:
        status_message = f"Using previously processed document: {os.path.basename(current_file_path)}"
        print(status_message)
        vector_store = document_state["vector_store"]
        indexed_chunks = document_state["indexed_chunks"]
        if vector_store is None or indexed_chunks is None:
             status_message = "Error: Document state is invalid. Please re-upload."
             print(status_message)
             document_state = {"file_path": None, "vector_store": None, "indexed_chunks": None} # Reset
             return status_message, result_output


    # --- Step 3: Perform Selected Task ---
    print(f"Task selected: {task}")

    # Determine prompt template, context query, and top_k based on task
    if task == "Ask a question":
        if not query:
            status_message += "\nPlease enter a question."
            return status_message, result_output
        prompt_template = QA_PROMPT_TEMPLATE
        context_query = query
        top_k = 3
    elif task == "Summarize":
        prompt_template = SUMMARIZATION_PROMPT_TEMPLATE
        context_query = "Provide a comprehensive overview of the document's content."
        top_k = 6
    # Check the task name including the language from the dropdown choices
    elif task == f"Translate (to {TARGET_LANGUAGE})":
        prompt_template = TRANSLATION_PROMPT_TEMPLATE
        if not query:
            # Ensure indexed_chunks is not empty before accessing index 0
            if not indexed_chunks:
                 status_message += "\nError: Cannot translate without document content."
                 return status_message, result_output
            context_query = indexed_chunks[0][:150] + "..."
            print(f"No specific query for translation, using first chunk topic: '{context_query}'")
        else:
            context_query = query
        top_k = 3
    else:
        status_message += "\nError: Invalid task selected."
        return status_message, result_output

    # Retrieve context
    context = retrieve_context(context_query, vector_store, embedder, indexed_chunks, top_k=top_k)

    # Generate response (handle potential errors from retrieval)
    if "Error:" in context or "Could not find relevant context" in context:
         result_output = f"Failed to retrieve context: {context}" # Show retrieval error
    else:
         result_output = generate_response(query, context, prompt_template, text_generator)

    status_message += f"\nTask '{task}' completed."
    print(status_message)

    return status_message, result_output


### 🖥️ Part 8: Create and Launch the Gradio Interface

Now that we have all the logic in place for document processing, retrieval, and generation, let's wrap everything up into a user-friendly **Gradio interface**.

This interface allows users to:
- Upload a PDF document
- Choose a task: Question Answering, Summarization, or Translation
- Enter a relevant query or topic
- Receive a smart, LLM-generated response based on the document content

In [None]:
# Create and Launch the Gradio Interface

print("Setting up Gradio interface...")

# Define Gradio components
file_input = gr.File(label="Upload PDF Document", file_types=[".pdf"])
task_dropdown = gr.Dropdown(
    label="Select Task",
    # Ensure the choices list matches the checks in process_document_and_query
    choices=["Ask a question", "Summarize", f"Translate (to {TARGET_LANGUAGE})"],
    value="Ask a question"
)
query_input = gr.Textbox(
    label="Enter Question or Topic",
    info="Required for 'Ask a question'. Optional for 'Translate' (specifies topic) or 'Summarize' (ignored)."
)
status_output = gr.Textbox(label="Status", interactive=False)
result_output = gr.Textbox(label="Result", lines=15, interactive=False)

# Create the interface
iface = gr.Interface(
    fn=process_document_and_query,
    inputs=[
        file_input,
        task_dropdown,
        query_input
    ],
    outputs=[
        status_output,
        result_output
    ],
    title="Smart Document Helper (RAG + Open Models)",
    description=f"Upload a PDF, select a task (Q&A, Summarize, Translate to {TARGET_LANGUAGE}), and enter a query if needed.\n"
                f"Uses '{EMBEDDING_MODEL_NAME}' for retrieval and '{LLM_MODEL_NAME}' for generation.",
    allow_flagging='never',
    examples=[
        [None, "Ask a question", "What is the main purpose of this document?"],
        [None, "Summarize", ""],
        [None, f"Translate (to {TARGET_LANGUAGE})", "Explain the methodology used."]
    ],
    cache_examples=False
)

print("Launching Gradio interface...")
if __name__ == "__main__":
    iface.launch(debug=False, share=True)

print("Gradio setup complete. Interface should be running.")
