## ***üß© Element Selection Strategy for Unstructured Documents Chunk Clubing***

| **Element** | **Type (to maintain in chunk)** | **Importance** | **Reason / Description** | **Action to Take** |
|--------------|--------------------------------|----------------|---------------------------|--------------------|
| **Text** | `Text` | ‚úÖ **Highly Important** | Contains main document content, including paragraphs, inline formulas, and contextual information. | Always include during extraction and RAG processing. |
| **Table** | `Table` | ‚úÖ **Highly Important** | Holds structured data such as metrics, comparisons, and datasets. | Always retain and preserve cell structure if possible. |
| **Image + FigureCaption** | `Image+Caption` | ‚úÖ **Important (Combined)** | Images provide visual info; FigureCaptions describe the image context. | Combine both ‚Äî keep the image and attach the caption as description or metadata. |
| **Formula** | `Formula` | ‚öôÔ∏è **Not Needed Separately** | Formulas are often embedded inline within text; separate extraction is redundant. | Skip separate extraction ‚Äî rely on text content. |
| **ListItem** | `ListItem` | ‚öôÔ∏è **Not Needed** | Lists are already represented within text blocks. | Exclude individual list items. |
| **NarrativeText** | `NarrativeText` | ‚öôÔ∏è **Not Needed** | Narrative text overlaps with the main text content. | Do not extract separately. |
| **Footer** | `Footer` | ‚úÖ **Very Important** | Often includes metadata like page numbers, document versions, and timestamps. | Extract and store separately when available. |

---

## üèÅ **Conclusion**

| **Keep / Exclude** | **Elements** | **Type to Maintain** | **Notes** |
|---------------------|--------------|----------------------|------------|
| ‚úÖ **Keep** | **Text**, **Table**, **Image + FigureCaption (combined)**, **Footer** | `Text`, `Table`, `Image+Caption`, `Footer` | These carry the most relevant and non-redundant information. |
| ‚ùå **Exclude** | **Formula**, **ListItem**, **NarrativeText** | `Formula`, `ListItem`, `NarrativeText` | These are redundant or already captured within text content. |

---

### ‚úÖ **Final Recommendation**
> Focus on the following elements for your RAG or document extraction pipeline:
> - **Text** ‚Üí Type: `Text`
> - **Table** ‚Üí Type: `Table`
> - **Image + FigureCaption (combined)** ‚Üí Type: `Image+Caption`
> - **Footer** ‚Üí Type: `Footer`
>
> Maintain the **type field** in each chunk so you always know what kind of content it contains.  
> This improves traceability, retrieval accuracy, and contextual organization across your RAG workflow.


### ***Imports Required***

In [2]:
import os
import json
import pickle
import base64
from pathlib import Path
from typing import List
from dotenv import load_dotenv

# Unstructured for document parsing
from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title
from unstructured.documents.elements import Element

# LangChain components
from langchain_core.documents import Document
from langchain_core.messages import HumanMessage
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
from langchain_chroma import Chroma

# Load environment variables
load_dotenv()

True

### ***Partion of documents***

In [None]:
import os
from pathlib import Path
from typing import List

def partition_document_launcher(
    file_path: str,
    max_characters: int,
    new_after_n_chars: int,
    combine_text_under_n_chars: int,
    extract_images: bool = False,
    extract_tables: bool = False,
    languages: List[str] = ['eng']
):
    """
    Extract elements from PDF using unstructured library.
    
    Args:
        file_path: Path to the PDF file to process (REQUIRED)
        max_characters: Maximum characters per chunk (REQUIRED)
        new_after_n_chars: Start new chunk after this many characters (REQUIRED)
        combine_text_under_n_chars: Combine small text blocks under this count (REQUIRED)
        extract_images: Whether to extract images from the PDF 'True' or 'False'
        extract_tables: Whether to infer table structure 'True' or 'False'
        languages: List of language codes (defaults to ['eng'])
    
    Returns:
        List of extracted elements from the PDF
    
    Raises:
        FileNotFoundError: If the PDF file doesn't exist
        ValueError: If invalid parameters are provided
    """
    # Validate input file
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"PDF file not found: {file_path}")
    
    # Validate chunk parameters
    if max_characters >= new_after_n_chars:
        raise ValueError("max_characters must be less than new_after_n_chars")
    
    # Set image output directory (fixed path)
    image_output_dir = r"D:\MultiModulRag\Backend\SmartPipelinedef\Images"
    
    # Create image directory if extracting images
    if extract_images:
        Path(image_output_dir).mkdir(parents=True, exist_ok=True)
    
    print(f"üìÑ Partitioning document: {file_path}")
    print(f"‚öôÔ∏è  Settings: Images={extract_images}, Tables={extract_tables}, Languages={languages}")
    print(f"üìä Chunk settings: max={max_characters}, new_after={new_after_n_chars}, combine={combine_text_under_n_chars}")

    elements = partition_pdf(
        ### File path is always require and important ###
        filename=file_path,

        ### Core parameters (Fixed Parameters) ###
        strategy="hi_res",
        hi_res_model_name="yolox",
        chunking_strategy="by_title",
        include_orig_elements=True,

        ### Language and extraction parameters ###
        languages=languages,  # Use the parameter instead of empty list
        
        ### Image extraction parameters ###
        extract_images_in_pdf=extract_images,
        extract_image_block_to_payload=extract_images,
        extract_image_block_output_dir=image_output_dir if extract_images else None,
        extract_image_block_types=["Image"] if extract_images else [],
        
        ### Table extraction ###
        infer_table_structure=extract_tables,  # Use the parameter
        
        ### Chunk parameters ###
        max_characters=max_characters,
        new_after_n_chars=new_after_n_chars,
        combine_text_under_n_chars=combine_text_under_n_chars,
    )
    
    print(f"‚úÖ Extracted {len(elements)} elements")
    
    # Print element breakdown
    element_types = {}
    for elem in elements:
        elem_type = type(elem).__name__
        element_types[elem_type] = element_types.get(elem_type, 0) + 1
    print(f"üìã Element breakdown: {dict(element_types)}")
    
    return elements

In [18]:
checkpoint1 = partition_document_launcher (file_path =r"D:\MultiModulRag\docs\NIPS-2017-attention-is-all-you-need-Paper_removed.pdf",
                                          max_characters=3000,
                                          new_after_n_chars=3800,
                                          combine_text_under_n_chars=200,
                                          extract_images=True,
                                          extract_tables=True,
                                          languages=['eng'],            
                                          )

üìÑ Partitioning document: D:\MultiModulRag\docs\NIPS-2017-attention-is-all-you-need-Paper_removed.pdf
‚öôÔ∏è  Settings: Images=True, Tables=True, Languages=['eng']
üìä Chunk settings: max=3000, new_after=3800, combine=200
‚úÖ Extracted 4 elements
üìã Element breakdown: {'CompositeElement': 4}


### ***Checkpoint1***

In [19]:
import json
import pickle
from pathlib import Path
from unstructured.documents.elements import Element

def save_elements(elements, pkl_path: str, json_path: str = None):
    """
    Save a Python variable `elements` to pickle and optionally to JSON.
    Automatically converts unstructured Element objects to dicts for JSON.

    Args:
        elements: Python variable to save (list, dict, etc.)
        pkl_path: Path to save the pickle file (required)
    """
    # Ensure parent directories exist
    Path(pkl_path).parent.mkdir(parents=True, exist_ok=True)
    if json_path:
        Path(json_path).parent.mkdir(parents=True, exist_ok=True)

    # Save as Pickle
    with open(pkl_path, "wb") as f:
        pickle.dump(elements, f)
    print(f"‚úÖ Saved elements to pickle: {pkl_path}")

    # # Save as JSON (optional)
    # if json_path:
    #     # Convert Element objects to dicts automatically
    #     def to_serializable(el):
    #         return el.to_dict() if isinstance(el, Element) else el
        
    #     elements_serializable = [to_serializable(el) for el in elements]

    #     with open(json_path, "w", encoding="utf-8") as f:
    #         json.dump(elements_serializable, f, indent=4, ensure_ascii=False)
    #     print(f"‚úÖ Saved elements to JSON: {json_path}")


# -----------------------------
# Example usage
# your Python variable, e.g., output of partition_pdf

pkl_file = r"D:\MultiModulRag\Backend\SmartPipelinedef\Pickel\checkpoint1.pkl"

save_elements(checkpoint1, pkl_file) 

‚úÖ Saved elements to pickle: D:\MultiModulRag\Backend\SmartPipelinedef\Pickel\checkpoint1.pkl


### ***Loding Checkpoint1 Pickel***

In [3]:
pkl_path = r"D:\MultiModulRag\Backend\SmartPipelinedef\Pickel\checkpoint1.pkl"
with open(pkl_path, "rb") as f:
        loaded1 = pickle.load(f)
print(f"‚úÖ Load Pickel has : {len(loaded1)} elements")

‚úÖ Load Pickel has : 4 elements


In [6]:
loaded1[0].metadata.orig_elements


[<unstructured.documents.elements.Image at 0x19013d75310>,
 <unstructured.documents.elements.FigureCaption at 0x19013d75c50>,
 <unstructured.documents.elements.NarrativeText at 0x19013d75e10>,
 <unstructured.documents.elements.NarrativeText at 0x19013d761d0>]

### ***Combing Content & Genrating AI Embeddings***

##### ***Clean Image Directory Function***

In [22]:
def clean_image_directory(image_dir: str) -> None:
    """Clean existing images from directory"""
    Path(image_dir).mkdir(parents=True, exist_ok=True)
    
    for file in Path(image_dir).glob("*"):
        if file.is_file():
            try:
                file.unlink()
                print(f"     üóëÔ∏è  Deleted old image: {file.name}")
            except Exception as e:
                print(f"     ‚ö†Ô∏è  Could not delete {file.name}: {e}")

##### ***Separate content types & Make an Image from Image64 formate and save it in directory***

In [23]:
import os
import base64

def separate_content_types(chunk, image_dir: str, image_counter: dict) -> dict[str, any]:
    """
    Analyze chunk content and extract text, tables, and images.
    Uses mutable dict to track image counter across calls.
    """
    content_data = {
        'text': chunk.text,
        'tables': [],
        'image_base64': [],
        'images_dirpath': [],  # now will hold folder + filename
        'page_no': [],
        'types': ['text']
    }

    for element in chunk.metadata.orig_elements:
        element_type = type(element).__name__

        # Handle page numbers
        if 'metadata' in element.to_dict():
            page_no = element.to_dict()['metadata'].get('page_number')
            if page_no and page_no not in content_data['page_no']:
                content_data['page_no'].append(page_no)

        # Handle tables
        if element_type == 'Table':
            if 'table' not in content_data['types']:
                content_data['types'].append('table')
            table_html = getattr(element.metadata, 'text_as_html', element.text)
            content_data['tables'].append(table_html)

        # Handle images
        elif element_type == 'Image':
            if hasattr(element.metadata, 'image_base64'):
                if 'image' not in content_data['types']:
                    content_data['types'].append('image')

                image_base64 = element.metadata.image_base64

                try:
                    # Generate filename and path
                    image_filename = f"image_{image_counter['count']:04d}.png"
                    image_path = os.path.join(image_dir, image_filename)

                    # Decode and save image
                    with open(image_path, "wb") as img_file:
                        img_file.write(base64.b64decode(image_base64))

                    # ‚úÖ Store relative folder + filename (e.g. "Images/image_0001.png")
                    folder_name = os.path.basename(image_dir.rstrip(os.sep))
                    relative_path = os.path.join(folder_name, image_filename).replace("\\", "/")
                    content_data['images_dirpath'].append(relative_path)

                    # Keep base64 for AI processing
                    content_data['image_base64'].append(image_base64)

                    print(f"     ‚úÖ Saved: {relative_path}")
                    image_counter['count'] += 1

                except Exception as e:
                    print(f"     ‚ùå Failed to save image {image_counter['count']}: {e}")

    return content_data


##### ***Creating AI Summary for embeddings***

In [24]:
def create_ai_enhanced_summary(text: str, tables: list[str], images: list[str]) -> str:
    """Create AI-enhanced summary for mixed content"""
    
    try:
        llm = ChatGoogleGenerativeAI(model="gemini-2.5-pro", temperature=0)
        
        # Build comprehensive prompt
        prompt_text = f"""You are creating a searchable description for document content retrieval.

CONTENT TO ANALYZE:

TEXT CONTENT:
{text}

"""
        
        # Add tables if present
        if tables:
            prompt_text += "TABLES:\n"
            for i, table in enumerate(tables, 1):
                prompt_text += f"Table {i}:\n{table}\n\n"
        
        # Add detailed instructions
        prompt_text += """
YOUR TASK:
Generate a comprehensive, searchable description that covers:

1. Key facts, numbers, and data points from text and tables
2. Main topics and concepts discussed  
3. Questions this content could answer
4. Visual content analysis (charts, diagrams, patterns in images)
5. Alternative search terms users might use

Make it detailed and searchable - prioritize findability over brevity.

OUTPUT FORMAT:
QUESTIONS: "List all potential questions that can be answered from this content (text, images, tables)"
SUMMARY: "Comprehensive summary of all data and information"
IMAGE_INTERPRETATION: "Detailed description of image content. If images are irrelevant or contain only decorative elements, state: ***DO NOT USE THIS IMAGE***"
TABLE_INTERPRETATION: "Detailed description of table content. If tables are irrelevant, state: ***DO NOT USE THIS TABLE***"

SEARCHABLE DESCRIPTION:"""

        # Build message with text and images
        message_content = [{"type": "text", "text": prompt_text}]
        
        # Add images to message
        for img_b64 in images:
            message_content.append({
                "type": "image_url",
                "image_url": {"url": f"data:image/png;base64,{img_b64}"}
            })
        
        # Invoke AI
        message = HumanMessage(content=message_content)
        response = llm.invoke([message])
        
        return response.content
        
    except Exception as e:
        print(f"     ‚ùå AI summary failed: {e}")
        # Fallback summary
        summary = f"{text[:300]}..."
        if tables:
            summary += f"\n[Contains {len(tables)} table(s)]"
        if images:
            summary += f"\n[Contains {len(images)} image(s)]"
        return summary

##### ***Make united pipeline***

In [None]:
def summarise_chunks(chunks, image_dir: str = r"D:\MultiModulRag\Backend\SmartPipelinedef\Images") -> list[Document]:
    
    """
    Process all chunks with AI Summaries.

    Args:
        chunks: List of document chunks to process
        image_dir: Directory to store extracted images
        
    Returns:
        List of LangChain Documents with enhanced summaries
    """

    print("üß† Processing chunks with AI Summaries...")
    
    # Clean image directory once
    clean_image_directory(image_dir)
    
    langchain_documents = []
    total_chunks = len(chunks)
    image_counter = {'count': 1}  # Use mutable dict to pass by reference
    
    for i, chunk in enumerate(chunks, 1):
        print(f"\n   üìÑ Processing chunk {i}/{total_chunks}")
        
        # Analyze chunk content
        content_data = separate_content_types(chunk, image_dir, image_counter)
        
        # Debug info
        print(f"     Types found: {', '.join(content_data['types'])}")
        print(f"     Tables: {len(content_data['tables'])}, Images: {len(content_data['image_base64'])}")
        if content_data['page_no']:
            print(f"     Pages: {content_data['page_no']}")
        
        # Create AI-enhanced summary for ALL chunks
        print(f"      Creating AI summary...")
        try:
            enhanced_content = create_ai_enhanced_summary(
                content_data['text'],
                content_data['tables'], 
                content_data['image_base64']
            )
            print(f"      AI summary created")
            print(f"     Preview: {enhanced_content[:150]}...")
        except Exception as e:
            print(f"      AI summary failed, using raw text: {e}")
            enhanced_content = content_data['text']
        
        # Create LangChain Document with metadata
        # Store image paths instead of base64 to reduce memory usage
        doc = Document(
            page_content=enhanced_content,
                # 'text': chunk.text,
                # 'tables': [],
                # 'images_base64': [],
                # 'images_dirpath': [],
                # 'page_no': [],
                # 'types': ['text']
            metadata={
                "chunk_index": i,
                "original_text": content_data['text'],
                "raw_tables_html": content_data['tables'],
                "image_paths": content_data['images_dirpath'],
                "page_numbers": content_data['page_no'],
                "content_types": content_data['types'],
                # "num_tables": len(content_data['tables']),
                # "num_images": len(content_data['images_dirpath']),
                # "original_content": json.dumps({
                # "raw_text": content_data['text'],
                # Don't store base64 in metadata to save space
                # "has_images": len(content_data['images_base64']) > 0
                # })
            }
        )
        
        langchain_documents.append(doc)
    
    print(f"\n‚úÖ Successfully processed {len(langchain_documents)} chunks")
    print(f"üìä Total images saved: {image_counter['count'] - 1}")
    
    return langchain_documents

##### ***Run pipeline***

In [26]:
output = summarise_chunks(loaded1)

üß† Processing chunks with AI Summaries...

   üìÑ Processing chunk 1/4
     ‚úÖ Saved: Images/image_0001.png
     Types found: text, image
     Tables: 0, Images: 1
     Pages: [1]
      Creating AI summary...
      AI summary created
     Preview: QUESTIONS:
*   What is the model architecture of the Transformer?
*   What are the main components of the Transformer model?
*   How is the Transforme...

   üìÑ Processing chunk 2/4
     Types found: text
     Tables: 0, Images: 0
     Pages: [1]
      Creating AI summary...
      AI summary created
     Preview: **QUESTIONS:**
"What is an attention function?
How can an attention function be described?
What are the inputs and outputs of an attention function?
W...

   üìÑ Processing chunk 3/4
     ‚úÖ Saved: Images/image_0002.png
     ‚úÖ Saved: Images/image_0003.png
     Types found: text, image
     Tables: 0, Images: 2
     Pages: [1, 2]
      Creating AI summary...
      AI summary created
     Preview: **QUESTIONS:**
*   What is S

In [27]:
output[3].page_content

'**QUESTIONS:**\n"What is Multi-Head Attention?\nHow does Multi-Head Attention work?\nWhat is the benefit of using Multi-Head Attention over a single attention function?\nHow are queries, keys, and values processed in Multi-Head Attention?\nWhat are the dimensions dk, dv, and dmodel in the context of attention?\nHow many times are the queries, keys, and values projected?\nWhat happens to the outputs of the parallel attention functions?\nHow does Multi-Head Attention allow a model to attend to different representation subspaces?\nWhy does single-head attention inhibit attending to different subspaces?\nWhat is the statistical explanation for why dot products get large in attention mechanisms?\nAssuming query (q) and key (k) components are independent random variables with mean 0 and variance 1, what is the mean and variance of their dot product?\nWhat is the variance of the dot product q ¬∑ k?\nWhat process is depicted in Figure 2?"\n\n**SUMMARY:**\n"This section, 3.2.2, describes the M

In [28]:
output[0].metadata['image_paths']

['Images/image_0001.png']

In [30]:
# Check if images were actually saved
image_dir = r"D:\MultiModulRag\Backend\SmartPipelinedef\Images"
saved_images = list(Path(image_dir).glob("*.png"))
print(f"\nüìÅ Images in directory: {len(saved_images)}")
for img in saved_images:
    print(f"   - {img.name}: {img.stat().st_size} bytes")


üìÅ Images in directory: 3
   - image_0001.png: 71463 bytes
   - image_0002.png: 12578 bytes
   - image_0003.png: 20672 bytes


### ***Checkpointing Output Pickel & JSON***

In [32]:
import os, json, pickle

pkl_path = r"D:\MultiModulRag\Backend\SmartPipelinedef\Pickel\output.pkl"
json_path = r"D:\MultiModulRag\Backend\SmartPipelinedef\JSON\output.json"

os.makedirs(os.path.dirname(pkl_path), exist_ok=True)
os.makedirs(os.path.dirname(json_path), exist_ok=True)

# üíæ Save Pickle
with open(pkl_path, "wb") as f:
    pickle.dump(output, f)
print(f"‚úÖ Pickle saved: {pkl_path}")

# üß© Convert to clean JSON format (including enhanced content)
clean_json = [
    {
        "chunk_index": doc.metadata.get("chunk_index"),
        "enhanced_content": getattr(doc, "page_content", ""),  # from Document
        "original_text": doc.metadata.get("original_text", ""),
        "raw_tables_html": doc.metadata.get("raw_tables_html", []),
        "image_paths": doc.metadata.get("image_paths", []),
        "page_numbers": doc.metadata.get("page_numbers", []),
        "content_types": doc.metadata.get("content_types", []),
    }
    for doc in output
]

# üíæ Save JSON
with open(json_path, "w", encoding="utf-8") as f:
    json.dump(clean_json, f, indent=4, ensure_ascii=False)

print(f"‚úÖ JSON saved: {json_path}")


‚úÖ Pickle saved: D:\MultiModulRag\Backend\SmartPipelinedef\Pickel\output.pkl
‚úÖ JSON saved: D:\MultiModulRag\Backend\SmartPipelinedef\JSON\output.json


### ***Loading Output Pickel***

In [33]:
pkl_dir = r"D:\MultiModulRag\Backend\SmartPipelinedef\Pickel\output.pkl"
with open(pkl_dir, 'rb') as f:
    loaded_docs = pickle.load(f)

# Now this will work:
print(f"üìÑ AI content produced: {loaded_docs[0].page_content[:200]}...")
print(f"üìä Metadata: {loaded_docs[0].metadata}")

üìÑ AI content produced: QUESTIONS:
*   What is the model architecture of the Transformer?
*   What are the main components of the Transformer model?
*   How is the Transformer's encoder structured?
*   How is the Transformer...
üìä Metadata: {'chunk_index': 1, 'original_text': 'Output Probabilities Add & Norm Feed Forward Add & Norm Multi-Head Attention a, Add & Norm Add & Norm Feed Forward Nx | -+CAgc8 Norm) Add & Norm Masked Multi-Head Multi-Head Attention Attention Se a, ee a, Positional Positional Encoding @ ¬© @ Encoding Input Output Embedding Embedding Inputs Outputs (shifted right)\n\nFigure 1: The Transformer - model architecture.\n\nwise fully connected feed-forward network. We employ a residual connection [10] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layer

In [34]:
loaded_docs[0].metadata

{'chunk_index': 1,
 'original_text': 'Output Probabilities Add & Norm Feed Forward Add & Norm Multi-Head Attention a, Add & Norm Add & Norm Feed Forward Nx | -+CAgc8 Norm) Add & Norm Masked Multi-Head Multi-Head Attention Attention Se a, ee a, Positional Positional Encoding @ ¬© @ Encoding Input Output Embedding Embedding Inputs Outputs (shifted right)\n\nFigure 1: The Transformer - model architecture.\n\nwise fully connected feed-forward network. We employ a residual connection [10] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.\n\nDecoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inse

### ***Into Store in VectorDB***

In [36]:
import json

def create_vector_store(documents, persist_directory=r"D:\MultiModulRag\Backend\SmartPipelinedef\chroma_db"):
    """Create and persist ChromaDB vector store"""
    print("üîÆ Creating embeddings and storing in ChromaDB...")
    
    # Convert list metadata to JSON strings
    for doc in documents:
        if "raw_tables_html" in doc.metadata:
            doc.metadata["raw_tables_html"] = json.dumps(doc.metadata["raw_tables_html"])
        if "image_paths" in doc.metadata:
            doc.metadata["image_paths"] = json.dumps(doc.metadata["image_paths"])
        if "page_numbers" in doc.metadata:
            doc.metadata["page_numbers"] = json.dumps(doc.metadata["page_numbers"])
        if "content_types" in doc.metadata:
            doc.metadata["content_types"] = json.dumps(doc.metadata["content_types"])
    
    embedding_model = GoogleGenerativeAIEmbeddings(model="text-embedding-004")
    
    print("--- Creating vector store ---")
    vectorstore = Chroma.from_documents(
        documents=documents,
        embedding=embedding_model,
        persist_directory=persist_directory,
        collection_metadata={"hnsw:space": "cosine"}
    )
    print("--- Finished creating vector store ---")
    
    print(f"‚úÖ Vector store created and saved to {persist_directory}")
    return vectorstore

# Create the vector store
db = create_vector_store(loaded_docs)

üîÆ Creating embeddings and storing in ChromaDB...
--- Creating vector store ---
--- Finished creating vector store ---
‚úÖ Vector store created and saved to D:\MultiModulRag\Backend\SmartPipelinedef\chroma_db


### ***Retrival from VDB***

In [5]:
def ask_question(query: str, vectorstore_or_path=r"D:\MultiModulRag\Backend\Pipeline_Database\chroma_db", k: int = 2):
    """
    Ask a question and get answer with context
    
    Args:
        query: User question
        vectorstore_or_path: Either ChromaDB vectorstore object or path to persisted DB
        k: Number of chunks to retrieve
    """
    import json
    from pathlib import Path
    
    # Load vectorstore if path is provided
    if isinstance(vectorstore_or_path, (str, Path)):
        print(f"üìÇ Loading vector store from: {vectorstore_or_path}")
        embedding_model = GoogleGenerativeAIEmbeddings(model="text-embedding-004")
        vectorstore = Chroma(
            persist_directory=str(vectorstore_or_path),
            embedding_function=embedding_model
        )
    else:
        vectorstore = vectorstore_or_path
    
    print(f"üîç Searching for: {query}\n")
    
    # Retrieve relevant chunks
    results = vectorstore.similarity_search(query, k=k)
    
    # Parse metadata and build context
    context_parts = []
    all_images = []
    
    for i, doc in enumerate(results, 1):
        print(f"üìÑ Chunk {i}:")
        print(f"   Pages: {json.loads(doc.metadata['page_numbers'])}")
        print(f"   Types: {json.loads(doc.metadata['content_types'])}")
        
        # Get metadata
        original_text = doc.metadata['original_text']
        tables = json.loads(doc.metadata['raw_tables_html'])
        images = json.loads(doc.metadata['image_paths'])
        
        # Build context from metadata
        chunk_context = f"Context {i}:\n"
        chunk_context += f"Text: {original_text}\n"
        
        if tables:
            print(f"   üìä Tables: {len(tables)}")
            chunk_context += f"\nTables:\n"
            for j, table in enumerate(tables, 1):
                chunk_context += f"Table {j}:\n{table}\n"
        
        if images:
            print(f"   üñºÔ∏è Images: {len(images)}")
            all_images.extend(images)
        
        context_parts.append(chunk_context)
        print()
    
    # Combine context
    combined_context = "\n\n".join(context_parts)
    
    # Generate answer
    llm = ChatGoogleGenerativeAI(model="gemini-2.5-pro", temperature=0)
    
    prompt = f"""Based on the following context, answer the question.

Context:
{combined_context}

Question: {query}

Answer:"""
    
    print("ü§ñ Generating answer...\n")
    response = llm.invoke(prompt)
    
    print("üí° Answer:")
    print(response.content)
    
    return response.content

# # Usage - Both work now:
# # Option 1: Pass vectorstore object
# answer = ask_question("What is the revenue?")

# Option 2: Pass path to persisted DB
answer = ask_question("What is this pdf about and how to learn it?")

üìÇ Loading vector store from: D:\MultiModulRag\Backend\Pipeline_Database\chroma_db
üîç Searching for: What is this pdf about and how to learn it?

üìÑ Chunk 1:
   Pages: [1, 2]
   Types: ["text", "image"]
   üìä Tables: 2
   üñºÔ∏è Images: 146

üìÑ Chunk 2:
   Pages: [1]
   Types: ["text", "image"]
   üìä Tables: 2
   üñºÔ∏è Images: 73

ü§ñ Generating answer...

üí° Answer:
Based on the provided context, here is an answer to your question.

### What is this PDF about?

This document describes the **Transformer model architecture**, a highly influential neural network design primarily used for natural language processing (NLP) and other sequence-to-sequence tasks. The paper it originates from is titled "Attention Is All You Need."

Based on the context, the key concepts covered are:

1.  **The Core Mechanism: Scaled Dot-Product Attention:** This is the fundamental building block of the Transformer. It's a mechanism that allows the model to weigh the importance of different wo

In [8]:
from pprint import pprint

pprint(answer)


('Based on the provided context, here is an answer to your question.\n'
 '\n'
 '### What is this PDF about?\n'
 '\n'
 'This document describes the **Transformer model architecture**, a highly '
 'influential neural network design primarily used for natural language '
 'processing (NLP) and other sequence-to-sequence tasks. The paper it '
 'originates from is titled "Attention Is All You Need."\n'
 '\n'
 'Based on the context, the key concepts covered are:\n'
 '\n'
 '1.  **The Core Mechanism: Scaled Dot-Product Attention:** This is the '
 "fundamental building block of the Transformer. It's a mechanism that allows "
 'the model to weigh the importance of different words (or parts of a '
 'sequence) when processing a particular word.\n'
 '    *   It operates on a set of **queries (Q)**, **keys (K)**, and **values '
 '(V)**.\n'
 '    *   The attention weights are calculated using the formula: '
 '`Attention(Q,K,V) = softmax( (Q * K^T) / ‚àödk ) * V`.\n'
 '    *   The scaling factor (`1/‚à