## ***üß© Element Selection Strategy for Unstructured Documents Chunk Clubing***

| **Element** | **Type (to maintain in chunk)** | **Importance** | **Reason / Description** | **Action to Take** |
|--------------|--------------------------------|----------------|---------------------------|--------------------|
| **Text** | `Text` | ‚úÖ **Highly Important** | Contains main document content, including paragraphs, inline formulas, and contextual information. | Always include during extraction and RAG processing. |
| **Table** | `Table` | ‚úÖ **Highly Important** | Holds structured data such as metrics, comparisons, and datasets. | Always retain and preserve cell structure if possible. |
| **Image + FigureCaption** | `Image+Caption` | ‚úÖ **Important (Combined)** | Images provide visual info; FigureCaptions describe the image context. | Combine both ‚Äî keep the image and attach the caption as description or metadata. |
| **Formula** | `Formula` | ‚öôÔ∏è **Not Needed Separately** | Formulas are often embedded inline within text; separate extraction is redundant. | Skip separate extraction ‚Äî rely on text content. |
| **ListItem** | `ListItem` | ‚öôÔ∏è **Not Needed** | Lists are already represented within text blocks. | Exclude individual list items. |
| **NarrativeText** | `NarrativeText` | ‚öôÔ∏è **Not Needed** | Narrative text overlaps with the main text content. | Do not extract separately. |
| **Footer** | `Footer` | ‚úÖ **Very Important** | Often includes metadata like page numbers, document versions, and timestamps. | Extract and store separately when available. |

---

## üèÅ **Conclusion**

| **Keep / Exclude** | **Elements** | **Type to Maintain** | **Notes** |
|---------------------|--------------|----------------------|------------|
| ‚úÖ **Keep** | **Text**, **Table**, **Image + FigureCaption (combined)**, **Footer** | `Text`, `Table`, `Image+Caption`, `Footer` | These carry the most relevant and non-redundant information. |
| ‚ùå **Exclude** | **Formula**, **ListItem**, **NarrativeText** | `Formula`, `ListItem`, `NarrativeText` | These are redundant or already captured within text content. |

---

### ‚úÖ **Final Recommendation**
> Focus on the following elements for your RAG or document extraction pipeline:
> - **Text** ‚Üí Type: `Text`
> - **Table** ‚Üí Type: `Table`
> - **Image + FigureCaption (combined)** ‚Üí Type: `Image+Caption`
> - **Footer** ‚Üí Type: `Footer`
>
> Maintain the **type field** in each chunk so you always know what kind of content it contains.  
> This improves traceability, retrieval accuracy, and contextual organization across your RAG workflow.


In [1]:
import json
from typing import List

# Unstructured for document parsing
from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title

# LangChain components
from langchain_core.documents import Document
from langchain_google_genai import ChatGoogleGenerativeAI,GoogleGenerativeAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.messages import HumanMessage
from dotenv import load_dotenv

load_dotenv()

True

### ***Partion of document*** 

In [6]:
import os
from pathlib import Path
from typing import List

def partition_document_launcher(
    file_path: str,
    max_characters: int,
    new_after_n_chars: int,
    combine_text_under_n_chars: int,
    extract_images: bool = False,
    extract_tables: bool = False,
    languages: List[str] = ['eng']
):
    """
    Extract elements from PDF using unstructured library.
    
    Args:
        file_path: Path to the PDF file to process (REQUIRED)
        max_characters: Maximum characters per chunk (REQUIRED)
        new_after_n_chars: Start new chunk after this many characters (REQUIRED)
        combine_text_under_n_chars: Combine small text blocks under this count (REQUIRED)
        extract_images: Whether to extract images from the PDF 'True' or 'False'
        extract_tables: Whether to infer table structure 'True' or 'False'
        languages: List of language codes (defaults to ['eng'])
    
    Returns:
        List of extracted elements from the PDF
    
    Raises:
        FileNotFoundError: If the PDF file doesn't exist
        ValueError: If invalid parameters are provided
    """
    # Validate input file
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"PDF file not found: {file_path}")
    
    # Validate chunk parameters
    if max_characters >= new_after_n_chars:
        raise ValueError("max_characters must be less than new_after_n_chars")
    
    # Set image output directory (fixed path)
    image_output_dir = r"D:\MultiModulRag\Backend\SmartChunkClubing\Images"
    
    # Create image directory if extracting images
    if extract_images:
        Path(image_output_dir).mkdir(parents=True, exist_ok=True)
    
    print(f"üìÑ Partitioning document: {file_path}")
    print(f"‚öôÔ∏è  Settings: Images={extract_images}, Tables={extract_tables}, Languages={languages}")
    print(f"üìä Chunk settings: max={max_characters}, new_after={new_after_n_chars}, combine={combine_text_under_n_chars}")

    elements = partition_pdf(
        ### File path is always require and important ###
        filename=file_path,

        ### Core parameters (Fixed Parameters) ###
        strategy="hi_res",
        hi_res_model_name="yolox",
        chunking_strategy="by_title",
        include_orig_elements=True,
        
        ### Language and extraction parameters ###
        languages=languages,  # Use the parameter instead of empty list
        
        ### Image extraction parameters ###
        extract_images_in_pdf=extract_images,
        extract_image_block_to_payload=extract_images,
        extract_image_block_output_dir=image_output_dir if extract_images else None,
        extract_image_block_types=["Image"] if extract_images else [],
        
        ### Table extraction ###
        infer_table_structure=extract_tables,  # Use the parameter
        
        ### Chunk parameters ###
        max_characters=max_characters,
        new_after_n_chars=new_after_n_chars,
        combine_text_under_n_chars=combine_text_under_n_chars,
    )
    
    print(f"‚úÖ Extracted {len(elements)} elements")
    
    # Print element breakdown
    element_types = {}
    for elem in elements:
        elem_type = type(elem).__name__
        element_types[elem_type] = element_types.get(elem_type, 0) + 1
    print(f"üìã Element breakdown: {dict(element_types)}")
    
    return elements

In [9]:
checkpoint = partition_document_launcher (file_path =r"D:\MultiModulRag\docs\NIPS-2017-attention-is-all-you-need-Paper.pdf",
                                          max_characters=3000,
                                          new_after_n_chars=3800,
                                          combine_text_under_n_chars=200,
                                          extract_images=True,
                                          extract_tables=True,
                                          languages=['eng'],            
                                          )

üìÑ Partitioning document: D:\MultiModulRag\docs\NIPS-2017-attention-is-all-you-need-Paper.pdf
‚öôÔ∏è  Settings: Images=True, Tables=True, Languages=['eng']
üìä Chunk settings: max=3000, new_after=3800, combine=200


The `max_size` parameter is deprecated and will be removed in v4.26. Please specify in `size['longest_edge'] instead`.


‚úÖ Extracted 24 elements
üìã Element breakdown: {'CompositeElement': 24}


In [7]:
checkpoint[5].metadata.orig_elements[3].to_dict()

{'type': 'Image',
 'element_id': 'ec7186d8-09dc-4beb-b7a8-8ad9b3324491',
 'text': 'Output Probabilities Add & Norm Feed Forward Add & Norm Multi-Head Attention a, Add & Norm Add & Norm Feed Forward Nx | -+CAgc8 Norm) Add & Norm Masked Multi-Head Multi-Head Attention Attention Se a, ee a, Positional Positional Encoding @ ¬© @ Encoding Input Output Embedding Embedding Inputs Outputs (shifted right)',
 'metadata': {'coordinates': {'points': ((np.float64(545.9972222222221),
     np.float64(200.00555555555542)),
    (np.float64(545.9972222222221), np.float64(1095.6055555555556)),
    (np.float64(1153.997222222222), np.float64(1095.6055555555556)),
    (np.float64(1153.997222222222), np.float64(200.00555555555542))),
   'system': 'PixelSpace',
   'layout_width': 1700,
   'layout_height': 2200},
  'last_modified': '2025-10-20T15:54:36',
  'filetype': 'PPM',
  'languages': ['eng'],
  'page_number': 3,
  'image_base64': '/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh

### ***Check Pointing***

In [11]:
import json
import pickle
from pathlib import Path
from unstructured.documents.elements import Element

def save_elements(elements, pkl_path: str, json_path: str = None):
    """
    Save a Python variable `elements` to pickle and optionally to JSON.
    Automatically converts unstructured Element objects to dicts for JSON.

    Args:
        elements: Python variable to save (list, dict, etc.)
        pkl_path: Path to save the pickle file (required)
        json_path: Path to save the JSON file (optional)
    """
    # Ensure parent directories exist
    Path(pkl_path).parent.mkdir(parents=True, exist_ok=True)
    if json_path:
        Path(json_path).parent.mkdir(parents=True, exist_ok=True)

    # Save as Pickle
    with open(pkl_path, "wb") as f:
        pickle.dump(elements, f)
    print(f"‚úÖ Saved elements to pickle: {pkl_path}")

    # Save as JSON (optional)
    if json_path:
        # Convert Element objects to dicts automatically
        def to_serializable(el):
            return el.to_dict() if isinstance(el, Element) else el
        
        elements_serializable = [to_serializable(el) for el in elements]

        with open(json_path, "w", encoding="utf-8") as f:
            json.dump(elements_serializable, f, indent=4, ensure_ascii=False)
        print(f"‚úÖ Saved elements to JSON: {json_path}")


# -----------------------------
# Example usage
# your Python variable, e.g., output of partition_pdf

pkl_file = r"D:\MultiModulRag\Backend\SmartChunkClubingdef\Pickel\Checkpointer1.pkl"
json_file = r"D:\MultiModulRag\Backend\SmartChunkClubingdef\JSON\Checkpointer1.json"

save_elements(checkpoint, pkl_file, json_file) 

‚úÖ Saved elements to pickle: D:\MultiModulRag\Backend\SmartChunkClubingdef\Pickel\Checkpointer1.pkl
‚úÖ Saved elements to JSON: D:\MultiModulRag\Backend\SmartChunkClubingdef\JSON\Checkpointer1.json


In [2]:
import pickle

# Path to your pickle file
pkl_file = r"D:\MultiModulRag\Backend\SmartChunkClubingdef\Pickel\Checkpointer1.pkl"

# Load pickle into a new variable
with open(pkl_file, "rb") as f:
    checkpoint = pickle.load(f)

print(f"‚úÖ Loaded {len(checkpoint)} elements from pickle")


‚úÖ Loaded 24 elements from pickle


In [88]:
checkpoint[2].metadata.orig_elements[6].to_dict()

{'type': 'NarrativeText',
 'element_id': '4515f878-d2ab-4cb4-a7c6-670d2b68860f',
 'text': 'Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht‚àí1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved signiÔ¨Åcant improvements in computational efÔ¨Åciency through factorization tricks [18] and conditional computation [26], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.',
 'metadata': {'detection_class_prob': 0.9516907334327698,
  'coordinates': {'points': ((np.float64(300.0),
     np.float64(205.97867111111094)),
  

### ***Taking out text to maintain it in formate which e wanted***

In [4]:
import os
import base64
from pathlib import Path

def separate_content_types(chunk):
    """Analyze what types of content are in a chunk"""
    content_data = {
        'text': chunk.text,
        'tables': [],
        'images_base64': [],
        'images_dirpath': [],  # ‚úÖ Now in use
        'page_no': [],
        'types': ['text']
    }
    
    # ‚úÖ Clean directory here (runs once per chunk)
    image_dir = r"D:\MultiModulRag\Backend\Pipeline_Database\Images"
    # Path(image_dir).mkdir(parents=True, exist_ok=True)
    
    # # Clean all existing images
    # for file in Path(image_dir).glob("*"):
    #     if file.is_file():
    #         file.unlink()
    
    image_counter = 1  # Counter for naming images

    # Check for tables and images in original elements
    if hasattr(chunk, 'metadata') and hasattr(chunk.metadata, 'orig_elements'):
        for element in chunk.metadata.orig_elements:
            element_type = type(element).__name__
            
            # Handle page numbers
            page_no = element.to_dict()['metadata']['page_number']
            if page_no not in content_data['page_no']: 
                content_data['page_no'].append(page_no)
            
            # Handle tables
            if element_type == 'Table':
                if 'table' not in content_data['types']:
                    content_data['types'].append('table')
                table_html = getattr(element.metadata, 'text_as_html', element.text)
                content_data['tables'].append(table_html)
            
            # Handle images
            elif element_type == 'Image':
                if hasattr(element, 'metadata') and hasattr(element.metadata, 'image_base64'):
                    if 'image' not in content_data['types']:
                        content_data['types'].append('image')
                    
                    image_base64 = element.metadata.image_base64
                    content_data['images_base64'].append(image_base64)
                    
                    # ‚úÖ Save image to directory and store path
                    try:
                        image_filename = f"image_{image_counter}.png"
                        image_path = os.path.join(image_dir, image_filename)
                        
                        # Decode and save image
                        with open(image_path, "wb") as img_file:
                            img_file.write(base64.b64decode(image_base64))
                        
                        # Store the path in content_data
                        content_data['images_dirpath'].append(image_path)
                        
                        print(f"     ‚úÖ Saved image: {image_filename}")
                        image_counter += 1
                        
                    except Exception as e:
                        print(f"     ‚ùå Failed to save image {image_counter}: {e}")

    return content_data

In [None]:
# def separate_content_types_launcher(content_data):
#     all_content_data = []  # ‚úÖ Store all content_data objects
#     total_chunks = len(content_data)
        
#     for i, chunk in enumerate(content_data):
#         current_chunk = i + 1
#         print(f"   Processing chunk {current_chunk}/{total_chunks}")
        
#         # Analyze chunk content
#         content_data = separate_content_types(chunk)
        
#         # Debug prints
#         print(f"     Types found: {content_data['types']}")
#         print(f"     Tables: {len(content_data['tables'])}, Images: {len(content_data['images_base64'])}")
        
#         # Store it
#         all_content_data.append(content_data)  # ‚úÖ Save each one
#     return all_content_data
#     # Now you can access any chunk:
#     # print(all_content_data[0])  # First chunk
#     # print(all_content_data[-1]) # Last chunk
#     # print(len(all_content_data)) # Total chunks

In [3]:
# all_content_data = separate_content_types_launcher(checkpoint)

In [None]:
# all_content_data[5]

{'text': '3.1 Encoder and Decoder Stacks\n\nEncoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The Ô¨Årst is a multi-head self-attention mechanism, and the second is a simple, position-\n\n2\n\nOutput Probabilities Add & Norm Feed Forward Add & Norm Multi-Head Attention a, Add & Norm Add & Norm Feed Forward Nx | -+CAgc8 Norm) Add & Norm Masked Multi-Head Multi-Head Attention Attention Se a, ee a, Positional Positional Encoding @ ¬© @ Encoding Input Output Embedding Embedding Inputs Outputs (shifted right)\n\nFigure 1: The Transformer - model architecture.\n\nwise fully connected feed-forward network. We employ a residual connection [10] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedd

### ***Creating ai enhance summary for rag retrival of***

In [6]:
def create_ai_enhanced_summary(text: str, tables: List[str], images: List[str]) -> str:
    """Create AI-enhanced summary for mixed content"""
    
    try:
        llm = ChatGoogleGenerativeAI(model="gemini-2.5-pro", temperature=0)
        
        # Build the text prompt
        prompt_text = f"""You are creating a searchable description for document content retrieval.

        CONTENT TO ANALYZE:
        TEXT CONTENT:
        {text}

        """
        
        # Add tables if present
        if tables:
            prompt_text += "TABLES:\n"
            for i, table in enumerate(tables):
                prompt_text += f"Table {i+1}:\n{table}\n\n"
        
        # Add instructions
        prompt_text += """
        YOUR TASK:
        Generate a comprehensive, searchable description that covers:

        1. Key facts, numbers, and data points from text and tables
        2. Main topics and concepts discussed  
        3. Questions this content could answer
        4. Visual content analysis (charts, diagrams, patterns in images)
        5. Alternative search terms users might use

        Make it detailed and searchable - prioritize findability over brevity.

        SEARCHABLE DESCRIPTION:"""

        # Build message content starting with text
        message_content = [{"type": "text", "text": prompt_text}]
        
        # Add images to the message
        for image_base64 in images:
            message_content.append({
                "type": "image_url",
                "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}
            })
        
        message_content.append({
            "type":"output_formate",
            "output": {"formate": f"""OUTPUT FORMATE:
            QUESIONS:"All questions that can be asked from topic from retirival for evry text ,image table that is given"
            SUMMARRY:"Summary of data that that is inside retrival"
            IMAGE_INTERPRITATION:"What is present inside image if useless things like human thing that are not related to topic say it simplay ***dont use this image*** else write down whole image description"
            TABLE_INTERPRITATION:"What is present inside table if useless say it simplay ***dont use this table*** else write down whole image description"
                """}
            })
            
        # Send to AI and get response
        message = HumanMessage(content=message_content)
        response = llm.invoke([message])
        
        return response.content
        
    except Exception as e:
        print(f"     ‚ùå AI summary failed: {e}")
        summary = f"{text[:300]}..."
        if tables:
            summary += f" [Contains {len(tables)} table(s)]"
        if images:
            summary += f" [Contains {len(images)} image(s)]"
        return summary

In [7]:
def summarise_chunks(chunks):
    """Process all chunks with AI Summaries"""
    print("üß† Processing chunks with AI Summaries...")

    langchain_documents = []  
    total_chunks = len(chunks)
    # all_content_data = []  
    # total_chunks = len(content_data)
        
    for i, chunk in enumerate(chunks):
        current_chunk = i + 1
        print(f"   Processing chunk {current_chunk}/{total_chunks}")
        
        # Analyze chunk content
        content_data = separate_content_types(chunk)
        
        # Debug prints
        print(f"     Types found: {content_data['types']}")
        print(f"     Tables: {len(content_data['tables'])}, Images: {len(content_data['images_base64'])}")
        
        # Store it
        # langchain_documents.append(content_data)
    
    # langchain_documents = []
    # total_chunks = len(chunks)
    
    # for i, chunk in enumerate(chunks):
    #     current_chunk = i + 1
    #     print(f"   Processing chunk {current_chunk}/{total_chunks}")
        
    #     # Analyze chunk content
    #     content_data = separate_content_types(chunk)
        
    #     # Debug prints
    #     print(f"     Types found: {content_data['types']}")
    #     print(f"     Tables: {len(content_data['tables'])}, Images: {len(content_data['images'])}")
        
        # Create AI-enhanced summary if chunk has tables/images
        if content_data['tables'] or content_data['images_base64']:
            print(f"     ‚Üí Creating AI summary for mixed content...")
            try:
                enhanced_content = create_ai_enhanced_summary(
                    content_data['text'],
                    content_data['tables'], 
                    content_data['images_base64']
                )
                print(f"     ‚Üí AI summary created successfully")
                print(f"     ‚Üí Enhanced content preview: {enhanced_content[:200]}...")
            except Exception as e:
                print(f"     ‚ùå AI summary failed: {e}")
                enhanced_content = content_data['text']
        else:
            print(f"     ‚Üí Using raw text (no tables/images)")
            enhanced_content = content_data['text']
        
        # Create LangChain Document with rich metadata
        doc = Document(
            page_content=enhanced_content,
            metadata={
                "original_content": json.dumps({
                    "raw_text": content_data['text'],
                    "tables_html": content_data['tables'],
                    "images_base64": content_data['images_base64']
                })
            }
        )
        
        langchain_documents.append(doc)
    
    print(f"‚úÖ Processed {len(langchain_documents)} chunks")
    return langchain_documents


# Process chunks with AI
processed_chunks = summarise_chunks(checkpoint)

üß† Processing chunks with AI Summaries...
   Processing chunk 1/24
     Types found: ['text']
     Tables: 0, Images: 0
     ‚Üí Using raw text (no tables/images)
   Processing chunk 2/24
     Types found: ['text']
     Tables: 0, Images: 0
     ‚Üí Using raw text (no tables/images)
   Processing chunk 3/24
     Types found: ['text']
     Tables: 0, Images: 0
     ‚Üí Using raw text (no tables/images)
   Processing chunk 4/24
     Types found: ['text']
     Tables: 0, Images: 0
     ‚Üí Using raw text (no tables/images)
   Processing chunk 5/24
     Types found: ['text']
     Tables: 0, Images: 0
     ‚Üí Using raw text (no tables/images)
   Processing chunk 6/24
     ‚úÖ Saved image: image_1.png
     Types found: ['text', 'image']
     Tables: 0, Images: 1
     ‚Üí Creating AI summary for mixed content...
     ‚ùå AI summary failed: Unrecognized message part type: output_formate.
     ‚Üí AI summary created successfully
     ‚Üí Enhanced content preview: 3.1 Encoder and Decoder Stac

In [6]:
import os
import json
import base64
from pathlib import Path
from langchain_core.documents import Document
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.messages import HumanMessage

In [7]:
from langchain_core.documents import Document


def clean_image_directory(image_dir: str) -> None:
    """Clean existing images from directory"""
    Path(image_dir).mkdir(parents=True, exist_ok=True)
    
    for file in Path(image_dir).glob("*"):
        if file.is_file():
            try:
                file.unlink()
                print(f"     üóëÔ∏è  Deleted old image: {file.name}")
            except Exception as e:
                print(f"     ‚ö†Ô∏è  Could not delete {file.name}: {e}")


def separate_content_types(chunk, image_dir: str, image_counter: int) -> tuple[dict[str, any], int]:
    """
    Analyze chunk content and extract text, tables, and images.
    Returns content data and updated image counter.
    """
    content_data = {
        'text': chunk.text,
        'tables': [],
        'images_base64': [],
        'images_dirpath': [],
        'page_no': [],
        'types': ['text']
    }

    # Check for tables and images in original elements
    if not (hasattr(chunk, 'metadata') and hasattr(chunk.metadata, 'orig_elements')):
        return content_data, image_counter

    for element in chunk.metadata.orig_elements:
        element_type = type(element).__name__
        
        # Handle page numbers
        if 'metadata' in element.to_dict():
            page_no = element.to_dict()['metadata'].get('page_number')
            if page_no and page_no not in content_data['page_no']:
                content_data['page_no'].append(page_no)
        
        # Handle tables
        if element_type == 'Table':
            if 'table' not in content_data['types']:
                content_data['types'].append('table')
            table_html = getattr(element.metadata, 'text_as_html', element.text)
            content_data['tables'].append(table_html)
        
        # Handle images
        elif element_type == 'Image':
            if hasattr(element.metadata, 'image_base64'):
                if 'image' not in content_data['types']:
                    content_data['types'].append('image')
                
                image_base64 = element.metadata.image_base64
                
                # Save image to directory
                try:
                    image_filename = f"image_{image_counter:04d}.png"
                    image_path = os.path.join(image_dir, image_filename)
                    
                    # Decode and save
                    with open(image_path, "wb") as img_file:
                        img_file.write(base64.b64decode(image_base64))
                    
                    # Store path (not base64 to save memory)
                    content_data['images_dirpath'].append(image_path)
                    content_data['images_base64'].append(image_base64)  # Keep for AI processing
                    
                    print(f"     ‚úÖ Saved: {image_filename}")
                    image_counter += 1
                    
                except Exception as e:
                    print(f"     ‚ùå Failed to save image {image_counter}: {e}")

    return content_data, image_counter


def create_ai_enhanced_summary(text: str, tables: list[str], images: list[str]) -> str:
    """Create AI-enhanced summary for mixed content"""
    
    try:
        llm = ChatGoogleGenerativeAI(model="gemini-2.5-pro", temperature=0)
        
        # Build comprehensive prompt
        prompt_text = f"""You are creating a searchable description for document content retrieval.

CONTENT TO ANALYZE:

TEXT CONTENT:
{text}

"""
        
        # Add tables if present
        if tables:
            prompt_text += "TABLES:\n"
            for i, table in enumerate(tables, 1):
                prompt_text += f"Table {i}:\n{table}\n\n"
        
        # Add detailed instructions
        prompt_text += """
YOUR TASK:
Generate a comprehensive, searchable description that covers:

1. Key facts, numbers, and data points from text and tables
2. Main topics and concepts discussed  
3. Questions this content could answer
4. Visual content analysis (charts, diagrams, patterns in images)
5. Alternative search terms users might use

Make it detailed and searchable - prioritize findability over brevity.

OUTPUT FORMAT:
QUESTIONS: "List all potential questions that can be answered from this content (text, images, tables)"
SUMMARY: "Comprehensive summary of all data and information"
IMAGE_INTERPRETATION: "Detailed description of image content. If images are irrelevant or contain only decorative elements, state: ***DO NOT USE THIS IMAGE***"
TABLE_INTERPRETATION: "Detailed description of table content. If tables are irrelevant, state: ***DO NOT USE THIS TABLE***"

SEARCHABLE DESCRIPTION:"""

        # Build message with text and images
        message_content = [{"type": "text", "text": prompt_text}]
        
        # Add images to message
        for img_b64 in images:
            message_content.append({
                "type": "image_url",
                "image_url": {"url": f"data:image/png;base64,{img_b64}"}
            })
        
        # Invoke AI
        message = HumanMessage(content=message_content)
        response = llm.invoke([message])
        
        return response.content
        
    except Exception as e:
        print(f"     ‚ùå AI summary failed: {e}")
        # Fallback summary
        summary = f"{text[:300]}..."
        if tables:
            summary += f"\n[Contains {len(tables)} table(s)]"
        if images:
            summary += f"\n[Contains {len(images)} image(s)]"
        return summary


def summarise_chunks(chunks, image_dir: str = r"D:\MultiModulRag\Backend\Pipeline_Database\Images") -> list[Document]:
    """
    Process all chunks with AI Summaries.
    
    Args:
        chunks: List of document chunks to process
        image_dir: Directory to store extracted images
        
    Returns:
        List of LangChain Documents with enhanced summaries
    """
    print("üß† Processing chunks with AI Summaries...")
    
    # Clean image directory once
    clean_image_directory(image_dir)
    
    langchain_documents = []
    total_chunks = len(chunks)
    image_counter = 1
    
    for i, chunk in enumerate(chunks, 1):
        print(f"\n   üìÑ Processing chunk {i}/{total_chunks}")
        
        # Analyze chunk content
        content_data, image_counter = separate_content_types(chunk, image_dir, image_counter)
        
        # Debug info
        print(f"     Types found: {', '.join(content_data['types'])}")
        print(f"     Tables: {len(content_data['tables'])}, Images: {len(content_data['images_base64'])}")
        if content_data['page_no']:
            print(f"     Pages: {content_data['page_no']}")
        
        # Create AI-enhanced summary for ALL chunks
        print(f"     ü§ñ Creating AI summary...")
        try:
            enhanced_content = create_ai_enhanced_summary(
                content_data['text'],
                content_data['tables'], 
                content_data['images_base64']
            )
            print(f"     ‚úÖ AI summary created")
            print(f"     Preview: {enhanced_content[:150]}...")
        except Exception as e:
            print(f"     ‚ùå AI summary failed, using raw text: {e}")
            enhanced_content = content_data['text']
        
        # Create LangChain Document with metadata
        # Store image paths instead of base64 to reduce memory usage
        doc = Document(
            page_content=enhanced_content,
            metadata={
                "chunk_index": i,
                "page_numbers": content_data['page_no'],
                "content_types": content_data['types'],
                "num_tables": len(content_data['tables']),
                "num_images": len(content_data['images_dirpath']),
                "image_paths": content_data['images_dirpath'],
                "original_content": json.dumps({
                    "raw_text": content_data['text'],
                    "tables_html": content_data['tables'],
                    # Don't store base64 in metadata to save space
                    "has_images": len(content_data['images_base64']) > 0
                })
            }
        )
        
        langchain_documents.append(doc)
    
    print(f"\n‚úÖ Successfully processed {len(langchain_documents)} chunks")
    print(f"üìä Total images saved: {image_counter - 1}")
    
    return langchain_documents

In [8]:
output = summarise_chunks(checkpoint)

üß† Processing chunks with AI Summaries...
     üóëÔ∏è  Deleted old image: image_0001.png
     üóëÔ∏è  Deleted old image: image_0002.png
     üóëÔ∏è  Deleted old image: image_0003.png

   üìÑ Processing chunk 1/24
     Types found: text
     Tables: 0, Images: 0
     Pages: [1]
     ü§ñ Creating AI summary...
     ‚úÖ AI summary created
     Preview: QUESTIONS:
"Who are the authors of the paper 'Attention Is All You Need'?
What companies or universities were the authors affiliated with?
What are th...

   üìÑ Processing chunk 2/24
     Types found: text
     Tables: 0, Images: 0
     Pages: [1]
     ü§ñ Creating AI summary...
     ‚úÖ AI summary created
     Preview: QUESTIONS:
"
*   What is the Transformer network architecture?
*   What traditional components of sequence transduction models does the Transformer di...

   üìÑ Processing chunk 3/24
     Types found: text
     Tables: 0, Images: 0
     Pages: [1, 2]
     ü§ñ Creating AI summary...
     ‚úÖ AI summary created
  

Retrying langchain_google_genai.chat_models._chat_with_retry.<locals>._chat_with_retry in 2.0 seconds as it raised ResourceExhausted: 429 You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits. To monitor your current usage, head to: https://ai.dev/usage?tab=rate-limit.
* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 2
Please retry in 11.503331697s. [violations {
  quota_metric: "generativelanguage.googleapis.com/generate_content_free_tier_requests"
  quota_id: "GenerateRequestsPerMinutePerProjectPerModel-FreeTier"
  quota_dimensions {
    key: "model"
    value: "gemini-2.5-pro"
  }
  quota_dimensions {
    key: "location"
    value: "global"
  }
  quota_value: 2
}
, links {
  description: "Learn more about Gemini API quotas"
  url: "https://ai.google.dev/gemini-api/docs/rate-limits"
}
, retry_delay {
  seconds:

     ‚úÖ AI summary created
     Preview: QUESTIONS:
- What is Scaled Dot-Product Attention?
- What is the formula for Scaled Dot-Product Attention?
- What are the inputs to the Scaled Dot-Pro...

   üìÑ Processing chunk 9/24
     Types found: text
     Tables: 0, Images: 0
     Pages: [4, 5]
     ü§ñ Creating AI summary...
     ‚úÖ AI summary created
     Preview: **QUESTIONS:**
"
*   What is Multi-Head Attention?
*   How does Multi-Head Attention work?
*   What is the benefit of using multiple attention heads i...

   üìÑ Processing chunk 10/24
     Types found: text
     Tables: 0, Images: 0
     Pages: [5]
     ü§ñ Creating AI summary...
     ‚úÖ AI summary created
     Preview: QUESTIONS:
"What are the three ways multi-head attention is used in the Transformer model?
How does encoder-decoder attention work in a Transformer?
W...

   üìÑ Processing chunk 11/24
     Types found: text
     Tables: 0, Images: 0
     Pages: [5]
     ü§ñ Creating AI summary...
     ‚úÖ AI summary 

Retrying langchain_google_genai.chat_models._chat_with_retry.<locals>._chat_with_retry in 2.0 seconds as it raised ResourceExhausted: 429 You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits. To monitor your current usage, head to: https://ai.dev/usage?tab=rate-limit.
* Quota exceeded for metric: generativelanguage.googleapis.com/generate_content_free_tier_requests, limit: 2
Please retry in 40.614263854s. [violations {
  quota_metric: "generativelanguage.googleapis.com/generate_content_free_tier_requests"
  quota_id: "GenerateRequestsPerMinutePerProjectPerModel-FreeTier"
  quota_dimensions {
    key: "model"
    value: "gemini-2.5-pro"
  }
  quota_dimensions {
    key: "location"
    value: "global"
  }
  quota_value: 2
}
, links {
  description: "Learn more about Gemini API quotas"
  url: "https://ai.google.dev/gemini-api/docs/rate-limits"
}
, retry_delay {
  seconds:

     ‚úÖ AI summary created
     Preview: QUESTIONS:
"
*   What is the impact of varying the number of attention heads on the Transformer model's performance for English-to-German translation?...

   üìÑ Processing chunk 22/24
     Types found: text
     Tables: 0, Images: 0
     Pages: [9]
     ü§ñ Creating AI summary...
     ‚úÖ AI summary created
     Preview: QUESTIONS:
"
*   What is the Transformer model and what is its main innovation?
*   What does the Transformer model replace in traditional sequence tr...

   üìÑ Processing chunk 23/24
     Types found: text
     Tables: 0, Images: 0
     Pages: [10]
     ü§ñ Creating AI summary...
     ‚úÖ AI summary created
     Preview: QUESTIONS:
"List all potential questions that can be answered from this content (text, images, tables)"
- Who are the authors of the "Layer normalizat...

   üìÑ Processing chunk 24/24
     Types found: text
     Tables: 0, Images: 0
     Pages: [10, 11]
     ü§ñ Creating AI summary...
     ‚úÖ AI summ

In [9]:
output

[Document(metadata={'chunk_index': 1, 'page_numbers': [1], 'content_types': ['text'], 'num_tables': 0, 'num_images': 0, 'image_paths': [], 'original_content': '{"raw_text": "Attention Is All You Need\\n\\nAshish Vaswani\\u2217 Google Brain avaswani@google.com\\n\\nNoam Shazeer\\u2217 Google Brain noam@google.com\\n\\nNiki Parmar\\u2217\\n\\nGoogle Research nikip@google.com\\n\\nJakob Uszkoreit\\u2217 Google Research usz@google.com\\n\\nLlion Jones\\u2217 Google Research llion@google.com\\n\\nAidan N. Gomez\\u2217 \\u2020 University of Toronto aidan@cs.toronto.edu\\n\\n\\u0141ukasz Kaiser\\u2217 Google Brain lukaszkaiser@google.com", "tables_html": [], "has_images": false}'}, page_content='QUESTIONS:\n"Who are the authors of the paper \'Attention Is All You Need\'?\nWhat companies or universities were the authors affiliated with?\nWhat are the email addresses for the authors Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, and ≈Åukasz Kaiser?\nWhi

In [11]:
import json
import pickle
from pathlib import Path
from unstructured.documents.elements import Element

def save_elements(elements, pkl_path: str, json_path: str = None):
    """
    Save a Python variable `elements` to pickle and optionally to JSON.
    Automatically converts unstructured Element objects to dicts for JSON.

    Args:
        elements: Python variable to save (list, dict, etc.)
        pkl_path: Path to save the pickle file (required)
        json_path: Path to save the JSON file (optional)
    """
    # Ensure parent directories exist
    Path(pkl_path).parent.mkdir(parents=True, exist_ok=True)
    if json_path:
        Path(json_path).parent.mkdir(parents=True, exist_ok=True)

    # Save as Pickle
    with open(pkl_path, "wb") as f:
        pickle.dump(elements, f)
    print(f"‚úÖ Saved elements to pickle: {pkl_path}")

    # Save as JSON (optional)
    if json_path:
        # Convert Element objects to dicts automatically
        def to_serializable(el):
            return el.to_dict() if isinstance(el, Element) else el
        
        elements_serializable = [to_serializable(el) for el in elements]

        with open(json_path, "w", encoding="utf-8") as f:
            json.dump(elements_serializable, f, indent=4, ensure_ascii=False)
        print(f"‚úÖ Saved elements to JSON: {json_path}")


# -----------------------------
# Example usage
# your Python variable, e.g., output of partition_pdf

pkl_file = r"D:\MultiModulRag\Backend\SmartChunkClubingdef\Pickel\output.pkl"
json_file = r"D:\MultiModulRag\Backend\SmartChunkClubingdef\JSON\output.json"

save_elements(output, pkl_file) 

‚úÖ Saved elements to pickle: D:\MultiModulRag\Backend\SmartChunkClubingdef\Pickel\output.pkl


In [16]:
clean_output = []

for doc in output:
    if hasattr(doc, "to_dict"):         # For Unstructured / LangChain Document
        clean_output.append(doc.to_dict())
    elif hasattr(doc, "__dict__"):      # For plain Python objects
        clean_output.append(doc.__dict__)
    else:
        clean_output.append(str(doc))   # Fallback for unknown types


In [19]:
clean_output[2]

{'id': None,
 'metadata': {'chunk_index': 3,
  'page_numbers': [1, 2],
  'content_types': ['text'],
  'num_tables': 0,
  'num_images': 0,
  'image_paths': [],
  'original_content': '{"raw_text": "1 Introduction\\n\\nRecurrent neural networks, long short-term memory [12] and gated recurrent [7] neural networks in particular, have been \\ufb01rmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [29, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [31, 21, 13].\\n\\n\\u2217Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the \\ufb01rst Transformer models and has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head attention and

In [22]:
import os
import json

json_dir = r"D:\MultiModulRag\Backend\SmartChunkClubing\JSON"
json_path = os.path.join(json_dir, "output_clean.json")

# Create the directory if it doesn't exist
os.makedirs(json_dir, exist_ok=True)

# Dump JSON
with open(json_path, "w", encoding="utf-8") as f:
    json.dump(clean_output, f, indent=4, ensure_ascii=False)

print(f"‚úÖ JSON saved at: {json_path}")


‚úÖ JSON saved at: D:\MultiModulRag\Backend\SmartChunkClubing\JSON\output_clean.json
