# Context-Aware Material Specification Extractor

This notebook demonstrates an end-to-end pipeline for extracting material specifications from technical documents using a hybrid NLP and AI approach.

**Instructions:** To run this notebook, execute each cell from top to bottom. You will be prompted to upload a document in the 'Main Execution' section.



# Step 1: Install All Dependencies

In [1]:
# This cell first uninstalls any conflicting versions and then installs the correct, locked versions of the libraries.
!pip uninstall -y transformers huggingface-hub
!pip install sentence-transformers==2.2.2 transformers==4.29.2 huggingface-hub==0.14.1 Flask==3.0.0 faiss-cpu==1.7.4 Jinja2==3.1.2 google-auth==2.21.0 google-auth-oauthlib==1.0.0 numpy==1.26.0 pandas==2.0.3 Pillow==10.0.1 pdfplumber==0.10.3 pytesseract==0.3.10 reportlab==3.6.13 scikit-learn==1.3.1 spacy==3.7.4 torch==2.0.1 urllib3==1.26.18 google-generativeai==0.8.5 python-dotenv==1.1.1 requests==2.32.4 xhtml2pdf==0.2.11

!python -m spacy download en_core_web_sm

!apt-get install -qq tesseract-ocr

Found existing installation: transformers 4.29.2
Uninstalling transformers-4.29.2:
  Successfully uninstalled transformers-4.29.2
Found existing installation: huggingface-hub 0.14.1
Uninstalling huggingface-hub-0.14.1:
  Successfully uninstalled huggingface-hub-0.14.1
Collecting transformers==4.29.2
  Using cached transformers-4.29.2-py3-none-any.whl.metadata (112 kB)
Collecting huggingface-hub==0.14.1
  Using cached huggingface_hub-0.14.1-py3-none-any.whl.metadata (7.6 kB)
Using cached transformers-4.29.2-py3-none-any.whl (7.1 MB)
Using cached huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
Installing collected packages: huggingface-hub, transformers
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
accelerate 1.8.1 requires huggingface_hub>=0.21.0, but you have huggingface-hub 0.14.1 which is incompatible.
diffusers 0.34.0 requires huggingface-hub>=0.27.0

# Step 2: Import Libraries & Configure API Keys

In [None]:
# --- Import all necessary libraries ---
import os
import json
import re
import io
import time
import pandas as pd
import numpy as np
import faiss
import spacy
import pytesseract
import pdfplumber
from PIL import Image
from sentence_transformers import SentenceTransformer
import google.generativeai as genai
from google.api_core import exceptions as google_exceptions
import requests
from xhtml2pdf import pisa
from jinja2 import Environment, FileSystemLoader
from google.colab import files
from IPython.display import display

# --- Configure API Keys ---
# For this notebook to work, please add your API keys to the Colab "Secrets" manager.
# 1. Click on the key icon (üîë) in the left sidebar.
# 2. Add two new secrets: 'GEMINI_API_KEY' and 'OPENROUTER_API_KEY'.
# 3. Paste your API keys as the values.
GEMINI_API_KEY_1 = 'YOUR_GEMINI_API_KEY_1'
GEMINI_API_KEY_2 = 'YOUR_GEMINI_API_KEY_2'
GEMINI_API_KEY_3 = 'YOUR_GEMINI_API_KEY_3'
OPENROUTER_API_KEY = 'YOUR_OPENROUTER_API_KEY'
# Load multiple Gemini keys from secrets
GEMINI_API_KEYS = [
    key for key in [
        GEMINI_API_KEY_1,
        GEMINI_API_KEY_2,
        GEMINI_API_KEY_3
    ] if key is not None
]

if not GEMINI_API_KEYS:
    print("Warning: No Gemini API keys found. Please set GEMINI_API_KEY_1, etc., in your environment.")

# Configure OpenRouter API
if not OPENROUTER_API_KEY:
    print("Warning: OPENROUTER_API_KEY not found. Fallback to Gemma will not be available.")
OPENROUTER_API_URL = "https://openrouter.ai/api/v1/chat/completions"

### **Code Block 1: Document Processing**

In [3]:
def process_pdf(file_path):
    """
    Extracts text and page numbers from a PDF file.
    Returns a list of dictionaries, each with 'page_number' and 'text'.
    """
    extracted_data = []
    print(f"Processing PDF: {os.path.basename(file_path)}")
    with pdfplumber.open(file_path) as pdf:
        for i, page in enumerate(pdf.pages):
            text = page.extract_text()
            if text:
                extracted_data.append({
                    'page_number': i + 1,
                    'text': text
                })
    print(f"Finished processing PDF: {os.path.basename(file_path)}")
    return extracted_data

def process_image(file_path):
    """
    Extracts text from an image file using OCR.
    Returns a list of dictionaries, each with 'page_number' (image name) and 'text'.
    """
    extracted_data = []
    print(f"Processing image: {os.path.basename(file_path)}")
    try:
        image = Image.open(file_path)
        text = pytesseract.image_to_string(image)
        if text:
            extracted_data.append({
                'page_number': os.path.basename(file_path),
                'text': text
            })
        print(f"Finished processing image: {os.path.basename(file_path)}")
    except Exception as e:
        print(f"Error processing image {file_path}: {e}")
    return extracted_data

def process_document(file_path):
    """
    Processes a document (PDF or image) based on its file extension.
    """
    print(f"Starting document processing for {os.path.basename(file_path)}...")
    if file_path.lower().endswith('.pdf'):
        result = process_pdf(file_path)
    elif file_path.lower().endswith(('.png', '.jpg', '.jpeg', '.tiff', '.bmp', '.gif')):
        result = process_image(file_path)
    else:
        raise ValueError("Unsupported file type. Please provide a PDF or image file.")
    print(f"Document processing complete for {os.path.basename(file_path)}.")
    return result

### **Code Block 2: Hybrid Information Extraction**

In [4]:
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    print("SpaCy model 'en_core_web_sm' not found. Please run 'python -m spacy download en_core_web_sm'")
    exit()

try:
    print("Initializing Sentence Transformer model...")
    model = SentenceTransformer('all-MiniLM-L6-v2')
    print("Sentence Transformer model loaded successfully.")
except Exception as e:
    print(f"Error loading Sentence Transformer model: {e}")
    model = None

CORE_KEYWORDS = [
    "Cement", "Aggregate", "Water", "Steel", "Concrete", "Admixture",
    "Fly Ash", "Bitumen", "Mortar", "Brick", "Gravel", "Sand",
    "Reinforcement", "Formwork", "Shuttering", "Piers", "Abutments", "Columns",
    "Slabs", "Beams", "Walls", "Foundations", "Piles", "Couplers", "Jali",
    "Particle Board", "Damp Proof Course", "Slump Test", "Cube Test",
    "Slag", "Pozzolana", "TMT Bars", "MS bars", "HYSD bars",
    "Galvanised Sleeves", "Polymer Block", "Waterproofing Materials",
    "Bitumen felt", "Copper plate", "lignite", "mica", "shale", "clay",
    "pyrites", "coal", "sea shells", "organic impurities", "pentachlorophenol"
]

def find_nearest_heading(all_sentences, start_index):
    """
    Scans backwards from a given index to find the nearest preceding heading,
    with improved filtering to avoid generic headings.
    """
    heading_patterns = [
        r"^\d+\.\d+(?:\.\d+)*\s+.*",      # Matches "1.2.3 Section Title"
        r"^\(?[a-zA-Z]\)\s+.*",          # Matches "(a) Title"
        r"^TABLE\s+\d+\.\d+",           # Matches "TABLE 4.1"
        r"^(?!.*\b(?:MATERIAL|SPECIFICATIONS)\b)[A-Z\s\d\.\-]+",  # Avoids generic all-caps
    ]
    for i in range(start_index, -1, -1):
        line = all_sentences[i]['text']
        # Skip very short, likely irrelevant lines
        if len(line.split()) < 2 and len(line) < 15:
            continue
        for pattern in heading_patterns:
            if re.match(pattern, line, re.IGNORECASE):
                return line.strip()
    return None

def create_semantic_index(sentences):
    """
    Creates a FAISS index for semantic search.
    """
    if not model:
        print("Sentence Transformer model not loaded. Skipping semantic index creation.")
        return None, None

    print("Creating semantic index...")
    try:
        # Generate embeddings for all sentences
        print("Generating sentence embeddings...")
        embeddings = model.encode([s['text'] for s in sentences], show_progress_bar=True)
        print("Sentence embeddings generated.")

        # Create a FAISS index
        print("Creating FAISS index...")
        index = faiss.IndexFlatL2(embeddings.shape[1])
        index.add(np.array(embeddings, dtype=np.float32))
        print("FAISS index created successfully.")
        return index, embeddings
    except Exception as e:
        print(f"Error creating semantic index: {e}")
        return None, None

def search_semantic_index(index, query, sentences, k=40):
    """
    Searches the FAISS index for the most similar sentences.
    """
    if not index or not model:
        return []

    print(f"Performing semantic search for query: '{query}' with k={k}")
    try:
        query_embedding = model.encode([query])
        distances, indices = index.search(np.array(query_embedding, dtype=np.float32), k)
        results = [sentences[i] for i in indices[0]]

        print(f"Semantic search found {len(results)} results.")
        return results
    except Exception as e:
        print(f"Error during semantic search: {e}")
        return []

def extract_information(document_data):
    """
    Extracts structured information about materials from processed document data using a hybrid approach.
    """
    print("Starting hybrid material information extraction...")

    # Smartly filter keywords to avoid redundant searches
    keywords = sorted(CORE_KEYWORDS, key=len, reverse=True)
    filtered_keywords = []
    for keyword in keywords:
        if not any(keyword in s for s in filtered_keywords):
            filtered_keywords.append(keyword)

    extracted_materials = []

    all_sentences = []
    for page_info in document_data:
        page_text = page_info['text']
        page_number = page_info['page_number']
        lines = page_text.split('\n')
        for line in lines:
            if line.strip():
                all_sentences.append({'text': line.strip(), 'page_number': page_number})

    # Create semantic index
    semantic_index, _ = create_semantic_index(all_sentences)

    processed_materials = set()

    for material_name in filtered_keywords:
        if material_name in processed_materials:
            continue

        print(f"Searching for material: {material_name}")

        # --- Hybrid Search: Keyword + Semantic ---
        found_indices = set()

        # 1. Keyword search
        for idx, sentence_info in enumerate(all_sentences):
            if re.search(r'\b' + re.escape(material_name) + r'\b', sentence_info['text'], re.IGNORECASE):
                found_indices.add(idx)

        # 2. Semantic search
        if semantic_index:
            semantic_results = search_semantic_index(semantic_index, material_name, all_sentences)
            for res in semantic_results:
                # Find the index of the result in the original list
                for idx, s in enumerate(all_sentences):
                    if s['text'] == res['text'] and s['page_number'] == res['page_number']:
                        found_indices.add(idx)
                        break

        if not found_indices:
            continue

        # --- Process found sections ---
        combined_references = []
        material_definitions = []
        other_info_list = []

        for idx in sorted(list(found_indices)):
            sentence_info = all_sentences[idx]
            sentence_text = sentence_info['text']
            page_number = sentence_info['page_number']

            # More focused context for extraction
            context_window = all_sentences[max(0, idx - 1):idx + 2]
            context = " ".join(s['text'] for s in context_window)

            heading = find_nearest_heading(all_sentences, idx)
            code_standard = extract_code_standard(sentence_text, page_number)

            # Combine heading and code standard intelligently
            reference = heading if heading else ""
            if code_standard:
                reference = f"{reference} ‚Äì {code_standard}" if reference else code_standard

            # Add page number and ensure it's not a duplicate
            if reference:
                reference_with_page = f"{reference.strip()} (Page {page_number})"
                if reference_with_page not in combined_references:
                    combined_references.append(reference_with_page)

            # Extract other info from the focused context
            material_def = extract_material_type_definition(context, material_name)
            if material_def and material_def not in material_definitions:
                material_definitions.append(material_def)

            other_info = extract_other_info(context, material_name)
            if other_info and other_info not in other_info_list:
                other_info_list.append(other_info)

        # Filter out generic or less relevant references before formatting
        final_references = [
            ref for ref in combined_references if "CHAPTER" not in ref.upper() and len(ref) > 10
        ]

        formatted_references = "\n".join(
            f"{i+1}. {ref}" for i, ref in enumerate(final_references)
        ) if final_references else "No Information Available"

        extracted_materials.append({
            'Sl. No': '',
            'Material Name': material_name,
            'Test Name/Reference Code/Standard as per the given document (with reference page number)': formatted_references,
            'Specific Material Type/Material Definition': "; ".join(material_definitions) if material_definitions else "No Information Available",
            'Any other relevant information': "; ".join(other_info_list) if other_info_list else "No Information Available"
        })
        processed_materials.add(material_name)

    df = pd.DataFrame(extracted_materials)

    if not df.empty:
        df.drop_duplicates(subset=['Material Name'], inplace=True)
        df.reset_index(drop=True, inplace=True)
        df['Sl. No'] = df.index + 1

    print("Finished hybrid material information extraction.")
    return df

def extract_code_standard(context, page_number):
    """
    Extracts IS codes and standards from the given context, with page number.
    """
    is_code_pattern = r"IS\s+\d+(?:\s*\(Part\s*[\w\d\s]+\))?"
    matches = re.findall(is_code_pattern, context)
    if matches:
        # Return only the code, page number is added later if needed
        return '; '.join(sorted(set(match.strip() for match in matches)))
    return None

def extract_material_type_definition(context, material_name):
    """
    Extracts material type or definition from the context using spaCy and enhanced fallback rules.
    """
    # Try to find a definition using spaCy's dependency parsing
    doc = nlp(context)
    for sent in doc.sents:
        if re.search(r'\b' + re.escape(material_name) + r'\b', sent.text, re.IGNORECASE):
            for token in sent:
                if token.text.lower() == material_name.lower() and token.dep_ == 'nsubj':
                    if token.head.lemma_ in ['be', 'consist', 'include', 'mean', 'refer']:
                        definition = [child.text for child in token.head.children if child.dep_ not in ['punct', 'advmod']]
                        if definition:
                            return " ".join(definition)

    # Enhanced fallback rules with more specific patterns
    fallback_patterns = {
        "Fine Aggregate": r"(fine aggregate(?: is| shall be)?\s+.*?((?:passes through|retained on)\s+\d+\.\d+\s+mm\s+IS\s+sieve|conforming to IS \d+).*?(?=\.|$))",
        "Coarse Aggregate": r"(coarse aggregate(?: is| shall be)?\s+.*?((?:retained on)\s+\d+\.\d+\s+mm\s+IS\s+sieve|conforming to IS \d+).*?(?=\.|$))",
        "Cement": r"((?:Ordinary\s+)?Portland\s+cement(?: of\s+\d+\s+Grade)?|cement(?: is| shall be)?\s+.*?(?=\.|$))",
        "Water": r"(water(?: is| shall be)?\s+.*?(?=\.|$))",
        "Steel": r"(steel(?: is| shall be)?\s+.*?(?=\.|$)|reinforcement(?: is| shall be)?\s+.*?(?=\.|$))",
        "Concrete": r"(concrete(?: is| shall be)?\s+.*?(?=\.|$))",
        "Admixture": r"(admixture(?: is| shall be)?\s+.*?(?=\.|$))",
        "Brick": r"(brick aggregate(?: is| shall be)?\s+.*?(?=\.|$))",
        "Gravel": r"(gravel(?: is| shall be)?\s+.*?(?=\.|$))",
        "Sand": r"(sand(?: is| shall be)?\s+.*?(?=\.|$))",
        "Reinforcement": r"(reinforcement(?: is| shall be)?\s+.*?(?=\.|$))",
        "Formwork": r"(formwork(?: is| shall be)?\s+.*?(?=\.|$))",
        "Shuttering": r"(shuttering(?: is| shall be)?\s+.*?(?=\.|$))",
        "Piers": r"(piers(?: are| shall be)?\s+.*?(?=\.|$))",
        "Abutments": r"(abutments(?: are| shall be)?\s+.*?(?=\.|$))",
        "Columns": r"(columns(?: are| shall be)?\s+.*?(?=\.|$))",
        "Slabs": r"(slabs(?: are| shall be)?\s+.*?(?=\.|$))",
        "Beams": r"(beams(?: are| shall be)?\s+.*?(?=\.|$))",
        "Walls": r"(walls(?: are| shall be)?\s+.*?(?=\.|$))",
        "Foundations": r"(foundations(?: are| shall be)?\s+.*?(?=\.|$))",
        "Piles": r"(piles(?: are| shall be)?\s+.*?(?=\.|$))",
        "Couplers": r"(couplers(?: are| shall be)?\s+.*?(?=\.|$))",
        "Jali": r"(jali(?: is| shall be)?\s+.*?(?=\.|$))",
        "Particle Board": r"(particle board(?: is| shall be)?\s+.*?(?=\.|$))",
        "Damp Proof Course": r"(damp proof course(?: is| shall be)?\s+.*?(?=\.|$))",
        "Slump Test": r"(slump test(?: is| shall be)?\s+.*?(?=\.|$))",
        "Cube Test": r"(cube test(?: is| shall be)?\s+.*?(?=\.|$))",
        "Slag": r"(slag(?: is| shall be)?\s+.*?(?=\.|$))",
        "Pozzolana": r"(pozzolana(?: is| shall be)?\s+.*?(?=\.|$))",
        "TMT Bars": r"(TMT bars(?: are| shall be)?\s+.*?(?=\.|$))",
        "MS bars": r"(MS bars(?: are| shall be)?\s+.*?(?=\.|$))",
        "HYSD bars": r"(HYSD bars(?: are| shall be)?\s+.*?(?=\.|$))",
        "Galvanised Sleeves": r"(galvanised sleeves(?: are| shall be)?\s+.*?(?=\.|$))",
        "Polymer Block": r"(polymer block(?: is| shall be)?\s+.*?(?=\.|$))",
        "Waterproofing Materials": r"(waterproofing materials(?: are| shall be)?\s+.*?(?=\.|$))",
        "Bitumen felt": r"(bitumen felt(?: is| shall be)?\s+.*?(?=\.|$))",
        "Copper plate": r"(copper plate(?: is| shall be)?\s+.*?(?=\.|$))",
        "lignite": r"(lignite(?: is| shall be)?\s+.*?(?=\.|$))",
        "mica": r"(mica(?: is| shall be)?\s+.*?(?=\.|$))",
        "shale": r"(shale(?: is| shall be)?\s+.*?(?=\.|$))",
        "clay": r"(clay(?: is| shall be)?\s+.*?(?=\.|$))",
        "pyrites": r"(pyrites(?: is| shall be)?\s+.*?(?=\.|$))",
        "coal": r"(coal(?: is| shall be)?\s+.*?(?=\.|$))",
        "sea shells": r"(sea shells(?: are| shall be)?\s+.*?(?=\.|$))",
        "organic impurities": r"(organic impurities(?: are| shall be)?\s+.*?(?=\.|$))",
        "pentachlorophenol": r"(pentachlorophenol(?: is| shall be)?\s+.*?(?=\.|$))"
    }

    for mat, pattern in fallback_patterns.items():
        if mat.lower() == material_name.lower():
            match = re.search(pattern, context, re.IGNORECASE)
            if match:
                # Return the first non-empty matched group
                for group in match.groups():
                    if group:
                        return group.strip()

    # If still no definition, return a more informative message
    return f"No specific definition for {material_name} could be determined from the context."

def extract_other_info(context, material_name):
    """
    Extracts supplementary information, prioritizing tables, notes, and recommendations.
    """
    # If the context contains keywords like "Table", "Note", or "IS recommends", it's likely important.
    if "Table" in context or "Note" in context or "IS recommends" in context:
        return context.strip()
    return None

Initializing Sentence Transformer model...
Error loading Sentence Transformer model: Failed to import transformers.models.bert.modeling_bert because of the following error (look up to see its traceback):
cannot import name 'split_torch_state_dict_into_shards' from 'huggingface_hub' (/usr/local/lib/python3.11/dist-packages/huggingface_hub/__init__.py)


### **Code Block 3: AI-Powered Data Refinement**

In [5]:
# This part is directly from ai_buddy.py
_current_gemini_key_index = 1 # Start with key index 1 (GEMINI_API_KEY_2)

def _log_ai_response(response_text, is_error=False, model_name=""):
    """Logs AI responses or errors to a file."""
    # Logging to a file is disabled in Colab for simplicity
    pass

def _log_ai_context(prompt):
    """Logs the full prompt sent to the AI."""
    # Logging to a file is disabled in Colab for simplicity
    pass

def _get_next_gemini_key():
    """Rotates to the next available Gemini API key."""
    global _current_gemini_key_index
    _current_gemini_key_index = (_current_gemini_key_index + 1) % len(GEMINI_API_KEYS)
    return GEMINI_API_KEYS[_current_gemini_key_index]

def _extract_json_from_string(text):
    """
    Finds and extracts the first valid JSON array from a string.
    Handles cases where the AI includes text before or after the JSON.
    """
    # Find the starting bracket of the JSON array
    start_index = text.find('[')
    if start_index == -1:
        return None

    # Find the matching closing bracket for the array
    end_index = -1
    open_brackets = 0
    for i in range(start_index, len(text)):
        if text[i] == '[':
            open_brackets += 1
        elif text[i] == ']':
            open_brackets -= 1
            if open_brackets == 0:
                end_index = i + 1
                break

    if end_index == -1:
        return None

    # Return the extracted JSON string
    return text[start_index:end_index]


def refine_batch_with_ai(batch_data):
    """
    Sends a batch of data to AI for refinement.
    Cycles through Gemini keys on failure, then falls back to Gemma.
    """
    global _current_gemini_key_index

    if not batch_data:
        return []

    print(f"Starting AI refinement for a batch of {len(batch_data)} rows...")

    prompt_template = f"""
    You are an expert civil engineering assistant. Your task is to refine the following structured data, provided as a JSON array.
    The JSON must be perfectly formatted. Ensure all string values with double-quotes are properly escaped (e.g., "some \\"quoted\\" text").

    Data to refine:
    {json.dumps(batch_data, indent=2)}

    Follow these instructions for each object:
    1.  "Sl. No": Keep original value.
    2.  "Material Name": Keep original value.
    3.  "Test Name/Reference Code/Standard...": From the provided references, select the top 7 to 9 most relevant ones. Prioritize references that are specific (e.g., IS codes, table numbers, detailed section numbers) and directly support the "Any other relevant information" field. List each selected reference on a new line, numbered (1., 2., etc.). If fewer than 7 relevant references are found, list all that are relevant.
    4.  "Specific Material Type/Material Definition": Extract the specific type or definition as found in the document. Do not generate a new definition. If no specific type can be determined, state "No specific definition could be determined from the context.".
    5.  "Any other relevant information": Provide concise (1-2 paragraphs) details for a civil engineer.

    Your response MUST be ONLY the JSON array of objects. Do not include any explanatory text, comments, or any characters before or after the opening `[` and closing `]` of the JSON array.
    """

    refined_data = None

    # --- Try Gemini with Key Rotation ---
    gemini_is_down = False
    initial_key_index = _current_gemini_key_index

    for i in range(len(GEMINI_API_KEYS)):
        current_key_index = (_current_gemini_key_index + i) % len(GEMINI_API_KEYS)
        current_key = GEMINI_API_KEYS[current_key_index]

        try:
            print(f"Attempting Gemini API with key index: {current_key_index}...")
            genai.configure(api_key=current_key)
            _log_ai_context(prompt_template)

            model = genai.GenerativeModel('gemini-2.0-flash')
            response = model.generate_content(prompt_template)

            _log_ai_response(response.text, model_name=f"Gemini (Key {current_key_index})")

            json_text = _extract_json_from_string(response.text)
            if json_text:
                refined_data = json.loads(json_text)
            else:
                raise json.JSONDecodeError("No valid JSON array found in the response.", response.text, 0)

            print("Gemini API response received successfully.")
            _current_gemini_key_index = current_key_index # Update current key index on success
            break  # Success, exit the loop

        except json.JSONDecodeError as e:
            error_message = f"Failed to decode JSON from Gemini response with key {current_key_index}: {e}"
            print(error_message)
            _log_ai_response(error_message, is_error=True, model_name=f"Gemini (Key {current_key_index})")
            continue # Try next key
        except (google_exceptions.PermissionDenied, google_exceptions.Unauthenticated, google_exceptions.ResourceExhausted) as e:
            error_message = f"Gemini API Key {current_key_index} failed: {e}"
            print(error_message)
            _log_ai_response(error_message, is_error=True, model_name=f"Gemini (Key {current_key_index})")
            continue # Try next key
        except google_exceptions.ServiceUnavailable as e:
            error_message = f"Gemini API service is unavailable: {e}"
            print(error_message)
            _log_ai_response(error_message, is_error=True, model_name="Gemini")
            gemini_is_down = True
            break # Service is down, no point rotating
        except Exception as e:
            error_message = f"An unexpected error occurred with Gemini: {e}"
            print(error_message)
            _log_ai_response(error_message, is_error=True, model_name="Gemini")
            gemini_is_down = True
            break # Assume a critical failure
    else: # This block runs if the for loop completes without a 'break'
        print("All Gemini keys failed. Falling back.")
        gemini_is_down = True

    if gemini_is_down:
        refined_data = None

    # --- Fallback to OpenRouter (Gemma) ---
    if refined_data is None:
        if not OPENROUTER_API_KEY or OPENROUTER_API_KEY == "your_openrouter_api_key":
            print("OpenRouter API key not configured. Cannot fall back to Gemma.")
            return batch_data

        print("Attempting OpenRouter (Gemma) for batch refinement...")
        _log_ai_context(prompt_template)
        headers = {"Authorization": f"Bearer {OPENROUTER_API_KEY}", "Content-Type": "application/json"}
        payload = {"model": "google/gemma-3n-e2b-it:free", "messages": [{"role": "user", "content": prompt_template}]}

        try:
            response = requests.post(OPENROUTER_API_URL, headers=headers, json=payload, timeout=90)
            response.raise_for_status()

            gemma_text = response.json()['choices'][0]['message']['content']
            _log_ai_response(gemma_text, model_name="Gemma")

            json_text = _extract_json_from_string(gemma_text)
            if json_text:
                refined_data = json.loads(json_text)
            else:
                raise json.JSONDecodeError("No valid JSON array found in Gemma's response.", gemma_text, 0)

            print("OpenRouter (Gemma) API response received successfully.")
        except Exception as e:
            print(f"OpenRouter (Gemma) failed: {e}")
            _log_ai_response(str(e), is_error=True, model_name="Gemma")

    # --- Final Processing ---
    if refined_data:
        expected_columns = list(batch_data[0].keys())
        cleaned_data = [{key: item.get(key) for key in expected_columns} for item in refined_data]
        print("AI refinement for batch complete.")
        return cleaned_data
    else:
        print("AI refinement failed for the batch. Returning original batch data.")
        return batch_data

### **Code Block 4: Output Generation**

In [6]:
def generate_csv(dataframe, output_filepath):
    """
    Generates a CSV file from a pandas DataFrame.
    """
    print(f"Starting CSV generation for {os.path.basename(output_filepath)}...")
    try:
        dataframe.to_csv(output_filepath, index=False, encoding='utf-8')
        print(f"CSV file generated successfully at: {output_filepath}")
        return True
    except Exception as e:
        print(f"Error generating CSV file: {e}")
        return False

def generate_pdf(dataframe, output_filepath, title="Material Extraction Report"):
    """
    Generates a PDF file from a pandas DataFrame using xhtml2pdf.
    """
    print(f"Starting PDF generation for {os.path.basename(output_filepath)}...")
    try:
        # Prepare data for the template
        data_for_template = dataframe.to_dict(orient='records')

        # Set up Jinja2 environment
        env = Environment(loader=FileSystemLoader('.'))
        template = env.get_template('pdf_template.html')

        # Render the HTML template with the data
        html = template.render(data=data_for_template)

        # Create the PDF
        with open(output_filepath, "w+b") as pdf_file:
            pisa_status = pisa.CreatePDF(html, dest=pdf_file)

        if pisa_status.err:
            print(f"Error generating PDF: {pisa_status.err}")
            return False

        print(f"PDF file generated successfully at: {output_filepath}")
        return True
    except Exception as e:
        print(f"Error generating PDF file: {e}")
        return False

### **Main Execution: Run the Full Pipeline**

In [7]:
# Create necessary directories
os.makedirs('uploads', exist_ok=True)
os.makedirs('downloads', exist_ok=True)

# Create a dummy pdf_template.html file for the notebook environment
with open('pdf_template.html', 'w') as f:
    f.write("""<!DOCTYPE html>
<html>
<head>
    <title>Material Extraction Report</title>
    <style>
        @page {
            size: landscape;
            margin: 1cm;
        }
        body {
            font-family: sans-serif;
            font-size: 10px;
        }
        h1 {
            text-align: center;
        }
        table {
            width: 100%;
            border-collapse: collapse;
        }
        th, td {
            border: 1px solid black;
            padding: 5px;
            text-align: left;
            vertical-align: top;
        }
        th {
            background-color: #f2f2f2;
        }
        /* Style for ordered lists within cells */
        td ol {
            margin: 0;
            padding-left: 15px; /* Adjust as needed for indentation */
        }
        td ol li {
            margin-bottom: 2px;
        }
    </style>
</head>
<body>
    <h1>Material Extraction Report</h1>
    <table>
        <thead>
            <tr>
                <th>Sl. No</th>
                <th>Material Name</th>
                <th>Test Name/Reference Code/Standard as per the given document (with reference page number)</th>
                <th>Specific Material Type/Material Definition</th>
                <th>Any other relevant information</th>
            </tr>
        </thead>
        <tbody>
            {% for row in data %}
            <tr>
                <td>{{ row['Sl. No'] }}</td>
                <td>{{ row['Material Name'] }}</td>
                <td>
                    <ol>
                        {% for item in row['Test Name/Reference Code/Standard as per the given document (with reference page number)'].split('\n') %}
                            {% if item.strip() %}
                                <li>{{ item.split('.', 1)[1]|default(item)|trim }}</li>
                            {% endif %}
                        {% endfor %}
                    </ol>
                </td>
                <td>{{ row['Specific Material Type/Material Definition']|replace('\n', '<br>')|safe }}</td>
                <td>{{ row['Any other relevant information']|replace('\n', '<br>')|safe }}</td>
            </tr>
            {% endfor %}
        </tbody>
    </table>
</body>
</html>""")

print("Please upload your technical document (PDF or image)...")
uploaded = files.upload()
filepath = list(uploaded.keys())[0]

print(f"\nFile '{filepath}' uploaded successfully.")

try:
    print("--- Document processing starts ---")
    document_data = process_document(filepath)
    print("--- Document processed. ---")

    print("--- Material information extraction starts ---")
    extracted_df = extract_information(document_data)
    print("--- Material information extracted. ---")

    print("--- AI refinement starts ---")
    extracted_list = extracted_df.to_dict(orient='records')

    refined_list = []
    batch_size = 5
    for i in range(0, len(extracted_list), batch_size):
        batch = extracted_list[i:i + batch_size]
        refined_batch = refine_batch_with_ai(batch)
        refined_list.extend(refined_batch)

    refined_df = pd.DataFrame(refined_list)
    print("--- AI refinement complete. ---")

    col_name = 'Test Name/Reference Code/Standard as per the given document (with reference page number)'
    if col_name in refined_df.columns:
        refined_df = refined_df[refined_df[col_name] != "No Information Available"]
        print("Filtered out empty reference rows.")

    refined_df.reset_index(drop=True, inplace=True)
    refined_df['Sl. No'] = refined_df.index + 1
    print("Serial numbers re-assigned.")

    print("\n--- Final Extracted Data ---")
    display(refined_df)

    print("\n--- Generating Reports ---")
    csv_filepath = os.path.join('downloads', 'material_report.csv')
    pdf_filepath = os.path.join('downloads', 'material_report.pdf')
    generate_csv(refined_df, csv_filepath)
    generate_pdf(refined_df, pdf_filepath)

except Exception as e:
    print(f"\nAn error occurred during the pipeline execution: {e}")

Please upload your technical document (PDF or image)...


Saving test_doc_for_NLP_HACKATHON.pdf to test_doc_for_NLP_HACKATHON (1).pdf

File 'test_doc_for_NLP_HACKATHON (1).pdf' uploaded successfully.
--- Document processing starts ---
Starting document processing for test_doc_for_NLP_HACKATHON (1).pdf...
Processing PDF: test_doc_for_NLP_HACKATHON (1).pdf
Finished processing PDF: test_doc_for_NLP_HACKATHON (1).pdf
Document processing complete for test_doc_for_NLP_HACKATHON (1).pdf.
--- Document processed. ---
--- Material information extraction starts ---
Starting hybrid material information extraction...
Sentence Transformer model not loaded. Skipping semantic index creation.
Searching for material: Waterproofing Materials
Searching for material: Galvanised Sleeves
Searching for material: organic impurities
Searching for material: Damp Proof Course
Searching for material: pentachlorophenol
Searching for material: Particle Board
Searching for material: Reinforcement
Searching for material: Polymer Block
Searching for material: Bitumen felt
Sea

ERROR:tornado.access:503 POST /v1beta/models/gemini-2.0-flash:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 1977.26ms


Gemini API response received successfully.
AI refinement for batch complete.
Starting AI refinement for a batch of 5 rows...
Attempting Gemini API with key index: 1...
Gemini API response received successfully.
AI refinement for batch complete.
Starting AI refinement for a batch of 5 rows...
Attempting Gemini API with key index: 1...
Gemini API response received successfully.
AI refinement for batch complete.
Starting AI refinement for a batch of 3 rows...
Attempting Gemini API with key index: 1...
Gemini API response received successfully.
AI refinement for batch complete.
--- AI refinement complete. ---
Filtered out empty reference rows.
Serial numbers re-assigned.

--- Final Extracted Data ---


Unnamed: 0,Sl. No,Material Name,Test Name/Reference Code/Standard as per the given document (with reference page number),Specific Material Type/Material Definition,Any other relevant information
0,1,Galvanised Sleeves,1. 5.6A.1.7 Mild Steel Galvanised Sleeves& Bol...,No specific definition for Galvanised Sleeves ...,"Galvanized sleeves, typically made of mild ste..."
1,2,organic impurities,1. and organic impurities in such quantity as ...,organic impurities in such quantity as to affe...,Organic impurities in concrete aggregates can ...
2,3,Damp Proof Course,1. 4.4 DAMP PROOF COURSE (Page 20)\n2. be term...,DAMP PROOF COURSE 4; damp proof course is to b...,A damp-proof course (DPC) is a crucial element...
3,4,pentachlorophenol,1. 7. IS 716 Specification for pentachlorophen...,pentachlorophenol 8; pentachlorophenol conform...,Pentachlorophenol (PCP) is a chemical compound...
4,5,Particle Board,1. particle board (Page 31)\n2. 5.7A EXPANSION...,particle board 5; PARTICLE BOARD 5; particle b...,"Particle board, particularly cement-bonded par..."
5,6,Reinforcement,1. 5.1.3 Steel for Reinforcement 129 (Page 28)...,No specific definition for Reinforcement could...,Reinforcement steel should be stored to preven...
6,7,Polymer Block,1. 5.6A.1.6 High Strength Polymer Block‚ÄìHigh S...,"Polymer Block for securing hinges, tower bolts...",High-strength polymer blocks are used for secu...
7,8,Bitumen felt,1. recommended that structures exceeding 45 m ...,bitumen felt or any such material and provisio...,Bitumen felt is mentioned in the context of ex...
8,9,Copper plate,"1. copper plate, etc. shall be paid for separa...",No specific definition for Copper plate could ...,Copper plate is mentioned as part of expansion...
9,10,Foundations,"1. (a) Foundations, footings, bases of columns...",No specific definition for Foundations could b...,Foundations are mentioned in the context of ma...



--- Generating Reports ---
Starting CSV generation for material_report.csv...
CSV file generated successfully at: downloads/material_report.csv
Starting PDF generation for material_report.pdf...
PDF file generated successfully at: downloads/material_report.pdf


### **Download Generated Reports**

In [8]:
print("Downloading generated reports...")
files.download('downloads/material_report.pdf')

Downloading generated reports...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [8]:
files.download('downloads/material_report.csv')