<a href="https://colab.research.google.com/github/Naman1995jain/Multilingual-Knowledge-Extraction/blob/main/Multilingual_Knowledge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Install Dependencies

This cell installs all the necessary Python libraries like `pdf2image`, `pillow`, `requests`, `sentence-transformers`, `faiss-cpu`, `langchain`, `langchain-community`, and `langchain-text-splitters`. It also installs `poppler-utils`, a command-line utility for working with PDFs, which `pdf2image` relies on.

In [14]:
!pip install -q pdf2image pillow requests sentence-transformers faiss-cpu langchain langchain-community langchain-text-splitters
!apt-get install -y poppler-utils

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.12).
0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.


### Import Libraries

This cell imports all the required modules for the project, including `os`, `re`, `json`, `time`, `base64`, `requests`, `numpy`, `faiss`, `pdf2image`, `google.colab.userdata`, `langchain_text_splitters`, and `sentence_transformers`. These libraries are used for file operations, regular expressions, JSON handling, time delays, base64 encoding, making HTTP requests, numerical operations, vector indexing, PDF to image conversion, accessing Colab secrets, text splitting, and embedding generation, respectively.

In [15]:
import os
import re
import json
import time
import base64
import requests
import numpy as np
import faiss
from pdf2image import convert_from_path
from google.colab import userdata
from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer



### Configuration Class

This `Config` class defines all the essential parameters for the application, such as the `OPENROUTER_API_KEY` (retrieved from Colab secrets), the `MODEL_NAME` for the LLM, the `EMBEDDING_MODEL` for generating text embeddings, the `PDF_PATH` of the input document, and the `OUTPUT_JSON` file path for storing processed data. It also includes a check to ensure the API key is present.

In [16]:
class Config:
    API_KEY = userdata.get("OPENROUTER_API_KEY") # Ensure this is set in Colab Secrets
    MODEL_NAME = "google/gemini-2.0-flash-001" # Faster & cheaper than 1.5 Pro
    EMBEDDING_MODEL = "intfloat/multilingual-e5-large"
    PDF_PATH = "/content/test.pdf" # Upload your file here
    OUTPUT_JSON = "/content/structured_knowledge.json"

    if not API_KEY:
        raise ValueError(" OPENROUTER_API_KEY not found in Secrets!")

### Document Processor Class

The `DocumentProcessor` class is responsible for extracting text content from PDF pages. It uses `pdf2image` to convert PDF pages into images and then sends these images to an LLM (via OpenRouter API) for layout-aware OCR. The `_encode_image` method handles base64 encoding of images, and `extract_page_content` constructs a prompt to instruct the LLM to output structured Markdown. The `process_pdf` method iterates through all pages, extracts content, and saves it as a JSON file.

In [17]:
class DocumentProcessor:
    def __init__(self, config):
        self.config = config
        self.headers = {
            "Authorization": f"Bearer {config.API_KEY}",
            "Content-Type": "application/json",
            "HTTP-Referer": "https://colab.research.google.com"
        }

    def _encode_image(self, image_path):
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode('utf-8')

    def extract_page_content(self, image_path, page_num):
        """Sends page image to LLM for Layout-Aware OCR"""
        b64_image = self._encode_image(image_path)

        # PROMPT ENGINEERING: Enforce Markdown structure for downstream chunking
        prompt = """
        Analyze this textbook page (Gujarati/Sanskrit). extract the text while strictly preserving structure using Markdown:
        1. Use '#' for Main Titles (like Book Name).
        2. Use '##' for Chapters (Adhyay).
        3. Use '###' for Verses (Shlokas) or Sub-sections.
        4. Keep Sanskrit Shlokas and Gujarati commentary distinct.
        5. Do NOT translate. Output raw text exactly as seen.
        """

        payload = {
            "model": self.config.MODEL_NAME,
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": prompt},
                        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64_image}"}}
                    ]
                }
            ]
        }

        try:
            response = requests.post("https://openrouter.ai/api/v1/chat/completions", headers=self.headers, json=payload)
            response.raise_for_status()
            return response.json()['choices'][0]['message']['content']
        except Exception as e:
            print(f"⚠️ Error on page {page_num}: {e}")
            return ""

    def process_pdf(self):
        print(f"🚀 Processing PDF: {self.config.PDF_PATH}")
        images = convert_from_path(self.config.PDF_PATH, dpi=150) # 150 DPI is enough for LLMs
        full_document = []

        for i, img in enumerate(images):
            page_num = i + 1
            print(f"   📄 Scanning Page {page_num}/{len(images)}...")

            temp_path = f"temp_page_{page_num}.jpg"
            img.save(temp_path, "JPEG")

            content = self.extract_page_content(temp_path, page_num)

            # Structuring the raw data
            full_document.append({
                "page": page_num,
                "content": content
            })

            os.remove(temp_path)
            time.sleep(1) # Rate limit politeness

        # Save as JSON (Better than Docx for data)
        with open(self.config.OUTPUT_JSON, "w", encoding="utf-8") as f:
            json.dump(full_document, f, ensure_ascii=False, indent=2)

        print(f"✅ OCR Complete. Saved to {self.config.OUTPUT_JSON}")
        return full_document

### Knowledge Base Class

The `KnowledgeBase` class manages the creation and retrieval of information from the processed PDF. The `chunk_data` method uses `MarkdownHeaderTextSplitter` and `RecursiveCharacterTextSplitter` from `langchain_text_splitters` to divide the document into smaller, context-rich chunks while preserving hierarchical information (like Book Title, Chapter, Section). The `build_index` method generates embeddings for these chunks using a `SentenceTransformer` model and stores them in a FAISS vector index. The `retrieve` method then uses this index to find the most relevant chunks given a query.

In [18]:
class KnowledgeBase:
    def __init__(self, config):
        self.config = config
        self.embedder = SentenceTransformer(config.EMBEDDING_MODEL)
        self.index = None
        self.chunks = []

    def chunk_data(self, json_data):
        print("🧩 Chunking data with Hierarchy Preservation...")

        # 1. Split by Headers first (Keep Chapter context)
        headers_to_split_on = [
            ("#", "Book Title"),
            ("##", "Chapter"),
            ("###", "Section"),
        ]
        markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

        # 2. Recursive split for long sections
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

        processed_chunks = []

        for page in json_data:
            # First, split by Markdown headers
            md_docs = markdown_splitter.split_text(page['content'])

            for doc in md_docs:
                # Further split long sections if needed
                splits = text_splitter.split_text(doc.page_content)

                for split in splits:
                    processed_chunks.append({
                        "text": split,
                        "metadata": {
                            "page": page['page'],
                            **doc.metadata # Inherits "Chapter", "Book Title" from headers
                        }
                    })

        self.chunks = processed_chunks
        print(f"✅ Created {len(self.chunks)} knowledge chunks.")
        return processed_chunks

    def build_index(self):
        print("🧠 Generating Embeddings (This may take a moment)...")
        texts = [c["text"] for c in self.chunks]

        # E5 requires "passage: " prefix for documents
        texts_for_embedding = [f"passage: {t}" for t in texts]

        embeddings = self.embedder.encode(texts_for_embedding, convert_to_numpy=True, normalize_embeddings=True)

        dimension = embeddings.shape[1]
        self.index = faiss.IndexFlatIP(dimension)
        self.index.add(embeddings)
        print("✅ Vector Index Built.")

    def retrieve(self, query, k=4):
        # E5 requires "query: " prefix for queries
        query_vec = self.embedder.encode([f"query: {query}"], convert_to_numpy=True, normalize_embeddings=True)
        distances, indices = self.index.search(query_vec, k)

        results = []
        for idx in indices[0]:
            results.append(self.chunks[idx])
        return results

### RAG Assistant Class

The `RAGAssistant` class integrates the retrieval and generation components. When asked a question, it uses the `KnowledgeBase` to `retrieve` relevant document chunks. It then formats these retrieved documents as `context` and constructs a prompt for the LLM. The prompt instructs the LLM to answer the question *only* based on the provided context, enabling a Retrieval-Augmented Generation (RAG) approach to ensure factual and grounded responses.

In [19]:
class RAGAssistant:
    def __init__(self, config, kb):
        self.config = config
        self.kb = kb

    def ask(self, query):
        retrieved_docs = self.kb.retrieve(query)

        # Format context with Metadata for the LLM
        context_str = ""
        for i, doc in enumerate(retrieved_docs):
            meta = doc['metadata']
            chapter_info = f" [Chapter: {meta.get('Chapter', 'General')}]" if 'Chapter' in meta else ""
            context_str += f"Source {i+1} (Page {meta['page']}{chapter_info}):\n{doc['text']}\n\n"

        prompt = f"""
        You are an expert on the Bhagavad Gita. Answer the question based ONLY on the context below.

        CONTEXT:
        {context_str}

        QUESTION: {query}

        ANSWER (in the same language as the question):
        """

        payload = {
            "model": self.config.MODEL_NAME,
            "messages": [{"role": "user", "content": prompt}],
            "temperature": 0.3
        }

        try:
            response = requests.post("https://openrouter.ai/api/v1/chat/completions",
                                     headers={"Authorization": f"Bearer {self.config.API_KEY}"},
                                     json=payload)
            return response.json()['choices'][0]['message']['content']
        except Exception as e:
            return f"Error: {e}"

### Main Execution Block

This is the main entry point of the script. It initializes the `DocumentProcessor` and checks if the OCR data (`structured_knowledge.json`) already exists. If not, it processes the PDF to extract content. Then, it initializes the `KnowledgeBase`, chunks the data, and builds the FAISS index. Finally, it sets up the `RAGAssistant` and enters an interactive loop, allowing the user to ask questions about the PDF content until they type 'exit' or 'quit'.

In [20]:
if __name__ == "__main__":
    # Initialize
    processor = DocumentProcessor(Config)

    # Check if we already did OCR to save time/cost
    if not os.path.exists(Config.OUTPUT_JSON):
        raw_data = processor.process_pdf()
    else:
        print("📂 Found existing OCR data. Loading...")
        with open(Config.OUTPUT_JSON, "r") as f:
            raw_data = json.load(f)

    # Build Knowledge Base
    kb = KnowledgeBase(Config)
    kb.chunk_data(raw_data)
    kb.build_index()

    # Start Chat
    bot = RAGAssistant(Config, kb)

    print("\n" + "="*50)
    print("🤖 GITA AI ASSISTANT READY")
    print("="*50)

    # Example Interaction Loop
    while True:
        q = input("\nAsk a question (or 'exit'): ")
        if q.lower() in ['exit', 'quit']: break

        print("Thinking...")
        answer = bot.ask(q)
        print(f"\nAnswer:\n{answer}")

🚀 Processing PDF: /content/test.pdf
   📄 Scanning Page 1/10...
   📄 Scanning Page 2/10...
   📄 Scanning Page 3/10...
   📄 Scanning Page 4/10...
   📄 Scanning Page 5/10...
   📄 Scanning Page 6/10...
   📄 Scanning Page 7/10...
   📄 Scanning Page 8/10...
   📄 Scanning Page 9/10...
   📄 Scanning Page 10/10...
✅ OCR Complete. Saved to /content/structured_knowledge.json


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/690 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/201 [00:00<?, ?B/s]

🧩 Chunking data with Hierarchy Preservation...
✅ Created 45 knowledge chunks.
🧠 Generating Embeddings (This may take a moment)...
✅ Vector Index Built.

🤖 GITA AI ASSISTANT READY

Ask a question (or 'exit'): what is gita
Thinking...

Answer:
ગીતા અંતઃકરણની બે પ્રવૃત્તિઓનો સંઘર્ષ છે.


Ask a question (or 'exit'): What is the significance of the "Conch Sound" (Shankh-dhwani) at the beginning of the war?
Thinking...

Answer:
યુદ્ધની શરૂઆતમાં શંખધ્વનિ પાત્રોના પરાક્રમની ઘોષણા છે. કૌરવોમાં ભીષ્મે દુર્યોધનને હર્ષ ઉપજાવતો સિંહનાદ કરતો ભયપ્રદ શંખ વગાડ્યો, જે સિંહ પ્રકૃતિના ભયજનક પાસાનું પ્રતીક છે. કૌરવોએ ભયનો સંચાર કરવા સિવાય બીજી કોઈ ઘોષણા કરી ન હતી. ત્યારબાદ પુણ્યમયી પ્રવૃત્તિઓ તરફ ઘોષણા થઈ, જેમાં પહેલી ઘોષણા યોગેશ્વર શ્રીકૃષ્ણની હતી.


Ask a question (or 'exit'): explain whole book 
Thinking...

Answer:
માફ કરશો, આપેલા સંદર્ભમાં આખી પુસ્તક સમજાવવા માટે પૂરતી માહિતી નથી. સંદર્ભમાં ફક્ત થોડા પાનાં અને શ્લોકો આપવામાં આવ્યા છે, જે પુસ્તકના અમુક ભાગો વિશે જ માહિતી આપે છે. આખી પુસ્તકને સમજાવવા મા