Open Project: Automated Metadata Generator (MARS)
Submitted by : Aagam Bandi (22124001)

IMPORTANT NOTE: 
I have done the whole process step by step in different cells of the ipynb file and then combined all the steps in the last cell block of the ipynb code. You can run the last cell block to generate the local or public url of the interface of the automated metadata generator. Make sure to run the last code cell block to generate a new link for the interface metadata generator.

Approach used in the Project:
1. In the problem statement we are told that the files may be of the format txt, docx or pdf. For docx and txt files we can directly extract text without doing ocr as they contain text in digitally recognizable format.
2. For pdf files, we might have to do ocr as they may be a scanned pdf and may not contain the text in digitally recognisable format. So we are using a ocr_fallback_threshold where if the text extracted from the page of pdf is less than a particular threshold (ocr_fallback_threshold) it would mean that the page is scanned and ocr is needed. Otherwise we would directly extract all the text. We are doing ocr for all the pages as it will slow down the process of metadata generation. If there is a little text present in the pdf in form of image elements we can skip them as it won't matter in our overall goal of semantically rich metadata generation.
3. Now since most llms have input token limit, so we cannot give the whole pdf to the llm for metadata generation. So instead we'll be first chunking the pdf, and using the llm to summarize each chunk while preserving the metadata and then combine the summaries of all the chunks.
4. We are also using RAG in our approach. Once the combined summary has been generated we will give the combined summary to the llm telling it to generate relevant questions that can be answered using the pdf and can be used in RAG to retrieve chunks which will be relevant for metadata generation.
5. Now we are giving the chunks retrieved from RAG along with the combined_summary to the llm for metadata generation. And we are getting the metadata as the final output.

In [1]:
##Identifying the file to find out if OCR is needed or not
import os


def get_file_type_safely(file_path):
    _ , extension = os.path.splitext(file_path)
    return extension.lower()

input_file_path = "lech205.pdf"

file_type = get_file_type_safely(input_file_path)

if file_type ==".docx":
    print("The given file is a docx file so no OCR required and text can be directly extracted.")
elif file_type ==".txt":
    print("The given file is a .txt file so no OCR required and text can be directly extracted.")
else:
    print("The given file is a .pdf file and may require OCR.")

The given file is a .pdf file and may require OCR.


In [None]:
#Extraction of text from the file depending on the file type. Not using ocr for docx and text files and checking with ocr_fallback_threshold for pdf file to know if ocr should
#be done or not. And if for a page ocr is needed then we are using ocr to extract the text from all the image elements of the page and using normal get_text function of 
#pyMupdf(fitz) to extract the rest of the digitally recognizable text
import langchain
from langchain_community.document_loaders import TextLoader, UnstructuredWordDocumentLoader
import fitz
from PIL import Image
import io
import pytesseract
from langchain.docstore.document import Document 

if file_type ==".txt":
    loader = TextLoader(input_file_path)
    documents = loader.load()
elif file_type ==".docx" or file_type==".doc":
    loader = UnstructuredWordDocumentLoader(input_file_path)
    documents = loader.load()
else:
    ##Document is in pdf format and may require OCR
    document = fitz.open(input_file_path)
    all_pages_text = []

    for page_num in range(len(document)):
        page = document.load_page(page_num)
        page_text = page.get_text()
        
        # 2. Get all images on the page
        image_list = page.get_images(full=True)
        
        # 3. If images are found, perform OCR on them
        if image_list:
            print(f"Page No.{page_num + 1}: Found {len(image_list)} image(s). Performing targeted OCR.")
            for img_index, img in enumerate(image_list):
                xref = img[0]
                base_image = document.extract_image(xref)
                image_bytes = base_image["image"]
            
                image = Image.open(io.BytesIO(image_bytes))
                ocr_text = pytesseract.image_to_string(image)
        
                page_text += "\n" + ocr_text
        else:
             print(f"Page No.{page_num + 1}: No images found. Direct extraction only.")

        all_pages_text.append(page_text)
        
    document.close()
    full_pdf_text = "\n\n".join(all_pages_text)
    documents = [Document(page_content=full_pdf_text, metadata={"source": input_file_path})] 

content = documents[0].page_content
word_list = content.split()
word_count = len(word_list)
print(word_count)

Page No.1: Found 3 image(s). Performing targeted OCR.
Page No.2: Found 18 image(s). Performing targeted OCR.
Page No.3: Found 2 image(s). Performing targeted OCR.
Page No.4: Found 2 image(s). Performing targeted OCR.
Page No.5: Found 2 image(s). Performing targeted OCR.
Page No.6: Found 2 image(s). Performing targeted OCR.
Page No.7: Found 2 image(s). Performing targeted OCR.
Page No.8: Found 2 image(s). Performing targeted OCR.
Page No.9: Found 2 image(s). Performing targeted OCR.
Page No.10: Found 60 image(s). Performing targeted OCR.
Page No.11: Found 42 image(s). Performing targeted OCR.
Page No.12: Found 2 image(s). Performing targeted OCR.
Page No.13: Found 2 image(s). Performing targeted OCR.
Page No.14: Found 3 image(s). Performing targeted OCR.
Page No.15: Found 13 image(s). Performing targeted OCR.
Page No.16: Found 34 image(s). Performing targeted OCR.
Page No.17: Found 48 image(s). Performing targeted OCR.
Page No.18: Found 2 image(s). Performing targeted OCR.
Page No.19: F

In [8]:
##Chunking of the pdf using RecursuveCharacterTextSplitter of langchain while keeping overlap between the chunks to prevent it from breaking sentences or paragraphs
#in half, which would destroy their meaning. Then we have made parallel llm api calls to summarize each chunk and then combined all their summaries. Then we gave the 
#combined summary to the llm in order to generate questions that can be answered on basis of the document and will be suitable for using as query to RAG to retrieve chunks 
#which are relevant for metadata generation. We used those questions as query for RAG to retrieve the relevant chunks. Then we gave the retrieved chunks along with the 
#combined summary to the llm for metadata extraction. We have used gemini 2.0 flash as llm model and 
from langchain_text_splitters import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
import google.generativeai as genai
import os
from langchain_community.vectorstores import Chroma
import dotenv
from dotenv import load_dotenv
import asyncio
import shutil
import chromadb
from langchain_community.embeddings import HuggingFaceEmbeddings 



load_dotenv()

google_api_key = os.getenv("GOOGLE_API_KEY")

genai.configure(api_key=google_api_key)

llm_model = genai.GenerativeModel("gemini-2.0-flash")


text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1250, chunk_overlap=250)
docs = text_splitter.split_documents(documents)

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")


db = Chroma.from_documents(docs, embedding_model)

retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 3})

async def summarize_chunk_with_sdk(model, chunk_text, chunk_number, semaphore):
    async with semaphore:
        print(f"  -> Starting API call for chunk {chunk_number}...")
        
        prompt = f"Provide a very concise 1-2 sentence summary of the following text excerpt:\n\n---{chunk_text}\n---. Any data in the excerpt which is relevant for metadata generation like file name, author name, title etc should be included in the summary as the summary will be later used for metadata generation. Also give the word count of the chunk after the summary."
        response = await model.generate_content_async(prompt)
        print(f"  <- Success on chunk {chunk_number}.")
        return response.text.strip()

llm_model_async = genai.GenerativeModel("gemini-2.0-flash")
concurrency_limit = 50
semaphore = asyncio.Semaphore(concurrency_limit)

tasks = []
for i, chunk in enumerate(docs):
    task = summarize_chunk_with_sdk(llm_model_async, chunk.page_content, i + 1, semaphore)
    
    tasks.append(task)
    
# 4. Gather the results concurrently
summaries = await asyncio.gather(*tasks)

combined_summaries = "\n".join(summaries)

# print("\n--- COMBINED SUMMARIES (Using Async SDK) ---")
# print(combined_summaries)

prompt = "You have been given a sequence of summaries combined which are basically the summaries of chunks of a large pdf. You have to give me three questions that can be answered by the pdf. These questions when used in RAG should be able to retrieve chunks of the pdf which are relevant for semantically rich metadata generation. The combined sequence of summaries is: {combined_summaries}. As an output only give the questions."

response = llm_model.generate_content(prompt)
# print("------------------------------------------------------------------------------------")
# print(response.text)
query_for_retriever = response.text

relevant_docs = retriever.invoke(query_for_retriever)

retrieved_context_list = [doc.page_content for doc in relevant_docs]


retrieved_context = "\n".join(retrieved_context_list)

final_prompt = f"""
You are an expert document analyst and metadata generation specialist. Your task is to synthesize information from two distinct sources: a high-level summary of a document and specific, relevant excerpts from the original text.

Based on ALL the information provided below, generate a single, clean JSON object containing the document's metadata. The metadata should also have summary of the document. Also you've been given that the word count of the document is: {word_count}. Mention this word count as it is in the metadata.

**IMPORTANT RULES:**
1.  Your entire response MUST be only the JSON object, with no introductory text, explanations, or any characters before or after the opening and closing curly braces (`{{` and `}}`).
2.  The JSON object must strictly adhere to the specified structure and keys.

---
**INPUT SOURCE 1: OVERALL DOCUMENT SUMMARY**
(This provides the high-level context and gist of the entire document. This is basically a sequence of summaries of all the chunks of the document. Each sdummary also contains the word count of each chunk. In the metadata you are finally giving, also give the total word count.)

{combined_summaries}
---
**INPUT SOURCE 2: RELEVANT EXCERPTS FROM ORIGINAL TEXT**
(This provides specific, detailed evidence and facts.)

{retrieved_context}
---"""

metadata_response = llm_model.generate_content(final_prompt)

metadata = metadata_response.text

print(metadata)



   
    









  embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")


  -> Starting API call for chunk 1...
  -> Starting API call for chunk 2...
  -> Starting API call for chunk 3...
  -> Starting API call for chunk 4...
  -> Starting API call for chunk 5...
  -> Starting API call for chunk 6...
  -> Starting API call for chunk 7...
  -> Starting API call for chunk 8...
  -> Starting API call for chunk 9...
  -> Starting API call for chunk 10...
  -> Starting API call for chunk 11...
  -> Starting API call for chunk 12...
  -> Starting API call for chunk 13...
  -> Starting API call for chunk 14...
  -> Starting API call for chunk 15...
  -> Starting API call for chunk 16...
  -> Starting API call for chunk 17...
  -> Starting API call for chunk 18...
  -> Starting API call for chunk 19...
  -> Starting API call for chunk 20...
  -> Starting API call for chunk 21...
  -> Starting API call for chunk 22...
  -> Starting API call for chunk 23...
  -> Starting API call for chunk 24...
  -> Starting API call for chunk 25...
  -> Starting API call for chunk 2

In [None]:
##Here I have combined all the previous steps in a single cell block of code
import langchain
from langchain_community.document_loaders import TextLoader, UnstructuredWordDocumentLoader
import fitz
from PIL import Image
import io
import pytesseract
from langchain.docstore.document import Document 
import os
import dotenv
from dotenv import load_dotenv
import asyncio
import shutil
import chromadb
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
import google.generativeai as genai
from langchain_community.embeddings import HuggingFaceEmbeddings 



load_dotenv()

openai_api_key = os.getenv("OPENAI_API_KEY")
google_api_key = os.getenv("GOOGLE_API_KEY")

genai.configure(api_key=google_api_key)

def get_file_type_safely(file_path):
    _ , extension = os.path.splitext(file_path)
    return extension.lower()

input_file_path = "combined.pdf"

file_type = get_file_type_safely(input_file_path)

if file_type ==".docx":
    print("The given file is a docx file so no OCR required and text can be directly extracted.")
elif file_type ==".txt":
    print("The given file is a .txt file so no OCR required and text can be directly extracted.")
else:
    print("The given file is a .pdf file and may require OCR.")

if file_type ==".txt":
    loader = TextLoader(input_file_path)
    documents = loader.load()
elif file_type ==".docx" or file_type==".doc":
    loader = UnstructuredWordDocumentLoader(input_file_path)
    documents = loader.load()
else:
    ##Document is in pdf format and may require OCR
    document = fitz.open(input_file_path)
    all_pages_text = []

    for page_num in range(len(document)):
        page = document.load_page(page_num)
        page_text = page.get_text()
        
        # 2. Get all images on the page
        image_list = page.get_images(full=True)
        
        # 3. If images are found, perform OCR on them
        if image_list:
            print(f"Page No.{page_num + 1}: Found {len(image_list)} image(s). Performing targeted OCR.")
            for img_index, img in enumerate(image_list):
                xref = img[0]
                base_image = document.extract_image(xref)
                image_bytes = base_image["image"]
            
                image = Image.open(io.BytesIO(image_bytes))
                ocr_text = pytesseract.image_to_string(image)
        
                page_text += "\n" + ocr_text
        else:
             print(f"Page No.{page_num + 1}: No images found. Direct extraction only.")

        all_pages_text.append(page_text)
        
    document.close()
    full_pdf_text = "\n\n".join(all_pages_text)
    documents = [Document(page_content=full_pdf_text, metadata={"source": input_file_path})]

content = documents[0].page_content
word_list = content.split()
word_count = len(word_list)

llm_model = genai.GenerativeModel("gemini-2.0-flash")

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1250, chunk_overlap=250)
docs = text_splitter.split_documents(documents)

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

db = Chroma.from_documents(docs, embedding_model)

retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 3})

async def summarize_chunk_with_sdk(model, chunk_text, chunk_number, semaphore):
    async with semaphore:
        print(f"  -> Starting API call for chunk {chunk_number}...")
        
        prompt = f"Provide a very concise 1-2 sentence summary of the following text excerpt:\n\n---{chunk_text}\n---. Any data in the excerpt which is relevant for metadata generation like file name, author name, title etc should be included in the summary as the summary will be later used for metadata generation. Also give the word count of the chunk after the summary."
        # print(prompt)
        response = await model.generate_content_async(prompt)
        print(f"  <- Success on chunk {chunk_number}.")
        return response.text.strip()

llm_model_async = genai.GenerativeModel("gemini-2.0-flash")
concurrency_limit = 50
semaphore = asyncio.Semaphore(concurrency_limit)

tasks = []
for i, chunk in enumerate(docs):
    task = summarize_chunk_with_sdk(llm_model_async, chunk.page_content, i + 1, semaphore)
    
    tasks.append(task)
    
summaries = await asyncio.gather(*tasks)

combined_summaries = "\n".join(summaries)

prompt = "You have been given a sequence of summaries combined which are basically the summaries of chunks of a large pdf. You have to give me three questions that can be answered by the pdf. These questions when used in RAG should be able to retrieve chunks of the pdf which are relevant for semantically rich metadata generation. The combined sequence of summaries is: {combined_summaries}. As an output only give the questions."

response = llm_model.generate_content(prompt)
query_for_retriever = response.text

relevant_docs = retriever.invoke(query_for_retriever)

retrieved_context_list = [doc.page_content for doc in relevant_docs]


retrieved_context = "\n".join(retrieved_context_list)

final_prompt = f"""
You are an expert document analyst and metadata generation specialist. Your task is to synthesize information from two distinct sources: a high-level summary of a document and specific, relevant excerpts from the original text.

Based on ALL the information provided below, generate a single, clean JSON object containing the document's metadata. The metadata should also have summary of the document. Also you've been given that the word count of the document is: {word_count}. Mention this word count as it is in the metadata.

**IMPORTANT RULES:**
1.  Your entire response MUST be only the JSON object, with no introductory text, explanations, or any characters before or after the opening and closing curly braces (`{{` and `}}`).
2.  The JSON object must strictly adhere to the specified structure and keys.

---
**INPUT SOURCE 1: OVERALL DOCUMENT SUMMARY**
(This provides the high-level context and gist of the entire document. This is basically a sequence of summaries of all the chunks of the document. Each sdummary also contains the word count of each chunk. In the metadata you are finally giving, also give the total word count.)

{combined_summaries}
---
**INPUT SOURCE 2: RELEVANT EXCERPTS FROM ORIGINAL TEXT**
(This provides specific, detailed evidence and facts.)

{retrieved_context}
---"""

metadata_response = llm_model.generate_content(final_prompt)

metadata = metadata_response.text

print(metadata)


    

    

    



The given file is a .pdf file and may require OCR.
Page No.1: Found 4 image(s). Performing targeted OCR.
Page No.2: Found 2 image(s). Performing targeted OCR.
Page No.3: Found 21 image(s). Performing targeted OCR.
Page No.4: Found 19 image(s). Performing targeted OCR.
Page No.5: Found 20 image(s). Performing targeted OCR.
Page No.6: Found 20 image(s). Performing targeted OCR.
Page No.7: Found 30 image(s). Performing targeted OCR.
Page No.8: Found 14 image(s). Performing targeted OCR.
Page No.9: Found 20 image(s). Performing targeted OCR.
Page No.10: Found 2 image(s). Performing targeted OCR.
Page No.11: Found 2 image(s). Performing targeted OCR.
Page No.12: Found 14 image(s). Performing targeted OCR.
Page No.13: Found 2 image(s). Performing targeted OCR.
Page No.14: Found 2 image(s). Performing targeted OCR.
Page No.15: Found 2 image(s). Performing targeted OCR.
Page No.16: Found 22 image(s). Performing targeted OCR.
Page No.17: Found 2 image(s). Performing targeted OCR.
Page No.18: Fo

  embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")


  -> Starting API call for chunk 1...
  -> Starting API call for chunk 2...
  -> Starting API call for chunk 3...
  -> Starting API call for chunk 4...
  -> Starting API call for chunk 5...
  -> Starting API call for chunk 6...
  -> Starting API call for chunk 7...
  -> Starting API call for chunk 8...
  -> Starting API call for chunk 9...
  -> Starting API call for chunk 10...
  -> Starting API call for chunk 11...
  -> Starting API call for chunk 12...
  -> Starting API call for chunk 13...
  -> Starting API call for chunk 14...
  -> Starting API call for chunk 15...
  -> Starting API call for chunk 16...
  -> Starting API call for chunk 17...
  -> Starting API call for chunk 18...
  -> Starting API call for chunk 19...
  -> Starting API call for chunk 20...
  -> Starting API call for chunk 21...
  -> Starting API call for chunk 22...
  -> Starting API call for chunk 23...
  -> Starting API call for chunk 24...
  -> Starting API call for chunk 25...
  -> Starting API call for chunk 2

In [1]:
##Here I have integrated the full code of metadata genaration (present in orevious cell) into a gradio interface. Run this code for generating link for the interface. 
#Please don't click on the link already there. Run the code to generate a new link.
import gradio as gr
import langchain
from langchain_community.document_loaders import TextLoader, UnstructuredWordDocumentLoader
import fitz  
from PIL import Image
import io
import pytesseract
from langchain.docstore.document import Document 
import os
import dotenv
from dotenv import load_dotenv
import asyncio
import shutil
import chromadb
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
import google.generativeai as genai
import time
import json
from langchain_community.embeddings import HuggingFaceEmbeddings 


load_dotenv()

google_api_key = os.getenv("GOOGLE_API_KEY")
genai.configure(api_key=google_api_key)

async def generate_metadata_from_file(input_file_path):
    if not input_file_path:
        return "Please upload a file first."

    print(f"--- Starting Metadata Generation for: {os.path.basename(input_file_path)} ---")
    
    try:
        _ , extension = os.path.splitext(input_file_path)
        file_type = extension.lower()

        if file_type == ".txt":
            loader = TextLoader(input_file_path)
            documents = loader.load()
        elif file_type in [".docx", ".doc"]:
            loader = UnstructuredWordDocumentLoader(input_file_path)
            documents = loader.load()
        elif file_type == ".pdf":
            pdf_document = fitz.open(input_file_path)
            all_pages_text = []
            print(f"  -> PDF has {len(pdf_document)} pages. Starting extraction...")
            
            for page_num in range(len(pdf_document)):
                page = pdf_document.load_page(page_num)
                page_text = page.get_text()
                
                image_list = page.get_images(full=True)
                ocr_fallback_threshold = 200
                if image_list and len(page_text) < ocr_fallback_threshold:
                    print(f"    - Page {page_num + 1}: Found {len(image_list)} image(s). Performing OCR.")
                    for img_index, img in enumerate(image_list):
                        xref = img[0]
                        try:
                            base_image = pdf_document.extract_image(xref)
                            image_bytes = base_image["image"]
                            image = Image.open(io.BytesIO(image_bytes))
                            ocr_text = pytesseract.image_to_string(image)
                            page_text += "\n" + ocr_text
                        except Exception as e:
                            print(f"      - Warning: Could not process image {img_index+1} on page {page_num+1}. Error: {e}")
                else:
                     print(f"    - Page {page_num + 1}: No images found. Direct text extraction only.")

                all_pages_text.append(page_text)
                
            pdf_document.close()
            full_pdf_text = "\n\n".join(all_pages_text)
            print(full_pdf_text)
            documents = [Document(page_content=full_pdf_text, metadata={"source": os.path.basename(input_file_path)})]
        else:
            return f"Unsupported file type: {file_type}. Please upload a .pdf, .docx, or .txt file."
        content = documents[0].page_content
        word_list = content.split()
        word_count = len(word_list)
        
        print("  -> Document loaded successfully.")

    except Exception as e:
        return f"Error during file loading: {e}"

    # 2. Split documents into chunks
    print("[2/7] Splitting text into chunks...")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1250, chunk_overlap=250)
    docs = text_splitter.split_documents(documents)
    print(f"  -> Document split into {len(docs)} chunks.")

    # 3. Create embeddings and vector store
    print("[3/7] Creating embeddings and vector store...")
    embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    db = Chroma.from_documents(docs, embedding_model)
    retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 3})
    print("  -> Vector store created.")

    # 4. Asynchronously summarize chunks
    print(f"[4/7] Summarizing {len(docs)} chunks concurrently (limit: 50)...")
    llm_model_async = genai.GenerativeModel("gemini-1.5-flash") # Updated model
    concurrency_limit = 50
    semaphore = asyncio.Semaphore(concurrency_limit)

    async def summarize_chunk_with_sdk(model, chunk_text, chunk_number, semaphore):
        async with semaphore:
            prompt = f"Provide a very concise 1-2 sentence summary of the following text excerpt:\n\n---{chunk_text}\n---. Any data in the excerpt which is relevant for metadata generation like file name, author name, title etc should be included in the summary as the summary will be later used for metadata generation. Also give the word count of the chunk after the summary. The wrod count should be given after the summary as: Word Count="
            try:
                response = await model.generate_content_async(prompt)
                return response.text.strip()
            except Exception as e:
                print(f"  -> Error summarizing chunk {chunk_number}: {e}")
                return f"Error summarizing chunk {chunk_number}."

    tasks = [summarize_chunk_with_sdk(llm_model_async, chunk.page_content, i + 1, semaphore) for i, chunk in enumerate(docs)]
    summaries = await asyncio.gather(*tasks)
    combined_summaries = "\n".join(summaries)
    print("  -> All chunks summarized.")

    # 5. Generate questions for retrieval
    print("[5/7] Generating retriever questions...")
    llm_model = genai.GenerativeModel("gemini-1.5-flash") # Updated model
    prompt_for_questions = f"You have been given a sequence of summaries combined which are basically the summaries of chunks of a large document. You have to give me three questions that can be answered by the document. These questions when used in RAG should be able to retrieve chunks of the pdf which are relevant for semantically rich metadata generation. The combined sequence of summaries is: {combined_summaries}. As an output only give the questions."
    question_response = llm_model.generate_content(prompt_for_questions)
    query_for_retriever = question_response.text
    print(f"  -> Generated questions:\n{query_for_retriever}")
    
    # 6. Retrieve relevant context
    print("[6/7] Retrieving relevant context...")
    relevant_docs = retriever.invoke(query_for_retriever)
    retrieved_context = "\n".join([doc.page_content for doc in relevant_docs])
    print("  -> Context retrieved.")

    # 7. Generate final metadata JSON
    print("[7/7] Generating final metadata JSON...")
    final_prompt = f"""
    You are an expert document analyst and metadata generation specialist. Your task is to synthesize information from two distinct sources: a high-level summary of a document and specific, relevant excerpts from the original text.

    Based on ALL the information provided below, generate a single, clean JSON object containing the document's metadata. The metadata should also have a summary of the document. Also you've been given that the word count of the document is: {word_count}. Mention this word count as it is in the metadata.

    **IMPORTANT RULES:**
    1.  Your entire response MUST be only the JSON object, with no introductory text, explanations, or any characters before or after the opening and closing curly braces (`{{` and `}}`).
    2.  The JSON object must strictly adhere to the specified structure and keys.

    ---
    **INPUT SOURCE 1: OVERALL DOCUMENT SUMMARY**
    (This provides the high-level context and gist of the entire document. This is basically a sequence of summaries of all the chunks of the document. Each summary also contains the word count of each summary. These word counts are present as: Word Count= . Summ all these word counts ang vie that as the total wordcount of pdf in its metadata.)

    {combined_summaries}
    ---
    **INPUT SOURCE 2: RELEVANT EXCERPTS FROM ORIGINAL TEXT**
    (This provides specific, detailed evidence and facts.)

    {retrieved_context}
    ---"""

    metadata_response = llm_model.generate_content(final_prompt)
    metadata_text = metadata_response.text.strip()
    
    if metadata_text.startswith("```json"):
        metadata_text = metadata_text[7:]
    if metadata_text.endswith("```"):
        metadata_text = metadata_text[:-3]

    print("--- Metadata Generation Complete ---")
    
    return metadata_text



def create_gradio_interface():
    def sync_wrapper(file):
        if file is None:
            return "Please upload a file to begin.", None
        
        input_path = file.name
        
        result = asyncio.run(generate_metadata_from_file(input_path))
        
        try:
            json_result = json.loads(result)
            return json_result
        except (json.JSONDecodeError, TypeError):
            return {result}

    with gr.Blocks(theme=gr.themes.Soft(), title="Automated Metadata Extractor") as demo:
        gr.Markdown(
            """
            # 📄 Automated Metadata Extractor
            Upload a document (`.pdf`, `.docx`, `.txt`) to automatically extract and generate its metadata.
            The process involves text extraction, OCR for images in PDFs, chunking, summarization, and AI-powered metadata synthesis.
            """
        )
        
        with gr.Row():
            with gr.Column(scale=1):
                file_input = gr.File(label="Upload Document", file_types=[".pdf", ".docx", ".txt"])
                submit_button = gr.Button("Generate Metadata", variant="primary")
                
            with gr.Column(scale=2):
                json_output = gr.JSON(label="Extracted Metadata")
        
        submit_button.click(
            fn=sync_wrapper,
            inputs=file_input,
            outputs=json_output
        )
        

    return demo

if __name__ == "__main__":
    app = create_gradio_interface()
    app.launch(share=True)

* Running on local URL:  http://127.0.0.1:7860
* Running on public URL: https://a3cd146e8c91c87d76.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


--- Starting Metadata Generation for: Machine Learning Book.pdf ---
  -> PDF has 64 pages. Starting extraction...
    - Page 1: Found 3 image(s). Performing OCR.
    - Page 2: Found 1 image(s). Performing OCR.
    - Page 3: No images found. Direct text extraction only.
    - Page 4: No images found. Direct text extraction only.
    - Page 5: No images found. Direct text extraction only.
    - Page 6: No images found. Direct text extraction only.
    - Page 7: No images found. Direct text extraction only.
    - Page 8: No images found. Direct text extraction only.
    - Page 9: No images found. Direct text extraction only.
    - Page 10: No images found. Direct text extraction only.
    - Page 11: No images found. Direct text extraction only.
    - Page 12: No images found. Direct text extraction only.
    - Page 13: No images found. Direct text extraction only.
    - Page 14: No images found. Direct text extraction only.
    - Page 15: No images found. Direct text extraction only.
    

  embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")


  -> Vector store created.
[4/7] Summarizing 149 chunks concurrently (limit: 50)...
  -> All chunks summarized.
[5/7] Generating retriever questions...
  -> Generated questions:
1. What are the key differences between batch learning and online learning in machine learning, and what are the advantages and disadvantages of each approach?  Provide examples of situations where one approach might be preferred over the other.

2.  Describe the process of building a machine learning model to predict life satisfaction based on GDP per capita, including data preparation, model selection, training, and evaluation. Discuss potential challenges and how to mitigate them, such as handling missing data, addressing overfitting, and ensuring the representativeness of the training data.

3. Explain the concept of overfitting and underfitting in machine learning.  Describe how to identify these issues during model development, and discuss the various techniques used to prevent or mitigate them.  Provide 