### 🔹 **Notebook Overview**

This notebook extracts structured content from a **DOCX document** using **OpenAI’s GPT-4o model**. It is designed to address a critical problem in document chunking: when clauses or sections with the same heading get divided into multiple chunks, retrieval systems—even when using hybrid search—may return only a subset of those chunks. As a result, the language model may hallucinate or generate incomplete responses because it lacks the full context of the clause.

### ✅ **Problems This Notebook Solves**

- **Ensures Cohesive Chunking:**  
  Prevents a single clause or heading from being split into unrelated pieces, ensuring that the entire clause is treated as a unified segment.

- **Improves Retrieval Accuracy:**  
  By preserving the integrity of each clause, the system ensures that retrieval methods return complete information, reducing the risk of hallucinations.

- **Extracts a Structured Overview:**  
  Organizes text into a structured JSON format that captures logical sections and cross-references for easier downstream processing.

- **Converts DOCX to TXT:**  
  Facilitates text processing by converting DOCX files into plain text, which is then segmented intelligently.

---

### 📌 **Required Parameters**

1. **`input_path`** → Path to the input DOCX file 📄  
2. **`output_dir`** → Directory where the extracted outputs will be saved 📂  
3. **`api_key`** → OpenAI API Key 🔑  
4. **`endpoint`** → Azure OpenAI API Endpoint 🌐  
5. **`deployment_name`** → Name of the GPT model deployment 🚀  

---

### **🔄 Processing Steps**

1️⃣ **Convert DOCX to TXT** (📄 ➡️ 📜)  
2️⃣ **Extract the Structured Overview:**  
   - Segment the document into logical sections while keeping entire clauses intact. 📑  
3️⃣ **Parse and Save the Overview/Segments into a JSON File** 🗄️  
4️⃣ **Extract Each Clause with Its Cross-References:**  
   - Ensure that all parts of a clause are grouped together for coherent retrieval. 🔍  
5️⃣ **Save the Final Structured Data into JSON** ✅  

---

By maintaining the complete structure of each clause, this notebook reduces the risk of the language model hallucinating due to missing context, leading to more accurate and reliable responses during retrieval.

# Imports

In [None]:
# Install required libraries
%pip install -q openai
!apt install libreoffice

# Inputs

In [10]:
import os
import openai
import asyncio
import json
import re
from pydantic import BaseModel
from typing import List

# 📌 User Inputs
Input_File_Path = "/content/Sample.docx"  # @param {type:"string", placeholder:"Enter document path"}
Output_Dir = "/content/clause_chunks_sample_doc"  # @param {type:"string", placeholder:"Enter output directory"}
API_Key    = "" # @param {type:"string", placeholder:"Enter OpenAI API Key"}
Endpoint   = "" # @param {type:"string", placeholder:"Enter OpenAI API Endpoint"}
Deployment_Name = "" # @param {type:"string", placeholder:"Enter deployment name"}
# 🎯 Setting Up OpenAI Credentials
oc = openai.AsyncAzureOpenAI(
    azure_endpoint=Endpoint,
    azure_deployment=Deployment_Name,
    api_key=API_Key,
    api_version="2024-08-01-preview"
)

# 📄 Convert DOCX to TXT
async def convert_docx_to_txt(Input_File_Path: str, Output_Dir: str) -> str:
    print("\n📂 Converting DOCX to TXT...")
    command = f'soffice --headless --convert-to txt:Text "{Input_File_Path}" --outdir "{Output_Dir}"'
    process = await asyncio.create_subprocess_shell(command, stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE)
    stdout, stderr = await process.communicate()

    if process.returncode != 0:
        print("❌ Error during conversion!")
        raise Exception(f"Error: {stderr.decode().strip()}")

    txt_path = os.path.join(Output_Dir, os.path.splitext(os.path.basename(Input_File_Path))[0] + ".txt")
    print(f"✅ Conversion complete! TXT file saved at: {txt_path}")
    return txt_path

# 📜 Load document content
def load_document(file_path):
    print("\n📖 Loading document content...")
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()
    print("✅ Document loaded successfully!")
    return content

# 🏛️ Define Pydantic schema for structured output
class Clause(BaseModel):
    clause_header: str
    subsection_number: List[str]
    reference_clause_header: List[str]

class TableOfContents(BaseModel):
    clauses: List[Clause]

# 📑 Extract Table of Contents (ToC) using GPT
async def extract_toc(text):
    print("\n🤖 Extracting Table of Contents using GPT-4o...")

    system_prompt = (
        "You are an advanced Table of Contents (ToC) extractor AI. "
        "Your task is to analyze the provided legal document and return a JSON object following this schema: "
        "{'clause_header': str, 'subsections': List[str], 'reference_clause_header': List[str]}. "
        "Extract all clauses, sections, subsections, schedules, annexes, and appendices."
        "For each clause, include any references to other clause_header in the 'reference_clause_header' field. Use clause_header for this purpose"
        "If you are unable to extract the ToC, return an empty JSON object. "
        "Ensure accuracy and completeness in the ToC extraction."
        "## Important Note: Never Extract ToC from the Table of Contents itself. Instead extract from the main content of the document."
        "- Return the Clause_Headers with exact grammar and hierarchy as they appear in the document even if it is wrong."
    )

    response = await oc.beta.chat.completions.parse(
        model=Deployment_Name,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Extract the Table of Contents from the following document:\n\n{text}"}
        ],
        response_format=TableOfContents,
        max_tokens=3000,
        temperature=0.1,
        top_p=1.0
    )

    print("✅ ToC extraction complete!")
    return response.choices[0].message.content

# 🔍 Extract clause-wise chunks
def extract_clauses(document, toc):
    print("\n📜 Extracting clauses from the document...")
    clause_chunks = {}
    last_end = 0

    for i, clause in enumerate(toc["clauses"]):
        clause_header = clause["clause_header"].strip()
        match_header = re.match(r'^(\d+)\s+(.*)$', clause_header)
        clause_number, clause_title = (match_header.groups() if match_header else ("", clause_header))

        exact_pattern = rf"(?m)^\s*{re.escape(clause_header)}\s*[-:]*"
        fallback_pattern = rf"(?m)^\s*{re.escape(clause_number)}\s*{re.escape(clause_title)}[-:]*" if clause_number else exact_pattern
        end_pattern = r"\Z" if i + 1 == len(toc["clauses"]) else rf"(?=^\s*{re.escape(toc['clauses'][i+1]['clause_header'])})"

        exact_match = re.compile(rf"{exact_pattern}(.*?){end_pattern}", re.MULTILINE | re.DOTALL | re.IGNORECASE)
        fallback_match = re.compile(rf"{fallback_pattern}(.*?){end_pattern}", re.MULTILINE | re.DOTALL | re.IGNORECASE)

        matches = exact_match.finditer(document, last_end)
        best_match = max(matches, key=lambda m: len(m.group(1).strip()), default=None)

        if not best_match:
            matches = fallback_match.finditer(document, last_end)
            best_match = max(matches, key=lambda m: len(m.group(1).strip()), default=None)

        clause_chunks[f"Chunk_{i+1}"] = {
            "clause_header": clause_header,
            "content": best_match.group(1).strip() if best_match else "[No content found]",
            "reference_clauses": clause.get("reference_clause_header", [])
        }

        if best_match:
            last_end = best_match.end()

    print("✅ Clause extraction complete!")
    return clause_chunks

# 📝 Save JSON file
def save_to_json(data, output_path):
    json_data = json.loads(data) if isinstance(data, str) else data
    with open(output_path, 'w', encoding='utf-8') as file:
        json.dump(json_data, file, indent=4, ensure_ascii=False)
    print(f"✅ JSON saved at: {output_path}")

# 🚀 Main execution
async def main():
    try:
        txt_path = await convert_docx_to_txt(Input_File_Path, Output_Dir)
        document_text = load_document(txt_path)

        toc = await extract_toc(document_text)
        save_to_json(toc, os.path.join(Output_Dir, "toc.json"))

        toc = json.loads(toc)
        clause_chunks = extract_clauses(document_text, toc)
        save_to_json(clause_chunks, os.path.join(Output_Dir, "clause_chunks.json"))

        print("\n🎉 **Processing complete! All JSON files are saved successfully.**")
    except Exception as e:
        print(f"❌ Error: {str(e)}")

await main()


📂 Converting DOCX to TXT...
✅ Conversion complete! TXT file saved at: /content/clause_chunks_sample_doc/Sample.txt

📖 Loading document content...
✅ Document loaded successfully!

🤖 Extracting Table of Contents using GPT-4o...
✅ ToC extraction complete!
✅ JSON saved at: /content/clause_chunks_sample_doc/toc.json

📜 Extracting clauses from the document...
✅ Clause extraction complete!
✅ JSON saved at: /content/clause_chunks_sample_doc/clause_chunks.json

🎉 **Processing complete! All JSON files are saved successfully.**
