<a href="https://colab.research.google.com/github/Ragunathan223/Machine_Learning/blob/main/policy_docx.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# ✅ Install required packages
!pip install -q pandas openpyxl

**1. Excel File → Multi-Sheet Parser (Unstructured)**

In [2]:
import pandas as pd
import os

# ✅ Path to your Excel file
EXCEL_FILE = "/content/FLFX_PSM_Golden Configuration-V2_04_09 3.xlsx"

# ✅ Utility: Clean and normalize the DataFrame
def clean_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    # Remove columns that are entirely NaN
    df = df.dropna(axis=1, how='all')
    # Drop rows that are entirely NaN
    df = df.dropna(axis=0, how='all')
    # Fill NaN with empty string and strip whitespace
    df = df.fillna("").astype(str).map(str.strip)
    return df

# ✅ Utility: Read all sheets and return a dictionary
def read_all_sheets(excel_path: str) -> dict:
    print(f"📄 Reading Excel file: {excel_path}")
    xls = pd.ExcelFile(excel_path, engine='openpyxl')
    sheets = {}
    for sheet_name in xls.sheet_names:
        print(f"🔍 Processing sheet: {sheet_name}")
        df = xls.parse(sheet_name)
        df_cleaned = clean_dataframe(df)
        sheets[sheet_name] = df_cleaned
    return sheets

# ✅ Utility: Flatten a DataFrame to text rows (list of strings)
def flatten_dataframe_to_text_rows(df: pd.DataFrame) -> list:
    rows = []
    for idx, row in df.iterrows():
        row_text = " | ".join([str(val) for val in row if val])
        if row_text.strip():
            rows.append(row_text)
    return rows

# ✅ Run Parser
sheet_data = read_all_sheets(EXCEL_FILE)

# ✅ Flatten each sheet into row-wise text for future chunking
sheet_texts = {}
for sheet_name, df in sheet_data.items():
    sheet_texts[sheet_name] = flatten_dataframe_to_text_rows(df)

# ✅ Preview
for sheet, rows in sheet_texts.items():
    print(f"\n📘 Sheet: {sheet} | Rows extracted: {len(rows)}")
    print("🔹 Sample row:", rows[0] if rows else "No data")


📄 Reading Excel file: /content/FLFX_PSM_Golden Configuration-V2_04_09 3.xlsx
🔍 Processing sheet: Table of Content
🔍 Processing sheet: Credentialing agencies
🔍 Processing sheet: Summary_status
🔍 Processing sheet: Finalized Taxonomies_11_1
🔍 Processing sheet: Legacy_Taxonomies
🔍 Processing sheet: System options-Approvals
🔍 Processing sheet: SystemOptions-Rolemgmnt
🔍 Processing sheet: Sensitive Fields
🔍 Processing sheet: System Options-Screening
🔍 Processing sheet: System Options - Data Change
🔍 Processing sheet: System Options - Rules
🔍 Processing sheet: System Options - Revalidation
🔍 Processing sheet: System Options - Licensure
🔍 Processing sheet: System Options - Site Visit
🔍 Processing sheet: System Options -  Drop Downs
🔍 Processing sheet:  Dropdowns Trigger
🔍 Processing sheet: Error Messages Externalization
🔍 Processing sheet: System Options - Auto Archive
🔍 Processing sheet: System Options - Security Polic
🔍 Processing sheet: System Options - User Deactivat
🔍 Processing sheet: Sys

**2. Convert Sheets → Chunks (Text + Metadata)**

In [3]:
# ✅ Install transformers if not already installed
!pip install -q transformers

from transformers import AutoTokenizer

# ✅ Choose a HuggingFace tokenizer
# Use small tokenizer to speed up (replace later with larger model)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# ✅ Define max tokens per chunk
MAX_TOKENS = 256

# ✅ Chunking utility: Token-aware chunking
def chunk_text(text, max_tokens=MAX_TOKENS):
    input_tokens = tokenizer.tokenize(text)
    chunks = []

    for i in range(0, len(input_tokens), max_tokens):
        chunk_tokens = input_tokens[i:i + max_tokens]
        chunk_text = tokenizer.convert_tokens_to_string(chunk_tokens)
        chunks.append(chunk_text.strip())

    return chunks

# ✅ Create nodes with chunked content and metadata
def create_nodes(sheet_texts: dict) -> list:
    nodes = []
    for sheet_name, rows in sheet_texts.items():
        for row_idx, row_text in enumerate(rows):
            if not row_text.strip():
                continue
            row_chunks = chunk_text(row_text)
            for chunk_idx, chunk in enumerate(row_chunks):
                node = {
                    "sheet": sheet_name,
                    "row_number": row_idx,
                    "chunk_index": chunk_idx,
                    "text_chunk": chunk
                }
                nodes.append(node)
    return nodes

# ✅ Generate all nodes from sheet_texts
nodes = create_nodes(sheet_texts)

# ✅ Preview a few nodes
print(f"✅ Total nodes created: {len(nodes)}")
print("\n📌 Sample node:\n", nodes[0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Token indices sequence length is longer than the specified maximum sequence length for this model (2036 > 512). Running this sequence through the model will result in indexing errors


✅ Total nodes created: 12357

📌 Sample node:
 {'sheet': 'Table of Content', 'row_number': 0, 'chunk_index': 0, 'text_chunk': '# | worksheet | sub - section in worksheet'}


**3. Convert to Nodes**

In [4]:
# ✅ Required imports (run this even if already imported)
import hashlib
from transformers import AutoTokenizer

# ✅ Load tokenizer (same as Step 2)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# ✅ Chunking function (token-aware)
def chunk_text(text, max_tokens=256):
    input_tokens = tokenizer.tokenize(text)
    chunks = []
    for i in range(0, len(input_tokens), max_tokens):
        chunk_tokens = input_tokens[i:i + max_tokens]
        chunk_text = tokenizer.convert_tokens_to_string(chunk_tokens)
        chunks.append(chunk_text.strip())
    return chunks

# ✅ Generate unique, stable node ID (useful for vector DB indexing)
def generate_node_id(sheet_name, row_number, chunk_index):
    key = f"{sheet_name}_{row_number}_{chunk_index}"
    return hashlib.md5(key.encode()).hexdigest()

# ✅ Main function: Convert all sheets into structured nodes
def convert_to_nodes(sheet_texts, max_tokens=256):
    all_nodes = []

    for sheet_name, rows in sheet_texts.items():
        for row_idx, row_text in enumerate(rows):
            if not row_text.strip():
                continue  # skip empty rows

            row_chunks = chunk_text(row_text, max_tokens=max_tokens)

            for chunk_idx, chunk in enumerate(row_chunks):
                node_id = generate_node_id(sheet_name, row_idx, chunk_idx)

                node = {
                    "id": node_id,
                    "sheet_name": sheet_name,
                    "row_number": row_idx,
                    "chunk_index": chunk_idx,
                    "text": chunk,
                    "source_text": row_text,  # full original row
                }

                all_nodes.append(node)

    return all_nodes

# ✅ Call this with `sheet_texts` from Step 1
nodes = convert_to_nodes(sheet_texts, max_tokens=256)

# ✅ Preview results
print(f"🧠 Total nodes generated: {len(nodes)}")
print("📄 Sample node:\n", nodes[0])


Token indices sequence length is longer than the specified maximum sequence length for this model (2036 > 512). Running this sequence through the model will result in indexing errors


🧠 Total nodes generated: 12357
📄 Sample node:
 {'id': 'a45e7399b40bcf23da22e666a67dbb50', 'sheet_name': 'Table of Content', 'row_number': 0, 'chunk_index': 0, 'text': '# | worksheet | sub - section in worksheet', 'source_text': '# | Worksheet | Sub-Section in Worksheet'}


In [5]:
import torch
from transformers import pipeline

# ✅ Check if GPU is available
device = 0 if torch.cuda.is_available() else -1

# ✅ Load summarization model on GPU
summarizer = pipeline("summarization", model="facebook/bart-large-cnn", device=device)

# ✅ Confirm device
print("🔥 Using device:", "GPU" if device == 0 else "CPU")


Device set to use cuda:0


🔥 Using device: GPU


**4. Summarize Nodes using HuggingFace**

In [6]:
# ✅ Install required libraries
!pip install -q transformers

import torch
from transformers import pipeline, AutoTokenizer
from tqdm import tqdm

# ✅ Load model and tokenizer
model_name = "google/flan-t5-large"
summarizer = pipeline("summarization", model=model_name, device=0 if torch.cuda.is_available() else -1)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# ✅ Clean summarize function with dynamic max/min
def summarize_node(node):
    text = node["text"].strip()
    input_tokens = tokenizer.tokenize(text)
    token_count = len(input_tokens)

    try:
        # Short text → skip summarization
        if token_count < 30:
            node["summary"] = text
            return node

        # Dynamically adjust max/min summary length
        max_length = min(50)
        min_length = max(20)

        result = summarizer(
            text,
            max_length=max_length,
            min_length=min_length,
            do_sample=False
        )
        node["summary"] = result[0]["summary_text"]

    except Exception as e:
        node["summary"] = text
        node["error"] = str(e)

    return node

# ✅ Batch summarization
def summarize_all_nodes(nodes):
    summarized_nodes = []
    print("🔄 Summarizing nodes (auto length)...")

    for node in tqdm(nodes):
        summarized_nodes.append(summarize_node(node))

    return summarized_nodes

# ✅ Run
summarized_nodes = summarize_all_nodes(nodes)

# ✅ Preview
for i in range(3):
    print(f"\n📄 Node {i+1}")
    print(f"Text: {summarized_nodes[i]['text']}")
    print(f"Summary: {summarized_nodes[i]['summary']}")

Device set to use cuda:0


🔄 Summarizing nodes (auto length)...


100%|██████████| 12357/12357 [00:01<00:00, 6490.14it/s]


📄 Node 1
Text: # | worksheet | sub - section in worksheet
Summary: # | worksheet | sub - section in worksheet

📄 Node 2
Text: 1 | status
Summary: 1 | status

📄 Node 3
Text: 2 | finalized taxonomies
Summary: 2 | finalized taxonomies





5. **Embedding + Store in ChromaDB**

In [7]:
# ✅ Install required libraries
!pip install -q chromadb sentence-transformers

# ✅ Imports
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import time

# ✅ Load SentenceTransformer model from HuggingFace
embed_model = SentenceTransformer("all-MiniLM-L6-v2")  # You can replace with another model if needed

# ✅ Initialize ChromaDB
chroma_client = chromadb.Client(Settings(
    persist_directory="./chroma_db",  # Optional: Persist to disk
    anonymized_telemetry=False
))

# ✅ Create (or get) ChromaDB collection
collection = chroma_client.get_or_create_collection(name="florida_policy_nodes")

# ✅ Function: Store summarized nodes into ChromaDB
def store_summarized_nodes_in_chroma(nodes):
    stored = 0
    failed = 0
    batch_size = 32  # adjust based on memory limits

    for i in range(0, len(nodes), batch_size):
        batch = nodes[i:i+batch_size]

        ids = []
        docs = []
        embeds = []
        metadatas = []

        for node in batch:
            try:
                node_id = node["id"]
                summary_text = node["summary"]

                embedding = embed_model.encode(summary_text).tolist()

                metadata = {
                    "sheet": node["sheet_name"],
                    "row_number": node["row_number"],
                    "chunk_index": node["chunk_index"]
                }

                ids.append(node_id)
                docs.append(summary_text)
                embeds.append(embedding)
                metadatas.append(metadata)

                stored += 1
            except Exception as e:
                print(f"⚠️ Failed to store node: {node.get('id', 'unknown')}. Error: {e}")
                failed += 1

        # Add batch to Chroma
        collection.add(
            ids=ids,
            documents=docs,
            embeddings=embeds,
            metadatas=metadatas
        )

    print(f"✅ Stored {stored} nodes in ChromaDB")
    if failed:
        print(f"⚠️ Failed nodes: {failed}")

# ✅ Run storage process
start_time = time.time()
store_summarized_nodes_in_chroma(summarized_nodes)
print(f"⏱️ Time taken: {round(time.time() - start_time, 2)} seconds")


#optional search and verify......

#------------------------------------------------------------------------------
# ✅ Count stored entries
print("📦 Total ChromaDB documents:", collection.count())

# ✅ Sample semantic search
results = collection.query(
    query_texts=["behavioral health site visit requirements"],
    n_results=3
)

print("\n🔍 Query Results:")
for i, doc in enumerate(results["documents"][0]):
    print(f"{i+1}. Summary: {doc}")
    print(f"   Metadata: {results['metadatas'][0][i]}")


  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for pypika (pyproject.toml) ... [?25l[?25hdone


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ Stored 12357 nodes in ChromaDB
⏱️ Time taken: 104.82 seconds
📦 Total ChromaDB documents: 12357


/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:02<00:00, 32.7MiB/s]



🔍 Query Results:
1. Summary: b. behavioral health | acceptable accrediting agencies
   Metadata: {'sheet': 'Credentialing agencies', 'chunk_index': 0, 'row_number': 9}
2. Summary: community behavioral health services | ravi | e | completed
   Metadata: {'row_number': 11, 'chunk_index': 0, 'sheet': 'ProviderTypeGCStatus'}
3. Summary: 5. 0 | community behavioral health services | * | * | * | * | * | *
   Metadata: {'chunk_index': 0, 'row_number': 2, 'sheet': 'Docs & Agreements'}


**6. Retrieve → Synthesize via OpenAI (LLM)**

In [19]:
!pip install -q google-generativeai chromadb

In [20]:
!pip install -q chromadb google-generativeai
import google.generativeai as genai

# ✅ Set your Gemini API key (store securely)
genai.configure(api_key="AIzaSyADbWLTtHPXFuGAE-5j45oSArtB9yClXQE")  # 🔐 Replace with env var for production


In [24]:
import google.generativeai as genai

genai.configure(api_key="AIzaSyADbWLTtHPXFuGAE-5j45oSArtB9yClXQE")  # ✅ Replace with yours

# ✅ List all available models that support generation
models = genai.list_models()
for m in models:
    if "generateContent" in m.supported_generation_methods:
        print(f"✅ {m.name} supports: generateContent")


✅ models/gemini-1.0-pro-vision-latest supports: generateContent
✅ models/gemini-pro-vision supports: generateContent
✅ models/gemini-1.5-pro-latest supports: generateContent
✅ models/gemini-1.5-pro-002 supports: generateContent
✅ models/gemini-1.5-pro supports: generateContent
✅ models/gemini-1.5-flash-latest supports: generateContent
✅ models/gemini-1.5-flash supports: generateContent
✅ models/gemini-1.5-flash-002 supports: generateContent
✅ models/gemini-1.5-flash-8b supports: generateContent
✅ models/gemini-1.5-flash-8b-001 supports: generateContent
✅ models/gemini-1.5-flash-8b-latest supports: generateContent
✅ models/gemini-2.5-pro-exp-03-25 supports: generateContent
✅ models/gemini-2.5-pro-preview-03-25 supports: generateContent
✅ models/gemini-2.5-flash-preview-04-17 supports: generateContent
✅ models/gemini-2.5-flash-preview-05-20 supports: generateContent
✅ models/gemini-2.5-flash supports: generateContent
✅ models/gemini-2.5-flash-preview-04-17-thinking supports: generateCont

In [21]:
import chromadb
from chromadb.config import Settings

# ✅ Reconnect to persistent ChromaDB (make sure no previous instance running)
chroma_client = chromadb.Client(Settings(persist_directory="./chroma_db", anonymized_telemetry=False))
collection = chroma_client.get_or_create_collection("florida_policy_nodes")

# ✅ Function: Retrieve top-k relevant summaries
def retrieve_context(query, top_k=10):
    results = collection.query(
        query_texts=[query],
        n_results=top_k
    )

    context_chunks = []
    for i in range(len(results['documents'][0])):
        chunk = results['documents'][0][i]
        meta = results['metadatas'][0][i]
        context = f"[Sheet: {meta['sheet']}, Row: {meta['row_number']}] {chunk}"
        context_chunks.append(context)

    return "\n".join(context_chunks)


In [38]:
# ✅ Load Gemini Pro model (faster variant)
model = genai.GenerativeModel("models/gemini-1.5-flash")

# ✅ Function: Generate comprehensive FLFX PSM Golden Configuration documentation
def synthesize_full_document(query, context):
    prompt = f"""
You are a senior healthcare systems architect and policy documentation specialist with expertise in Medicaid provider services management systems. You are tasked with creating **comprehensive technical and business documentation** for the **Florida Medicaid Provider Services Management (PSM) Golden Configuration System**.

🎯 **DOCUMENTATION SCOPE:**
{query}

📊 **SOURCE DATA CONTEXT:**
The following data has been extracted from the official FLFX PSM Golden Configuration Excel file containing 40 worksheets with detailed system configurations:

{context}

📋 **COMPLETE DOCUMENTATION REQUIREMENTS:**

## 1. EXECUTIVE SUMMARY & SYSTEM OVERVIEW
- **Business Purpose**: Define the strategic objectives of the Florida PSM system
- **System Architecture**: High-level technical architecture and component relationships
- **Regulatory Framework**: CMS compliance, state regulations, and audit requirements
- **Stakeholder Impact**: Benefits for providers, state agencies, and beneficiaries

## 2. PROVIDER TAXONOMY & CLASSIFICATION SYSTEM
- **Provider Type Hierarchy**: Complete breakdown of Individual, Entity, MCO, and Trading Partner categories
- **Specialty Code Mapping**: Detailed taxonomy codes with descriptions and business rules
- **Risk Level Framework**: Low/Medium/High risk classifications with specific criteria
- **Enrollment Type Matrix**: Application types and routing logic by provider category

## 3. SYSTEM CONFIGURATION SPECIFICATIONS

### 3.1 Core System Options (Document all 23 modules):
- **Approval Workflows**: Decision trees, routing rules, and authorization levels
- **Role Management**: User permissions, access controls, and security groups
- **Screening Protocols**: Background checks, fingerprinting, and verification procedures
- **Data Change Controls**: Audit trails, approval chains, and modification tracking
- **Business Rules Engine**: Validation logic, conditional requirements, and dependencies
- **Revalidation Cycles**: Periodic compliance verification and renewal processes
- **Licensure Management**: Professional credentials, CLIA, DEA, and specialty licenses
- **Site Visit Protocols**: Inspection procedures, scheduling, and compliance verification
- **Reference Data Management**: Dropdown configurations, triggers, and dependencies
- **Archive Policies**: Data retention, purging rules, and compliance requirements
- **Security Framework**: Authentication, authorization, encryption, and audit logging
- **User Lifecycle**: Account creation, deactivation, and access management
- **Payment Processing**: Fee structures, billing rules, and financial workflows
- **Profile Management**: Provider information, demographics, and contact data
- **Notification Engine**: Alerts, reminders, and communication workflows
- **Duplicate Prevention**: Matching algorithms, conflict resolution, and data quality
- **Affiliation Rules**: Provider relationships, network management, and hierarchies
- **External Integrations**: Third-party connections, APIs, and data exchange
- **Custom Configurations**: Flexible fields, forms, and state-specific adaptations
- **Enrollment Processing**: Application intake, routing, and decision workflows

### 3.2 Data Architecture & Management:
- **Sensitive Data Handling**: PII protection, field-level security, and access controls
- **Reference Data Systems**: Lookup tables, validation rules, and data dependencies
- **Integration Points**: External system connections and data synchronization
- **Error Handling**: Message externalization, troubleshooting, and resolution procedures

## 4. BUSINESS PROCESS DOCUMENTATION

### 4.1 Provider Enrollment Workflows:
- **Application Intake**: Forms, documentation requirements, and initial validation
- **Credentialing Process**: Verification steps, third-party validations, and approval criteria
- **Risk Assessment**: Scoring algorithms, screening requirements, and decision matrices
- **Approval Workflows**: Review processes, committee decisions, and final authorization
- **Onboarding**: System access, training requirements, and go-live procedures

### 4.2 Ongoing Operations:
- **Revalidation Cycles**: Periodic reviews, documentation updates, and compliance verification
- **Change Management**: Profile updates, role modifications, and system changes
- **Monitoring & Compliance**: Audit procedures, reporting requirements, and corrective actions
- **Issue Resolution**: Incident management, escalation procedures, and resolution tracking

## 5. REGULATORY COMPLIANCE & AUDIT FRAMEWORK
- **CMS Requirements**: Federal guidelines, reporting obligations, and compliance metrics
- **State Regulations**: Florida-specific rules, licensing requirements, and oversight procedures
- **HIPAA Compliance**: Privacy protections, security safeguards, and breach procedures
- **Audit Trails**: Logging requirements, data retention, and compliance reporting
- **Quality Assurance**: Monitoring procedures, performance metrics, and improvement processes

## 6. TECHNICAL SPECIFICATIONS

### 6.1 System Architecture:
- **Platform Requirements**: Hardware, software, and infrastructure specifications
- **Integration Architecture**: APIs, data formats, and communication protocols
- **Security Architecture**: Encryption, authentication, and access control mechanisms
- **Performance Requirements**: Response times, throughput, and scalability specifications

### 6.2 Data Management:
- **Database Design**: Schema structure, relationships, and optimization strategies
- **Data Quality**: Validation rules, cleansing procedures, and quality metrics
- **Backup & Recovery**: Disaster recovery plans, backup procedures, and restoration processes
- **Data Migration**: Conversion procedures, validation steps, and rollback plans

## 7. OPERATIONAL PROCEDURES

### 7.1 System Administration:
- **User Management**: Account provisioning, role assignments, and access reviews
- **System Monitoring**: Performance tracking, alert management, and issue escalation
- **Maintenance Procedures**: Updates, patches, and system optimization
- **Troubleshooting Guides**: Common issues, diagnostic procedures, and resolution steps

### 7.2 Business Operations:
- **Daily Procedures**: Routine tasks, monitoring activities, and quality checks
- **Exception Handling**: Error resolution, escalation procedures, and communication protocols
- **Reporting**: Standard reports, ad-hoc analysis, and compliance documentation
- **Training**: User education, system updates, and ongoing support

## 8. APPENDICES & REFERENCE MATERIALS
- **Data Dictionary**: Complete field definitions, validation rules, and business context
- **Configuration Tables**: Detailed parameter settings, options, and dependencies
- **Process Flow Diagrams**: Visual representations of key workflows and decision points
- **Compliance Matrix**: Regulatory requirements mapped to system features
- **Troubleshooting Reference**: Error codes, resolution procedures, and contact information

🔧 **OUTPUT FORMATTING REQUIREMENTS:**
- Use **Markdown formatting** with proper headers, subheaders, and bullet points
- Include **tables** for complex data relationships and configuration matrices
- Use **numbered lists** for sequential procedures and **bullet points** for feature lists
- Provide **cross-references** between related sections using clear linking
- Maintain **consistent terminology** throughout the document
- Include **code blocks** for technical specifications and configuration examples

🎯 **QUALITY STANDARDS:**
- **Accuracy**: Use only information from the provided context - no hallucination
- **Completeness**: Cover all 40 worksheets and their interdependencies
- **Clarity**: Write for both technical and business audiences
- **Structure**: Logical flow from high-level overview to detailed specifications
- **Actionability**: Provide specific, implementable guidance and procedures

📏 **DOCUMENT SCOPE:**
Generate the most comprehensive documentation possible within token limits, prioritizing:
1. Critical business processes and workflows
2. System configuration specifications
3. Regulatory compliance requirements
4. Operational procedures and troubleshooting
5. Technical architecture and integration details

⚠️ **CRITICAL INSTRUCTION:**
Base all content strictly on the provided context data. If information is missing or unclear, explicitly note gaps rather than creating fictional content.
"""

    try:
        # Configure generation parameters for maximum output
        generation_config = genai.GenerationConfig(
            temperature=0.1,  # Low temperature for factual accuracy
            top_p=0.9,
            top_k=40,
            max_output_tokens=8192,  # Maximum tokens for comprehensive output
            candidate_count=1,
        )

        response = model.generate_content(
            prompt,
            generation_config=generation_config,
            safety_settings=[
                {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"},
                {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"},
                {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE"},
                {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE"},
            ]
        )

        # Ensure we got a valid response
        if response.candidates and len(response.candidates) > 0:
            return response.text
        else:
            return "❌ No content generated. Please check your input data and try again."

    except Exception as e:
        return f"❌ Error during generation: {str(e)}\n\nPlease ensure your API key is valid and you have sufficient quota."

# ✅ Optional: Function to chunk large context if needed
def chunk_large_context(context, max_chunk_size=30000):
    """
    Split large context into manageable chunks if needed
    """
    if len(context) <= max_chunk_size:
        return [context]

    chunks = []
    start = 0
    while start < len(context):
        end = start + max_chunk_size
        # Try to break at a logical point (end of line)
        if end < len(context):
            last_newline = context.rfind('\n', start, end)
            if last_newline > start:
                end = last_newline

        chunks.append(context[start:end])
        start = end

    return chunks

# ✅ Function to process large documents in chunks if needed
def generate_comprehensive_documentation(query, full_context):
    """
    Generate documentation, handling large contexts by chunking if necessary
    """
    try:
        # Try with full context first
        if len(full_context) <= 30000:
            return synthesize_full_document(query, full_context)

        # If context is too large, process in chunks
        chunks = chunk_large_context(full_context)
        all_documentation = []

        for i, chunk in enumerate(chunks):
            print(f"🔄 Processing chunk {i+1}/{len(chunks)}...")
            chunk_query = f"{query} - Part {i+1} of {len(chunks)}"
            doc_part = synthesize_full_document(chunk_query, chunk)
            all_documentation.append(doc_part)

        # Combine all parts
        full_documentation = "\n\n---\n\n".join(all_documentation)
        return full_documentation

    except Exception as e:
        return f"❌ Error in comprehensive documentation generation: {str(e)}"

In [43]:
query = "Complete documentation of Florida Medicaid enrollment policy"

# 🔍 Retrieve comprehensive context
context = retrieve_context(query, top_k=50)  # Increase top_k to cover more Excel sheets

# 🧠 Generate the full document
generated_policy = synthesize_full_document(query, context)

# 🖨️ Preview
print("🧾 Synthesized Policy Document:\n")
print(generated_policy[:2000])  # Preview only first 2000 characters


🧾 Synthesized Policy Document:

# Florida Medicaid Provider Services Management (PSM) Golden Configuration System: Comprehensive Documentation

## 1. Executive Summary & System Overview

**Business Purpose:** The Florida PSM Golden Configuration System aims to streamline the Medicaid provider enrollment process, ensuring compliance with federal (CMS) and state regulations while providing efficient management of provider data and relationships.  This system supports timely processing of applications, accurate provider classification, and effective monitoring of provider compliance.

**System Architecture:**  (High-level architecture is missing from the provided data.  This section requires additional information from the 40 worksheets beyond the provided snippets.)  The system likely consists of a core application managing provider data, integrated with external systems for background checks, licensure verification, and potentially payment processing.  Further details on database techno

In [44]:
with open("florida_medicaid_policy_full.md", "w") as f:
    f.write(generated_policy)

print("✅ Saved to florida_medicaid_policy_full.md")


✅ Saved to florida_medicaid_policy_full.md
