# Documentation Generator with Gen AI

## Overview
This documentation generator leverages Generative AI to automatically create feature descriptions from product module specifications, supporting multiple documentation standards.

## Features

### Input Processing
- **Module Specification Parsing**: Consumes various formats of product module specifications (JSON, YAML, XML, Word)
- **Metadata Extraction**: Automatically extracts key information like module names, functions, parameters, and dependencies
- **Validation**: Ensures input specifications meet required schema standards

### AI-Powered Generation
- **Natural Language Processing**: Uses advanced NLP models to understand technical specifications
- **Context-Aware Generation**: Maintains consistency across related modules and features
- **Multi-Standard Support**: Generates documentation compliant with various standards (API docs, user manuals, technical specs)
- **Language rules**: Follows the language rules specified for technical documentation

### Output Formats
- **Markdown Documentation**: Clean, readable format for web and version control
- **HTML Reports**: Styled documentation with navigation and search capabilities
- **PDF Generation**: Professional documentation for distribution
- **API Documentation**: Interactive documentation with examples and testing capabilities

### Supported Standards
- **OpenAPI/Swagger**: RESTful API documentation
- **JSDoc**: JavaScript code documentation
- **Sphinx**: Python project documentation
- **GitBook**: Collaborative documentation platform
- **Confluence**: Enterprise wiki format

## Usage Workflow
1. **Input**: Feed product module specifications into the system
2. **Processing**: AI analyzes and understands the technical content
3. **Generation**: Creates human-readable feature descriptions
4. **Output**: Delivers documentation in specified standard format(s)

## Benefits
- **Consistency**: Uniform documentation style across all modules
- **Efficiency**: Automated generation saves manual documentation time
- **Accuracy**: Reduces human error in technical documentation
- **Scalability**: Handles large codebases and complex product suites
- **Maintenance**: Easy updates when specifications change

In [None]:
%pip install -qU python-docx Pillow pytesseract

In [None]:
import os


def search_word_files(folder_path):
    """
    Recursively search for Word files (.doc, .docx) in a folder and its subdirectories.
    
    Args:
        folder_path (str): Path to the folder to search in
        
    Returns:
        list: List of full paths to Word files found
    """
    word_files = []
    word_extensions = ('.doc', '.docx')
    
    for root, dirs, files in os.walk(folder_path):
        for file in files:
            if file.lower().endswith(word_extensions):
                full_path = os.path.join(root, file)
                word_files.append(full_path)
    
    return word_files

'''
# Example usage
folder_to_search = "./pmds" 
word_documents = search_word_files(folder_to_search)

print(f"Found {len(word_documents)} Word files:")
for doc in word_documents:
    print(doc)'''

In [None]:
from docx import Document as DocxDocument
from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document
from abc import ABC

class DocxLoader(BaseLoader, ABC):
    """Loader that uses python-docx to load .docx files."""

    def __init__(self, file_path: str):
        """Initialize with file path."""
        self.file_path = file_path

    def load(self):
        """Load the document and return its text content."""
        doc = DocxDocument(self.file_path)
        full_text = []
        for para in doc.paragraphs:
            full_text.append(para.text)
        text = "\n".join(full_text)
        metadata = {"source": self.file_path}
        return [Document(page_content=text, metadata=metadata)]

test_doc = DocxDocument("./pmds/PRISM.USRMGMT_User_Management/PMD_PRISM_USRMGMT_User_Management.docx")
control = True
i = 0

while control: 
    paragraph = test_doc.paragraphs[i]
    print(paragraph.text, paragraph.style.name)
    i += 1
    if i >= len(test_doc.paragraphs):
        break
    control = bool(int(input("Continue (0, 1)?")))

_ParagraphStyle('Title') id: 131002037580240
_ParagraphStyle('Title') id: 131002037663072
_ParagraphStyle('Title') id: 131002035118944
_ParagraphStyle('Subtitle') id: 131002036837968
_ParagraphStyle('Subtitle') id: 131002037571312
_ParagraphStyle('Subtitle') id: 131002038671680
_ParagraphStyle('Subtitle') id: 131002037599504
_ParagraphStyle('Subtitle') id: 131002037664944
_ParagraphStyle('Subtitle') id: 131002035213408
_ParagraphStyle('Subtitle') id: 131002035356304
_ParagraphStyle('Subtitle') id: 131002037598064
_ParagraphStyle('Subtitle') id: 131002035993056
_ParagraphStyle('Subtitle') id: 131002035117936


In [1]:
from docx import Document
from docx.shared import Inches
import base64
from PIL import Image
import io
import pytesseract

word_documents = ["./pmds/PRISM.USRMGMT_User_Management/PMD_PRISM_USRMGMT_User_Management.docx"]

def extract_document_content(doc_path):
    """
    Extract and semantically separate content from a Word document.
    Identifies images and processes them with OCR or tagging.
    
    Args:
        doc_path (str): Path to the Word document
        
    Returns:
        dict: Structured content with text, images, and metadata
    """
    doc = Document(doc_path)
    content = {
        'metadata': {
            'file_path': doc_path,
            'title': '',
            'paragraphs_count': 0,
            'images_count': 0
        },
        'sections': [],
        'images': [],
        'tables': []
    }
    
    current_section = {'title': '', 'content': []}
    
    for para in doc.paragraphs:
        # Check if paragraph is a heading
        if para.style.name.startswith('Heading'):
            # Save previous section if it has content
            if current_section['content']:
                content['sections'].append(current_section)
            
            # Start new section
            current_section = {
                'title': para.text.strip(),
                'level': para.style.name,
                'content': []
            }
            
            # Set document title from first heading
            if not content['metadata']['title'] and para.text.strip():
                content['metadata']['title'] = para.text.strip()
        else:
            # Add paragraph content
            if para.text.strip():
                current_section['content'].append({
                    'type': 'paragraph',
                    'text': para.text.strip(),
                    'style': para.style.name
                })
        
        # Check for images in paragraph
        for run in para.runs:
            if run.element.xpath('.//a:blip'):
                content['metadata']['images_count'] += 1
                content['images'].append({
                    'index': content['metadata']['images_count'],
                    'paragraph_text': para.text.strip(),
                    'type': 'embedded_image'
                })
    
    # Add final section
    if current_section['content']:
        content['sections'].append(current_section)
    
    # Process tables
    for i, table in enumerate(doc.tables):
        table_data = []
        for row in table.rows:
            row_data = []
            for cell in row.cells:
                row_data.append(cell.text.strip())
            table_data.append(row_data)
        
        content['tables'].append({
            'index': i + 1,
            'data': table_data,
            'rows': len(table_data),
            'columns': len(table_data[0]) if table_data else 0
        })
    
    content['metadata']['paragraphs_count'] = len([s for section in content['sections'] for s in section['content']])
    
    return content

def process_images_with_ocr(content):
    """
    Process images in the document with OCR.
    Note: This is a placeholder implementation as extracting actual image data from docx requires additional handling.
    """
    for img in content['images']:
        img['ocr_text'] = "OCR processing would be implemented here"
        img['tags'] = ["technical_diagram", "flowchart", "screenshot"]  # Example tags
    
    return content

# Process the first Word document
if word_documents:
    first_doc = word_documents[0]
    print(f"Processing: {first_doc}")
    
    extracted_content = extract_document_content(first_doc)
    # processed_content = process_images_with_ocr(extracted_content)
    
    print(f"\nDocument Analysis:")
    print(f"Title: {extracted_content['metadata']['title']}")
    print(f"Sections: {len(extracted_content['sections'])}")
    print(f"Paragraphs: {extracted_content['metadata']['paragraphs_count']}")
    print(f"Images: {extracted_content['metadata']['images_count']}")
    print(f"Tables: {len(extracted_content['tables'])}")

Processing: ./pmds/PRISM.USRMGMT_User_Management/PMD_PRISM_USRMGMT_User_Management.docx

Document Analysis:
Title: History
Sections: 72
Paragraphs: 287
Images: 2
Tables: 14


In [19]:
extracted_content["sections"][4]["content"][-3:-1]

[{'type': 'paragraph',
  'text': 'Secret - Sensitive information, such as passwords or cryptographic keys, that is known only to authorized users or systems',
  'style': 'List Paragraph'},
 {'type': 'paragraph',
  'text': "Credential - Information used to verify an entity's identity, often in the form of a username and password, security token or by other means",
  'style': 'List Paragraph'}]