# Word Document Processing

In [1]:
from langchain_community.document_loaders import Docx2txtLoader,UnstructuredWordDocumentLoader

## Method 1 : Using Docx2txtLoader

In [3]:
## METHOD 1: Using Docx2txtLoader
# This method uses LangChain’s Docx2txtLoader to load and extract text from a Word (.docx) document.

print("1️ Using Docx2txtLoader")

try:
    # Create an instance of the Docx2txtLoader class.
    # This class automatically reads the .docx file and extracts all text content.
    # You need to specify the full or relative path to your .docx file.
    docx_loader = Docx2txtLoader("data/word_files/proposal.docx")

    # Load the document content.
    # The 'load()' method returns a list of 'Document' objects (LangChain's standard data type).
    # Each 'Document' contains:
    #  - page_content → the extracted text from the document
    #  - metadata → file metadata (like file path, page number, etc.)
    docs = docx_loader.load()

    # Print how many documents were loaded.
    # Usually, this will be 1 for a single file, but if the loader splits it into multiple parts, this may be >1.
    print(f"Loaded {len(docs)} document(s)")

    # Print the first 200 characters of the document’s content as a preview.
    # Useful for checking if the text was read correctly.
    print(f"Content preview: {docs[0].page_content[:200]}...")

    # Print the metadata of the loaded document.
    # This often includes info like the file path and source.
    print(f"Metadata: {docs[0].metadata}")

# The try/except block is used to catch and display any errors that occur.
# For example, if the file path is incorrect or the document is corrupted.
except Exception as e:
    print(f"Error: {e}")

1️ Using Docx2txtLoader
Loaded 1 document(s)
Content preview: Project Proposal: RAG Implementation

Executive Summary

This proposal outlines the implementation of a Retrieval-Augmented Generation system for our organization.

Objectives

Key objectives include:...
Metadata: {'source': 'data/word_files/proposal.docx'}


## METHOD 2: Using UnstructuredWordDocumentLoader

In [6]:
## METHOD 2: Using UnstructuredWordDocumentLoader
# This method uses LangChain’s UnstructuredWordDocumentLoader.
# It leverages the "unstructured" library to parse .docx files into fine-grained elements
# like paragraphs, titles, tables, and lists — rather than one big chunk of text.

print("\n Using UnstructuredWordDocumentLoader")

try:
    # Create an instance of the UnstructuredWordDocumentLoader.
    # Arguments:
    # - The first argument is the file path to the Word document.
    # - The 'mode' argument defines how the text is split:
    #     "single"   → loads the whole document as one text block.
    #     "elements" → loads and separates text into individual elements (paragraphs, headers, etc.)
    unstructured_loader = UnstructuredWordDocumentLoader(
        "data/word_files/proposal.docx",
        mode="elements"
    )

    # Load the document.
    # Returns a list of LangChain Document objects.
    # Each object corresponds to an "element" (e.g., paragraph, title, table).
    unstructured_docs = unstructured_loader.load()

    # Display the number of elements extracted from the Word file.
    print(f"Loaded {len(unstructured_docs)} elements")

    # Loop through and preview the first 3 elements.
    for i, doc in enumerate(unstructured_docs[:3]):
        print(f"\nElement {i + 1}:")
        
        # 'category' metadata tells what kind of element it is (e.g., 'Title', 'Paragraph', 'Table').
        # We use get('category', 'unknown') to avoid errors if the key doesn’t exist.
        print(f"Type: {doc.metadata.get('category', 'unknown')}")
        
        # Show the first 100 characters of each element’s content for preview.
        print(f"Content: {doc.page_content[:100]}...")

# The try/except block helps handle errors gracefully
# (for example, if the file doesn’t exist or the parsing fails).
except Exception as e:
    print(e)



 Using UnstructuredWordDocumentLoader
Loaded 20 elements

Element 1:
Type: Title
Content: Project Proposal: RAG Implementation...

Element 2:
Type: Title
Content: Executive Summary...

Element 3:
Type: NarrativeText
Content: This proposal outlines the implementation of a Retrieval-Augmented Generation system for our organiz...


In [7]:
unstructured_docs

[Document(metadata={'source': 'data/word_files/proposal.docx', 'category_depth': 0, 'file_directory': 'data/word_files', 'filename': 'proposal.docx', 'last_modified': '2025-06-28T15:37:12', 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title', 'element_id': 'bb0410bfd160ef866f8d4357b0949db2'}, page_content='Project Proposal: RAG Implementation'),
 Document(metadata={'source': 'data/word_files/proposal.docx', 'category_depth': 0, 'file_directory': 'data/word_files', 'filename': 'proposal.docx', 'last_modified': '2025-06-28T15:37:12', 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title', 'element_id': 'c0f844859abf08d9506856b3aed4a719'}, page_content='Executive Summary'),
 Document(metadata={'source': 'data/word_files/proposal.docx', 'category_depth': 0, 'file_directory': 'data/word_files', 'filename': 'proposal.docx', 'last_modified': '2

## Best Methods

| #     | Method                                                            | Description                                                                                                 | Advantages                                                                                                             | Limitations                                                                                            | Best For                                                         |
| ----- | ----------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------- |
| **1** |  **`Docx2txtLoader`**                                           | Uses the lightweight `docx2txt` library to extract plain text from `.docx`.                                 | - Fast and simple<br>- No external dependencies<br>- Good for clean text extraction                                      | - Loses structure (no headings, tables, or formatting info)<br>- Metadata is minimal                      | Quick text-only extraction for simple NLP or indexing               |
| **2** |  **`UnstructuredWordDocumentLoader`**                           | Uses the `unstructured` library to parse `.docx` into logical elements (headings, paragraphs, tables, etc.) | - Preserves document structure<br>- Captures metadata (category, hierarchy)<br>- Works well with RAG and semantic search | - Slightly slower<br>- Requires `unstructured` library and dependencies (like `libmagic`, `pandas`, etc.) | Semantic document processing, RAG pipelines, content classification |
| **3** |  **`DocxLoader` (from `langchain_community.document_loaders`)** | Uses `python-docx` under the hood; a balanced choice for text + moderate structure extraction.              | - Retains paragraph and heading separation<br>- Easy to use<br>- Pure Python                                             | - Tables and images not well handled<br>- Metadata minimal                                                | Moderate structure preservation with low setup effort               |
| **4** |  **`UnstructuredFileLoader`**                                   | Auto-detects and handles multiple file types (PDF, DOCX, TXT, etc.) through the `unstructured` pipeline.    | - Unified interface for all file formats<br>- Good for multi-format pipelines                                            | - Overhead for single `.docx` use<br>- Less control over structure granularity                            | Pipelines that handle many file formats dynamically                 |
| **5** |  **Manual with `python-docx`**                                  | Directly read and process `.docx` files using the low-level `python-docx` API.                              | - Full control over parsing (headings, tables, runs, styles)<br>- Can customize output                                   | - Requires manual iteration and formatting logic<br>- More code to write                                  | Custom loaders or preprocessing before feeding into LangChain       |
