Python is a powerful tool for analyzing Microsoft Word documents (.docx).  
Since .docx files are essentially XML files wrapped in a ZIP archive, Python can parse them efficiently to extract text, tables, images, and metadata.  

The industry-standard library for this task is python-docx. For data analysis specifically, combining it with pandas is often the best approach for handling tabular data found within documents.  

Here is a comprehensive guide on how to analyze .docx files.  

## Prerequisites  
You will need to install the necessary library:

In [1]:
%pip install python-docx pandas


Collecting python-docx
  Downloading python_docx-1.2.0-py3-none-any.whl.metadata (2.0 kB)
Downloading python_docx-1.2.0-py3-none-any.whl (252 kB)
Installing collected packages: python-docx
Successfully installed python-docx-1.2.0
Note: you may need to restart the kernel to use updated packages.


### 1. Basic Text Extraction and Analysis  
The most common task is extracting raw text to perform Natural Language Processing (NLP) or simple search operations.

In [5]:
from docx import Document

def extract_text_stats(file_path):
    try:
        # Load the document
        doc = Document(file_path)
        
        full_text = []
        char_count = 0
        
        # Iterate through paragraphs
        for para in doc.paragraphs:
            # filter out empty lines to reduce noise
            if para.text.strip():
                full_text.append(para.text)
                char_count += len(para.text)
        
        joined_text = "\n".join(full_text)
        
        print(f"--- Analysis for {file_path} ---")
        print(f"Total Paragraphs: {len(full_text)}")
        print(f"Total Characters: {char_count}")
        print(f"Preview: {joined_text[:200]}...") # Show first 200 chars
        
        return joined_text

    except Exception as e:
        print(f"Error processing file: {e}")
        return None


In [6]:

# Usage
text_content = extract_text_stats('Artificial Neurons.docx')


--- Analysis for Artificial Neurons.docx ---
Total Paragraphs: 557
Total Characters: 50041
Preview: Artificial Neurons:
Introduction
Artificial neurons are the fundamental building blocks of artificial neural networks (ANNs), designed to mimic the functionality of biological neurons. These computati...


### 2. Extracting Tables into Pandas  
Word documents often contain data tables. Extracting these into a Pandas DataFrame allows you to perform statistical analysis or export the data to CSV/Excel easily.

In [7]:
import pandas as pd
from docx import Document

def extract_tables_to_dfs(file_path):
    doc = Document(file_path)
    dataframes = []

    for i, table in enumerate(doc.tables):
        data = []
        keys = None
        
        for row_index, row in enumerate(table.rows):
            text = [cell.text.strip() for cell in row.cells]
            
            # Assume the first row is the header
            if row_index == 0:
                keys = text
                continue
            
            # Create a dictionary for the row data based on headers
            if keys:
                row_data = dict(zip(keys, text))
                data.append(row_data)

        if data:
            df = pd.DataFrame(data)
            dataframes.append(df)
            print(f"Table {i+1} extracted with shape: {df.shape}")

    return dataframes



In [8]:
# Usage
dfs = extract_tables_to_dfs('Artificial Neurons.docx')
if dfs:
    print(dfs[0].head())


### 3. Analyzing Document Metadata
Metadata analysis is useful for auditing (e.g., checking who authored a document or when it was last modified).

In [9]:
from docx import Document

def analyze_metadata(file_path):
    doc = Document(file_path)
    core_props = doc.core_properties

    metadata = {
        "Author": core_props.author,
        "Created": core_props.created,
        "Last Modified By": core_props.last_modified_by,
        "Last Printed": core_props.last_printed,
        "Revision": core_props.revision,
        "Title": core_props.title
    }

    for key, value in metadata.items():
        print(f"{key}: {value}")



In [10]:

# Usage
analyze_metadata('Artificial Neurons.docx')


Author: M. R. Das, K. Behera, C. Moharana
Created: 2024-08-04 14:46:00+00:00
Last Modified By: Minus Speed
Last Printed: 2024-08-04 16:59:00+00:00
Revision: 16
Title: An NCS using ANN & Deep Learning


### 4. Advanced: Direct XML Parsing (High Performance)  
If you have thousands of documents and only need to check for specific keywords or regex patterns, python-docx might be too slow because it loads the entire object model.  

Since a .docx is a zip file, you can read the XML directly using Python's built-in zipfile library.  
This is significantly faster for simple text scraping.

In [11]:
import zipfile
import re

def fast_search_docx(file_path, search_term):
    """
    Searches a docx file for a term without loading the full Document object model.
    """
    try:
        with zipfile.ZipFile(file_path) as zf:
            # The main text content is usually in word/document.xml
            xml_content = zf.read('word/document.xml').decode('utf-8')
            
            # Simple regex to remove XML tags to search pure text
            # Note: This is a rough extraction for speed, not perfect formatting preservation
            clean_text = re.sub(r'<[^>]+>', '', xml_content)
            
            if search_term.lower() in clean_text.lower():
                return True
            return False
    except KeyError:
        print("Could not find document.xml (file might be encrypted or corrupt)")
        return False



In [13]:

# Usage
exists = fast_search_docx('Artificial Neurons.docx', 'Artificial')
print(f"Term found: {exists}")


Term found: True


## 1. Creating and Modifying .docx Files  
The python-docx library is the standard tool for creating and editing Word documents.  

### Creating a New Document  
This example demonstrates how to build a document from scratch, including headings, styled text, and tables.

In [2]:
from docx import Document
from docx.shared import Inches, Pt
from docx.enum.text import WD_ALIGN_PARAGRAPH

def create_invoice_doc(filename):
    # 1. Create a new Document
    doc = Document()

    # 2. Add a Title
    heading = doc.add_heading('Invoice #001', 0)
    heading.alignment = WD_ALIGN_PARAGRAPH.CENTER

    # 3. Add a Paragraph with specific styling
    p = doc.add_paragraph()
    p.add_run('Date: ').bold = True
    p.add_run('January 23, 2026\n')
    p.add_run('Bill To: ').bold = True
    p.add_run('Acme Corp')

    # 4. Add a Table
    # Create table with 1 header row
    table = doc.add_table(rows=1, cols=3)
    table.style = 'Table Grid' # Apply a standard Word style

    # Set header cells
    hdr_cells = table.rows[0].cells
    hdr_cells[0].text = 'Item'
    hdr_cells[1].text = 'Quantity'
    hdr_cells[2].text = 'Price'

    # Add data rows
    items = [
        ('Consulting Services', '10', '$1,000'),
        ('Server Maintenance', '5', '$500'),
        ('Software License', '1', '$2,500')
    ]

    for item_name, qty, price in items:
        row_cells = table.add_row().cells
        row_cells[0].text = item_name
        row_cells[1].text = qty
        row_cells[2].text = price

    # 5. Save the file
    doc.save(filename)
    print(f"Document saved as {filename}")

if __name__ == "__main__":
    create_invoice_doc('generated_invoice.docx')


Document saved as generated_invoice.docx


### Modifying an Existing Document  
Modifying text in Word documents is tricky because text is split into "runs" (chunks of text with the same formatting). If you simply replace paragraph.text, you lose all bold/italic formatting.  

Here is a robust way to append content and a simple way to replace text.

In [3]:
from docx import Document

def modify_existing_doc(input_path, output_path):
    doc = Document(input_path)

    # 1. Appending new content to the end
    doc.add_page_break()
    doc.add_heading('Appendix', level=1)
    doc.add_paragraph('This section was added via Python.')

    # 2. Simple Text Replacement (Note: This may reset formatting in the paragraph)
    for paragraph in doc.paragraphs:
        if 'Acme Corp' in paragraph.text:
            # Replace the text while keeping the paragraph object
            # Warning: This removes bold/italic styling within the paragraph
            paragraph.text = paragraph.text.replace('Acme Corp', 'Global Industries Ltd.')

    doc.save(output_path)
    print(f"Modified document saved as {output_path}")


In [4]:

# Usage
modify_existing_doc('generated_invoice.docx', 'modified_invoice.docx')


Modified document saved as modified_invoice.docx


## 2. Extracting Images from a .docx File  
While python-docx is great for text, it does not have a built-in, simple function to extract existing images.  

However, a .docx file is actually just a ZIP archive containing XML files and media. The most efficient way to extract images is to use Python's built-in zipfile module to unzip the media folder directly.

In [14]:
import zipfile
import os

def extract_images_from_docx(docx_path, output_folder):
    """
    Extracts all images from a docx file into an output directory.
    """
    # Create output directory if it doesn't exist
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    with zipfile.ZipFile(docx_path, 'r') as zip_ref:
        # Get list of all files in the ZIP
        file_list = zip_ref.namelist()
        
        # Images are always stored in 'word/media/'
        image_files = [f for f in file_list if f.startswith('word/media/')]
        
        count = 0
        for image_file in image_files:
            # Extract the file
            # We read the bytes and write them to the new location to flatten the structure
            image_data = zip_ref.read(image_file)
            
            # Get just the filename (e.g., 'image1.png') ignoring the 'word/media/' prefix
            filename = os.path.basename(image_file)
            target_path = os.path.join(output_folder, filename)
            
            with open(target_path, 'wb') as f:
                f.write(image_data)
            
            count += 1
            print(f"Extracted: {filename}")

    print(f"Total images extracted: {count}")


In [15]:

# Usage
extract_images_from_docx('Artificial Neurons.docx', 'extracted_images')


Extracted: image1.jpg
Extracted: image2.png
Extracted: image3.png
Extracted: image4.png
Extracted: image5.png
Extracted: image6.png
Extracted: image7.png
Extracted: image8.png
Extracted: image9.png
Extracted: image13.png
Extracted: image14.png
Extracted: image15.png
Extracted: image16.png
Extracted: image17.jpeg
Extracted: image18.jpeg
Extracted: image19.jpeg
Extracted: image20.jpeg
Extracted: image21.png
Extracted: image10.png
Extracted: image12.png
Extracted: image11.png
Total images extracted: 21
