## Introduction
This example accepts transcript data (in Microsoft Word, docx) and transform them to the JSON format our tool needs. The format:

```
Speaker1: 
This is the first line.

Speaker2: 
This is the second line.
```

In addition, it will extract comments from the docx and treat them as human open coding results. 

In [6]:
%pip install python-docx

Note: you may need to restart the kernel to use updated packages.


## The Converter
The converter code here is generated by Gemini-2.5-pro with slight manual editing. Below is the prompt:
```
Write Python code to convert all `.docx` files in an input folder into the format of the uploaded example JSON. The format of the docx:

===
{$Name (May have whitespaces)} {$Time}
{$Content}
===

Following a name and time denotation, each non-empty paragraph should be its own "item". For each, identify its "paragraph" number for further processing.
Ignore content before the first name. If no time information is available, mark it as "1970-01-01 00:00:00".
The entirety should be considered as one chunk. The chunk ID should come from the file name. Return the JS object.
```

In [7]:
import os
import json
import re
from docx import Document
from datetime import datetime
from collections import defaultdict

def process_single_docx(docx_filepath):
    """
    Converts a single .docx file into a structured dictionary based on speaker text.
    Each item is tagged with its paragraph index for reliable comment mapping.

    Args:
        docx_filepath (str): The full path to the .docx file.

    Returns:
        dict: A dictionary containing the structured data from the document,
              or None if the file cannot be processed.
    """
    if not os.path.exists(docx_filepath):
        print(f"Error: File not found at {docx_filepath}")
        return None

    chunk_id = os.path.splitext(os.path.basename(docx_filepath))[0]
    document = Document(docx_filepath)
    
    chunk_data = {
        "id": chunk_id,
        "items": []
    }

    current_uid = None
    current_time = "1970-01-01 00:00:00"
    item_counter = 1
    first_speaker_found = False
    
    # Regex to capture 'Name HH:MM:SS:' or 'Name YYYY-MM-DD HH:MM:SS:'
    name_time_regex = re.compile(r'^(.*?)\s*((?:\d{4}-\d{2}-\d{2}\s)?\d{2}:\d{2}:\d{2})$')

    # Use enumerate to get the paragraph index (para_counter)
    for para_counter, para in enumerate(document.paragraphs):
        text = para.text.strip()
        if not text:
            continue

        match = name_time_regex.match(text)
        if match:
            first_speaker_found = True
            current_uid = match.group(1).strip()
            time_str = match.group(2).strip()
            
            if ' ' not in time_str:
                time_str = f"1970-01-01 {time_str}"
            
            try:
                parsed_time = datetime.strptime(time_str, '%Y-%m-%d %H:%M:%S')
                current_time = parsed_time.strftime('%Y-%m-%d %H:%M:%S')
            except ValueError:
                current_time = "1970-01-01 00:00:00"
            
            continue

        if first_speaker_found and current_uid:
            content = text
            item = {
                "id": f"{chunk_id}-{item_counter}",
                "uid": current_uid,
                "time": current_time,
                "content": content,
                "paragraph_index": para_counter # Store the paragraph index
            }
            chunk_data["items"].append(item)
            item_counter += 1
                
    return chunk_data

## The Comment Reader
The converter code here is generated by Gemini-2.5-pro with slight manual editing. Below is the prompt:
```
Write a new procedure to extract comments as open codes from human coders into the format of the uploaded example JSON. Take the docx and the parsed chunk_data as input. Use the paragraph index to identify the comment's position.
Separate comments from each coder and send output as a Map<string, CodedData>.
Then, edit the folder-level procedure to save 1) the same data file as before, and 2) each coder's codes as a separate JSON.
```

In [8]:
def extract_comments_as_codes(docx_filepath, chunk_data):
    """
    Extracts comments from a .docx file and formats them as open codes.
    Uses the comment's associated paragraph to identify which data item it relates to.
    
    Args:
        docx_filepath (str): The path to the .docx file.
        chunk_data (dict): The parsed data from process_single_docx for the same file.

    Returns:
        dict: A dictionary where keys are coder names (comment authors) and
              values are the structured coded data.
    """
    coders_data = defaultdict(lambda: {"threads": {}})
    document = Document(docx_filepath)
    chunk_id = chunk_data['id']

    # Create a lookup map from paragraph index to item ID
    paragraph_index_to_item_id_map = {item['paragraph_index']: item['id'] for item in chunk_data['items']}
    
    # Build a mapping of comment IDs to paragraph indices by scanning the document XML
    comment_to_paragraph_map = {}
    
    # Scan through paragraphs to find comment range markers
    for para_index, para in enumerate(document.paragraphs):
        # Check the paragraph's XML element for comment range markers
        if para._element is not None:
            for element in para._element.iter():
                # Look for comment range start markers
                if element.tag.endswith('commentRangeStart'):
                    comment_id = element.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}id')
                    if comment_id:
                        comment_to_paragraph_map[comment_id] = para_index
    
    # Process each comment
    for comment in document.comments:
        # Get comment properties using correct attribute names
        comment_id = str(comment.comment_id) if hasattr(comment, 'comment_id') else None
        coder = comment.author if hasattr(comment, 'author') else 'Unknown'
        code_text = comment.text.strip() if hasattr(comment, 'text') else ''
        
        # Skip empty comments or comments without IDs
        if not comment_id or not code_text:
            continue
            
        # Find the paragraph this comment is associated with
        para_index = comment_to_paragraph_map.get(comment_id)
        
        if para_index is not None:
            # Find the corresponding item ID using paragraph index
            item_id = paragraph_index_to_item_id_map.get(para_index)
            
            if item_id:
                # Build the nested structure for the coder's data
                thread = coders_data[coder]["threads"].setdefault(chunk_id, {"id": chunk_id, "items": {}})
                item_codes = thread["items"].setdefault(item_id, {"id": item_id, "codes": []})

                # Split codes into multiple items using ; , and . as delimiters
                codes = re.split(r'[;,.]', code_text)
                for code in codes:
                    code = code.strip()
                    # Add the code if it's not already present
                    if code not in item_codes["codes"]:
                        item_codes["codes"].append(code)

    return dict(coders_data)

In [9]:
def convert_folder_to_json(input_folder, output_filename='data-example.json'):
    """
    Scans a directory for .docx files, processes them for content and comments,
    and saves the output into multiple JSON files.

    Args:
        input_folder (str): The path to the folder containing .docx files.
        output_filename (str, optional): The name for the main data output file.
    """
    final_json_object = {}
    aggregated_codes_by_coder = defaultdict(lambda: {"threads": {}})
    
    try:
        docx_files = [f for f in os.listdir(input_folder) if f.endswith('.docx')]
    except FileNotFoundError:
        print(f"Error: The folder '{input_folder}' was not found.")
        return

    if not docx_files:
        print(f"No .docx files were found in the folder '{input_folder}'.")
        return

    print(f"Found {len(docx_files)} .docx file(s) to process...")

    for docx_filename in docx_files:
        full_path = os.path.join(input_folder, docx_filename)
        print(f"Processing '{docx_filename}'...")
        
        # 1. Process the document for speaker text content
        chunk_data = process_single_docx(full_path)
        if chunk_data:
            # We need to remove the paragraph_index from the final output for cleanliness
            final_chunk_data = chunk_data.copy()
            final_chunk_data['items'] = [{k: v for k, v in item.items() if k != 'paragraph_index'} for item in final_chunk_data['items']]
            final_json_object[final_chunk_data['id']] = final_chunk_data

            # 2. Process the same document for comments (codes)
            codes_by_coder = extract_comments_as_codes(full_path, chunk_data)
            
            # 3. Merge the results into the aggregated coder data
            for coder, data in codes_by_coder.items():
                for thread_id, thread_data in data["threads"].items():
                    # Ensure the thread exists for the coder
                    agg_thread = aggregated_codes_by_coder[coder]["threads"].setdefault(thread_id, {"id": thread_id, "items": {}})
                    # Merge items from the current doc into the aggregated thread
                    agg_thread["items"].update(thread_data["items"])

    # --- Save the output files ---

    # 1. Save the main data file (same as before)
    output_path = os.path.join(os.getcwd(), output_filename)
    with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(final_json_object, f, indent=4, ensure_ascii=False)
    print(f"\nMain data conversion complete. Output saved to '{output_path}'")

    # 2. Save a separate JSON file for each coder
    if not aggregated_codes_by_coder:
        print("No comments found to generate coder files.")
        return

    # Create the folder
    os.makedirs("human", exist_ok=True)

    print("\nSaving code files for each coder...")
    for coder, coded_data in aggregated_codes_by_coder.items():
        coder_filename = f"human/{coder.replace(' ', '_')}.json"
        coder_output_path = os.path.join(os.getcwd(), coder_filename)
        with open(coder_output_path, 'w', encoding='utf-8') as f:
            json.dump(coded_data, f, indent=4, ensure_ascii=False)
        print(f"  - Saved codes for '{coder}' to '{coder_output_path}'")


## Convert Your Data

In [10]:
# Test the rewritten function
root_folder_path = './'
output_json_path = convert_folder_to_json(root_folder_path)
print("Conversion completed successfully!")

Found 1 .docx file(s) to process...
Processing 'example.docx'...

Main data conversion complete. Output saved to 'f:\Minor Solutions\LLM-Qualitative\examples\docx-data\data-example.json'
No comments found to generate coder files.
Conversion completed successfully!
