# 📄 DOCX to JSON Data Processing Pipeline

## Source of the Data

The `.docx` files processed in this notebook were **distilled using a Large Language Model (LLM)** from complex academic and technical PDFs. This distillation step extracts structured chapter text, headings, and subheadings into a consistent format suitable for automated parsing.

## What This Notebook Does

- Loads `.docx` files containing chapter-based content using `python-docx`
- Parses and cleans text, preserving paragraph structure and headings
- **Transforms the raw `.docx` data into structured `.json` files** that can be used for further processing
- Prepares the output to be used in later stages, such as generating synthetic noisy TOCs or training/evaluation pipelines

## Adding Your Own Data

You can easily extend this workflow by adding your own `.docx` files. To ensure your data is compatible:

- Follow the **same formatting structure** (use clear section headings, avoid unusual indentation or styles)
- Save your files to the `synthetic_data/docx/` directory
- The parser expects documents to be organized with **consistent, LLM-friendly formatting**, as seen in the original examples

Once added, your `.docx` file will be parsed and included in the same JSON-based transformation process.


In [None]:
from docx import Document
import json
import os
import sys
import os
sys.path.append(os.path.abspath('../src'))
from distilled_data_processing.toc_parsing import parse_structured_toc

## Inspecting a Sample `.docx`

Load a `.docx` file, extract all non-empty paragraphs, and print them to inspect the document structure.  
Then parse the text into a structured JSON format using `parse_structured_toc` and print the result for review.


In [None]:
# Inspect data
doc = Document("../synthetic_data/docx/2_chapter.docx")

# Every paragraph (including headings)
docx_text = "\n".join(p.text for p in doc.paragraphs if p.text.strip())

# Every paragraph (including headings)
for p in doc.paragraphs[:10]:
    print(p.text)

# Inspect final json format
parsed = parse_structured_toc(docx_text)
print(json.dumps(parsed, indent=2, ensure_ascii=False))

1. Artificial Intelligence and Machine Learning
Chapter 1: Introduction to Artificial Intelligence
Chapter 2: Machine Learning Fundamentals
Chapter 3: Neural Networks and Deep Learning
Chapter 4: Computer Vision Applications
Chapter 5: Natural Language Processing
Chapter 6: Reinforcement Learning Systems
Chapter 7: AI Ethics and Bias Prevention
Chapter 8: Automated Decision Making
Chapter 9: AI in Healthcare and Medicine


## Process text and store in local directory
Lists `.docx` files in the source directory, filters out temporary files, and processes files ending with `_full.docx` and `_chapter.docx` separately. Each document is loaded, its non-empty paragraphs extracted, and parsed into structured TOC data using `parse_structured_toc`. The parsed results are collected and saved as JSON files in the output directory for further use.


In [None]:
data_dir = "../synthetic_data/docx/"
output_dir = "../synthetic_data/json"
files = [file for file in os.listdir(data_dir) if not file.startswith("~")]

full_results = []
chapters_results = []

for file in files:
    if file.endswith("_full.docx"):
        cur_path = os.path.join(data_dir, file)
        doc = Document(cur_path)
        docx_text = "\n".join(p.text for p in doc.paragraphs if p.text.strip())
        result = parse_structured_toc(docx_text)
        full_results.extend(result)
    elif file.endswith("_chapter.docx"):
        cur_path = os.path.join(data_dir, file)
        doc = Document(cur_path)
        docx_text = "\n".join(p.text for p in doc.paragraphs if p.text.strip())
        result = parse_structured_toc(docx_text)
        chapters_results.extend(result)

with open(os.path.join(output_dir, "all_full.json"), "w", encoding="utf-8") as f:
    json.dump(full_results, f, indent=2, ensure_ascii=False)

with open(os.path.join(output_dir, "all_chapters.json"), "w", encoding="utf-8") as f:
    json.dump(chapters_results, f, indent=2, ensure_ascii=False)