Skip to content

refactor#8

Merged
CocoRoF merged 10 commits intodeployfrom
main
Jan 22, 2026
Merged

refactor#8
CocoRoF merged 10 commits intodeployfrom
main

Conversation

@CocoRoF
Copy link
Copy Markdown
Owner

@CocoRoF CocoRoF commented Jan 22, 2026

No description provided.

- Introduced a unified metadata extraction interface with BaseMetadataExtractor.
- Implemented HWPX, PDF, and PPT metadata extractors adhering to the new interface.
- Updated existing handlers (HWPXHandler, PDFHandler, PPTHandler) to utilize the new extractors.
- Removed legacy metadata extraction functions and replaced them with class-based extractors.
- Enhanced metadata formatting capabilities with a shared MetadataFormatter class.
- Improved logging and error handling during metadata extraction processes.
- Ensured compatibility with existing document processing workflows while enhancing maintainability.
- Implemented HWPImageProcessor for handling HWP-specific image extraction, including BinData streams and OLE embedded images.
- Created HWPXImageProcessor to manage image processing in HWPX format, supporting BinData extraction from ZIP archives.
- Introduced ImageFileImageProcessor for standalone image files, allowing for metadata preservation and format conversion.
- Developed PDFImageProcessor for PDF-specific image handling, including XRef images and page region rendering.
- Added PPTImageProcessor for processing images in PPT/PPTX files, covering slide images and embedded pictures.
- Established TextImageProcessor for text files, maintaining interface consistency despite the absence of embedded images.
- Created utility modules for image file and text processing to streamline image handling across different document types.
- Added BaseFileConverter and specific converters for PDF, DOC, DOCX, XLS, XLSX, PPT, PPTX, CSV, HTML, HWP, HWPX, and image files.
- Introduced TextFileConverter for text file processing with encoding detection.
- Implemented NullFileConverter and PassThroughConverter for handling raw binary data.
- Enhanced CSVFileConverter to support BOM detection and delimiter handling.
- Created DOCFileConverter to auto-detect and convert various DOC formats (RTF, OLE, HTML, DOCX).
- Developed HWPFileConverter for HWP file format conversion.
- Added HWPXFileConverter for handling HWPX files as ZIP archives.
- Integrated image handling with ImageFileConverter for raw image data.
- Updated HWPXHandler to process HWPX files, including text extraction and chart handling.
- Add RTFParser class for parsing RTF files, extracting text, tables, metadata, and images.
- Introduce rtf_region_finder.py to identify excluded regions (headers, footers, footnotes) in RTF documents.
- Create rtf_table_extractor.py for extracting and parsing tables from RTF content, supporting merged cells.
- Develop rtf_text_cleaner.py to clean RTF text by removing control codes and unnecessary elements.
- Enhance modularity by separating functionalities into dedicated modules for better maintainability and readability.
- Added `BasePreprocessor` abstract class for binary preprocessing.
- Introduced `RTFPreprocessor` to handle RTF binary data, including image extraction and encoding detection.
- Created utility functions for cleaning RTF text and removing unwanted binary data.
- Enhanced `rtf_text_cleaner.py` with improved comments and structure.
- Added `PreprocessedData` dataclass to encapsulate preprocessing results.
- Implemented image processing and storage handling within the RTF preprocessing pipeline.
- Updated RTFPreprocessor to handle both bytes and file-like objects, improving flexibility in input handling.
- Introduced CSVPreprocessor, DOCPreprocessor, DOCXPreprocessor, ExcelPreprocessor, HTMLPreprocessor, HWPPreprocessor, HWPXPreprocessor, ImageFilePreprocessor, PDFPreprocessor, PPTPreprocessor, and TextPreprocessor classes, each implementing a pass-through preprocessing method.
- Enhanced metadata extraction in preprocessors to provide additional insights about the processed content.
- Improved validation methods across preprocessors to ensure data integrity before processing.
- Added logging for better traceability during preprocessing steps.
@CocoRoF CocoRoF merged commit caf79de into deploy Jan 22, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant