refactor by CocoRoF · Pull Request #8 · CocoRoF/Contextifier

CocoRoF · 2026-01-22T07:56:22Z

No description provided.

…0.1.6

- Introduced a unified metadata extraction interface with BaseMetadataExtractor. - Implemented HWPX, PDF, and PPT metadata extractors adhering to the new interface. - Updated existing handlers (HWPXHandler, PDFHandler, PPTHandler) to utilize the new extractors. - Removed legacy metadata extraction functions and replaced them with class-based extractors. - Enhanced metadata formatting capabilities with a shared MetadataFormatter class. - Improved logging and error handling during metadata extraction processes. - Ensured compatibility with existing document processing workflows while enhancing maintainability.

- Implemented HWPImageProcessor for handling HWP-specific image extraction, including BinData streams and OLE embedded images. - Created HWPXImageProcessor to manage image processing in HWPX format, supporting BinData extraction from ZIP archives. - Introduced ImageFileImageProcessor for standalone image files, allowing for metadata preservation and format conversion. - Developed PDFImageProcessor for PDF-specific image handling, including XRef images and page region rendering. - Added PPTImageProcessor for processing images in PPT/PPTX files, covering slide images and embedded pictures. - Established TextImageProcessor for text files, maintaining interface consistency despite the absence of embedded images. - Created utility modules for image file and text processing to streamline image handling across different document types.

- Added BaseFileConverter and specific converters for PDF, DOC, DOCX, XLS, XLSX, PPT, PPTX, CSV, HTML, HWP, HWPX, and image files. - Introduced TextFileConverter for text file processing with encoding detection. - Implemented NullFileConverter and PassThroughConverter for handling raw binary data. - Enhanced CSVFileConverter to support BOM detection and delimiter handling. - Created DOCFileConverter to auto-detect and convert various DOC formats (RTF, OLE, HTML, DOCX). - Developed HWPFileConverter for HWP file format conversion. - Added HWPXFileConverter for handling HWPX files as ZIP archives. - Integrated image handling with ImageFileConverter for raw image data. - Updated HWPXHandler to process HWPX files, including text extraction and chart handling.

- Add RTFParser class for parsing RTF files, extracting text, tables, metadata, and images. - Introduce rtf_region_finder.py to identify excluded regions (headers, footers, footnotes) in RTF documents. - Create rtf_table_extractor.py for extracting and parsing tables from RTF content, supporting merged cells. - Develop rtf_text_cleaner.py to clean RTF text by removing control codes and unnecessary elements. - Enhance modularity by separating functionalities into dedicated modules for better maintainability and readability.

- Added `BasePreprocessor` abstract class for binary preprocessing. - Introduced `RTFPreprocessor` to handle RTF binary data, including image extraction and encoding detection. - Created utility functions for cleaning RTF text and removing unwanted binary data. - Enhanced `rtf_text_cleaner.py` with improved comments and structure. - Added `PreprocessedData` dataclass to encapsulate preprocessing results. - Implemented image processing and storage handling within the RTF preprocessing pipeline.

…RTF type

- Updated RTFPreprocessor to handle both bytes and file-like objects, improving flexibility in input handling. - Introduced CSVPreprocessor, DOCPreprocessor, DOCXPreprocessor, ExcelPreprocessor, HTMLPreprocessor, HWPPreprocessor, HWPXPreprocessor, ImageFilePreprocessor, PDFPreprocessor, PPTPreprocessor, and TextPreprocessor classes, each implementing a pass-through preprocessing method. - Enhanced metadata extraction in preprocessors to provide additional insights about the processed content. - Improved validation methods across preprocessors to ensure data integrity before processing. - Added logging for better traceability during preprocessing steps.

CocoRoF added 10 commits January 21, 2026 10:50

refactor: Consolidate chart processing modules and update version to …

fce02ce

…0.1.6

refactor: Remove unused Excel chart constants module

8750258

refactor: Reintroduce HWPX handler with complete implementation

14429f4

feat: Refactor RTF handling and update document processor to support …

1fcd131

…RTF type

CocoRoF merged commit caf79de into deploy Jan 22, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor#8

refactor#8
CocoRoF merged 10 commits intodeployfrom
main

CocoRoF commented Jan 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CocoRoF commented Jan 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant