Conversation
- Introduced a unified metadata extraction interface with BaseMetadataExtractor. - Implemented HWPX, PDF, and PPT metadata extractors adhering to the new interface. - Updated existing handlers (HWPXHandler, PDFHandler, PPTHandler) to utilize the new extractors. - Removed legacy metadata extraction functions and replaced them with class-based extractors. - Enhanced metadata formatting capabilities with a shared MetadataFormatter class. - Improved logging and error handling during metadata extraction processes. - Ensured compatibility with existing document processing workflows while enhancing maintainability.
- Implemented HWPImageProcessor for handling HWP-specific image extraction, including BinData streams and OLE embedded images. - Created HWPXImageProcessor to manage image processing in HWPX format, supporting BinData extraction from ZIP archives. - Introduced ImageFileImageProcessor for standalone image files, allowing for metadata preservation and format conversion. - Developed PDFImageProcessor for PDF-specific image handling, including XRef images and page region rendering. - Added PPTImageProcessor for processing images in PPT/PPTX files, covering slide images and embedded pictures. - Established TextImageProcessor for text files, maintaining interface consistency despite the absence of embedded images. - Created utility modules for image file and text processing to streamline image handling across different document types.
- Added BaseFileConverter and specific converters for PDF, DOC, DOCX, XLS, XLSX, PPT, PPTX, CSV, HTML, HWP, HWPX, and image files. - Introduced TextFileConverter for text file processing with encoding detection. - Implemented NullFileConverter and PassThroughConverter for handling raw binary data. - Enhanced CSVFileConverter to support BOM detection and delimiter handling. - Created DOCFileConverter to auto-detect and convert various DOC formats (RTF, OLE, HTML, DOCX). - Developed HWPFileConverter for HWP file format conversion. - Added HWPXFileConverter for handling HWPX files as ZIP archives. - Integrated image handling with ImageFileConverter for raw image data. - Updated HWPXHandler to process HWPX files, including text extraction and chart handling.
- Add RTFParser class for parsing RTF files, extracting text, tables, metadata, and images. - Introduce rtf_region_finder.py to identify excluded regions (headers, footers, footnotes) in RTF documents. - Create rtf_table_extractor.py for extracting and parsing tables from RTF content, supporting merged cells. - Develop rtf_text_cleaner.py to clean RTF text by removing control codes and unnecessary elements. - Enhance modularity by separating functionalities into dedicated modules for better maintainability and readability.
- Added `BasePreprocessor` abstract class for binary preprocessing. - Introduced `RTFPreprocessor` to handle RTF binary data, including image extraction and encoding detection. - Created utility functions for cleaning RTF text and removing unwanted binary data. - Enhanced `rtf_text_cleaner.py` with improved comments and structure. - Added `PreprocessedData` dataclass to encapsulate preprocessing results. - Implemented image processing and storage handling within the RTF preprocessing pipeline.
- Updated RTFPreprocessor to handle both bytes and file-like objects, improving flexibility in input handling. - Introduced CSVPreprocessor, DOCPreprocessor, DOCXPreprocessor, ExcelPreprocessor, HTMLPreprocessor, HWPPreprocessor, HWPXPreprocessor, ImageFilePreprocessor, PDFPreprocessor, PPTPreprocessor, and TextPreprocessor classes, each implementing a pass-through preprocessing method. - Enhanced metadata extraction in preprocessors to provide additional insights about the processed content. - Improved validation methods across preprocessors to ensure data integrity before processing. - Added logging for better traceability during preprocessing steps.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.