This project provides a tool for extracting text from PDF files and saving the extracted text into CSV format. It is designed to process all PDF files within a specified directory, starting the text extraction from a user-configured page. This tool is particularly useful for converting large volumes of PDF documents into a more accessible and manageable format.
- Bulk Processing: Automatically processes all PDF files found in a specified directory.
- Configurable Start Page: Begins text extraction from a user-defined page number to skip irrelevant content.
- Logging: Includes detailed logging of the processing steps and outcomes, aiding in debugging and verification of the extraction process.
- Python 3.x
- pdfplumber
- A configuration file named
config.inifor setting up the PDF directory path and start page.
- Ensure that Python 3 is installed on your system.
- Install the required Python packages:
pip install pdfplumber - Clone this repository to your local machine.
Before running the script, configure the config.ini file with the following parameters:
PDFDirectoryPath: The path to the directory containing the PDF files to be processed.StartPage: The page number at which to start text extraction (1-indexed).
Example config.ini:
[Paths]
PDFDirectoryPath = path/to/your/pdf/directory
StartPage = 2To run the script, simply execute the following command in your terminal:
python main.py
Ensure that the config.ini file is in the same directory as main.py, or adjust the config_path variable in the script accordingly.
The script generates a log file named pdf_to_csv.log, containing information about the processing of each PDF file. This includes messages about the start of processing, successful extractions, and any errors encountered. To view the log messages in the console as well, set print_log_messages=True in the get_logger function call.
Contributions to this project are welcome. Please submit a pull request or open an issue to suggest improvements or report bugs.