A versatile tool designed to streamline the extraction of list items from PDF documents and the merging of CSV files, ensuring unique identification across datasets.
- Extract Items from PDF: Convert list-like structures in PDF documents into structured CSV format.
- Merge CSV Files: Combine multiple CSV files into a single file, maintaining unique IDs through a newly generated sequential ID column.
- Python 3.6+
- PyMuPDF (fitz)
- Pandas
- Typer
Clone the repository and install dependencies:
git clone https://github.com/GeroZayas/PDF-itemslist-extractor.git
cd PDF-itemslist-extractor
pip install -r requirements.txt
Extract Items from PDF
python your_script_name.py extract_and_save./path/to/your/pdf/file.pdf./desired/output/path/
Merge Multiple CSV Files
python your_script_name.py merge_csv_files./file1.csv./file2.csv./merged_output.csv
Assuming you have a PDF named example.pdf and two CSV files named data1.csv and data2.csv, you can extract items from the PDF and merge the CSV files as follows:
python your_script_name.py extract_and_save./example.pdf./extracted_items.csv
python your_script_name.py merge_csv_files./data1.csv./data2.csv./merged_data.csv
Contributions are welcome Feel free to submit a pull request or open an issue to discuss improvements or report bugs.
Gero Zayas - @gerozayas