This project is a Python-based document parsing application developed using Jupyter Notebook. It is designed to extract text, metadata, and tables from documents such as PDF, DOCX, and TXT files, and store the processed information in a clean and reusable format.
The parser is useful for document automation, data extraction, preprocessing tasks, and as a base for NLP pipelines.
- Page-wise text extraction from PDF files
- Table extraction from PDFs using pdfplumber and export to CSV format
- Text and table extraction from DOCX files
- TXT file processing and cleaning
- Automatic text cleaning to remove extra spaces, blank lines, and hyphenation issues
- Automatic creation of input and output folders
- Generation of a summary.json file with processing details
- Graceful handling of unsupported file formats and runtime errors
Python 3, Jupyter Notebook, pandas, PyPDF2, pdfplumber, python-docx, os, json
After cloning the repository, create and activate a Python virtual environment, install the required dependencies from requirements.txt, and launch Jupyter Notebook. Place input files inside the input_files folder and run all cells in the notebook to generate outputs automatically.
Supported input formats include PDF, DOCX, and TXT files.
The project generates cleaned text files, extracted tables in CSV format, and a summary.json file containing metadata and processing status for each document.
This project can be used for PDF text and table extraction, document preprocessing for NLP tasks, data extraction for analytics, automated document ingestion pipelines, and parsing resumes, reports, or academic documents.
Virtual environment folders are excluded using .gitignore. Output files may be committed or ignored based on project requirements. The project is designed to be easily extendable for advanced features.
Planned improvements include keyword extraction, text summarization, a Streamlit-based web interface, a command-line interface, and OCR support for scanned PDF documents.
Rakshitha
This project is created for learning and demonstration purposes.