📄 PDF & Document Parser using Python

This project is a Python-based document parsing application developed using Jupyter Notebook. It is designed to extract text, metadata, and tables from documents such as PDF, DOCX, and TXT files, and store the processed information in a clean and reusable format.

The parser is useful for document automation, data extraction, preprocessing tasks, and as a base for NLP pipelines.

✨ Key Features

Page-wise text extraction from PDF files
Table extraction from PDFs using pdfplumber and export to CSV format
Text and table extraction from DOCX files
TXT file processing and cleaning
Automatic text cleaning to remove extra spaces, blank lines, and hyphenation issues
Automatic creation of input and output folders
Generation of a summary.json file with processing details
Graceful handling of unsupported file formats and runtime errors

🛠️ Technologies & Libraries Used

Python 3, Jupyter Notebook, pandas, PyPDF2, pdfplumber, python-docx, os, json

▶️ Running the Project

After cloning the repository, create and activate a Python virtual environment, install the required dependencies from requirements.txt, and launch Jupyter Notebook. Place input files inside the input_files folder and run all cells in the notebook to generate outputs automatically.

Supported input formats include PDF, DOCX, and TXT files.

📤 Output

The project generates cleaned text files, extracted tables in CSV format, and a summary.json file containing metadata and processing status for each document.

📊 Use Cases

This project can be used for PDF text and table extraction, document preprocessing for NLP tasks, data extraction for analytics, automated document ingestion pipelines, and parsing resumes, reports, or academic documents.

⚠️ Notes

Virtual environment folders are excluded using .gitignore. Output files may be committed or ignored based on project requirements. The project is designed to be easily extendable for advanced features.

🚀 Future Enhancements

Planned improvements include keyword extraction, text summarization, a Streamlit-based web interface, a command-line interface, and OCR support for scanned PDF documents.

👩‍💻 Author

Rakshitha

📄 License

This project is created for learning and demonstration purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.ipynb_checkpoints		.ipynb_checkpoints
input_files		input_files
output		output
Main.ipynb		Main.ipynb
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📄 PDF & Document Parser using Python

✨ Key Features

🛠️ Technologies & Libraries Used

▶️ Running the Project

📤 Output

📊 Use Cases

⚠️ Notes

🚀 Future Enhancements

👩‍💻 Author

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Rakshitha-a18/pdf-parser

Folders and files

Latest commit

History

Repository files navigation

📄 PDF & Document Parser using Python

✨ Key Features

🛠️ Technologies & Libraries Used

▶️ Running the Project

📤 Output

📊 Use Cases

⚠️ Notes

🚀 Future Enhancements

👩‍💻 Author

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages