Skip to content

Rakshitha-a18/pdf-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“„ PDF & Document Parser using Python

This project is a Python-based document parsing application developed using Jupyter Notebook. It is designed to extract text, metadata, and tables from documents such as PDF, DOCX, and TXT files, and store the processed information in a clean and reusable format.

The parser is useful for document automation, data extraction, preprocessing tasks, and as a base for NLP pipelines.


✨ Key Features

  • Page-wise text extraction from PDF files
  • Table extraction from PDFs using pdfplumber and export to CSV format
  • Text and table extraction from DOCX files
  • TXT file processing and cleaning
  • Automatic text cleaning to remove extra spaces, blank lines, and hyphenation issues
  • Automatic creation of input and output folders
  • Generation of a summary.json file with processing details
  • Graceful handling of unsupported file formats and runtime errors

πŸ› οΈ Technologies & Libraries Used

Python 3, Jupyter Notebook, pandas, PyPDF2, pdfplumber, python-docx, os, json


▢️ Running the Project

After cloning the repository, create and activate a Python virtual environment, install the required dependencies from requirements.txt, and launch Jupyter Notebook. Place input files inside the input_files folder and run all cells in the notebook to generate outputs automatically.

Supported input formats include PDF, DOCX, and TXT files.


πŸ“€ Output

The project generates cleaned text files, extracted tables in CSV format, and a summary.json file containing metadata and processing status for each document.


πŸ“Š Use Cases

This project can be used for PDF text and table extraction, document preprocessing for NLP tasks, data extraction for analytics, automated document ingestion pipelines, and parsing resumes, reports, or academic documents.


⚠️ Notes

Virtual environment folders are excluded using .gitignore. Output files may be committed or ignored based on project requirements. The project is designed to be easily extendable for advanced features.


πŸš€ Future Enhancements

Planned improvements include keyword extraction, text summarization, a Streamlit-based web interface, a command-line interface, and OCR support for scanned PDF documents.


πŸ‘©β€πŸ’» Author

Rakshitha


πŸ“„ License

This project is created for learning and demonstration purposes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors