PDF-Parser

Description

This project was created to enable parsing PDF documents to obtain hierarchial relations among various componnents of PDF.

Parse PDF and obtain text and tables.
Find headers for paras, headers for tables, page_numbers

What makes this project unique/valuable?

While there exist many PDF parsers out there to extract table and text information we couldn't find a proper one which will associate header with each components of a PDF such as para, table. Many projects rely on finding structured information from documents, not just a dump of all the text.
The easy to understand JSON output structure helps users use this for further tasks easily.

Instructions for a new setup

Clone the repository.
Enter this branch
Create a virtual environment, preferably outside this cloned folder.
Upgrade pip if required.
sudo apt-get install ghostscript
Install the dependencies using the requirements.txt
Run the get_results function in main.py by passing the pdf_file_path

sudo apt-get install ghostscript
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install wheel
cd PDF-Parser
pip install -r requirements.txt
python main.py

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.idea		.idea
README.md		README.md
create_hierarchy.py		create_hierarchy.py
extract_tables.py		extract_tables.py
get_tagged_data.py		get_tagged_data.py
main.py		main.py
remove_table_contents.py		remove_table_contents.py
requirements.txt		requirements.txt
split_into_topics.py		split_into_topics.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

README.md

README.md

create_hierarchy.py

create_hierarchy.py

extract_tables.py

extract_tables.py

get_tagged_data.py

get_tagged_data.py

main.py

main.py

remove_table_contents.py

remove_table_contents.py

requirements.txt

requirements.txt

split_into_topics.py

split_into_topics.py

Repository files navigation

PDF-Parser

Description

Instructions for a new setup

About

Releases

Packages

Contributors 2

Languages

Abhishek-Rnjn/PDF-Parser

Folders and files

Latest commit

History

Repository files navigation

PDF-Parser

Description

Instructions for a new setup

About

Resources

Stars

Watchers

Forks

Languages