Skip to content

This script allows you to quickly and easily search multiple PDF files located in a single folder.

License

Notifications You must be signed in to change notification settings

LearnFL/proj-article-reader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Multiple PDF files reader

Disclaimer

This project is a work in progress and was designed to accomplish a specific task. At the moment, this script functions as intended, but since this is the first version, there will be multiple modifications and bug fixes made over time. I welcome any constructive feedback and suggestions to help improve this project.

Description

This script allows you to quickly and easily search multiple PDF files located in a single folder. You can specify search parameters such as keywords, must-contain words, must-not-contain words. The script will then build a BASIC regular expression style pattern to search the PDFs, or you may submit your own pattern instead. Results can be printed on screen and/or saved to a file. This is a great tool for quickly finding the information you need from multiple PDFs.

How to use

You may submit a dictionary with keywords along with their associated values. The values must be in the form of a list:

Command Descritpion
include must contain words
exclude must not contain words
keywords list of keywords

Passing all three keys is not a requirement. You can choose to pass only one or two of the keys if that works better for you. This gives you the flexibility to customize your experience and pick the keys that are most important to you.

params = {
        'include':['Al'], 
        'exclude': ['cathode'], 
        'keywords':['ion']
         }
         
search_pdf('c:/my_folder_with_pdf', params=params)

Using Regular Expressions style patterns, you can pass your own unique pattern to search for specific text in a PDF:

my_pattern = "(?=.*mobility).*"

search_pdf('c:/my_folder_with_pdf', pattern=my_pattern)

Optional parameters

Parameter Action
to_txt takes a path where you want to save a text file that contains findings including file name and page on which finding is present
to_csv takes a path where you want to save a csv file that contains findings including file name and page on which finding is present
to_excel takes a path where you want to save an excel file that contains findings including file name and page on which finding is present
grab_extra takes a postive integer that specifies a number of charechters before and after the finding that you want to include and add to the finding
print_on_screen the default is True, prints result in terminal, that includes file name and path, page number where the finding is present and the finding itself

Example:

my_pattern = "SOME PATTERN"

search_pdf('c:/my_folder_with_pdf', to_txt="c:/some_path/my_text_file.txt", pattern=my_pattern, 
            grab_extra=25, print_on_screen=True)

Example of output:

Screen

Reading files, it may take a minute...
File Name: C:\Users\..\Articles\368635591.pdf, Page: 64 of 290, crystals 
cooled too quickly exhibit thermal strain and crack easily.26  Fast cooling of
File Name: C:\Users\..\Articles\268634785.pdf, Page: 144 of 290, limiting the suitable crystal growth techniques.  Because of this complication as well as additional problems with thermally-induced cracking and continued problems 
File Name: C:\Users\...\Articles\Articles\768635545.pdf, Page: 173 of 290, LiB 6O10 (CLBO) which is severely hygroscopic and tends to crack.16  A more advanced 
study
Total read: 8, Found: 212, total could not read: 1, files could not read by name: ["C:\\Users\\...\\Articles\\report356.pdf"]

Text file

text

CSV

csv

EXCEL

excel

Requirements

openpyxl==3.1.1
pandas==1.5.3
PyPDF2==3.0.1

About

This script allows you to quickly and easily search multiple PDF files located in a single folder.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages