Skip to content

RafaRossatto/PDF_Text_Cleaner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Text Cleaner

A Python tool for extracting and cleaning text from PDF files using PyMuPDF (fitz).

Workflow

See the Step1.pdf file.

Installation

All requirements are available in a requirements.txt file. To install use the following line: pip install -r requirements.txt

How to Run

python main.py <file_path> [-o <output_path/file_name.txt>]

Arguments

Argument Description
file_path Path to the PDF file (required)
-o, --output Path to save cleaned text (optional)

If not provided, a default filename is generated based on the input file, in the script directory.

Examples:

    >>> # Basic usage (saves to default filename)
    $ python main.py thesis.pdf
    
    >>> # Custom output location
    $ python main.py annual_report.pdf --output ./clean/report.txt
    
    >>> # Using short option
    $ python main.py article.pdf -o cleaned_article.txt
    
    >>> # Processing file with spaces in name
    $ python main.py "my document.pdf" -o "my document clean.txt"
    
    >>> # Using absolute paths
    $ python main.py /home/user/documents/paper.pdf -o /home/user/output/paper.txt

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors