A Python tool for extracting and cleaning text from PDF files using PyMuPDF (fitz).
See the Step1.pdf file.
All requirements are available in a requirements.txt file.
To install use the following line:
pip install -r requirements.txt
python main.py <file_path> [-o <output_path/file_name.txt>]
| Argument | Description |
|---|---|
file_path |
Path to the PDF file (required) |
-o, --output |
Path to save cleaned text (optional) |
If not provided, a default filename is generated based on the input file, in the script directory.
>>> # Basic usage (saves to default filename)
$ python main.py thesis.pdf
>>> # Custom output location
$ python main.py annual_report.pdf --output ./clean/report.txt
>>> # Using short option
$ python main.py article.pdf -o cleaned_article.txt
>>> # Processing file with spaces in name
$ python main.py "my document.pdf" -o "my document clean.txt"
>>> # Using absolute paths
$ python main.py /home/user/documents/paper.pdf -o /home/user/output/paper.txt