PDF Processing and Document Analysis API with FastAPI

This application provides an API to process PDF files, extract text, perform OCR, calculate a legibility score, and classify documents using an LLM. Deployed on Heroku - https://ai-powered-doc-classifier-59728d598ae1.herokuapp.com/

Technologies Used

Python
FastAPI
PyMuPDF (fitz)
Pytesseract
Pillow (for OCR fallback)
Ollama (for LLM inference)

Setup and Installation

Prerequisites:
- Python 3.7+
- Tesseract OCR engine. You can download it from https://github.com/tesseract-ocr/tesseract. Make sure to add the Tesseract executable to your system's PATH.

Clone the repository (or download the files):

git clone <repository_url>
cd <repository_directory>

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install the required dependencies:
```
pip install -r requirements.txt
```
Download the required LLM models: You need to download the llama3.1:8b-instruct model (and optionally mistral:7b-instruct for verification) for Ollama:
```
ollama pull llama3.1:8b-instruct
ollama pull mistral:7b-instruct
```

Open your browser and navigate to http://127.0.0.1:8001/docs. You will see the Swagger UI, which allows you to interact with the API.

API Endpoint

`POST /classify/`

Description: Upload a PDF file to be processed and classified.
Request:
- file: The PDF file to be uploaded.
Response: A JSON object containing the processing results, including:
- filename: The name of the uploaded file.
- page_count: The total number of pages in the PDF.
- average_legibility_score: The average legibility score across all pages.
- total_images: The total number of images found in the PDF.
- pages: A list of results for each page, including extracted text, OCR text, legibility score, image count, and bounding boxes for found matches.
- document_classification: The LLM-based classification result for the entire document.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
old-models		old-models
Procfile		Procfile
README.md		README.md
app.py		app.py
hitl_feedback.jsonl		hitl_feedback.jsonl
requirements.txt		requirements.txt
runtime.txt		runtime.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Processing and Document Analysis API with FastAPI

Technologies Used

Setup and Installation

API Endpoint

`POST /classify/`

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Processing and Document Analysis API with FastAPI

Technologies Used

Setup and Installation

API Endpoint

POST /classify/

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /classify/`

Packages