GitHub - Aminbcf/pdf-tool: using embedding models for text extraction from pdf with mcp integration for pdf generation

Setup

pip install -r requirements.txt
# requirements.txt uses faiss-cpu; swap for faiss-gpu if a CUDA GPU is available

Running

# Start the server (model loading takes ~30-60 s on first run)
python main.py
# Visit http://localhost:8000

All commands must be run from the project root — Python resolves packages relative to cwd.

Architecture — Hexagonal (Ports & Adapters)

domain/          ← pure Python, zero framework dependencies
  entities.py    ← Page, Message, Conversation, AskResult
  ports/         ← abstract interfaces (IPdfExtractor, IEmbeddingService,
                    ILanguageModel, IConversationHistory)

application/
  ask_question.py  ← AskQuestionUseCase: the only place that orchestrates the pipeline

adapters/
  inbound/
    api.py         ← FastAPI routes; constructs singletons, calls use case
  outbound/
    pdf_adapter.py        ← PyMuPDF → IPdfExtractor
    embedding_adapter.py  ← SentenceTransformer + FAISS → IEmbeddingService
    llm_adapter.py        ← Qwen2.5-3B-Instruct (4-bit NF4) → ILanguageModel
    history_adapter.py    ← JSON files in history/ → IConversationHistory

frontend/        ← vanilla HTML/CSS/JS, served by FastAPI StaticFiles

Dependency rule: domain has no imports from other layers. Application imports only from domain. Adapters import from domain and application; they never import each other.

RAG Pipeline (per query)

ILanguageModel.extract_keywords(query) — LLM returns a JSON array of search terms.
IPdfExtractor.extract_pages(pdf_path) — PyMuPDF extracts every page as text.
IEmbeddingService.search(query, keywords, pages) — encodes original query and keyword string separately, averages the two vectors, searches a FAISS L2 index, returns top-k pages.
ILanguageModel.answer(query, pages, history) — injects retrieved pages as context into the Qwen prompt, includes last 6 history turns for continuity.

Conversation History

Each session has a UUID as its key. History is persisted to history/<uuid>.json. The FileConversationHistory adapter maintains an in-process cache so repeated reads within a request don't re-open the file.

API Surface

Method	Path	Purpose
`POST`	`/api/sessions`	Create a new session UUID
`GET`	`/api/sessions`	List all sessions with labels
`POST`	`/api/sessions/{id}/upload`	Upload a PDF for the session
`POST`	`/api/sessions/{id}/ask`	Send a query, get answer + keywords + source pages
`GET`	`/api/sessions/{id}/history`	Fetch full conversation history

Key Notes

The FAISS index is rebuilt in memory on every /ask call. There is no persistent index file yet.
QwenAdapter loads with device_map="auto" and 4-bit quantization — it falls back gracefully to CPU if no GPU is present, but inference will be slow.
Session metadata (label, hasPdf flag) is stored in the browser's localStorage; the server only stores message history.
The frontend serves from frontend/ via FastAPI StaticFiles. API routes (/api/*) must stay above the static mount in api.py or they will be shadowed.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
__pycache__		__pycache__
adapters		adapters
application		application
domain		domain
frontend		frontend
history		history
indexes		indexes
.gitignore		.gitignore
README.md		README.md
embeddings.py		embeddings.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setup

Running

Architecture — Hexagonal (Ports & Adapters)

RAG Pipeline (per query)

Conversation History

API Surface

Key Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Setup

Running

Architecture — Hexagonal (Ports & Adapters)

RAG Pipeline (per query)

Conversation History

API Surface

Key Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages