Qt 6 library that parses Chilean bank-statement PDFs into structured
transactions through a Factory pattern. PDF text extraction is delegated
to a small Python (pdfplumber)
sidecar invoked through QProcess, keeping the C++ side free of any PDF
runtime dependency.
It mirrors the structure of NFBaeza/XLSXReader but for PDF input. The repo ships as a library / module meant to be linked into a parent application (e.g. Budget Monitor); a small test executable is included for local development.
PDFReader/
├── CMakeLists.txt
├── main.cpp # standalone test driver
├── requirements.txt # Python deps for the extractor sidecar
├── tools/
│ └── extract_pdf.py # pdfplumber-based extractor (stdout: JSON)
├── include/
│ ├── bank.h # abstract Bank base + shared helpers
│ ├── bankFactory.h
│ ├── simpleClassifier.h
│ └── banks/ # per-bank headers
│ ├── bice.h
│ ├── chile.h
│ └── santander.h
└── src/
├── bank.cpp # QProcess sidecar call, helpers
├── bankFactory.cpp
├── simpleClassifier.cpp
└── banks/
├── bice.cpp
├── chile.cpp # debit + credit (with cuota expansion)
└── santander.cpp
┌──────────────────────────────────────────┐
│ Bank::extractPdfText (C++, QProcess) │
│ ──> python tools/extract_pdf.py X │
│ <── {"pages": ["...","...","..."]} │
└──────────────────────────────────────────┘
│
▼
per-bank parser (BICE / Chile / Santander)
• unwraps wrapped rows
• regex-extracts named columns
• SimpleClassifier tags the description
│
▼
QList<Transaction>
Bank::extractPdfText() spawns the Python sidecar, captures its JSON
output and returns a QStringList of per-page text. Each derived bank
parses that text with its own column regex.
Helpers shared by every bank live on the base class:
Bank::spanishMonth("abr") → 4Bank::cleanDescription(...)strips embedded dates, times, RUTs,monto $ X, long numeric IDs and bank-specific noise (O.Gerencia,Agustinas,boleta N XXX, …).Bank::unwrapRows(pageText, rowStartRx)joins continuation lines onto the row currently being assembled.
- C++ side: CMake ≥ 3.19, Qt 6.8 (
Core,Sql), C++17 compiler. - Python sidecar: Python 3.10+ with
pdfplumber(pinned in requirements.txt).
No PDF library (Poppler, QtPdf) is linked into C++ anymore — the sidecar owns all PDF parsing.
# 1. C++ deps already installed via your distro / Qt installer.
# 2. Python sidecar — one-time:
python3 -m venv .venv
.venv/bin/pip install -r requirements.txtThe build picks up .venv/bin/python and tools/extract_pdf.py
automatically (CMake injects the project root as a compile-time path).
cmake -S . -B build -DCMAKE_PREFIX_PATH=/path/to/Qt/6.8.x/gcc_64
cmake --build build -j
./build/PDFReaderTestThe test driver in main.cpp loads one PDF from files/ and
dumps the parsed transactions to stdout. Edit it to point at whatever
statement you want to test.
add_subdirectory(third_party/PDFReader) # PDFREADER_BUILD_TEST auto-OFF
target_link_libraries(MyApp PRIVATE PDFReader::PDFReader)Headers are exposed via the target's INTERFACE include directories —
#include "bankFactory.h" just works.
If you instead cmake --install the library, consume it with:
find_package(PDFReader CONFIG REQUIRED)
target_link_libraries(MyApp PRIVATE PDFReader::PDFReader)Bank::extractPdfText() resolves the sidecar in this order:
PDFREADER_EXTRACTOR_BUNDLE— path to a single PyInstaller binary (recommended for shipping; no Python needed at runtime).PDFREADER_EXTRACTOR_PYTHON+PDFREADER_EXTRACTOR_SCRIPT— explicit interpreter and script paths.- Fallback:
<project>/.venv/bin/python <project>/tools/extract_pdf.py(auto-detected during dev via thePDFREADER_PROJECT_ROOTcompile definition). - Last resort:
python3onPATH.
To freeze the sidecar into a single binary:
.venv/bin/pip install pyinstaller
.venv/bin/pyinstaller --onefile tools/extract_pdf.py
# resulting binary at dist/extract_pdf
export PDFREADER_EXTRACTOR_BUNDLE=/abs/path/to/dist/extract_pdfauto bank = BankFactory::create("bice", "debit");
if (bank) {
bank->readBankMovements("/abs/path/to/statement.pdf");
}Internally Bank::readBankMovements() extracts text via the sidecar and
dispatches to the credit/debit implementation defined by each derived
bank class (typeAccount is "debit" or "credit").
SimpleClassifier tags descriptions with categories such as
groceries, drugstore, food delivery, transport, gas,
utilities, healthcare, education, subscription, online shopping,
retail, clothes, insurance, bank transfer, withdraw cash,
card payment, online payment, bank comission, paycheck,
investment, deposit, purchase, others.
- Create
include/banks/<bank>.hdeclaringclass <Bank> : public Bank. - Create
src/banks/<bank>.cppimplementingreadBankMovementsCredit()andreadBankMovementsDebit()(input: pre-extracted page text). UseBank::unwrapRows,Bank::cleanDescription,Bank::spanishMonthto avoid duplicating logic. - Add the new files to
PDFREADER_PUBLIC_HEADERSandPDFREADER_SOURCESin CMakeLists.txt. - In src/bankFactory.cpp,
#includethe header and add the matchingcaseinBankFactory::create().
src/banks/bice.cpp is the cleanest worked example for a checking-account parser; src/banks/chile.cpp shows credit-card parsing with cuota (installment) expansion.