This repository contains a small Streamlit app that uses LangChain + sentence-transformers + an open-source LLM to answer questions about an uploaded PDF.
-
Create and activate a venv (PowerShell):
python -m venv .venv .\.venv\Scripts\Activate.ps1 -
Install dependencies:
pip install -r requirements.txtNote:
requirements.txtpinspdfminer.six==20201018for compatibility with theunstructured/langchainloaders. If you run into PDF parsing ImportErrors, see the "PDFMiner compatibility" section below.This repository also includes a small runtime compatibility shim
pdfminer_compat.pythat attempts to patch common symbol-location differences acrosspdfminer.sixreleases. The shim runs at import time so users who clone the repo get a better out-of-the-box experience. Prefer pinningpdfminer.sixinrequirements.txtfor reproducible environments; the shim is provided as a convenience. -
Run the app:
streamlit run streamlit.py
Some versions of pdfminer.six expose or locate parser exception classes in different modules. If you see errors like:
ImportError: cannot import name 'open_filename' from 'pdfminer.utils'ModuleNotFoundError: No module named 'pdfminer.psexceptions'
Then there are two supported fixes:
A) Preferred: Install the pinned version (already in requirements.txt):
pip install pdfminer.six==20201018B) If you cannot pin versions, create a small compatibility shim in your virtualenv to re-export the missing module names (this repo includes the example command). From PowerShell (run after activating your venv):
$site = (python -c "import pdfminer, os; print(os.path.dirname(pdfminer.__file__))")
$shim = Join-Path $site 'psexceptions.py'
@"
from .psparser import PSException, PSSyntaxError, PSEOF, PSTypeError
__all__ = [
'PSException',
'PSSyntaxError',
'PSEOF',
'PSTypeError',
]
"@ | Set-Content -Path $shim -Encoding UTF8This writes a small psexceptions.py shim into the pdfminer package in your venv so unstructured can import pdfminer.psexceptions.
The unstructured pipeline may use pdf2image which depends on the external Poppler tools (not a Python package). On Windows you must install Poppler and add its bin folder to your PATH.
-
Download a Poppler build for Windows, for example:
-
Extract the archive and add the
binfolder to your PATH. -
Verify (PowerShell):
.\scripts\check_poppler.ps1
If Poppler is missing, pdf2image will raise: PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?
The Streamlit app now catches this and shows a helpful message. See scripts/check_poppler.ps1 for a quick check.
- The app loads models at import time which can be slow and may download large files. Consider lazy-loading or using smaller models for local testing.
- If you prefer a fully reproducible environment, generate a
requirements.lock(pip-toolspip-compile) or use Poetry.
If you'd like, I can add a small scripts/setup.ps1 that automates venv creation, installation, and the shim creation.
- Initialize git, commit, and push (replace with your repository URL):
git init
git add .
git commit -m "Initial commit"
git branch -M main
git remote add origin <url>
git push -u origin main- The included GitHub Actions workflow
.github/workflows/python-ci.ymlwill run a syntax check on pushes tomain.
- Create a public repository and push (see above).
- In Streamlit Cloud, click "New app", connect your GitHub repo, choose the branch and
streamlit.pyas the entrypoint. - Set any necessary secrets or environment variables in the Streamlit Cloud settings (for example
OCR_AGENTif you choose to use it).