Skip to content

CJScripts/RAG-Based-Research-Paper-Assistant

Repository files navigation

RAG Paper Chatbot (Open-Source)

This repository contains a small Streamlit app that uses LangChain + sentence-transformers + an open-source LLM to answer questions about an uploaded PDF.

Quick setup (Windows PowerShell)

  1. Create and activate a venv (PowerShell):

    python -m venv .venv
    .\.venv\Scripts\Activate.ps1
  2. Install dependencies:

    pip install -r requirements.txt

    Note: requirements.txt pins pdfminer.six==20201018 for compatibility with the unstructured/langchain loaders. If you run into PDF parsing ImportErrors, see the "PDFMiner compatibility" section below.

    This repository also includes a small runtime compatibility shim pdfminer_compat.py that attempts to patch common symbol-location differences across pdfminer.six releases. The shim runs at import time so users who clone the repo get a better out-of-the-box experience. Prefer pinning pdfminer.six in requirements.txt for reproducible environments; the shim is provided as a convenience.

  3. Run the app:

    streamlit run streamlit.py

PDFMiner compatibility

Some versions of pdfminer.six expose or locate parser exception classes in different modules. If you see errors like:

  • ImportError: cannot import name 'open_filename' from 'pdfminer.utils'
  • ModuleNotFoundError: No module named 'pdfminer.psexceptions'

Then there are two supported fixes:

A) Preferred: Install the pinned version (already in requirements.txt):

pip install pdfminer.six==20201018

B) If you cannot pin versions, create a small compatibility shim in your virtualenv to re-export the missing module names (this repo includes the example command). From PowerShell (run after activating your venv):

$site = (python -c "import pdfminer, os; print(os.path.dirname(pdfminer.__file__))")
$shim = Join-Path $site 'psexceptions.py'
@"
from .psparser import PSException, PSSyntaxError, PSEOF, PSTypeError

__all__ = [
    'PSException',
    'PSSyntaxError',
    'PSEOF',
    'PSTypeError',
]
"@ | Set-Content -Path $shim -Encoding UTF8

This writes a small psexceptions.py shim into the pdfminer package in your venv so unstructured can import pdfminer.psexceptions.

Notes

Poppler (required by pdf2image)

The unstructured pipeline may use pdf2image which depends on the external Poppler tools (not a Python package). On Windows you must install Poppler and add its bin folder to your PATH.

  1. Download a Poppler build for Windows, for example:

    https://github.com/oschwartz10612/poppler-windows/releases/

  2. Extract the archive and add the bin folder to your PATH.

  3. Verify (PowerShell):

    .\scripts\check_poppler.ps1

If Poppler is missing, pdf2image will raise: PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH? The Streamlit app now catches this and shows a helpful message. See scripts/check_poppler.ps1 for a quick check.

  • The app loads models at import time which can be slow and may download large files. Consider lazy-loading or using smaller models for local testing.
  • If you prefer a fully reproducible environment, generate a requirements.lock (pip-tools pip-compile) or use Poetry.

If you'd like, I can add a small scripts/setup.ps1 that automates venv creation, installation, and the shim creation.

Publish to GitHub and CI

  1. Initialize git, commit, and push (replace with your repository URL):
git init
git add .
git commit -m "Initial commit"
git branch -M main
git remote add origin <url>
git push -u origin main
  1. The included GitHub Actions workflow .github/workflows/python-ci.yml will run a syntax check on pushes to main.

Deploy to Streamlit Cloud

  1. Create a public repository and push (see above).
  2. In Streamlit Cloud, click "New app", connect your GitHub repo, choose the branch and streamlit.py as the entrypoint.
  3. Set any necessary secrets or environment variables in the Streamlit Cloud settings (for example OCR_AGENT if you choose to use it).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published