Chat whith your own pdf files 😀
Demo version of End-to-end service to analize pdf documents using LLMs.
This is a two stage solution, first, aplied robust ocr engineering using DocTR to generate the dataset, and then fine tuning a LLM model using LangChain and Open AI Finally expose chat with FastAPI
OCR Engineering
Chat using FastAPI
- Ubuntu 20
- Python >=3.10
-
Clone repo
-
create and activate virtual enviroment
python3 -m venv .venv
source .venv/bin/activate
- Install dependences
python3 -m pip install --upgrade pip setuptools wheel
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
pip install "python-doctr[torch]"
pip install langchain
pip install -r requirements.txt
export PYTHONPATH="${PYTHONPATH}:${PWD}"
- setup keys for open ai, into .env file: OPENAI_API_KEY="your-open-ai-key"
python3 src/main.py showocr
OCR engineering in default project (Amazon report - 2022)
python3 src/main.py ocrengineering
uvicorn app.main:app --port 5000
Run with docs
http://127.0.0.1:5000/docs
Train endpoint
http://127.0.0.1:5000/docs#/chatpdf/train_chatpdf_route_train_chatpdf_post
post the project anual_report
Chat with files
http://127.0.0.1:5000/docs#/chatpdf/chatpdf_route_chatpdf_post
Put manualy your pdf files into this structure:
chatpdf
+--data/
+--projects/
+--project_name/
+--documents/
1-file.pdf
....
n-file.pdf
+--text_files/
OCR dataloaders will search pdf files in documents folder and then generate text files into text_files folder


