A fork of RD17/Ambar with various fixes and upgrades
Ambar is an open-source document search engine with automated crawling, OCR, tagging and instant full-text search.
Ambar defines a new way to implement full-text document search into your workflow.
- Easily deploy Ambar with a single
docker-compose
file - Perform Google-like search through your documents and contents of your images
- Tag your documents
- Use a simple REST API to integrate Ambar into your workflow
Tutorial: Mastering Ambar Search Queries
- Fuzzy Search (John~3)
- Phrase Search ("John Smith")
- Search By Author (author:John)
- Search By File Path (filename:*.txt)
- Search By Date (when: yesterday, today, lastweek, etc)
- Search By Size (size>1M)
- Search By Tags (tags:ocr)
- Search As You Type
- Supported language analyzers: English
ambar_en
, Russianambar_ru
, Germanambar_de
, Italianambar_it
, Polishambar_pl
, Chineseambar_cn
, CJKambar_cjk
Ambar only supports local fs crawling, if you need to crawl an SMB share of an FTP location - just mount it using standard linux tools. Crawling is automatic, no schedule is needed due to crawlers monitor file system events and automatically process new, changed and removed files.
Ambar supports large files (>30MB)
Supported file types:
- ZIP archives
- Mail archives (PST)
- MS Office documents (Word, Excel, Powerpoint, Visio, Publisher)
- OCR over images
- Email messages with attachments
- Adobe PDF (with OCR)
- OCR languages: Eng, Deu, Fra, Por
- OpenOffice documents
- RTF, Plaintext
- HTML / XHTML
- Multithread processing
Notice: Ambar requires Docker to run
If you want to see how Ambar works w/o installing it, try our live demo. No signup required.
All the images required to run Ambar can be built locally. In general, each image can be built by navigating into the directory of the component in question, performing the compilation steps required and building the image like that:
# From project root
docker compose up --build
Hint: Run plantuml to generate the updated PNG (or an online tool like PlantText).
Yes, it's fully open-source.
Yes, it is forever free and open-source.
Yes, it performs OCR on images (jpg, tiff, bmp, etc) and PDF's. OCR is perfomed by well-known open-source library Tesseract. We tuned it to achieve best perfomance and quality on scanned documents. You can easily find all files on which OCR was perfomed with tags:ocr
query
Supported languages: Eng, Rus, Ita, Deu, Fra, Spa, Pl, Nld. See this commit for an example how to add new languages.
Yes!
Yes, it can search through any PDF, even badly encoded or with scans inside. We did our best to make search over any kind of pdf document smooth.
It's limited by amount of RAM on your machine, typically it's 500MB. It's an awesome result, as typical document managment systems offer 30MB maximum file size to be processed.