Skip to content

Adiii1436/datadoc

Repository files navigation

Datadoc: Your AI DOC Assistant

🌟 Feature Update: Offline LLM support. Now you can run whole system offline. Click here.

Welcome to Datadoc, your personal AI document assistant. Datadoc is designed to help you extract information from documents without having to read or remember the entire content. It's like having a personal assistant who has read all your documents and can instantly recall any piece of information from them.

Screenshot from 2024-03-20 00-50-20

Features 🚀

  • Document RAG Search: Datadoc uses a Retrieval-Augmented Generation (RAG) approach for document search. This involves retrieving relevant documents or passages and then using them to generate a response. This allows Datadoc to provide detailed and contextually relevant answers.

  • Offline Support: Datadoc supports offline mode i.e now you can the LLM model locally on your system. And you also don't need GPU for this. If you prefer to run LLM locally you can use this feature.

    image

    • Download the Model: Download mistral-7b-openorca.gguf2.Q4_0.gguf model from the Model Explorer section in GPT4All website.
    • Place the model inside models/mistral-7b-openorca.gguf2.Q4_0.gguf.
  • Child Mode: It enables LLMs to elucidate topics as if they're explaining to a child. This feature proves invaluable for providing detailed and easily understandable explanations for each topic.

    • Without child mode: Screenshot from 2024-03-20 16-52-25

    • After child mode: Screenshot from 2024-03-20 16-53-17

  • Vector Database: Datadoc uses ChromaDB to store embeddings of the data. Embeddings are vector representations of text that capture semantic meaning. Storing these embeddings in a vector database allows for fast and efficient similarity search, enabling Datadoc to quickly find relevant information in your documents.

    Untitled (1)

  • Supports Multiple Formats: Datadoc can read information from various document formats such as PDFs, DOCX, MD, and more.

  • Image Search: Datadoc can also answer queries based on the content of an uploaded image using gemini-pro-vision model.

  • Fast and Efficient: Powered by Langchain and ChromaDB for storing data embeddings, Datadoc provides instant results.

How Datadoc Works

  • Intelligent Fusion: Datadoc harnesses the power of Langchain's Gemini model (a sophisticated Language Model Mixture) in combination with ChromaDB's advanced embedding storage.
  • Versatile Processing: Datadoc handles a multitude of document formats with ease.
  • Image Understanding: For image-related queries, the Gemini API steps in to provide deep image analysis.

Getting Started 🎉

  1. Clone the repository
git clone https://github.com/Adiii1436/datadoc.git
cd datadoc
  1. Create virtual environment
python3 -m venv venv
  1. Install the dependencies
pip install -r requirements.txt
  1. Put all your files inside Transcripts folder.
  2. Run the main script and start asking questions!
streamlit run app.py
  1. You also need a gemini-api key which you can get from here.
  2. Note that initial execution may take some time to create document embeddings and parse various document types, but subsequent runs will be faster.
  3. Important Note: click here

Contributing 🤝

We welcome contributions from developers. Feel free to fork this repository, make changes, and submit a pull request.

License 📄

This project is licensed under the MIT License.