RAGChatbot

Experimental project with OpenAI LLM + RAG using Huggingface + Pinecone vectorstore

PDF AgentForce is an AI-powered chatbot designed to extract, index, and query information from uploaded PDF documents. It integrates advanced natural language processing techniques to provide document-centric responses, as well as fallback capabilities to open-ended GPT-based search when document context is insufficient.

Key Features

Pinecone Integration for Vector Search: Uses Pinecone to store and retrieve document embeddings, ensuring fast and efficient similarity search.
OpenAI GPT-4 Integration: Combines document-based answers with GPT-4 to provide general knowledge or fallback responses when documents don’t contain the relevant information.
Hugging Face Integration: Downloads and processes documents from a Hugging Face repository upon initialization.
Streamlit Interface: An intuitive UI that allows users to upload PDF documents, query them, and view responses interactively.

Technology Stack

Pinecone: Used for storing and querying embeddings with high-performance vector search.
OpenAI GPT-4: Powers natural language responses and fallback search capabilities.
Streamlit: A lightweight and powerful framework used to build the app's user interface.
Hugging Face Hub: Used for storing, downloading, and embedding PDF documents for document-based answers.
Sentence Transformers: Utilized for generating embeddings for both the document content and user queries.

How It Works

Document Upload: Users can upload PDFs which are processed to extract text using the pdf_utils module.
Text Embedding: The extracted text is converted into embeddings using the Sentence Transformers model and stored in Pinecone.
Querying: When a query is made, the bot searches through the stored document embeddings in Pinecone to retrieve the most relevant information.
GPT Fallback: If the relevant context is not found in the document, the bot falls back to OpenAI GPT-4 to answer the query.
Response Streaming: Responses are streamed in real-time, offering a seamless user experience.

Project Structure . ├── app_chatbot.py # The main application file for Streamlit ├── utils/ │ ├── pdf_utils.py # Functions for PDF text extraction │ ├── huggingface_utils.py # Utility for uploading to Hugging Face │ ├── embedding_utils.py # Embedding generation using Sentence Transformers │ ├── pinecone_utils.py # Pinecone initialization and vector store setup │ ├── huggingface_download_embed_utils.py # Functions for downloading from Hugging Face └── README.md # Project documentation

install required libraries

pip install -r requirements.txt

Add the necessary environment variables:

PINECONE_API_KEY: Your Pinecone API key. PINECONE_INDEX_NAME: The name of the Pinecone index. OPENAI_API_KEY: Your OpenAI API key. HUGGINGFACE_API_KEY: Your Hugging Face API key.

Run the application: streamlit run app_chatbot.py

Future Enhancements Advanced Document Parsing: Implement more robust parsing for a wider variety of document formats - Multimodal. Improved Query Accuracy: Enhance the embedding and search mechanisms to deliver more precise answers.

Contributions This is purely an experimental and hobby work! But I welcome contributions to improve this project!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RAGChatbot

install required libraries

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.devcontainer		.devcontainer
utils		utils
README.md		README.md
app_chatbot.py		app_chatbot.py
requirements.txt		requirements.txt

Sylendran/RAGChatbot

Folders and files

Latest commit

History

Repository files navigation

RAGChatbot

install required libraries

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages