This project demonstrates the use of LangChain with Chroma for document embedding and retrieval. It leverages Azure OpenAI for generating embeddings and executing chat-based interactions.
- PDF Embedding: Convert PDF documents into embeddings using Azure OpenAI
- Chroma Vector Store: Store and retrieve document embeddings using Chroma DB
- Chunking: Intelligent document chunking with customizable size and overlap
- Chat Agent: Interact with the system using a chat-based interface powered by LangChain and Azure OpenAI
- Python 3.8+
- Azure OpenAI account
- Chroma DB (running in Docker)
-
Clone the repository:
git clone <repository-url> cd <repository-directory>
-
Start Chroma DB:
docker run -d -p 8000:8000 -v C:/chroma/data:/vector_data -e CHROMA_SERVER_CORS_ALLOW_ORIGINS='["http://localhost:8090"]' -e PERSIST_DIRECTORY=/vector_data --name chromadb chromadb/chroma -
Install dependencies:
pip install -r requirements.txt
-
Environment Variables:
- Copy
.env_sampleto.envand fill in the required API keys and endpoints.
OPENAI_API_KEY: Your OpenAI API keyAZURE_OPENAI_API_KEY: Your Azure OpenAI API keyAZURE_OPENAI_ENDPOINT: The endpoint URL for Azure OpenAIAZURE_OPENAI_API_VERSION: The API version for Azure OpenAILANGSMITH_TRACING: Enable or disable LangSmith tracing (true/false)LANGSMITH_ENDPOINT: The endpoint URL for LangSmithLANGSMITH_API_KEY: Your LangSmith API keyLANGSMITH_PROJECT: The project name for LangSmith
- Copy
-
Prepare PDF Embeddings:
- Place your PDF files in the
./data/input/directory - Run the embedding process to create and store document embeddings in Chroma DB
- Place your PDF files in the
-
Start the Application:
python agent.py
-
Interact with Documents:
- The system will process PDFs from the input directory
- Use the chat interface to ask questions about your documents
- The agent will retrieve relevant information using the Chroma vector store
You can customize the document processing by adjusting these parameters:
- Chunk size (default: 1000 characters)
- Chunk overlap (default: 200 characters)
- Collection name for vector storage
agent.py: Main script to run the chat agentembeddings_tools.py: Functions to create and manage document embeddings with Chromatools.py: Utility functions for file writing and message extraction.env_sample: Sample environment configuration filedata/: Directory for input PDFs and output markdown files
- LangChain for the foundational framework
- Chroma for vector storage and retrieval
- Azure OpenAI for embedding and chat capabilities