A simple yet powerful document embedding and retrieval system built with LangChain, ChromaDB, and Sentence Transformers. This project demonstrates how to process PDF documents, generate embeddings, store them in a vector database, and perform semantic search queries.
- PDF Processing: Load and parse PDF documents using PyPDF
- Text Chunking: Split documents into manageable chunks for better embedding quality
- Embedding Generation: Create semantic embeddings using the all-MiniLM-L6-v2model from Sentence Transformers
- Vector Storage: Store embeddings in ChromaDB for efficient retrieval
- Semantic Search: Query the vector database to retrieve relevant document chunks
- LangChain Integration: Leverages LangChain for seamless workflow orchestration
- Python 3.8 or higher
- pip (Python package manager)
- Git (for cloning the repository)
- 
Clone the repository git clone https://github.com/yourusername/Simple-Embedding-Project.git cd Simple-Embedding-Project
- 
Create a virtual environment python -m venv venv 
- 
Activate the virtual environment On Windows: venv\Scripts\activate On macOS/Linux: source venv/bin/activate
- 
Install dependencies pip install -r requirements.txt 
langchain
langchain-core
langchain-community
langchain-chroma
langchain-huggingface
pypdf
chromadb
sentence-transformers
python-dotenv
typesense
- 
Place your PDF file in the project directory 
- 
Run the main script python main.py 
- Document Loading: The system loads PDF documents using PyPDFLoader
- Text Splitting: Documents are split into smaller chunks using CharacterTextSplitter for optimal embedding generation
- Embedding Generation: Each chunk is converted into a 384-dimensional vector using the all-MiniLM-L6-v2model
- Vector Storage: Embeddings are stored in ChromaDB, a vector database optimized for similarity search
- Retrieval: When a query is made, it's converted to an embedding and compared against stored vectors to find the most relevant chunks
## 📝 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## 🙏 Acknowledgments
- [LangChain](https://github.com/langchain-ai/langchain) - Framework for developing applications with LLMs
- [ChromaDB](https://www.trychroma.com/) - AI-native open-source embedding database
- [Sentence Transformers](https://www.sbert.net/) - Framework for state-of-the-art sentence embeddings
- [Hugging Face](https://huggingface.co/) - ML model repository