Simple Embedding Project

A simple yet powerful document embedding and retrieval system built with LangChain, ChromaDB, and Sentence Transformers. This project demonstrates how to process PDF documents, generate embeddings, store them in a vector database, and perform semantic search queries.

🚀 Features

PDF Processing: Load and parse PDF documents using PyPDF
Text Chunking: Split documents into manageable chunks for better embedding quality
Embedding Generation: Create semantic embeddings using the all-MiniLM-L6-v2 model from Sentence Transformers
Vector Storage: Store embeddings in ChromaDB for efficient retrieval
Semantic Search: Query the vector database to retrieve relevant document chunks
LangChain Integration: Leverages LangChain for seamless workflow orchestration

📋 Prerequisites

Python 3.8 or higher
pip (Python package manager)
Git (for cloning the repository)

🛠️ Installation

Clone the repository

git clone https://github.com/yourusername/Simple-Embedding-Project.git
cd Simple-Embedding-Project

Create a virtual environment
```
python -m venv venv
```
Activate the virtual environment

On Windows:
```
venv\Scripts\activate
```
On macOS/Linux:
```
source venv/bin/activate
```
Install dependencies
```
pip install -r requirements.txt
```

📦 Dependencies

langchain
langchain-core
langchain-community
langchain-chroma
langchain-huggingface
pypdf
chromadb
sentence-transformers
python-dotenv
typesense

🎯 Usage

Basic Usage

Place your PDF file in the project directory
Run the main script
```
python main.py
```

🔍 How It Works

Document Loading: The system loads PDF documents using PyPDFLoader
Text Splitting: Documents are split into smaller chunks using CharacterTextSplitter for optimal embedding generation
Embedding Generation: Each chunk is converted into a 384-dimensional vector using the all-MiniLM-L6-v2 model
Vector Storage: Embeddings are stored in ChromaDB, a vector database optimized for similarity search
Retrieval: When a query is made, it's converted to an embedding and compared against stored vectors to find the most relevant chunks


## 📝 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- [LangChain](https://github.com/langchain-ai/langchain) - Framework for developing applications with LLMs
- [ChromaDB](https://www.trychroma.com/) - AI-native open-source embedding database
- [Sentence Transformers](https://www.sbert.net/) - Framework for state-of-the-art sentence embeddings
- [Hugging Face](https://huggingface.co/) - ML model repository

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Simple Embedding Project

🚀 Features

📋 Prerequisites

🛠️ Installation

📦 Dependencies

🎯 Usage

Basic Usage

🔍 How It Works

About

Uh oh!

Releases

Packages

Languages

License

Mahbub292/Simple-Embadding-Project

Folders and files

Latest commit

History

Repository files navigation

Simple Embedding Project

🚀 Features

📋 Prerequisites

🛠️ Installation

📦 Dependencies

🎯 Usage

Basic Usage

🔍 How It Works

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages