The visual embedding workbench.
Ingest any text dataset → generate vector embeddings → explore clusters interactively → search with RAG. No coding required.
| Step | What you do | What EmbedAtlas does |
|---|---|---|
| 1 · Ingest | Upload files, point to a folder, paste a HuggingFace dataset ID, or drop in a URL | Loads and chunks your text with overlap-aware splitting |
| 2 · Embed | Pick an embedding model from the dropdown | Encodes every chunk and stores vectors in ChromaDB |
| 3 · Explore | Choose PCA, UMAP, or t-SNE | Renders an interactive Plotly scatter — hover any point to read the original text |
| 4 · Search | Type a query | Returns ranked results via keyword, semantic, or hybrid search |
pip install embedatlas
embedatlasYour browser opens automatically at http://localhost:8501.
EmbedAtlas works on CPU out of the box. For large datasets, a GPU dramatically speeds up embedding:
# NVIDIA (CUDA 12)
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install embedatlas
# Apple Silicon (MPS)
pip install torch # MPS is included in the standard macOS wheel
pip install embedatlasThe default PDF parser is PyMuPDF (fast, lightweight). For layout-aware parsing of tables and structured PDFs, install Docling:
pip install embedatlas[docling]Then toggle "Docling" in the PDF parser option on the Ingest page.
| Source | Examples |
|---|---|
| Local files | .txt .md .rst .pdf .csv .tsv .json .jsonl |
| Local folder | Any directory — EmbedAtlas recurses into sub-folders |
| HuggingFace dataset | allenai/c4, wikitext, imdb, any public dataset repo |
| URL | Any direct link to a supported file type |
EmbedAtlas ships with a curated selection of SentenceTransformers models:
| Model | Dims | Best for |
|---|---|---|
all-mpnet-base-v2 |
768 | High-quality general English (default) |
all-MiniLM-L6-v2 |
384 | Fast, great general English |
BAAI/bge-small-en-v1.5 |
384 | RAG retrieval on English |
BAAI/bge-m3 |
1024 | Multilingual, state-of-the-art |
paraphrase-multilingual-MiniLM-L12-v2 |
384 | Fast multilingual (50+ languages) |
Any model from the SentenceTransformers model hub can be used by editing config.py.
EmbedAtlas persists all embeddings in ChromaDB at data/collections/. Collections survive between sessions — open the app, pick up where you left off.
- Create a collection during Ingest
- Switch between collections from the sidebar
- Delete or rename collections via the sidebar settings panel
| Mode | How it works |
|---|---|
| Hybrid (recommended) | Keyword hits ranked first, semantic results fill the rest. Each result shows a match badge. |
| Semantic | Pure vector similarity — finds conceptually related chunks even without exact word matches |
| Keyword | BM25 term matching — fast, exact, no model required |
- Always attach a label during ingestion. Labels colour the scatter points in Explore and make clusters interpretable.
- t-SNE is best for ≤10k points. For larger collections, use UMAP or let EmbedAtlas switch to centroid-per-label mode automatically.
- Hybrid search is usually the right default. Use pure semantic search when your query is a concept or sentence rather than a specific term.
- For very large datasets, set a max rows limit on the HuggingFace tab to avoid OOM errors during embedding.
git clone https://github.com/your-org/embedatlas
cd embedatlas
pip install -e ".[dev]"
streamlit run embedatlas/app.py