`embedsim`

Measure semantic similarity and detect outliers in text collections using embeddings.

embedsim is a lightweight Python library that helps you understand how well texts relate to each other. It provides two core functions: pairwise similarity for comparing two texts, and group coherence for analyzing collections.

Use cases:

Content moderation: Find off-topic comments or reviews
Document clustering: Identify outliers before grouping
Quality assurance: Verify generated content stays on topic
Search relevance: Score how well results match a query theme
Duplicate detection: Compare documents for similarity

Installation

For OpenAI models:

uv add embedsim[openai]
export OPENAI_API_KEY=your-key-here

For local models:

uv add embedsim[sentence-transformers]

Quick Start

Pairwise Similarity

Compare two texts directly:

import embedsim

# Similar texts
score = embedsim.pairsim(
    "The cat sat on the mat",
    "A feline rested on the rug"
)
print(score)  # 0.89

# Dissimilar texts
score = embedsim.pairsim(
    "The cat sat on the mat",
    "Python is a programming language"
)
print(score)  # 0.21

Group Coherence

Analyze a collection and find outliers:

import embedsim

texts = [
    "Python is a programming language",
    "JavaScript is used for web development",
    "Machine learning uses neural networks",
    "Pizza is a popular food"  # This doesn't belong
]

scores = embedsim.groupsim(texts)
# [0.76, 0.73, 0.71, 0.28]
#                    ~~~~ Outlier detected!

API Reference

`pairsim(text_a, text_b, model_id=None, **config) → float`

Compute similarity between two texts.

Converts both texts to embeddings
Computes cosine similarity
Returns a single similarity score (0-1, higher = more similar)

`groupsim(texts, model_id=None, **config) → list[float]`

Compute coherence scores for a collection of texts.

Converts all texts to embeddings
Calculates the centroid (average) of all embeddings
Measures how close each text is to the centroid
Returns coherence scores (0-1, higher = more coherent)

This centroid-based approach gives you a score per text showing how well it fits with the group's semantic theme.

Configuration

Runtime Configuration

Modify the config object directly in your code:

import embedsim

# Change default model at runtime
embedsim.config.model = "jinaai/jina-embeddings-v2-base-en"

# Now all calls use the new default
score = embedsim.pairsim("hello", "hi")

Environment Variables

Alternatively, set configuration via environment variables:

# Set default model
export EMBEDSIM_MODEL=jinaai/jina-embeddings-v2-base-en

# Use custom OpenAI key
export EMBEDSIM_OPENAI_API_KEY=sk-...

Models

embedsim supports both OpenAI's API and local sentence-transformer models.

See MODELS.md for detailed model comparison and selection guide.

OpenAI (default, requires API key):

# Best for production - fast, accurate, no model downloads
score = embedsim.pairsim(text_a, text_b)  # uses openai/text-embedding-3-small
scores = embedsim.groupsim(texts, model_id="openai/text-embedding-3-large")

Local models (privacy, offline):

# Run entirely on your machine
score = embedsim.pairsim(text_a, text_b, model_id="jinaai/jina-embeddings-v2-base-en")
scores = embedsim.groupsim(texts, model_id="sentence-transformers/all-MiniLM-L6-v2")

Development

# Install with dev dependencies
uv sync --all-extras

# Run tests and benchmarks
make test

License

MIT

Links

Model comparison - Detailed guide to choosing the right embedding model

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
embedsim		embedsim
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MODELS.md		MODELS.md
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

`embedsim`

Installation

Quick Start

Pairwise Similarity

Group Coherence

API Reference

`pairsim(text_a, text_b, model_id=None, **config) → float`

`groupsim(texts, model_id=None, **config) → list[float]`

Configuration

Runtime Configuration

Environment Variables

Models

Development

License

Links

About

Uh oh!

Releases 2

Packages

Languages

License

Agent-CI/embedsim

Folders and files

Latest commit

History

Repository files navigation

embedsim

Installation

Quick Start

Pairwise Similarity

Group Coherence

API Reference

pairsim(text_a, text_b, model_id=None, **config) → float

groupsim(texts, model_id=None, **config) → list[float]

Configuration

Runtime Configuration

Environment Variables

Models

Development

License

Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

`embedsim`

`pairsim(text_a, text_b, model_id=None, **config) → float`

`groupsim(texts, model_id=None, **config) → list[float]`

Packages