Skip to content

Agent-CI/embedsim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

embedsim

Measure semantic similarity and detect outliers in text collections using embeddings.

embedsim is a lightweight Python library that helps you understand how well texts relate to each other. It provides two core functions: pairwise similarity for comparing two texts, and group coherence for analyzing collections.

Use cases:

  • Content moderation: Find off-topic comments or reviews
  • Document clustering: Identify outliers before grouping
  • Quality assurance: Verify generated content stays on topic
  • Search relevance: Score how well results match a query theme
  • Duplicate detection: Compare documents for similarity

Installation

For OpenAI models:

uv add embedsim[openai]
export OPENAI_API_KEY=your-key-here

For local models:

uv add embedsim[sentence-transformers]

Quick Start

Pairwise Similarity

Compare two texts directly:

import embedsim

# Similar texts
score = embedsim.pairsim(
    "The cat sat on the mat",
    "A feline rested on the rug"
)
print(score)  # 0.89

# Dissimilar texts
score = embedsim.pairsim(
    "The cat sat on the mat",
    "Python is a programming language"
)
print(score)  # 0.21

Group Coherence

Analyze a collection and find outliers:

import embedsim

texts = [
    "Python is a programming language",
    "JavaScript is used for web development",
    "Machine learning uses neural networks",
    "Pizza is a popular food"  # This doesn't belong
]

scores = embedsim.groupsim(texts)
# [0.76, 0.73, 0.71, 0.28]
#                    ~~~~ Outlier detected!

API Reference

pairsim(text_a, text_b, model_id=None, **config) → float

Compute similarity between two texts.

  • Converts both texts to embeddings
  • Computes cosine similarity
  • Returns a single similarity score (0-1, higher = more similar)

groupsim(texts, model_id=None, **config) → list[float]

Compute coherence scores for a collection of texts.

  • Converts all texts to embeddings
  • Calculates the centroid (average) of all embeddings
  • Measures how close each text is to the centroid
  • Returns coherence scores (0-1, higher = more coherent)

This centroid-based approach gives you a score per text showing how well it fits with the group's semantic theme.

Configuration

Runtime Configuration

Modify the config object directly in your code:

import embedsim

# Change default model at runtime
embedsim.config.model = "jinaai/jina-embeddings-v2-base-en"

# Now all calls use the new default
score = embedsim.pairsim("hello", "hi")

Environment Variables

Alternatively, set configuration via environment variables:

# Set default model
export EMBEDSIM_MODEL=jinaai/jina-embeddings-v2-base-en

# Use custom OpenAI key
export EMBEDSIM_OPENAI_API_KEY=sk-...

Models

embedsim supports both OpenAI's API and local sentence-transformer models.

See MODELS.md for detailed model comparison and selection guide.

OpenAI (default, requires API key):

# Best for production - fast, accurate, no model downloads
score = embedsim.pairsim(text_a, text_b)  # uses openai/text-embedding-3-small
scores = embedsim.groupsim(texts, model_id="openai/text-embedding-3-large")

Local models (privacy, offline):

# Run entirely on your machine
score = embedsim.pairsim(text_a, text_b, model_id="jinaai/jina-embeddings-v2-base-en")
scores = embedsim.groupsim(texts, model_id="sentence-transformers/all-MiniLM-L6-v2")

Development

# Install with dev dependencies
uv sync --all-extras

# Run tests and benchmarks
make test

License

MIT

Links

About

Measure semantic similarity and detect outliers in text collections using embeddings.

Resources

License

Stars

Watchers

Forks

Packages

No packages published