# Embeddings Service Notebook

This is the first custom notebook to document and test how Aurite will use txtai's features.

The first three notebooks (including this one) will focus on the functionality of txtai's embeddings service.

This service will be used to embed and manage documents. It is the central component of txtai.

We will start by setting up the embeddings service with a basic configuration.

In [1]:
# Cell 1 - Imports and Setup
from pathlib import Path
import os
from dotenv import load_dotenv
from txtai.embeddings import Embeddings
import logging

# Setup logging with a clear format
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Load environment variables
load_dotenv()

True

# Embeddings Setup

This is the basic configuration for the embeddings service.

The embeddings service, which is a wrapper around the txtai embeddings library, is configured with a minimal set of parameters.

The txtai embeddings library is configured with a model, content storage, vector storage backend, hybrid search, and normalization.

This configuration is used to initialize the embeddings service. NOTE: This is not the final configuration, but a starting point to test the embeddings service.

In [2]:
# Cell 2 - Basic Configuration
# Create a minimal embeddings config to start
config = {
    "path": "sentence-transformers/nli-mpnet-base-v2",  # Model choice
    "content": True,                                    # Enable content storage
    "backend": "faiss",                                # Vector storage backend
    "hybrid": True,                                    # Enable hybrid search
    "normalize": True                                  # Normalize vectors
}

# Initialize embeddings
logger.info("Initializing embeddings...")
embeddings = Embeddings(config)

2024-11-17 16:47:37,296 - INFO - Initializing embeddings...


# Cell 3 - Test Documents

Next, we will create some test documents to use for testing the embeddings service.

After creating the test documents, we will index them into the embeddings service.

This is how you add documents to the embeddings service.

Indexing involves embedding the documents and storing them in the vector database.

In [3]:
# Cell 3 - Test Documents
import json

# Create test documents with different formats
test_docs = [
    # Basic documents (id, text, tags)
    (0, "This is a test document about machine learning", None),
    (1, "Another document about cloud computing", None),
    (2, "Document about natural language processing", None),

    # Documents with metadata - convert dict to JSON string
    (3, "Document about vector databases",
     json.dumps({"category": "databases", "type": "technical"})),
    (4, "Introduction to embeddings",
     json.dumps({"category": "ml", "type": "introduction"}))
]

# Index documents
logger.info("Indexing test documents...")
embeddings.index(test_docs)
logger.info(f"Total documents indexed: {embeddings.count()}")

2024-11-17 16:47:38,241 - INFO - Indexing test documents...
2024-11-17 16:47:38,400 - INFO - Total documents indexed: 5


# Cell 4 - Understanding Result Formats

In this section, we will test the basic search functionality of the embeddings service.

Before we display the results, we will create a utility function to inspect the results so we can understand the data structure.

These structures are finicky, so it is important to test and document them.


In [4]:
# Cell 4 - Understanding Result Formats
def inspect_results(results, label="Search Results"):
    """Utility function to inspect search result format"""
    logger.info(f"\n{label}")
    logger.info(f"Result type: {type(results)}")
    if results:
        logger.info(f"First result type: {type(results[0])}")
        logger.info(f"First result content: {results[0]}")
        if isinstance(results[0], tuple):
            logger.info("Format: (id, score) tuple")
        elif isinstance(results[0], dict):
            logger.info(f"Available keys: {results[0].keys()}")
    return results

# Test basic search
basic_results = inspect_results(
    embeddings.search("machine learning", 2),
    "Basic Search Results"
)

2024-11-17 16:47:38,434 - INFO - 
Basic Search Results


2024-11-17 16:47:38,434 - INFO - Result type: <class 'list'>
2024-11-17 16:47:38,434 - INFO - First result type: <class 'dict'>
2024-11-17 16:47:38,435 - INFO - First result content: {'id': '0', 'text': 'This is a test document about machine learning', 'score': 0.6211893865296524}
2024-11-17 16:47:38,435 - INFO - Available keys: dict_keys(['id', 'text', 'score'])


# Cell 5 - Search Result Processing

This is where we process the results returned by the embeddings service. 

We will use the utility function we created earlier to inspect the results.


In [5]:
# Cell 5 - Search Result Processing (updated)
logger.info("\nProcessing Search Results:")
for result in basic_results:
    # Handle dictionary format
    logger.info(f"Text: {result['text']}")
    logger.info(f"Score: {result['score']}")

    # Look up original document for metadata
    doc_id = int(result['id'])
    if doc_id < len(test_docs):
        original_doc = test_docs[doc_id]
        if original_doc[2]:  # metadata is at index 2
            try:
                metadata = json.loads(original_doc[2])
                logger.info(f"Metadata: {metadata}")
            except json.JSONDecodeError:
                logger.info(f"Raw Metadata: {original_doc[2]}")
    logger.info("---")

2024-11-17 16:47:38,441 - INFO - 
Processing Search Results:
2024-11-17 16:47:38,442 - INFO - Text: This is a test document about machine learning
2024-11-17 16:47:38,443 - INFO - Score: 0.6211893865296524
2024-11-17 16:47:38,443 - INFO - ---
2024-11-17 16:47:38,444 - INFO - Text: Document about natural language processing
2024-11-17 16:47:38,444 - INFO - Score: 0.1778886765241623
2024-11-17 16:47:38,445 - INFO - ---


# Cell 6 - Different Search Types

This section is used to compare the different search types available in the embeddings service.

We will test the basic search and the hybrid search.


In [6]:
# Cell 6 - Different Search Types
# Compare different search approaches
queries = [
    "machine learning",
    "cloud storage",
    "vector database"
]

logger.info("\nComparing Search Types:")
for query in queries:
    logger.info(f"\nQuery: {query}")

    # Basic search
    basic = embeddings.search(query, 1)
    logger.info(f"Basic Search Score: {basic[0]['score']}")  # Using dictionary key 'score'
    logger.info(f"Basic Search Text: {basic[0]['text']}")

    # Hybrid search (combines semantic + BM25)
    hybrid = embeddings.search(query, 1)
    logger.info(f"Hybrid Search Score: {hybrid[0]['score']}")  # Using dictionary key 'score'
    logger.info(f"Hybrid Search Text: {hybrid[0]['text']}")
    logger.info("---")

2024-11-17 16:47:38,454 - INFO - 
Comparing Search Types:
2024-11-17 16:47:38,454 - INFO - 
Query: machine learning
2024-11-17 16:47:38,475 - INFO - Basic Search Score: 0.6211893865296524
2024-11-17 16:47:38,476 - INFO - Basic Search Text: This is a test document about machine learning
2024-11-17 16:47:38,495 - INFO - Hybrid Search Score: 0.6211893865296524
2024-11-17 16:47:38,495 - INFO - Hybrid Search Text: This is a test document about machine learning
2024-11-17 16:47:38,495 - INFO - ---
2024-11-17 16:47:38,496 - INFO - 
Query: cloud storage
2024-11-17 16:47:38,517 - INFO - Basic Search Score: 0.5902987615700078
2024-11-17 16:47:38,518 - INFO - Basic Search Text: Another document about cloud computing
2024-11-17 16:47:38,534 - INFO - Hybrid Search Score: 0.5902987615700078
2024-11-17 16:47:38,534 - INFO - Hybrid Search Text: Another document about cloud computing
2024-11-17 16:47:38,534 - INFO - ---
2024-11-17 16:47:38,535 - INFO - 
Query: vector database
2024-11-17 16:47:38,557 - 

# Cell 7 - Switch Embeddings Config

With .load(config) we can use the same embeddings object with a different configuration.

This is useful to test different configurations without having to re-initialize the embeddings service.

In [8]:
# Cell 7 - Switch Embeddings Config
new_config = {
    "path": "sentence-transformers/nli-mpnet-base-v2",
    "content": False,
    "backend": "faiss",
    "hybrid": True,
    "normalize": True
}

# Create new embeddings instance with new config
logger.info("Initializing new embeddings configuration...")
new_embeddings = Embeddings(new_config)

# Re-index the test documents with new configuration
logger.info("Re-indexing test documents with new configuration...")
new_embeddings.index(test_docs)

2024-11-17 16:48:19,302 - INFO - Initializing new embeddings configuration...
2024-11-17 16:48:19,660 - INFO - Re-indexing test documents with new configuration...


In [9]:
# Test the new configuration
new_results = embeddings.search("machine learning", 1)
inspect_results(new_results, "New Configuration Search Results")

2024-11-17 16:48:22,465 - INFO - 
New Configuration Search Results
2024-11-17 16:48:22,469 - INFO - Result type: <class 'list'>
2024-11-17 16:48:22,469 - INFO - First result type: <class 'dict'>
2024-11-17 16:48:22,470 - INFO - First result content: {'id': '0', 'text': 'This is a test document about machine learning', 'score': 0.6211893865296524}
2024-11-17 16:48:22,470 - INFO - Available keys: dict_keys(['id', 'text', 'score'])


[{'id': '0',
  'text': 'This is a test document about machine learning',
  'score': 0.6211893865296524}]

# Conclusion

This notebook is a guide to understanding how we can use txtai's embeddings functionality through our embeddings service.

We have tested the basic search and the hybrid search.

We have also inspected the results and processed them.