Module for creating and managing text embeddings using Google Cloud's Vertex AI models.
This module allows you to:
- Create remote models in Vertex AI using Google embeddings
- Generate embeddings from tables with text fields
- Compare embedding models using a Streamlit app
- Deploy infrastructure using Terraform
embeddings/
βββ src/ # Main code
β βββ embeddings.py # SemanticSearch class
β βββ scripts/ # Utility scripts
β β βββ create_embeddings.py # Create BigQuery embeddings tables
β β βββ validator.py # Streamlit app to compare models
β βββ utils/ # Helper functions
βββ terraform/ # Infrastructure code
βββ pyproject.toml # Dependencies
βββ README.md # This file
Terraform creates remote models in Vertex AI that can be called from BigQuery, based on existing Vertex models:
- text-embedding-005: Standard embedding model
- gemini-embedding-001: Advanced Gemini embedding model
Costs per 1000 characters:
- text-embedding-005: $0.00002 (0.002 cents)
- gemini-embedding-001: $0.00012 (0.012 cents)
Once the models are created, use the create_embeddings.py
script to:
- Check character limits before processing (default: 300M characters)
- Calculate costs:
- text-embedding-005: 300,000,000 Γ $0.00002/1000 = $6
- gemini-embedding-001: 300,000,000 Γ $0.00012/1000 = $36
- Create new BigQuery table with embeddings for the selected text field
The module includes example queries that work on embedding tables you've created. Replace the table names in these queries with your actual embedding tables.
The validator.py
script runs a simple Streamlit app that allows you to:
- Select a query (on an embedding table)
- Input search terms to see results
- Select two models to compare embeddings
- Compare results from both models side by side
Note: You must have created the embedding table with both models beforehand.
Embedding model prices in Vertex AI:
- text-embedding-005: $0.00002 per 1000 characters
- gemini-embedding-001: $0.00012 per 1000 characters
For updated pricing, check the official Google Cloud documentation.
To compare the performance of different embedding models, check the MTEB leaderboard on Hugging Face.
GOOGLE_APPLICATION_CREDENTIALS=path/to/credentials.json
GCP_PROJECT_ID=your-project-id
GCP_BIG_QUERY_DATABASE=your-database
cd terraform
terraform init
terraform apply
cd src/utils/ml/embeddings/src/scripts
python create_embeddings.py
cd src/utils/ml/embeddings/src/scripts
python validator.py
- Models are deployed in Vertex AI for scalability
- Embedding tables are created automatically in BigQuery
- The Streamlit app allows validation and result comparison
- Support for multiple Google embedding models
- Character limits and cost calculations are built-in
The validator.py
Streamlit app lets you compare two embedding model versions side by side. Each version is typically stored in its own table. You can:
- Compare results from two models with the same user query
- Parameterize the SQL with placeholders like
{user_query}
,{model}
,{table_name}
,{limit_results}
, and optional{min_date}
- Inspect distances, overlap, and detailed content/metadata
-
Select left and right models and their table names
Write your text (search query) and set the number of results
- Select a predefined query or write a query
- Run the comparison to view side-by-side tables with results and distance metrics
This module includes a migration system to transfer embeddings from BigQuery to Pinecone, enabling high-performance vector search and RAG (Retrieval-Augmented Generation) applications.
Create a pinecone_migrations.yaml
file to configure your migration:
tables:
profiles_df_embeddings:
index_name: "profiles-df"
dimension: 768
fields_to_include: []
fields_to_exclude: []
metadata_fields: ["company_name", "sector", "description"]
id_fields: ["company_id", "symbol"]
text_field: "description"
cd src/utils/ml/embeddings/src/scripts/migrations
python migrate_to_pinecone.py
# Required for migration
GCP_PROJECT_ID=your-project-id
GCP_BIG_QUERY_DATABASE_EMBEDDINGS=your-database
PINECONE_API_KEY=your-pinecone-key
PINECONE_ENVIRONMENT=eu-west-1
CREDENTIALS_PATH_EMBEDDINGS=path/to/credentials.json
# Optional
MIGRATION_BATCH_SIZE=200
MIGRATION_MAX_WORKERS=4
The module includes an example of RAG system that integrates VertexAI embeddings with Pinecone:
cd src/utils/ml/embeddings/src/scripts/examples
python rag_example.py
π§ Components:
- VertexAI Embeddings: Uses
text-embedding-005
model - Pinecone Vector Store: High-performance vector search
- Google Generative AI:
gemini-2.0-flash-lite
for text generation - LangChain Integration: Complete RAG pipeline
from src.utils.ml.embeddings.src.scripts.examples.rag_example import RAGSystem
# Initialize RAG system
rag_system = RAGSystem()
# Ask questions
result = rag_system.ask_question("What are the best tech companies in logistics?")
print(result['answer'])
# Perform similarity search
results = rag_system.similarity_search("AI companies", k=5)
for doc in results:
print(doc.page_content[:200])
The RAG system includes an interactive command-line interface:
π€ RAG System with VertexAI Embeddings and Pinecone
============================================================
π Initializing RAG System...
π§ Setting up VertexAI embeddings...
β
Embeddings ready!
π§ Setting up Pinecone vector store...
π Index stats: 1234 vectors
β
Pinecone ready!
π Options:
1. Ask a question (RAG)
2. Similarity search
3. Exit
π Choose an option (1-3): 1
β Enter your question: What are the top companies in AI?
π€ Thinking...
π‘ Answer: Based on the available data, the top AI companies include...
π Sources used: 5 documents
π First source preview: Company XYZ is a leading AI company...
The module provides a LangChain-compatible embeddings wrapper:
from src.utils.ml.embeddings.src.vertex_ai_embeddings import VertexAIEmbeddings
embeddings = VertexAIEmbeddings(
project_id="your-project-id",
location="europe-west1",
model="text-embedding-005"
)
# Use with LangChain
from langchain.vectorstores import Pinecone
vectorstore = Pinecone.from_existing_index(
index_name="your-index",
embedding=embeddings
)
- Migration fails: Check BigQuery permissions and table schema
- RAG system won't start: Verify environment variables and API keys
- Slow search: Consider reducing vector dimensions or using filters
- Memory issues: Reduce batch size in migration
Enable debug logging:
import logging
logging.basicConfig(level=logging.DEBUG)
- Deploy to production: Use the migration script to move your embeddings
- Build RAG applications: Integrate with your existing systems
- Monitor performance: Track search latency and accuracy
- Scale as needed: Add more indexes for different data types
For more examples and advanced usage, check the examples/
directory in the scripts folder.