# Vector Stores Comparison with LangChain

This notebook demonstrates the implementation and usage of different vector stores:
1. Pinecone (Cloud-based)
2. FAISS (Local, in-memory)
3. Milvus/Zilliz (Self-hosted/Cloud)
4. Weaviate (Self-hosted/Cloud)

We'll use the same dataset across all vector stores to compare their functionality.

## Setting up Environment

First, let's install all required packages and set up our environment:

In [1]:
# Install required packages
# !pip install -q langchain-community langchain pinecone-client faiss-cpu pymilvus weaviate-client python-dotenv

In [7]:
# Import common dependencies
import os
from dotenv import load_dotenv
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain.schema import Document
from typing import List
import warnings
warnings.filterwarnings('ignore')

# Load environment variables
load_dotenv()

# Verify Google API key
google_api_key = os.getenv('GOOGLE_API_KEY')
if not google_api_key:
    raise ValueError("❌ GOOGLE_API_KEY not found in .env file")
print("✅ Google API key found")

# Initialize Gemini embeddings model
embeddings = GoogleGenerativeAIEmbeddings(
    model="gemini-embedding-001",  # or "gemini-embedding-001"
    task_type="retrieval_document",  # Specify task type for better embeddings
    google_api_key=google_api_key
)
print("✅ Embeddings model initialized")

✅ Google API key found
✅ Embeddings model initialized


E0000 00:00:1760016611.373995   78734 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.
E0000 00:00:1760016611.374578   78734 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


In [8]:
# Create sample documents
documents = [
    Document(
        page_content="Python is a high-level programming language known for its simplicity and readability.",
        metadata={"type": "programming", "language": "Python"}
    ),
    Document(
        page_content="JavaScript is a scripting language primarily used for web development.",
        metadata={"type": "programming", "language": "JavaScript"}
    ),
    Document(
        page_content="Machine Learning is a subset of AI that focuses on data and algorithms.",
        metadata={"type": "technology", "field": "AI"}
    ),
    Document(
        page_content="Deep Learning is part of machine learning based on artificial neural networks.",
        metadata={"type": "technology", "field": "AI"}
    ),
    Document(
        page_content="Docker is a platform for developing, shipping, and running applications in containers.",
        metadata={"type": "technology", "field": "DevOps"}
    )
]

## 1. FAISS Vector Store

FAISS (Facebook AI Similarity Search) is a library for efficient similarity search. It's local and in-memory, making it perfect for smaller datasets and quick experimentation.

In [10]:
# Initialize FAISS vector store
from langchain.vectorstores import FAISS

faiss_store = FAISS.from_documents(
    documents=documents,
    embedding=embeddings
)
print("FAISS vector store created successfully! ✅")

FAISS vector store created successfully! ✅


In [11]:
# Search with FAISS
query = "What is artificial intelligence?"
results = faiss_store.similarity_search_with_score(query, k=2)

print("Query:", query)
print("\nResults:")
for doc, score in results:
    print(f"\nScore: {score}")
    print(f"Content: {doc.page_content}")
    print(f"Metadata: {doc.metadata}")

Query: What is artificial intelligence?

Results:

Score: 0.2305298000574112
Content: Machine Learning is a subset of AI that focuses on data and algorithms.
Metadata: {'type': 'technology', 'field': 'AI'}

Score: 0.2490173727273941
Content: Deep Learning is part of machine learning based on artificial neural networks.
Metadata: {'type': 'technology', 'field': 'AI'}


# Pinecone section removed
# To re-enable Pinecone later:
# 1. Create an index named 'langchain-demo' in Pinecone console with dimension matching your embeddings (3072)
# 2. Add PINECONE_API_KEY to your .env
# 3. Re-insert the Pinecone cell or use LangChain's Pinecone.from_documents/from_existing_index
# (Pinecone removed to avoid package/version conflicts during this demo)
print("⚠️ Pinecone section is disabled in this notebook. Skipping Pinecone tests.")

### Finding Your Pinecone Environment

To find your environment name:
1. Log in to Pinecone Console (https://app.pinecone.io/)
2. Go to "API Keys" in the left sidebar
3. Look for "Environment" or "Default Environment"
   - It will be something like `gcp-starter` or `us-east1-gcp-free`
4. Copy this value and use it as your `PINECONE_ENV` in the .env file

Example .env file:
```
PINECONE_API_KEY=a1b2c3d4-5e6f-7g8h-9i10-j11k12l13m14
PINECONE_ENV=gcp-starter
```

In [10]:
# Test Pinecone Connection
import os
from pinecone import Pinecone

try:
    api_key = os.getenv('PINECONE_API_KEY')
    
    if not api_key:
        print("⚠️ PINECONE_API_KEY not found in .env file")
        print("\nMake sure your .env file contains:")
        print("PINECONE_API_KEY=your_api_key_here")
    else:
        # Initialize Pinecone with your specific configuration
        pc = Pinecone(api_key=api_key)
        
        # List indexes
        active_indexes = pc.list_indexes()
        print("✅ Successfully connected to Pinecone!")
        print(f"\nActive indexes: {active_indexes}")
        
        # Get index details
        if "langchain-demo" in [index.name for index in active_indexes]:
            index = pc.describe_index("langchain-demo")
            print("\nIndex Statistics:")
            print(f"Dimension: {index.dimension}")
            print(f"Metric: {index.metric}")
            print(f"Status: {index.status}")
            
except Exception as e:
    print(f"❌ Error connecting to Pinecone: {str(e)}")

✅ Successfully connected to Pinecone!

Active indexes: [{
    "name": "langchain-demo",
    "metric": "cosine",
    "host": "langchain-demo-7lvw08o.svc.aped-4627-b74a.pinecone.io",
    "spec": {
        "serverless": {
            "cloud": "aws",
            "region": "us-east-1"
        }
    },
    "status": {
        "ready": true,
        "state": "Ready"
    },
    "vector_type": "dense",
    "dimension": 3072,
    "deletion_protection": "disabled",
    "tags": null
}]

Index Statistics:
Dimension: 3072
Metric: cosine
Status: {'ready': True, 'state': 'Ready'}

Index Statistics:
Dimension: 3072
Metric: cosine
Status: {'ready': True, 'state': 'Ready'}


In [9]:
# Initialize Pinecone with Google's embeddings
import pinecone
from langchain_community.vectorstores import Pinecone as LangchainPinecone

try:
    # Initialize Pinecone
    pinecone.init(api_key=os.getenv('PINECONE_API_KEY'), environment="us-east-1")
    index_name = "langchain-demo"
    
    # First, let's check the dimensionality of our embeddings
    test_embedding = embeddings.embed_query("Test text")
    embedding_dim = len(test_embedding)
    print(f"✅ Embedding dimension: {embedding_dim}")
    
    # Create the vector store using the index name directly
    vector_store = LangchainPinecone.from_existing_index(
        index_name=index_name,
        embedding=embeddings,
        namespace="gemini_embeddings"
    )
    
    # Add documents to the store
    vector_store.add_documents(documents)
    print("✅ Documents added to Pinecone vector store!")
    
    # Test search functionality
    query = "What is machine learning?"
    print("\nSearching for:", query)
    results = vector_store.similarity_search(
        query=query,
        k=2,
        namespace="gemini_embeddings"
    )
    
    print("\nResults:")
    for i, doc in enumerate(results, 1):
        print(f"\n{i}. Content: {doc.page_content}")
        print(f"   Metadata: {doc.metadata}")
    
except Exception as e:
    print(f"❌ Error: {str(e)}")
    print("\nTroubleshooting steps:")
    print("1. Make sure you've created an index named 'langchain-demo' in Pinecone console")
    print("2. Index dimension should be 3072 (matching Gemini embeddings)")
    print("3. Make sure your Pinecone API key is correct in .env file")

Exception: The official Pinecone python package has been renamed from `pinecone-client` to `pinecone`. Please remove `pinecone-client` from your project dependencies and add `pinecone` instead. See the README at https://github.com/pinecone-io/pinecone-python-client for more information on using the python SDK.

### Testing Google Generative AI Embeddings

Let's test the embeddings directly to understand how they work:

In [None]:
# Test Google's embedding model directly
test_texts = [
    "Machine learning is amazing",
    "Python programming is fun",
    "AI technology is advancing rapidly"
]

# Get embeddings for each text
try:
    # Single query embedding
    single_embedding = embeddings.embed_query(test_texts[0])
    print(f"Single embedding dimension: {len(single_embedding)}")
    
    # Multiple document embeddings
    doc_embeddings = embeddings.embed_documents(test_texts)
    print(f"\nNumber of document embeddings: {len(doc_embeddings)}")
    print(f"Each document embedding dimension: {len(doc_embeddings[0])}")
    
    print("\n✅ Embeddings generated successfully!")
    
except Exception as e:
    print(f"❌ Error generating embeddings: {str(e)}")

## 3. Milvus Vector Store

Milvus is an open-source vector database that can be self-hosted or used via Zilliz Cloud. For this example, we'll use the Python client to connect to a local Milvus instance.

In [12]:
# Initialize Milvus
from langchain_community.vectorstores import Milvus

try:
    milvus_store = Milvus.from_documents(
        documents=documents,
        embedding=embeddings,
        connection_args={"host": "localhost", "port": "19530"},
        collection_name="langchain_demo"
    )
    print("Milvus vector store created successfully! ✅")

except Exception as e:
    print(f"Error initializing Milvus: {str(e)}")
    print("\nMake sure you have Milvus running locally or update connection details for cloud deployment")

Failed to create new connection using: ed2b3e4ea533416e908aff92ba76fb14


Error initializing Milvus: <MilvusException: (code=2, message=Fail connecting to server on localhost:19530, illegal connection params or server unavailable)>

Make sure you have Milvus running locally or update connection details for cloud deployment


## 4. Weaviate Vector Store

Weaviate is an open-source vector search engine that can be self-hosted or used via Weaviate Cloud Services.

In [13]:
# Initialize Weaviate
from langchain.vectorstores import Weaviate
import weaviate

try:
    client = weaviate.Client(
        url="http://localhost:8080",  # Update with your Weaviate instance URL
    )
    
    weaviate_store = Weaviate.from_documents(
        documents=documents,
        embedding=embeddings,
        client=client,
        by_text=False
    )
    print("Weaviate vector store created successfully! ✅")
    
except Exception as e:
    print(f"Error initializing Weaviate: {str(e)}")
    print("\nMake sure you have Weaviate running locally or update connection details for cloud deployment")

Error initializing Weaviate: Client.__init__() got an unexpected keyword argument 'url'

Make sure you have Weaviate running locally or update connection details for cloud deployment


## Vector Store Comparison

Here's a quick comparison of the vector stores we've looked at:

1. **FAISS**
   - ✅ Local, in-memory storage
   - ✅ Great for quick prototyping
   - ✅ No external dependencies
   - ❌ Not suitable for large-scale production

2. **Pinecone**
   - ✅ Fully managed cloud service
   - ✅ Highly scalable
   - ✅ Great for production
   - ❌ Paid service

3. **Milvus**
   - ✅ Open source
   - ✅ Can be self-hosted or cloud
   - ✅ Highly scalable
   - ❌ More complex setup

4. **Weaviate**
   - ✅ Advanced features (semantic search)
   - ✅ GraphQL interface
   - ✅ Can be self-hosted or cloud
   - ❌ More resource intensive

### Notes — Pinecone removed for this demo

- The Pinecone section has been disabled because the notebook experienced package/version conflicts during setup.
- If you want to re-enable Pinecone later:
  1. Create an index named `langchain-demo` in the Pinecone Console with dimension `3072` and metric `cosine`.
  2. Add `PINECONE_API_KEY` to your `.env` file.
  3. Re-insert the Pinecone cell or use LangChain's Pinecone helper methods.

Troubleshooting:
- Milvus: make sure Milvus is running locally on port `19530` or change `connection_args` to your Milvus host/port.
- Weaviate: ensure `weaviate-client` is the expected version; the client initializer may accept `url=` or `base_url=` depending on version. Update `weaviate.Client(base_url="http://localhost:8080")` if needed.

If you want, I can re-enable Pinecone in the notebook with a pinned, working client version and exact code — or we can keep it disabled and proceed with FAISS/Milvus/Weaviate.