# üîç Advanced Querying: Metadata Filtering

## Overview
This notebook covers **advanced querying techniques** in ChromaDB. You'll learn how to combine semantic search with powerful filtering capabilities.

## What You'll Learn
- Filtering queries by metadata fields
- Using comparison operators (`$in`, `$gt`, `$lt`, etc.)
- Full-text search with `where_document`
- Controlling output with the `include` parameter

## Why Metadata Filtering Matters
While semantic search finds **similar** documents, metadata filtering lets you add **precise constraints**:
- Find reviews similar to "great product" but only in the "electronics" category
- Search for documents from a specific date range
- Filter by rating, author, source, or any custom field

---

In [1]:
import chromadb
client = chromadb.Client()

## 1. Setup

Initialize ChromaDB and create a sample collection with product reviews.

In [2]:
from datetime import datetime

collection = client.create_collection(
    name="reviews", 
    metadata={
        "description": "Product reviews",
        "created": str(datetime.now())
    }  
)

In [3]:
collection.add(
    documents=[
        "The delivery was fast and the product quality is excellent!",
        "I was not able to increase TV's brightness so I returned it back",
        "The shoes I ordered were too small. Sizing is inaccurate.",
        "Great customer support. Resolved my issue in minutes."
    ],
    ids=["r1", "r2", "r3", "r4"],
    metadatas=[
        {"product_category": "electronics", "rating": 5},
        {"product_category": "electronics", "rating": 2},
        {"product_category": "apparel", "rating": 3},
        {"product_category": "services", "rating": 4}
    ]
)

### Adding Sample Data

We'll add customer reviews with metadata including:
- `product_category`: electronics, apparel, services
- `rating`: 1-5 star rating

This metadata enables powerful filtering in queries.

---

## 2. Metadata Filtering with `where`

The `where` parameter filters results based on metadata fields.

### Exact Match
Filter by exact value of a metadata field.

In [4]:
collection.query(
    query_texts=["fast shipping"],
    n_results=2,
    where={"product_category": "electronics"}
)

{'ids': [['r1', 'r2']],
 'embeddings': None,
 'documents': [['The delivery was fast and the product quality is excellent!',
   "I was not able to increase TV's brightness so I returned it back"]],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[{'rating': 5, 'product_category': 'electronics'},
   {'product_category': 'electronics', 'rating': 2}]],
 'distances': [[0.8920919895172119, 2.0018959045410156]]}

In [5]:
collection.query(
    query_texts=["fast shipping"],
    n_results=2,
    where={"rating": { "$in": [1,2,3] }}
)

{'ids': [['r3', 'r2']],
 'embeddings': None,
 'documents': [['The shoes I ordered were too small. Sizing is inaccurate.',
   "I was not able to increase TV's brightness so I returned it back"]],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[{'product_category': 'apparel', 'rating': 3},
   {'product_category': 'electronics', 'rating': 2}]],
 'distances': [[1.7524369955062866, 2.0018959045410156]]}

### Using Operators

ChromaDB supports various comparison operators:

| Operator | Description | Example |
|----------|-------------|---------|
| `$eq` | Equal to | `{"rating": {"$eq": 5}}` |
| `$ne` | Not equal | `{"category": {"$ne": "services"}}` |
| `$gt` | Greater than | `{"rating": {"$gt": 3}}` |
| `$gte` | Greater or equal | `{"rating": {"$gte": 4}}` |
| `$lt` | Less than | `{"rating": {"$lt": 3}}` |
| `$lte` | Less or equal | `{"rating": {"$lte": 2}}` |
| `$in` | In list | `{"rating": {"$in": [1, 2, 3]}}` |
| `$nin` | Not in list | `{"category": {"$nin": ["services"]}}` |

**Example**: Find reviews with rating 1, 2, or 3 (low ratings)

---

## 3. Full-Text Search with `where_document`

The `where_document` parameter filters based on **document content** (not metadata).

### `$contains` Operator
Find documents containing specific text.

In [6]:
collection.query(
    query_texts=["fast shipping"],
    n_results=2,
    where_document={ "$contains": "customer"}
)

{'ids': [['r4']],
 'embeddings': None,
 'documents': [['Great customer support. Resolved my issue in minutes.']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[{'product_category': 'services', 'rating': 4}]],
 'distances': [[1.7052761316299438]]}

---

## 4. Controlling Output with `include`

By default, queries return documents and metadata. Use `include` to customize:

| Value | Description |
|-------|-------------|
| `"documents"` | The original text |
| `"metadatas"` | Associated metadata |
| `"embeddings"` | Vector representations |
| `"distances"` | Similarity scores |

**Example**: Return only embeddings (useful for debugging or analysis)

In [7]:
collection.query(
    query_texts=["fast shipping"],
    n_results=2,
    where_document={ "$contains": "customer"},
    include=["embeddings"]
)

{'ids': [['r4']],
 'embeddings': [array([[-1.41812982e-02, -9.90826823e-03,  2.93994211e-02,
          -1.22566503e-02,  5.10056736e-03,  2.16237293e-03,
          -9.74385068e-02, -6.17292672e-02,  6.37307987e-02,
           1.49031833e-03,  8.11432302e-02,  7.01216608e-02,
          -5.30364588e-02,  7.69385770e-02,  2.97790915e-02,
           2.28454154e-02, -5.27350977e-02, -1.03562772e-01,
          -4.23899963e-02,  1.65251102e-02, -5.12959659e-02,
          -8.13004598e-02,  2.56917067e-02, -1.95907820e-02,
           3.23317721e-02, -2.17675082e-02, -5.57334349e-02,
           3.12407278e-02, -2.24554795e-03,  9.44269914e-03,
          -7.55347386e-02,  3.44812423e-02,  5.00762311e-04,
           6.30434975e-03,  8.00479427e-02, -8.86319764e-03,
          -9.45463926e-02, -1.10361381e-02,  2.43164189e-02,
          -6.73370734e-02,  4.47158702e-02,  7.56282210e-02,
           3.62584963e-02,  2.27874201e-02,  1.72436126e-02,
           4.43170033e-02,  4.98497188e-02, -3.501664

---

## üìù Summary

| Feature | Parameter | Purpose |
|---------|-----------|---------|
| **Metadata Filter** | `where` | Filter by metadata fields |
| **Document Filter** | `where_document` | Filter by document content |
| **Output Control** | `include` | Choose what data to return |

### Combining Filters
You can combine all these for precise queries:
```python
collection.query(
    query_texts=["search term"],
    n_results=5,
    where={"category": "electronics", "rating": {"$gte": 4}},
    where_document={"$contains": "quality"},
    include=["documents", "metadatas", "distances"]
)
```