# Indexing in Azure Cosmos DB

Let's explore how [indexing](https://docs.microsoft.com/en-us/azure/cosmos-db/index-overview) works in Azure Cosmos DB! In this notebook, we will present how to update indexing policies and how indexing can affect query performance.

Let's start by creating the resources we are going to need: a database and a container.

In [None]:
import os
import time
import azure.cosmos.cosmos_client as cosmos_client
client = cosmos_client.CosmosClient(os.environ["COSMOS_ENDPOINT"], {'masterKey': os.environ["COSMOS_KEY"]})

db_name    = "iddbtest"
container_name  = "idcltest"
db_link = "/dbs/" + db_name
container_link = "/dbs/" + db_name + "/colls/" + container_name

# Create the database if it doesn't exist
db_query = "select * from r where r.id = '{0}'".format(db_name)
db = list(client.QueryDatabases(db_query))
if db:
    print('Database already exists')
else:
    client.CreateDatabase({'id': db_name})
    print('Database created')
    time.sleep(3)

# Reset the container
container_query = "select * from r where r.id = '{0}'".format(container_name)
container = list(client.QueryContainers(db_link, container_query))
if container:
    client.DeleteContainer(container_link)
    print('Container dropped')
client.CreateContainer(db_link, {'id': container_name})
print('Container created')

### Importing documents

We will need some test data to work with, so we import 10,000 documents. Each document contains 2 fields: `field1` which has a random string value and `field2` which has a random integer value:

```
{
  "field1":"Garry75",
  "field2":405
}
```

This is going to take some time, so feel free to get a coffee ☕ in the meantime!

In [None]:
import os
import json

# open the file
trigger_filename = os.path.join(os.getcwd(), 'Indexing.txt')
f = open(trigger_filename, 'r')

# read all lines
docs = f.readlines()

# insert the docs
progress = 1
for doc in docs:
    if progress % 10 == 0:
        print("Inserting document: %5d / 10000\r"% (progress), end="", flush=True)
    client.CreateItem(container_link, json.loads(doc))
    progress += 1

### Testing a first query

Now that we have our test data, time to issue some queries! We start by fetching all documents with a `field1` value of `Mario86`:

In [None]:
query = 'SELECT * FROM c WHERE c.field1 = "Mario86"'
start = time.time()
results = list(client.QueryItems(container_link, query, {'enableCrossPartitionQuery': True}))
end = time.time()

print('Got ' + str(len(results)) + ' result(s)')
print('Time elapsed: %d ms'% ((end - start) * 1000))
print('Request charge: ' + client.last_response_headers['x-ms-request-charge'] + ' RUs')

We got 2 results back in a pretty short amount of time (if the elapsed time looks large, just re-run the previous cell a couple of times - Cosmos DB Notebooks are running on a shared infrastructure so time measurements aren't always very precise). Also, we see that this query has consumed a very small amount of [Request Units](https://docs.microsoft.com/en-us/azure/cosmos-db/request-units) (or RUs).
This is because by default, Cosmos DB indexes *all* the fields it finds in the JSON documents you store. This lets you achieve good performance from the start, without the need to think about secondary indexes upfront.

### Disabling the index

So what would be the performance of that same query if there was no index? To find out, we can completely disable indexing by updating our container's [indexing policy](https://docs.microsoft.com/en-us/azure/cosmos-db/index-policy). Specifically, we set the policy's indexing mode to `none`:

In [None]:
container = client.ReadContainer(container_link)
container['indexingPolicy']['indexingMode'] = 'none'
del container['indexingPolicy']['automatic']
del container['indexingPolicy']['includedPaths']
del container['indexingPolicy']['excludedPaths']
client.ReplaceContainer(container_link, container)
print('Indexing policy updated')

Now let's run the same query again and see the results.

In [None]:
query = 'SELECT * FROM c WHERE c.field1 = "Mario86"'
start = time.time()
results = list(client.QueryItems(container_link, query, {'enableCrossPartitionQuery': True}))
end = time.time()

print('Got ' + str(len(results)) + ' result(s)')
print('Time elapsed: %d ms'% ((end - start) * 1000))
print('Request charge: ' + client.last_response_headers['x-ms-request-charge'] + ' RUs')

A pretty big difference, both in terms of latency and RUs consumed!

So indexes are very useful to improve the performance and cost-effectiveness of a data model, and it's great that Cosmos DB indexes every property by default.

But in some situations, you may want to fine-tune the default indexing policy by explicitly removing from the index the properties that you won't filter on in your queries. This optimization yields 2 kinds of benefits:

- it will reduce the amount of storage consumed by your container
- it will reduce the latency and RU consumption of write operations

### Excluding paths from the index

Let's do that by updating our indexing policy again, setting the indexing mode back to `consistent` (which is the default) and including all property paths except `field2` (see [this page](https://docs.microsoft.com/en-us/azure/cosmos-db/index-policy#including-and-excluding-property-paths) for a detailed explanation of indexing path syntax).

When an indexing policy is updated with an indexing mode set to `consistent`, Cosmos DB starts to rebuild the index asynchronously. We can monitor the progress of this operation by reading the corresponding container and fetch the transformation progress from a specific response header.

In [None]:
container = client.ReadContainer(container_link)
container['indexingPolicy']['indexingMode'] = 'consistent'
container['indexingPolicy']['includedPaths'] = [ {'path' : '/*'} ]
container['indexingPolicy']['excludedPaths'] = [ {'path' : '/field2/?'} ]
client.ReplaceContainer(container_link, container)
print('Indexing policy updated')

container = client.ReadContainer(container_link)
index_transformation_progress = client.last_response_headers['x-ms-documentdb-collection-index-transformation-progress']
print('\rCurrent index transformation progress: ' + index_transformation_progress + '%', end="", flush=True)
while (index_transformation_progress != '100'):
    time.sleep(5)
    container = client.ReadContainer(container_link)
    index_transformation_progress = client.last_response_headers['x-ms-documentdb-collection-index-transformation-progress']
    print('\rCurrent index transformation progress: ' + index_transformation_progress + '%', end="", flush=True)
print('\nIndex transformation completed')

We now have an indexing policy that indexes everything except the `field2` property. Let's verify that with some queries!

Our previous query filtering on `field1` gets its original performance back:

In [None]:
query = 'SELECT * FROM c WHERE c.field1 = "Mario86"'
start = time.time()
results = list(client.QueryItems(container_link, query, {'enableCrossPartitionQuery': True}))
end = time.time()

print('Got ' + str(len(results)) + ' result(s)')
print('Time elapsed: %d ms'% ((end - start) * 1000))
print('Request charge: ' + client.last_response_headers['x-ms-request-charge'] + ' RUs')

But a query that filters on `field2` won't benefit from the index and yield poor performance:

In [None]:
query = 'SELECT * FROM c WHERE c.field2 = 3188'
start = time.time()
results = list(client.QueryItems(container_link, query, {'enableCrossPartitionQuery': True}))
end = time.time()

print('Got ' + str(len(results)) + ' result(s)')
print('Time elapsed: %d ms'% ((end - start) * 1000))
print('Request charge: ' + client.last_response_headers['x-ms-request-charge'] + ' RUs')

Check [this page](https://docs.microsoft.com/en-us/azure/cosmos-db/how-to-manage-indexing-policy) to explore the different ways to manage indexing policies, including the Azure Portal, Azure CLI, or any Cosmos DB SDK.

And before we close, don't forget to clean up the resources we've created:

In [None]:
client.DeleteDatabase(db_link)