# Azure Cosmos DB Partition Key Advisor
Welcome to the Partition Key Advisor notebook! The Advisor takes in input about your workload and automatically recommends a partition key. 

### How to use this notebook
To get started with an example, select **Run all** from the command bar to see the recommendation for the sample workload.

To get a customized recommendation for your own workload, replace all inputs in each **Input** cell with your own workload details. 

After entering the information, click the **Play** icon to run the cell, select **Run Active Cell** from the command bar, or use the keyboard shortcut **Shift + Enter**. 

**Be sure to run each code cell before moving on to the next.**

![Run all cells button in command bar](https://cosmosnotebooksdata.blob.core.windows.net/notebooksamplesimages/notebooks_run_all_cells.png)

### Example shopping cart scenario

This notebook contains an example for a shopping cart workload for a retail website. Each customer's shopping cart is stored in a Cosmos DB container. The application needs to be able to 1) read and update a customer's cart and 2) find all carts updated within a timeframe. 

Sample document:

``{
      "id": "171a89a2-8698-46ee-8d57-c090eb58a73f", 
      "cartId": "171a89a2-8698-46ee-8d57-c090eb58a73f",
      "customerId": "alice392",
      "lastModifiedTimestamp": "2021-12-06T17:06:17.1531888-07:00",
      "products":[{"productId": "3942", "qty": 2}, {"productId": "3948", "qty": 1}]
}``


Potential partition keys include ``cartId``, ``customerId``, ``id``, or ``lastModifiedTimestamp``. This notebook will analyze the various choices and recommend a partition key.

### Before you start - run the cell below to load dependencies

In [10]:
#r "nuget: Cosmos.PartitionKeyAdvisor.Core, 1.0.2"
#r "nuget: System.Text.Json, 5.0.1"
#r "nuget: System.Memory"
#r "nuget: System.Text.Encodings.Web, 5.0.0.0"
#r "nuget: System.Linq"
#r "nuget: System.Linq.Expressions"

Display.AsMarkdown("Successfully loaded dependencies");

Successfully loaded dependencies

### Input #1: Enter in data size in GB and estimated RU/s
This information helps determine the scale of the workload, so we can analyze impact of candidate partition keys on queries. If your data volume grows over time, enter in the steady-state data size in the next 1-3 years.

If you're not sure about the RU/s needed, set it to -1 and we'll estimate it for you, based on the other workload details. 

In [12]:
var dataSizeInGB = 50;
var estimatedRUsPerSecond = -1;

Display.AsMarkdown("Data size in GB: " + dataSizeInGB);
Display.AsMarkdown("Estimated RU/s: " + estimatedRUsPerSecond);

Data size in GB: 50

Estimated RU/s: -1

>**Did you know?** While it's best practice to set a good partition key for your workload, typically when you have less than 100GB in data and less than 30,000 RU/s required, the impact on performance with choice of partition key is minimal.


### Input #2: Tell us about your data access pattern
The "key" to choosing a good partition key is to make sure it fits with your data access patterns. Whether a partition key is a good choice depends on whether your workload is read-heavy, write-heavy, or a mix of both.

Tell us the percentage breakdown of your workload that is reads vs. writes.

Reads refer to both point reads (also known as key value lookups) and queries. Writes refer to any create, update, upsert, or delete operation. 

`workloadWritePercent` and `workloadReadPercent` should add up to 100.

`workloadRequestsPerSecond` refers to the overall number of requests per second in your workload. If the workload is variable, enter the number for peak. 

|**Input** |**Description** |
|--- | --- 
|workloadRequestsPerSecond|The overall number of requests per second in your workload. If the workload is variable, enter in the highest number you'd expect.|
|workloadReadPercent|The percent of your workload that consists of point reads (also known as key value lookups) and queries. |
|workloadWritePercent|The percent of your workload that consists of writes - create update, upsert or delete operations.|


In [15]:
var workloadRequestsPerSecond = 200;

var workloadReadPercent = 60;
var workloadWritePercent = 40;

Display.AsMarkdown("Workload requests per second: " + workloadRequestsPerSecond);
Display.AsMarkdown("Workload read percent: " + workloadReadPercent);
Display.AsMarkdown("Workload write percent: " + workloadWritePercent);

Workload requests per second: 200

Workload read percent: 60

Workload write percent: 40

This is a helper cell. Do not make any changes to the cell below. Make sure to run it before moving on.

In [16]:
using PartitionKeyAdvisor.Core.Models;

var basicInformation = new BasicInformation(){
    DataSizeInGB = dataSizeInGB,                        
    WorkloadRequestsPerSecond = workloadRequestsPerSecond,
    EstimatedRUsPerSecond = estimatedRUsPerSecond,
    WorkloadReadPercent = workloadReadPercent,    
    WorkloadWritePercent = workloadWritePercent,
};

Display.AsMarkdown("Basic workload details successfully recorded");
basicInformation

PartitionKeyAdvisor.Core.Models.BasicInformation

Basic workload details successfully recorded

### Input #3: Enter information about your reads
If you have a read-heavy workload, it's important to choose a key that you can include in most of your read and query operations. 

For each read operation you have, create a new `ReadOperation` object and add it to the list object below. Be sure to include a comma between each `ReadOperation` object.

|**ReadOperation input** |**Description** |
|--- | --- 
|Name|Optional property to name this query for easier reference later. This has no impact on the end recommendation.
|PercentOfReadOperations|The percent across all reads that this operation represents. All percents across all `ReadOperations` should sum to 100.|
|Filters|Filters that appear in the WHERE clause of the query (e.g. `SELECT * FROM c WHERE c.prop1 = ...`, or as inputs in a point read. Do not include properties in the SORT or ORDER BY clauses.|


In [17]:
using System.Collections.Generic;

// Instructions: Copy and fill out the ReadOperation constructor for each read operation in your workload
// Be sure to include a comma between each ReadOperation object
var readOperations = new List<ReadOperation>(){
    new ReadOperation() {
        Name = "QueryByCustomerId",
        PercentOfReadOperations = 30,
        Filters = new List<string>() { "customerId" }
    },
    new ReadOperation() {
        Name = "QueryByCustomerIdAndCartId",
        PercentOfReadOperations = 50,
        Filters = new List<string>() { "customerId", "cartId" }
    },
    new ReadOperation() {
        Name = "QueryByLastModifiedTimestamp",
        PercentOfReadOperations = 20,
        Filters = new List<string>() { "lastModifiedTimestamp" }
    }
};

Display.AsMarkdown("Read workload details successfully recorded");

Read workload details successfully recorded

### Input #4: Any other properties we should consider?

Are there any properties that don't appear in your read operations that you think might make a good key? For write-heavy scenarios, properties that have a high number of unique values can make good candidates. A common example is the `id` property. In our sample workload, we will include `id`, which represents our shopping cart id.

If there are no additional keys you want to consider, then edit the cell to: `var otherGoodCandidates = new List<string>() { };`

In [18]:
// Instructions: Add any other properties you want to consider to the list. 
// Feel free to remove the "id" property and leave an empty list if that makes more sense for your workload
var otherGoodCandidates = new List<string>() { "id" };
Display.AsMarkdown("Other candidates successfully recorded")

Other candidates successfully recorded

The helper cell below will print out all the candidate partition keys you've told us about so far. **Run the cell to see all candidates**. No input is required.

In [23]:
using PartitionKeyAdvisor.Core;

Display.AsMarkdown("### Candidate partition keys");

foreach(var candidate in RecommendationEngine.GetCandidates(readOperations, otherGoodCandidates)){
    Display.AsMarkdown(candidate);
}

### Candidate partition keys

customerId

cartId

lastModifiedTimestamp

id

### Input #5: Tell us more about each candidate partition key

Enter information for each candidate partition key that was listed as an output of cell 16. If you don't want to consider it as a candidate, exlude it from the list below.

For each candidate you want to consider, create a new `CandidatePartitionKey` object and add it to the list object below. Be sure to include a comma between each `CandidatePartitionKey` object.

In the below table, **Property** refers to the candidate partition key. **Value** refers to the potential value of the property. For example, `customerId` is a property, and `Alice` is a value of the property `customerId`. 

|**CandidatePartitionKey input** |**Description** |**Example**|
--- | --- | ---|
|Name| Name of the property. This is the candidate partition key.|"customerId"
|UniqueValues|The number of unique values of this property in your data. |If the property is `customerId` and you have 10000 different customers, enter 10000.
|WritesDistribution|Enter `SkewLevel.Low` if there are many unique values being written in any given second and your writes are evenly distributed across those values. Enter `SkewLevel.High` if only one or a few unique values are written in any given second or your writes skew heavily towards only a few unique values. Enter `SkewLevel.Medium` if it is in-between or you're not sure. | If the property is `customerId`, and you have many documents being updated at the same time for each customerId, enter `SkewLevel.Low`. If the property is `customerId`, and you have many documents being updated at the same time but most of the documents are for a single customerId, enter `SkewLevel.High`. If the property is `day`, all writes are of the same value in any given second, enter `SkewLevel.High`.|
|StorageDistribution|Enter `SkewLevel.Low` if most of your values for this property have a similar amount of data. Enter `SkewLevel.High` if a few values have a significantly larger or smaller amount of data. Enter `SkewLevel.Medium` if it is in-between or you're not sure.|If the property is `customerId`, and each customer has roughly the same amount of data, enter `SkewLevel.Low`. If one or a few customers have a lot more data than the others, enter `SkewLevel.High`.|
|QueryDistribution|Enter `SkewLevel.Low` if you query all values of this property evenly. Enter `SkewLevel.High` if there are a few values that you query for significantly more or less than others. Enter `SkewLevel.Medium` if it is in-between, you don't query on the property at all, or you're not sure.| If the property is `customerId`, and you query for all customers equally, enter `SkewLevel.Low`. If there is one or a few customers you consistently query more often than the others, enter `SkewLevel.High`. |
|HasValueWithOver20GB|If there's a value of this property that will eventually have more than 20 GB of data, enter `true`. Otherwise, enter `false`.|For `customerId`, if a single customer will ever have more than 20 GB of data, enter true.|
|IsInAllDocuments|Enter `true` if this property is in all documents and entities in your collection. This should be true for most properties. | --

In [24]:
// This enum is part of the Partition Key Advisor Nuget package, and is copied here for your reference.
// enum SkewLevel {
//    High,
//    Medium,
//    Low 
// }

// Instructions: Copy and fill out the ReadOperation constructor for each read operation in your workload
// Be sure to include a comma between each ReadOperation object
var candidates = new List<CandidatePartitionKey>(){
    new CandidatePartitionKey() {
        Name = "id",
        UniqueValues = 1000000,
        WritesDistribution = SkewLevel.Low,
        StorageDistribution = SkewLevel.Low,
        QueryDistribution = SkewLevel.Low, 
        HasValueWithOver20GB = false, 
        IsInAllDocuments = true
    },
    new CandidatePartitionKey() {
        Name = "cartId",
        UniqueValues = 1000000,
        WritesDistribution = SkewLevel.Low,
        StorageDistribution = SkewLevel.Low,
        QueryDistribution = SkewLevel.Low, 
        HasValueWithOver20GB = false, 
        IsInAllDocuments = true
    },
    new CandidatePartitionKey() {
        Name = "customerId",
        UniqueValues = 1000000,
        WritesDistribution = SkewLevel.Low,
        StorageDistribution = SkewLevel.Low,
        QueryDistribution = SkewLevel.Low,
        HasValueWithOver20GB = false, 
        IsInAllDocuments = true
    },
    new CandidatePartitionKey() {
        Name = "lastModifiedTimestamp",
        UniqueValues = 1000000,
        WritesDistribution = SkewLevel.Low,
        StorageDistribution = SkewLevel.Low,
        QueryDistribution = SkewLevel.High,
        HasValueWithOver20GB = false,
        IsInAllDocuments = true
    },    
};

Display.AsMarkdown("Candidate partition key details successfully recorded");

Candidate partition key details successfully recorded

# Recommendation
Run the cell below to get the recommendation. Based on your workload details, we've recommended a partitioning strategy and partition key, along with an explanation of why this key is a good choice.

In [30]:
using PartitionKeyAdvisor.Core.PartitionScenarios;

var rec = RecommendationEngine.GetRecommendation(basicInformation, readOperations, candidates);

foreach (var rec in rec.Candidates) {
   if (rec.RecommendationTier == 0) {
       Display.AsMarkdown("## Recommended partition key: " + rec.Name);
       Display.AsMarkdown("## Recommended partitioning strategy:  " + rec.PartitioningStrategy.ToString());
       Display.AsMarkdown("## This partition key is a good choice because...");

       if (rec.HasHighCardinality == true) {
           Display.AsMarkdown("* It has high cardinality");
       }
       if (rec.UsedInCommonQueries == true) {
           Display.AsMarkdown("* It is used in the majority of your queries");          
       }
       if (rec.ImpactOfCrossPartitionQueries == CrossPartitionQueryImpact.Low) {
           Display.AsMarkdown("* The impact of any cross partition queries will be low");
       }
   }
};

## Recommended partition key: customerId

## Recommended partitioning strategy:  SinglePartitionKey

## This partition key is a good choice because...

* It has high cardinality

* It is used in the majority of your queries

* The impact of any cross partition queries will be low

## Detailed analysis
Run the cell to see the analysis of all candidate partition keys 

In [31]:
Display.AsMarkdown("### Estimated number of physical partitions:  " + rec.EstimatedPhysicalPartitions);
rec.Candidates

PartitionKeyAdvisor.Core.WorkloadDetails+<AnalyzeCandidates>d__42

### Estimated number of physical partitions:  2

## How to interpret the results

### Partitioning strategy

|Partitioning Strategy|Description |Details |Reference Material |
|---------|---------|-----------|-------------------|
|0|SinglePartitionKey|This is the most commonly used partitioning strategy. You can set the recommended partition key on your container to optimally partition your data.| https://aka.ms/cosmos-partitioning-overview |
|1|HierarchicalPartitionKey|[PREVIEW] This is typically recommended for workloads where using only 1 key is not enough to achieve high cardinality or good data distribution. You can set up to 3 levels of partition keys. Consider pairing a lower cardinality key with a higher cardinality key, according to the natural hierarchy of your data.| https://aka.ms/cosmos-hierarchical-partitioning |
|2|CopyContainer|This partitioning strategy is recommended for read-heavy scenarios where each of the most common queries filter to a different property, and there are a small number of such properties. As a result, choosing a single partition key leads to a high percentage of cross-partition queries. To implement this strategy, create multiple copies of your container, partitioned by different keys, and route your application's queries depending on what data is available. You can also apply the LookupContainer strategy described below.| -- |
|3|LookupContainer|This partitioning strategy is recommended for read-heavy scenarios where each of the most common queries filter to a different property, and there are a large number of such properties. As a result, choosing a single partition key leads to a high percentage of cross-partition queries. To implement this stratey, create a second container that allows you to lookup what the value of the partition key would be for the "main" container, given the property you know. This is recommended when your data size and number of containers needed makes it infeasible to create multiple copies of your data.| -- |

### Recommendation tier
|Recommendation Tier|Description|Details|
|---------|---------|-----------|
|0|Recommended |Based on your workload, we've determined this is a good choice of partition key that gives even distribution of data and matches your data access patterns.|
|1|Warning|This partition key could work for your workload, but there are some potential flaws.|
|2|NotRecommended|This partition key is not recommended. Typically, this occurs when the key does not result in high enough cardinality. |

### Cardinality
It's important to choose a partition key with high cardinality. We've scored each key based on the number of unique values for each property. The score is out of 100 and anything over 80 is good. 

### Has over 20 GB of data
Azure Cosmos DB has a limit of 20 GB for a logical partition. If a particular value of the partition key is likely to exceed 20 GB of data, then it's typically recommended to use hierarchical partition keys to avoid this limit.
### Used in common queries
For read-heavy workloads, it's important to choose a partition key that is used in your top queries. This is calculated as `true` if the partition key can be used in over 50% of your specified query requests.

### Impact of cross-partition queries
For read-heavy workloads, it's important to choose a partition key that is used in your top queries, as it reduces the number of queries that need to fan out to all physical partitions. At the same time, it's totally normal to have some cross-partition queries. As long as they make up a small portion of your workload or are run infrequently (e.g. once a minute or a few times an hour), then the impact is low. Typically, an additional 3 RUs are required for each partition that is visited during a query. 

To read this table, first find the row that corresponds to the number of physical partitions we've estimated you'll have in the output of cell 22. Then, for each candidate find the ``ImpactOfCrossPartitionQueries`` value (0, 1,  or 2) and its corresponding column.

When you have 1 physical partition, all "cross-partition queries" are effectively single partition queries, so the impact is low. As the number of physical partitions you have increases (typically corresponds to larger workloads with high scale), the impact of cross-partition queries increases. 

|-|If this partition key is chosen, and impact is low...|If this partition key is chosen, and impact is medium...|If this partition key is chosen, and impact is high...|
|---------|---------|-----------|---|
|**Number of physical partitions**|**0 (Low)**|**1 (Medium)**|**2 (High)**
|**1**|Cross partition queries make up <90% of your overall workload. |Cross partition queries make up between 90% - 100% of your overall workload.|--|
|**2-5**|Cross partition queries make up <30% of your overall workload.|Cross partition queries make up between 30% - 50% of your overall workload.| Cross partition queries make up >=50% of your overall workload.
|**6-50**|Cross partition queries make up <25% of your overall workload.|Cross partition queries make up between 25% - 40% of your overall workload. |Cross partition queries make up >=40% of your overall workload.
|**>50**|Cross partition queries make up <20% of your overall workload.|Cross partition queries make up between 20% - 30% of your overall workload. |Cross partition queries make up >=30% of your overall workload.

### Analyze the impact of cross-partition queries

This table explores how each partition key candidate impacts cross-partition queries. To see the queries for particular candidate, filter the table to the candidate. 


In [32]:
using System.Linq;
using System.Linq.Expressions;
using System.Collections.Generic;

Display.AsMarkdown("## Estimated number of physical partitions:  " + rec.EstimatedPhysicalPartitions);

rec.CrossPartitionQueryAnalysis.ToList()

System.Collections.Generic.List`1[PartitionKeyAdvisor.Core.Models.CrossPartitionQuery]

## Estimated number of physical partitions:  2

### Analysis of cross-partition queries

When you query data from containers, if the query has a partition key filter specified, Azure Cosmos DB automatically optimizes the query. It routes the query to the physical partitions corresponding to the partition key values specified in the filter. If the query doesn't specify the partition key, then a cross-partition query will occur and the query must access multiple physical partitions. 

In general, it's ok to have cross-partition queries! As long as your partition key can be included in the majority of your queries and cross-partition queries make up a small overall fraction of your overall worload, the impact will be minimal.

|Property|Descritption|
|---------|-----------|
|CandidatePartitionKey|The candidate partition key.|
|QueryName |The name of the query provided from your workload details.|
|RequestsPerSecond |The number of requests for this query per second. This is calculated by multiplying: total number of requests for your workload \* percent of your workload that is reads/100 \* percent of your reads this query represents/100.  |
|PercentOfReadWorkload|The percent of your total read operations that is represented by this query. |
|PercentOfOverallWorkload|The percent of your overall workload that is represented by this query.|
|QueryFilters|The properties that the query filters on. |
|AdditionalRUsPerSecond|The additional RUs needed to support this cross-partition query. This is calculated by multiplying: 3 RUs \* number of estimated physical partitions \* number of times query is run per second. |

# Help us improve
Do you have feedback or questions on the recommendation? Feel free to contact us at: cosmos-pka-team @ microsoft.com. Thank you for trying out the Azure Cosmos DB Partition Key Advisor!