# Getting started with Azure Cosmos DB's API for MongoDB and Synapse Link

In this sample we will execute the following tasks:

1. Insert a dataset using the traditional MongoDB client.
1. Execute aggregation queries against the Analytical Store from the transactional data we inserted.
1. Insert another dataset, but this time using the MongoSpark connector.
1. Execute aggregation queries again, consolidating both datasets.

## Pre-requisites
1. Have you created a MongoDB API account in Azure Cosmos DB? If not, go to [Create an account for Azure Cosmos DB's API for MongoDB](https://docs.microsoft.com/azure/cosmos-db/create-cosmosdb-resources-portal#create-an-azure-cosmos-db-account). Be sure to create an account using MongoDB as the API option.
1. For your Cosmos DB account, have you enabled Synapse Link? If not, go to [Enable Synapse Link for Azure Cosmos DB accounts](https://docs.microsoft.com/azure/cosmos-db/configure-synapse-link#enable-synapse-link).
1. Have you created a Synapse Workspace? If not, go to [Create a Synapse Workspace](https://docs.microsoft.com/azure/synapse-analytics/quickstart-create-workspace).

## Create a Cosmos DB collection with Synapse Link
1. Create a database named `test`. 
1. Create a collection named `htap`.
    - Make sure you set the `Storage capacity` option to `Fixed` when you create your collection.
    - Make sure you set the `Analytical store` option to `On` when you create your collection.

## Connect your collection to Synapse
1. Go to your Synapse Analytics workspace.
1. Create a `Linked Data` connection for your MongoDB API account. 
    1. Under the `Data` blade, select the + (plus) sign.
    1. Select the `Connect to external data` option.
    1. Now select the `Azure Cosmos DB (MongoDB API)` option. 
    1. Enter all the information regarding your specific Azure Cosmos DB account either by using the dropdowns or by entering the connection string. Take note of the name you assigned to your `Linked Data` connection. 
    - Alternatively, you can also use the connection parameters from your account overview.
1. Test the connection by looking for your database accounts in the `Data` blade, and under the `Linked` tab.
    - There should be a list that contains all accounts and collections.
    - Collections that have an `Analytical Store` enabled will have a distinctive icon.

### Let's get the environment ready

This environment allows you to install and use any python libraries that you want to run. For this sample, you need to add the following libraries to your Spark pool:

```
pymongo==2.8.1
aenum==2.1.2
backports-abc==0.5
bson==0.5.10
```

Learn how to import libraries into your Spark pools in [this article](https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-azure-portal-add-libraries). We recommend creating a new pool for this.

You can execute the following command to make sure all the libraries are installed correctly:

In [1]:
import pip #needed to use the pip functions

for i in pip.get_installed_distributions(local_only=True):
    print(i)

# The output might be long... you can collapse it by clicking on the 'Collapse output' option on the upper left corner of the output cell.

StatementMeta(MongoSpark, 11, 3, Finished, Available)

zipp 0.6.0
zict 1.0.0
xlwt 1.2.0
XlsxWriter 0.9.6
xlrd 1.0.0
wrapt 1.11.2
widgetsnbextension 2.0.0
wheel 0.30.0
Werkzeug 0.16.0
websocket-client 0.56.0
wcwidth 0.1.7
vega-datasets 0.7.0
urllib3 1.25.6
unicodecsv 0.14.1
typing-extensions 3.7.4
traitlets 4.3.2
tqdm 4.48.2
tornado 6.0.3
torch 1.3.0
toolz 0.10.0
testpath 0.3
terminado 0.6
termcolor 1.1.0
tensorflow 1.14.0
tensorflow-estimator 1.14.0
tensorboard 1.14.0
tblib 1.4.0
tables 3.3.0
sympy 1.0
statsmodels 0.10.1
SQLAlchemy 1.1.9
spyder 3.1.4
sortedcontainers 2.1.0
sortedcollections 0.5.3
snowballstemmer 1.2.1
smart-open 1.8.4
sklearn-pandas 1.7.0
skl2onnx 1.4.9
six 1.12.0
singledispatch 3.4.0.3
simplegeneric 0.8.1
shap 0.34.0
setuptools 41.4.0
SecretStorage 3.1.1
seaborn 0.9.0
scipy 1.1.0
scikit-learn 0.20.3
scikit-image 0.15.0
s3transfer 0.2.1
ruamel.yaml 0.15.89
rope-py3k 0.9.4.post1
retrying 1.3.3
Resource 0.2.1
requests 2.22.0
requests-oauthlib 1.2.0
QtPy 1.2.1
qtconsole 4.3.0
QtAwesome 0.4.4
pyzmq 16.0.2
PyYAML 5.1.2
PyWavele

### Write your database account specific secrets here!

We won't tell anybody.


In [3]:
DATABASE_ACCOUNT_NAME = '<your Cosmos DB account name>'
DATABASE_ACCOUNT_READWRITE_KEY = '<Readable and writable primary or secondary password of your Cosmos DB account>'

StatementMeta(MongoSpark, 9, 4, Finished, Available)



## Let's initialize the MongoDB client

You are only going to need the following parameters from your account overview: 
- Connection string.
- Primary or secondary ready/write key.

Remember that we named our database `test` and our collection `htap`.

The code snippet below shows how to initialize the `MongoClient` object.

In [4]:
from pymongo import MongoClient
from bson import ObjectId # For ObjectId to work

client = MongoClient("mongodb://{account}.mongo.cosmos.azure.com:10255/?ssl=true&replicaSet=globaldb".format(account = DATABASE_ACCOUNT_NAME)) # Your own database account endpoint.
db = client.test    # Select the database
db.authenticate(name=DATABASE_ACCOUNT_NAME,password=DATABASE_ACCOUNT_READWRITE_KEY) # Use your database account name and any of your read/write keys.

StatementMeta(MongoSpark, 9, 5, Finished, Available)

ModuleNotFoundError: No module named 'pymongo'

## Inserting data with the MongoClient driver

The following sample will generate 500 items based on random data. Each item will contain the following fields:
- Item, string
- Price, float
- Rating, integer
- Timestamp, [epoch integer](http://unixtimestamp.50x.eu/about.php)

This cell depends on the cell above to create an instance of the connection to the Cosmos DB MongoDB API account.

This data will be inserted into the MongoDB store of your database. This emulates the transactional data that an application would generate.

In [None]:
from random import randint
import time

orders = db["htap"]

items = ['Pizza','Sandwich','Soup', 'Salad', 'Tacos']
prices = [2.99, 3.49, 5.49, 12.99, 54.49]

for x in range(1, 501):
    order = {
        'item' : items[randint(0, (len(items)-1))],
        'price' : prices[randint(0, (len(prices)-1))],
        'rating' : randint(1, 5),
        'timestamp' : time.time()
    }
    
    result=orders.insert(order)

print('finished creating 500 orders')


## Read data from the Analytical Store

Now that we have inserted some transactional data, let's read it from the Analytical Store.

The data will be automatically transformed into the columnar format, which will make it fast and easy to execute large aggregation queries.


In [1]:
# Load the Analytical Store data into a dataframe
# Make sure to run the cell with the secrets to get the DATABASE_ACCOUNT_NAME and the DATABASE_ACCOUNT_READWRITE_KEY variables.
df = spark.read.format("cosmos.olap")\
    .option("spark.cosmos.accountEndpoint", "https://{account}.documents.azure.com:443/".format(account = DATABASE_ACCOUNT_NAME))\
    .option("spark.cosmos.accountKey", DATABASE_ACCOUNT_READWRITE_KEY)\
    .option("spark.cosmos.database", "test")\
    .option("spark.cosmos.container", "htap")\
    .load()

# Let's find out all the revenue from Pizza orders
df.groupBy(df.item.string).sum().show()

# df[df.item.string == 'Pizza'].show(10) 
# df.select(df['item'] == Struct).show(10) 
# df.select("timestamp.float64").show(10)
#df.select("timestamp.string", when(df.timestamp.string != null)).show(10)

StatementMeta(, , , SessionStarting, )


## A quick note about the MongoDB schema in Analytical Store

For MongoDB accounts we make use of a **Full Fidelity Schema**. This is a representation of property names extended with their data types to provide an accurate representation of their values and avoid ambiguity.

This is why, when we called the fields above, we used their datatype as a suffix. Like in the example below:

```
df.filter((df.item.string == "Pizza")).show(10)
```

Notice how we specified the `string` type after the name of the property. Here is a map of all potential properties and their suffix representations in the Analytical Store:

| Original Data Type     &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| Suffix    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| Example &nbsp;&nbsp;&nbsp;&nbsp; | 
|---------------|----------------|--------|
| Double        | ".float64"     |  `24.99`   |
| Array         | ".array"       |  `["a", "b"]`   |
| Binary        | ".binary"      |  `0`   |
| Boolean       | ".bool"        |  `True`   |
| Int32         | ".int32"       |  `123`   |
| Int64         | ".int64"       |  `255486129307`   |
| Null          | ".null"        |  `null`   |
| String        | ".string"      |  `"ABC"`   |
| Timestamp     | ".timestamp"   |  `Timestamp(0, 0)`   |
| DateTime      | ".date"        |  `ISODate("2020-08-21T07:43:07.375Z")`   |
| ObjectId      | ".objectId"    |  `ObjectId("5f3f7b59330ec25c132623a2")`   |
| Document      | ".object"      |  `{"a": "a"}`   |

These types are inferred from the data that is inserted in the transactional store. You can see the schema by executing the following command:
```
df.schema
```

## Let's insert more orders!

This time we will use slightly different data. Each item will contain the following fields:
- Item, string
- Price, float
- Rating, integer
- Timestamp, [ISO String format](https://en.wikipedia.org/wiki/ISO_8601)

Notice how the `Timestamp` field is now in a string format. This will help us understand how the different data fields can be read based on their data type.

In [None]:
from random import randint
from time import strftime

orders = db["htap"]

items = ['Pizza','Sandwich','Soup', 'Salad', 'Tacos']
prices = [2.99, 3.49, 5.49, 12.99, 54.49]

for x in range(1, 501):
    order = {
        'item' : items[randint(0, (len(items)-1))],
        'price' : prices[randint(0, (len(prices)-1))],
        'rating' : randint(1, 5),
        'timestamp' : strftime("%Y-%m-%d %H:%M:%S")
    }
    
    result=orders.insert(order)

print('finished creating 500 orders')

## Let's read that data again!

This time, we will be reading the ISO string dates separately by specifying the `timestamp.string` parameter.


In [26]:
# Load the Analytical Store data into a dataframe
# Make sure to run the cell with the secrets to get the DATABASE_ACCOUNT_NAME and the DATABASE_ACCOUNT_READWRITE_KEY variables.
df = spark.read.format("cosmos.olap")\
    .option("spark.cosmos.accountEndpoint", "https://{account}.documents.azure.com:443/".format(account = DATABASE_ACCOUNT_NAME))\
    .option("spark.cosmos.accountKey", DATABASE_ACCOUNT_READWRITE_KEY)\
    .option("spark.cosmos.database", "test")\
    .option("spark.cosmos.container", "htap")\
    .load()

# Let's find out all the revenue from Pizza orders
df.filter( (df.timestamp.string != "")).show(10)

StatementMeta(MongoSpark, 9, 27, Finished, Available)

+--------------------+----------+--------------------+--------------------+--------------+----------+-------+------+--------------------+---+--------------+
|                _rid|       _ts|                  id|               _etag|           _id|      item|  price|rating|           timestamp| pk| _partitionKey|
+--------------------+----------+--------------------+--------------------+--------------+----------+-------+------+--------------------+---+--------------+
|c8dVAMNlPsb1AQAAA...|1597792391|NWYzYzYwODdmYjVkM...|"00003909-0000-08...|[_<`��]RUj�]|   [Tacos]|[12.99]|   [1]|[, 2020-08-18 23:...|[3]|[c8dVAMNlPsY=]|
|c8dVAMNlPsb2AQAAA...|1597792391|NWYzYzYwODdmYjVkM...|"00003a09-0000-08...|[_<`��]RUj�]|[Sandwich]|[54.49]|   [5]|[, 2020-08-18 23:...|[1]|[c8dVAMNlPsY=]|
|c8dVAMNlPsb3AQAAA...|1597792391|NWYzYzYwODdmYjVkM...|"00003b09-0000-08...|[_<`��]RUj�]|    [Soup]| [5.49]|   [4]|[, 2020-08-18 23:...|[4]|[c8dVAMNlPsY=]|
|c8dVAMNlPsb4AQAAA...|1597792391|NWYzYzYwODdmYjVkM...|"000