# Getting started with Cosmos notebooks
In this notebook, we'll import some sample data in a container in Azure Cosmos DB, analyze it, and create some visualizations of the data. 

### Create a new database named RetailDemo for our data

We'll use the built-in ```cosmos_client``` to run operations. This is a ready to use instance of [CosmosClient]('https://docs.microsoft.com/python/api/azure-cosmos/azure.cosmos.cosmos_client.cosmosclient?view=azure-python') from our Python SDK. It already has the context of this account baked in. 

In [4]:
## Create a new database if it doesn't exist.
database_id = "RetailDemo"
database_link = 'dbs/' + database_id

try:
    cosmos_client.CreateDatabase({"id": database_id})
    print('Database with id \'{0}\' created'.format(database_id))

except errors.HTTPFailure as e:
    if e.status_code == 409:
       print('A database with name \'{0}\' already exists'.format(database_id))
    else: 
        raise

A database with name 'RetailDemo' already exists


### Create a new container WebsiteData inside the RetailDemo database.
Our dataset will contain events that occurred on the website - e.g. a user viewing an item, adding it to their cart, or purchasing it. We will partition by CartId, which represents the individual cart of each user. This will give us an even distribution of throughput and storage in our container. Learn more about how to [choose a good partition key.](https://docs.microsoft.com/azure/cosmos-db/partition-data)

In [5]:
## Create a new container if it doesn't already exist
container_id = "WebsiteData"
container_link = database_link + '/colls/' + container_id
try:
    container_definition = {
        "id": container_id,
        "partitionKey": {
            "paths": [
              "/CartId"
            ]
        }
    }

    container = cosmos_client.CreateContainer(database_link, container_definition)
    print('Container with id \'{0}\' created'.format(container_id))

except errors.CosmosError as e:
    if e.status_code == 409:
       print('A container with id \'{0}\' already exists'.format(container_id))
    else: 
        raise

A container with id 'WebsiteData' already exists


### Load in sample JSON data and insert into the container. 
We'll use the **UpsertItem** operation to create the item if it doesn't exist, or replace it if it already exists. This will take a few minutes.

Here's a sample JSON document.
```
{"CartID":5399,
"Action":"Viewed",
"Item":"Cosmos T-shirt",
"Price":350,
"UserName":"Chadrick.Larkin87",
"Country":"Iceland",
"EventDate":"2015-06-25T00:00:00",
"Year":2015,"Latitude":-66.8673,
"Longitude":-29.8214,
"Address":"852 Modesto Loop, Port Ola, Iceland",
"id":"00ffd39c-7e98-4451-9b91-b2bcf2f9a32d"},
```

In [6]:
## Read data from storage
import urllib.request, json 
with urllib.request.urlopen("https://cosmosnotebooksdata.blob.core.windows.net/notebookdata/websiteData.json") as url:
    data = json.loads(url.read().decode())

for event in data:
    try: 
        test = cosmos_client.UpsertItem(container_link, event)
    except errors.CosmosError as e:
        raise

## Run a query against Azure Cosmos DB, using **CosmosClient**.
We'll run the query **SELECT VALUE COUNT(1) FROM c** to count the number of documents in the container.

In [56]:
## Run a query against the container to see number of documents
query = {'query': 'SELECT VALUE COUNT(1) FROM c'}

options = {}
options['enableCrossPartitionQuery'] = True

result_iterable = cosmos_client.QueryItems(container_link, query, options)
for item in iter(result_iterable):
    print('Container with id \'{0}\' contains \'{1}\' items'.format(container_id, item))
    
    

Container with id 'WebsiteData' contains '2654' items


## Run some queries against Azure Cosmos DB, using the built-in notebook magic
We'll use the syntax:

```%%sql --database {database_id} --container {container_id} --output outputDataframeVar
{Query text}```

This allows us to output the results of the query directly into a Pandas data frame.


### Query #1
Get the latest record, using the query: ```SELECT TOP 1 * FROM c ORDER BY c._ts desc```

In [60]:
%%sql --database {database_id} --container {container_id}
SELECT TOP 1 * from c order by c._ts desc

Unnamed: 0,Action,Address,CartID,Country,EventDate,Item,Latitude,Longitude,Price,UserName,Year,_attachments,_etag,_rid,_self,_ts,id
0,Viewed,"66786 Marlen Path, Erachester, Haiti",9211,Haiti,0001-01-01T00:00:00,Flannel Shirt,-1.0224,-135.163,19.99,Troy.Beatty,0,attachments/,"""0000fb14-0000-0400-0000-5d5d84cd0000""",JjpGALE5LzJeCgAAAAAAAA==,dbs/JjpGAA==/colls/JjpGALE5LzI=/docs/JjpGALE5L...,1566409933,6e90664c-8ab7-44f6-8e56-8ea67e217f37


In [59]:
%%sql?

[0;31mDocstring:[0m
::

  %sql [--database DATABASE] [--container CONTAINER] [--output OUTPUT]

Queries Azure Cosmos DB using the given Cosmos database and container.
Learn about the Cosmos query language: https://aka.ms/CosmosQuery

Example:
    %%sql --database databaseName --container containerName
    SELECT top 1 r.id, r._ts from r order by r._ts desc

optional arguments:
  --database DATABASE, -d DATABASE
                        If provided, this Cosmos database will be used;
  --container CONTAINER, -c CONTAINER
                        If provided, this Cosmos container will be used;
  --output OUTPUT       The dataframe of the result will be stored in a
                        variable with this name.
[0;31mFile:[0m      /usr/local/lib/python3.6/dist-packages/cosmos_sql/__init__.py


### Query #2
Get particular fields we want to visualize and analyze, using the query: ```SELECT c.Action, c.Price as ItemRevenue, c.Country, c.Item FROM c```. The results will be saved into a Pandas dataframe named ```df_cosmos```.

In [50]:
%%sql --database {database_id} --container {container_id} --output df_cosmos
SELECT c.Action, c.Price as ItemRevenue, c.Country, c.Item FROM c

In [51]:
# See a sample of the result
df_cosmos.head(10)

Unnamed: 0,Action,Country,Item,ItemRevenue
0,Viewed,Tunisia,Black Tee,9.0
1,Viewed,Antigua and Barbuda,Flannel Shirt,19.99
2,Added,Guinea-Bissau,Socks,3.75
3,Viewed,Guinea-Bissau,Socks,3.75
4,Viewed,Czech Republic,Rainjacket,55.0
5,Viewed,Iceland,Cosmos T-shirt,350.0
6,Added,Syrian Arab Republic,Button-Up Shirt,19.99
7,Viewed,Syrian Arab Republic,Button-Up Shirt,19.99
8,Viewed,Tuvalu,Red Top,33.0
9,Viewed,Cape Verde,Flip Flop Shoes,14.0


### Query #3

We can also count the number of items in the container using the ```%%sql``` syntax.

In [52]:
%%sql --database RetailDemo --container WebsiteData
SELECT VALUE COUNT(1) FROM c

Unnamed: 0,0
0,2654


### Get result from previous cell into Pandas dataframe
We'll run a simple group by on the dataframe to sum the total sales revenue for each country and display a sample of the results.

### Sum revenue by country

In [53]:
df_revenue = df_cosmos.groupby("Country").sum().reset_index()

display(df_revenue.head(5))

Unnamed: 0,Country,ItemRevenue
0,Afghanistan,785.44
1,Albania,605.8
2,Algeria,1058.98
3,American Samoa,229.0
4,Andorra,247.49


### Analyze top 5 popular purchased items

In [54]:
import pandas as pd

## What are the top 5 purchased items?
pd.DataFrame(df_cosmos[df_cosmos['Action']=='Purchased'].groupby('Item').size().sort_values(ascending=False).head(5), columns=['Count'])

Unnamed: 0_level_0,Count
Item,Unnamed: 1_level_1
Puffy Jacket,25
Athletic Shoes,20
Athletic Shorts,19
Crewneck Sweater,14
Light Jeans,13


## Visualization #1: Sales revenue by country on a world map

Now that we have our data on revenue from our Cosmos container, we'll visualize it using bokeh. Credit to https://towardsdatascience.com/a-complete-guide-to-an-interactive-geographical-map-using-python-f4c5197e23e0 for inspiration.

In [None]:
import sys
!{sys.executable} -m pip install bokeh

### Prepare our data to be plotted

In [None]:
import urllib.request, json 
import geopandas as gpd

# Load country information for mapping
countries = gpd.read_file("https://raw.githubusercontent.com/datasets/geo-countries/master/data/countries.geojson")

# Merge the countries dataframe with our data in Azure Cosmos DB, joining on country code
df_merged = countries.merge(df_revenue, left_on = 'ADMIN', right_on = 'Country', how='left')

# Convert to GeoJSON so bokeh can plot it
merged_json = json.loads(df_merged.to_json())
json_data = json.dumps(merged_json)


### Plot the sales revenue on a world map
This may take a few seconds...

In [None]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.models import GeoJSONDataSource, LinearColorMapper, ColorBar
from bokeh.palettes import brewer

#Input GeoJSON source that contains features for plotting.
geosource = GeoJSONDataSource(geojson = json_data)

#Choose our choropleth color palette: https://bokeh.pydata.org/en/latest/docs/reference/palettes.html
palette = brewer['YlGn'][8]

#Reverse color order so that dark green is highest revenue
palette = palette[::-1]

#Instantiate LinearColorMapper that linearly maps numbers in a range, into a sequence of colors.
color_mapper = LinearColorMapper(palette = palette, low = 0, high = 1000)

#Define custom tick labels for color bar.
tick_labels = {'0': '$0', '250': '$250', '500':'$500', '750':'$750', '1000':'$1000', '1250':'$1250', '1500':'$1500','1750':'$1750', '2000': '>$2000'}

#Create color bar. 
color_bar = ColorBar(color_mapper=color_mapper, label_standoff=8,width = 500, height = 20,
border_line_color=None,location = (0,0), orientation = 'horizontal', major_label_overrides = tick_labels)

#Create figure object.
p = figure(title = 'Sales revenue by country', plot_height = 600 , plot_width = 950, toolbar_location = None)
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None

#Add patch renderer to figure. 
p.patches('xs','ys', source = geosource,fill_color = {'field' :'ItemRevenue', 'transform' : color_mapper},
          line_color = 'black', line_width = 0.25, fill_alpha = 1)

#Specify figure layout.
p.add_layout(color_bar, 'below')

#Display figure inline in Jupyter Notebook.
output_notebook()

#Display figure.
show(p)

## Visualization #2: Conversion rate of Viewed -> Added to cart -> Purchased by item

In our WebsiteData container, we have a record of users who viewed an item, added to their cart, and purchased the item. We can visualize the conversion rate for each item. Credit to: https://bokeh.pydata.org/en/latest/docs/user_guide/categorical.html for inspiration.

### Plot our data

In [45]:
from bokeh.io import show, output_notebook

from bokeh.plotting import figure
from bokeh.palettes import Spectral3
from bokeh.transform import factor_cmap
from bokeh.models import ColumnDataSource, FactorRange


# Get the top 10 items as an array
top_10_items = df_cosmos[df_cosmos['Action']=='Purchased'].groupby('Item').size().sort_values(ascending=False)[:10].index.values.tolist()

# Filter our data to only these 10 items
df_top10 = df_cosmos[df_cosmos['Item'].isin(top_10_items)]

# Group by Item and Action, sorting by event count
df_top10_sorted = df_top10.groupby(['Item', 'Action']).count().rename(columns={'Country':'ResultCount'}, inplace=False).reset_index().sort_values(['Item', 'ResultCount'], ascending = False).set_index(['Item', 'Action'])

# Get sorted X-axis values - this way, we can display the funnel of view -> add -> purchase
x_axis_values = df_top10_sorted.index.values.tolist()

group = df_top10_sorted.groupby(['Item', 'Action'])

# Specifiy colors for X axis
index_cmap = factor_cmap('Item_Action', palette=Spectral3, factors=sorted(df_top10.Action.unique()), start=1, end=2)

# Create the plot

p = figure(plot_width=1200, plot_height=500, title="Conversion rate of items from View -> Add to cart -> Purchase", x_range=FactorRange(*x_axis_values), toolbar_location=None, tooltips=[("Number of events", "@ResultCount_max"), ("Item, Action", "@Item_Action")])

p.vbar(x='Item_Action', top='ItemRevenue_max', width=1, source=group,
       line_color="white", fill_color=index_cmap, )

#Configure how the plot looks
p.y_range.start = 0
p.x_range.range_padding = 0.05
p.xgrid.grid_line_color = None
p.xaxis.major_label_orientation = 1.2
p.outline_line_color = "black"
p.xaxis.axis_label = "Item"
p.yaxis.axis_label = "Count"

#Display figure inline in Jupyter Notebook.
output_notebook()

#Display figure.
show(p)
