# Getting started with Cosmos notebooks
In this notebook, we'll import some sample data in a container in Azure Cosmos DB, analyze it, and create some visualizations of the data. 

### First, let's create an instance of CosmosClient we can use to run operations against Azure Cosmos DB.

In [2]:
## Create an instance of CosmosClient
import os
import azure.cosmos.cosmos_client as cosmos
import azure.cosmos.errors as errors

client = cosmos.CosmosClient(os.environ["COSMOS_ENDPOINT"], {'masterKey': os.environ["COSMOS_KEY"]})


In [3]:
import sys
#!{sys.executable} -m pip install cufflinks
#!{sys.executable} -m pip install bokeh
!{sys.executable} -m pip install geopandas
#!{sys.executable} -m pip install plotly==4.1.0


You should consider upgrading via the 'pip install --upgrade pip' command.[0m


### Create a new database named RetailDemo for our data.

In [30]:
## Create a new database if it doesn't exist.
database_id = "RetailDemo"
database_link = 'dbs/' + database_id

try:
    client.CreateDatabase({"id": database_id})
    print('Database with id \'{0}\' created'.format(id))

except errors.HTTPFailure as e:
    if e.status_code == 409:
       print('A database with name \'{0}\' already exists'.format(database_id))
    else: 
        raise

A database with name 'RetailDemo' already exists


### Create a new container WebsiteData inside the RetailDemo database.
Our dataset will contain events that occurred on the website - e.g. a user viewing an item, adding it to their cart, or purchasing it. We will partition by CartId, which represents the individual cart of each user. This will give us an even distribution of throughput and storage in our container. Learn more about how to [choose a good partition key.](https://docs.microsoft.com/azure/cosmos-db/partition-data)

In [31]:
## Create a new container if it doesn't already exist
container_id = "WebsiteData"
container_link = database_link + '/colls/' + container_id
try:
    container_definition = {
        "id": container_id,
        "partitionKey": {
            "paths": [
              "/CartId"
            ]
        }
    }

    container = client.CreateContainer(database_link, container_definition)
    print('Container with id \'{0}\' created'.format(container_id))

except errors.CosmosError as e:
    if e.status_code == 409:
       print('A container with id \'{0}\' already exists'.format(container_id))
    else: 
        raise

A container with id 'WebsiteData' already exists


### Load in sample JSON data and insert into the container. 
We'll use the **UpsertItem** operation to create the item if it doesn't exist, or replace it if it already exists. This will take a few minutes

Here's a sample JSON document.
```
{"CartID":5399,
"Action":"Viewed",
"Item":"Cosmos T-shirt",
"Price":350,
"UserName":"Chadrick.Larkin87",
"Country":"Iceland",
"EventDate":"2015-06-25T00:00:00",
"Year":2015,"Latitude":-66.8673,
"Longitude":-29.8214,
"Address":"852 Modesto Loop, Port Ola, Iceland",
"id":"00ffd39c-7e98-4451-9b91-b2bcf2f9a32d"},
```

In [6]:
## Read data from storage
import urllib.request, json 
with urllib.request.urlopen("https://cosmosnotebooksdata.blob.core.windows.net/notebookdata/websiteData.json") as url:
    data = json.loads(url.read().decode())

for event in data[:5]:
    try: 
        test = client.UpsertItem(container_link, event)
    except errors.CosmosError as e:
        raise

## Run a query against Azure Cosmos DB, using **CosmosClient**.
We'll run the query **SELECT VALUE COUNT(1) FROM c** to count the number of documents in the container.

In [32]:
## Run a query against the container to see number of documents
query = {'query': 'SELECT VALUE COUNT(1) FROM c'}

options = {}
options['enableCrossPartitionQuery'] = True

result_iterable = client.QueryItems(container_link, query, options)
for item in iter(result_iterable):
    print('Container with id \'{0}\' contains \'{1}\' items'.format(container_id, item))
    
    

Container with id 'WebsiteData' contains '2654' items


## Run some queries against Azure Cosmos DB, using the built-in notebook magic
We'll use the syntax:

```%%sql --database {database_id} --container {container_id}
{Query text}```

### Query #1
Get the latest record, using the query: ```SELECT TOP 1 * FROM c ORDER BY c._ts desc```

In [33]:
%%sql --database {database_id} --container {container_id}
SELECT TOP 1 * from c order by c._ts desc

Unnamed: 0,Action,Address,CartID,Country,EventDate,Item,Latitude,Longitude,Price,UserName,Year,_attachments,_etag,_rid,_self,_ts,id
0,Viewed,"09601 Kacey Mount, Bahringerhaven, Czech Republic",2924,Czech Republic,2019-04-08T00:00:00,Rainjacket,19.105,-60.2559,55,Gerhard47,2019,attachments/,"""00002800-0000-0400-0000-5d5b2e020000""",DQMeAJ+SPd4FAAAAAAAAAA==,dbs/DQMeAA==/colls/DQMeAJ+SPd4=/docs/DQMeAJ+SP...,1566256642,526d2d25-a087-4c81-917f-504958567616


### Query #2
Get particular fields we want to visualize and analyze, using the query: ```SELECT c.Action, c.Price as ItemRevenue, c.Country, c.Item FROM c```

In [34]:
%%sql --database {database_id} --container {container_id}
SELECT c.Action, c.Price as ItemRevenue, c.Country, c.Item FROM c

Unnamed: 0,Action,Country,Item,ItemRevenue
0,Viewed,Tunisia,Black Tee,9.00
1,Viewed,Antigua and Barbuda,Flannel Shirt,19.99
2,Added,Guinea-Bissau,Socks,3.75
3,Viewed,Guinea-Bissau,Socks,3.75
4,Viewed,Czech Republic,Rainjacket,55.00
5,Viewed,Iceland,Cosmos T-shirt,350.00
6,Added,Syrian Arab Republic,Button-Up Shirt,19.99
7,Viewed,Syrian Arab Republic,Button-Up Shirt,19.99
8,Viewed,Tuvalu,Red Top,33.00
9,Viewed,Cape Verde,Flip Flop Shoes,14.00


### Get result from previous cell into Pandas dataframe
We'll run a simple group by on the dataframe to sum the total sales revenue for each country and display a sample of the results.

In [35]:
import pandas as pd

## Create a dataframe using the output of the query in the previous cell 
df = pd.DataFrame(_, columns = ['Action', 'Country', 'Item', 'ItemRevenue'])

### Sum revenue by country

In [36]:
df_revenue = df.groupby("Country").sum().reset_index()

display(df_revenue.head(5))

Unnamed: 0,Country,ItemRevenue
0,Afghanistan,785.44
1,Albania,605.8
2,Algeria,1058.98
3,American Samoa,229.0
4,Andorra,247.49


### Analyze top 5 popular purchased items

In [37]:
## What are the top 5 purchased items?
pd.DataFrame(df[df['Action']=='Purchased'].groupby('Item').size().sort_values(ascending=False).head(), columns=['Count'])

Unnamed: 0_level_0,Count
Item,Unnamed: 1_level_1
Puffy Jacket,25
Athletic Shoes,20
Athletic Shorts,19
Crewneck Sweater,14
Light Jeans,13


## Visualization #1: Sales revenue by country on a world map

Now that we have our data on revenue from our Cosmos container, we'll visualize it using bokeh. Credit to https://towardsdatascience.com/a-complete-guide-to-an-interactive-geographical-map-using-python-f4c5197e23e0 for inspiration.

### Prepare our data to be plotted

In [None]:
import urllib.request, json 
import geopandas as gpd

# Load country information for mapping
countries = gpd.read_file("https://raw.githubusercontent.com/datasets/geo-countries/master/data/countries.geojson")

# Merge the countries dataframe with our data in Azure Cosmos DB, joining on country code
df_merged = countries.merge(df_revenue, left_on = 'ADMIN', right_on = 'Country', how='left')

# Convert to GeoJSON so bokeh can plot it
merged_json = json.loads(df_merged.to_json())
json_data = json.dumps(merged_json)


### Plot the sales revenue on a world map
This may take a few seconds...

In [None]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.models import GeoJSONDataSource, LinearColorMapper, ColorBar
from bokeh.palettes import brewer

#Input GeoJSON source that contains features for plotting.
geosource = GeoJSONDataSource(geojson = json_data)

#Choose our choropleth color palette: https://bokeh.pydata.org/en/latest/docs/reference/palettes.html
palette = brewer['YlGn'][8]

#Reverse color order so that dark green is highest revenue
palette = palette[::-1]

#Instantiate LinearColorMapper that linearly maps numbers in a range, into a sequence of colors.
color_mapper = LinearColorMapper(palette = palette, low = 0, high = 1000)

#Define custom tick labels for color bar.
tick_labels = {'0': '$0', '250': '$250', '500':'$500', '750':'$750', '1000':'$1000', '1250':'$1250', '1500':'$1500','1750':'$1750', '2000': '>$2000'}

#Create color bar. 
color_bar = ColorBar(color_mapper=color_mapper, label_standoff=8,width = 500, height = 20,
border_line_color=None,location = (0,0), orientation = 'horizontal', major_label_overrides = tick_labels)

#Create figure object.
p = figure(title = 'Sales revenue by country', plot_height = 600 , plot_width = 950, toolbar_location = None)
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None

#Add patch renderer to figure. 
p.patches('xs','ys', source = geosource,fill_color = {'field' :'ItemRevenue', 'transform' : color_mapper},
          line_color = 'black', line_width = 0.25, fill_alpha = 1)

#Specify figure layout.
p.add_layout(color_bar, 'below')

#Display figure inline in Jupyter Notebook.
output_notebook()

#Display figure.
show(p)

## Visualization #2: Conversion rate of Viewed -> Added to cart -> Purchased by item

In our WebsiteData container, we have a record of users who viewed an item, added to their cart, and purchased the item. We can visualize the conversion rate for each item.

### Plot our data
https://bokeh.pydata.org/en/latest/docs/user_guide/categorical.html