# Accessing Data Observatory subscriptions with BiqQuery's Analytics Toolbox

This tutorial showcases how to make use of your subscriptions to Data Observatory datasets and access them through the Analytics Toolbox interface from a jupyter notebook.

We first authenticate to the Carto Data Warehouses with the `carto_auth` library, and then use it to explore our available subscriptions and select variables of our interest. Finally, we perform an enrichment of a sample dataset with one of our subscribed datasets.

In [1]:
!pip install carto_auth[carto-dw] pydeck pydeck-carto -q 

[K     |████████████████████████████████| 4.3 MB 2.1 MB/s 
[K     |████████████████████████████████| 1.6 MB 33.3 MB/s 
[K     |████████████████████████████████| 4.7 MB 42.5 MB/s 
[?25h

## Authentication to CARTO
We start by using the `carto_auth` package to authenticate to our CARTO account and to get the necessary details to interact with data available in the CARTO Data Warehouse.

In [2]:
import pydeck as pdk
from carto_auth import CartoAuth
from pydeck_carto import register_carto_layer, get_layer_credentials, get_error_notifier
from pydeck_carto.layer import MapType, CartoConnection
from pydeck_carto.styles import color_continuous, color_categories

In [None]:

# Authentication with CARTO
carto_auth = CartoAuth.from_oauth()
# CARTO Data Warehouse client
carto_dw_client = carto_auth.get_carto_dw_client()

## Listing subscriptions
We first retrieve all our subscriptions as a pandas dataframe in order to explore what we have. 
For more details about how to use the following functions, please refer to the [Analytics Toolbox documentation](https://docs.carto.com/analytics-toolbox-bigquery/sql-reference/data/#dataobs_subscriptions)

In [4]:
subscription_id = "XXXXXXXX" #IMPORTANT not remove leave this here. 
get_subscriptions_q = \
f"""
CALL `carto-un`.carto.DATAOBS_SUBSCRIPTIONS('{subscription_id}',"dataset_license = 'Public data'");
"""

In [6]:
subs_df = carto_dw_client.query(get_subscriptions_q).result().to_dataframe()
subs_df.sample(5)


Unnamed: 0,dataset_slug,dataset_name,dataset_country,dataset_category,dataset_license,dataset_provider,dataset_version,dataset_geo_type,dataset_table,associated_geography_table,associated_geography_slug
36,cdb_spatial_fea_94e6b1f,Spatial Features - United States of America (H...,United States of America,Derived,Public data,CARTO,v2,POLYGON,sub_carto_derived_spatialfeatures_usa_h3res8_v...,sub_carto_geography_usa_h3res8_v1,cdb_h3res8_67d48abb
42,cdb_zcta5_f4043497,5-digit Zip Code Tabulation Area - United Stat...,United States of America,Geography,Public data,CARTO,2015,MULTIPOLYGON,sub_carto_geography_usa_zcta5_2015,,
12,can_employment_ac5fa1db,Employment And Income - Canada (Census Division),Canada,Demographics,Public data,Statistics Canada,2016,MULTIPOLYGON,sub_can_statistics_demographics_employment_can...,sub_carto_geography_can_censusdivision_2016,cdb_census_divi_6e50dfae
81,osm_nodes_205a5f56,Nodes - Spain (Latitude/Longitude),Spain,Points of Interest,Public data,OpenStreetMap,v1,POINT,sub_openstreetmap_pointsofinterest_nodes_esp_l...,sub_openstreetmap_geography_esp_latlon_v1,ostm_lat_lon_326793ed
119,wp_population_6e3bd184,"Population Mosaics - Brazil (Grid 100m, 2020)",Brazil,Demographics,Public data,WorldPop,2020,POLYGON,sub_worldpop_demographics_population_bra_grid1...,sub_worldpop_geography_bra_grid100m_v1,wp_grid100m_74dd8a53


Lets take a look at what we have on United States

In [7]:
subs_df.query("dataset_country == 'United States of America'").sample(5)

Unnamed: 0,dataset_slug,dataset_name,dataset_country,dataset_category,dataset_license,dataset_provider,dataset_version,dataset_geo_type,dataset_table,associated_geography_table,associated_geography_slug
36,cdb_spatial_fea_94e6b1f,Spatial Features - United States of America (H...,United States of America,Derived,Public data,CARTO,v2,POLYGON,sub_carto_derived_spatialfeatures_usa_h3res8_v...,sub_carto_geography_usa_h3res8_v1,cdb_h3res8_67d48abb
37,cdb_spatial_fea_bd4173ae,Spatial Features - United States of America (Q...,United States of America,Derived,Public data,CARTO,v2,POLYGON,sub_carto_derived_spatialfeatures_usa_quadgrid...,sub_carto_geography_usa_quadgrid15_v1,cdb_quadgrid15_417f4a13
98,tigr_cbsa_5de70990,Core-based Statistical Area - United States of...,United States of America,Geography,Public data,Tiger/Line geographic data from the U.S. Censu...,2018,MULTIPOLYGON,sub_usa_tiger_geography_usa_cbsa_2018,,
103,tigr_zcta5_9942285,5-digit Zip Code Tabulation Area - United Stat...,United States of America,Geography,Public data,Tiger/Line geographic data from the U.S. Censu...,2019,MULTIPOLYGON,sub_usa_tiger_geography_usa_zcta5_2019,,
15,cdb_block_96b823a2,Census Block - United States of America (2019),United States of America,Geography,Public data,CARTO,2019,MULTIPOLYGON,sub_carto_geography_usa_block_2019,,


There is a population dataset for USA provided by Worldpop which seems interesting (`wp_population_704f6b75`). We can take the `dataset_slug` and find out more about the variables it contains.

In [8]:
get_dataset_variables = \
f"""
CALL `carto-un`.carto.DATAOBS_SUBSCRIPTION_VARIABLES(
    "{subscription_id}",
    "dataset_slug = 'wp_population_704f6b75'"
    );
"""


In [9]:
vars_df = carto_dw_client.query(get_dataset_variables).result().to_dataframe()
vars_df

Unnamed: 0,variable_slug,variable_name,variable_description,variable_type,variable_aggregation,dataset_slug,associated_geography_slug
0,country_iso_a3_8cf77237,country_iso_a3,"Three-letter country code, following the ISO 3...",STRING,,wp_population_704f6b75,wp_grid1km_406ea53e
1,country_iso_eb2c4ed5,country_iso,"Country name, following the ISO 3166 standard",STRING,,wp_population_704f6b75,wp_grid1km_406ea53e
2,do_date_54c11595,do_date,First day of the year of the corresponding data,DATE,,wp_population_704f6b75,wp_grid1km_406ea53e
3,geoid_346e6a2e,geoid,Unique cell identifier,STRING,,wp_population_704f6b75,wp_grid1km_406ea53e
4,population_e3a78133,population,Population,FLOAT,SUM,wp_population_704f6b75,wp_grid1km_406ea53e


## Exporting data from the Data Observatory

Once we have a sneak peek of the available variables, we can export some of the data.
Let's retrieve the `population` variable for a 10 km buffer around Atlanta.

 
We first need the id of the table we have to run our query against, together with its geography.

In [10]:
dataset_id, geography_id = subs_df.query("dataset_slug == 'wp_population_704f6b75'")[["dataset_table", "associated_geography_table"]].values.ravel()
dataset_id, geography_id

('sub_worldpop_demographics_population_usa_grid1km_v1_yearly_2020',
 'sub_worldpop_geography_usa_grid1km_v1')

In [11]:
usa_pop_q = \
f"""
WITH whole_usa AS (
SELECT population, geom
FROM `{subscription_id}.{dataset_id}` d
JOIN `{subscription_id}.{geography_id}` g
ON d.geoid = g.geoid
)
SELECT * FROM whole_usa 
WHERE ST_INTERSECTS(geom, ST_BUFFER(ST_GEOGPOINT(-84.387655, 33.760213), 10000))
"""

In [12]:
atlanta_df = carto_dw_client.query(usa_pop_q).result().to_dataframe()
atlanta_df.sample(5)

Unnamed: 0,population,geom
222,1475.661011,"POLYGON((-84.3679163142 33.69958339991, -84.35..."
352,810.963257,"POLYGON((-84.3429163143 33.71625006651, -84.33..."
67,1005.65387,"POLYGON((-84.4262496473 33.70791673321, -84.41..."
362,327.319244,"POLYGON((-84.4929163137 33.75791673301, -84.48..."
42,187.232971,"POLYGON((-84.4512496472 33.74125006641, -84.44..."


Now that we have the data, we can save it in our local machine in several formats.

In [None]:
atlanta_df.to_csv("atlanta_df.csv")

In [None]:
atlanta_df.to_parquet("atlanta_df.parquet")

## Enrichment with the Data Observatory

`retail_stores` is a dataset with information about revenue and size of retail stores in USA which can be found in the data observatory. We are going to enrich it with the population variable from the previous example (slug_id `population_e3a78133`)

We define an output table where the enriched data will be placed. Later we use pydeck-carto to visualize the results.

In [12]:
output_table_id = 'carto-dw-ac-7xhfwyml.shared.retail_stores_enriched'
enrich_q = \
f"""
CALL `carto-un`.carto.DATAOBS_ENRICH_POINTS(
   R'''
   SELECT cartodb_id, revenue, geom FROM `carto-demo-data.demo_tables.retail_stores`
   ''',
   'geom',
   [('population_e3a78133', 'sum')],
   NULL,
   ['{output_table_id}'],
   '{subscription_id}'
);
"""
carto_dw_client.delete_table(output_table_id, not_found_ok = True)
carto_dw_client.query(enrich_q).result()

<google.cloud.bigquery.table._EmptyRowIterator at 0x7f06dae54110>

In [None]:
# Register CartoLayer in pydeck
register_carto_layer()
credentials = get_layer_credentials(carto_auth)

enriched_layer = pdk.Layer(
    "CartoLayer",
    data = "SELECT * FROM `carto-dw-ac-7xhfwyml.shared.retail_stores_enriched`",
    geo_column=pdk.types.String("geom"),
    type_=MapType.QUERY,
    connection=CartoConnection.CARTO_DW,
    credentials=credentials,
    opacity=0.2,
    stroked=True,
    point_radius_min_pixels=2,
    get_fill_color = color_continuous("population_e3a78133_sum", [x*100 for x in range(10)], colors = "Tropic"),
    on_data_error=get_error_notifier(),
    )

view_state = pdk.ViewState(latitude=33.64, longitude=-117.94, zoom=4)
r = pdk.Deck(
    [enriched_layer],
    initial_view_state=view_state,
    map_style=pdk.map_styles.LIGHT,
)
r.to_html(iframe_height = 700)