# Discoverying content of interest in the Data Observatory

The Discovery API is a powerful tool for exploring the available datasets in our data lake. Through its methods you would be able to navigate through the datasets and their properties, thus knowing in advanced which sources may be of interest for you before even requesting access to them. 

## Catalog: the first step for discovery

The Catalog class provides the methods to be used as the starting point in your discovery. It allows you to get the complete list of categories related to the avilable datasets, for example.


### Get the list of categories

In [2]:
from cartoframes.data.observatory import Catalog
from cartoframes.data.observatory import Category

catalog = Catalog()
categories = catalog.categories

categories

[<Category('demographics')>,
 <Category('environmental')>,
 <Category('financial')>,
 <Category('housing')>,
 <Category('human_mobility')>,
 <Category('points_of_interest')>,
 <Category('road_traffic')>]

In [3]:
isinstance(categories[0], Category)

True

We can also obtain the categories as a pandas DataFrame:

In [4]:
categories_df = categories.to_dataframe()
categories_df

Unnamed: 0,id,name
0,demographics,Demographics
1,environmental,Environmental
2,financial,Financial
3,housing,Housing
4,human_mobility,Human Mobility
5,points_of_interest,Points of Interest
6,road_traffic,Road Traffic


In [5]:
import pandas as pd

isinstance(categories_df, pd.DataFrame)

True

### Filter one category 

Since the list of categories is also a Pandas' DataFrame, we can use its already familiar methods to explore the data.

In [6]:
filtered_category = categories[0]
filtered_category

<Category('demographics')>

In [7]:
isinstance(filtered_category, Category)

True

We can obtain the category in a pandas Series:

In [8]:
category_series = filtered_category.to_series()
category_series


id      demographics
name    Demographics
dtype: object

In [9]:
import pandas as pd

isinstance(category_series, pd.Series)

True

## Explore a particular category

If we already know that a particular category has presence in the Catalog, we can retrieve it directly by using its id.

In [23]:
category1 = catalog.categories.get('demographics')

category1

[<CatalogDataset('od_acs_181619a3')>,
 <CatalogDataset('od_acs_550657ce')>,
 <CatalogDataset('mbi_retail_turn_a40b2d9e')>,
 <CatalogDataset('od_acs_6e4b69f6')>,
 <CatalogDataset('mbi_purchasing__ff4ce909')>,
 <CatalogDataset('mbi_population_73b2034')>,
 <CatalogDataset('mbi_population_bf3a91db')>,
 <CatalogDataset('od_acs_9510981d')>,
 <CatalogDataset('od_acs_6d43ed82')>,
 <CatalogDataset('od_acs_194c5960')>,
 <CatalogDataset('mbi_population_552f1d19')>,
 <CatalogDataset('mbi_population_70c11317')>,
 <CatalogDataset('mbi_population_c207835a')>,
 <CatalogDataset('mbi_population_8b8bf92b')>,
 <CatalogDataset('mbi_education_3d6776b9')>,
 <CatalogDataset('mbi_retail_turn_36a9dbe1')>,
 <CatalogDataset('mbi_population_af1d83ba')>,
 <CatalogDataset('mbi_retail_cent_ac0457a5')>,
 <CatalogDataset('mbi_purchasing__c9645e12')>,
 <CatalogDataset('od_acs_38016c42')>,
 <CatalogDataset('od_acs_9a9c93b8')>,
 <CatalogDataset('od_acs_7b2649a9')>,
 <CatalogDataset('od_acs_478c37b8')>,
 <CatalogDataset('o

In [11]:
isinstance(category1, Category)

True

And we can access its different properties:

In [12]:
category1.name

'Demographics'

### Get the datasets for that category

Once we have a Category we can use the discovery methods to get the datasets related to that category.

In [13]:
demographics_datasets = category1.datasets
demographics_datasets

[<CatalogDataset('od_acs_181619a3')>,
 <CatalogDataset('od_acs_550657ce')>,
 <CatalogDataset('mbi_retail_turn_a40b2d9e')>,
 <CatalogDataset('od_acs_6e4b69f6')>,
 <CatalogDataset('mbi_purchasing__ff4ce909')>,
 <CatalogDataset('mbi_population_73b2034')>,
 <CatalogDataset('mbi_population_bf3a91db')>,
 <CatalogDataset('od_acs_9510981d')>,
 <CatalogDataset('od_acs_6d43ed82')>,
 <CatalogDataset('od_acs_194c5960')>,
 <CatalogDataset('mbi_population_552f1d19')>,
 <CatalogDataset('mbi_population_70c11317')>,
 <CatalogDataset('mbi_population_c207835a')>,
 <CatalogDataset('mbi_population_8b8bf92b')>,
 <CatalogDataset('mbi_education_3d6776b9')>,
 <CatalogDataset('mbi_retail_turn_36a9dbe1')>,
 <CatalogDataset('mbi_population_af1d83ba')>,
 <CatalogDataset('mbi_retail_cent_ac0457a5')>,
 <CatalogDataset('mbi_purchasing__c9645e12')>,
 <CatalogDataset('od_acs_38016c42')>,
 <CatalogDataset('od_acs_9a9c93b8')>,
 <CatalogDataset('od_acs_7b2649a9')>,
 <CatalogDataset('od_acs_478c37b8')>,
 <CatalogDataset('o

In [15]:
from cartoframes.data.observatory import CatalogDataset

isinstance(demographics_datasets[0], CatalogDataset)

True

Again, we can export the data as a pandas' DataFrame:

In [16]:
demographics_datasets.to_dataframe().iloc[0]

category_id                                                   demographics
category_name                                                 Demographics
country_id                                                             usa
data_source_id                                                         acs
description                                                           None
geography_description                                                 None
geography_id             carto-do-public-data.tiger.geography_usa_pumac...
geography_name           Topologically Integrated Geographic Encoding a...
id                       carto-do-public-data.acs.demographics_acs_usa_...
is_public_data                                                        True
lang                                                                   eng
name                     American Community Survey (ACS) data at pumacl...
provider_id                                                      open_data
provider_name            

And as well as Category, a Dataset can be used to extract related properties:

In [17]:
d1 = demographics_datasets[0]

d1.country

'usa'

In [18]:
vars1 = d1.variables
vars1

[<Variable('households_publ_14c36c88','None')>,
 <Variable('median_income_c0ef31a5','Within a geographic area, the ...')>,
 <Variable('income_200000_o_c40255b2','The number of households in a ...')>,
 <Variable('households_reti_3d9952d0','None')>,
 <Variable('income_150000_1_28ac2dc2','The number of households in a ...')>,
 <Variable('income_125000_1_d5f3acbe','The number of households in a ...')>,
 <Variable('income_100000_1_35b5b31e','The number of households in a ...')>,
 <Variable('income_75000_99_e304f470','The number of households in a ...')>,
 <Variable('income_60000_74_89f98648','The number of households in a ...')>,
 <Variable('income_50000_59_98b16bf4','The number of households in a ...')>,
 <Variable('income_45000_49_827656c0','The number of households in a ...')>,
 <Variable('income_40000_44_96354559','The number of households in a ...')>,
 <Variable('income_35000_39_196d9f53','The number of households in a ...')>,
 <Variable('less_one_year_c_434945a4','The number of people

In [20]:
from cartoframes.data.observatory import Variable

isinstance(vars1[0], Variable)

True

If we already know which dataset we want to recover, we can do so through its id or its slug:

In [24]:
CatalogDataset.get('od_acs_181619a3').to_dict()

{'id': 'carto-do-public-data.acs.demographics_acs_usa_pumaclipped_2015_5yrs_20102014',
 'slug': 'od_acs_181619a3',
 'name': 'American Community Survey (ACS) data at pumaclipped level (2010 - 2014)',
 'description': None,
 'country_id': 'usa',
 'geography_id': 'carto-do-public-data.tiger.geography_usa_pumaclipped_2015',
 'geography_name': 'Topologically Integrated Geographic Encoding and Referencing (TIGER) boundaries data at pumaclipped level (2015)',
 'geography_description': None,
 'category_id': 'demographics',
 'category_name': 'Demographics',
 'provider_id': 'open_data',
 'provider_name': 'Open Data',
 'data_source_id': 'acs',
 'lang': 'eng',
 'temporal_aggregation': '5yrs',
 'time_coverage': '[2010-01-01,2014-01-01)',
 'update_frequency': None,
 'version': '20102014',
 'is_public_data': True}

And if we have a list of ids or slugs that we want to get from the catalog, we can obtain them in one call:

In [25]:
CatalogDataset.get_list(['od_acs_181619a3', 'od_acs_13345497'])

[<CatalogDataset('od_acs_181619a3')>, <CatalogDataset('od_acs_13345497')>]

## Navigate through the catalog with filters

It is possible to navigate the catalog by adding nested filters that will be applied when requesting lists of entities.

For example, we can add a country filter before requesting the list of datasets and then we will only obtain the datasets from that country:

In [27]:
from cartoframes.data.observatory import Catalog

catalog = Catalog()

In [28]:
catalog.country('spain').datasets

Applied filters are saved in that catalog instance, so if we now add a new filter the list of datasets will be affected by both filters:

In [29]:
catalog.category('demographics').datasets

We can also nest the filters in the same call with the same result:

In [30]:
catalog2 = Catalog()
catalog2.country('spain').category('demographics').datasets

To remove all past filters, we simply call the method to clear them:

In [31]:
catalog.clear_filters()
catalog.datasets

[<CatalogDataset('pb_pointsofint_ce35b8b8')>,
 <CatalogDataset('pb_pointsofint_2cf3538b')>,
 <CatalogDataset('od_acs_181619a3')>,
 <CatalogDataset('pb_pointsofint_66b9e4e7')>,
 <CatalogDataset('pb_pointsofint_8345a948')>,
 <CatalogDataset('pb_pointsofint_17af1fae')>,
 <CatalogDataset('pb_pointsofint_624640f3')>,
 <CatalogDataset('pb_pointsofint_23ff89c4')>,
 <CatalogDataset('pb_pointsofint_72337606')>,
 <CatalogDataset('od_acs_550657ce')>,
 <CatalogDataset('tt_trafficdens_c451dfb9')>,
 <CatalogDataset('mbi_retail_turn_a40b2d9e')>,
 <CatalogDataset('pb_pointsofint_2a9f04b8')>,
 <CatalogDataset('pb_pointsofint_dcb3020c')>,
 <CatalogDataset('pb_pointsofint_28463f13')>,
 <CatalogDataset('od_acs_6e4b69f6')>,
 <CatalogDataset('pb_pointsofint_50c122ed')>,
 <CatalogDataset('mbi_purchasing__ff4ce909')>,
 <CatalogDataset('mbi_population_73b2034')>,
 <CatalogDataset('mbi_population_bf3a91db')>,
 <CatalogDataset('pb_pointsofint_fd812dbc')>,
 <CatalogDataset('pb_pointsofint_c6e8d9ca')>,
 <CatalogDa

Nested filters can also be applied when requesting countries, categories, or geogrpahies:

In [32]:
catalog.clear_filters()
catalog.category('demographics').countries

[<Country('gtm')>,
 <Country('tjk')>,
 <Country('tca')>,
 <Country('prt')>,
 <Country('gab')>,
 <Country('bwa')>,
 <Country('aia')>,
 <Country('png')>,
 <Country('ury')>,
 <Country('nga')>,
 <Country('twn')>,
 <Country('nzl')>,
 <Country('svn')>,
 <Country('swe')>,
 <Country('qat')>,
 <Country('ago')>,
 <Country('ven')>,
 <Country('khm')>,
 <Country('swz')>,
 <Country('omn')>,
 <Country('tza')>,
 <Country('uga')>,
 <Country('mrt')>,
 <Country('gnq')>,
 <Country('mkd')>,
 <Country('bel')>,
 <Country('aze')>,
 <Country('ecu')>,
 <Country('pyf')>,
 <Country('msr')>,
 <Country('vir')>,
 <Country('irn')>,
 <Country('pak')>,
 <Country('jpn')>,
 <Country('phl')>,
 <Country('guy')>,
 <Country('tto')>,
 <Country('jam')>,
 <Country('ner')>,
 <Country('fji')>,
 <Country('srb')>,
 <Country('bmu')>,
 <Country('cpv')>,
 <Country('mmr')>,
 <Country('pri')>,
 <Country('fro')>,
 <Country('jor')>,
 <Country('uzb')>,
 <Country('rus')>,
 <Country('mar')>,
 <Country('dnk')>,
 <Country('mli')>,
 <Country('u

In [33]:
catalog.clear_filters()
catalog.country('spain').categories

In [34]:
catalog.clear_filters()
catalog.country('usa').category('demographics').geographies

[<Geography('mbi_counties_141b61cd')>,
 <Geography('od_placeclippe_48a89947')>,
 <Geography('mbi_blockgroups_1ab060a')>,
 <Geography('od_pumaclipped_b065909')>,
 <Geography('ags_blockgroup_1c63771c')>,
 <Geography('od_schooldistr_75493a16')>,
 <Geography('od_stateclippe_8d79f5be')>,
 <Geography('od_cbsaclipped_b6a32adc')>,
 <Geography('od_schooldistr_6d5c417f')>,
 <Geography('mbi_pc_5_digit_4b1682a6')>,
 <Geography('od_countyclipp_caef1ec9')>,
 <Geography('od_censustract_5962fe30')>,
 <Geography('od_congression_6774ebb')>,
 <Geography('od_blockclippe_9c508438')>,
 <Geography('usct_censustract_784cc2ed')>,
 <Geography('od_zcta5clippe_6b6ff33c')>,
 <Geography('ags_q17_4739be4f')>,
 <Geography('od_blockgroupc_3ab29c84')>,
 <Geography('mbi_county_subd_e8e6ea23')>,
 <Geography('od_schooldistr_f70c7e28')>]

And we can use all the filters at the same time:

In [35]:
catalog.clear_filters()
catalog.country('usa').category('demographics').geography('ags_blockgroup_1c63771c').datasets

[<CatalogDataset('ags_sociodemogr_e92b1637')>,
 <CatalogDataset('ags_consumerspe_fe5d060a')>,
 <CatalogDataset('ags_retailpoten_ddf56a1a')>,
 <CatalogDataset('ags_consumerpro_e8344e2e')>,
 <CatalogDataset('ags_businesscou_a8310a11')>,
 <CatalogDataset('ags_crimerisk_9ec89442')>]

In [36]:
Catalog().country('usa').category('demographics').geography('od_countyclipp_caef1ec9').datasets

[<CatalogDataset('od_acs_550657ce')>,
 <CatalogDataset('od_acs_1f614ee8')>,
 <CatalogDataset('od_acs_c1c86582')>,
 <CatalogDataset('od_acs_5b8fdefd')>,
 <CatalogDataset('od_acs_c2868f47')>,
 <CatalogDataset('od_acs_b581bfd1')>,
 <CatalogDataset('od_acs_5c10acf4')>,
 <CatalogDataset('od_acs_c5eb4b5e')>,
 <CatalogDataset('od_tiger_66b9092c')>]