# Discoverying content of interest in the Data Observatory

The Discovery API is a powerful tool for exploring the available datasets in our data lake. Through its methods you would be able to navigate through the datasets and their properties, thus knowing in advanced which sources may be of interest for you before even requesting access to them. 

## Catalog: the first step for discovery

The Catalog class provides the methods to be used as the starting point in your discovery. It allows you to get the complete list of categories related to the avilable datasets, for example.


### Get the list of categories

In [1]:
from cartoframes.data.observatory.catalog import Catalog
from cartoframes.data.observatory.category import Category

catalog = Catalog()
categories = catalog.categories

categories

[Category(demographics),
 Category(environmental),
 Category(financial),
 Category(housing),
 Category(human_mobility),
 Category(points_of_interest),
 Category(road_traffic)]

In [2]:
isinstance(categories[0], Category)

True

We can also obtain the categories as a pandas DataFrame:

In [3]:
categories_df = categories.to_dataframe()
categories_df

Unnamed: 0,id,name
0,demographics,Demographics
1,environmental,Environmental
2,financial,Financial
3,housing,Housing
4,human_mobility,Human Mobility
5,points_of_interest,Points of Interest
6,road_traffic,Road Traffic


In [4]:
import pandas as pd

isinstance(categories_df, pd.DataFrame)

True

### Filter one category 

Since the list of categories is also a Pandas' DataFrame, we can use its already familiar methods to explore the data.

In [5]:
filtered_category = categories[0]
filtered_category

Category(demographics)

In [6]:
isinstance(filtered_category, Category)

True

We can obtain the category in a pandas Series:

In [7]:
category_series = filtered_category.to_series()
category_series


id      demographics
name    Demographics
dtype: object

In [8]:
import pandas as pd

isinstance(category_series, pd.Series)

True

## Explore a particular category

If we already know that a particular category has presence in the Catalog, we can retrieve it directly by using its id.

In [9]:
category1 = catalog.categories.get('demographics')

category1

Category(demographics)

In [10]:
isinstance(category1, Category)

True

And we can access its different properties:

In [11]:
category1.name

'Demographics'

### Get the datasets for that category

Once we have a Category we can use the discovery methods to get the datasets related to that category.

In [12]:
demographics_datasets = category1.datasets
demographics_datasets

[Dataset(carto-do-public-data.acsquantiles.demographics_acsquantiles_usa_schooldistrictunifiedclipped_2015_5yrs_20062010),
 Dataset(carto-do-public-data.acs.demographics_acs_usa_stateclipped_2015_5yrs_20062010),
 Dataset(carto-do-public-data.acs.demographics_acs_usa_placeclipped_2015_yearly_2010),
 Dataset(carto-do-public-data.acs.demographics_acs_usa_countyclipped_2015_yearly_2015),
 Dataset(carto-do-public-data.acsquantiles.demographics_acsquantiles_usa_stateclipped_2015_5yrs_20102014),
 Dataset(carto-do-public-data.acsquantiles.demographics_acsquantiles_usa_schooldistrictsecondaryclipped_2015_5yrs_20062010),
 Dataset(carto-do-public-data.acsquantiles.demographics_acsquantiles_usa_cbsaclipped_2015_5yrs_20062010),
 Dataset(carto-do-public-data.acs.demographics_acs_usa_placeclipped_2015_5yrs_20132017),
 Dataset(carto-do-public-data.acs.demographics_acs_usa_schooldistrictsecondaryclipped_2015_5yrs_20102014),
 Dataset(carto-do-public-data.acs.demographics_acs_usa_countyclipped_2015_5yrs_

In [13]:
from cartoframes.data.observatory.dataset import Dataset

isinstance(demographics_datasets[0], Dataset)

True

Again, we can export the data as a pandas' DataFrame:

In [14]:
demographics_datasets.to_dataframe().iloc[0]

id                      carto-do-public-data.acsquantiles.demographics...
name                    American Community Survey (ACS) quantiles data...
description                                                          None
provider_id                                                     open_data
category_id                                                  demographics
data_source_id                                               acsquantiles
country_iso_code3                                                     usa
language_iso_code3                                                   None
geography_id            carto-do-public-data.tiger.geography_usa_schoo...
temporal_aggregation                                                 5yrs
time_coverage                                     [2006-01-01,2010-01-01)
update_frequency                                                     None
version                                                          20062010
is_public_data                        

And as well as Category, a Dataset can be used to extract related properties:

In [15]:
d1 = demographics_datasets[0]

d1.country

'usa'

In [16]:
vars1 = d1.variables
vars1

[Variable(carto-do-public-data.acsquantiles.demographics_acsquantiles_usa_schooldistrictunifiedclipped_2015_5yrs_20062010.male_55_to_59_quantile),
 Variable(carto-do-public-data.acsquantiles.demographics_acsquantiles_usa_schooldistrictunifiedclipped_2015_5yrs_20062010.white_male_55_64_quantile),
 Variable(carto-do-public-data.acsquantiles.demographics_acsquantiles_usa_schooldistrictunifiedclipped_2015_5yrs_20062010.male_10_to_14_quantile),
 Variable(carto-do-public-data.acsquantiles.demographics_acsquantiles_usa_schooldistrictunifiedclipped_2015_5yrs_20062010.white_male_45_54_quantile),
 Variable(carto-do-public-data.acsquantiles.demographics_acsquantiles_usa_schooldistrictunifiedclipped_2015_5yrs_20062010.female_pop_quantile),
 Variable(carto-do-public-data.acsquantiles.demographics_acsquantiles_usa_schooldistrictunifiedclipped_2015_5yrs_20062010.male_5_to_9_quantile),
 Variable(carto-do-public-data.acsquantiles.demographics_acsquantiles_usa_schooldistrictunifiedclipped_2015_5yrs_2006

In [17]:
from cartoframes.data.observatory.variable import Variable

isinstance(vars1[0], Variable)

True

In [3]:
from cartoframes.data.observatory.catalog import Catalog

catalog = Catalog()

In [None]:
len(catalog.country('spain').datasets)

In [2]:
len(catalog.category('demographics').datasets)

5

In [3]:
catalog.clear_filters()
len(catalog.category('demographics').datasets)

273

In [4]:
catalog.clear_filters()
len(catalog.country('spain').category('demographics').datasets)

5

In [5]:
catalog.clear_filters()
len(catalog.country('usa').category('demographics').datasets)

268

In [2]:
catalog.category('demographics').countries

[Country(usa), Country(spain)]

In [4]:
catalog.country('spain').categories

CartoException: ['column "country_id" does not exist']