## Data discovery

### Introduction

The Data Observatory is a spatial data repository that enables data scientists to augment their data and broaden their analysis. It offers a wide range of datasets from around the globe.

This guide is intended for those who want to start augmenting their own data using CARTOframes and wish to explore CARTO's public Data Observatory catalog to find datasets that best fit their use cases and analyses.

**Note: The catalog is public and you don't need a CARTO account to search for available datasets**

### Find demographic data for the US

In this guide we walk through the Data Observatory catalog looking for demographics data in the US.

The catalog is comprised of thousands of curated spatial datasets, so when searching for data the easiest way to find what you are looking for is to make use of a faceted search. A faceted (or hierarchical) search allows you to narrow down search results by applying multiple filters based on faceted classification of catalog datasets.

Datasets are organized in three main hierarchies:

- Country
- Category
- Geography (or spatial resolution)

For our analysis we are looking for demographic datasets in the US with a spatial resolution at the block group level.

We can start by discovering which available geographies (or spatial resolutions) we have for demographic data in the US, by filtering the `catalog` by `country` and `category` and listing the available `geographies`.

Let's start exploring the available categories of data for the US:

In [1]:
from cartoframes.data.observatory import Catalog
Catalog().country('usa').categories

[<Category.get('covid19')>,
 <Category.get('demographics')>,
 <Category.get('environmental')>,
 <Category.get('financial')>,
 <Category.get('geosocial')>,
 <Category.get('housing')>,
 <Category.get('human_mobility')>,
 <Category.get('points_of_interest')>,
 <Category.get('road_traffic')>]

For the case of the US, the Data Observatory provides six different categories of datasets. Let's discover the available spatial resolutions for the demographics category (which at a first sight will contain the population data we need).

In [2]:
from cartoframes.data.observatory import Catalog
geographies = Catalog().country('usa').category('demographics').geographies
geographies

[<Geography.get('mbi_blockgroups_535aed6d')>,
 <Geography.get('mbi_counties_46ea8aaa')>,
 <Geography.get('mbi_county_subd_ba170144')>,
 <Geography.get('mbi_pc_5_digit_19e769c1')>,
 <Geography.get('cdb_blockgroup_7753dd51')>,
 <Geography.get('cdb_cbsa_d1b91d3b')>,
 <Geography.get('cdb_censustract_af861cba')>,
 <Geography.get('cdb_congression_478295fd')>,
 <Geography.get('cdb_county_767e79f0')>,
 <Geography.get('cdb_county_8cf054d')>,
 <Geography.get('cdb_place_93d54d1e')>,
 <Geography.get('cdb_puma_56bbc2e')>,
 <Geography.get('cdb_schooldistr_eb48e7bc')>,
 <Geography.get('cdb_schooldistr_18547e3f')>,
 <Geography.get('cdb_schooldistr_d9ca1a26')>,
 <Geography.get('cdb_state_cd83b434')>,
 <Geography.get('cdb_zcta5_f4043497')>]

Let's filter the geographies by those that contain information at the level of blockgroup. For that purpose we are converting the geographies to a pandas `DataFrame` and search for the string `blockgroup` in the `id` of the geographies:

In [3]:
df = geographies.to_dataframe()
df[df['id'].str.contains('blockgroup', case=False, na=False)]

Unnamed: 0,slug,name,description,country_id,provider_id,geom_type,geom_coverage,update_frequency,is_public_data,lang,version,provider_name,id
0,mbi_blockgroups_535aed6d,Blockgroups - United States of America,MBI Digital Boundaries for USA at Blockgroups ...,usa,mbi,MULTIPOLYGON,,yearly,False,eng,2020,Michael Bauer International,carto-do.mbi.geography_usa_blockgroups_2020
4,cdb_blockgroup_7753dd51,Census Block Group - United States of America,Shoreline clipped TIGER/Line boundaries. More ...,usa,carto,MULTIPOLYGON,,,True,eng,2015,CARTO,carto-do-public-data.carto.geography_usa_block...


We have three available datasets, from three different providers: Michael Bauer International, Open Data and AGS. For this example, we are going to look for demographic datasets for the MBI blockgroups geography `mbi_blockgroups_535aed6d`:

In [4]:
datasets = Catalog().country('usa').category('demographics').geography('mbi_blockgroups_535aed6d').datasets
datasets

[<Dataset.get('mbi_consumer_sp_fdc16f97')>,
 <Dataset.get('mbi_households__ec03bf40')>,
 <Dataset.get('mbi_sociodemogr_1c54ac66')>,
 <Dataset.get('mbi_purchasing__faaee3c9')>,
 <Dataset.get('mbi_population_9d1b276f')>,
 <Dataset.get('mbi_retail_spen_6a1acff4')>,
 <Dataset.get('mbi_consumer_pr_c1d4e20e')>,
 <Dataset.get('mbi_households__60466314')>,
 <Dataset.get('mbi_education_8903fc2c')>]

Let's continue with the data discovery. We have 6 datasets in the US with demographics information at the level of MBI blockgroups:

In [5]:
datasets.to_dataframe()

Unnamed: 0,slug,name,description,category_id,country_id,data_source_id,provider_id,geography_name,geography_description,temporal_aggregation,time_coverage,update_frequency,is_public_data,lang,version,category_name,provider_name,geography_id,id
0,mbi_consumer_sp_fdc16f97,Consumer Spending - United States of America (...,MBI Consumer Spending by product groups quanti...,demographics,usa,consumer_spending,mbi,Blockgroups - United States of America,MBI Digital Boundaries for USA at Blockgroups ...,yearly,"[2019-01-01, 2020-01-01)",yearly,False,eng,2020,Demographics,Michael Bauer International,carto-do.mbi.geography_usa_blockgroups_2020,carto-do.mbi.demographics_consumerspending_usa...
1,mbi_households__ec03bf40,Households By Type - United States of America ...,Distribution of the households in an area by t...,demographics,usa,households_by_type,mbi,Blockgroups - United States of America,MBI Digital Boundaries for USA at Blockgroups ...,yearly,"[2019-01-01, 2020-01-01)",yearly,False,eng,2020,Demographics,Michael Bauer International,carto-do.mbi.geography_usa_blockgroups_2020,carto-do.mbi.demographics_householdsbytype_usa...
2,mbi_sociodemogr_1c54ac66,Sociodemographics - United States of America (...,MBI Sociodemographics includes:\n- Population\...,demographics,usa,sociodemographics,mbi,Blockgroups - United States of America,MBI Digital Boundaries for USA at Blockgroups ...,yearly,"[2019-01-01, 2020-01-01)",yearly,False,eng,2020,Demographics,Michael Bauer International,carto-do.mbi.geography_usa_blockgroups_2020,carto-do.mbi.demographics_sociodemographics_us...
3,mbi_purchasing__faaee3c9,Purchasing Power - United States of America (B...,Purchasing Power describes the disposable inco...,demographics,usa,purchasing_power,mbi,Blockgroups - United States of America,MBI Digital Boundaries for USA at Blockgroups ...,yearly,"[2019-01-01, 2020-01-01)",yearly,False,eng,2020,Demographics,Michael Bauer International,carto-do.mbi.geography_usa_blockgroups_2020,carto-do.mbi.demographics_purchasingpower_usa_...
4,mbi_population_9d1b276f,Population - United States of America (Blockgr...,Population figures are shown as projected aver...,demographics,usa,population,mbi,Blockgroups - United States of America,MBI Digital Boundaries for USA at Blockgroups ...,yearly,"[2019-01-01, 2020-01-01)",yearly,False,eng,2020,Demographics,Michael Bauer International,carto-do.mbi.geography_usa_blockgroups_2020,carto-do.mbi.demographics_population_usa_block...
5,mbi_retail_spen_6a1acff4,Retail Spending - United States of America (Bl...,Retail Spending relates to the proportion of P...,demographics,usa,retail_spending,mbi,Blockgroups - United States of America,MBI Digital Boundaries for USA at Blockgroups ...,yearly,"[2019-01-01, 2020-01-01)",yearly,False,eng,2020,Demographics,Michael Bauer International,carto-do.mbi.geography_usa_blockgroups_2020,carto-do.mbi.demographics_retailspending_usa_b...
6,mbi_consumer_pr_c1d4e20e,Consumer Profiles - United States of America (...,The MB International Consumer Styles describe ...,demographics,usa,consumer_profiles,mbi,Blockgroups - United States of America,MBI Digital Boundaries for USA at Blockgroups ...,yearly,"[2019-01-01, 2020-01-01)",yearly,False,eng,2020,Demographics,Michael Bauer International,carto-do.mbi.geography_usa_blockgroups_2020,carto-do.mbi.demographics_consumerprofiles_usa...
7,mbi_households__60466314,Households By Income Quintiles - United States...,On the national level the number of households...,demographics,usa,households_by_income_quintiles,mbi,Blockgroups - United States of America,MBI Digital Boundaries for USA at Blockgroups ...,yearly,"[2019-01-01, 2020-01-01)",yearly,False,eng,2020,Demographics,Michael Bauer International,carto-do.mbi.geography_usa_blockgroups_2020,carto-do.mbi.demographics_householdsbyincomequ...
8,mbi_education_8903fc2c,Education - United States of America (Blockgro...,Distribution of the population in an area by t...,demographics,usa,education,mbi,Blockgroups - United States of America,MBI Digital Boundaries for USA at Blockgroups ...,yearly,"[2019-01-01, 2020-01-01)",yearly,False,eng,2020,Demographics,Michael Bauer International,carto-do.mbi.geography_usa_blockgroups_2020,carto-do.mbi.demographics_education_usa_blockg...


They comprise different information: consumer spending, retail potential, consumer profiles, etc.

At a first sight, it looks the dataset with `data_source_id: sociodemographic` might contain the population information we are looking for. Let's try to understand a little bit better what data this dataset contains by looking at its variables:

In [6]:
from cartoframes.data.observatory import Dataset
dataset = Dataset.get('ags_sociodemogr_a7e14220')
variables = dataset.variables
variables

[<Variable.get('BLOCKGROUP_30e525a6')> #'Geographic Identifier',
 <Variable.get('POPCY_4534fac4')> #'Population (2019A)',
 <Variable.get('POPCYGRP_3033ef2e')> #'Population in Group Quarters (2019A)',
 <Variable.get('POPCYGRPI_1e42899')> #'Institutional Group Quarters Population (2019A)',
 <Variable.get('AGECY0004_aaae373a')> #'Population age 0-4 (2019A)',
 <Variable.get('AGECY0509_d2d4896c')> #'Population age 5-9 (2019A)',
 <Variable.get('AGECY1014_b09611e')> #'Population age 10-14 (2019A)',
 <Variable.get('AGECY1519_7373df48')> #'Population age 15-19 (2019A)',
 <Variable.get('AGECY2024_32919d33')> #'Population age 20-24 (2019A)',
 <Variable.get('AGECY2529_4aeb2365')> #'Population age 25-29 (2019A)',
 <Variable.get('AGECY3034_9336cb17')> #'Population age 30-34 (2019A)',
 <Variable.get('AGECY3539_eb4c7541')> #'Population age 35-39 (2019A)',
 <Variable.get('AGECY4044_41a06569')> #'Population age 40-44 (2019A)',
 <Variable.get('AGECY4549_39dadb3f')> #'Population age 45-49 (2019A)',
 <Vari

In [7]:
from cartoframes.data.observatory import Dataset
vdf = variables.to_dataframe()
vdf

Unnamed: 0,slug,name,description,db_type,agg_method,column_name,variable_group_id,dataset_id,id
0,BLOCKGROUP_30e525a6,BLOCKGROUP,Geographic Identifier,STRING,,BLOCKGROUP,,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...
1,POPCY_4534fac4,Total Population,Population (2019A),INTEGER,SUM,POPCY,,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...
2,POPCYGRP_3033ef2e,POPCYGRP,Population in Group Quarters (2019A),INTEGER,SUM,POPCYGRP,,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...
3,POPCYGRPI_1e42899,POPCYGRPI,Institutional Group Quarters Population (2019A),INTEGER,SUM,POPCYGRPI,,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...
4,AGECY0004_aaae373a,AGECY0004,Population age 0-4 (2019A),INTEGER,SUM,AGECY0004,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...
...,...,...,...,...,...,...,...,...,...
103,DWLPY_312aaf70,Number of household units,Housing units (2024A),INTEGER,SUM,DWLPY,,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...
104,AGEPYMED_d5583bbb,AGEPYMED,Median Age (2024A),FLOAT,AVG,AGEPYMED,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...
105,INCPYPCAP_f9c107fa,INCPYPCAP,Per capita income (2024A),FLOAT,AVG,INCPYPCAP,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...
106,INCPYAVEHH_48c1d530,INCPYAVEHH,Average household Income (2024A),FLOAT,AVG,INCPYAVEHH,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...


We can see there are several variables related to population, so this is the `Dataset` we are looking for.

In [8]:
vdf[vdf['description'].str.contains('pop', case=False, na=False)]

Unnamed: 0,slug,name,description,db_type,agg_method,column_name,variable_group_id,dataset_id,id
1,POPCY_4534fac4,Total Population,Population (2019A),INTEGER,SUM,POPCY,,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...
2,POPCYGRP_3033ef2e,POPCYGRP,Population in Group Quarters (2019A),INTEGER,SUM,POPCYGRP,,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...
3,POPCYGRPI_1e42899,POPCYGRPI,Institutional Group Quarters Population (2019A),INTEGER,SUM,POPCYGRPI,,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...
4,AGECY0004_aaae373a,AGECY0004,Population age 0-4 (2019A),INTEGER,SUM,AGECY0004,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...
5,AGECY0509_d2d4896c,AGECY0509,Population age 5-9 (2019A),INTEGER,SUM,AGECY0509,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...
6,AGECY1014_b09611e,AGECY1014,Population age 10-14 (2019A),INTEGER,SUM,AGECY1014,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...
7,AGECY1519_7373df48,AGECY1519,Population age 15-19 (2019A),INTEGER,SUM,AGECY1519,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...
8,AGECY2024_32919d33,AGECY2024,Population age 20-24 (2019A),INTEGER,SUM,AGECY2024,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...
9,AGECY2529_4aeb2365,AGECY2529,Population age 25-29 (2019A),INTEGER,SUM,AGECY2529,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...
10,AGECY3034_9336cb17,AGECY3034,Population age 30-34 (2019A),INTEGER,SUM,AGECY3034,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...,carto-do.ags.demographics_sociodemographics_us...


### Dataset and variables metadata

The Data Observatory catalog is not only a repository of curated spatial datasets, it also contains valuable information that helps on understanding better the underlying data for every dataset, so you can take an informed decision on what data best fits your problem.

Some of the augmented metadata you can find for each dataset in the catalog is:

- `head` and `tail` methods to get a glimpse of the actual data. This helps you to understand the available columns, data types, etc. To start modelling your problem right away.
- `geom_coverage` to visualize on a map the geographical coverage of the data in the `Dataset`.
- `counts`, `fields_by_type` and a full `describe` method with stats of the actual values in the dataset, such as: average, stdev, quantiles, min, max, median for each of the variables of the dataset.

You don't need a subscription to a dataset to be able to query the augmented metadata, it's just publicly available for anyone exploring the Data Observatory catalog.

Let's overview some of that information, starting by getting a glimpse of the ten first or last rows of the actual data of the dataset:

In [9]:
from cartoframes.data.observatory import Dataset
dataset = Dataset.get('ags_sociodemogr_a7e14220')

In [10]:
dataset.head()

Unnamed: 0,DWLCY,DWLPY,HHDCY,HHDPY,POPCY,POPPY,geoid,VPHCY1,do_date,AGECYMED,...,MARCYDIVOR,MARCYNEVER,MARCYWIDOW,RCHCYAMNHS,RCHCYASNHS,RCHCYBLNHS,RCHCYHANHS,RCHCYMUNHS,RCHCYOTNHS,RCHCYWHNHS
0,1057,1112,932,986,1500,1648,40130405071,442,2020-01-01 00:00:00+00:00,77.4,...,149,4,228,0,11,20,0,25,0,1317
1,1964,2069,1774,1877,2595,2868,40130405072,1049,2020-01-01 00:00:00+00:00,76.88,...,414,160,699,0,74,68,7,55,0,2167
2,1049,1101,897,933,1585,1716,40130610182,460,2020-01-01 00:00:00+00:00,69.88,...,31,217,246,2,55,43,9,26,0,1313
3,1084,1137,910,940,1503,1616,40138175002,392,2020-01-01 00:00:00+00:00,71.44,...,191,79,268,8,24,38,0,8,0,1290
4,682,706,574,591,980,1039,40190043241,244,2020-01-01 00:00:00+00:00,72.38,...,30,44,195,3,9,0,0,0,0,902
5,880,910,840,869,1249,1284,60133511032,539,2020-01-01 00:00:00+00:00,76.75,...,160,40,319,0,136,19,2,12,5,1024
6,1467,1534,1314,1467,1658,1800,60590995101,831,2020-01-01 00:00:00+00:00,74.58,...,423,136,496,3,226,10,1,16,0,1269
7,704,753,693,730,1078,1176,60610210391,338,2020-01-01 00:00:00+00:00,73.86,...,117,63,215,5,33,7,0,9,0,986
8,1582,1691,1553,1650,2540,2795,60610236001,818,2020-01-01 00:00:00+00:00,68.8,...,406,45,301,5,168,26,3,19,0,2183
9,1186,1268,1163,1234,1980,2176,60610236002,415,2020-01-01 00:00:00+00:00,68.59,...,253,60,223,5,97,22,1,10,0,1750


Alternatively, you can get the last ten ones with `dataset.tail()`

An overview of the coverage of the dataset

In [11]:
dataset.geom_coverage()

Some stats about the dataset:

In [12]:
dataset.counts()

rows                    217182.0
cells                 23890020.0
null_cells                   0.0
null_cells_percent           0.0
dtype: float64

In [13]:
dataset.fields_by_type()

float          5
string         2
integer      102
timestamp      1
dtype: int64

In [14]:
dataset.describe()

Unnamed: 0,POPCY,POPCYGRP,POPCYGRPI,AGECY0004,AGECY0509,AGECY1014,AGECY1519,AGECY2024,AGECY2529,AGECY3034,...,DWLCYVACNT,DWLCYRENT,DWLCYOWNED,POPPY,HHDPY,DWLPY,AGEPYMED,INCPYPCAP,INCPYAVEHH,INCPYMEDHH
avg,1520.47,37.25678,17.98816,90.4704,93.41024,95.81574,97.2645,100.0845,108.3624,104.4326,...,49.57671,211.6296,383.5577,1568.307,607.5594,671.8555,39.89449,42928.96,107309.8,79333.28
max,67100.0,19752.0,12053.0,5393.0,5294.0,5195.0,7606.0,14804.0,5767.0,5616.0,...,6547.0,10057.0,23676.0,75845.0,28115.0,32640.0,87.5,3824975.0,11127200.0,350000.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sum,330218700.0,8091501.0,3906704.0,19648540.0,20287020.0,20809450.0,21124100.0,21736550.0,23534360.0,22680870.0,...,10767170.0,45962130.0,83301820.0,340608000.0,131951000.0,145914900.0,8664364.0,9323398000.0,23305760000.0,17229760000.0
range,67100.0,19752.0,12053.0,5393.0,5294.0,5195.0,7606.0,14804.0,5767.0,5616.0,...,6547.0,10057.0,23676.0,75845.0,28115.0,32640.0,87.5,3824975.0,11127200.0,350000.0
stdev,1063.417,242.8693,158.2057,80.44792,83.38915,83.51758,111.7828,124.4125,96.42758,93.58961,...,98.49786,235.3261,316.3312,1141.981,413.8545,446.7692,7.567177,31788.7,78351.31,42620.18
q1,850.0,0.0,0.0,44.0,44.0,46.0,45.0,43.0,50.0,49.0,...,11.0,60.0,182.0,867.0,344.0,384.0,34.07,21680.0,55361.0,47018.0
q3,1454.0,0.0,0.0,83.0,86.0,89.0,87.0,86.0,98.0,95.0,...,34.0,178.0,375.0,1485.0,581.0,648.0,41.01,40582.0,101272.0,79083.0
median,1125.0,0.0,0.0,62.0,63.0,65.0,64.0,62.0,71.0,69.0,...,20.0,108.0,274.0,1143.0,452.0,504.0,37.71,30563.0,76713.0,62122.0
interquartile_range,604.0,0.0,0.0,39.0,42.0,43.0,42.0,43.0,48.0,46.0,...,23.0,118.0,193.0,618.0,237.0,264.0,6.94,18902.0,45911.0,32065.0


Every `Dataset` instance in the catalog contains other useful metadata:

- slug: A short ID
- name and description: Free text attributes
- country
- geography: Every dataset is related to a Geography instance
- category
- provider
- data source
- lang
- temporal aggregation
- time coverage
- update frequency
- version
- is_public_data: whether you need a license to use the dataset for enrichment purposes or not

In [15]:
dataset.to_dict()

{'slug': 'ags_sociodemogr_a7e14220',
 'name': 'Sociodemographics - United States of America (Census Block Group)',
 'description': 'Census and ACS sociodemographic data estimated for the current year and data projected to five years. Projected fields are general aggregates (total population, total households, median age, avg income etc.)',
 'category_id': 'demographics',
 'country_id': 'usa',
 'data_source_id': 'sociodemographics',
 'provider_id': 'ags',
 'geography_name': 'Census Block Group - United States of America',
 'geography_description': None,
 'temporal_aggregation': 'yearly',
 'time_coverage': None,
 'update_frequency': 'yearly',
 'is_public_data': False,
 'lang': 'eng',
 'version': '2020',
 'category_name': 'Demographics',
 'provider_name': 'Applied Geographic Solutions',
 'geography_id': 'carto-do.ags.geography_usa_blockgroup_2015',
 'id': 'carto-do.ags.demographics_sociodemographics_usa_blockgroup_2015_yearly_2020'}

There's also some intersting metadata, for each variable in the dataset:

- id
- slug: A short ID
- name and description
- column_name: Actual column name in the table that contains the data
- db_type: SQL type in the database
- dataset_id
- agg_method: Aggregation method used
- temporal aggregation and time coverage

Variables are the most important asset in the catalog and when exploring datasets in the Data Observatory catalog it's very important that you understand clearly what variables are available to enrich your own data.

For each `Variable` in each dataset, the Data Observatory provides (as it does with datasets) a set of methods and attributes to understand their underlaying data.

Some of them are:

- `head` and `tail` methods to get a glimpse of the actual data and start modelling your problem right away.
- `counts`, `quantiles` and a full `describe` method with stats of the actual values in the dataset, such as: average, stdev, quantiles, min, max, median for each of the variables of the dataset.
- an `histogram` plot with the distribution of the values on each variable.

Let's overview some of that augmented metadata for the variables in the AGS population dataset.

In [16]:
from cartoframes.data.observatory import Variable
variable = Variable.get('POPCY_4534fac4')
variable

<Variable.get('POPCY_4534fac4')> #'Population (2019A)'

In [17]:
variable.to_dict()

{'slug': 'POPCY_4534fac4',
 'name': 'Total Population',
 'description': 'Population (2019A)',
 'db_type': 'INTEGER',
 'agg_method': 'SUM',
 'column_name': 'POPCY',
 'variable_group_id': None,
 'dataset_id': 'carto-do.ags.demographics_sociodemographics_usa_blockgroup_2015_yearly_2020',
 'id': 'carto-do.ags.demographics_sociodemographics_usa_blockgroup_2015_yearly_2020.POPCY'}

There's also some utility methods ot understand the underlying data for each variable:

In [18]:
variable.head()

0    1500
1    2595
2    1585
3    1503
4     980
5    1249
6    1658
7    1078
8    2540
9    1980
dtype: int64

In [19]:
variable.counts()

all                 217182.000000
null                     0.000000
zero                   299.000000
extreme               9073.000000
distinct              6756.000000
outliers             26998.000000
null_percent             0.000000
zero_percent             0.137673
extreme_percent          0.041776
distinct_percent         3.110755
outliers_percent         0.124310
dtype: float64

In [20]:
variable.quantiles()

q1                      850
q3                     1454
median                 1125
interquartile_range     604
dtype: int64

In [21]:
variable.histogram()

<Figure size 1200x700 with 1 Axes>

In [22]:
variable.describe()

avg                    1.520470e+03
max                    6.710000e+04
min                    0.000000e+00
sum                    3.302187e+08
range                  6.710000e+04
stdev                  1.063417e+03
q1                     8.500000e+02
q3                     1.454000e+03
median                 1.125000e+03
interquartile_range    6.040000e+02
dtype: float64

### Subscribe to a Dataset in the catalog

Once you have explored the catalog and have detected a dataset with the variables you need for your analysis and the right spatial resolution, you have to look at the `is_public_data` to know if you can just use it from CARTOframes or you first need to subscribe for a license.

Subscriptions to datasets allow you to use them from CARTOframes to enrich your own data or to download them. See the enrichment guides for more information about this.

Let's see the dataset and geography in our previous example:

In [23]:
dataset = Dataset.get('ags_sociodemogr_a7e14220')

In [24]:
dataset.is_public_data

False

In [25]:
from cartoframes.data.observatory import Geography
geography = Geography.get(dataset.geography)

In [26]:
geography.is_public_data

False

Both `dataset` and `geography` are not public data, that means you need a subscription to be able to use them to enrich your own data.

**To subscribe to data in the Data Observatory catalog you need a CARTO account with access to Data Observatory**

In [27]:
from cartoframes.auth import set_default_credentials

set_default_credentials('creds.json')

In [28]:
dataset.subscribe()

HTML(value='\n        <h3>Subscription already purchased</h3>\n        The dataset <b>carto-do.ags.demographic…

In [29]:
geography.subscribe()

HTML(value='\n        <h3>Subscription already purchased</h3>\n        The geography <b>carto-do.ags.geography…

**Licenses to data in the Data Observatory grant you the right to use the data subscribed for the period of one year. Every dataset or geography you want to use to enrich your own data, as long as they are not public data, require a valid license.**

You can check the actual status of your subscriptions directly from the catalog.

In [30]:
Catalog().subscriptions()

Datasets: [<Dataset.get('ags_businesscou_df363a87')>, <Dataset.get('ags_crimerisk_e9cfa4d4')>, <Dataset.get('ags_retailpoten_aaf25a8c')>, <Dataset.get('ags_sociodemogr_f510a947')>, <Dataset.get('ags_sociodemogr_a7e14220')>]
Geographies: [<Geography.get('cdb_blockgroup_7753dd51')>, <Geography.get('ags_blockgroup_1c63771c')>]

### Conclusion

In this guide you've seen how to explore the Data Observatory catalog to identify variables of datasets that you can use to enrich your own data.

You've learned how to:

- Explore the catalog using nested hierarchical filters.
- Describe the three main entities in the catalog: `Geography`, `Dataset` and their `Variables`.
- Taken a look at the data and stats taken from the actual repository, to make a more informed decision on which variables to choose.
- How to subscribe to the chosen dataset to get a license that grants the right to enrich your own data.

We also recommend checking out the resources below to learn more about the Data Observatory catalog:

- The CARTOframes [enrichment guide](/developers/cartoframes/guides/Data-enrichment/)
- [Our public website](https://carto.com/platform/location-data-streams/)
- Your user dashboard: Under the data section
- The CARTOframes catalog [API reference](/developers/cartoframes/reference/#heading-Data-Observatory)
