# Data Observatory in CARTOframes

The [Data Observatory](https://carto.com/data-observatory/) can be accessed through CARTOframes. This is a basic demonstration how one would pull down new measures for building a feature set for training a model.

In [55]:
from cartoframes.auth import set_default_context, Context
from cartoframes.viz import Map, Layer
from cartoframes.data import Dataset

import pandas

username = 'cartovl' # <-- insert your username here
api_key = ''# <-- insert your API key here

context = Context('https://{}.carto.com/'.format(username), api_key)
set_default_context(context)

## Getting Mexico City Metro station coordinates

Use pandas to download an Excel spreadsheet into a dataframe.

In [56]:
# Metro stations from here:
# https://github.com/josecarlosgonz/mexicoCityMetro/blob/master/coordsMetro.xlsx

dataframe = pandas.read_excel('https://github.com/josecarlosgonz/mexicoCityMetro/blob/master/coordsMetro.xlsx?raw=true')
dataframe.head()

Unnamed: 0,Name,latitude,longitude,Unnamed: 3,linea,estacion,afluencia,latitude.1,longitude.1
0,Pantitlán,19.4163,-99.0747,,1,Pantitlán,4513549.0,19.4163,-99.0747
1,Zaragoza,19.4117,-99.0821,,1,Zaragoza,5144223.0,19.4117,-99.0821
2,Gómez Farías,19.4165,-99.0904,,1,Gómez Farías,3665025.0,19.4165,-99.0904
3,Boulevard Puerto Aéreo,19.4196,-99.0963,,1,Boulevard Puerto Aéreo,3611591.0,19.4196,-99.0963
4,Balbuena,19.4231,-99.1021,,1,Balbuena,1822229.0,19.4231,-99.1021


Send to CARTO, being sure to specify the to-be-normalized column names `latitude.1` -> `latitude_1`, etc.

In [57]:
dataset = Dataset.from_dataframe(dataframe)

dataset.upload(
    table_name='coordsmetro_demo',
    with_lnglat=('longitude_1', 'latitude_1'),
    if_exists='replace'
)

The following columns were changed in the CARTO copy of this dataframe:
[1mName[0m -> [1mname[0m
[1mUnnamed: 3[0m -> [1munnamed_3[0m
[1mlatitude.1[0m -> [1mlatitude_1[0m
[1mlongitude.1[0m -> [1mlongitude_1[0m




<cartoframes.data.dataset.Dataset at 0x115d07710>

## See the data by `linea`

_Note: notice the basemap labels are default on the bottom._

In [58]:
data = dataset.download(decode_geom=True)

data.head()

Unnamed: 0_level_0,geometry,name,latitude,longitude,unnamed_3,linea,estacion,afluencia,latitude_1,longitude_1
cartodb_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,,Pantitlán,19.4163,-99.0747,,1,Pantitlán,4513549.0,19.4163,-99.0747
2,,Zaragoza,19.4117,-99.0821,,1,Zaragoza,5144223.0,19.4117,-99.0821
3,,Gómez Farías,19.4165,-99.0904,,1,Gómez Farías,3665025.0,19.4165,-99.0904
4,,Boulevard Puerto Aéreo,19.4196,-99.0963,,1,Boulevard Puerto Aéreo,3611591.0,19.4196,-99.0963
5,,Balbuena,19.4231,-99.1021,,1,Balbuena,1822229.0,19.4231,-99.1021


In [52]:
Map(Layer('coordsmetro_demo', 'color: ramp($linea, sunset)'))

## Data Observatory measures in the Mexico City area

Let's get education-related Data Observatory measures around the metro stops.

In [53]:
data_observatory = context.data_discovery(region='coordsmetro_demo', keywords='education')
data_observatory.head()

Unnamed: 0,denom_aggregate,denom_colname,denom_description,denom_geomref_colname,denom_id,denom_name,denom_reltype,denom_t_description,denom_tablename,denom_type,...,numer_timespan,numer_type,score,score_rank,score_rownum,suggested_name,target_area,target_geoms,timespan_rank,timespan_rownum
0,sum,b01_tot_p_f,Selected Person Characteristics,region_id,au.data.B01_Tot_P_F,Total (Females),denominator,,obs_1699c60291c8bd72199fc1ef86b23165eff0f201,Numeric,...,2011,Numeric,37.858126,1.0,1.0,b01_age_psns_att_educ_inst_0_4_f_2011,,,1.0,1.0
1,sum,b01_tot_p_f,Selected Person Characteristics,region_id,au.data.B01_Tot_P_F,Total (Females),denominator,,obs_1699c60291c8bd72199fc1ef86b23165eff0f201,Numeric,...,2011,Numeric,37.858126,1.0,1.0,b01_age_psns_att_educ_inst_0_4_f_2011_by_b01_t...,,,1.0,1.0
2,sum,b01_tot_p_m,Selected Person Characteristics,region_id,au.data.B01_Tot_P_M,Total (Males),denominator,,obs_1699c60291c8bd72199fc1ef86b23165eff0f201,Numeric,...,2011,Numeric,37.858126,1.0,1.0,b01_age_psns_att_educ_inst_0_4_m_2011,,,1.0,1.0
3,sum,b01_tot_p_m,Selected Person Characteristics,region_id,au.data.B01_Tot_P_M,Total (Males),denominator,,obs_1699c60291c8bd72199fc1ef86b23165eff0f201,Numeric,...,2011,Numeric,37.858126,1.0,1.0,b01_age_psns_att_educ_inst_0_4_m_2011_by_b01_t...,,,1.0,1.0
4,sum,b01_tot_p_p,Selected Person Characteristics,region_id,au.data.B01_Tot_P_P,Total (Persons),denominator,,obs_1699c60291c8bd72199fc1ef86b23165eff0f201,Numeric,...,2011,Numeric,37.858126,1.0,1.0,b01_age_psns_att_educ_inst_0_4_p_2011,,,1.0,1.0


In [28]:
# See how many measures are possible
data_observatory.shape

(2060, 42)

In [38]:
# Look at the geometry levels available
data_observatory.groupby('geom_id')['geom_id'].count()

geom_id
au.geo.SED                1034
ca.statcan.geo.cd_          18
es.ine.the_geom             18
eu.geo.nuts2               600
eu.geo.nuts3               160
mx.inegi.municipio          50
us.census.tiger.block       10
us.census.tiger.cbsa        50
us.census.tiger.county     120
Name: geom_id, dtype: int64

Narrow down the problem to only have `municipio`-level measures.

In [54]:
# select only the municipio level data
data_observatory = data_observatory[data_observatory['geom_id'] == 'mx.inegi.municipio']
data_observatory.shape

(50, 42)

Take a look at the measures we have

In [41]:
data_observatory['numer_name'].values

array(['Employed population with primary education',
       'Employed population with primary education',
       'Employed female population with primary education',
       'Employed female population with primary education',
       'Employed female population with primary education',
       'Employed female population with primary education',
       'Employed male population with primary education',
       'Employed male population with primary education',
       'Employed male population with primary education',
       'Employed male population with primary education',
       'Employed population with incomplete secondary education',
       'Employed population with incomplete secondary education',
       'Employed female population with incomplete secondary education',
       'Employed female population with incomplete secondary education',
       'Employed female population with incomplete secondary education',
       'Employed female population with incomplete secondary education'