## Enrichment

### Introduction

We call enrichment to the process of augmenting your data with new variables by means of a spatial join between your data and a `Dataset` aggregated at a given spatial resolution in the CARTO Data Observatory or in other words:

"*Enrichment is the process of adding variables to a geometry, which we call the target, (point, line, polygon…) from a spatial (polygon) dataset, which we call the source*"

We recommend you to check the [CARTOframes quickstart](https://carto.com/developers/cartoframes/guides/Quickstart/) since we'll use some of the DataFrames generated in the exaple and the [Discovery guide](https://carto.com/developers/cartoframes/guides/Data-discovery) to learn how to explore the Data Observatory catalog to find out variables of interest for your analyses.

### Choose variables to enrich from the Data Observatory catalog

Let's follow up the [Discovery guide](https://carto.com/developers/cartoframes/guides/Data-discovery) where we subscribed to the AGS demographics dataset and list the variables we have available to enrich our own data.

In [1]:
from cartoframes.auth import set_default_credentials
set_default_credentials('creds.json')

FileNotFoundError: There is no default credentials file. Run `Credentials(...).save()` to create a credentials file.

In [None]:
from cartoframes.data.observatory import Catalog, Dataset, Variable, Geography
Catalog().subscriptions().datasets

In [None]:
dataset = Dataset.get('ags_sociodemogr_e92b1637')
variables = dataset.variables
variables

As we saw in the discovery guide the `ags_sociodemogr_e92b1637` dataset contains socio-demographic variables aggregated at the Census blockgroup level. Let's try to find out a variable indicating the total population.

In [None]:
vdf = variables.to_dataframe()
vdf[vdf['name'].str.contains('pop', case=False, na=False)]

We can store the variable instance we need by searching the Catalog by its `slug`, in this case `POPCY_f5800f44`:

In [None]:
variable = Variable.get('POPCY_f5800f44')
variable.to_dict()

The `POPCY` variable contains the `SUM` of the population for blockgroup for the year 2019. Let's enrich our stores DataFrame with that variable.

### Enrich a points DataFrame

We learned in the [CARTOframes quickstart](https://carto.com/developers/cartoframes/guides/Quickstart/) how to load our own data (in this case Starbucks stores) and geocode their addresses to coordinates for further analysis, so we start by loading our geocoded Starbucks stores:

In [None]:
from geopandas import read_file
stores_gdf = read_file('../examples/files/starbucks_brooklyn_geocoded.geojson')
stores_gdf.head(5)

**Note: We could alternatively load any geospatial format supported by GeoPandas or CARTO. See the Data Management guide for more information about this**

As we can see for each store we have its name, address, the total revenue by year and a `geometry` column indicating the location of the store. This is important because for the enrichment service to work we need a DataFrame with a geometry column encoded as a [shapely](https://pypi.org/project/Shapely/) object.

We can now create a new `Enrichment` instance, and since the `stores_df` dataset represents stores locations (points), we can use the `enrich_points` function passing as arguments, the stores DataFrame and a list of `Variables` we have a valid subscription from the Data Observatory catalog.

In this case we are just enriching one variable (the total population), but we could pass a list of them.

In [None]:
from cartoframes.data.observatory import Enrichment
enriched_stores_gdf = Enrichment().enrich_points(stores_gdf, [variable])
enriched_stores_gdf.head(5)

Once the enrichment finishes, we've obtained a new column in our DataFrame called `POPCY` with the `SUM` of population projected for the year 2019, in the US Census blockgroup which contains each one of our Starbucks stores.

The reason why we are obtaining the `SUM` is because we are using the `ags_sociodemogr_e92b1637` which data is aggregated at the Census blockgroup level and more concretely the `POPCY` variable is aggregated by `SUM` as we can see in the Catalog `Variable` metadata:

In [None]:
variable.agg_method

All this information, is available in the `ags_sociodemogr_e92b1637` metadata. Let's take a look:

In [None]:
dataset.to_dict()

### Enrich a polygons DataFrame

Let's do a second enrichment but in this case let's use the DataFrame with the areas of influence calculated in the [Quickstart guide](https://carto.com/developers/cartoframes/guides/Quickstart-Part-1/). There, we used the [CARTOframes isochrones](https://carto.com/developers/cartoframes/reference/#heading-Isolines) service to obtain the polygon around each store that cover the area within 8, 17 and 25 minutes walk.

In [None]:
aoi_gdf = read_file('../examples/files/starbucks_brooklyn_isolines.geojson')
aoi_gdf.head(5)

In this case we have a DataFrame which, for each index in the `stores_df` contains a polygon of the areas of influence around each store at 8, 17 and 25 minutes walking. Again the `geometry` is encoded as a `shapely` object.

In this case, the `Enrichment` service provides an `enrich_polygons` function, which in his basic version, works in the same way as the `enrich_points` function. It just needs a DataFrame with a polygon geometry and a list of variables to enrich:

In [None]:
from cartoframes.data.observatory import Enrichment
enriched_aoi_gdf = Enrichment().enrich_polygons(aoi_gdf, [variable])
enriched_aoi_gdf.head(5)

We have obtained a new column in our areas of influence DataFrame, `SUM_POPCY` which represents the `SUM` of total population in the Census blockgroups that instersect with each polygon in our DataFrame.

### How enrichment works

Let's try to explain what happens under the hood when you do a polygons enrichment.

Imagine we have polygons representing municipalities, in blue, each of which have a population attribute, and we want to find out the population inside the green circle. 

<img src="../examples/files/enrichment_01.png" width="400"/>

We don’t know how the population is distributed inside these municipalities. They are probably concentrated in cities somewhere but, since we don’t know where those are, our best guess is to assume that the population is evenly distributed in the municipality (every point inside the municipality has the same population density).

Population is an extensive property (it grows with area), so we can subset it (a region inside the municipality will always have a smaller population than the whole municipality), and also aggregate it by summing.

In this case, we’d calculate the population inside each part of the circle that intersects with a municipality.

**Default aggregation methods**

In the Data Observatory, we suggest a default aggregation method for certain fields (always weighted by intersected area). However, some fields don’t have a clear best method, and some just can’t be aggregated. In these cases, we leave the agg_method field blank and let the users choose the method that fits best for their needs.

### Conclusion

In this guide we've learned how to use CARTOframes together with the Data Observatory to enrich our Starbucks dataset with a new population variable for our use case of revenue prediction analysis. For that purpose we've:

- Choose the total population variable from the Data Observatory catalog
- Calculated the sum of total population for each store
- Calculated the sum of total population around the walking areas of influence around each store

Finally we've introduced some other advanced concepts and further explanation on how the enrichment works.