# devlog 2023-10-26

The ADRIOs Phase 2 scope of work includes significant changes to the architecture of the Geo system. Here's what you need to know.

## Geo class hierarchy

Previously there was only one kind of geo -- it was basically just a dictionary of numpy arrays, where the key of the dictionary was the name of the geo attribute ('population', or 'humidity'). This meant that the only way to use a geo in the epymorph simulation was to first gather all of the necessary data and load it into memory.

This isn't flexible enough for our long-term goals.

Now Geo is an abstract class, with child classes implementing different types of Geos. The public interface of the Geo, however, still uses Python's dictionary syntax for accessing data (the square brackets operator, e.g., `geo['population']`). All attributes are expected to be in the form of a numpy array when they are returned by the Geo, but how exactly they produce that numpy array is up to the implementation.

So far there are two types of Geo -- StaticGeo and DynamicGeo.

**StaticGeos** are analogous to the previous implementation of geos -- they contain all of their data in memory in the form of a bunch of numpy arrays. This data is either hard-coded in Python or loaded from a compressed file, perhaps included in the project or loaded from the user's file path. StaticGeo compressed files will have the `.geo.tar` extension.

**DynamicGeos** on the other hand are designed to load data from third-party data repositories, say, the Census API. Rather than proactively load every attribute into memory, DynamicGeos can delay fetching an attribute until it is requested -- thus avoiding a costly fetch for attributes that are never actually used. ADRIOMakers serve as the interface to an external data source. They encode how to access the data source, which attributes are available under what restrictions, and what information is required to access them. ADRIOMakers eventually create an ADRIO per attribute, which are responsible for fetching their attribute from that source. (Once an attribute is requested, the entire range of values is loaded. ADRIOs are not presently designed to perform partial queries.)

Additional geo types with more intricate behavior may be added in future.

## GeoSpecs

Every Geo has a specification, or "spec" for short, which differs slightly based on the geo type. The job of the geo spec is to define which attributes the geo includes (name, type, and shape) as well as what time period it covers (if any).

For StaticGeos this is all the info that's required.

For DynamicGeos, the spec also includes a mapping from attribute to the ADRIO source for that attribute, as well as any geographic scope which is needed for those ADRIOs. For instance, a DynamicGeo that loads population data from the US Census needs to know what granularity of data to fetch (state? county? census tract? etc.) as well as which of those to fetch (all of them? just the counties in Arizona? etc.).

GeoSpecs are Python objects which can be serialized to text, and deserialized from text back into Python objects. As such it is possible to work with them in either form. The goal is for GeoSpecs in text form to be human-readable, such that they could be included in a scientific publication. A knowledgable epymorph user could look at the text and understand which data attributes they used and where they got them. Another goal is for GeoSpecs to be human-writable. At least for DynamicGeos, I should be able to easily add or modify attributes given an existing spec. Geo spec files should have the `.geo` extension.

_NOTE:_ For now, GeoSpecs are serialized to-and-from text in Python's jsonpickle format. This is for ease of development, but it has the drawback of being rather cumbersome for a human to read and write. We will very likely replace this json format with a custom serialization in the near future.

## The required attribute: 'label'

One particularly tricky detail is knowing exactly how many nodes are in a geo. If a DynamicGeo specifies "give me all of the census block groups in Arizona", epymorph itself has no way of knowing how many of those there are. Even more complicated, that number is subject to change over time! So we need some uniform way to find out how many geo nodes to expect. Purely as a convention, we require every geo includes a 'label' attribute which must have strictly one value per node, and we must be able to completely load this attribute at the start of the simulation (even if other data loading is delayed or partial).

Why 'label'? As an attribute it has several suitable qualities. For one, it should be possible to come up with some kind of label for every geo node imaginable. Two, the memory requirements of $N$ string values should always be manageable, assuming the simulation over $N$ nodes is itself manageable. And three, it's likely that we will want the full label set at some point anyway, for example, when graphing results. Hence, 'label' is a suitable "source of truth" and a proxy for the number of nodes in the geo.

## Pros and Cons

StaticGeos are perfectly reliable, but aren't easy to create. To author a StaticGeo from scratch, the user must fetch all data themselves and assemble it in the proper form. Furthermore, the entire dataset must fit in memory and it will all be loaded regardless of which attributes are used. We can only include so many of these with epymorph before it starts to weigh-down the package download unreasonably. However StaticGeos are a perfect fit for quick ad-hoc experiments, small exemplar datasets, and testing.

DynamicGeos by constrast are super lightweight to transmit (just send their serialized spec!), but in order to use them you need a working connection to the third-party data sources. Depending on the data source, you may need to first arrange API keys, have a working, fast internet connection, hope the data source is online, avoid running into data access limits, and so on. Therefore a DynamicGeo is not guaranteed to be available. Nor is it guaranteed to be repeatable! Because we don't control third-party data sources, it's always possible they will revise their data from day to day, or disappear entirely. And if your data connection to the data source is slow, it may take a long time to fetch all data needed to run your simulation.

To give users flexibility in using geos and balancing these pros and cons, features exist to convert a DynamicGeo into a StaticGeo, and to save the StaticGeo to a file. This way it can be stored for later, efficiently pushed to HPC clusters, or published to a scientific data archive.

## Caching

epymorph provides a set of command-line commands for managing the caching of geos on a user's machine. When asked to run a simulation from the command line using a DynamicGeo, epymorph checks the cache to see if it already has a static copy. If so, it uses that. This can save the user from costly or unpredictable data fetching operations. However the caching system requires users to opt into it by performing the manual steps so as not to surprise the unwary.

`epymorph cache {fetch,list,remove,clear}` are the supported operations.
- `fetch` attempts to cache a named dynamic geo.
- `list` prints which geos are currently cached.
- `remove` evicts a single geo from the cache by name. 
- `clear` deletes the entire cache.

Geos are stored as `.geo.tar` files (this is the file format for a StaticGeo), which are simple archives containing the numpy compressed data (npz format) and the serialized spec file for the geo.

## Example: StaticGeo

Here is an example StaticGeo spec, pulled from the "2023-07-06.ipynb" dev log:

---
```python
spec = StaticGeoSpec(
    attributes=[
        LABEL, # we have a pre-defined AttribDef for label since it's required, equivalent to: `AttribDef('label', np.str_, Shapes.N)`
        AttribDef('geoid', np.str_, Shapes.N),
        AttribDef('centroid', CentroidDType, Shapes.N),
        AttribDef('population', np.int64, Shapes.N),
        AttribDef('commuters', np.int64, Shapes.NxN),
        AttribDef('humidity', np.float64, Shapes.TxN),
    ],
    time_period=Year(2015))
```
---

Because we specified this geo spec covers the year 2015, we can check that any time-series data ('humidity' in this case) do in fact provide 365 values as expected.

And if I create numpy arrays matching the above types and shapes, I could assemble the StaticGeo as follows:

---
```python
geo = StaticGeo(spec, {
    'label': label,
    'geoid': geoid,
    'centroid': centroid,
    'population': population,
    'commuters': commuters,
    'humidity': humidity,
})
```
---

I can also validate the geo against its own specification. This is not done automatically (e.g., in the class constructor). This is so epymorph can control the timing of validation, and avoid it when it's not necessary. But it's a good idea (if you are authoring your own geo) to make sure the validation passes.

---
```python
try:
    # This will raise a GeoValidationException if any attribute defined in the spec
    # is missing or malformed when compared to the values of each of those attributes.
    geo.validate()
except GeoValidationException as e:
    print(e.pretty())
```
---

## Example: DynamicGeo

Here is an example DynamicGeo spec, pulled from the "adrio_phase_2_demo.ipynb" dev log:

---
```python
spec = DynamicGeoSpec(
    attributes=[
        LABEL,
        AttribDef('population', np.int64, Shapes.N),
        AttribDef('population_by_age', np.int64, Shapes.NxA(3)),
        AttribDef('centroid', CentroidDType, Shapes.N),
        AttribDef('geoid', np.int64, Shapes.N),
        AttribDef('dissimilarity_index', np.float64, Shapes.N),
        AttribDef('median_income', np.int64, Shapes.N),
        AttribDef('pop_density_km2', np.float64, Shapes.N),
    ],
    time_period=Year(2015),
    geography=CensusGeography(
        granularity=Granularity.COUNTY,
        filter={
            'state': ['04', '08', '49', '35', '32'],
            'county': ['*'],
            'tract': ['*'],
            'block group': ['*'],
        }),
    source={
        'label': 'Census:name',
        'population': 'Census',
        'population_by_age': 'Census',
        'centroid': 'Census',
        'geoid': 'Census',
        'dissimilarity_index': 'Census',
        'median_income': 'Census',
        'pop_density_km2': 'Census',
    }
)
```
---

Notice how it provides many of the same fields as required in a StaticGeoSpec, but also a Geography object to specify that this is county-level data within five selected states (listed here by their geo ID).

Finally there's the source map, which determines exactly which ADRIOMaker should be used to load the various attributes (Census, in this case), and can even map a geo's attribute name to an ADRIO's attribute name in case they are different. The geo's 'label' attribute is the CensusADRIOMaker's 'name' attribute.

In order to instantiate a DynamicGeo, you must provide it with a dictionary containing the required ADRIOMakers. You can use epymorph's built-in ADRIOMaker library for this, or provide your own (though that's a bit outside of the scope of this document).

---
```python
from epymorph.geo.adrio import adrio_maker_library

dynamic_geo = DynamicGeo.from_library(spec, adrio_maker_library)
```
---

And remember you can use a DynamicGeo in a Simulation just like you can use a StaticGeo. The only difference is when and where each get their data.

Now let's say I load some data using a DynamicGeo but I need to save it to a file to make sure I can access the exact same data later. First I want to convert my DynamicGeo to a StaticGeo (we have a utility function for that!) Then you can save the StaticGeo to a file:

---
```python
# convert
static_geo = convert_to_static_geo(dynamic_geo)

# and save
filename = StaticGeoFileOps.to_archive_filename('my-cool-geo')
filepath = Path('/path/to/my/data') / filename
static_geo.save(filepath)
```
---

Which creates the file: `/path/to/my/data/my-cool-geo.geo.tar`
