# Matching and linking datasets in space

This notebook shows how to use the [geomatcher](https://wurst.readthedocs.io/technical.html#wurst.geo.Geomatcher) object in [Wurst](https://github.com/IndEcol/wurst).

## Consistent global topology

The foundation for the `geomatcher` is a consistent global topology, built using the [Constructive Geometries](https://bitbucket.org/cmutel/constructive-geometries) repository, and provided in Python using a [separate library](https://bitbucket.org/cmutel/py-constructive-geometries). The base data is from [Natural Earth](http://www.naturalearthdata.com/), though there has been a number of fixes and added locations.

A [topolgy](https://postgis.net/docs/Topology.html) is useful because it comes with guarantees of consistency, as each edge is only stored once. This allows us to do GIS operations using set algebra with the ids of topological faces. Here is an part of the world, as provided in `Constructive Geometries`:

<img src="images/japan.png">

As you can see, each polygon is a face, no matter how big or small it is.

## Retrieving facial ids

As a first try, you can retrieve the faces associated with any given location:

In [2]:
from wurst import *
list(geomatcher['JP'])[:10]

[6144, 6145, 6146, 6147, 6148, 6149, 6150, 6151, 6152, 6153]

Geospatial definitions are namespaced, except for countries. Countries are therefore defined by their ISO two-letter codes, but other data should be referenced by a tuple of its namespace and identifier:

In [3]:
list(geomatcher[('ecoinvent', 'NAFTA')])[:10]

[2048, 1, 2, 3, 4, 5, 6, 7, 8, 9]

You can also set a default namespace. The default is `"ecoinvent"`. So, the IMAGE regions are loaded by default, and must be retrieved explicitly, while ecoinvent regions will be searched automatically, unless the default namespace is changed.

In [5]:
list(geomatcher[('IMAGE', 'Oceania')])[:10]

[5194, 5195, 5196, 5197, 5198, 5199, 5200, 5201, 5202, 5203]

In [7]:
list(geomatcher['NAFTA'])[:10]

[2048, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [6]:
geomatcher['Oceania']



KeyError: "Can't find this location"

In [8]:
geomatcher.default_namespace = 'IMAGE'

In [10]:
list(geomatcher['Oceania'])[:10]

[5194, 5195, 5196, 5197, 5198, 5199, 5200, 5201, 5202, 5203]

Finally, by default `geomatcher` will search for country names, using [country converter](https://github.com/konstantinstadler/country_converter).

In [11]:
'Japan' in geomatcher.topology

False

In [12]:
list(geomatcher['Japan'])[:10]

[6144, 6145, 6146, 6147, 6148, 6149, 6150, 6151, 6152, 6153]

## GIS operations: intersection, contained, within

Geomatcher allows you to do quick GIS calculations.

In [13]:
geomatcher.intersects("US")

[('ecoinvent', 'UN-AMERICAS'),
 ('ecoinvent', 'RNA'),
 ('ecoinvent', 'NAFTA'),
 ('ecoinvent', 'IAI Area 2, North America'),
 ('ecoinvent', 'IAI Area 2, without Quebec'),
 ('IMAGE', 'USA'),
 ('ecoinvent', 'US-ASCC'),
 ('ecoinvent', 'NPCC'),
 ('ecoinvent', 'US-NPCC'),
 ('ecoinvent', 'US-HICC'),
 ('ecoinvent', 'WECC'),
 ('ecoinvent', 'US-WECC'),
 ('ecoinvent', 'US-SERC'),
 ('ecoinvent', 'US-RFC'),
 ('ecoinvent', 'MRO'),
 ('ecoinvent', 'US-TRE'),
 ('ecoinvent', 'US-FRCC'),
 ('ecoinvent', 'US-MRO'),
 ('ecoinvent', 'US-SPP')]

`contained` gets all locations that are *completely within* this location, whereas `within` gets all locations that *completely contain* this location.

In [15]:
geomatcher.contained("US")

['US',
 ('ecoinvent', 'US-ASCC'),
 ('ecoinvent', 'US-NPCC'),
 ('ecoinvent', 'US-HICC'),
 ('ecoinvent', 'US-WECC'),
 ('ecoinvent', 'US-SERC'),
 ('ecoinvent', 'US-RFC'),
 ('ecoinvent', 'US-FRCC'),
 ('ecoinvent', 'US-MRO'),
 ('ecoinvent', 'US-SPP')]

In [16]:
geomatcher.within("US")

[('ecoinvent', 'UN-AMERICAS'),
 ('ecoinvent', 'RNA'),
 ('ecoinvent', 'NAFTA'),
 ('ecoinvent', 'IAI Area 2, North America'),
 ('ecoinvent', 'IAI Area 2, without Quebec'),
 ('IMAGE', 'USA'),
 'US']

For all three operations, you can exclude the input variable:

In [18]:
geomatcher.within("US", include_self=False)

[('ecoinvent', 'UN-AMERICAS'),
 ('ecoinvent', 'RNA'),
 ('ecoinvent', 'NAFTA'),
 ('ecoinvent', 'IAI Area 2, North America'),
 ('ecoinvent', 'IAI Area 2, without Quebec'),
 ('IMAGE', 'USA')]

You can also change the sorting order, with is biggest first by default.

In [20]:
geomatcher.within("US", include_self=False, biggest_first=False)

[('IMAGE', 'USA'),
 ('ecoinvent', 'IAI Area 2, without Quebec'),
 ('ecoinvent', 'IAI Area 2, North America'),
 ('ecoinvent', 'NAFTA'),
 ('ecoinvent', 'RNA'),
 ('ecoinvent', 'UN-AMERICAS')]

Finally, you can ask for an list where none of the regions overlap:

In [3]:
geomatcher.intersects("US", biggest_first=False, exclusive=True)

[('ecoinvent', 'US-FRCC'),
 ('ecoinvent', 'US-MRO'),
 ('ecoinvent', 'US-SPP'),
 ('ecoinvent', 'US-TRE'),
 ('ecoinvent', 'US-RFC'),
 ('ecoinvent', 'US-SERC'),
 ('ecoinvent', 'US-WECC'),
 ('ecoinvent', 'US-HICC'),
 ('ecoinvent', 'US-NPCC'),
 ('ecoinvent', 'US-ASCC')]

## Splitting faces

Say we wanted to split the island of Honshu and develop a separate inventory for Tokyo. From our graphic above, we know that Honshu has face number 6247. So, we can split this face into two new ones - one of which we will consider Tolyo, and the other the "Rest of Honshu".

In [4]:
first, second = geomatcher.split_face(6247)
first, second

(7162, 7163)

In [7]:
geomatcher.topology[("example", "Rest of Honshu")] = set([first])
geomatcher.topology[("example", "Tokyo")] = set([second])

In [8]:
geomatcher.contained("JP")

['JP', ('IMAGE', 'Japan'), ('example', 'Rest of Honshu'), ('example', 'Tokyo')]

`split_face` also supports the arguments `number` (number of new faces to create), and `ids` (integers values for the new ids to create). If `ids` is passed, `number` is ignored.

## Adding new topologies

You can also add new topologies to support custom spatial systems. This is how the IMAGE regions are added in Wurst:

    geomatcher.add_definitions(IMAGE_TOPOLOGY, "IMAGE")
    
New topologies can be either relative (default) or not. Relative topologies are defined by reference to regions already in the topology:

    {"Russia Region": [
        "AM",
        "AZ",
        "GE",
        "RU"
    ]}
    
Non-relative topologies must be defined by a set of integer ids.

    {
        'A': {1, 2, 3},
        'B': {2, 3, 4},
    }
    
Regions added by `add_definitions` will be namespaced with the second argument passed to the function.

In [11]:
topoz = {"Black Sea": [
    "RO",
    "TR",
    "UA",
    "GE",
    "BG",
    "RU",
]}
geomatcher.add_definitions(topoz, "just added")

In [12]:
geomatcher.contained(("just added", "Black Sea"))

[('just added', 'Black Sea'),
 'RU',
 ('ecoinvent', 'Russia (Asia)'),
 ('ecoinvent', 'Russia (Europe)'),
 'TR',
 ('IMAGE', 'Turkey'),
 'UA',
 'BG',
 'GE',
 'RO']