## Assigning Anchorage Points to Anchorage Groups
This notebook describes how anchorage groups were created by applying the rule that any two anchorage points within 4 kilometers of one another consistute the same anchorage group. In some regions this will closely replicate **ports**

In [None]:
import pandas as pd
from collections import defaultdict

### Identify pairs of neighboring anchorage points

Using the public Big Query table run this query to get all pairings of anchorages that are within 4km of one another which returns a table with ~2 million rows. This query can be run within BQ and the resulting table exported or can be run directly from python.
  
There are ~100k anchorages, with ~4k that are not within 4km of any others. Singlet anchorages are removed in this processes, but will be reincluded in the final dataset

```
SELECT
  if(a.s2id > b.s2id, b.s2id, a.s2id ) s2id_1,
  if(a.s2id > b.s2id, a.s2id, b.s2id ) s2id_2,
FROM
  [world-fishing-827:gfw_raw.named_anchorages_20171106] a
CROSS JOIN
  [world-fishing-827:gfw_raw.named_anchorages_20171106] b
WHERE
  ACOS(COS(RADIANS(90-b.lat)) *COS(RADIANS(90-a.lat)) 
       +SIN(RADIANS(90-b.lat))*SIN(RADIANS(90-a.lat)) * COS(RADIANS(b.lon-a.lon)))*6371 < 4
  and a.s2id != b.s2id
  group by s2id_1, s2id_2
```

If the above query was not run from within a python environment, export the resulting table and import into Python.

### Combine overlapping pairs

Use UnionFind structure to efficiently identify overlapping pairs

In [None]:
"""UnionFind.py

Union-find data structure. Based on Josiah Carlson's code,
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/215912
with significant additional changes by D. Eppstein.
"""

class UnionFind(object):
    """Union-find data structure.

    Each unionFind instance X maintains a family of disjoint sets of
    hashable objects, supporting the following two methods:

    - X[item] returns a name for the set containing the given item.
      Each set is named by an arbitrarily-chosen one of its members; as
      long as the set remains unchanged it will keep the same name. If
      the item is not yet part of a set in X, a new singleton set is
      created for it.

    - X.union(item1, item2, ...) merges the sets containing each item
      into a single larger set.  If any item is not yet part of a set
      in X, it is added to X as one of the members of the merged set.
    """

    def __init__(self):
        """Create a new empty union-find structure."""
        self.weights = {}
        self.parents = {}

    def __getitem__(self, object):
        """Find and return the name of the set containing the object."""

        # check for previously unknown object
        if object not in self.parents:
            self.parents[object] = object
            self.weights[object] = 1
            return object

        # find path of objects leading to the root
        path = [object]
        root = self.parents[object]
        while root != path[-1]:
            path.append(root)
            root = self.parents[root]

        # compress the path and return
        for ancestor in path:
            self.parents[ancestor] = root
        return root
        
    def __iter__(self):
        """Iterate through all items ever found or unioned by this structure."""
        return iter(self.parents)

    def union(self, *objects):
        """Find the sets containing the objects and merge them all."""
        roots = [self[x] for x in objects]
        heaviest = max([(self.weights[r],r) for r in roots])[1]
        for r in roots:
            if r != heaviest:
                self.weights[heaviest] += self.weights[r]
                self.parents[r] = heaviest

In [None]:
def simplify_list(items):
    '''
    This function takes a list of lists and group items that overlap
    so, simplify_list([[1,2,5],[5],[8]]) returns [[1,2,5],[8]]
    '''
    ufind = UnionFind()
    for x in items:
        ufind.union(*x)
    sets_by_name = defaultdict(set)
    for x in ufind:
        sets_by_name[ufind[x]].add(x)
    return sorted([sorted(x) for x in sets_by_name.values()])


In [None]:
rows = []
for index, row in df.iterrows():
    rows.append([row.s2id_1, row.s2id_2])

#### Apply the UnionFind to the anchorage point pairs

In [None]:
anchorage_groups = simplify_list(rows)

### Generate full dataset and include the singlet anchorage points

Upload the resulting `anchorage_groups` file to Biq Query either through the UI or the `bq load` command-line tool.
Then the anchorage groups can be joined to the original dataset in the following manner.

```
SELECT
  a.s2id s2id,
  label,
  sublabel,
  lat,
  lon,
  iso3
  IF(id IS NULL, anchor_points.s2id, id) anchorage_group, -- if singlet anchorage, use s2id as group_id
FROM
  [world-fishing-827:gfw_raw.named_anchorages_20171116] anchor_points
LEFT JOIN
  [________.anchorages_grouped_4km] anchor_groups
ON
  anchor_points.s2id = anchor_groups.s2id
```