# Fleet Clustering

### Tim Hochberg, 2019-01-16

## Squid-Jigger Edition

We cluster vessel using HDBSCAN and a custom metric to derive fleets
that are related in the sense that they spend a lot of time in the same
location while at sea.

## See Also

* Other notebooks in https://github.com/GlobalFishingWatch/fleet-clustering for 
examples of clustering Squid Jiggers, etc.
* This workspace that Nate put together: https://globalfishingwatch.org/map/workspace/udw-v2-85ff8c4f-fbfe-4126-b067-4d94cdd2b737



## Open Questions

### Fleet Coherence Time

One thing this current implementation doesn't take into account is 
the coherence time of a fleet. A vessel might be part of one fleet 
this season, but move to another fleet the next season. A way to
deal with this is to group fleets over shorter time periods (6 months
for instance) and then match fleets across groupings by seeing what
previous fleets have the largest overlap with the current set of
fleets.

In [1]:
from __future__ import print_function
from __future__ import division
from collections import Counter, OrderedDict
import datetime as dt
import hdbscan
import logging
import matplotlib.pyplot as plt
import matplotlib.animation as mpl_animation
import numpy as np
import pandas as pd
from skimage import color
from IPython.display import HTML
from fleet_clustering import bq
from fleet_clustering import filters
from fleet_clustering import distances
from fleet_clustering import animation

## Load AIS Clustering Data

Load the AIS data that we use for clustering. Note that it onlyu includes vessels away
from shores so as to exclude clustering on ports

In [2]:
all_by_date = bq.load_ais_by_date('squid_jigger', dt.date(2017, 1, 1), dt.date(2017, 12, 31),
                                 fishing_only=False, min_km_from_shore=0, include_carriers=True,
                                 show_queries=True)    
pruned_by_date = {k : filters.remove_carriers(
                         filters.remove_near_shore(10,
                            filters.remove_chinese_coast(v))) for (k, v) in all_by_date.items()}
valid_ssvid = sorted(filters.find_valid_ssvid(pruned_by_date))

2017-01-01

    WITH 
    base as (
        SELECT ssvid, 
               EXTRACT(YEAR FROM timestamp) year,
               EXTRACT(MONTH FROM timestamp) month,
               EXTRACT(DAY FROM timestamp) day,
               lon,
               lat,
               TIMESTAMP_TRUNC(timestamp, MINUTE) AS minute_stamp,
               distance_from_shore_m / 1000.0 AS distance_from_shore_km
        FROM 
        `world-fishing-827.pipe_production_b.messages_scored_*`
        WHERE _TABLE_SUFFIX BETWEEN "20170101" AND "20170703"
        AND seg_id in (select seg_id from gfw_research.pipe_production_b_segs where good_seg)
        AND distance_from_shore_m >= 0
    ),
    thinned as (
        SELECT ssvid, year, month, day, 
               APPROX_QUANTILES(lon, 2)[OFFSET(1)] AS lon,
               APPROX_QUANTILES(lat, 2)[OFFSET(1)] AS lat,
               APPROX_QUANTILES(distance_from_shore_km, 2)[OFFSET(1)] AS distance_from_shore_km,
               minute_stamp
        FROM 
        base
    



2017-07-04

    WITH 
    base as (
        SELECT ssvid, 
               EXTRACT(YEAR FROM timestamp) year,
               EXTRACT(MONTH FROM timestamp) month,
               EXTRACT(DAY FROM timestamp) day,
               lon,
               lat,
               TIMESTAMP_TRUNC(timestamp, MINUTE) AS minute_stamp,
               distance_from_shore_m / 1000.0 AS distance_from_shore_km
        FROM 
        `world-fishing-827.pipe_production_b.messages_scored_*`
        WHERE _TABLE_SUFFIX BETWEEN "20170704" AND "20171231"
        AND seg_id in (select seg_id from gfw_research.pipe_production_b_segs where good_seg)
        AND distance_from_shore_m >= 0
    ),
    thinned as (
        SELECT ssvid, year, month, day, 
               APPROX_QUANTILES(lon, 2)[OFFSET(1)] AS lon,
               APPROX_QUANTILES(lat, 2)[OFFSET(1)] AS lat,
               APPROX_QUANTILES(distance_from_shore_km, 2)[OFFSET(1)] AS distance_from_shore_km,
               minute_stamp
        FROM 
        base
    

## Create Distance Metrics

Create an array of distance metrics. The details are still evolving, but in general
we want to deal with two things.  Days on which a boat is missing and days where the
boat is away from the fleet.

* Distances to/from a boat on days when it is missing are represented by $\infty$ in 
  the distance matrix. HDBSCAN ignores these values.
* Only the closest N days are kept for each boat pair, allowing boats to leave the fleet
  for up to half the year without penalty.
  
In addition, distances have a floor of 1 km to prevent overclustering when boats tie up
up together, etc.

In [3]:
import imp; imp.reload(distances)
C = distances.create_composite_lonlat_array(pruned_by_date, valid_ssvid)
dists = distances.compute_distances_4(C, gamma=2)

## Load Carrier Data

In [4]:
carriers_by_date = bq.load_carriers_by_year(2017, 2018)
pruned_carriers_by_date = {k : filters.remove_chinese_coast(v) for (k, v) in carriers_by_date.items()}
query = """
               SELECT CAST(mmsi AS STRING) FROM
               `world-fishing-827.vessel_database.all_vessels_20190102`
               WHERE  iscarriervessel AND confidence = 3
        """
valid_carrier_ssvid_df = pd.read_gbq(query, dialect='standard', project_id='world-fishing-827')
valid_carrier_ssvid = valid_carrier_ssvid_df.f0_
valid_carrier_ssvid_set = set(valid_carrier_ssvid)

## Load Encounters Data And Country Codes

This is used to filter the carrier vessels down to only those
that meet with jiggers and to add iso3 labels to outputs

In [5]:
encounters = bq.load_carriers(2017, 2017)

In [6]:
query = """
SELECT code, iso3 FROM `world-fishing-827.gfw_research.country_codes`"""
country_codes_df = pd.read_gbq(query, dialect='standard', project_id='world-fishing-827')
iso3_map = {x.code : x.iso3 for x in country_codes_df.itertuples()}

## Fit the Clusterer

This is pretty straightforward -- all the complicated stuff is
embedded in the matrix computations. Fleet size can be tweaked
using `min_cluster_size` and `min_sample_size`.

In [7]:
clusterer = hdbscan.HDBSCAN(metric='precomputed', 
                            min_cluster_size=10,
                           )
clusterer.fit(dists)

HDBSCAN(algorithm='best', allow_single_cluster=False, alpha=1.0,
    approx_min_span_tree=True, cluster_selection_method='eom',
    core_dist_n_jobs=4, gen_min_span_tree=False, leaf_size=40,
    match_reference_implementation=False, memory=Memory(cachedir=None),
    metric='precomputed', min_cluster_size=10, min_samples=None, p=None,
    prediction_data=False)

## Set up Fleets

Set up the fleets for viewing.

In [8]:
all_fleet_ssvid_set = set([s for (s, f) in zip(valid_ssvid, clusterer.labels_) if f >= 0])
valid_ssvid_set = set(valid_ssvid)
all_longline_reefer_ssvid_set = set()
for x in encounters.itertuples():
    if x.ssvid_1 in all_fleet_ssvid_set and x.ssvid_2 in valid_carrier_ssvid_set:
        all_longline_reefer_ssvid_set.add(x.ssvid_2)
    if x.ssvid_2 in all_fleet_ssvid_set and x.ssvid_1 in valid_carrier_ssvid_set:
        all_longline_reefer_ssvid_set.add(x.ssvid_1)
all_longline_reefer_ssvid = sorted(all_longline_reefer_ssvid_set)

valid_ssvid_set = set(valid_ssvid)
carrier_ids = [x for x in all_longline_reefer_ssvid if x not in valid_ssvid_set]
joint_ssvid = valid_ssvid + sorted(carrier_ids) 
labels = list(clusterer.labels_) + [max(clusterer.labels_) + 1] * len(carrier_ids) 

In [9]:
counts = []
skip = [2, 9, 12, 4] 
for i in range(max(labels) + 1):
    if i in skip:
        counts.append(0)
    else:
        counts.append((np.array(labels) == i).sum())
        
fleet_ids = [x for x in np.argsort(counts)[::-1] if counts[x] > 0]
fleet_ids_without_carriers = [x for x in fleet_ids if x != max(labels)]

print(len(fleet_ids), "fleets")
fleets = OrderedDict()
n_hues = int(np.ceil(len(fleet_ids) / 4.0))
used = set()
for i, fid in enumerate(fleet_ids_without_carriers):
    b = (i // (2 * n_hues)) % 2
    c = (i // 2)% n_hues
    d = i  % 2
    symbol = 'o^'[d]
    assert (b, c, d) not in used, (i, b, c, d)
    used.add((b, c, d))
    sat = 1
    val = 1
    hue = c / float(n_hues)
    assert 0 <= hue < 1, hue
    [[clr]] = color.hsv2rgb([[(hue, sat, val)]])
    fg = [(0, 0, 0), clr][b]
    bg = [clr, (1, 1, 1)][b]
    w = [1, 2][b]
    sz = [7, 7][b]
    fleets[fid] = (symbol, tuple(fg), tuple(bg), sz, w,  str(i + 1))
fleets[max(labels)] = ('1', 'k', 'k', 8, 2, 'Carrier Vessel')

3 fleets


## Create Animations

In [10]:
anim = animation.make_anim(joint_ssvid, 
                           labels, 
                           all_by_date, 
                           interval=10,
                           fleets=fleets, 
                           show_ungrouped=True,
                           alpha=1,
                           legend_cols=8,
                           ungrouped_legend="Ungrouped")
HTML(anim.to_html5_video())

  axisbgc = ax.get_axis_bgcolor()
  fill_color = ax.get_axis_bgcolor()


In [11]:
anim = animation.make_anim(joint_ssvid, 
                           labels, 
                           all_by_date, 
                           interval=1,
                           fleets=fleets, 
                           show_ungrouped=True,
                           alpha=1,
                           legend_cols=8,
                           ungrouped_legend="Ungrouped")
Writer = mpl_animation.writers['ffmpeg']
writer = Writer(fps=8, metadata=dict(artist='Me'), bitrate=1800)
anim.save('fleet_squid.mp4', writer=writer)

## List Fleet Composition

In [12]:
for fid, v in fleets.items():
    label = v[-1]
    mask = (fid == np.array(labels))
    ssvids = np.array(joint_ssvid)[mask]
    mids = [x[:3] for x in ssvids]
    countries = [iso3_map.get(float(x), x) for x in mids]
    c = Counter(countries)
    print('Fleet: {} ({})'.format(label, fid), label)
    for country, count in c.most_common():
        print('\t', country, ':', count)

Fleet: 1 (0) 1
	 CHN : 480
	 TWN : 66
	 ARG : 64
	 KOR : 41
	 200 : 5
	 ALB : 3
	 VUT : 3
	 CYP : 1
	 ESP : 1
	 288 : 1
	 527 : 1
	 BOL : 1
	 RUS : 1
	 COK : 1
Fleet: 2 (1) 2
	 CHN : 20
	 600 : 1
Fleet: Carrier Vessel (3) Carrier Vessel
	 PAN : 37
	 RUS : 8
	 CHN : 6
	 LBR : 5
	 TWN : 4
	 KIR : 2


In [13]:
fleets

OrderedDict([(0, ('o', (0, 0, 0), (1.0, 0.0, 0.0), 7, 1, '1')),
             (1, ('^', (0, 0, 0), (1.0, 0.0, 0.0), 7, 1, '2')),
             (3, ('1', 'k', 'k', 8, 2, 'Carrier Vessel'))])