# Fleet Clustering

### Tim Hochberg, 2019-01-16

## Longliner Edition

We cluster vessel using HDBSCAN and a custom metric to derive fleets
that are related in the sense that they spend a lot of time in the same
location while at sea.

## See Also

Other notebooks in https://github.com/GlobalFishingWatch/fleet-clustering for 
examples of clustering Squid Jiggers, etc.

## Open Questions

### Fleet Coherence Time

One thing this current implementation doesn't take into account is 
the coherence time of a fleet. A vessel might be part of one fleet 
this season, but move to another fleet the next season. A way to
deal with this is to group fleets over shorter time periods (6 months
for instance) and then match fleets across groupings by seeing what
previous fleets have the largest overlap with the current set of
fleets.

In [1]:
from __future__ import print_function
from __future__ import division
from collections import Counter, OrderedDict
import datetime as dt
import hdbscan
import logging
import matplotlib.pyplot as plt
import matplotlib.animation as mpl_animation
import numpy as np
import pandas as pd
from skimage import color
from IPython.display import HTML
from fleet_clustering import bq
from fleet_clustering import filters
from fleet_clustering import distances
from fleet_clustering import animation

## Load AIS Clustering Data

Load the AIS data that we use for clustering. Note that it onlyu includes vessels away
from shores so as to exclude clustering on ports

In [2]:
ais_by_date = bq.load_ais_by_date('drifting_longlines', dt.date(2017, 1, 1), dt.date(2017, 12, 31),
                                 fishing_only=False, min_km_from_shore=10)
pruned_by_date = {k : filters.remove_chinese_coast(v) for (k, v) in ais_by_date.items()}
valid_ssvid = sorted(filters.find_valid_ssvid(pruned_by_date))

2017-01-01




## Create Distance Metrics

Create an array of distance metrics. The details are still evolving, but in general
we want to deal with two things.  Days on which a boat is missing and days where the
boat is away from the fleet.

* Distances to/from a boat on days when it is missing are represented by $\infty$ in 
  the distance matrix. HDBSCAN ignores these values.
* Only the closest N days are kept for each boat pair, allowing boats to leave the fleet
  for up to half the year without penalty.
  
In addition, distances have a floor of 1 km to prevent overclustering when boats tie up
up together, etc.

In [3]:
C = distances.create_composite_lonlat_array(pruned_by_date, valid_ssvid)
dists = distances.compute_distances_3(C, days=180, min_clip=1)

  distances[i] = np.sqrt(np.nanmean(d2s[:, :days], axis=1))


## Load Carrier Data

In [4]:
carriers_by_date = bq.load_carriers_by_year(2017, 2018)
pruned_carriers_by_date = {k : filters.remove_chinese_coast(v) for (k, v) in carriers_by_date.items()}
query = """
               SELECT CAST(mmsi AS STRING) FROM
               `world-fishing-827.vessel_database.all_vessels_20190102`
               WHERE  iscarriervessel AND confidence = 3
        """
valid_carrier_ssvid_df = pd.read_gbq(query, dialect='standard', project_id='world-fishing-827')
valid_carrier_ssvid = valid_carrier_ssvid_df.f0_
valid_carrier_ssvid_set = set(valid_carrier_ssvid)

## Load Ais Data for Plotting

This is the data used for plotting as opposed to just clustering

In [5]:
all_ais_by_date = bq.load_ais_by_date('drifting_longlines', dt.date(2017, 1, 1), dt.date(2017, 12, 31),
                                 fishing_only=False, min_km_from_shore=-1, include_carriers=True)

2017-01-01


In [6]:
all_pruned_by_date = {k : filters.remove_chinese_coast(v) for (k, v) in all_ais_by_date.items()}

## Fit the Clusterer

This is pretty straightforward -- all the complicated stuff is
embedded in the matrix computations. Fleet size can be tweaked
using `min_cluster_size` and `min_sample_size`.

In [9]:
clusterer = hdbscan.HDBSCAN(metric='precomputed', 
                            min_cluster_size=11,
                           )
clusterer.fit(dists)

HDBSCAN(algorithm='best', allow_single_cluster=False, alpha=1.0,
    approx_min_span_tree=True, cluster_selection_method='eom',
    core_dist_n_jobs=4, gen_min_span_tree=False, leaf_size=40,
    match_reference_implementation=False, memory=Memory(cachedir=None),
    metric='precomputed', min_cluster_size=11, min_samples=None, p=None,
    prediction_data=False)

## Load Encounters Data

This is used to filter the carrier vessels down to only those
that meet with longliners.

In [10]:
encounters = bq.load_carriers(2017, 2017)
all_fleet_ssvid_set = set([s for (s, f) in zip(valid_ssvid, clusterer.labels_) if f >= 0])
valid_ssvid_set = set(valid_ssvid)
all_longline_reefer_ssvid_set = set()
for x in encounters.itertuples():
    if x.ssvid_1 in all_fleet_ssvid_set and x.ssvid_2 in valid_carrier_ssvid_set:
        all_longline_reefer_ssvid_set.add(x.ssvid_2)
    if x.ssvid_2 in all_fleet_ssvid_set and x.ssvid_1 in valid_carrier_ssvid_set:
        all_longline_reefer_ssvid_set.add(x.ssvid_1)
all_longline_reefer_ssvid = sorted(all_longline_reefer_ssvid_set)

valid_ssvid_set = set(valid_ssvid)
carrier_ids = [x for x in all_longline_reefer_ssvid if x not in valid_ssvid_set]
joint_ssvid = valid_ssvid + sorted(carrier_ids) 
labels = list(clusterer.labels_) + [max(clusterer.labels_) + 1] * len(carrier_ids) 

## Set up Fleets

Set up the fleets for viewing.

In [11]:
counts = []
skip = [1, 6, 7, 9, 10, 23, 25]
for i in range(max(labels) + 1):
    if i in skip:
        counts.append(0)
    else:
        counts.append((np.array(labels) == i).sum())
        
fleet_ids = [x for x in np.argsort(counts)[::-1] if counts[x] > 0]
fleet_ids_without_carriers = [x for x in fleet_ids if x != max(labels)]

print(len(fleet_ids), "fleets")
fleets = OrderedDict()
n_hues = (len(fleet_ids) + 3) // 4
for i, fid in enumerate(fleet_ids_without_carriers):
    sat = [0.5, 1][i % 2]
    val = 1
    hue = np.linspace(0, 1, n_hues)[(i // 4) % n_hues]
    [[clr]] = color.hsv2rgb([[(hue, sat, val)]])
    fg = [(0, 0, 0), clr][(i // 2) % 2]
    sz = [5, 4][(i // 2) % 2]
    fleets[fid] = ('o', tuple(fg), tuple(clr), sz, 1, str(i + 1))
fleets[max(labels)] = ('1', 'k', 'k', 8, 2, 'Carrier Vessel')

23 fleets


## Create Animations

In [12]:
anim = animation.make_anim(joint_ssvid, 
                           labels, 
                           all_pruned_by_date, 
                           interval=2,
                           fleets=fleets, 
                           show_ungrouped=True,
                           alpha=1,
                           legend_cols=8,
                           ungrouped_legend="Ungrouped")
HTML(anim.to_html5_video())

  axisbgc = ax.get_axis_bgcolor()
  fill_color = ax.get_axis_bgcolor()


In [13]:
anim = animation.make_anim(joint_ssvid, 
                           labels, 
                           all_pruned_by_date, 
                           interval=1,
                           fleets=fleets, 
                           show_ungrouped=True,
                           alpha=1,
                           legend_cols=8,
                           ungrouped_legend="Ungrouped")
Writer = mpl_animation.writers['ffmpeg']
writer = Writer(fps=8, metadata=dict(artist='Me'), bitrate=1800)
anim.save('fleet_longlines.mp4', writer=writer)

## List Fleet Composition

In [14]:
query = """
SELECT code, iso3 FROM `world-fishing-827.gfw_research.country_codes`"""
country_codes_df = pd.read_gbq(query, dialect='standard', project_id='world-fishing-827')
iso3_map = {x.code : x.iso3 for x in country_codes_df.itertuples()}

for fid, v in fleets.items():
    label = v[-1]
    mask = (fid == np.array(labels))
    ssvids = np.array(joint_ssvid)[mask]
    mids = [x[:3] for x in ssvids]
    countries = [iso3_map.get(float(x), x) for x in mids]
    c = Counter(countries)
    print('Fleet'.format(fid), label, ':')
    for country, count in c.most_common():
        print('\t', country, ':', count)

Fleet 1 :
	 CHN : 130
	 DEU : 5
	 415 : 4
	 KIR : 1
	 556 : 1
	 700 : 1
	 421 : 1
	 KOR : 1
	 PNG : 1
Fleet 2 :
	 CHN : 67
	 FJI : 25
	 KOR : 14
	 452 : 2
	 600 : 1
	 AUS : 1
Fleet 3 :
	 USA : 48
	 CAN : 46
Fleet 4 :
	 JPN : 23
	 CHN : 22
	 TWN : 15
	 ESP : 4
	 302 : 1
	 CIV : 1
	 KOR : 1
	 CPV : 1
Fleet 5 :
	 USA : 64
Fleet 6 :
	 ESP : 38
	 FRA : 2
	 ITA : 2
	 MLT : 2
	 MHL : 1
	 PRT : 1
Fleet 7 :
	 CHN : 20
	 TWN : 9
	 MHL : 4
	 FSM : 4
	 JPN : 2
	 200 : 1
Fleet 8 :
	 KOR : 38
	 TWN : 1
Fleet 9 :
	 TWN : 31
	 SYC : 4
	 TZA : 1
	 CHN : 1
Fleet 10 :
	 USA : 11
	 VUT : 8
	 TWN : 4
	 KIR : 1
	 COK : 1
Fleet 11 :
	 TWN : 20
	 FJI : 1
Fleet 12 :
	 JPN : 19
	 CHN : 1
Fleet 13 :
	 ZAF : 20
Fleet 14 :
	 NCL : 16
Fleet 15 :
	 REU : 13
	 MUS : 2
	 FRA : 1
Fleet 16 :
	 TWN : 10
	 CHN : 5
	 VUT : 1
Fleet 17 :
	 TWN : 11
	 MYS : 5
Fleet 18 :
	 CHN : 15
Fleet 19 :
	 CHN : 9
	 FSM : 4
	 TWN : 2
Fleet 20 :
	 NZL : 13
	 ESP : 1
Fleet 21 :
	 ESP : 11
Fleet 22 :
	 TWN : 7
	 ESP : 4
Fleet Carrier Vessel 