# Fleet Clustering

### Tim Hochberg, 2019-01-16

## Longliner Edition

We cluster vessel using HDBSCAN and a custom metric to derive fleets
that are related in the sense that they spend a lot of time in the same
location while at sea.

## See Also

* Other notebooks in https://github.com/GlobalFishingWatch/fleet-clustering for 
examples of clustering Squid Jiggers, etc.
* This workspace that Nate put together: https://globalfishingwatch.org/map/workspace/udw-v2-85ff8c4f-fbfe-4126-b067-4d94cdd2b737



## Open Questions

### Fleet Coherence Time

One thing this current implementation doesn't take into account is 
the coherence time of a fleet. A vessel might be part of one fleet 
this season, but move to another fleet the next season. A way to
deal with this is to group fleets over shorter time periods (6 months
for instance) and then match fleets across groupings by seeing what
previous fleets have the largest overlap with the current set of
fleets.

In [1]:
from __future__ import print_function
from __future__ import division
from collections import Counter, OrderedDict
import datetime as dt
import hdbscan
import logging
import matplotlib.pyplot as plt
import matplotlib.animation as mpl_animation
import numpy as np
import pandas as pd
from skimage import color
from IPython.display import HTML
from fleet_clustering import bq
from fleet_clustering import filters
from fleet_clustering import distances
from fleet_clustering import animation

## Load AIS Clustering Data

Load the AIS data that we use for clustering. Note that it onlyu includes vessels away
from shores so as to exclude clustering on ports

In [2]:
all_by_date = bq.load_ais_by_date('drifting_longlines', dt.date(2016, 1, 1), dt.date(2018, 12, 31),
                                 fishing_only=False, min_km_from_shore=0)    
pruned_by_date = {k : filters.remove_near_shore(10,
                            filters.remove_chinese_coast(v)) for (k, v) in all_by_date.items()}
valid_ssvid = sorted(filters.find_valid_ssvid(pruned_by_date))

2016-01-01




2016-07-03
2017-01-03
2017-07-06
2018-01-06
2018-07-09


## Create Distance Metrics

Create an array of distance metrics. The details are still evolving, but in general
we want to deal with two things.  Days on which a boat is missing and days where the
boat is away from the fleet.

* Distances to/from a boat on days when it is missing are represented by $\infty$ in 
  the distance matrix. HDBSCAN ignores these values.
* Only the closest N days are kept for each boat pair, allowing boats to leave the fleet
  for up to half the year without penalty.
  
In addition, distances have a floor of 1 km to prevent overclustering when boats tie up
up together, etc.

In [None]:
dists_by_date = {}

In [447]:
for start_date, end_date in [('20160101', '20161231'),
    ('20170101', '20171231'), ('20170701', '20180630'), ('20180101', '20181231')]:
    if start_date in dists_by_date:
        continue
    print("computing distance for", start_date, end_date)
    subset_by_date = {k : v for (k, v) in pruned_by_date.items() if start_date <= k <= end_date}
    C = distances.create_composite_lonlat_array(subset_by_date, valid_ssvid)
    dists = distances.compute_distances_4(C, gamma=2)
    dists_by_date[start_date] = dists

computing distance for 20160101 20161231


## Load Carrier Data

In [4]:
carriers_by_date = bq.load_carriers_by_year(2017, 2018)
pruned_carriers_by_date = {k : filters.remove_chinese_coast(v) for (k, v) in carriers_by_date.items()}
query = """
               SELECT CAST(mmsi AS STRING) FROM
               `world-fishing-827.vessel_database.all_vessels_20190102`
               WHERE  iscarriervessel AND confidence = 3
        """
valid_carrier_ssvid_df = pd.read_gbq(query, dialect='standard', project_id='world-fishing-827')
valid_carrier_ssvid = valid_carrier_ssvid_df.f0_
valid_carrier_ssvid_set = set(valid_carrier_ssvid)

## Load Encounters Data And Country Codes

This is used to filter the carrier vessels down to only those
that meet with longliners and to add iso3 labels to outputs

In [5]:
encounters = bq.load_carriers(2017, 2017)

In [6]:
query = """
SELECT code, iso3 FROM `world-fishing-827.gfw_research.country_codes`"""
country_codes_df = pd.read_gbq(query, dialect='standard', project_id='world-fishing-827')
iso3_map = {x.code : x.iso3 for x in country_codes_df.itertuples()}

## Fit the Clusterer

This is pretty straightforward -- all the complicated stuff is
embedded in the matrix computations. Fleet size can be tweaked
using `min_cluster_size` and `min_sample_size`.

In [93]:
clusterers = {}
for start_date, dists in dists_by_date.items():
    clusterer = hdbscan.HDBSCAN(metric='precomputed', 
                                min_cluster_size=9,
                               )
    clusterer.fit(dists)
    clusterers[start_date] = clusterer

* Refactor to get:

def create_animation(clusterer, ...):


* Apply mixing symetrically, so 2016 + pd_2017, 2017 + 0.5 * pd_2016 + 0.5 * pd_2018, 2018 + pd_2017



## Set up Fleets

Set up the fleets for viewing.

In [532]:
clusterer = clusterers['20170101']

all_fleet_ssvid_set = set([s for (s, f) in zip(valid_ssvid, clusterer.labels_) if f >= 0])
valid_ssvid_set = set(valid_ssvid)
all_longline_reefer_ssvid_set = set()
for x in encounters.itertuples():
    if x.ssvid_1 in all_fleet_ssvid_set and x.ssvid_2 in valid_carrier_ssvid_set:
        all_longline_reefer_ssvid_set.add(x.ssvid_2)
    if x.ssvid_2 in all_fleet_ssvid_set and x.ssvid_1 in valid_carrier_ssvid_set:
        all_longline_reefer_ssvid_set.add(x.ssvid_1)
all_longline_reefer_ssvid = sorted(all_longline_reefer_ssvid_set)

valid_ssvid_set = set(valid_ssvid)
carrier_ids = [x for x in all_longline_reefer_ssvid if x not in valid_ssvid_set]
joint_ssvid = valid_ssvid + sorted(carrier_ids) 
labels = list(clusterer.labels_) + [max(clusterer.labels_) + 1] * len(carrier_ids) 

# Remove vessels that have no connection to other vessels
dists = dists_by_date['20170101']
for i, ssvid in enumerate(valid_ssvid):
    connections = (~np.isinf(dists[i])).sum()
    if connections == 0:
        labels[i] = -1

In [533]:
def to_rgb(string):
    string = string.strip('#')
    r = string[:2]
    g = string[2:4]
    b = string[4:]
    return [int(x, 16) / 225.0 for x in (r, g, b)]

to_rgb('#222D4B')

[0.1511111111111111, 0.2, 0.3333333333333333]

In [534]:
counts = []
skip = []
for i in range(max(labels) + 1):
    if i in skip:
        counts.append(0)
    else:
        counts.append((np.array(labels) == i).sum())
        
fleet_ids = [x for x in np.argsort(counts)[::-1] if counts[x] > 0]
fleet_ids_without_carriers = [x for x in fleet_ids if x != max(labels)]

print(len(fleet_ids), "fleets")
fleets = OrderedDict()
n_hues = int(np.ceil(len(fleet_ids) / 4.0))
used = set()
for i, fid in enumerate(fleet_ids_without_carriers):
    b = (i // (2 * n_hues)) % 2
    c = (i // 2)% n_hues
    d = i  % 2
    symbol = 'o^'[d]
    assert (b, c, d) not in used, (i, b, c, d)
    used.add((b, c, d))
    sat = 1
    val = 1
    raw_hue = c / float(n_hues)
    # We remap the raw hue in order to avoid the 60 degree segment around blue
    hue = 5. / 6. * raw_hue
    if hue > 7. / 12.:
        hue += 1. / 6.
    assert 0 <= hue < 1, hue
    [[clr]] = color.hsv2rgb([[(hue, sat, val)]])
    fg = [[0.1511111111111111, 0.2, 0.3333333333333333], clr][b]
    bg = [clr, [0.1511111111111111, 0.2, 0.3333333333333333]][b]
    w = [1, 2][b]
    sz = [7, 7][b]
    fleets[fid] = (symbol, tuple(fg), tuple(bg), sz, w,  str(i + 1))
# fleets[max(labels)] = ('1', 'k', 'k', 8, 2, 'Carrier Vessel')

41 fleets


## Create 2017 Animation as a Check

In [572]:
imp.reload(animation)

all_by_date_2017 = {k : v for (k, v) in all_by_date.items() if '20170101' <= k <= '20171231'}

anim = animation.make_anim(joint_ssvid, 
                           labels, 
                           all_by_date_2017, 
                           interval=100,
                           fleets=fleets, 
                           show_ungrouped=True,
                           alpha=1,
                           legend_cols=12,
                           ungrouped_legend="Ungrouped")
HTML(anim.to_html5_video())

In [536]:
imp.reload(animation)

anim = animation.make_anim(joint_ssvid, 
                           labels, 
                           all_by_date_2017, 
                           interval=1,
                           fleets=fleets, 
                           show_ungrouped=True,
                           alpha=1,
                           legend_cols=12,
                           ungrouped_legend="Ungrouped")
Writer = mpl_animation.writers['ffmpeg']
writer = Writer(fps=8, metadata=dict(artist='Me'), bitrate=1800)
anim.save('fleet_longlines_2017.mp4', writer=writer,  
          savefig_kwargs={'facecolor':'#222D4B'})
#           savefig_kwargs={'facecolor':'#000000'})

## Create Psuedo Distance From Fleet Membership

In [573]:
pdists_2017 = np.zeros_like(dists_by_date['20170101'])
raw_labels_2017 = np.asarray(clusterers['20170101'].labels_)
SCALE = 1000
UNKNOWN_FLEET_DIST = 1 * SCALE
OTHER_FLEET_DIST = 2 * SCALE
mask = (raw_labels_2017 == -1)
for i, fid in enumerate(raw_labels_2017):
    if fid == -1:
        pdists_2017[i] = UNKNOWN_FLEET_DIST
    else:
        pdists_2017[i] = OTHER_FLEET_DIST * (raw_labels_2017 != fid)
        pdists_2017[i, mask] = UNKNOWN_FLEET_DIST

In [574]:
clusterer = hdbscan.HDBSCAN(metric='precomputed', 
                            min_cluster_size=9,
                           )
clusterer.fit(dists_by_date['20180101'] + pdists_2017)

HDBSCAN(algorithm='best', allow_single_cluster=False, alpha=1.0,
    approx_min_span_tree=True, cluster_selection_method='eom',
    core_dist_n_jobs=4, gen_min_span_tree=False, leaf_size=40,
    match_reference_implementation=False, memory=Memory(cachedir=None),
    metric='precomputed', min_cluster_size=9, min_samples=None, p=None,
    prediction_data=False)

In [575]:
all_fleet_ssvid_set = set([s for (s, f) in zip(valid_ssvid, clusterer.labels_) if f >= 0])
valid_ssvid_set = set(valid_ssvid)
all_longline_reefer_ssvid_set = set()
for x in encounters.itertuples():
    if x.ssvid_1 in all_fleet_ssvid_set and x.ssvid_2 in valid_carrier_ssvid_set:
        all_longline_reefer_ssvid_set.add(x.ssvid_2)
    if x.ssvid_2 in all_fleet_ssvid_set and x.ssvid_1 in valid_carrier_ssvid_set:
        all_longline_reefer_ssvid_set.add(x.ssvid_1)
all_longline_reefer_ssvid = sorted(all_longline_reefer_ssvid_set)

valid_ssvid_set = set(valid_ssvid)
carrier_ids = [x for x in all_longline_reefer_ssvid if x not in valid_ssvid_set]
joint_ssvid_2018 = valid_ssvid + sorted(carrier_ids) 
labels_2018 = list(clusterer.labels_) + [max(clusterer.labels_) + 1] * len(carrier_ids) 

counts = []
for i in range(max(labels_2018) + 1):
    if i in skip:
        counts.append(0)
    else:
        counts.append((np.array(labels_2018) == i).sum())
        
fleet_ids_2018 = [x for x in np.argsort(counts)[::-1] if counts[x] > 0]
fleet_ids_without_carriers_2018 = [x for x in fleet_ids_2018 if x != max(labels_2018)]

In [576]:
ssvid_2017 = []
for fid in fleet_ids:
    mask = (labels == fid)
    ssvid_2017.append(np.array(joint_ssvid)[mask])
    
ssvid_2018 = []
for fid in fleet_ids_2018:
    mask = (labels_2018 == fid)
    ssvid_2018.append(np.array(joint_ssvid_2018)[mask])
    
    
# mapping = {}
# for fid, ssvid_list in zip(fleet_ids, ssvid_2017):
#     i = best_match(ssvid_list, ssvid_2018)
#     if i is None:
#         mapping[fid] = None
#     else:
#         mapping[fid] = fleet_ids_2018[i]
        
        
rev_mapping = {}
for fid, ssvid_list in zip(fleet_ids_2018, ssvid_2018):
    i = best_match(ssvid_list, ssvid_2017)
    if i is None:
        rev_mapping[fid] = None
    else:
        rev_mapping[fid] = fleet_ids[i]
        
mapping = {v :k for (k, v) in rev_mapping.items()}

In [577]:
fleets_2018 = OrderedDict()

for i, fid in enumerate(fleets):
    if fid in mapping and mapping[fid] is not None:
        k = mapping[fid]
        if k in fleets_2018:
            print("Skipping", k, fid, "because of double match")
            fleets_2018[i + max(fleets)] = fleets[fid]
        else:
            fleets_2018[mapping[fid]] = fleets[fid]
    else:
        fleets_2018[i + max(fleets)] = fleets[fid]


In [578]:
fleets_2018

OrderedDict([(30,
              ('o',
               (0.1511111111111111, 0.2, 0.3333333333333333),
               (1.0, 0.0, 0.0),
               7,
               1,
               '1')),
             (50,
              ('^',
               (0.1511111111111111, 0.2, 0.3333333333333333),
               (1.0, 0.0, 0.0),
               7,
               1,
               '2')),
             (14,
              ('o',
               (0.1511111111111111, 0.2, 0.3333333333333333),
               (1.0, 0.4545454545454546, 0.0),
               7,
               1,
               '3')),
             (35,
              ('^',
               (0.1511111111111111, 0.2, 0.3333333333333333),
               (1.0, 0.4545454545454546, 0.0),
               7,
               1,
               '4')),
             (47,
              ('o',
               (0.1511111111111111, 0.2, 0.3333333333333333),
               (1.0, 0.9090909090909092, 0.0),
               7,
               1,
               '5')),
     

In [579]:
import imp; imp.reload(animation)

all_by_date_2018 = {k : v for (k, v) in all_by_date.items() if '20180101' <= k <= '20181231'}

anim = animation.make_anim(joint_ssvid_2018, 
                           labels_2018, 
                           all_by_date_2018, 
                           interval=100,
                           fleets=fleets_2018, 
                           show_ungrouped=True,
                           alpha=1,
                           legend_cols=12,
                           ungrouped_legend="Ungrouped")
HTML(anim.to_html5_video())

In [580]:
anim = animation.make_anim(joint_ssvid_2018, 
                           labels_2018, 
                           all_by_date_2018, 
                           interval=1,
                           fleets=fleets_2018, 
                           show_ungrouped=True,
                           alpha=1,
                           legend_cols=12,
                           ungrouped_legend="Ungrouped")
Writer = mpl_animation.writers['ffmpeg']
writer = Writer(fps=8, metadata=dict(artist='Me'), bitrate=1800)
anim.save('fleet_longlines_2018_momentum.mp4', writer=writer, savefig_kwargs={'facecolor':'#222D4B'})

In [559]:
clusterer = hdbscan.HDBSCAN(metric='precomputed', 
                            min_cluster_size=9,
                           )
clusterer.fit(dists_by_date['20160101'] + pdists_2017)

HDBSCAN(algorithm='best', allow_single_cluster=False, alpha=1.0,
    approx_min_span_tree=True, cluster_selection_method='eom',
    core_dist_n_jobs=4, gen_min_span_tree=False, leaf_size=40,
    match_reference_implementation=False, memory=Memory(cachedir=None),
    metric='precomputed', min_cluster_size=9, min_samples=None, p=None,
    prediction_data=False)

In [560]:
all_fleet_ssvid_set = set([s for (s, f) in zip(valid_ssvid, clusterer.labels_) if f >= 0])
valid_ssvid_set = set(valid_ssvid)
all_longline_reefer_ssvid_set = set()
for x in encounters.itertuples():
    if x.ssvid_1 in all_fleet_ssvid_set and x.ssvid_2 in valid_carrier_ssvid_set:
        all_longline_reefer_ssvid_set.add(x.ssvid_2)
    if x.ssvid_2 in all_fleet_ssvid_set and x.ssvid_1 in valid_carrier_ssvid_set:
        all_longline_reefer_ssvid_set.add(x.ssvid_1)
all_longline_reefer_ssvid = sorted(all_longline_reefer_ssvid_set)

valid_ssvid_set = set(valid_ssvid)
carrier_ids = [x for x in all_longline_reefer_ssvid if x not in valid_ssvid_set]
joint_ssvid_2016 = valid_ssvid + sorted(carrier_ids) 
labels_2016= list(clusterer.labels_) + [max(clusterer.labels_) + 1] * len(carrier_ids) 

counts = []
for i in range(max(labels_2016) + 1):
    if i in skip:
        counts.append(0)
    else:
        counts.append((np.array(labels_2016) == i).sum())
        
fleet_ids_2016 = [x for x in np.argsort(counts)[::-1] if counts[x] > 0]
fleet_ids_without_carriers_2016 = [x for x in fleet_ids_2016 if x != max(labels_2016)]

In [561]:
ssvid_2016 = []
for fid in fleet_ids_2016:
    mask = (labels_2016 == fid)
    ssvid_2016.append(np.array(joint_ssvid_2016)[mask])   
        
rev_mapping = {}
for fid, ssvid_list in zip(fleet_ids_2016, ssvid_2016):
    i = best_match(ssvid_list, ssvid_2017)
    if i is None:
        rev_mapping[fid] = None
    else:
        rev_mapping[fid] = fleet_ids[i]
        
mapping = {v :k for (k, v) in rev_mapping.items()}

In [562]:
fleets_2016 = OrderedDict()

for i, fid in enumerate(fleets):
    if fid in mapping and mapping[fid] is not None:
        k = mapping[fid]
        if k in fleets_2016:
            print("Skipping", k, fid, "because of double match")
            fleets_2016[i + max(fleets)] = fleets[fid]
        else:
            fleets_2016[mapping[fid]] = fleets[fid]
    else:
        fleets_2016[i + max(fleets)] = fleets[fid]

In [563]:
import imp; imp.reload(animation)

all_by_date_2016 = {k : v for (k, v) in all_by_date.items() if '20160101' <= k <= '20161231'}

anim = animation.make_anim(joint_ssvid_2016, 
                           labels_2018, 
                           all_by_date_2016, 
                           interval=100,
                           fleets=fleets_2016, 
                           show_ungrouped=True,
                           alpha=1,
                           legend_cols=12,
                           ungrouped_legend="Ungrouped")
HTML(anim.to_html5_video())

In [564]:
anim = animation.make_anim(joint_ssvid_2016, 
                           labels_2016, 
                           all_by_date_2016, 
                           interval=1,
                           fleets=fleets_2016, 
                           show_ungrouped=True,
                           alpha=1,
                           legend_cols=12,
                           ungrouped_legend="Ungrouped")
Writer = mpl_animation.writers['ffmpeg']
writer = Writer(fps=8, metadata=dict(artist='Me'), bitrate=1800)
anim.save('fleet_longlines_2016_momentum.mp4', writer=writer, savefig_kwargs={'facecolor':'#222D4B'})

In [565]:
for fid, v in fleets_2016.items():
    label = v[-1]
    mask = (fid == np.array(labels_2016))
    ssvids = np.array(joint_ssvid_2016)[mask]
    mids = [x[:3] for x in ssvids]
    countries = [iso3_map.get(float(x), x) for x in mids]
    c = Counter(countries)
    print('Fleet: {} ({})'.format(label, fid))
    for country, count in c.most_common():
        print('\t', country, ':', count)

Fleet: 1 (37)
	 CHN : 64
	 DEU : 4
	 415 : 2
	 556 : 1
	 KIR : 1
Fleet: 2 (45)
	 CHN : 44
	 KOR : 16
	 FJI : 13
Fleet: 3 (6)
	 USA : 46
	 CAN : 44
	 TCA : 1
	 ESP : 1
Fleet: 4 (40)
	 JPN : 92
	 TWN : 1
Fleet: 5 (30)
	 TWN : 28
	 SYC : 4
	 TZA : 1
Fleet: 6 (41)
	 CHN : 20
	 TWN : 9
	 FSM : 7
	 MHL : 4
Fleet: 7 (22)
	 TWN : 22
	 CHN : 19
	 KOR : 1
	 JPN : 1
Fleet: 8 (4)
	 BHR : 11
	 RUS : 1
	 FRA : 1
Fleet: 9 (26)
	 CHN : 12
	 SYC : 10
	 TWN : 10
	 MDV : 1
	 CYP : 1
Fleet: 10 (34)
	 CHN : 34
	 TWN : 21
	 VUT : 10
Fleet: 11 (11)
	 ESP : 35
	 MLT : 3
	 ITA : 1
Fleet: 12 (24)
	 USA : 27
Fleet: 13 (9)
	 GRC : 13
	 CYP : 10
Fleet: 14 (31)
	 TWN : 20
	 MYS : 5
Fleet: 15 (20)
	 VUT : 13
	 USA : 11
	 TWN : 4
	 KIR : 1
	 COK : 1
Fleet: 16 (7)
	 IND : 11
Fleet: 17 (33)
	 TWN : 24
	 FJI : 1
Fleet: 18 (38)
	 JPN : 23
Fleet: 19 (43)
	 FJI : 12
	 CHN : 6
	 452 : 2
	 600 : 1
Fleet: 20 (17)
	 ZAF : 19
Fleet: 21 (19)
	 AUS : 19
Fleet: 22 (35)
	 CHN : 18
	 FSM : 4
	 TWN : 3
Fleet: 23 (12)
	 MLT : 23
	 ITA

In [566]:
for fid, v in fleets.items():
    label = v[-1]
    mask = (fid == np.array(labels))
    ssvids = np.array(joint_ssvid)[mask]
    mids = [x[:3] for x in ssvids]
    countries = [iso3_map.get(float(x), x) for x in mids]
    c = Counter(countries)
    print('Fleet: {} ({})'.format(label, fid))
    for country, count in c.most_common():
        print('\t', country, ':', count)

Fleet: 1 (10)
	 CHN : 105
	 DEU : 4
	 415 : 4
	 KIR : 1
	 556 : 1
	 KOR : 1
Fleet: 2 (40)
	 CHN : 69
	 KOR : 16
	 FJI : 13
Fleet: 3 (11)
	 USA : 49
	 CAN : 45
	 PAN : 1
	 ESP : 1
Fleet: 4 (43)
	 JPN : 74
	 KOR : 1
Fleet: 5 (38)
	 TWN : 42
	 SYC : 8
	 TZA : 1
Fleet: 6 (36)
	 CHN : 24
	 TWN : 11
	 FSM : 6
	 MHL : 4
	 200 : 1
Fleet: 7 (19)
	 TWN : 23
	 CHN : 18
	 JPN : 1
Fleet: 8 (6)
	 BHR : 33
	 QAT : 2
	 CHN : 2
	 FRA : 1
	 TZA : 1
	 PCN : 1
	 222 : 1
Fleet: 9 (28)
	 CHN : 23
	 TWN : 9
	 SYC : 6
	 CYP : 1
Fleet: 10 (35)
	 CHN : 15
	 TWN : 13
	 VUT : 7
Fleet: 11 (16)
	 ESP : 29
	 MLT : 3
	 ITA : 1
Fleet: 12 (27)
	 USA : 32
Fleet: 13 (14)
	 GRC : 15
	 CYP : 11
Fleet: 14 (37)
	 TWN : 18
	 MYS : 6
	 CHN : 2
Fleet: 15 (29)
	 USA : 11
	 VUT : 7
	 TWN : 4
	 KIR : 1
	 COK : 1
Fleet: 16 (5)
	 IND : 22
	 420 : 1
Fleet: 17 (30)
	 TWN : 21
	 FJI : 1
Fleet: 18 (25)
	 JPN : 21
Fleet: 19 (39)
	 FJI : 14
	 CHN : 5
	 452 : 1
	 600 : 1
Fleet: 20 (21)
	 ZAF : 21
Fleet: 21 (9)
	 AUS : 21
Fleet: 22 (26)
	 C

In [567]:
for fid, v in fleets_2018.items():
    label = v[-1]
    mask = (fid == np.array(labels_2018))
    ssvids = np.array(joint_ssvid_2018)[mask]
    mids = [x[:3] for x in ssvids]
    countries = [iso3_map.get(float(x), x) for x in mids]
    c = Counter(countries)
    print('Fleet: {} ({})'.format(label, fid))
    for country, count in c.most_common():
        print('\t', country, ':', count)

Fleet: 1 (27)
	 ZAF : 20
	 JPN : 1
Fleet: 2 (36)
	 CHN : 23
	 TWN : 9
	 SYC : 7
Fleet: 3 (12)
	 GRC : 15
	 CYP : 11
Fleet: 4 (41)
	 VUT : 12
	 TWN : 1
	 ALB : 1
	 CHN : 1
Fleet: 5 (30)
	 CHN : 78
	 DEU : 4
	 415 : 2
	 200 : 1
	 KIR : 1
	 556 : 1
	 700 : 1
	 KOR : 1
Fleet: 6 (22)
	 CHN : 23
	 FSM : 7
	 MHL : 4
	 TWN : 2
	 JPN : 1
Fleet: 7 (26)
	 TWN : 23
	 CHN : 18
Fleet: 8 (9)
	 AUS : 22
Fleet: 9 (31)
	 TWN : 12
	 FJI : 1
Fleet: 10 (42)
	 TWN : 10
Fleet: 11 (17)
	 PRT : 17
	 ESP : 3
Fleet: 12 (34)
	 CHN : 21
Fleet: 13 (13)
	 CHN : 9
	 200 : 2
	 ARG : 2
Fleet: 16 (11)
	 IND : 21
Fleet: 18 (24)
	 NCL : 16
Fleet: 20 (29)
	 JPN : 12
	 KOR : 1
Fleet: 21 (10)
	 NZL : 10
	 ESP : 1
Fleet: 22 (37)
	 CHN : 22
	 FSM : 4
	 TWN : 2
Fleet: 23 (18)
	 ESP : 34
	 MLT : 4
	 ITA : 1
Fleet: 24 (20)
	 CHN : 21
	 421 : 2
Fleet: 25 (28)
	 JPN : 19
	 KOR : 6
	 433 : 1
Fleet: 29 (23)
	 TWN : 10
	 ESP : 9
	 PRT : 3
Fleet: 30 (19)
	 MLT : 20
	 ITA : 2
	 HUN : 1
Fleet: 31 (35)
	 JPN : 107
	 KOR : 1
Fleet: 32 (25)

## Look for labor violations

In [568]:
text = "312422000,2015;312422000,2014;312000125,2015;312000125,2014;412420941,2014;412420941,2015;412201837,2015;412201837,2016;413270430,2017;413270430,2016;440801000,2013;440801000,2014;533000000,2017;567000421,2015;567000445,2014;567000445,2015;567025800,2015;567025800,2014;416202800,2014;416202800,2015;416003928,2014;416054500,2017;416054500,2016;416001769,2013;416001769,2014;367363390,2015;576678000,2015;576678000,2014"
pairs = text.strip().split(';')
# Ignore years for now
mmsi = [x.split(',')[0] for x in pairs]

print(2017)
available = set(mmsi) & set(joint_ssvid)
for x in available:
    mask = (np.array(joint_ssvid) == x)
    [fid] = np.array(labels)[mask]
    if fid != -1:
        label = fleets[fid][-1]
        print(x, label, fid)

2017
413270430 9 28


In [569]:
print(2018)
available = set(mmsi) & set(joint_ssvid_2018)
for x in available:
    mask = (np.array(joint_ssvid_2018) == x)
    [fid] = np.array(labels_2018)[mask]
    if fid in fleets_2018:
        label = fleets_2018[fid][-1]
        print(x, label, fid)

2018
567025800 33 0
567000421 33 0
413270430 2 36
567000445 33 0


In [570]:
mask = (np.array(labels_2018) == 5)
mystery_mmsi = np.array(joint_ssvid_2018)[mask]

In [571]:
dists = dists_by_date['20180101']
mask = np.zeros([len(dists)], dtype=bool)
for mmsi in mystery_mmsi:
    mask |= (np.asarray(valid_ssvid)== mmsi)
    
(~np.isinf(dists[mask])).sum(axis=0)

array([0, 0, 0, ..., 0, 0, 0])