# Delays

Delays resulting from the late arrival of ships at port can have a significant operational and economic impact. The following section investigates if it's possble to predict whether a ship will be delayed arriving at it's destination. To explore all the factors that may contribute to delays, several features were taken and derived from both the AIS and CERS datasets. 

At each point on a ship’s journey features from the high level themes below will be used:

- time and seasonality
- ship characteristics such as gross tonnage
- previous delay counts
- ship dynamics such as SOG and ROT
- distance from last port of call
- segments
- local loading
- port loading
- weather 

In [1]:
# base libraries
import pandas as pd
import math
import os
import json
import numpy as np

In [2]:
# set variable from config file
config_path = os.path.abspath('..')

with open(config_path + '/config.json', 'r') as f:
    config = json.load(f) 

processing_path = config['DEFAULT']['processing_path']
shipping_rot_filename = config['DEFAULT']['shipping_rot_filename']
shipping_filename = config['DEFAULT']['shipping_filename']
cers_eta_filename = config['DEFAULT']['cers_eta_filename']
delay_filename = config['DEFAULT']['delay_filename']

During investigations of the CERS data it was discovered that the ETA within the downloadable CERS data is updated to equal the ATA (actual time of arrival).

In [3]:
# import data
dtype_dic = {'MMSI':int,'dt':'str', 'lat':'float', 'long':'float','SOG':'float', 'rot':'float', 
             'Type':'str', 'gross_tonnage':'float','vessel_name':'str', 'ETA':'str', 'POC_LOCODE':'str',
             'last_port_LOCODE':'str', 'next_port_LOCODE':'str', 'status':'str','voyage_id':'float','tripid':int,
            'in_hazmat':'str','out_hazmat':'str'}
parse_dates = ['dt', 'ETA']

shipping_data = pd.read_csv(processing_path + shipping_filename,header = 0,delimiter = ',',dtype = dtype_dic, parse_dates=parse_dates)

# file contains the original ETA and a new ETA found by creating an automated process to query CERS
ETA_data = pd.read_csv(processing_path + cers_eta_filename, header = 0,delimiter = ',')
ETA_data['etatoportofcall'] = pd.to_datetime(ETA_data['etatoportofcall'])

In [4]:
# merge the new ETA to the shipping data
shipping_data = shipping_data[['MMSI','voyage_id','dt','ETA']].merge(ETA_data[['voyage_id','etatoportofcall']], 
                                                                     how = 'inner', on = 'voyage_id')

## Delays

The target field for the modelling will be a binary variable indicating whether the ship is delayed or not. The delay is calculated by substacting the estiamted time of arrival from the actual time of arrival.

As the threshold at which the length of delay becomes operationally critical, differs for different situations, five binary target fields are created each relating to different delay thresholds, 15, 30, 60, 90 and 120 minutes.

In [5]:
# calculate delay
shipping_data.rename(columns = {'etatoportofcall':'ETA_new'}, inplace = True)
shipping_data['arrivalDelay'] = shipping_data['ETA'] - shipping_data['ETA_new']
shipping_data['arrivalDelayMin'] = shipping_data['arrivalDelay'].dt.total_seconds()/60

In [6]:
shipping_data['delay15'] = shipping_data['arrivalDelayMin'] >= 15
shipping_data['delay30'] = shipping_data['arrivalDelayMin'] >= 30
shipping_data['delay60'] = shipping_data['arrivalDelayMin'] >= 60
shipping_data['delay90'] = shipping_data['arrivalDelayMin'] >= 90
shipping_data['delay120'] = shipping_data['arrivalDelayMin'] >= 120

## Previous delays

A possible predictive feature is whether a ship has been delayed before

In [7]:
# find the combination of ship and journey (MMSI and ETA) that have been delayed by at least 15 mins
delayed_15 = shipping_data[shipping_data['delay15'] == 1].copy()
delayed_15a = delayed_15[['MMSI','ETA']]
delayed_15a = delayed_15a.drop_duplicates(keep = 'first', inplace = False)

# join on each delay by MMSI
ship_delays = delayed_15a.merge(delayed_15a,how = 'left',on = ['MMSI'])
# only keep a previous delay if it's for a journey before the current one
ship_delays['previous_delays'] = ship_delays.apply(lambda row: (row['ETA_y'] if row['ETA_y'] < row['ETA_x'] else float(np.nan)), axis=1)
ship_delays.rename(index = str, columns = {'ETA_x':'ETA'}, inplace=True)

# count previous delays for each MMSI and journey (ETA)
ship_delays = ship_delays.groupby(['MMSI','ETA'],as_index=False)['previous_delays'].count()
delayed_15 = delayed_15.merge(ship_delays,how = 'left',on=['MMSI','ETA'])

In [8]:
shipping_data = shipping_data.merge(delayed_15[['MMSI','dt','previous_delays']], how = 'left', on = ['MMSI','dt'])

## Export data

In [9]:
shipping_data.to_csv(processing_path + delay_filename,header=True,index=False,sep=',')

In [12]:
shipping_data.describe()

Unnamed: 0,MMSI,arrivalDelay,arrivalDelayMin,previous_delays
count,1659444.0,1659444,1659444.0,798051.0
mean,359022000.0,0 days 01:41:34.662802,101.5777,2.538025
std,161688100.0,0 days 17:34:13.824703,1054.23,4.178169
min,209322000.0,-9 days +10:00:00,-12360.0,0.0
25%,220477000.0,-1 days +23:05:00,-55.0,0.0
50%,255805700.0,0 days 00:06:00,6.0,1.0
75%,477712800.0,0 days 02:06:00,126.0,3.0
max,636092600.0,29 days 21:05:00,43025.0,24.0
