Group name:[7] OceanWarlocks

Group number:7 

Skage Klingstedt Reistad, studentnr : 545212

Susan Desirée Bredesen Palencia, studentnr : 529305

## Exploratory data analysis

#### Domain Knowledge

After reading the dataset definitions and explainations, we started to search after relevant papers about vessel trajectory prediction (VTP). Our relevant findings can be divided into two parts: significant input features and state-of-the art ML methods. 

According to research some of the most important and most widely used input features for VTP (in addition to timestamps, latitude and altitude) are SOG, COG, heading, ship type and information about ports (Yang et al, 2024, p. 7).  

The same paper also mentioned that some of the most used approaches to VTP are LSTM networks and its variants, especially bidirectional variants (Yang et al, 2024, p. 7). This is mostly because VTP can be approached as a time series problem because the data always contain a temporal dimension (Yang et al, 2024, p. 7).

We used these findings as a basis and inspiration for our feature engineering, and to develop our first but (not accurate!) predictor. 



#### Checking if data is intuitive
We mostly used the AIS data, and therefore checked if especially the AIS data made sense to us. We checked for duplicated rows and missing values. We found that ETA was not in a datetime format and did not include years, we wanted it to be a datetime object and that was something we fixed during data clean up and by fixing the year to be 2024. This is not an unknown problem, because ETA is often manually entered and therefore is often in inconsistent data formats (Yang et al, 2024, p. 4).

We found that the other features were of suitable data types. Note that portID and vesselID are strings, when we did feature engineering we tried adding portID and vesselID as features and we converted them to integers as can be seen in the "cleaning features" section.

We found no duplicated rows, but we found that portID had 1615 missing values. We did let portID have its missing values, as we thought there we no particular negative consequence to have data rows with missing portIDs. Especially because that when we use portID we change them to integers, and that makes the missing portIDs into zeroes (which most predictors easily can handle).

We also used some vesselData, and found that it included a lot of missing values (as seen below). We only tried to add vesselType as a feature, and we fixed the missing values by making into a categorical and letting missing values be its own category. 

In [1]:
RESOURCE_FOLDER="../../resources"
import pandas as pd

In [None]:

df_ais = pd.read_csv(RESOURCE_FOLDER+'/ais_train.csv', sep='|')

df_ais['time'] = pd.to_datetime(df_ais['time'])

print("Shape of AIS: " + str(df_ais.shape))
#n = 1522065
#columns = 11
print("Columns of AIS: "+ str(df_ais.columns))
#latitude and longitude are our targets. Relevant covariates may be time, cog, sog, rot, heading, navstat, etaRAW, vesselId and portId.

print("Data types of each of the columns") 
print(df_ais.dtypes)
#time is a datetime object.
# cog, latitude and longitude are floats. Rot, heading and navstat are ints. etaRaw, vesselId and portId are objects. 

print("Describtion of AIS data")
print(df_ais.describe())
#cog: course over ground. From 0 to 360 degrees.
#sog: speed over ground. from 0 to 1023 knots. 
#rot: rate of turning (of heading, which is the compass direction of where the boats bow/nose is heading). Degrees per minute.
#heading: direction of where boats bow is pointing. Measured in degrees from 0 to 360.
#navstat: Navigational status. The number tells the status of the boat. From 0 to 15.
#latitude: north-south position. Degrees. From -90 (south) to +90 (north). 
#longitude: east-west position. Degrees. From -180 (west) to +180 (east)

print("Checking where I have missing values")
print(df_ais.isna().sum()) #portId (1615 entries)

print("Checking for duplicated rows")
df_ais.loc[df_ais.duplicated()] #No duplicated rows

Shape of AIS: (1522065, 11)
Columns of AIS: Index(['time', 'cog', 'sog', 'rot', 'heading', 'navstat', 'etaRaw', 'latitude',
       'longitude', 'vesselId', 'portId'],
      dtype='object')
Data types of each of the columns
time         datetime64[ns]
cog                 float64
sog                 float64
rot                   int64
heading               int64
navstat               int64
etaRaw               object
latitude            float64
longitude           float64
vesselId             object
portId               object
dtype: object
Describtion of AIS data
                                time           cog           sog  \
count                        1522065  1.522065e+06  1.522065e+06   
mean   2024-03-06 03:20:23.657231360  1.782494e+02  6.331703e+00   
min              2024-01-01 00:00:25  0.000000e+00  0.000000e+00   
25%              2024-02-03 02:59:19  7.820000e+01  0.000000e+00   
50%              2024-03-07 12:34:57  1.838000e+02  5.000000e-01   
75%              2024-0

Unnamed: 0,time,cog,sog,rot,heading,navstat,etaRaw,latitude,longitude,vesselId,portId


In [10]:
df_vessels = pd.read_csv(RESOURCE_FOLDER+'/vessels.csv', sep='|')

print("Data types of each of the columns") 
print(df_vessels.dtypes)

print("Checking where I have missing values")
print(df_vessels.isna().sum()) 
print("Checking for duplicated rows")
df_vessels.loc[df_vessels.duplicated()] #No duplicated rows

Data types of each of the columns
shippingLineId     object
vesselId           object
CEU                 int64
DWT               float64
GT                  int64
NT                float64
vesselType        float64
breadth           float64
depth             float64
draft             float64
enginePower       float64
freshWater        float64
fuel              float64
homePort           object
length            float64
maxHeight         float64
maxSpeed          float64
maxWidth          float64
rampCapacity      float64
yearBuilt           int64
dtype: object
Checking where I have missing values
shippingLineId      0
vesselId            0
CEU                 0
DWT                 8
GT                  0
NT                524
vesselType         12
breadth             8
depth             469
draft             701
enginePower        20
freshWater        490
fuel              490
homePort          138
length              0
maxHeight         676
maxSpeed          498
maxWidth          676

Unnamed: 0,shippingLineId,vesselId,CEU,DWT,GT,NT,vesselType,breadth,depth,draft,enginePower,freshWater,fuel,homePort,length,maxHeight,maxSpeed,maxWidth,rampCapacity,yearBuilt


#### Understanding how the data was generated

Note that we have irregular timestamps (as seen in the head of the AIS data). This is a known problem as AIS data has a lot of quality issues that leads to irregular timestamps. Two examples of this is that AIS messages has a tendency to get lost or damaged due to meterology and magnetics and that AIS tranceivers may also be intentionally turned off, which leads to missing data (Yang et al, 2024, p. 4).

Irregular timestamps was a big challenge during our whole project, and we believe we could have made a more accurate model if they were regularized. How we accounted for this issue will be followed in the predictors section.

In [8]:
print(df_ais.head(10))

                 time    cog   sog  rot  heading  navstat       etaRaw  \
0 2024-01-01 00:00:25  284.0   0.7    0       88        0  01-09 23:00   
1 2024-01-01 00:00:36  109.6   0.0   -6      347        1  12-29 20:00   
2 2024-01-01 00:01:45  111.0  11.0    0      112        0  01-02 09:00   
3 2024-01-01 00:03:11   96.4   0.0    0      142        1  12-31 20:00   
4 2024-01-01 00:03:51  214.0  19.7    0      215        0  01-25 12:00   
5 2024-01-01 00:05:13  186.9   0.0    0      187        5  12-20 02:40   
6 2024-01-01 00:05:40  123.4   0.0  128      511        5  12-16 01:00   
7 2024-01-01 00:05:49  151.2   0.0    0       20        5  12-31 18:30   
8 2024-01-01 00:06:18  265.0   0.1    0      122        1  12-30 19:00   
9 2024-01-01 00:06:29   36.0   0.0    0       70        5  12-30 19:55   

   latitude  longitude                  vesselId                    portId  
0 -34.74370  -57.85130  61e9f3a8b937134a3c4bfdf7  61d371c43aeaecc07011a37f  
1   8.89440  -79.47939  61e9f3d

#### Exploring individual features

We explored some individual features.

#### Cleaning up features

## Predictors

#### LSTM

#### XGBoost random forest

##### Regularized and supervised random forest

##### Interval trained random forest

## Feature engineering

#### Feature Engineering on supervised and resampled XGBoost random forest

#### Feature Engineering on 5-day interval trained XGBoost random forest

## Model Interpretation

#### Feature importance plots

## Sources

Yang, Y., Yang, L., Li, G., Zhang, Z. & Liu, Y.(2024).Harnessing the power of Machine learning for AIS Data-Driven maritime Research: A comprehensive review. Transportation Research Part E: Logistics and Transportation Review, 183, 103426. https://doi.org/10.1016/j.tre.2024.103426