# Data Science Test Widget Brain.

This assignment focuses on transshipments of a certain port. Each day, many vessels arrive in this port and are served by some stevedore(s). Four cargo types have been identified (ore, coal, oil, and petroleum), and vessels often carry a mixture of cargo types. For each unique vessel arrival (i.e. each row in the data), we would like a prediction of how much it transships (total of load & discharge activities) per cargo type. Variables of interest therefore are: discharge1, load1, discharge2, load2, discharge3, load3, discharge4 and load4. 

The data for this case is stored in ‘VesselData.csv’ and contains historical data. 

We would like you to provide us with a Jupyter notebook or Python script with the results of your endeavors, well enriched with comments elaborating on the steps taken, even if they did not lead you anywhere, and try to motivate your actions as much as possible. We would like to understand the approach you have taken and your line of thought.

Explanation of variables in the data:

| Variable | Explanation   |
|:-----|:-----|
|   eta  | Estimated time of arrival of vessel|
|  ata  | Actual time of arrival of vessel |
|atd	 |           Actual time of departure of vessel |
|vesseldwt|	    Vessel deadweight tonnage|
|vesseltype|	    Vessel type|
|discharge[x]|	Discharge amount of cargo type x|
|load[x]	  |      Load amount of cargo type x|
|stevedorenames|	(Anonymized) stevedore ID’s visited by vessel|
|hasnohamis	   | Boolean whether vessel has the HaMIS notification system|
|earliesteta|	    Estimated time of arrival of first entry to port|
|latesteta	 |   Estimated time of arrival of last entry to port (vessel can spread transshipment(s) over multiple days)|
|traveltype	  |  Travel type|
|previousportid|	ID of previous port|
|nextportid	   | ID of next port|
|isremarkable|	Boolean whether there is anything remarkable regarding the vessel|
|vesselid	  |  Vessel ID|

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

In [2]:
df =pd.read_csv("VesselData.csv")

In [36]:
df.isna().any().any

<bound method NDFrame._add_numeric_operations.<locals>.any of eta               False
ata               False
atd               False
vesseldwt          True
vesseltype        False
discharge1        False
load1             False
discharge2        False
load2             False
discharge3        False
load3             False
discharge4        False
load4             False
stevedorenames     True
hasnohamis         True
earliesteta       False
latesteta         False
traveltype        False
previousportid    False
nextportid        False
isremarkable      False
vesselid          False
total1            False
total2            False
total3            False
total4            False
eta_weekday       False
dtype: bool>

In [3]:
pd.unique(df['vesseltype'])

array([5, 3, 2, 4, 1], dtype=int64)

In [5]:
# Create total activities for each cargo type.
df["total1"] = df["discharge1"] + df["load1"]
df["total2"] = df["discharge2"] + df["load2"]
df["total3"] = df["discharge3"] + df["load3"]
df["total4"] = df["discharge4"] + df["load4"]

In [7]:
len(pd.unique(df['vesselid']))

3022

In [12]:
min(df["ata"])

'2017-04-15 00:00:00+00'

In [13]:
max(df["ata"])

'2017-11-15 00:00:00+00'

In [26]:
df.dtypes

eta                object
ata                object
atd                object
vesseldwt         float64
vesseltype          int64
discharge1          int64
load1               int64
discharge2          int64
load2               int64
discharge3          int64
load3               int64
discharge4          int64
load4               int64
stevedorenames     object
hasnohamis        float64
earliesteta        object
latesteta          object
traveltype         object
previousportid      int64
nextportid          int64
isremarkable       object
vesselid            int64
total1              int64
total2              int64
total3              int64
total4              int64
dtype: object

In the first iteration, a linear regression model could be used.

In [38]:
df["eta_weekday"] =  pd.to_datetime(df["eta"]).dt.weekday
df["est_duration"] = pd.to_datetime(df["latesteta"]) - pd.to_datetime(df["earliesteta"])
# Encode the variables vesseltype, traveltype, previousportid, nextportid, isremarkable, vesselid into categorical variables
# Choose the explanatory variables eta_weekday, vesseldwt, vesseltype, hasnohamis, est_duration, traveltype, previousportid, nextportid, isremarkable, vesselid, response variables total1, total2, total3, and total4
# Split the dataset into trainingset, validation set and testset: 3:1:1 
# Create a model class including the following functions. 
# For each response variable, fit the trainingset with a linear regression model 
# Validate the errors on the validation set, using MSE metrics.
# Exam the coefficients and refine the model by removing least significant explantory variables. 
# Iterate until the errors are minimum.
# Test the models 