### INFO284 Machine Learning Exam, spring 2024

#### Importing and versioncontrol for relevant libraries

In [11]:
import sys
print("Python version: {}".format(sys.version))
import pandas as pd
print("pandas version: {}".format(pd.__version__))
import matplotlib
import matplotlib.pyplot as plt
print("matplotlib version: {}".format(matplotlib.__version__))
import numpy as np
print("NumPy version: {}".format(np.__version__))
import scipy as sp
print("SciPy version: {}".format(sp.__version__))
import IPython
print("IPython version: {}".format(IPython.__version__))
import sklearn
print("scikit-learn version: {}".format(sklearn.__version__))
import seaborn as sns
print("seaborn version: {}".format(sns.__version__))

Python version: 3.12.2 (tags/v3.12.2:6abddd9, Feb  6 2024, 21:26:36) [MSC v.1937 64 bit (AMD64)]
pandas version: 2.2.1
matplotlib version: 3.8.3
NumPy version: 1.26.4
SciPy version: 1.12.0
IPython version: 8.21.0
scikit-learn version: 1.4.1.post1
seaborn version: 0.13.2


#### Importing the dataset

In [12]:
filePath = 'elektronisk-rapportering-ers-2018-fangstmelding-dca-simple.csv'
# Keep in mind that the file is encoded in UTF-8 so it will only work if you have the correct version of pandas.
df = pd.read_csv(filePath, encoding="UTF-8" , delimiter=";")
print(f"Before pre-processing the dataset has {df.shape[1]} columns and {df.shape[0]} rows")


Before pre-processing the dataset has 45 columns and 305434 rows


#### Step 1: chosing target and pre-prossesing
After taking some time to understand the data we have chosen our target features to be the catches of Hyse, Torsk and Sei as continuous values.

Next we must pre-prosess the data into suitable forms for modeling, this might take different forms as we try different ways to model it. The key skill here will therefor be the ability to manipulate the pandas dataframe.

Ways we might process the data:
1. Remove uncessary columns.
2. Collapse columns that say the same thing into one column.
3. Remove or manipulate non-sensical or missing values.
4. Transform values to work with our model.

*Question: should we also show how we went about understanding the data?*

(Old)The dataset needs a lot of pre processing. Current objective is to clean and remove redundancy and irrelevant columns. In this step we are actively discussing what our goal will be machine learning model, while understanding the data on a deeper level

In [13]:
#Using the same seed for testing purposes makes the results more comparable
seed = 32

In [14]:
# Excluding irrelevant columns

# Fangstår has only 2 unique values
df.drop(columns = ['Fangstår'], inplace= True)

# Lengdegruppe (kode), Lengdegruppe, Bruttotonnasje 1969, Bruttotonnasje annen, Bredde, Fartøylengde all seem to be speaking of the boat doing the catching, with few unique values in each column.
# One of them should be kept as a feature, the rest discarded. We would suggest the most relevant to be Bruttotonasje, which speaks to how much cargo space there is.
# "Bruttotonnasje 1969" and "Bruttotonnasje annen" seem to have nan where the other one has a value, so first we will collapse them into one.
df['Bruttotonnasje'] = df['Bruttotonnasje annen'].combine_first(df['Bruttotonnasje 1969'])
df.drop(columns=['Bruttotonnasje annen', "Bruttotonnasje 1969","Lengdegruppe (kode)", "Lengdegruppe", "Bredde", "Fartøylengde"], inplace=True)

# All columns (kode) in them are cateogrical code representations of another column. For human-readability and to avoid mistaking the code for a continuous value we will remove them.
df.drop(columns=["Hovedområde start (kode)", "Lokasjon start (kode)", "Hovedområde stopp (kode)", "Lokasjon stopp (kode)", "Redskap FAO (kode)", "Redskap FDIR (kode)", "Hovedart FAO (kode)", "Hovedart - FDIR (kode)", "Art FAO (kode)", "Art - FDIR (kode)", "Art - gruppe (kode)", ], inplace = True)

# In both the "Redskap" and "Art" columns you have FAO and FDIR abbriviations. FAO = Food and Agriculture Organization of the United Nations and FDIR = Fiskeridirektoratet
# Due to "Hovedart" onyl having FAO uncoded we will stick to FAO. For the same reason we will remove "Art - gruppe"
df.drop(columns=["Art - gruppe", "Art - FDIR", "Redskap FDIR"], inplace=True)

# While time of day and date might be relevant we don't need all of them and we don't need to know when it was reported in. For now we will leave start/end date and time.
df.drop(columns=["Meldingstidspunkt", "Meldingsdato", "Meldingsklokkeslett", "Starttidspunkt", "Stopptidspunkt"], inplace=True)

# The areas where they start and stop have 6 columns. A pair of coordiantes and name of area x2. Since we prefer the continuous features and coordinates=name of place we are removing the name.
df.drop(columns=["Hovedområde start", "Hovedområde stopp"], inplace=True)

In [15]:
# Calculating the percentage of NaNs for each column to see whether we can afford to drop them
nanPercentage = df.isna().mean() * 100
#print(nanPercentage)
df = df.dropna()


In [16]:
# Focusing down on the species we want to investigate
# More might be added later as categories or ranges

df = df[df['Art FAO'].isin(['Torsk', 'Sei', 'Hyse'])]

In [17]:
# Removing parts of columns

# From varighet we remove anyting above 400 as according to lecturer that is in the high range of how long you would be fishing in a session.
# So we will consider them outliers or multiple sessions reported as one and exclude them for now.

df = df[df['Varighet'] <= 400]

# Maybe remove some outliers from "Trekkavstand". Above 50000 the frequency gets 100 instances per 5000 length.
# Just doing it for now, unsure of necessity
df = df[df['Trekkavstand'] <= 50000]

In [18]:
df.head()

Unnamed: 0,Melding ID,Startdato,Startklokkeslett,Startposisjon bredde,Startposisjon lengde,Havdybde start,Stoppdato,Stoppklokkeslett,Varighet,Stopposisjon bredde,Stopposisjon lengde,Havdybde stopp,Trekkavstand,Redskap FAO,Hovedart FAO,Art FAO,Rundvekt,Bruttotonnasje
1,1497178,30.12.2017,23:21,74885,16048,-335,31.12.2017,04:16,295,74914,15969,-334,3970.0,"Bunntrål, otter",Hyse,Hyse,9594.0,1476.0
2,1497178,30.12.2017,23:21,74885,16048,-335,31.12.2017,04:16,295,74914,15969,-334,3970.0,"Bunntrål, otter",Hyse,Torsk,8510.0,1476.0
4,1497178,30.12.2017,23:21,74885,16048,-335,31.12.2017,04:16,295,74914,15969,-334,3970.0,"Bunntrål, otter",Hyse,Sei,134.0,1476.0
5,1497178,31.12.2017,05:48,7491,15868,-403,31.12.2017,10:15,267,74901,16248,-277,11096.0,"Bunntrål, otter",Hyse,Hyse,9118.0,1476.0
6,1497178,31.12.2017,05:48,7491,15868,-403,31.12.2017,10:15,267,74901,16248,-277,11096.0,"Bunntrål, otter",Hyse,Torsk,6651.0,1476.0


In [19]:
# Manipulating columns

# The coordinates are strings, here I'm changing them to int so they're easier to use.
# Later we might potentially change them in a different way.
df['Startposisjon bredde'] = df['Startposisjon bredde'].str.replace(',', '').astype(int)
df['Startposisjon lengde'] = df['Startposisjon lengde'].str.replace(',', '').astype(int)
df['Stopposisjon bredde'] = df['Stopposisjon bredde'].str.replace(',', '').astype(int)
df['Stopposisjon lengde'] = df['Stopposisjon lengde'].str.replace(',', '').astype(int)

# Date/time could potentially be changed to month/hour?
# df['Startmåned'] = df['Startdato'].astype(str).str[3:5]
# df['Starttime'] = df['Startklokkeslett'].astype(str).str[3:5]
# df['Stoppmåned'] = df['Stoppdato'].astype(str).str[3:5]
# df['Stopptime'] = df['Stoppklokkeslett'].astype(str).str[3:5]
# df.drop(columns=['Startdato', "Startklokkeslett", "Stoppdato", "Stoppklokkeslett"], inplace=True)

# Many of the sea depth notations are positiv, which doesn't make sense.
# But the amount of them in relation to number of entries means it can't be discounted as an error
# In the lecture on fisheries it was mentioned that a lot fo these are inputed manually
# And that most of these non-sensical sea depths are actually correct, just lacking a minus.
# Therefore we are simply flipping all the positive sea depth into negatives.
df['Havdybde start'] = -df['Havdybde start'].abs()
df['Havdybde stopp'] = -df['Havdybde stopp'].abs()
 

In [20]:
# Pivoting table

# Pivoting table so rows that are information about the same session are put together
df = df.pivot_table(index=['Melding ID', 'Startdato', 'Startklokkeslett', 'Startposisjon bredde', 'Startposisjon lengde', 'Havdybde start', 'Stoppdato', 'Stoppklokkeslett', 'Varighet', 'Stopposisjon bredde', 'Stopposisjon lengde', 'Havdybde stopp', 'Trekkavstand', 'Redskap FAO', 'Hovedart FAO','Bruttotonnasje'], columns='Art FAO', values='Rundvekt', aggfunc='sum').reset_index()

# This creates a lot of nan values which we fill with 0
df = df.fillna(0)

# We add another column to indicate which was the dominant catch during that session
# Might be removed or deemed redunadant later on as it has a 86% match to "Hovedfangst FAO"
df['Hovedfangst'] = df[['Hyse', 'Sei', 'Torsk']].idxmax(axis=1)

print(f"After pre-processing the dataset has {df.shape[1]} columns and {df.shape[0]} rows")


After pre-processing the dataset has 20 columns and 51083 rows


In [22]:
df.head()

Art FAO,Melding ID,Startdato,Startklokkeslett,Startposisjon bredde,Startposisjon lengde,Havdybde start,Stoppdato,Stoppklokkeslett,Varighet,Stopposisjon bredde,Stopposisjon lengde,Havdybde stopp,Trekkavstand,Redskap FAO,Hovedart FAO,Bruttotonnasje,Hyse,Sei,Torsk,Hovedfangst
0,1497178,30.12.2017,23:21,74885,16048,-335,31.12.2017,04:16,295,74914,15969,-334,3970.0,"Bunntrål, otter",Hyse,1476.0,9594.0,134.0,8510.0,Hyse
1,1497178,31.12.2017,05:48,7491,15868,-403,31.12.2017,10:15,267,74901,16248,-277,11096.0,"Bunntrål, otter",Hyse,1476.0,9118.0,67.0,6651.0,Hyse
2,1497178,31.12.2017,11:34,74883,16056,-346,31.12.2017,16:49,315,74924,15742,-496,10215.0,"Bunntrål, otter",Hyse,1476.0,12432.0,68.0,5097.0,Hyse
3,1497178,31.12.2017,17:44,74931,15785,-443,31.12.2017,21:47,243,74926,15894,-358,3214.0,"Bunntrål, otter",Torsk,1476.0,6758.0,0.0,7022.0,Torsk
4,1497229,01.01.2018,10:01,67828,12972,-71,01.01.2018,11:04,63,67827,12942,-56,1269.0,Snurrevad,Hyse,51.0,4.0,0.0,0.0,Hyse
