Info-284 Group exam
Group members: Heejung Yu, Tsz Ching, Sverre-Emil and Aaron Male

# Table of Contents
1. [Introduction](#Introduction)
2. [Preprocessing](#Preprocessing)
    - [Features](#Features)
        - [Species](#Species)
        - [Equipment](#Equipment)
        - [Gross Weight of Catch](#Gross-Weight-of-Catch)
        - [Boat Information](#Boat-Information)
        - [Location of Trip](#Location-of-Trip)
        - [Drag Distance](#Drag-Distance)
        - [Duration](#Duration)
        - [Time](#Time)
    - [Analyzation](#Analyzation)
        - [Hovedart FAO](#Hovedart-FAO)
        - [Lengdegruppe](#Lengdegruppe)
        - [Redskap FAO](#Redskap-FAO)
        - [Rundvekt](#Rundvekt)
3. [Supervised Learning](#Supervised-Learning)
    - [Picking the Machine Learning Models](#Picking-the-Machine-Learning-Models)
        - [K-NN](#K-NN)
        - [Linear Models](#Linear-Models)
        - [Naive Bayes](#Naive-Bayes)
        - [Decision Trees](#Decision-Trees)
        - [Ensembles of Decision Trees](#Ensembles-of-Decision-Trees)
        - [Neural Networks](#Neural-Networks)
        - [Kernelized Support Vector Machines](#Kernelized-Support-Vector-Machines)
    - [Our Choices](#Our-Choices)

## Introduction 
After taking a quick look at the dataset and the documents that was provided with it, we figured we wanted to try to predict if an entry is a Bycatch. We believe that by being able to predict Bycatch we can find out if there are any boats that are not reporting their catches properly. We classify an entry as a Bycatch if the "Hovedart FAO" does not match with "Art FAO". We are aware this definition of a Bycatch is somewhat limited especially considering the way "Hovedart FAO" is chosen, but if we are to define Bycatch as species that are not the top 2-3 species we would need more data.

In [None]:
import warnings # Got an irritating warning
warnings.filterwarnings('ignore')
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows', None)
pd.set_option('float_format', '{:f}'.format)

In [None]:
dataset = pd.read_csv("H:\Informasjonsvitenskap\Programming\Python\Info-284\Info284_Project\Exam Task\Dataset\elektronisk-rapportering-ers-2018-fangstmelding-dca-simple.csv", sep=";")

# Dataset where the species isn't the same as the main-species
bycatch = dataset[dataset['Art FAO'] != dataset['Hovedart FAO']]
main_species = dataset[dataset['Art FAO'] == dataset['Hovedart FAO']]

## Preprocessing

### Picking features
When choosing features to use from the dataset, we divided the 45 different columns into 8 different categories. 
<br>
<br>
The 8 categories are as follows:
<ul>
<li>Species</li>
<li>Equipments used</li>
<li>Gross weight of catch</li>
<li>Boat information</li>
<li>Location of trip</li>
<li>Drag distance</li>
<li>Duration</li>
<li>Time</li>
</ul>

We chose to use the following features in our ML model: art, hovedart, redskap, rundvekt, lengdegruppe, hovedområde.

Our methodology in picking the features was first seeing how the data could be relevant for our prediction, we then plotted the data from each feature into various graphs to see the distribution and also looked at the relationship between one feature and the other.

#### Species

This feature consisted mainly of "Hovedart FAO" and "Art FAO", this feature is a necessity as it is used to check if the species is classified as a Bycatch or not. 

#### Equipments

How can we measure the usefulness of equipments when predicting Bycatch? We checked the distribution of equipments in the original dataset and compared it with the Bycatch dataset.

In [None]:
# A count of instances of equipment used for every Bycatch. 
count_of_equipment_used_for_only_Bycatch = bycatch.groupby(["Redskap FAO"])["Redskap FAO"].count()
count_of_equipment_used_for_only_Bycatch = count_of_equipment_used_for_only_Bycatch.sort_values(ascending=False)
count_of_equipment_used_for_only_Bycatch.plot(kind="bar")

In [None]:
# A count of instances of equipment used for every species. 
count_of_equipment_used_for_original_dataset = dataset.groupby(["Redskap FAO"])["Redskap FAO"].count()
count_of_equipment_used_for_original_dataset = count_of_equipment_used_for_original_dataset.sort_values(ascending=False)
count_of_equipment_used_for_original_dataset.plot(kind="bar")

Although there were a few difference in the distribution, the Bycatch dataset had a more noticeable sections whereas the original dataset had a steady decline. We didn't find there were too much of a difference between the two, and were unsure if it was relevant. So we tried another approach, we checked the correlation between the most common Bycatch species and their equipment. 

In [None]:
# Finding the most common bycatch, defined by Bycatch species with highest roundweight. 
count_of_Bycatches_for_every_main_species = bycatch.groupby(["Art FAO"])["Rundvekt"].sum()

# Top 5 species 
top_5_common_bycatch = (count_of_Bycatches_for_every_main_species.sort_values(ascending=False))[:5]
top_5_common_bycatch = list(top_5_common_bycatch.index)

fig, ax = plt.subplots(figsize=(16, 12)) 

positions = np.arange(len(top_5_common_bycatch))*3  
width = 0.5

# Finding the most common equipment used for catching each of these species
for i, species in enumerate(top_5_common_bycatch):
    species_only_dataset = bycatch[bycatch["Art FAO"] == species]
    count_of_equipment_used = species_only_dataset.groupby("Redskap FAO")["Redskap FAO"].count()
    top_equipment_for_species = count_of_equipment_used.sort_values(ascending=False).head(5) 
    
    for j, equipment in enumerate(top_equipment_for_species.index):
        ax.bar(positions[i] + j*width, top_equipment_for_species[equipment], width, label=f'{species} - {equipment}')

ax.set_xlabel('Species and Equipment')
ax.set_ylabel('Count')
ax.set_title('Top Bycatch Species and Their Most Common Equipment')

ax.set_xticks(positions + width)
ax.set_xticklabels(top_5_common_bycatch)

plt.legend()
plt.xticks(rotation=45)
plt.show()

Here we see that the distribution of most common equipment used for each species varies a lot, although the top equipments for the species are the same, the rest of the equipments varies. 

#### Gross weight of catch 

This one is relevant because of the way main species is calculated; "Main species caught, reported using the FAO species code. Main species is chosen using highest estimated weight in round kilograms." (datadokumentasjon-ers-rapport-varnivaa-5-140121, p.11) This means that the main species caught on average will be higher than if it were a Bycatch.  


#### Boat information

We chose to use Length group because the data in this feature is already categorized which makes it easier to process and use. We believe that the boat size is relevant when we use it together with equipments as we believe bigger boats use trawl equipments more often. We checked the distribution of equipments for every boat size.

In [None]:
every_lengthgroup = ["28 m og over", "21-27,99 m", "15-20,99 m"]

for length_group in every_lengthgroup:
        lengdegruppe_dataset = dataset[dataset["Lengdegruppe"] == length_group]
        lengdegruppe_equipmentcount = lengdegruppe_dataset.groupby([lengdegruppe_dataset["Redskap FAO"]])["Redskap FAO"].count()
        lengdegruppe_equipmentcount.sort_values(ascending=False, inplace=True)
        print(f"Top 3 common equipment for boats in category: {length_group}\n{lengdegruppe_equipmentcount.head(3)}\n")

We see that the distribution of equipments used in bigger boats is much more skewed towards trawls than the distribution in smaller boat. Therefore we believe that bigger boats are more likely to result in Bycatch. 

#### Location of trip

To check if the area is relevant we chose to look at a Bycatch and main species dataset for torsk (Because it is the most common species). We compared the areas where they were caught as a Bycatch and where they were caught as the main species.

In [None]:
torsk_only = dataset[dataset['Art FAO'] == "Torsk"]
torsk_only_Bycatch = torsk_only[torsk_only['Hovedart FAO'] != "Torsk"]
torsk_only_main = torsk_only[torsk_only['Hovedart FAO'] == "Torsk"]

In [None]:

# Count the occurrences of each unique value in 'Hovedområde start', sort, and select top 10
top_10_areas_bycatch_start = torsk_only_Bycatch['Hovedområde start'].value_counts().head(10)

# Plot the top 10 areas
top_10_areas_bycatch_start.plot(kind='pie', figsize=(12, 10), fontsize=10, autopct='%1.1f%%', startangle=140)

plt.title('Top 10 Torsk Bycatch Main Start Areas', fontsize=12)
plt.ylabel('')
plt.legend(title='Main Start Areas', loc='upper right', bbox_to_anchor=(0, 1))
plt.show()


In [None]:

# Count the occurrences of each unique value in 'Hovedområde start', sort, and select top 10
top_10_areas = torsk_only_main['Hovedområde start'].value_counts().head(10)

# Plot the top 10 areas
top_10_areas.plot(kind='pie', figsize=(12, 10), fontsize=10, autopct='%1.1f%%', startangle=140)

plt.title('Top 10 Torsk Non-Bycatch Main Start Areas', fontsize=12)
plt.ylabel('')
plt.legend(title='Main Start Area', loc='upper right', bbox_to_anchor=(0, 1))
plt.show()

There is interestingly a big difference where the fish is caught when we compare Bycatch with main species. 

#### Drag distance

To check if drag distance is relevant when predicting Bycatch we just took a quick look and compared values. 


In [None]:
print("Duration for original dataset:\n", dataset["Trekkavstand"].describe())
print("\n")
print("Duration for bycatch:\n",bycatch["Trekkavstand"].describe())

Doesn't look like there is anything special here, maybe we can check drag distance for torsk when a specific equipment is used? We used the most common equipment for catching torsk: Bunntrål 

In [None]:
torsk_only_dragdistance = dataset[dataset["Art FAO"] == "Torsk"]
# Removing an outlier so its easier to read the graph
torsk_only_dragdistance = torsk_only_dragdistance[torsk_only_dragdistance["Trekkavstand"] < 200000]

torsk_as_Bycatch_dragdistance = torsk_only_dragdistance[torsk_only_dragdistance["Hovedart FAO"] != "Torsk"]
torsk_as_Bycatch_dragdistance = torsk_as_Bycatch_dragdistance[torsk_as_Bycatch_dragdistance["Redskap FAO"] == "Bunntrål, otter"]


torsk_as_main_dragdistance = torsk_only_dragdistance[torsk_only_dragdistance["Hovedart FAO"] == "Torsk"]
torsk_as_main_dragdistance = torsk_as_main_dragdistance[torsk_as_main_dragdistance["Redskap FAO"] == "Bunntrål, otter"]

In [None]:
plt.figure(figsize=(10, 8))

# Scatter plot for bycatch
plt.scatter(torsk_as_Bycatch_dragdistance["Rundvekt"], torsk_as_Bycatch_dragdistance["Trekkavstand"], color='blue', alpha=0.5, label='Torsk as Bycatch')

# Adding labels and title
plt.xlabel("Gross Weight (kg)", fontsize=14)
plt.ylabel("Distance Traveled (km)", fontsize=14)
plt.title("Amount of Torsk Caught per Distance Traveled", fontsize=16)

# Adding a legend to distinguish between the two datasets
plt.legend(title="Catch Type", title_fontsize='13', fontsize='12')

# Show the plot
plt.show()

In [None]:
plt.figure(figsize=(10, 8))

# Scatter plot for main catch
plt.scatter(torsk_as_main_dragdistance["Rundvekt"], torsk_as_main_dragdistance["Trekkavstand"], color='red', alpha=0.5, label='Torsk as Main Catch')

# Adding labels and title
plt.xlabel("Gross Weight (kg)", fontsize=14)
plt.ylabel("Distance Traveled (km)", fontsize=14)
plt.title("Amount of Torsk Caught per Distance Traveled", fontsize=16)

# Adding a legend to distinguish between the two datasets
plt.legend(title="Catch Type", title_fontsize='13', fontsize='12')

# Show the plot
plt.show()

In [None]:
# amount of entries from 40 000 to 75 000 in Bycatch
filtered_dragdistance = torsk_as_Bycatch_dragdistance[(torsk_as_Bycatch_dragdistance["Trekkavstand"] > 40000) & (torsk_as_Bycatch_dragdistance["Trekkavstand"] < 75000)]

total_instances = filtered_dragdistance.shape[0]

print(total_instances)

Here we see that drag distance makes very little difference, most of the dots in the scatterplot is concentrated around 50-60k. while there are a bit fewer dots in the main species graph around 60k it is only 800 entries.

#### Duration

To check if duration is relevant when predicting Bycatch we just took a quick look and compared values. 

In [None]:
print("Duration for original dataset:\n", dataset["Varighet"].describe())
print("\n")
print("Duration for bycatch:\n", bycatch["Varighet"].describe())

As we can see the 25, 50 and 75 percentile are more or less the same, we believe this doesn't really help us predict Bycatch anyways.

#### Time

To see the relevance of time for our ml model, we can compare the start times of each fishing trip for torsk where the equipment used is the same.

In [None]:
torsk_only_Bycatch_bunntrål = torsk_only_Bycatch[torsk_only_Bycatch["Redskap FAO"] == "Bunntrål, otter"]
torsk_only_main_bunntrål = torsk_only_main[torsk_only_main["Redskap FAO"] == "Bunntrål, otter"]

# Convert columns to datetime format to extract the hour
torsk_only_Bycatch_bunntrål['Startklokkeslett'] = pd.to_datetime(torsk_only_Bycatch_bunntrål['Startklokkeslett'], format='%H:%M')
torsk_only_main_bunntrål['Startklokkeslett'] = pd.to_datetime(torsk_only_main_bunntrål['Startklokkeslett'], format='%H:%M')

# Extract the hour from the start time in both main and Bycatch dataframe
torsk_only_Bycatch_bunntrål['Startklokkeslett_time'] = torsk_only_Bycatch_bunntrål['Startklokkeslett'].dt.hour
torsk_only_main_bunntrål['Startklokkeslett_time'] = torsk_only_main_bunntrål['Startklokkeslett'].dt.hour

# counting hours
hourly_distribution_start = torsk_only_Bycatch_bunntrål['Startklokkeslett_time'].value_counts().sort_index()
hourly_distribution_start2 = torsk_only_main_bunntrål['Startklokkeslett_time'].value_counts().sort_index()

We chose to use a pie-chart because it visualizes the data better when we want to see percentage. 

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
hourly_distribution_start.plot(kind='pie', autopct='%1.1f%%', startangle=90)
plt.title('Hourly Distribution of Torsk Bycatch Start Times')

plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
hourly_distribution_start2.plot(kind='pie', autopct='%1.1f%%', startangle=90)
plt.title('Hourly Distribution of Torsk Bycatch Start Times')

plt.show()

As we can see the start times for Bycatch and main species are basically the same, this makes us believe that it doesn't really matter when the fishing is done as it doesn't impact the statistics for Bycatch. 

### Analyzation
The primary objective of this analysis is to explore findings related to outliers within each chosen features of our dataset. By examining these outliers, we aim to understand their impact on the dataset and determine appropriate strategies for handling them. This process involves identifying outliers, assessing their significance, and deciding on actions such as keeping, modifying, or removing these outliers.
#### Hovedart FAO
Considering that "Hovedart FAO" will be the feature we use to identify if it is a Bycatch the only thing we have to consider are NaN values. 

In [None]:
print(dataset["Hovedart FAO"].isna().sum())
print(dataset["Art FAO"].isna().sum())

# Drop all rows where Hovedart FAO and Art FAO is NaN
dataset = dataset.dropna(subset=["Hovedart FAO", "Art FAO"])

print(dataset["Hovedart FAO"].isna().sum())
print(dataset["Art FAO"].isna().sum())

Considering that this feature will be a vital part of our ml model we will be dropping all NaN values in this feature, and also considering there are only 5000 entries which is missing the species it shouldn't impact our model that much.

#### Lengdegruppe
Considering this feature is already categorized outliers aren't an issue here, we believe its only relevant to look at NaN values.


In [None]:
print(dataset["Lengdegruppe"].isna().sum())

lengdegruppe_is_NaN = dataset[(dataset["Lengdegruppe"].isna())]

pd.DataFrame(lengdegruppe_is_NaN.head(10))

At first glance we were planning on dropping these values, however when we actually look at the values here we can see that most catches without a specified boat length is when Stortare is caught. 

In [None]:
art_fao_counts_lengdegruppe = lengdegruppe_is_NaN['Art FAO'].value_counts()

# Printing the counts for each unique value in 'Art FAO'
print(art_fao_counts_lengdegruppe)

Looks like rather than dropping NaN values we will be converting them into a special group, this is because all NaN values are catching Stortare. 

In [None]:
dataset['Lengdegruppe'] = dataset['Lengdegruppe'].fillna('Stortare båter')

#### Redskap FAO
Looking at NaN values

In [None]:
print(dataset["Redskap FAO"].isna().sum())
redskap_FAO_is_NaN = dataset[dataset["Redskap FAO"].isna()]
pd.DataFrame(redskap_FAO_is_NaN)


Considering there are only 200 NaN values the easiest solution will be dropping them, although they have aren't missing any values other than Redskap FAO we feel that the amount of work that would be needed to impute or replace the missing values will be unnoticeable in the ml models accuracy.

In [None]:
dataset = dataset.dropna(subset=["Redskap FAO"])

#### Rundvekt
When looking at outliers we can assume that since different species most likely have a different average gross weight we should be looking for outliers for each individual species. We chose to use box plots to visualize these because they show outliers well.

In [None]:
species_grossweight = dataset.groupby("Art FAO")["Rundvekt"].apply(list)

# Define the species of interest
species_of_interest = ["Torsk", "Hyse", "Sei"]

# Filter the aggregated data to include only the species of interest
filtered_species_grossweight = {species: weights for species, weights in species_grossweight.items() if species in species_of_interest}

# Iterate over the filtered Series using .items() for species and their corresponding gross weights
for species, grossweights in filtered_species_grossweight.items():
    plt.figure(figsize=(10, 6))  # Create a new figure for each species
    
    # Create a boxplot for the species
    plt.boxplot(grossweights)
    plt.title(f'Boxplot of Rundvekt for {species}')
    plt.ylabel('Rundvekt')
    plt.xticks([1], [species])  # Set the x-tick to the name of the current species
    
    plt.show()  # Show the plot for the current species

Seems like there are many outliers for each species, we believe it will be difficult to outright remove outliers since there are 122 different species and there are probably some species with few outliers while others have a lot. According to "Introduction to Machine Learning with Python: A Guide for Data Scientists" by Andreas C. Müller and Sarah Guido (p.133), employing a RobustScaler offers a strategic solution to this issue. The RobustScaler effectively transforms the data by ignoring points that significantly deviate from the rest, making it particularly suitable for our dataset where outliers are prevalent but their outright removal is impractical."

Checking for NaN values in Gross weight. 

In [None]:
print(dataset["Rundvekt"].isna().sum())

## Supervised learning
In "Introduction to Machine Learning with Python", the book discusses seven different machine learning models: k-Nearest Neighbors (k-NN), linear models, Naive Bayes, decision trees, ensembles of decision trees, neural networks, and kernelized support vector machines (SVMs). 

### Picking the Machine learning models 
Considering we have 6 features, where 5 of them are categorical and 1 is continuous data, we will have to keep this distribution in mind when picking out models. 

#### K-NN

#### Linear models

#### Naive Bayes

#### Decision trees

#### Ensembles of decision trees

#### Neural networks

#### Kernelized support vector machines

### Our choices 

####

####

####

####