# Group exam I

## Project description

We want to create a Knn model that predicts the type of fish species most likely to be caught given the features: 
- Tools used 
- Start area 
- Starting time (date and time)
- Live weight

## Goal

The goal here is not to achieve a result with high accuracy, but to reflect upon, and practice the necessary steps needed to create and train a model. We are limiting ourself to very few features in order to keep the complexity low 

## Import libraries

First we start of by importing the libraries we need for pre processing our dataset and to create our model:
model. If you encounter errors when importing a library, make sure it is installed on your computer. 

If you cloned the repository from our github, please follow the instructions in README.md.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

### Read file

Then we read the csv file into a pandas dataframe. If we call head(), we can inspect the first five rows of the dataframe. This way we can make sure that everything went ok.

PS: The file path used here assumes that you cloned the repository and inserted your csv-file into the Resources-folder. Please provide the path to your own file if this is not the case 


In [4]:
df = pd.read_csv('../Resources/elektronisk-rapportering-ers-2018-fangstmelding-dca-simple.csv', delimiter = ";")
df.head()

Unnamed: 0,Melding ID,Meldingstidspunkt,Meldingsdato,Meldingsklokkeslett,Starttidspunkt,Startdato,Startklokkeslett,Startposisjon bredde,Startposisjon lengde,Hovedområde start (kode),...,Art - FDIR,Art - gruppe (kode),Art - gruppe,Rundvekt,Lengdegruppe (kode),Lengdegruppe,Bruttotonnasje 1969,Bruttotonnasje annen,Bredde,Fartøylengde
0,1497177,01.01.2018,01.01.2018,00:00,31.12.2017,31.12.2017,00:00,-6035,-46133,,...,Antarktisk krill,506.0,Antarktisk krill,706714.0,5.0,28 m og over,9432.0,,1987,13388
1,1497178,01.01.2018,01.01.2018,00:00,30.12.2017 23:21,30.12.2017,23:21,74885,16048,20.0,...,Hyse,202.0,Hyse,9594.0,5.0,28 m og over,1476.0,,126,568
2,1497178,01.01.2018,01.01.2018,00:00,30.12.2017 23:21,30.12.2017,23:21,74885,16048,20.0,...,Torsk,201.0,Torsk,8510.0,5.0,28 m og over,1476.0,,126,568
3,1497178,01.01.2018,01.01.2018,00:00,30.12.2017 23:21,30.12.2017,23:21,74885,16048,20.0,...,Blåkveite,301.0,Blåkveite,196.0,5.0,28 m og over,1476.0,,126,568
4,1497178,01.01.2018,01.01.2018,00:00,30.12.2017 23:21,30.12.2017,23:21,74885,16048,20.0,...,Sei,203.0,Sei,134.0,5.0,28 m og over,1476.0,,126,568


Everything looks good so far! The next step is to get rid of columns we deem unnecesseary for our purpose. This can be achieved by either dropping columns we don't want, or by copying the columns we DO want onto a new dataframe. Since we care about very few features, the easiest thing is to copy the needed columns onto a new frame:

In [5]:
df = df[['Art - FDIR','Rundvekt', 'Hovedområde start', 'Starttidspunkt', 'Redskap FDIR']].copy()
df

Unnamed: 0,Art - FDIR,Rundvekt,Hovedområde start,Starttidspunkt,Redskap FDIR
0,Antarktisk krill,706714.0,,31.12.2017,Flytetrål
1,Hyse,9594.0,Bjørnøya,30.12.2017 23:21,Bunntrål
2,Torsk,8510.0,Bjørnøya,30.12.2017 23:21,Bunntrål
3,Blåkveite,196.0,Bjørnøya,30.12.2017 23:21,Bunntrål
4,Sei,134.0,Bjørnøya,30.12.2017 23:21,Bunntrål
...,...,...,...,...,...
305429,Gråsteinbit,145.0,Vest-Spitsbergen,31.12.2018 19:41,Bunntrål
305430,Uer (vanlig),136.0,Vest-Spitsbergen,31.12.2018 19:41,Bunntrål
305431,Flekksteinbit,132.0,Vest-Spitsbergen,31.12.2018 19:41,Bunntrål
305432,Snabeluer,102.0,Vest-Spitsbergen,31.12.2018 19:41,Bunntrål


# Data types

It is nice to know what data types we are dealing with. dtypes gives us an overview

In [6]:
df.dtypes

Art - FDIR            object
Rundvekt             float64
Hovedområde start     object
Starttidspunkt        object
Redskap FDIR          object
dtype: object

### Null values

Most datasets can, for various reasons, contain null values. The models in sci kit learn will give us an error when encountering these. Therefore it is important that we deal with null values during the pre processing phase. Decisisions must now be made. Do we want to drop rows or columns containing null values? Or do we perhaps want to fill in the missing cells with another value (e.g the mean)? The ladder approach is usually refered to as 'imputation'.

If we choose to drop rows or columns, we run the risk of losing important information needed for training our model. On the other hand, filling in missing values could supply the model with a potentially large amount of inaccurate information. According to kaggle.com, imputation usually yields better results (https://www.kaggle.com/code/alexisbcook/missing-values).

Before making any decicions, we can investigate the amount of null values found in our columns. If the amount is small relative to the size of our entire dataset, the consequences of either dropping or replacing should (hopefully) not be critical.

We use the isnull() method on our dataframe to iterate each column checking for null values. sum() gives us the total for each column:

In [7]:
print(df.isnull().sum())

Art - FDIR           4982
Rundvekt             4978
Hovedområde start    4124
Starttidspunkt          0
Redskap FDIR          188
dtype: int64


We see that 4 out of 5 columns contain null values. Dropping these would be a terrible idea. And Since terrible ideas usually should not be entertained, we have to consider either dropping rows or filling in values instead.

So how many rows would we potentially have do drop?

From the numbers above we find that out of 305434 rows (the entire dataframe), the biggest amount of null values is found in the column 'Art - FDIR' (4982). If we account for the possibility that each row only contain one null value, there would be 4982 + 4978 + 4124 + 0 + 188 = 14272 rows scheduled for meeting their maker (Metaphor for deletion. Meeting the data scientist in charge of datasets would not be useful)

And how about impudation?

At this stage of our machine learning journey, we don't feel we have the required knowledge to perform impudation on categorical data. 

So lets do some good ol' row droppin' using pandas dropna() method:


In [8]:
df = df.dropna()
df

Unnamed: 0,Art - FDIR,Rundvekt,Hovedområde start,Starttidspunkt,Redskap FDIR
1,Hyse,9594.0,Bjørnøya,30.12.2017 23:21,Bunntrål
2,Torsk,8510.0,Bjørnøya,30.12.2017 23:21,Bunntrål
3,Blåkveite,196.0,Bjørnøya,30.12.2017 23:21,Bunntrål
4,Sei,134.0,Bjørnøya,30.12.2017 23:21,Bunntrål
5,Hyse,9118.0,Bjørnøya,31.12.2017 05:48,Bunntrål
...,...,...,...,...,...
305429,Gråsteinbit,145.0,Vest-Spitsbergen,31.12.2018 19:41,Bunntrål
305430,Uer (vanlig),136.0,Vest-Spitsbergen,31.12.2018 19:41,Bunntrål
305431,Flekksteinbit,132.0,Vest-Spitsbergen,31.12.2018 19:41,Bunntrål
305432,Snabeluer,102.0,Vest-Spitsbergen,31.12.2018 19:41,Bunntrål


Dropped rows: 305434 - 296477 = 8957

This does not seem to bad considering the size of our dataset. Thus we can proceed

### Dates and time

'Starttidspunkt' contains information about the start date and time of a catch operation. This is currently stored as a string in our dataframe. In order to please our model we have to convert it into a numerical value. First, we use to_datetime() to convert our values into datetime-types. 

To make it very easy to differenciate between values for months, days and hours, we divide them into separate columns and store them as floats.

Why don't we include year and minutes?

In this specific case we are curious about whether time of day and time of year makes a difference for the probablitiy of catching a certain species of fish. We assume that the model can do this work without needing to consider years and minutes:  

In [9]:
df = df.copy()

df['Starttidspunkt'] = pd.to_datetime(df['Starttidspunkt'], format='mixed')
df['month'] = df['Starttidspunkt'].dt.month
df['day'] = df['Starttidspunkt'].dt.day
df['hour'] = df['Starttidspunkt'].dt.hour
df = df.drop(['Starttidspunkt'], axis=1)

df.dtypes

Art - FDIR            object
Rundvekt             float64
Hovedområde start     object
Redskap FDIR          object
month                  int32
day                    int32
hour                   int32
dtype: object

### One hot encoding

The Knn algorithm finds the nearest neighbours by measuring distance between datapoints. The eucledian distance is used in this case. This calculation does not work on categorical values ('Hovedområde start', 'Redskap FDIR'), so we have to find a way to translate them into numerical values. It is important that the numerical values reflect the ordinality of the original values, this meaning that there should not be a hierarcy where one categorical value is translated into a higher numerical value than the others if the original categorical values are not divided into this type of system. 
E.g 'Bjørnøya' should not be worth 3 while 'Vest-Spitsbergen' is worth 1. However, we still want to separate them into distinctly different values in order for our model to be able to differenciate between them. 

To be able to to so, we use one hot encoding. This encodes the categorical values into binary vectors, where the length of each vector corresponds to the amount of unique categorical values. The entries of the vector are assigned a value of 0 or 1. 1 is assigned to the entry corresponding to our original value, while the remaining entries become 0.
To integrate the vector into our dataframe, each entry gets its own column.

Luckily we don't have to program this ourselves, as Pandas offers a method called get_dummies() that does it for us:

In [10]:
df = pd.get_dummies(df, columns=['Hovedområde start'])
df = pd.get_dummies(df, columns=['Redskap FDIR'])
df

Unnamed: 0,Art - FDIR,Rundvekt,month,day,hour,Hovedområde start_Admiralityfeltet,Hovedområde start_Bjørnøya,Hovedområde start_Britvinfeltet,Hovedområde start_Danmarkstredet,Hovedområde start_Eigersundbanken,...,Redskap FDIR_Flytetrål par,Redskap FDIR_Harpun og lignende uspesifiserte typer,Redskap FDIR_Juksa/pilk,Redskap FDIR_Reketrål,Redskap FDIR_Settegarn,Redskap FDIR_Snurpenot/ringnot,Redskap FDIR_Snurrevad,Redskap FDIR_Teiner,Redskap FDIR_Udefinert garn,Redskap FDIR_Udefinert trål
1,Hyse,9594.0,12,30,23,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,Torsk,8510.0,12,30,23,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,Blåkveite,196.0,12,30,23,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,Sei,134.0,12,30,23,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5,Hyse,9118.0,12,31,5,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
305429,Gråsteinbit,145.0,12,31,19,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
305430,Uer (vanlig),136.0,12,31,19,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
305431,Flekksteinbit,132.0,12,31,19,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
305432,Snabeluer,102.0,12,31,19,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


### Splitting our dataset

We are doing supervised learning here, so we need to feed the model a chunk of data containing the correct labels in the training stage. This enables it to evaluate the combination of individal feature values and map the resulting value to a specific label. The model then makes predictions on the test data (labels excluded) based on this mapping.

First we need to separate the target value (label) from the rest of the features.
By convention, the target values are stored in a variable 'y' while the remaining feature values are stored in a variable 'X' 

In [11]:
X = df.drop(['Art - FDIR'], axis=1)
y = df[['Art - FDIR']]


We then have to separate the values into one set used for training and one set used for testing.
Scikit-learn offers a method for this, called train_test_split(). This method gives us the possibility to decide how much of the dataset should be used for training and testing. 
Setting the random_state parameter ensures that the same data is divided into training and testing sets each time we execute our code. This way, we are assured that we will get the same result every time, making it easier to assess our model.

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Training our model

After pre processing data and splitting it into designated sets for training and testing, we are now ready to train our model. At the top of this notebook we imported the class KNeighborsClassifier from the Scikit-learn library. This is the cassifier that implements the K-Nearest Neighbors algorithm.
It allows us to specify the number of neighbours taken into consideration for making a prediction.
As mentioned before, we do not care about accuracy, only creating a model that is able to predict something without errors. The number of neighbours is therefore arbitrarily chosen.

The next step is feeding our classifier with the necessary training data. This is done using fit()

In [13]:
clf = KNeighborsClassifier(n_neighbors=50)
clf.fit(X_train, y_train)

  return self._fit(X, y)


### Prediction and score

The final step is to feed our hard working classifier the test data. predict() will give us a list containing the predicted species for each row, while score() computes the accuracy of these predictions compared to the actual labels.

In [14]:
prediction = clf.predict(X_test)
score = clf.score(X_test, y_test)

# Results

With a test set accuracy of 0.24 we can conclude that the machine learning experiment performed in this notebook must be buried deep down and hidden from the world.
But we have learned a lot! The features used as a starting point for predicting fish species was probably not sufficient in the first place. And maybe we should not have dropped all those rows! We probably should also experiment more with the parameters of our classifier.

In [15]:
print("Test set predictions: {}".format(prediction))
print("Test set accuracy: {:.2f}".format(score))

Test set predictions: ['Lange' 'Hyse' 'Lange' ... 'Sild' 'Torsk' 'Gråsteinbit']
Test set accuracy: 0.24
