#### Jérémy TREMBLAY

# Project 1 : Supervied Learning

In [81]:
# Import the libraries that will be used in this notebook.
import pandas as pd
import numpy as np
import random

# Import the pyplot module from matplotlib with the plt alias.
import matplotlib.pyplot as plt

# Import the sklearn modules.
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn import tree

Fix seeds for reprodutiblity principles.

In [82]:
np.random.seed(42)
random.seed(42)

In the subfolder of this path, there is a dataset extracted from observations from the Bergen institute.
The mission is to estimate the age of the fish based on the parameters provided in order to better regulate fish stocks.  

Constraints:
* Use the 3 models seen in class (regression, knn, decision tree)
* Optimize your models by analyzing the different versions and possible parameterizations.  

**The goal of this notebook is to realize the best possible model to predict data.**

## First step : load data

The first step is to load the two CSV that will be used in this notebook with `pandas`.

In [83]:
# Specify the relative path of the the files.
train_file_path = 'datasets/train.csv'
test_file_path = 'datasets/test.csv'

# Load the database into a DataFrame.
df_train = pd.read_csv(train_file_path)
df_test = pd.read_csv(test_file_path)

# Display the first few rows of the DataFrame with head.
print(df_train.head())
print("---------------------------------------------------")
print(df_test.head())

   id  weight  length  liverweight  gonadweight  age
0   1   20700   132.0        0.528        2.300   14
1   2    1308    54.0        0.082        0.002    5
2   3    2730    72.0        0.046        0.039    7
3   4    3300    76.0        0.098        0.020    7
4   5    1155    51.0        0.035        0.002    4
---------------------------------------------------
    id  weight  length  liverweight  gonadweight
0  441    2566    70.0        0.077        0.005
1  442    1235    53.0        0.035        0.006
2  443    4008    82.0        0.114        0.146
3  444    4310    78.0        0.318        0.370
4  445   16130   105.0        1.118        3.720


Perfect. We will now explore data.

In [84]:
print(df_train.isnull().any())
print(df_test.isnull().any())

id             False
weight         False
length         False
liverweight    False
gonadweight    False
age            False
dtype: bool
id             False
weight         False
length         False
liverweight    False
gonadweight    False
dtype: bool


The datasets are already clean, we can easily read it now and search some information.

In [85]:
# Know the dimensions of the dataframes.
print(df_train.shape)
print(df_test.shape)

(440, 6)
(81, 5)


There is 440 rows and 6 columns for the train dataset and 81 rows and 5 columns for the test dataset, let's check the content more in detail with some stats.

In [86]:
# Display usefull information about the train dataset.
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 440 entries, 0 to 439
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           440 non-null    int64  
 1   weight       440 non-null    int64  
 2   length       440 non-null    float64
 3   liverweight  440 non-null    float64
 4   gonadweight  440 non-null    float64
 5   age          440 non-null    int64  
dtypes: float64(3), int64(3)
memory usage: 20.8 KB


In [87]:
# Display usefull information about the test dataset.
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81 entries, 0 to 80
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           81 non-null     int64  
 1   weight       81 non-null     int64  
 2   length       81 non-null     float64
 3   liverweight  81 non-null     float64
 4   gonadweight  81 non-null     float64
dtypes: float64(3), int64(2)
memory usage: 3.3 KB


In [88]:
df_train.describe()

Unnamed: 0,id,weight,length,liverweight,gonadweight,age
count,440.0,440.0,440.0,440.0,440.0,440.0
mean,220.5,5134.756818,76.9,0.325775,0.472077,7.745455
std,127.161315,4296.584819,19.683868,0.366086,0.82196,2.63734
min,1.0,495.0,40.0,0.007,0.001,3.0
25%,110.75,2210.5,63.0,0.0735,0.01,6.0
50%,220.5,3715.0,75.0,0.1805,0.0925,7.0
75%,330.25,6808.75,90.25,0.4575,0.483,9.0
max,440.0,23620.0,132.0,1.823,5.24,16.0


In [89]:
df_test.describe()

Unnamed: 0,id,weight,length,liverweight,gonadweight
count,81.0,81.0,81.0,81.0,81.0
mean,481.0,4527.728395,73.981481,0.309037,0.445222
std,23.526581,4029.039696,19.954706,0.397947,0.838763
min,441.0,550.0,40.5,0.012,0.001
25%,461.0,1706.0,58.0,0.077,0.006
50%,481.0,3290.0,73.0,0.147,0.106
75%,501.0,6320.0,86.0,0.318,0.398
max,521.0,17110.0,124.0,1.68,4.01


Since we want to predict the age of the fish, we will use the columns `weight`, `length`, `liverweight` and `gonadweight`.
The `id` is here just to identify the fish. The `age` is the variable we want to know. This is why the column does not exists in the test dataset. Let's check the number of fish with their ages for the train dataset.

In [90]:
df_train.age.value_counts()

age
7     75
6     70
8     63
9     51
5     47
4     34
12    30
11    23
10    18
13    10
14     9
3      6
15     2
16     2
Name: count, dtype: int64

We are now ready to work with the data.

## Second step : clean and separate data

We must use our train dataset and split it to use it to train and test our model and check his performances. The test dataset cannot be used ffor that because it contains the data we want to predict, and we cannot check the effiency of the mdoel with it. We do not need to clean the dataset as saw at the previous step, so let's suppress the `id` and `age` columns of the datasets because they will not be used by our models.

In [91]:
X_train_real = df_train[df_train.columns.difference(["id", "age"])] # The columns used to predict the fish's age.
y_train_real = df_train.age # The answer.
X_test_real = df_test[df_test.columns.difference(["id"])] # The columns used to predict the fish's age in the test dataset.

# Let's split data: 30% for test and 70% for train.
X_train, X_test, y_train, y_test = train_test_split(X_train_real, y_train_real, test_size=0.3)

Now we can use it with our models. We are in a regression case. We will use a LinearRegressor, a KNNRegressor and a DecisionTreeRegressor. For each model, we will try different value for some parameters to see which one produces the best results and at the end of each step, we will apply our model on our test dataset and submit our work for these predictions. These prediction files can be found under the `predictions` folder. So first, let's use the LinearRegressor.

## Thrid step : using Linear Regressor

We need to reshape our data and then create our model, fit it and see his predictions about our train dataset.

In [92]:
# Reshape the data size (not usefull here).
X_test_reshaped = np.array(X_test).reshape(-1, 1)
y_test_reshaped = np.array(y_test).reshape(-1, 1)

In [93]:
# Create a linear regression, fit it and get its results and predictions.
linear = LinearRegression()
linear.fit(X_train, y_train)

# First let's see how our model predict the test data
y_predict = linear.predict(X_test)

Let's check some metrics now to see the performances of our model. For this, we will use the R2 score with the mean square error along this notebook.

In [94]:
# Calculate R-squared (R2).
r2 = r2_score(y_test, y_predict)

# Calculate Mean Squared Error (MSE).
mse = mean_squared_error(y_test, y_predict)

# Display the results
print(f'R-squared (R2): {r2}')
print(f'Mean Squared Error (MSE): {mse}')

R-squared (R2): 0.7922446301406176
Mean Squared Error (MSE): 1.5684194990417326


Remember that we seek to have an R2 as close to 1 as possible (better performance) and an MSE as low as possible (more accurate predictions).

To improve again the values, we can search a parameter and try different value to improve our results. Because this is not one of the best model use for that generally, we will not parameterize this model. We will therefore parameterize the others. So we are now done with this model, let's predict the results of our test dataset and save it in a CSV file.

In [95]:
# Predict our test dataset.
y_predict = linear.predict(X_test_real)

# Create a dataframe to associate the fish id with its prediction.
predictions_df = pd.DataFrame({'id': df_test['id'], 'age': y_predict})

# Save data into a CSV file to submit it on Kaggle.
predictions_df.to_csv('predictions/linear_regression.csv', index=False)

Now we have submitted our file we can continue with the next model.

## Fourth step : using KNN

## Fifth step : using a decision tree

## Conclusion