#### Jérémy TREMBLAY

# Project 1 : Supervied Learning

In [121]:
# Import the libraries that will be used in this notebook.
import pandas as pd
import numpy as np
import random

# Import the pyplot module from matplotlib with the plt alias.
import matplotlib.pyplot as plt

# Import the sklearn modules.
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor

Fix seeds for reprodutiblity principles.

In [82]:
np.random.seed(42)
random.seed(42)

In the subfolder of this path, there is a dataset extracted from observations from the Bergen institute.
The mission is to estimate the age of the fish based on the parameters provided in order to better regulate fish stocks.  

Constraints:
* Use the 3 models seen in class (regression, knn, decision tree)
* Optimize your models by analyzing the different versions and possible parameterizations.  

**The goal of this notebook is to realize the best possible model to predict data.**

## First step : load data

The first step is to load the two CSV that will be used in this notebook with `pandas`.

In [83]:
# Specify the relative path of the the files.
train_file_path = 'datasets/train.csv'
test_file_path = 'datasets/test.csv'

# Load the database into a DataFrame.
df_train = pd.read_csv(train_file_path)
df_test = pd.read_csv(test_file_path)

# Display the first few rows of the DataFrame with head.
print(df_train.head())
print("---------------------------------------------------")
print(df_test.head())

   id  weight  length  liverweight  gonadweight  age
0   1   20700   132.0        0.528        2.300   14
1   2    1308    54.0        0.082        0.002    5
2   3    2730    72.0        0.046        0.039    7
3   4    3300    76.0        0.098        0.020    7
4   5    1155    51.0        0.035        0.002    4
---------------------------------------------------
    id  weight  length  liverweight  gonadweight
0  441    2566    70.0        0.077        0.005
1  442    1235    53.0        0.035        0.006
2  443    4008    82.0        0.114        0.146
3  444    4310    78.0        0.318        0.370
4  445   16130   105.0        1.118        3.720


Perfect. We will now explore data.

In [84]:
print(df_train.isnull().any())
print(df_test.isnull().any())

id             False
weight         False
length         False
liverweight    False
gonadweight    False
age            False
dtype: bool
id             False
weight         False
length         False
liverweight    False
gonadweight    False
dtype: bool


The datasets are already clean, we can easily read it now and search some information.

In [85]:
# Know the dimensions of the dataframes.
print(df_train.shape)
print(df_test.shape)

(440, 6)
(81, 5)


There is 440 rows and 6 columns for the train dataset and 81 rows and 5 columns for the test dataset, let's check the content more in detail with some stats.

In [86]:
# Display usefull information about the train dataset.
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 440 entries, 0 to 439
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           440 non-null    int64  
 1   weight       440 non-null    int64  
 2   length       440 non-null    float64
 3   liverweight  440 non-null    float64
 4   gonadweight  440 non-null    float64
 5   age          440 non-null    int64  
dtypes: float64(3), int64(3)
memory usage: 20.8 KB


In [87]:
# Display usefull information about the test dataset.
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81 entries, 0 to 80
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           81 non-null     int64  
 1   weight       81 non-null     int64  
 2   length       81 non-null     float64
 3   liverweight  81 non-null     float64
 4   gonadweight  81 non-null     float64
dtypes: float64(3), int64(2)
memory usage: 3.3 KB


In [88]:
df_train.describe()

Unnamed: 0,id,weight,length,liverweight,gonadweight,age
count,440.0,440.0,440.0,440.0,440.0,440.0
mean,220.5,5134.756818,76.9,0.325775,0.472077,7.745455
std,127.161315,4296.584819,19.683868,0.366086,0.82196,2.63734
min,1.0,495.0,40.0,0.007,0.001,3.0
25%,110.75,2210.5,63.0,0.0735,0.01,6.0
50%,220.5,3715.0,75.0,0.1805,0.0925,7.0
75%,330.25,6808.75,90.25,0.4575,0.483,9.0
max,440.0,23620.0,132.0,1.823,5.24,16.0


In [89]:
df_test.describe()

Unnamed: 0,id,weight,length,liverweight,gonadweight
count,81.0,81.0,81.0,81.0,81.0
mean,481.0,4527.728395,73.981481,0.309037,0.445222
std,23.526581,4029.039696,19.954706,0.397947,0.838763
min,441.0,550.0,40.5,0.012,0.001
25%,461.0,1706.0,58.0,0.077,0.006
50%,481.0,3290.0,73.0,0.147,0.106
75%,501.0,6320.0,86.0,0.318,0.398
max,521.0,17110.0,124.0,1.68,4.01


Since we want to predict the age of the fish, we will use the columns `weight`, `length`, `liverweight` and `gonadweight`.
The `id` is here just to identify the fish. The `age` is the variable we want to know. This is why the column does not exists in the test dataset. Let's check the number of fish with their ages for the train dataset.

In [90]:
df_train.age.value_counts()

age
7     75
6     70
8     63
9     51
5     47
4     34
12    30
11    23
10    18
13    10
14     9
3      6
15     2
16     2
Name: count, dtype: int64

We are now ready to work with the data.

## Second step : clean and separate data

We must use our train dataset and split it to use it to train and test our model and check his performances. The test dataset cannot be used ffor that because it contains the data we want to predict, and we cannot check the effiency of the mdoel with it. We do not need to clean the dataset as saw at the previous step, so let's suppress the `id` and `age` columns of the datasets because they will not be used by our models.

In [91]:
X_train_real = df_train[df_train.columns.difference(["id", "age"])] # The columns used to predict the fish's age.
y_train_real = df_train.age # The answer.
X_test_real = df_test[df_test.columns.difference(["id"])] # The columns used to predict the fish's age in the test dataset.

# Let's split data: 30% for test and 70% for train.
X_train, X_test, y_train, y_test = train_test_split(X_train_real, y_train_real, test_size=0.3)

Now we can use it with our models. We are in a regression case. We will use a LinearRegressor, a KNNRegressor and a DecisionTreeRegressor. For each model, we will try different value for some parameters to see which one produces the best results and at the end of each step, we will apply our model on our test dataset and submit our work for these predictions. These prediction files can be found under the `predictions` folder. So first, let's use the LinearRegressor.

## Thrid step : using Linear Regressor

We need to reshape our data and then create our model, fit it and see his predictions about our train dataset.

In [92]:
# Reshape the data size (not usefull here).
X_test_reshaped = np.array(X_test).reshape(-1, 1)
y_test_reshaped = np.array(y_test).reshape(-1, 1)

In [93]:
# Create a linear regression, fit it and get its results and predictions.
linear = LinearRegression()
linear.fit(X_train, y_train)

# First let's see how our model predict the test data
y_predict = linear.predict(X_test)

Let's check some metrics now to see the performances of our model. For this, we will use the R2 score with the mean square error along this notebook.

In [94]:
# Calculate R-squared (R2).
r2 = r2_score(y_test, y_predict)

# Calculate Mean Squared Error (MSE).
mse = mean_squared_error(y_test, y_predict)

# Display the results
print(f'R-squared (R2): {r2}')
print(f'Mean Squared Error (MSE): {mse}')

R-squared (R2): 0.7922446301406176
Mean Squared Error (MSE): 1.5684194990417326


Remember that we seek to have an R2 as close to 1 as possible (better performance) and an MSE as low as possible (more accurate predictions).

To improve again the values, we can search a parameter and try different value to improve our results. Because this is not one of the best model use for that generally, we will not parameterize this model. We will therefore parameterize the others. So we are now done with this model, let's predict the results of our test dataset and save it in a CSV file. We wiil need to train our model on all data and our train file to imrpove again our model before predicting the new values.

In [116]:
linear = LinearRegression()
linear.fit(X_train_real, y_train_real)

# Predict our test dataset.
y_predict = linear.predict(X_test_real)

# Create a dataframe to associate the fish id with its prediction.
predictions_df = pd.DataFrame({'id': df_test['id'], 'age': y_predict})

# Save data into a CSV file to submit it on Kaggle.
predictions_df.to_csv('predictions/linear_regression.csv', index=False)

Now we have submitted our file we can continue with the next model.

## Fourth step : using KNN

T>he process is the same as before, but we will use a KNN Regressor for that. As previsou, let's create our model, train it and predict result.

In [97]:
# Create a K-Nearest Neighbors (KNN) model with no parameter (for now).
model = KNeighborsRegressor()
model.fit(X_train, y_train)

# Make predictions on the test set.
y_pred = model.predict(X_test)

# Calculate the coefficient of determination (R-squared).
r2 = r2_score(y_test, y_pred)

# Calculate the Mean Absolute Error (MAE).
mae = mean_absolute_error(y_test, y_pred)

# Calculate the Mean Squared Error (MSE).
mse = mean_squared_error(y_test, y_pred)

# Display the metrics for each n_neighbors value.
print(f'KNN: R-squared = {r2:.4f}, MAE = {mae:.2f}, MSE = {mse:.2f}')

KNN: R-squared = 0.7265, MAE = 1.12, MSE = 2.06


Remember that we seek to have an R2 as close to 1 as possible (better performance), an MSE as low as possible (more accurate predictions) and a MAE as low as possible.

This is not a bad score but we could probably improve it if we set a parameter `n_neighbor` with a good value. To determine this value, we will loop and test different values and see after the predicted results which one is the best generalisation for our model:

In [99]:
# List of n_neighbors values to test.
n_neighbors_values = list(range(1, 41))  # From 1 to 41 inclusive.

# Do the same as before but with a paremeter and a loop...
for n_neighbors in n_neighbors_values:
    # Create a K-Nearest Neighbors (KNN) model with the current n_neighbors value.
    model = KNeighborsRegressor(n_neighbors=n_neighbors)
    model.fit(X_train, y_train)
    
    # Make predictions on the test set.
    y_pred = model.predict(X_test)
    
    # Calculate the coefficient of determination (R-squared).
    r2 = r2_score(y_test, y_pred)
    
    # Calculate the Mean Absolute Error (MAE).
    mae = mean_absolute_error(y_test, y_pred)
    
    # Calculate the Mean Squared Error (MSE).
    mse = mean_squared_error(y_test, y_pred)
    
    # Display the metrics for each n_neighbors value.
    print(f'n_neighbors = {n_neighbors}: R-squared = {r2:.4f}, MAE = {mae:.2f}, MSE = {mse:.2f}')

n_neighbors = 1: R-squared = 0.6076, MAE = 1.28, MSE = 2.96
n_neighbors = 2: R-squared = 0.6678, MAE = 1.23, MSE = 2.51
n_neighbors = 3: R-squared = 0.6996, MAE = 1.15, MSE = 2.27
n_neighbors = 4: R-squared = 0.7173, MAE = 1.13, MSE = 2.13
n_neighbors = 5: R-squared = 0.7265, MAE = 1.12, MSE = 2.06
n_neighbors = 6: R-squared = 0.7362, MAE = 1.12, MSE = 1.99
n_neighbors = 7: R-squared = 0.7437, MAE = 1.09, MSE = 1.93
n_neighbors = 8: R-squared = 0.7423, MAE = 1.10, MSE = 1.95
n_neighbors = 9: R-squared = 0.7421, MAE = 1.09, MSE = 1.95
n_neighbors = 10: R-squared = 0.7508, MAE = 1.08, MSE = 1.88
n_neighbors = 11: R-squared = 0.7570, MAE = 1.06, MSE = 1.83
n_neighbors = 12: R-squared = 0.7638, MAE = 1.05, MSE = 1.78
n_neighbors = 13: R-squared = 0.7713, MAE = 1.02, MSE = 1.73
n_neighbors = 14: R-squared = 0.7688, MAE = 1.02, MSE = 1.75
n_neighbors = 15: R-squared = 0.7724, MAE = 1.01, MSE = 1.72
n_neighbors = 16: R-squared = 0.7756, MAE = 1.00, MSE = 1.69
n_neighbors = 17: R-squared = 0.7

In choosing the optimal value for n_neighbors, a trade-off needs to be considered. While a larger number of neighbors improves the fit (higher R-squared) and reduces prediction errors (lower MAE and MSE), it may lead to over-smoothing and loss of sensitivity to local patterns in the data.

Considering the trade-off, we may want to choose a value that provides a good balance between model complexity and performance. In this case, a value around 25 to 30 seems to offer a good balance, as it provides a high R-squared, low MAE, and low MSE. The best choice is probably 26 with the best values.

It may seems smart to use a Standard Scaler to normalize the data and improve our model performance again. We will do that.

In [102]:
# Create scaler.
scaler = StandardScaler()

Then we should transform our data by using our scaler.

In [103]:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Then we can train our model and predict the value based on the parameter found just before.

In [114]:
model = KNeighborsRegressor(n_neighbors=26)  # We choose an appropriate n_neighbors value which is 26.
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)  # Predict.

# Compute values.
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

# Display data.
print(f'n_neighbors = 8: R-squared = {r2:.4f}, MAE = {mae:.2f}, MSE = {mse:.2f}')

n_neighbors = 8: R-squared = 0.7656, MAE = 1.00, MSE = 1.77


Again, we have improve our model with better results. We will do another turn as previous to see if there is a better value for our parameter `n_neighbors` now we have normalized our data with a Scaler.

In [115]:
# Same as previous but with scaled data...
n_neighbors_values = list(range(1, 41))

for n_neighbors in n_neighbors_values:
    model = KNeighborsRegressor(n_neighbors=n_neighbors)
    model.fit(X_train_scaled, y_train)
    
    y_pred = model.predict(X_test_scaled)
    
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    
    print(f'n_neighbors = {n_neighbors}: R-squared = {r2:.4f}, MAE = {mae:.2f}, MSE = {mse:.2f}')

n_neighbors = 1: R-squared = 0.6568, MAE = 1.17, MSE = 2.59
n_neighbors = 2: R-squared = 0.7286, MAE = 1.08, MSE = 2.05
n_neighbors = 3: R-squared = 0.7453, MAE = 1.02, MSE = 1.92
n_neighbors = 4: R-squared = 0.7493, MAE = 1.04, MSE = 1.89
n_neighbors = 5: R-squared = 0.7480, MAE = 1.05, MSE = 1.90
n_neighbors = 6: R-squared = 0.7707, MAE = 1.01, MSE = 1.73
n_neighbors = 7: R-squared = 0.7763, MAE = 0.99, MSE = 1.69
n_neighbors = 8: R-squared = 0.7773, MAE = 0.99, MSE = 1.68
n_neighbors = 9: R-squared = 0.7800, MAE = 0.99, MSE = 1.66
n_neighbors = 10: R-squared = 0.7758, MAE = 1.00, MSE = 1.69
n_neighbors = 11: R-squared = 0.7720, MAE = 1.01, MSE = 1.72
n_neighbors = 12: R-squared = 0.7711, MAE = 1.01, MSE = 1.73
n_neighbors = 13: R-squared = 0.7672, MAE = 1.01, MSE = 1.76
n_neighbors = 14: R-squared = 0.7684, MAE = 1.00, MSE = 1.75
n_neighbors = 15: R-squared = 0.7642, MAE = 1.01, MSE = 1.78
n_neighbors = 16: R-squared = 0.7669, MAE = 1.00, MSE = 1.76
n_neighbors = 17: R-squared = 0.7

Again, a value between 25 and 30 seems to be good base on the R2 valeu, the MSE and the MAE. We will keep 26. Let's now use our model to predict our test data and save it into a file as previous.

In [118]:
# Create the model, scale data, fit and predict and all train data.
knn = KNeighborsRegressor(n_neighbors=26)
X_train_real_scaled = scaler.fit_transform(X_train_real)
X_test_real_scaled = scaler.transform(X_test_real)
knn.fit(X_train_real_scaled, y_train_real)
y_predict = knn.predict(X_test_real_scaled)

# Save results.
predictions_df = pd.DataFrame({'id': df_test['id'], 'age': y_predict})
predictions_df.to_csv('predictions/knn.csv', index=False)

## Fifth step : using a decision tree

We will finally use a DecisionTreeRegressor to predict data. As before, we first create our model to see the indicators withour applying parameters.

In [123]:
# Create the model and train it.
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)
y_predict = model.predict(X_test)

# Compute MSE, MAE and R2 and display data.
mse = mean_squared_error(y_test, y_predict)
mae = mean_absolute_error(y_test, y_predict)
r2 = r2_score(y_test, y_predict)
print(f"DTR: [MSE: {mse:.2f}, MAE: {mae:.2f}, R2: {r2:.2f}]")

DTR: [MSE: 2.46, MAE: 1.14, R2: 0.67]


As usual, we are going to try to improve these indicators by searching a good value for the depth parameter of the decision tree.

In [124]:
# We test the first 41 depth.
depths = range(1, 41)

# Iterate through each depth and create a regression tree, train it and predict result, compare the prediction and display the accuracy.
for depth in depths:
    model = DecisionTreeRegressor(max_depth=depth, random_state=42)
    model.fit(X_train, y_train)
    y_predict = model.predict(X_test)
    
    # Compute MSE, MAE and R2 and display data.
    mse = mean_squared_error(y_test, y_predict)
    mae = mean_absolute_error(y_test, y_predict)
    r2 = r2_score(y_test, y_predict)
    print(f"Model (depth={depth}): [MSE: {mse:.2f}, MAE: {mae:.2f}, R2: {r2:.2f}]")

Model (depth=1): [MSE: 3.63, MAE: 1.50, R2: 0.52]
Model (depth=2): [MSE: 1.80, MAE: 1.03, R2: 0.76]
Model (depth=3): [MSE: 1.65, MAE: 0.99, R2: 0.78]
Model (depth=4): [MSE: 1.65, MAE: 1.00, R2: 0.78]
Model (depth=5): [MSE: 1.61, MAE: 0.96, R2: 0.79]
Model (depth=6): [MSE: 1.82, MAE: 1.01, R2: 0.76]
Model (depth=7): [MSE: 1.83, MAE: 1.01, R2: 0.76]
Model (depth=8): [MSE: 1.98, MAE: 1.05, R2: 0.74]
Model (depth=9): [MSE: 2.14, MAE: 1.08, R2: 0.72]
Model (depth=10): [MSE: 2.18, MAE: 1.08, R2: 0.71]
Model (depth=11): [MSE: 2.39, MAE: 1.13, R2: 0.68]
Model (depth=12): [MSE: 2.32, MAE: 1.11, R2: 0.69]
Model (depth=13): [MSE: 2.34, MAE: 1.11, R2: 0.69]
Model (depth=14): [MSE: 2.22, MAE: 1.08, R2: 0.71]
Model (depth=15): [MSE: 2.41, MAE: 1.15, R2: 0.68]
Model (depth=16): [MSE: 2.46, MAE: 1.14, R2: 0.67]
Model (depth=17): [MSE: 2.46, MAE: 1.14, R2: 0.67]
Model (depth=18): [MSE: 2.46, MAE: 1.14, R2: 0.67]
Model (depth=19): [MSE: 2.46, MAE: 1.14, R2: 0.67]
Model (depth=20): [MSE: 2.46, MAE: 1.14,

We can see that we have the best performances for a depth between 3 and 5, based on the indicators. We will choose the middle value, 4, for the next parts.
We are going to train our final model with all the value as usual and predict the value of our test file.

In [125]:
# Create the model and train it.
dtr = DecisionTreeRegressor(max_depth=4, random_state=42)
dtr.fit(X_train_real, y_train_real)
y_predict = dtr.predict(X_test_real)

# Save results.
predictions_df = pd.DataFrame({'id': df_test['id'], 'age': y_predict})
predictions_df.to_csv('predictions/decision_tree_regressor.csv', index=False)

## Conclusion