### Content

With the data being processed, we're ready now to train a machine learning model that will help us predict abalone age. The following approach will be adopted:

* Importing the required libraries and the data;
* Data splitting into training and testing;
* Model training and evaluation;

Three different models will be tested : Linear, Ridge, and Lasso regression.

### Import the necessary libraries

In [1]:
import pandas as pd
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

### Import the processed data and create the features and target arrays

In [2]:
abalone_df = pd.read_csv("abalone_processed.csv")
abalone_df

Unnamed: 0,Length,Diam,Height,Whole,Shucked,Viscera,Shell,Age
0,-0.592283,-0.433414,-1.199002,-0.625502,-0.604416,-0.719291,-0.619496,16.5
1,-1.533969,-1.517342,-1.340646,-1.274339,-1.218729,-1.237502,-1.270924,8.5
2,0.080351,0.162747,-0.065858,-0.258914,-0.447151,-0.319528,-0.130926,10.5
3,-0.726809,-0.433414,-0.349144,-0.621004,-0.648646,-0.590972,-0.578782,11.5
4,-1.713338,-1.625735,-1.623932,-1.320443,-1.267874,-1.326338,-1.393066,8.5
...,...,...,...,...,...,...,...,...
3776,0.394246,0.487926,0.784001,0.213376,0.110645,0.642863,0.186645,12.5
3777,0.618457,0.379533,-0.065858,0.391047,0.449746,0.401031,0.280287,11.5
3778,0.708142,0.758908,1.917145,0.863337,0.874851,1.121591,0.667072,10.5
3779,0.932353,0.867301,0.359071,0.680044,0.901881,0.860018,0.569358,11.5


In [3]:
X = abalone_df.iloc[:, :-1].values
y = abalone_df.iloc[:, -1].values

In [4]:
rs = 117

Data splitting

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = rs)

### Model definition and training

We first define a function to help us train and evaluate models.

In [6]:
def modelTrainEval(model, X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        test_size = 0.25, 
                                                        random_state = rs)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    rmse = root_mean_squared_error(y_test, y_pred)
    print(f"RMSE : {rmse:.3f}")

In [7]:
ridgeRegression = Ridge(solver = "auto", fit_intercept = False)
lassoRegression = Lasso(fit_intercept = False)
simpleRegression = LinearRegression(fit_intercept = False)

models = {"Ridge":ridgeRegression,
          "Lasso Regression":lassoRegression,
          "Linear Regression":simpleRegression}


for name, object in models.items():
    print(name)
    modelTrainEval(object, X, y)

Ridge
RMSE : 11.100
Lasso Regression
RMSE : 11.210
Linear Regression
RMSE : 11.102


The default models gave quite bad results on the dataset. We can achieve better results by setting `fit_intercept` to True (Default value), however the sample age depends completely on the characteristics and we don't expect any bias units because logically if all the sample characteristics are zero then the sample doesn't exist.

A technique that we can use to improve the models performance is using polynomial transformations on the features and test different degrees and evaluate the models each time.

In [8]:
for i in range(1, 5):
    polyFeatures = PolynomialFeatures(degree = i, include_bias = False)
    X_poly = polyFeatures.fit_transform(X)
    print(f"Degree = {i}")
    print()
    for name, object in models.items():
        print(name)
        modelTrainEval(object, X_poly, y)
    print("########################")

Degree = 1

Ridge
RMSE : 11.100
Lasso Regression
RMSE : 11.210
Linear Regression
RMSE : 11.102
########################
Degree = 2

Ridge
RMSE : 5.798
Lasso Regression
RMSE : 8.538
Linear Regression
RMSE : 5.807
########################
Degree = 3

Ridge
RMSE : 4.311
Lasso Regression
RMSE : 7.768
Linear Regression
RMSE : 4.533
########################
Degree = 4

Ridge
RMSE : 4.295
Lasso Regression
RMSE : 7.417
Linear Regression
RMSE : 13.118
########################


We'll continue with the Ridge regression model since it performs well on the data and use a degree of 4 for the polynomial transformation.

In [9]:
polyFeatures = PolynomialFeatures(degree = 4, include_bias = False)
X_poly = polyFeatures.fit_transform(X)

Now we'll test the model's performance using different values for alpha.

In [10]:
for i in range(6):
    ridgeModel = Ridge(fit_intercept = False, alpha = i)
    print(f"Alpha : {i}")
    modelTrainEval(ridgeModel, X_poly, y)
    print()

Alpha : 0
RMSE : 13.118

Alpha : 1
RMSE : 4.295

Alpha : 2
RMSE : 4.051

Alpha : 3
RMSE : 3.986

Alpha : 4
RMSE : 3.971

Alpha : 5
RMSE : 3.976



We have the optimal values for the degrees and alpha parameter, the final step is to create a pipeline using the polynomial features transformation and the ridge regression model.

In [11]:
finalModel = Pipeline([
    ("poly", PolynomialFeatures(degree = 4, include_bias = False)),
    ("ridge", Ridge(fit_intercept = False, alpha = 4))
])

In [12]:
modelTrainEval(finalModel, X, y)

RMSE : 3.971


### Save the model

In [13]:
import joblib

In [14]:
joblib.dump(finalModel, "Streamlit-App/ridgeModel.pkl")

['Streamlit-App/ridgeModel.pkl']