### Content

With the data being processed, we're ready now to train a machine learning model that will help us predict abalone age. The following approach will be adopted:

* Importing the required libraries and the data;
* Data splitting into training and testing;
* Model training and evaluation;

Three different models will be tested : ElasticNet, XGBoost, Random Forest regression.

### Import the necessary libraries

In [24]:
import pandas as pd
import numpy as np
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import ElasticNet
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor

### Import the processed data and create the features and target arrays

In [2]:
abalone_df = pd.read_csv("abalone_processed.csv")
abalone_df

Unnamed: 0,Length,Diam,Height,Whole,Shucked,Viscera,Shell,Sex_F,Sex_I,Sex_M,Rings
0,-0.592283,-0.433414,-1.199002,-0.625502,-0.604416,-0.719291,-0.619496,0.0,0.0,1.0,15
1,-1.533969,-1.517342,-1.340646,-1.274339,-1.218729,-1.237502,-1.270924,0.0,0.0,1.0,7
2,0.080351,0.162747,-0.065858,-0.258914,-0.447151,-0.319528,-0.130926,1.0,0.0,0.0,9
3,-0.726809,-0.433414,-0.349144,-0.621004,-0.648646,-0.590972,-0.578782,0.0,0.0,1.0,10
4,-1.713338,-1.625735,-1.623932,-1.320443,-1.267874,-1.326338,-1.393066,0.0,1.0,0.0,7
...,...,...,...,...,...,...,...,...,...,...,...
3776,0.394246,0.487926,0.784001,0.213376,0.110645,0.642863,0.186645,1.0,0.0,0.0,11
3777,0.618457,0.379533,-0.065858,0.391047,0.449746,0.401031,0.280287,0.0,0.0,1.0,10
3778,0.708142,0.758908,1.917145,0.863337,0.874851,1.121591,0.667072,0.0,0.0,1.0,9
3779,0.932353,0.867301,0.359071,0.680044,0.901881,0.860018,0.569358,1.0,0.0,0.0,10


In [3]:
X = abalone_df.iloc[:, :-1].values
y = abalone_df.iloc[:, -1].values

In [4]:
rs = 117

### Model definition and training

In [13]:
def modelTrainEval(models, X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        test_size = 0.25, 
                                                        random_state = rs)
    for name, object in models.items():
        object.fit(X_train, y_train)
        y_pred = object.predict(X_test)
        rmse = mean_squared_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)
        print(name)
        print(f"RMSE : {rmse:.3f} \t R2 Score : {r2:.3f}\n")

In [14]:
eNetRegression = ElasticNet()
rfRegression = RandomForestRegressor()
xgbRegression = XGBRegressor()

models = {"ElasticNet":eNetRegression,
          "Random Forest regression":rfRegression,
          "XGBoost":xgbRegression}


modelTrainEval(models, X, y)

ElasticNet
RMSE : 3.661 	 R2 Score : 0.311

Random Forest regression
RMSE : 2.912 	 R2 Score : 0.452

XGBoost
RMSE : 3.233 	 R2 Score : 0.391



The default models gave acceptable results, we'll see if we can improve their perforamnce by trying a set of different hyperparameters.

In [23]:
eNetParams = {"alpha":[i*0.1 for i in range(0, 11)],
              "l1_ratio":[i*0.1 for i in range(0, 11)],
              "max_iter":[1000, 1500, 2000]}

rfParams = {'bootstrap': [True, False],
             'max_depth': [2, 5, 10, 20, None],
             'max_features': ['auto', 'sqrt'],
             'min_samples_leaf': [1, 2, 4],
             'min_samples_split': [2, 5, 10],
             'n_estimators': [100, 150, 200, 250]}

xgbParams = {'n_estimators':[100, 200, 300] , 
             'max_depth':list(range(1,10)) , 
             'learning_rate':[0.006,0.007,0.008,0.05,0.09] ,
             'min_child_weight':list(range(1,10))}

params = [eNetParams, rfParams, xgbParams]

In [None]:
i = 0
for name, object in models.items():
    regressor = RandomizedSearchCV(estimator = object,
                                   param_distributions = params[i],
                                   n_iter = 10,
                                   cv = 5,
                                   scoring = "neg_mean_squared_error",
                                   n_jobs = -1)
    randomizedSearch = regressor.fit(X_train, y_train)
    print(name)
    print(f"Best parameters : {randomizedSearch.best_params_:.3f}")
    print(f"RMSE : {randomizedSearch.best_score_:.3f}")