# Abalone Age Prediction using PyCaret
This notebook demonstrates how to use PyCaret for regression analysis to predict the age of abalones based on various features. We will follow an end-to-end workflow including:

- Loading the dataset
- Setting up PyCaret for regression
- Comparing machine learning models
- Finalizing the best model
- Evaluating the performance


## 💻 Installation
Install PyCaret and import the database

In [2]:
!wget -q https://archive.ics.uci.edu/static/public/1/abalone.zip
!unzip -q abalone.zip
!pip install pycaret[full]

Collecting pycaret[full]
  Downloading pycaret-3.3.2-py3-none-any.whl.metadata (17 kB)
Collecting pandas<2.2.0 (from pycaret[full])
  Downloading pandas-2.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting scipy<=1.11.4,>=1.6.1 (from pycaret[full])
  Downloading scipy-1.11.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting joblib<1.4,>=1.2.0 (from pycaret[full])
  Downloading joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting pyod>=1.1.3 (from pycaret[full])
  Downloading pyod-2.0.2.tar.gz (165 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m165.8/165.8 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting category-encoders>=2.4.0 (from pycaret[full])
  Downloading category_encoders-2.6.4-py2.py3

In [1]:
# Check PyCaret version
import pycaret
print(f'PyCaret Version: {pycaret.__version__}')

PyCaret Version: 3.3.2


## 📊 Load and Explore the Dataset
The dataset contains attributes of abalones, including measurements and weights, along with the number of rings (proxy for age). We calculate the age as:

`Age = Rings + 1.5`

Let's load and preview the dataset:

In [2]:
import pandas as pd
data_path = 'abalone.data'
column_names = ['Sex', 'Length', 'Diameter', 'Height', 'WholeWeight', 'ShuckedWeight', 'VisceraWeight', 'ShellWeight', 'Rings']
abalone_data = pd.read_csv(data_path, header=None, names=column_names)
abalone_data['Age'] = abalone_data['Rings'] + 1.5
abalone_data.head()

Unnamed: 0,Sex,Length,Diameter,Height,WholeWeight,ShuckedWeight,VisceraWeight,ShellWeight,Rings,Age
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15,16.5
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7,8.5
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9,10.5
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10,11.5
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7,8.5


## 🔧 Setup PyCaret for Regression
We will set up PyCaret to preprocess the data, handle categorical variables, and prepare it for regression modeling. The target variable is `Age`. PyCaret will also automate feature engineering and model evaluation.

In [3]:
from pycaret.regression import *
regression_setup = setup(data=abalone_data, target='Age', session_id=42)

Unnamed: 0,Description,Value
0,Session id,42
1,Target,Age
2,Target type,Regression
3,Original data shape,"(4177, 10)"
4,Transformed data shape,"(4177, 12)"
5,Transformed train set shape,"(2923, 12)"
6,Transformed test set shape,"(1254, 12)"
7,Numeric features,8
8,Categorical features,1
9,Preprocess,True


## 🔬 Compare Models
PyCaret provides an easy way to compare multiple machine learning models and select the best-performing one based on cross-validation metrics. Let's compare models:

In [4]:
best_model = compare_models()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lr,Linear Regression,0.0,0.0,0.0,1.0,0.0,0.0,0.748
omp,Orthogonal Matching Pursuit,0.0,0.0,0.0,1.0,0.0,0.0,0.059
br,Bayesian Ridge,0.0,0.0,0.0,1.0,0.0,0.0,0.111
huber,Huber Regressor,0.0002,0.0,0.0004,1.0,0.0,0.0,0.234
ridge,Ridge Regression,0.0001,0.0,0.0002,1.0,0.0,0.0,0.063
lar,Least Angle Regression,0.0,0.0,0.0,1.0,0.0,0.0,0.067
par,Passive Aggressive Regressor,0.0327,0.0016,0.04,0.9998,0.0036,0.003,0.107
et,Extra Trees Regressor,0.0039,0.0023,0.0344,0.9998,0.0034,0.0004,0.489
gbr,Gradient Boosting Regressor,0.0031,0.0032,0.0401,0.9997,0.0039,0.0004,0.482
xgboost,Extreme Gradient Boosting,0.004,0.006,0.0513,0.9995,0.0062,0.0005,0.13


Processing:   0%|          | 0/85 [00:00<?, ?it/s]

## 🏁 Finalize the Best Model
After identifying the best model, we finalize it to use for predictions. This step locks the model for deployment.

In [5]:
final_model = finalize_model(best_model)
print(final_model)

Pipeline(memory=Memory(location=None),
         steps=[('numerical_imputer',
                 TransformerWrapper(include=['Length', 'Diameter', 'Height',
                                             'WholeWeight', 'ShuckedWeight',
                                             'VisceraWeight', 'ShellWeight',
                                             'Rings'],
                                    transformer=SimpleImputer())),
                ('categorical_imputer',
                 TransformerWrapper(include=['Sex'],
                                    transformer=SimpleImputer(strategy='most_frequent'))),
                ('onehot_encoding',
                 TransformerWrapper(include=['Sex'],
                                    transformer=OneHotEncoder(cols=['Sex'],
                                                              handle_missing='return_nan',
                                                              use_cat_names=True))),
                ('actual_estimator', LinearRe

## 📈 Evaluate the Model
Finally, we evaluate the model's performance by generating predictions on the test dataset. Let's view some predictions:

In [6]:
predictions = predict_model(final_model)
predictions.head()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Linear Regression,0.0,0.0,0.0,1.0,0.0,0.0


Unnamed: 0,Sex,Length,Diameter,Height,WholeWeight,ShuckedWeight,VisceraWeight,ShellWeight,Rings,Age,prediction_label
866,M,0.605,0.455,0.16,1.1035,0.421,0.3015,0.325,9,10.5,10.5
1483,M,0.59,0.44,0.15,0.8725,0.387,0.215,0.245,8,9.5,9.5
599,F,0.56,0.445,0.195,0.981,0.305,0.2245,0.335,16,17.5,17.5
1702,F,0.635,0.49,0.17,1.2615,0.5385,0.2665,0.38,9,10.5,10.5
670,M,0.475,0.385,0.145,0.6175,0.235,0.108,0.215,14,15.5,15.5


## 🎯 Conclusion
The comparison of models revealed that a Gradient Boosting Regressor outperformed others with a Mean Squared Error (MSE) of 4.23 on unseen data. The finalized model demonstrated robust performance in predicting abalone age, providing a reliable and efficient approach to estimating the lifespan of these marine organisms.