###Introduction
  
Abalone is a delicacy worldwide, having been harvested for centries as food source and decorative items, it possesses great economic value. Determining the age of abalone is a complicated process that involves cutting the shell through the cone, staining it, and counting the number of rings through a microscope. The age of abalone is (number of rings +1.5) years, according to the [source](https://archive.ics.uci.edu/ml/datasets/abalone). The objective of this project is to come up with a model to predit the age (number of rings) of abalone from easily obtained physical measurements, such as length, diameter and different types of weights. We will first conduct some exploratory data analysis with visualizations, then we will select the features to be fed into the machine learning algorithm to produce the model.  

<br>
Abalone Close Look <img src="https://upload.wikimedia.org/wikipedia/commons/1/14/White_abalone_Haliotis_sorenseni.jpg" width="600" height="400">

First let's import some libraries, we will use [Plotly](https://plotly.com/python/) for interactive plots.

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#pip install -U pandasql
#pip install -U plotly
from pandasql import sqldf

import plotly.graph_objects as go
from plotly.subplots import make_subplots

pd.options.display.float_format = '{:.3f}'.format
pd.set_option('display.max_colwidth', None)




Get data from UCI, note this Abalone data set has the column names and data seperated
<br>
Data Source https://archive.ics.uci.edu/ml/datasets/Abalone

In [0]:
!wget -O abalone.data "http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data"


--2023-01-12 11:51:01--  http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 191873 (187K) [application/x-httpd-php]
Saving to: ‘abalone.data’


2023-01-12 11:51:01 (1.22 MB/s) - ‘abalone.data’ saved [191873/191873]



From the source, we have the following description for this data set, we can treat this either as continuous value or classification problem. 
  
Name / Data Type / Measurement Unit / Description <br>
----------------------------- <br>
Sex / nominal / -- / M, F, and I (infant) <br>
Length / continuous / mm / Longest shell measurement <br>
Diameter / continuous / mm / perpendicular to length <br>
Height / continuous / mm / with meat in shell <br>
Whole weight / continuous / grams / whole abalone <br>
Shucked weight / continuous / grams / weight of meat <br>
Viscera weight / continuous / grams / gut weight (after bleeding) <br>
Shell weight / continuous / grams / after being dried <br>
Rings / integer / -- / +1.5 gives the age in years <br>

In [0]:
# record the column names
cols = ['Sex','Length','Diameter','Height','Whole weight',
                'Shucked weight', 'Viscera weight', 'Shell weight', 'Rings']
# reading the .data file using pandas
df_original = pd.read_csv('./abalone.data', names=cols, na_values = "?",
                sep= ",",
                skipinitialspace=True)


Let's have a quick look over the original data set, as provided by the source.

In [0]:

print(df_original.info())
print(df_original.shape)
print(df_original.head(10))


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             4177 non-null   object 
 1   Length          4177 non-null   float64
 2   Diameter        4177 non-null   float64
 3   Height          4177 non-null   float64
 4   Whole weight    4177 non-null   float64
 5   Shucked weight  4177 non-null   float64
 6   Viscera weight  4177 non-null   float64
 7   Shell weight    4177 non-null   float64
 8   Rings           4177 non-null   int64  
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB
None
(4177, 9)
  Sex  Length  Diameter  Height  Whole weight  Shucked weight  Viscera weight  \
0   M   0.455     0.365   0.095         0.514           0.225           0.101   
1   M   0.350     0.265   0.090         0.226           0.100           0.049   
2   F   0.530     0.420   0.135         0.677           0.257           0.141

### Exploratory & Data Visualization

Do a quick descriptive statistical analysis, follow by visualizations.

In [0]:
df_original.describe()


Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
count,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0
mean,0.524,0.408,0.14,0.829,0.359,0.181,0.239,9.934
std,0.12,0.099,0.042,0.49,0.222,0.11,0.139,3.224
min,0.075,0.055,0.0,0.002,0.001,0.001,0.002,1.0
25%,0.45,0.35,0.115,0.442,0.186,0.093,0.13,8.0
50%,0.545,0.425,0.14,0.799,0.336,0.171,0.234,9.0
75%,0.615,0.48,0.165,1.153,0.502,0.253,0.329,11.0
max,0.815,0.65,1.13,2.825,1.488,0.76,1.005,29.0


Here, we can look at distribution of counts for each individual dimensions and the number of rings. In terms of gender, being categorical, it needs to be examined differently, we will look at that latter.

In [0]:
#Review the box-plot of various dimensions

import plotly.graph_objects as go
import numpy as np

x0 = df_original['Length']
x1 = df_original['Diameter']
x2 = df_original['Height']
x3 = df_original['Whole weight']
x4 = df_original['Shucked weight']
x5 = df_original['Viscera weight']
x6 = df_original['Shell weight']

fig_bp = go.Figure()
# Use x instead of y argument for horizontal plot
fig_bp.add_trace(go.Box(x=x0, name="Length"))
fig_bp.add_trace(go.Box(x=x1, name="Diameter"))
fig_bp.add_trace(go.Box(x=x2, name="Height"))
fig_bp.add_trace(go.Box(x=x3, name="Whole weight"))
fig_bp.add_trace(go.Box(x=x4, name="Shucked weight"))
fig_bp.add_trace(go.Box(x=x5, name="Viscera weight"))
fig_bp.add_trace(go.Box(x=x6, name="Shell weight"))

fig_bp.show()

In [0]:
#Review the box-plot of number of rings

x7 = df_original['Rings']
fig_bp_rings = go.Figure()
# Use x instead of y argument for horizontal plot
fig_bp_rings.add_trace(go.Box(x=x7, name="Rings"))
fig_bp_rings.update_layout(height=250, width=1200)
fig_bp_rings.show()

In [0]:
fig_subp_hist = make_subplots(
    rows=4, cols=2,
    subplot_titles=("Length", "Diameter", "Height", "Whole weight","Shucked weight", "Viscera weight", "Shell weight", "Rings"))

fig_subp_hist.add_trace(go.Histogram(x=x0,name='Length'),
              row=1, col=1)

fig_subp_hist.add_trace(go.Histogram(x=x1,name='Diameter'),
              row=1, col=2)

fig_subp_hist.add_trace(go.Histogram(x=x2,name='Height'),
              row=2, col=1)

fig_subp_hist.add_trace(go.Histogram(x=x3,name='Whole weight'),
              row=2, col=2)

fig_subp_hist.add_trace(go.Histogram(x=x4,name='Shucked weight'),
              row=3, col=1)

fig_subp_hist.add_trace(go.Histogram(x=x5,name='Viscera weight'),
              row=3, col=2)

fig_subp_hist.add_trace(go.Histogram(x=x6,name='Shell weight'),
              row=4, col=1)

fig_subp_hist.add_trace(go.Histogram(x=x7,name='Rings'),
              row=4, col=2)

fig_subp_hist.update_layout(height=800, width=1200,
                  title_text="Histogram of Various Fields")

fig_subp_hist.show()



Now we should look at each individual gender and see the distribution for number of rings.

In [0]:
df_MRings = df_original[['Rings','Sex']].query('Sex == ["M"]')
df_FRings = df_original[['Rings','Sex']].query('Sex == ["F"]')
df_IRings = df_original[['Rings','Sex']].query('Sex == ["I"]')

g0 = len(df_MRings.index)
g1 = len(df_FRings.index)
g2 = len(df_IRings.index)

x = ['Male', 'Female', 'Infant']
y = [g0, g1, g2]

# Use textposition='auto' for direct text
fig_gbar = go.Figure(data=[go.Bar(
            x=x, y=y,
            text=y,
            textposition='auto',
        )])

fig_gbar.update_layout(height=400, width=800, title_text="Counts Per Gender")
fig_gbar.show()

In [0]:
print(df_MRings.describe())
print(df_FRings.describe())
print(df_IRings.describe())

         Rings
count 1528.000
mean    10.705
std      3.026
min      3.000
25%      9.000
50%     10.000
75%     12.000
max     27.000
         Rings
count 1307.000
mean    11.129
std      3.104
min      5.000
25%      9.000
50%     10.000
75%     12.000
max     29.000
         Rings
count 1342.000
mean     7.890
std      2.512
min      1.000
25%      6.000
50%      8.000
75%      9.000
max     21.000


In [0]:
xg0 = df_MRings['Rings']
xg1 = df_FRings['Rings']
xg2 = df_IRings['Rings']

fig_bpg = go.Figure()
# Use x instead of y argument for horizontal plot
fig_bpg.add_trace(go.Box(x=xg0, name="Male"))
fig_bpg.add_trace(go.Box(x=xg1, name="Female"))
fig_bpg.add_trace(go.Box(x=xg2, name="Infant"))

fig_bpg.show()

In [0]:
fig_subp_GendRing = make_subplots(
    rows=3, cols=1,
    subplot_titles=("Number of Rings For Male", "Number of Rings For Female", "Number of Rings For Infant"))

fig_subp_GendRing.add_trace(go.Histogram(x=df_MRings['Rings'],name='Male'),
              row=1, col=1)

fig_subp_GendRing.add_trace(go.Histogram(x=df_FRings['Rings'],name='Female'),
              row=2, col=1)

fig_subp_GendRing.add_trace(go.Histogram(x=df_IRings['Rings'],name='Infant'),
              row=3, col=1)

fig_subp_GendRing.update_layout(height=600, width=1200,
                  title_text="Multiple Subplots with Titles")

fig_subp_GendRing.update_xaxes(range=[0, 30], row=1, col=1)
fig_subp_GendRing.update_xaxes(range=[0, 30], row=2, col=1)
fig_subp_GendRing.update_xaxes(range=[0, 30], row=3, col=1)

fig_subp_GendRing.show()

From above plots and statistics of each individdual gender, we can observe that there is no big difference between Male and Female in terms of number of rings, both having mean around 11 and standard deviation around 3. Yet for infant, the number of rings is clearly less than that of adults. Maybe we should create new dimensions latter to address abalone gender issues.

We can further illustrate the relation between each non-categorical feature and number of rings with the following series of scatter charts

In [0]:
fig_subp_scat = make_subplots(
    rows=4, cols=2,
    subplot_titles=("Number of Rings VS Length", "Number of Rings VS Diameter", "Number of Rings VS Height", 
                    "Number of Rings VS Whole weight","Number of Rings VS Shucked weight", "Number of Rings VS Viscera weight", 
                    "Number of Rings VS Shell weight"))

fig_subp_scat.add_trace(go.Scatter(x=x0, y=x7, mode='markers', name='Rings VS Length'),
              row=1, col=1)

fig_subp_scat.add_trace(go.Scatter(x=x1, y=x7, mode='markers', name='Rings VS Diameter'),
              row=1, col=2)

fig_subp_scat.add_trace(go.Scatter(x=x2, y=x7, mode='markers', name='Rings VS Height'),
              row=2, col=1)

fig_subp_scat.add_trace(go.Scatter(x=x3, y=x7, mode='markers', name='Rings VS Whole weight'),
              row=2, col=2)

fig_subp_scat.add_trace(go.Scatter(x=x4, y=x7, mode='markers', name='Rings VS Shucked weight'),
              row=3, col=1)

fig_subp_scat.add_trace(go.Scatter(x=x5, y=x7, mode='markers', name='Rings VS Viscera weight'),
              row=3, col=2)

fig_subp_scat.add_trace(go.Scatter(x=x6, y=x7, mode='markers', name='Rings VS Shell weight'),
              row=4, col=1)

fig_subp_scat.update_layout(height=800, width=1200,
                  title_text="Scatter Plots of Number of Rings VS Various Non-Categorical Features")

fig_subp_scat.show()




We can handel the gender by introducing a few new dimensions, eg. [IsMale] or [IsFemale]. From previous visualizations, we know there is not too much difference between Male and Female for the frequencies and Rings, may be we should include [IsInfant] as well and only use that to address the gender for the ML.

In [0]:
df_ML = df_original.copy()
df_ML['IsMale'] = np.where(df_ML['Sex'] =='M', 1, 0 )
df_ML['IsFemale'] = np.where(df_ML['Sex'] =='F', 1, 0 )
df_ML['IsInfant'] = np.where(df_ML['Sex'] =='I', 1, 0 )


In [0]:
df_ML.head(10)


Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings,IsMale,IsFemale,IsInfant
0,M,0.455,0.365,0.095,0.514,0.225,0.101,0.15,15,1,0,0
1,M,0.35,0.265,0.09,0.226,0.1,0.049,0.07,7,1,0,0
2,F,0.53,0.42,0.135,0.677,0.257,0.141,0.21,9,0,1,0
3,M,0.44,0.365,0.125,0.516,0.215,0.114,0.155,10,1,0,0
4,I,0.33,0.255,0.08,0.205,0.089,0.04,0.055,7,0,0,1
5,I,0.425,0.3,0.095,0.351,0.141,0.077,0.12,8,0,0,1
6,F,0.53,0.415,0.15,0.777,0.237,0.141,0.33,20,0,1,0
7,F,0.545,0.425,0.125,0.768,0.294,0.149,0.26,16,0,1,0
8,M,0.475,0.37,0.125,0.509,0.216,0.113,0.165,9,1,0,0
9,F,0.55,0.44,0.15,0.894,0.315,0.151,0.32,19,0,1,0


Next we should check the correlation matrix. Note that the correlation ranges from -1 to +1, and being near 0 meaning no relation, near +/- 1 meaning strongly positive or negative related.

In [0]:
df_corr = df_ML.corr()

In [0]:
fig_corr = go.Figure()
fig_corr.add_trace(
    go.Heatmap(
        x = df_corr.columns,
        y = df_corr.index,
        z = np.array(df_corr),
        text=df_corr.values,
        texttemplate='%{text:.3f}'
    )
)
fig_corr.show()

It appears the most promising attribute to predict Rings would be Shell weight, follow by Diametter, Length and Height. The latter 3 can be think of as being related to the volume or area of abalone, they are clearly related to each other, and thus bringing the collinearity. We can try to multiply these 2 or 3 to obtain an estimated area or volume and see whether using one of them will provide better correlation values for the Rings, or we can just choose 1 of the 3 to do the fitting. For the weights, since Shell weight has the highest correlation value with respect to Rings, we can just use that and ignore the other ones.

In [0]:
df_ML['Area'] = df_ML['Length']*df_ML['Diameter']
df_ML['Vol'] = df_ML['Length']*df_ML['Diameter']*df_ML['Height']
df_ML

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings,IsMale,IsFemale,IsInfant,Area,Vol
0,M,0.455,0.365,0.095,0.514,0.225,0.101,0.150,15,1,0,0,0.166,0.016
1,M,0.350,0.265,0.090,0.226,0.100,0.049,0.070,7,1,0,0,0.093,0.008
2,F,0.530,0.420,0.135,0.677,0.257,0.141,0.210,9,0,1,0,0.223,0.030
3,M,0.440,0.365,0.125,0.516,0.215,0.114,0.155,10,1,0,0,0.161,0.020
4,I,0.330,0.255,0.080,0.205,0.089,0.040,0.055,7,0,0,1,0.084,0.007
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4172,F,0.565,0.450,0.165,0.887,0.370,0.239,0.249,11,0,1,0,0.254,0.042
4173,M,0.590,0.440,0.135,0.966,0.439,0.214,0.261,10,1,0,0,0.260,0.035
4174,M,0.600,0.475,0.205,1.176,0.525,0.287,0.308,9,1,0,0,0.285,0.058
4175,F,0.625,0.485,0.150,1.095,0.531,0.261,0.296,10,0,1,0,0.303,0.045


In [0]:
df_corr02 = df_ML.corr()

fig_corr02 = go.Figure()
fig_corr02.add_trace(
    go.Heatmap(
        x = df_corr02.columns,
        y = df_corr02.index,
        z = np.array(df_corr02),
        text=df_corr02.values,
        texttemplate='%{text:.3f}'
    )
)
fig_corr02.show()


From above correlation matrix, there is not much we gain in correlation by introducing the Volume and Area, so let's forget that idea. Next we will need to select the features to feed to ML algorithm, we can just select the more promissing ones with higher correlation values with the Number of Rings, like [Diameter],[Shell weight], plus the gender features [IsInfant].

More library imports and preprocessing, let's try different linear regression methods provided by [SciKit Learn](https://scikit-learn.org/).

In [0]:
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

x=df_ML[['Diameter','Shell weight','IsInfant']].copy()

y=df_ML['Rings']
x = preprocessing.normalize(x)


We can leverage the built-in train-test split from SKLearn to define the train test data sets.

In [0]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)


In [0]:
linearRegressor = LinearRegression()
linearRegressor.fit(x_train, y_train)
y_predicted1 = linearRegressor.predict(x_test)
print (linearRegressor.intercept_)
print (linearRegressor.coef_)

-5.658227395486644
[ 8.68340486 17.91129396  9.36342046]


In [0]:
mae1=mean_absolute_error(y_test, y_predicted1)
mse1=mean_squared_error(y_test, y_predicted1)
r21=r2_score(y_test, y_predicted1)
print(mae1)
print(mse1)
print(r21)


1.6828260747867505
5.594361341636072
0.46761057934856687


In [0]:
polynomial_features= PolynomialFeatures(degree=2)
x_poly = polynomial_features.fit_transform(x_train)
x_poly_test = polynomial_features.fit_transform(x_test)
model = LinearRegression()
model.fit(x_poly, y_train)
y_predicted2 = model.predict(x_poly_test)
print (model.intercept_)
print (model.coef_)

1098119599756.1034
[ 0.00000000e+00  8.27777703e+01  1.21254337e+02  1.28124274e+02
 -1.09811960e+12 -9.20767874e+01 -7.51789225e+01 -1.09811960e+12
 -6.57472728e+01 -1.09811960e+12]


In [0]:
mae2=mean_absolute_error(y_test, y_predicted2)
mse2=mean_squared_error(y_test, y_predicted2)
r22=r2_score(y_test, y_predicted2)
print(mae2)
print(mse2)
print(r22)

1.653627532520933
5.412032930217862
0.4849619285724355


In [0]:
#Lasso Cross validation
lasso_cv = linear_model.LassoCV(alphas = [0.001, 0.01, 0.1, 1, 10], random_state=0).fit(x_train, y_train)
y_predicted3 = lasso_cv.predict(x_test)
print (lasso_cv.intercept_)
print (lasso_cv.coef_)

-4.537680992190069
[ 7.70242707 17.35137568  8.55250839]


In [0]:
mae3=mean_absolute_error(y_test, y_predicted3)
mse3=mean_squared_error(y_test, y_predicted3)
r23=r2_score(y_test, y_predicted3)
print(mae3)
print(mse3)
print(r23)

1.685718027574483
5.604217902602817
0.4666725761588827


###Conclusion
  
Overall, all three methods produced models with similar error rate, all with R^2 values close to 0.5. In general, R^2 of greater than 0.7 would be considered as strong correlation, while below 0.4 would be considered as weak. For our models, R^2 of around 0.5 means the correlation is moderate, at least for the feastures we selected against the number of Rings. Thus, this project should be considered as a success. Furthermore, we found that Linear Regression with polynomial feature of degree 2 will produce the smallest errors, with MAE = 1.65, MSE = 5.41 and R^2 = 0.485, yet it comes with very large intercept and coefficients, makes its expression somewhat complicated. The Ordinary Least Squares and Lasso CV produced close values, with R^2 of 0.468 and 0.467 respectively. This is because Lasso method with very small alpha will behave similar to that of the ordinary one.