# Can you estimate the age of an abalone?

## 📖 Background
You are working as an intern for an abalone farming operation in Japan. For operational and environmental reasons, it is an important consideration to estimate the age of the abalones when they go to market. 

Determining an abalone's age involves counting the number of rings in a cross-section of the shell through a microscope. Since this method is somewhat cumbersome and complex, you are interested in helping the farmers estimate the age of the abalone using its physical characteristics.

## <h1 style="background:#e61010; border:0; border-radius: 16px; color:#D3D3D3"><center>1.Introduction</center></h1>

<div style="text-align: justify">Abalone is a shellfish considered a delicacy in many parts of the world. An excellent source of iron and pantothenic acid, and a nutritious food resource and farming in Australia, America and East Asia. 100 grams of abalone yields more than 20% recommended daily intake of these nutrients. The economic value of abalone is positively correlated with its age. Therefore, to detect the age of abalone accurately is important for both farmers and customers to determine its price. However, the current technology to decide the age is quite costly and inefficient. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a laborious task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem. However, for this problem we shall assume that the abalone's physical measurements are sufficient to provide an accurate age prediction.</div>


**Paper objectives**:
1. How does weight change with age for each of the three sex categories?
2. Can you estimate an abalone's age using its physical characteristics? 
3. Investigate which variables are better predictors of age for abalones.

<a id =""></a><h2 style="background:#e65010; border:0; border-radius: 12px; color:black"><center>Features of data</center></h2> 

- #### **The dataset has 4177 entries and 10 columns**:

Feature | Data Type | Measurement | Description 
:--------: | ------- | :-------: | -------  
`sex` | categorical |    | M, F, and I (Infant)
`length` | continuous | mm | longest shell measurement
`diameter` | continuous | mm | perpendicular to the length
`height` | continuous | mm | measured with meat in the shell
`whole_wt` | continuous | grams | whole abalone weight
`shucked_wt` | continuous | grams | the weight of abalone meat
`viscera_wt` | continuous | grams | gut-weight
`shell_wt` | continuous | grams | the weight of the dried shell
`rings` | continuous |  | number of rings in a shell cross-section
`age` | continuous |  | the age of the abalone: the number of rings + 1.5

In [47]:
### Loading Packages
#!pip install category_encoders

In [59]:
# Data manipulation
import pandas as pd
import numpy as np
# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# sns.set_style('darkgrid')
# sns.set_palette('colorblind')

In [49]:
# Load dataset
raw_data = pd.read_csv('./abalone.csv')

# View the first five rows of the data
raw_data.head()

Unnamed: 0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
0,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
1,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
2,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
3,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
4,I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8


In [50]:
# Set the column names
colnames = ['sex', 'length', 'diameter', 'height', 'whole_weight',
            'shucked_weight', 'viscera_weight', 'shell_weight', 'rings']

raw_data.columns = colnames

raw_data.head()

Unnamed: 0,sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings
0,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
1,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
2,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
3,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
4,I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8


In [51]:
# Add age data using rings*1.5
raw_data['age'] = raw_data.rings*1.5

In [52]:
# Check out the data
print(raw_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4176 entries, 0 to 4175
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   sex             4176 non-null   object 
 1   length          4176 non-null   float64
 2   diameter        4176 non-null   float64
 3   height          4176 non-null   float64
 4   whole_weight    4176 non-null   float64
 5   shucked_weight  4176 non-null   float64
 6   viscera_weight  4176 non-null   float64
 7   shell_weight    4176 non-null   float64
 8   rings           4176 non-null   int64  
 9   age             4176 non-null   float64
dtypes: float64(8), int64(1), object(1)
memory usage: 326.4+ KB
None


In [53]:
# Summary of the data
raw_data.describe(include='all')

Unnamed: 0,sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings,age
count,4176,4176.0,4176.0,4176.0,4176.0,4176.0,4176.0,4176.0,4176.0,4176.0
unique,3,,,,,,,,,
top,M,,,,,,,,,
freq,1527,,,,,,,,,
mean,,0.524009,0.407892,0.139527,0.828818,0.3594,0.180613,0.238852,9.932471,14.898707
std,,0.120103,0.09925,0.041826,0.490424,0.22198,0.10962,0.139213,3.223601,4.835402
min,,0.075,0.055,0.0,0.002,0.001,0.0005,0.0015,1.0,1.5
25%,,0.45,0.35,0.115,0.4415,0.186,0.093375,0.13,8.0,12.0
50%,,0.545,0.425,0.14,0.79975,0.336,0.171,0.234,9.0,13.5
75%,,0.615,0.48,0.165,1.15325,0.502,0.253,0.329,11.0,16.5


#### Observations:
- The count of the features are same so there is no missing values in this dataset. We will check it.
- There is only one categorical feature. The quantitative data either `float64` or `int64` which means all the features present in this dataset in their right data type.
- The minimum height of the abalone is given which is practically not possible, we will investigate these observation lately

<a id ="2.3"></a><h2 style="background:#e65010; border:0; border-radius: 12px; color:black"><center>Exploratory Data Analysis</center></h2>

In [54]:
data = raw_data.copy()

In [55]:
# Checking Missing values
data.isna().sum()

sex               0
length            0
diameter          0
height            0
whole_weight      0
shucked_weight    0
viscera_weight    0
shell_weight      0
rings             0
age               0
dtype: int64

In [63]:
# Checking Duplicated Values
data.duplicated().sum()

0

In [70]:
# Checking for Outliers
fig = px.box(data_frame=data, 
            x='sex', 
            y='age', 
            color='sex')
fig.show()


In [71]:
# Distribution of the data based on sex
fig = px.histogram(data_frame=data, 
                    x='sex', 
                    color='sex')
fig.show()


- Looking at the dataset summary, we can see that data is quite evenly distributed between the three factor levels of male, female and infant.

In [57]:
# # Drop the rings from the data
# data = data.drop(columns=['rings'])

# # Seperating the features and the target
# X = data.drop('age', axis=1)
# y = data.age

# # Splitting dataset into train and test