# Predicting Abalone Age

Abalones are a rare type of marine snail found in cold coastal saltwater and highly valued for their culinary uses. The common name abalone refers to a number of large gastropod molluscs in the family *Haliotidae*. It’s popularity as a culinary delicacy has caused great pressure on the species due to overharvesting, in turn rendering it even rarer and more expensive. Assessing the age of these organisms, whether for purposes of conservation, harvesting, or research, is a tedious task that requires cutting open the snail’s shell, staining it, and counting the individual rings under a microscope. For this reason, we wish to design a model that will **predict the age of abalones** through other measurements, such as physical dimensions and weight, using regression.

The dataset we will use contains 4,177 observations and 9 columns: sex (either M, F, or I for infant), length in mm, diameter in mm, height in mm, whole weight in grams, shucked weight in grams (without shell), viscera weight in grams (after bleeding), shell weight in grams, and finally number of rings, which is approximately 1.5 less than the age of the snail. After the design of the model, we will evaluate the accuracy of our predictions to answer the question: how well can we predict the age of an abalone snail from its size (length and diameter), sex, and weight (whole, shucked, viscera, and shell)?

## Setup code

In [2]:
library(tidyverse)
library(tidymodels)
download.file('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data', 'data.csv')
set.seed(695624153456)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘tidymodels’ was built under R version 4.0.2”
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 0.1.1 ──

[32m✔

ERROR: Error in set.seed(695624153456): supplied seed is not a valid integer


## Loading and wrangling data

In [5]:
abalone <- read_csv('data.csv', col_names = c(
    'sex',
    'length',
    'diameter',
    'height',
    'whole_weight',
    'shucked_weight',
    'viscera_weight',
    'shell_weight',
    'rings'
))

abalone <- abalone %>%
    mutate(sex = as_factor(sex)) %>%
    mutate(age = rings + 1.5)
abalone_split <- initial_split(abalone, prop = 0.75, strata = age)
abalone_training <- training(abalone_split)
abalone_testing <- testing(abalone_split)

Parsed with column specification:
cols(
  sex = [31mcol_character()[39m,
  length = [32mcol_double()[39m,
  diameter = [32mcol_double()[39m,
  height = [32mcol_double()[39m,
  whole_weight = [32mcol_double()[39m,
  shucked_weight = [32mcol_double()[39m,
  viscera_weight = [32mcol_double()[39m,
  shell_weight = [32mcol_double()[39m,
  rings = [32mcol_double()[39m
)



## Summarization

Table 1: The number of each amount of rings present in the dataset (mean = 9.934). This is the variable which we aim to predict.

In [16]:
age_counts <- abalone %>%
    group_by(rings) %>%
    summarize(n = n())
age_counts

`summarise()` ungrouping output (override with `.groups` argument)



rings,n
<dbl>,<int>
1,1
2,1
3,15
4,57
5,115
6,259
7,391
8,568
9,689
10,634


Table 2: The mean values of each numerical predictor in the data set.

In [17]:
predictor_means <- abalone %>%
    select(-rings) %>%
    summarize(across(length:shell_weight, mean))
predictor_means

length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.5239921,0.4078813,0.1395164,0.8287422,0.3593675,0.1805936,0.2388309


Table 3: The mean values of each predictor with regards to sex of the abalone. From this table we observe that females are generally slightly larger and heavier than males, and both are larger and heavier than infants.

In [20]:
sex_means <- abalone %>%
    group_by(sex) %>%
    summarize(across(length:shell_weight, mean))
sex_means

`summarise()` ungrouping output (override with `.groups` argument)



sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
M,0.5613907,0.4392866,0.1513809,0.9914594,0.432946,0.2155445,0.2819692
F,0.5790933,0.4547322,0.1580107,1.0465321,0.4461878,0.2306886,0.3020099
I,0.4277459,0.326494,0.1079955,0.4313625,0.191035,0.09201006,0.1281822


## Visualization

## Methods

## Expected Outcomes

We expect to find that older abalone should be larger and heavier than other younger abalone of the same sex as older abalone have more time to grow. The ability to predict the age of an abalone without needing to cut open the shell, stain it, and count the number of rings under a microscope can help scientists save time when performing research on abalone. This will, in theory, allow for more complex and extensive research on the species as collecting essential measurements will be less labor intensive.
