In [2]:
library(tidyverse)
abalone_data <- read_csv("abalone.data", col_names = FALSE)
abalone_data <- rename(abalone_data, 
                       sex = X1,
                       length_mm = X2,
                       diameter_mm = X3,
                       height_mm = X4,
                       whole_weight_g = X5,
                       shucked_weight_g = X6,
                       viscera_weight_g = X7,
                       shell_weight_g = X8,
                       rings = X9)
abalone_data <- mutate(abalone_data, age_yrs = rings + 1.5)
head(abalone_data)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[1mRows: [22m[34m4177[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (1): X1
[32mdbl[39m (8): X2, X3, X4, X5, X6, X7, X8, X9

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39

sex,length_mm,diameter_mm,height_mm,whole_weight_g,shucked_weight_g,viscera_weight_g,shell_weight_g,rings,age_yrs
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15,16.5
M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7,8.5
F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9,10.5
M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10,11.5
I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7,8.5
I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8,9.5


In [3]:
abalone_data <- select(abalone_data, sex, length_mm, whole_weight_g, age_yrs)
head(abalone_data)

sex,length_mm,whole_weight_g,age_yrs
<chr>,<dbl>,<dbl>,<dbl>
M,0.455,0.514,16.5
M,0.35,0.2255,8.5
F,0.53,0.677,10.5
M,0.44,0.516,11.5
I,0.33,0.205,8.5
I,0.425,0.3515,9.5


### Methods
For our analysis we will only be using the length and whole weight of the abalone to predict it's age through use of regression. The length of an abalone is used in practice to determine maturity of the marine snail and the whole weight can be used to show growth, both of which are indicative of age. Using the length and whole weight of an abalone to predict age can be visualized through a scatter plot. 

### Splitting into Training Data

since age is a categorical variable, we can split the data into training and test data using 'initial_split()'


In [4]:
set.seed(1)

abalone_data <- abalone_data |>
  mutate(sex = as_factor(sex))  |>
  mutate(sex = fct_recode(sex, "Male" = "M", "Female" = "F", "Infant" = "I"))
head(abalone_data)
glimpse(abalone_data)

sex,length_mm,whole_weight_g,age_yrs
<fct>,<dbl>,<dbl>,<dbl>
Male,0.455,0.514,16.5
Male,0.35,0.2255,8.5
Female,0.53,0.677,10.5
Male,0.44,0.516,11.5
Infant,0.33,0.205,8.5
Infant,0.425,0.3515,9.5


Rows: 4,177
Columns: 4
$ sex            [3m[90m<fct>[39m[23m Male, Male, Female, Male, Infant, Infant, Female, Femal…
$ length_mm      [3m[90m<dbl>[39m[23m 0.455, 0.350, 0.530, 0.440, 0.330, 0.425, 0.530, 0.545,…
$ whole_weight_g [3m[90m<dbl>[39m[23m 0.5140, 0.2255, 0.6770, 0.5160, 0.2050, 0.3515, 0.7775,…
$ age_yrs        [3m[90m<dbl>[39m[23m 16.5, 8.5, 10.5, 11.5, 8.5, 9.5, 21.5, 17.5, 10.5, 20.5…


In [5]:
library(tidymodels)

abalone_split <- initial_split(abalone_data, prop = 0.75, strata = sex)
abalone_train <- training(abalone_split)
abalone_test <- testing(abalone_split)

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mworkflows   [39m 1.1.2
[32m✔[39m [34mmodeldata   [39m 1.0.1     [32m✔[39m [34mworkflowsets[39m 1.0.0
[32m✔[39m [34mparsnip     [39m 1.0.3     [32m✔[39m [34myardstick   [39m 1.1.0
[32m✔[39m [34mrecipes     [39m 1.0.4     

── [1mConflicts[22m ───────────────────────────────────────── tidymodels_conflicts() ──
[31m✖[39m [34mscales[39m::[32mdiscard()[39m masks [34mpurrr[39m::discard()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m   masks [34mstats[39m::filter()
[31m✖[39m [34mrecipes[39m::[32mfixed()[39m  masks [34mstringr[39m::fixed()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m      masks [34mstats[39m::lag()
[31m✖[39m [3

In [6]:
glimpse(abalone_train)

Rows: 3,132
Columns: 4
$ sex            [3m[90m<fct>[39m[23m Female, Female, Female, Female, Female, Female, Female,…
$ length_mm      [3m[90m<dbl>[39m[23m 0.545, 0.525, 0.535, 0.470, 0.440, 0.615, 0.580, 0.680,…
$ whole_weight_g [3m[90m<dbl>[39m[23m 0.7680, 0.6065, 0.6845, 0.4755, 0.4510, 1.1615, 0.9955,…
$ age_yrs        [3m[90m<dbl>[39m[23m 17.5, 15.5, 11.5, 11.5, 11.5, 11.5, 12.5, 16.5, 20.5, 1…


In [7]:
glimpse(abalone_test)

Rows: 1,045
Columns: 4
$ sex            [3m[90m<fct>[39m[23m Female, Infant, Female, Female, Male, Male, Female, Fem…
$ length_mm      [3m[90m<dbl>[39m[23m 0.530, 0.425, 0.530, 0.550, 0.365, 0.450, 0.565, 0.550,…
$ whole_weight_g [3m[90m<dbl>[39m[23m 0.6770, 0.3515, 0.7775, 0.8945, 0.2555, 0.3810, 0.9395,…
$ age_yrs        [3m[90m<dbl>[39m[23m 10.5, 9.5, 21.5, 20.5, 8.5, 10.5, 13.5, 10.5, 12.5, 14.…


We can check to see how representative the training data is by comparing the proportion of each unique observation in the 'sex' variable across the training data 'abalone_train' and the original data 'abalone_data'. 

In [8]:
#Checking for proportions of male, female, and infant abalone's in the original data
abalone_proportions <- abalone_data |>
                      group_by(sex) |>
                      summarize(n = n()) |>
                      mutate(percent = 100*n/nrow(abalone_data))

abalone_proportions

sex,n,percent
<fct>,<int>,<dbl>
Male,1528,36.58128
Female,1307,31.2904
Infant,1342,32.12832


In [9]:
#Checking for proportions of male, female, and infant abalone's in the training data
abalone_train_proportions <- abalone_train |>
                      group_by(sex) |>
                      summarize(n = n()) |>
                      mutate(percent = 100*n/nrow(abalone_train))

abalone_train_proportions

sex,n,percent
<fct>,<int>,<dbl>
Male,1146,36.59004
Female,980,31.28991
Infant,1006,32.12005


Both 'abalone_data' and 'abalone_train' have similar proportions within the classifier 'sex' indicating that proportions were preserved after the split.

Before using the training data, we need to make sure our predictor variables are standardized within the training data set.

In [10]:
abalone_recipe <- recipe(sex ~ length_mm + whole_weight_g + age_yrs, data = abalone_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())