# Methods and Results 

In [1]:
# Required packages and versions
# install.packages("janitor")
# install.packages("cowplot")

In [2]:
library(tidyverse)
library(janitor)
library(repr)
library(tidymodels)
library(cowplot)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Attaching package: ‘janitor’


The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test


── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtu

## Preliminary Exploratory Data Analysis

### 1. load the dataset 
Source: https://www.kaggle.com/datasets/risakashiwabara/jpmean-and-standard-deviation-of-height-and-weight/

To answer this question we will analyze a data set of Japanese boys' heights and weights. This data, `man.csv` was downloaded from [Kaggle](https://www.kaggle.com/datasets/risakashiwabara/jpmean-and-standard-deviation-of-height-and-weight/)(same as source). Each row in the data set represents an observation (i.e. a Japanese boy). 
The columns in the data set `man.csv` represents: 
- `year` - The age of the individual (yrs), which we chose as the **response variable** that is continuous.
- `category` - The region the individual is from
- `height_average` - The average height of the boys with the same age in this region (cm)
- `height_standard deviation` - The standard deviation of the height of the boys with the same age in this region (cm)
- `body weight _average` - The average weight of the boys with the same age in this region (kg)
- `body weight _standard deviation` - The standard deviation of the weight of the boys with the same age in this region (kg)

In [3]:
man_data <- read_csv("https://raw.githubusercontent.com/Ekenny02/dsci-100-project/main/man.csv")

[1mRows: [22m[34m624[39m [1mColumns: [22m[34m6[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (1): category
[32mdbl[39m (5): year, height_average, height_standard deviation, body weight _avera...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


#### Table 1:  *Man Dataset - untidy*
Here we take a look at the original data set `man_data` (**Table 1**), because there are too many observation, we use `head(man_data)` to get a sens of what the data set looks like. The first 6 rows of the original data set `man_data` is shown below.

In [4]:
head(man_data)

year,category,height_average,height_standard deviation,body weight _average,body weight _standard deviation
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
5,national,111.0,4.87,19.3,2.79
5,Hokkaido,111.3,4.81,19.3,2.83
5,Aomori,111.8,4.87,19.9,2.78
5,Iwate,111.0,5.08,19.6,3.02
5,Miyagi,111.3,4.9,19.7,3.04
5,Akita,112.3,5.08,19.9,3.16


From the original data set table above, we noticed a few **problems**, which we will address in the **data cleaning process**: 
-  We notice that some of the column names are not very tidy because some contain spaces between words. 
-  In the `category` column, representing the region of each observation, we have identified a region labeled **"national"** This category is the average of the values for all regions in the `category` column.
-  The variable name is not informative, for example, `category` actually means the region, and `year` actually means the age.

### 2. Data wrangling and cleaning 
1. To clean the name, we will clean the column names using `clean_names` to make them consist of only lowercase letters and underscores.
2. Because **"national"** is the average of the values(e.g., average heights, average weights) for all regions in the `category` column. We will remove all observations with "national" in the `category` column. 
3. To make the variable names informative, we will rename the columns (variables) `year` to `age` and `category` to `region` to align with our analysis and ensure consistency.

The above steps have made sure the `clean_man_data` is currently in a tidy format, as represented by **Table 2**.

In [5]:
# clean the dataset accordingly
clean_man_data <- man_data |>
        clean_names() |>
        filter(category != "national") |>
        rename("age" = "year",
               "region" = "category")

#### Table 2: *Cleaned Man Datset*
Here we take a look at the cleaned data set `clean_man_data`, because there are too many observation, we use `head(clean_man_data)` to get a sens of what the data set looks like
The first 6 rows of the original data set `clean_man_data` is shown below.

In [6]:
head(clean_man_data)

age,region,height_average,height_standard_deviation,body_weight_average,body_weight_standard_deviation
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
5,Hokkaido,111.3,4.81,19.3,2.83
5,Aomori,111.8,4.87,19.9,2.78
5,Iwate,111.0,5.08,19.6,3.02
5,Miyagi,111.3,4.9,19.7,3.04
5,Akita,112.3,5.08,19.9,3.16
5,Yamagata,111.5,4.46,19.5,3.05


### 3. Initial Split and Training Data Summaries 
1. Because we want to predict the age of the given boys, we first create a initial split of the data `man_split` and split it to training data `man_train` and testing data `man_test`. 

In [7]:
set.seed(9999) # set.seed to make sure it is reproducible 
# split the data into training and testing.
man_split <- initial_split(clean_man_data, prop = 0.75, starta = age)
man_train <- training(man_split)
man_test <- testing(man_split)

#### Table 3: *Training Datset*
Here we take a look at the training data `man_train` (**Table 3**). 
The first 6 rows of the training data set is shown below using `head(man_train)`.

In [27]:
head(man_train)

age,region,height_average,height_standard_deviation,body_weight_average,body_weight_standard_deviation
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
7,Aomori,123.3,5.08,25.8,4.97
10,Osaka,138.8,6.61,34.8,8.12
9,Miyazaki,133.6,5.91,32.0,7.17
11,Niigata,146.1,7.31,40.0,9.41
7,Kagawa,122.3,5.38,24.3,4.36
6,Osaka,116.6,4.72,21.3,3.49


2. Then we look at if there is any rows with missing data and to remove those rows if applicable.

In [30]:
missing_rows <- man_train |>
                    filter(complete.cases(man_train) == FALSE) |>
                    nrow() 
missing_rows

looks like there is no missing data so no need to remove any row. 

3. Because each row of our dataset **does not represent one individual**, instead, it represents summary statistics of individuals with certain age, region, which is more of a product of `group_by(age, region)` and `summarize(mean,sd)` from a more broader dataset that we don't have access to. As a result, we are worried that our data may suffer from **not having enough data points** and thus less accuracy, we want to take a look at the sample size and mean of the rest columns for the whole dataset, and also for each age group and region, as represented by **Table 4-6**.



In [22]:
# to get the whole sample size and summary statistics 
training_summary <- man_train |>
    summarize(count = n(), height_average = mean(height_average), 
              body_weight_average = mean(body_weight_average), 
              height_standard_deviation = mean(height_standard_deviation),
              body_weight_standard_deviation = mean(body_weight_standard_deviation))
# to get the sample size and summary statistics by age 
training_summary_age <- man_train |>
    group_by(age) |>
    summarize(count = n(), height_average = mean(height_average), 
              body_weight_average = mean(body_weight_average), 
              height_standard_deviation = mean(height_standard_deviation),
              body_weight_standard_deviation = mean(body_weight_standard_deviation))
# to get the sample size and summary statistics by region 
training_summary_region <- man_train |>
    group_by(region) |>
    summarize(count = n(), height_average = mean(height_average), 
              body_weight_average = mean(body_weight_average), 
              height_standard_deviation = mean(height_standard_deviation),
              body_weight_standard_deviation = mean(body_weight_standard_deviation))

#### Table 4: *Training Dataset Summary - Full*

In [17]:
training_summary

count,height_average,body_weight_average,height_standard_deviation,body_weight_standard_deviation
<int>,<dbl>,<dbl>,<dbl>,<dbl>
458,144.6352,40.6238,6.06976,7.783537


#### Table 5: *Training Dataset Summary - Age*

In [23]:
training_summary_age

age,count,height_average,body_weight_average,height_standard_deviation,body_weight_standard_deviation
<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
5,35,111.0029,19.32857,4.868286,2.810857
6,36,116.6222,21.68611,4.902778,3.491389
7,38,122.6158,24.55789,5.15,4.358158
8,34,128.3676,27.84706,5.411471,5.440588
9,38,133.6711,31.26842,5.706316,6.509737
10,34,139.2324,35.11176,6.329706,7.775
11,34,145.9059,39.72353,7.216765,8.880588
12,38,153.4368,45.21842,7.892105,10.088947
13,37,160.4946,50.11351,7.335676,10.282973
14,34,165.6294,54.89706,6.444412,10.343529


#### Table 6: *Training Dataset Summary - Region*

In [25]:
training_summary_region

region,count,height_average,body_weight_average,height_standard_deviation,body_weight_standard_deviation
<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
Aichi,11,144.0727,39.64545,6.082727,7.400909
Akita,9,147.0889,43.0,6.033333,8.471111
Aomori,8,139.9375,38.2125,6.015,7.82375
Chiba,10,145.75,41.07,6.068,7.835
Ehime,9,143.9333,41.03333,5.783333,7.628889
Fukui,11,146.8182,41.53636,6.1,7.391818
Fukuoka,10,146.79,41.9,6.079,7.875
Fukushima,9,143.2111,40.65556,5.728889,8.1
Gifu,10,137.73,34.94,6.146,6.869
Gunma,7,152.1571,46.18571,6.231429,9.498571


Looking at the summary tables above of the training dataset, we better understand our dataset: 
- Like expected, our data set is **more of a summary data set** that have 12 age groups (from 5-17) and for each age we have around 35 data points that represent around 35 regions' height and weight summary statistics (i.e. mean and sd)
- we can see that the number of individuals for each age group is comparative/balanced (around 35) and the number of individual for each region is also comparative/balanced (around 10) with a large sample size of 458 in total. 
- we can convinced now the data is balanced and have enough sample size using `age` as the response variable. We now can have some ideas of what to better phrase the research question according to our dataset: 
**given a data point with its region summary statistics (region/mean/sd), what is the predicted age of this data point?** we will discuss which summary statistics to choose as the predictors in the next section.

### 4. The Data Type of `age` 
From the dataset structure we've explained above, we noticed a few things 
1. `age` is not nessarily a continuous variable statistically speaking because all age groups are integers in our data. However, it is clearly not categorical data type with just classes, because an age of a higher number is for sure larger than an age of a lower number (there is **order**)
2. There is around 12 age groups that run from 5 to 17. 

Based on the data type of `age`, we can either treating `age` as a continuous variable and do a regression-related analysis or treating `age` as a categorical variable and do a classification analysis. However, although we have balanced samples for each age group (about 35), there are too many categories (12 categories) to predict. **As a result, we decided to treat `age` as a continuous variable and do a regression-typed analysis ($k$-nn regression or linear regression).**