# Height and Weight of Boys Ages 5-17 Across Japan

# Introduction

Every year across Japan, measurement sessions are required by the government to be held at elementary schools in every region from ages 5-17, measuring each child's height and weight. **Using this data, we will be predicting the age of the child based on given height and weight in boys. We will be splitting the dataset into a training and testing set to test the accuracy or RMSPE of the model of predicting the age from the testing data’s height and weight of boys 5-17 in Japan.** 

# Preliminary Exploratory Data Analysis

In [2]:
library(tidyverse)
library(janitor)
library(repr)
library(tidymodels)
library(cowplot)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


ERROR: Error in library(janitor): there is no package called ‘janitor’


#### 1. Demonstration that the data set `man.csv` can be read into R 

In [None]:
# the following way of reading data is allowed according to Anthony's announcement.
man_data <- read_csv("man.csv")
head(man_data)

### 2. Data wrangling and cleaning 

1. We notice that some of the column names contain spaces between words. We will clean the column names using `clean_names` to make them consist of only lowercase letters and underscores.
2. In the `category` column, representing the region of each observation, we have identified a category/region labeled "national." This category is the average of the values for all regions in the `category/region` column. We will remove all observations with "national" in the `category` column.
3. We will remove the unused columns (variables): `category`, `height_standard_deviation`, and `body_weight_standard_deviation`.
4. We will rename the column (variable) `year` to `age` to align with our analysis and ensure consistency.


The above steps have made sure the `clean_man_data` is currently in a tidy format. 

In [None]:
clean_man_data <- man_data |>
        clean_names() |>
        filter(category != "national") |>
        select(-category, -height_standard_deviation, -body_weight_standard_deviation) |>
        rename("age" = "year")
# take a look to make sure it looks good
head(clean_man_data)

#### 3. Initial split of our data set `clean_man_data` (to training and testing data) 

In [None]:
set.seed(9999) # set.seed to make sure it is reproducible 

man_split <- initial_split(clean_man_data, prop = 0.75, starta = age)
man_train <- training(man_split)
man_test <- testing(man_split)

# take a look to make sure it looks good 
head(man_train)
head(man_test)

### 4. Preliminary summary tables of the training data 

1. We want to determine the number of observations (rows) for each age in the training set by creating the dataframe `number_observation_by_age`.
2. We aim to generate a summary of the predictors (weight and height). We create a dataframe that includes the means and standard deviations for the predictor variables, named `summary_predictors`.
3. We want to check for any observations that contain missing values using `complete.cases`. We calculate the number of rows containing missing values, denoted as `missing_rows`, and have found that there are **no missing values** in our training set.

In [None]:
number_obersvation_by_age  <- man_train |>
                    group_by(age) |>
                    summarize(count = n()) 

summary_predictors <- man_train |>
                pivot_longer(cols = height_average:body_weight_average,
                            names_to = "predictors", 
                            values_to = "value") |>
                group_by(predictors) |>
                summarize(mean = mean(value, na.rm = TRUE), 
                          SD = sd(value, na.rm = TRUE))
missing_rows <- man_train |>
                    filter(complete.cases(man_train) == FALSE) |>
                    nrow() 

In [None]:
number_obersvation_by_age
summary_predictors
missing_rows # here the missing_rows is 0 

### 5. Preliminary visualization of the training data 


1. **Predictor Distributions:** We would like to visualize the distribution of the two predictors to gather more information, instead of relying solely on the means and standard deviations from tables above. We create two histograms `height_distribution_plot` and `body_weight_distribution_plot` side by side, referred to as `predictor_distribution_plot`, to better understand each predictor.

2. **Age Distribution:** Similar to the predictors, we also want to examine the distribution of the response variable `age` to ensure comparability among age observations. We create a histogram named `age_plot`.

3. **Relationship Analysis:** In preparation for KNN regression, we aim to explore the relationship between each predictor variable and the response variable. We generate two scatter plots, `height_scatter_plot` and `body_weight_scatter_plot`, displayed side by side as `predictor_scatter_plot`. These plots reveal that age tends to increase as either height or body weight increases, suggesting that height and weight could be used for age prediction.

4. **Collinearity Assessment:** We create a scatter plot between the predictors `body_weight_average` and `height_average`, labeled as `between_predictors_plot`. This plot illustrates a high degree of collinearity, emphasizing that KNN regression is a more suitable choice for prediction compared to linear regression, which is susceptible to predictor collinearity.


In [1]:
options(repr.plot.width = 10, repr.plot.height = 8)
# histograms for predictor variables. 
height_distribution_plot <- man_train |>
        ggplot(aes(x = height_average)) + 
            geom_histogram() + 
            labs(x = "Height (cm)", y = "Number of Observation") + 
            ggtitle("Height Distribution") + 
            theme(text = element_text(size = 12))


body_weight_distribution_plot <- man_train |>
        ggplot(aes(x = body_weight_average)) + 
            geom_histogram() + 
            labs(x = "Body Weight (kg)", y = "Number of Observation") + 
            ggtitle("Body Weight Distribution") + 
            theme(text = element_text(size = 12))

predictor_distribution_plot <- plot_grid(height_distribution_plot, body_weight_distribution_plot, nrow = 1)
predictor_distribution_plot

# histogram for response variable 
age_plot <- man_train |>
        ggplot(aes(x = age)) + 
            geom_histogram() + 
            labs(x = "Age (year)", y = "Number of Observation") + 
            ggtitle("Age Distribution") + 
            theme(text = element_text(size = 12))
age_plot # we see that the number of observation for each age is comparable 

# scatter plots for predictor variables
height_scatter_plot <- man_train |>
                ggplot(aes(x = height_average, y = age)) + 
                    geom_point() + 
                    labs(x = "Height (cm)", y = "Age (year)") + 
                    theme(text = element_text(size = 12)) +
                    ggtitle("Relationship Between Height and Age") 
body_weight_scatter_plot <- man_train |>
                ggplot(aes(x = body_weight_average, y = age)) + 
                    geom_point() + 
                    labs(x = "Body Weight (kg)", y = "Age (year)") + 
                    theme(text = element_text(size = 12)) +
                    ggtitle("Relationship Between Body Weight and Age") 
predictor_scatter_plot <- plot_grid(height_scatter_plot, body_weight_scatter_plot, nrow = 1)
predictor_scatter_plot # we see that there is a positive relationship between the predictors height and weight. 

# scatter plot between predictors 
between_predictors_plot <- man_train |>
                    ggplot(aes(x = height_average, y = body_weight_average)) +
                        geom_point() + 
                        labs(x = "Height (cm)", y = "Body Weight (kg)") + 
                        theme(text = element_text(size = 12)) +
                        ggtitle("Relationship Between Height and Weight")
between_predictors_plot # we see that there is a large positive relationship between height and weight, which infers a high collinearity. 

ERROR: Error in ggplot(man_train, aes(x = height_average)): could not find function "ggplot"


# Methods

To conduct our analysis, we will read and clean our data, select the weight and height columns for prediction, perform cross-validation, and perform some form of regression to predict age. We will then visualize our results, likely using a scatterplot with a regression line superimposed.

# Expected Outcomes and Significance

By using the height and weight given for a Japanese boy, we can predict the child's age within the given range.

These findings could allow:
- Pediatric Healthcare providers to assess a (Japanese male) child's development and compare it to the average growth patterns as seen in this data set.
- Nutrition and dietary plans for this demographic. If a certain height to weight ratio in other countries produces an average age that differs from this data set.
- Physical Education programs to be tailored to the growth development 

For future questions, would this prediction set be similar for Japanese girls? Could it accurately predict their ages?  Maybe the prepubescent ages might have similar data?
