<h1 style="text-align:center">Body Fat Calculator </h1>
<h4 style="text-align:right">Jiawen Chen, Chunyuan Jin, Han Liao </h4>



 
# 1 Introduction


## 1.1 Background

<p> Body fat percentage is a measure of fitness level, which is not easy to obtain during clinical application. A popular way to estimate the body fat is by using the Siri's equation. However, body density is also hard to obtain. 
    
<p> This project aims to come up with a simple, robust, accurate and precise “rule-of-thumb” method to estimate percentage of body fat using clinically available measurements.

## 1.2 Description of Dataset


### 1.2.1 Formula

<p> The Siri's equation for estimating the **body fat B(%)** is

 $$ Body\ Fat\ \%\ (i.e. 100 \times B) = \frac{495}{D} - 450,\ D = Body\ Density\ (gm/cm^{3})$$ 
 

<p> The formula for estimating **ADIPOSITY (BMI)** is
 
$$ADIPOSITY\ (BMI)= \frac{Weight\ (lbs) \times 703}{[Height\ (in)]^2}$$

### 1.2.2 Glimpse at the dataset

In [25]:
BodyFatData = read.csv("BodyFat.csv"); head(BodyFatData, 2)

IDNO,BODYFAT,DENSITY,AGE,WEIGHT,HEIGHT,ADIPOSITY,NECK,CHEST,ABDOMEN,HIP,THIGH,KNEE,ANKLE,BICEPS,FOREARM,WRIST
<int>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,12.6,1.0708,23,154.25,67.75,23.7,36.2,93.1,85.2,94.5,59.0,37.3,21.9,32.0,27.4,17.1
2,6.9,1.0853,22,173.25,72.25,23.4,38.5,93.6,83.0,98.7,58.7,37.3,23.4,30.5,28.9,18.2


We have **252** observations and **17** variables, their units are given in the parenthesis.
 
* Response variable: Body fat in percentage
* Predictors: Age (years), Weight (lbs), Height (inches), Adiposity (bmi), Neck circumference (cm), Chest circumference (cm), Abdomen circumference (cm), Hip circumference (cm), Thigh circumference (cm), Knee circumference (cm), Ankle circumference (cm), Biceps (extended) circumference (cm), Forearm circumference (cm), Wrist circumference (cm)

# 2 Data Cleaning

Our data cleaning process follows these criteria:

1. Impute those variables which have abnormal values and can be fixed with their redundant variable.
2. Filter out records which are abnormal and cannot be fixed.

## 2.1 Use Summary Table to Check Extreme Values

We use the summary table to get an overview of the data. What surprises us is the extreme values in some variables

In [26]:
summary(BodyFatData[, c(2,3,5:7)]) 

    BODYFAT         DENSITY          WEIGHT          HEIGHT     
 Min.   : 0.00   Min.   :0.995   Min.   :118.5   Min.   :29.50  
 1st Qu.:12.80   1st Qu.:1.041   1st Qu.:159.0   1st Qu.:68.25  
 Median :19.00   Median :1.055   Median :176.5   Median :70.00  
 Mean   :18.94   Mean   :1.056   Mean   :178.9   Mean   :70.15  
 3rd Qu.:24.60   3rd Qu.:1.070   3rd Qu.:197.0   3rd Qu.:72.25  
 Max.   :45.10   Max.   :1.109   Max.   :363.1   Max.   :77.75  
   ADIPOSITY    
 Min.   :18.10  
 1st Qu.:23.10  
 Median :25.05  
 Mean   :25.44  
 3rd Qu.:27.32  
 Max.   :48.90  

From the summary table, we can see there exist extreme values in the some variables.

In [27]:
BodyFatData[c(172, 182, 216, 29, 42),]

Unnamed: 0_level_0,IDNO,BODYFAT,DENSITY,AGE,WEIGHT,HEIGHT,ADIPOSITY,NECK,CHEST,ABDOMEN,HIP,THIGH,KNEE,ANKLE,BICEPS,FOREARM,WRIST
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
172,172,1.9,1.0983,35,125.75,65.5,20.6,34.0,90.8,75.0,89.2,50.0,34.8,22.0,24.8,25.9,16.9
182,182,0.0,1.1089,40,118.5,68.0,18.1,33.8,79.3,69.4,85.0,47.2,33.5,20.2,27.7,24.6,16.5
216,216,45.1,0.995,51,219.0,64.0,37.6,41.2,119.8,122.1,112.8,62.5,36.9,23.6,34.7,29.1,18.4
29,29,4.7,1.091,27,133.25,64.75,22.4,36.4,93.5,73.9,88.5,50.1,34.5,21.3,30.5,27.9,17.2
42,42,31.7,1.025,44,205.0,29.5,29.9,36.6,106.0,104.3,115.5,70.6,42.5,23.7,33.6,28.7,17.4


**BODYFAT** is the response variable whose reasonable value ranges from 2% to 39%. 
Individual 172 has lowest possible body fat, which can be considered as essential fat;
Individual 216 is sever obesity, which is possible;
Individual 182, it's impossible to have 0% of bodyfat, and after checking the siri's equation to his density, the corresponding bodyfat becomes negative, thus we filter this records out of our analysis.

There also exists extreme value in **WEIGHT**, which occurs in individual 39. This man also has the largest value in  ADIPOSITY, NECK, CHEST, ABDOMEN, HIP, THIGH, KNEE, BICEPS, and WRIST. Which indicates that this record does exist.

As for **HEIGHT**, individual 42's height is only 29.5 which is quite abnormal. After checking the corresponding weight by BMI formula, we can assume that this is a wrong record. Thus we fix his height by applying the BMI formula.


## 2.2 Check Siri's Equation

Known that the body fat percentage can be estimated by the density with the Siri's equation. We build a linear model between the bodyfat percentage estimated by Siri's equation and the bodyfat percentage in the data set. The residual plot and the QQ plot of this model is shown below. We can see that record 48, 76, 96 are possible outliers. 

![](plots/siri.PNG)

In [28]:
BodyFatData[c(48, 76, 96),]

Unnamed: 0_level_0,IDNO,BODYFAT,DENSITY,AGE,WEIGHT,HEIGHT,ADIPOSITY,NECK,CHEST,ABDOMEN,HIP,THIGH,KNEE,ANKLE,BICEPS,FOREARM,WRIST
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
48,48,6.4,1.0665,39,148.5,71.25,20.6,34.6,89.8,79.5,92.7,52.7,37.5,21.9,28.8,26.8,17.9
76,76,18.3,1.0666,61,148.25,67.5,22.9,36.0,91.6,81.8,94.8,54.5,37.0,21.4,29.3,27.0,18.3
96,96,17.3,1.0991,53,224.5,77.75,26.1,41.1,113.2,99.2,107.5,61.7,42.3,23.2,32.9,30.8,20.4


Individual 96's other variables all have normal value, which indicates his desity might be wrongly recorded.

Individual 48 and 76 have similar values in other variables, thus their body fat percentage should also be similar. Thus we use the Siri's equation to fix their body fat percentage.

## 2.3 Check the BMI Formula

Known that the ADIPOSITY can be estimated using WEIGHT and HEIGHT. We build a linear model between the BMI estimated by equation and the ADIPOSITY in the data set. The residual plot and the QQ plot of this model is shown below. We can see that record 163, 220, 234 are possible outliers.

![](plots/BMI.PNG)

In [29]:
BodyFatData[c(163, 220, 234), ]

Unnamed: 0_level_0,IDNO,BODYFAT,DENSITY,AGE,WEIGHT,HEIGHT,ADIPOSITY,NECK,CHEST,ABDOMEN,HIP,THIGH,KNEE,ANKLE,BICEPS,FOREARM,WRIST
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
163,163,13.3,1.069,33,184.25,68.75,24.4,40.7,98.9,92.1,103.5,64.0,37.3,23.5,33.5,30.6,19.7
220,220,15.1,1.0646,53,154.5,69.25,22.7,37.6,93.9,88.7,94.5,53.7,36.2,22.0,28.5,25.7,17.1
234,234,25.9,1.0384,58,161.75,67.25,25.2,35.1,94.9,94.9,100.2,56.8,35.9,21.0,27.8,26.1,17.6


For individual 163, 220, and 234, their weights vary a lot, which means their BMI should not be similar to each other. Thus we adopt the calculated BMI as their ADIPOSITY.

## 2.3 Data cleaning summary
 
In conclusion:
    
Record **182** is filtered out because it has 0 body fat and there's no way to fix that.
    
The HEIGHT of record **42** is fixed according to the weight and adiposity.
    
The body fat of record **48** and **76** are fixed according to the density.
    
The adiposity of record **163**, **220**, and **234** are fixed according to the weight and height.

# 3 Variable Selection and Statistical Modeling

## 3.1 Relationship between BodyFat and Predictors

![](image/all_regression.jpeg)

As we can see in this Figure, all the variables show the linear tendency, and some variable might have multicollinearity, so we try Lasso regression at first, we also use the Mallow's Cp, BIC forward and backward to select the variable.

## 3.2 Variable and Model Selection

| Method | ABDOMEN | WRIST | HEIGHT | WEIGHT | AGE |  R-squard|
|------|------|------|------|------|------|------|
| Lasso-1 | 0.50 | 0 | 0 | 0 | 0 | 0.641 |

From the table, all the selected variables are reasonable   
ABDOMEN: a direct reflection of body fat  
WRIST and HEIGHT: an indicator of body frame  
WEIGHT: refect body fat and muscle proportion


### 3.2.1 **Stepwise Backward and Forward LR**  



### 3.2.2 **Lasso Regression**



### 3.2.3 **Xgboost**




## 3.3 Model Interpretion 



## 3.4 Model Diagnostics


### 3.4.1 **Using Plots**

After model fitting, We diagnose the model assumptions with the standardized residual plot and the QQ plot.

![](image/Qqplot.jpeg)

### 3.4.2 **Using Outside Source**



# 4 Conclusion


 
## 4.1 Rule of Thumb 

Multiply your abdomen (cm) by.

$$ Body\ Fat\ \% =  \beta_0 (Constant)  + \beta_1 ABDOMEN (cm) $$



## 4.2 Strength & Weekness

Our model has the following strength:

1. **Simplicity**

2. **Robustness**

3. **Interpretability**

As for the weekness:

1. **Precision**





# 5 Contribution

Jiawen Chen: 

Chunyuan Jin:

Han Liao: