### 1. Introduction
Our project focuses on the measure of percentage of body fat. In this module, we will come up with a simple, robust, accurate and precise “rule-of-thumb” method to estimate percentage of body fat.Our project can be divided into 6 steps:  
1)Data Pre-processing  
2)Variable Selection  
3)Model Construction & Evaluation  
4)Model Diagnosis  
5)Conclusions  
6)Application on Shiny  
### 2. Data Pre-processing
#### 2.1 Data Structure
First of all, we should get a general idea of what data looks like, and whether there are some relationship and pattern between dataset. 
To realize this, we do a summary of the simple linear regression model with all the variables in it first and the result shows that there may be a lot of outliers which may be caused by mismeasure in it and we need to deal with them later.  
Then, we do a correlation test and find that there is a strong relationship between **WEIGHT**, **HEIGHT** and **ADIPOSITY**. Also, **BODYFAT** and **ADIPOSITY** seem to be highly related.  
After that, we check the assumption of normality, equal variance and independence by qq-plot and residual plot.
#### 2.2 Data Cleaning
After observing data, due to the *Siri's equation* professor gave us: 
$$100*B = \frac{495}{D} - 450$$
**BODYFAT** and **DENSITY** variables seem to be the same thing. But the values in .csv files do not statify the *Siri's equation* strictly so we will ignore **DENSITY** variable in following regressions (because we need to get **BODYFAT** as response variable ultimately.). 
To clean up data, first we use scatter plot to check outlier and abnormal samples. For example, we drawed scatter plot to find the lowest **HEIGHT** is abnormal, more specifically, less than 30 inches. We think it could be a wrong value. So we decide to find a way to get the real  data. Based on the *formula of BMI(Body Mass Index)*:$ ^{[1]}$
$$BMI = \frac{Weight (lb)}{[height (in)]^2} * 703$$
we estimate the 'real' height of that person is about $69.43$. But not all strange data can be recovered. For example, we find some ones have very extreme **BODYFAT** value, like, $0$, $1.9$ and $45.1$. And researches online told us that these bodyfat values are too dangerous for normal people to survive. We use **DENSITY** data and the *Siri's equation* professor gave us. But the estimates are very close to original data. So we think these are not wrong data due to recording as previous **HEIGHT** one. We have to delete them.  
Further more, we find out that the **observation 39** has many extreme values like **ABDOMEN**, **ADIPOSITY**, **CHEST** etc. It must be an influential point. We will detect influential points later and to avoid this point influencing our later test, we decided to delete it ahead.    
After so-called eyeballing methond, we delete $4$ observations, they are index **39**, **172**, **182**, **216**.  Now we will use some statistical methods to find outliers and influential points. At very beginning, we fitted a linear regression for all variables. Based on the diagnosis plots of regression, we notice the existence of influential points and outliers. Then we used *Cook's Distance* to visualise influential points. we found that although plenty of points exceeds the standard line. But observations **86** and **221** are far away from the group. And we do not have too many observations so we only deleted these $2$ influential points. As for outliers, we use *Outlier Test* to find observation **224** cound be outlier. However, observation **224** passed the test. So finally we reserve this point. By now, we finished our data cleaning up.  

### 3. Variable Selection
In this part, our group use several methods such as *eyeballing*, *mallow's cp*, *adjusted R Square*, *AIC(forward, stepwise, backward)*, *BIC(forward, stepwise, backward)*, *VIF* to try different subsets of variables.  
*Eyeballing* shows that **ABDOMEN + WRIST** to be most important variables.  
*Mallow's CP* shows that **AGE + WEIGHT + ABDOMEN + THIGH + WRIST** to be most important.  
*Adjusted R Square* shows that **AGE + WEIGHT + NECK + ABDOMEN + THIGH + FOREARM + WRIST** seem to be a good choice.  
*AIC and BIC*:    
The result of *AIC backward* remained 6 variables. *BIC backward* remained 3 variables: **WEIGHT + ABDOMEN + WRIST**. *AIC forward* remained 14 variables. *BIC forward* remained 14 variables. *AIC both* remained 6 variables. *BIC both* also remained 3 variables: **WEIGHT + ABDOMEN + WRIST**.  
So to follow the rule of thumb, it seems that we should choose the least variables choice. But later we should evaluate the model with each choice of variables and make the final decision. Before that, we should check the multi-collinearity. First, we choose model with 3 variables **WEIGHT + ABDOMEN + WRIST** to do the correlation plot. It shows that **WEIGHT** and **ABDOMEN** have strong relationship. Then we calculate the *VIF*. It seems that the value of **WEIGHT** and **ABDOMEN** both larger than *VIF*. So we can remain only one of them in our model. To decide which one to be remained, we tried both of them. Then we compare the *R Square* of two models, we find that **BODYFAT ~ ABDOMEN + WRIST** performs much better than the other one.

### 4. Model Construction & Evaluation
By now, we've already selected different models with different variables. We choose to use simple linear regression to construct our final model since it is the simplest way and costs least. Next we are going to evaluate our models with 2 metrics: *R square* and *MSE*.
#### 4.1 R Square  
In linear regression models, *R suqare* is the proportion of the variance in the dependent variable that is predictable from the independent variable(s)$ ^{[2]}$. And it represents how well are our models fitting the observed data.  
According to the *R squares* we calculated, most of our models got *R squares* around 0.7. Even the one with only two variables got an R square of 0.7058. So the conclusion is that the 2-variable model did not reduce a lot of *R square*.  

models|R square|# of variables
-|-|-
All variables | 0.7415 | 14
Mallow's Cp| 0.7313 |5
Adjusted R^2 |0.7373|7
AIC Backward |0.7353|6
BIC Backward |0.7247|3
BIC Backward and VIF|0.7058|2  

#### 4.2 MSE  
*MSE* (*mean square error*) of the response is the average of the squares of the errors$ ^{[3]}$. As a estimate of the accuracy, we use *MSE* to measure how close are our fitted responses to the true responses.  
When calculating *MSE*s, we randomly divided our dataset into 2 subsets. Then carried out cross validation on the datasets and calculated the mean square error of the true responses and the predicted responses.  
According to our calculation, the two varable models had an *MSE* of 16.12. Meanwhile, the model with all the variables had an *MSE* of 17.48. So the two variable (**ABDOMEN** and **WRIST**) model had a high accuracy while avoided the problems of overfitting.

models|MSE|# of variables
-|-|-
All variables | 17.64 | 14
Mallow's Cp| 14.46 |5
Adjusted $R^2$ |15.29|7
AIC Backward |14.37|6
BIC Backward |14.51|3
BIC Backward and VIF|16.12|2  

#### 4.3 Conclusion  
To sum up, based on the two metrics we used above, we prefered the mdoel with just two variables, **ABDOMEN** and **WRIST**.  
This model got a relatively high *R square* and a relatively low *MSE*. And the coefficient is as below:  
$$BodyFat(\%) = -9.7813 + 0.717ABDOMEN(cm) - 2.06008WRIST(cm)$$
### 5. Model Diagnosis
In our multiple regression model, there are three assumptions can be violated:  
1.Residuals may not be normally distributed;  
2.Residuals may not be independent;  
3.Residuals may not have the same variance.  
To confirm that our model satisfies all the assumptions, we do several plots.For normality, we draw the qq-plot to prove that the residuals are nearly normal distributed since the points in the qq-plot are mostly lying on the diagonal. For equal variance, from residual vs fitted plot, we can know that residuals have equal variance. Also, we assume our data are all from ramdom sampling.  

### 6. Conclusions
#### 6.1 Model Inference
From the summary above, we can find that our final model, a simple linear regression model with two covariates of **ABDOMEN** and **WRIST**, has both coefficients significant. Their p-values are smaller than $2 \times 10^{-1}$ and equal to $1.48 \times 10^{-8}$ respectively, which are both far less than 0.05.  Besides that, the expected change of **BODYFAT** will be 0.717 in response of a one-unit increase in **ABDOMEN** while -2.06 in response of a one-unit increase in **WRIST**.  
*R Square* of our final model is only 0.7058, a little bit smaller than *R Square*=0.7415 of the model with full variables. It means that our model can explain 70.58% of bodyfat. We can take the decrease in *R Square* as a reasonable cost for the significant decline in total number of covariates. 
#### 6.2 Strength and Weakness
##### 6.2.1 Strength
1.Our model satisfies all the assumptions of simple liear regression;  
2.Our model just has 2 variables, ABDOMEN and WRIST. So it will cost less;  
3.No multi-collinearity, so it is a perfect model;  
4.$R^2= 0.7058$ is relatively high, which means our model can explain $70.58\%$ of body fat.
##### 6.2.2 Weakness
1.**ABDOMEN** and **WRIST** may be a little difficult to measure for people;  
2.Because we want to trade off between cost and performance, our model does not have largest *$R^2$* , and smallest *MSE*. But we think it deserves to give up some accuracy to decrease the cost and simplify our model to the best.
#### 6.3 Rule of Thumb
Based on the rule of thumb, we need to make our model as simple as possible. So finally, we choose 2 variables **ABDOMEN** and **WRIST** to construct our model at the cost of abandon some accuracy, but just a little bit and not influence our model much.  

### 7. Application on Shiny
We develop a shinyapp which is basically a **BODYFAT** calculator. It will utilize the model we select above and predict the user's body fat by two given variables, **ABDOMEN** and **WRIST**. And it will also show you the category which your bodyfat belongs to. Furthermore, this app can accept massive data by uploading a csv file that contains two necessary variables and automatically generate a column of body fat. The design of this app quotes from the internet$ ^{[4]}$. The link of this shinyapp is shown below.  
https://ericchenzhang.shinyapps.io/body_fat_calculator/

### 8. Contribution

|    __Name__ | __Contribution__                                             |
| ----------: | :----------------------------------------------------------- |
| Naiqing Cai | Conducted the slides part and finished part of the report    |
|  Yuhang Lan | Conducted the model evaluation part and finished part of the report |
|    Zihao Li | Conducted the data clean-up part and finished part of the report |
| Xinkai Chen | Conducted the Shiny App part and finished part of the report |

### 9. Reference
[1]. Body mass index from https://en.wikipedia.org/wiki/Body_mass_index  
[2]. R square from https://en.wikipedia.org/wiki/Coefficient_of_determination  
[3]. MSE from https://en.wikipedia.org/wiki/Mean_squared_error  
[4]. Calculator reference from https://www.calculator.net/body-fat-calculator.html  