# Data Cleaning

To begin with, we used boxplot() and summary() to have an overview of the data. The following five points stand out: 216's high bodyfat, 182's zero bodyfat, 39's large weight, and 42's short height.

In [5]:
BodyFat = read.csv("BodyFat.csv")  # Read data
summary(BodyFat[,c(2,5,6)])[c(1,6),]  # Or boxplot 

    BODYFAT          WEIGHT          HEIGHT     
 Min.   : 0.00   Min.   :118.5   Min.   :29.50  
 Max.   :45.10   Max.   :363.1   Max.   :77.75  

![](plot\\BOXPLOT.jpeg)

For data cleaning, we have the following three criteria:

1. DO NOT impute any records of response variable, BODYFAT.
2. Exclude records which are verified as wrong and unable to impute correctly.
3. Exclude extreme and high influential points, for the privilege of fitting a robust and common applied model.
4. Exclude observations parsimouniously unless they violate the 2nd & 3rd rule.


## Check BODYFAT using DENSITY

Since BODYFAT is the response variable, the most crucial one during the whole analysis, as criterion 1 mentions, we do not impute BODYFAT. And as the following regression plot shows, BODYFAT is inversely proportional with DENSITY. And aftering detecting the following outliers one at each time: 96, 48, 76, 182, and 216, we trained the model without aforementioned outliers to predicted their BODYFAT based on DENSITY and analyzed each outliers in an order of decreasing residual, analyzing them orderly.

![](plot\\BODYFAT_regression.jpeg)

In [14]:
outlier_bodyfat <- c(96,48,76,182,216);
train <- BodyFat[-outlier_bodyfat,];test <- BodyFat[outlier_bodyfat,]
pre <- predict(lm(BODYFAT~I(1/DENSITY), data=train),test)
round(rbind(Record=test$BODYFAT, Prediction=pre, Bias=test$BODYFAT-pre),2)

Unnamed: 0,96,48,76,182,216
Record,17.3,6.4,18.3,0.0,45.1
Prediction,1.6,14.31,14.27,-2.08,45.1
Bias,15.7,-7.91,4.03,2.08,0.0


**96: Keep.**  Given 96 has the largest prediction bias, we deducted that either DENSITY or BODYFAT must be recorded wrong. Compared with observations sharing similar DENSITY (172), similar BODYFAT (126), similar BODYFAT and HEIGHT(109), 96 has more significant difference with 172, thus the DENSITY must be wrong. Without further verification, the distribution of 96 seems reasonable, thus we keep 96 based on 4th criterion.

In [37]:
BodyFat[c(96, 172, 126, 109),-c(1,4)]

Unnamed: 0,BODYFAT,DENSITY,WEIGHT,HEIGHT,ADIPOSITY,NECK,CHEST,ABDOMEN,HIP,THIGH,KNEE,ANKLE,BICEPS,FOREARM,WRIST
96,17.3,1.0991,224.5,77.75,26.1,41.1,113.2,99.2,107.5,61.7,42.3,23.2,32.9,30.8,20.4
172,1.9,1.0983,125.75,65.5,20.6,34.0,90.8,75.0,89.2,50.0,34.8,22.0,24.8,25.9,16.9
126,17.4,1.0587,167.0,67.0,26.2,36.6,101.0,89.9,100.0,60.7,36.0,21.9,35.6,30.2,17.6
109,17.2,1.0593,194.0,75.5,24.0,38.5,110.1,88.7,102.1,57.5,40.0,24.8,35.1,30.7,19.2


![](plot\\96.jpeg)

**48 & 76 : Delete.** 
48 and 76 have similar DENSITY, but significantly different BODYFAT. Comparing them with observations sharing similar DENSITY (146 & 90), all indexes except for BODYFAT are reasonable, for which we deducted the BODYFAT records are wrong. Though we could have estimated their BODYFAT as 14.3, as the first criterion mentioned before, we do not impute response variable. Thus, we delete 48 and 76 based on the 1st and 2nd criteria.

In [38]:
BodyFat[order(BodyFat$DENSITY)[173:176],]

Unnamed: 0,IDNO,BODYFAT,DENSITY,AGE,WEIGHT,HEIGHT,ADIPOSITY,NECK,CHEST,ABDOMEN,HIP,THIGH,KNEE,ANKLE,BICEPS,FOREARM,WRIST
146,146,14.4,1.0664,24,156.0,70.75,21.9,35.7,92.7,81.9,95.3,56.4,36.5,22.0,33.5,28.3,17.3
48,48,6.4,1.0665,39,148.5,71.25,20.6,34.6,89.8,79.5,92.7,52.7,37.5,21.9,28.8,26.8,17.9
76,76,18.3,1.0666,61,148.25,67.5,22.9,36.0,91.6,81.8,94.8,54.5,37.0,21.4,29.3,27.0,18.3
90,90,14.3,1.0666,48,176.0,73.0,23.3,36.7,96.7,86.5,98.3,60.4,39.9,24.4,28.8,29.6,18.7


**182 & 216: Delete.** As the following figure shows, red points refer to 182, purple points refer to 216. Though 182 and 216 have not too large predition bias, their extreme BODYFAT, WEIGHT, and other indxes indicate their distributions are rather extreme and uncommon. It is unreasonable for a human to have zero or negative BODYFAT like 182, and it is also rare for people to have such serious overweight phenomenon. Thus, based on the 3rd criterion, we sacrifice the information provided by 182 and 216, for the privilege of fitting a more robust and accurate model for most people. 

![](plot\\182_216.jpeg)


## Check HEIGHT & WEIGHT with ADIPOSITY

As the formula for ADIPOSITY with HEIGHT and WEIGHT introduced before, we identified and corrected outliers for WEIGHT and HEIGHT based on ADIPOSITY. Using similar approach as the last section, we  obtained a similar list of outliers with decreasing residuals: 39, 42, 163, and 221. Besides, 42 is an extreme high influential point as the following plot shows.
![](plot\\BMI_regression.jpeg")

In [39]:
outlier_adiposity <- c(39,42,163,221);
train <- BodyFat[-outlier_adiposity,];test <- BodyFat[outlier_adiposity,]
pre <- predict(lm(ADIPOSITY~I(WEIGHT/(HEIGHT)^2), data=train),test)
round(rbind(Record=test$ADIPOSITY, Prediction=pre, Bias=pre-test$ADIPOSITY ),2)

Unnamed: 0,39,42,163,221
Record,48.9,29.9,24.4,24.5
Prediction,48.91,165.47,27.44,21.72
Bias,0.01,135.57,3.04,-2.78


**39: Delete.** Though the records are correct based on the prediction result for 39, his highest values on the ten variables as printed indicate this person as extreme overweight. Recalling our 3rd criterion, we exclude this point from consideration a more robust model.

In [None]:
max<-c()
for (i in 1:17) { if( max(BodyFat[,i]) == BodyFat[39,i]) {  max <- cbind(max, i)  } }
names(BodyFat)[max]

**42：Impute HEIGHT.**
42 is the significant high influential point, and his records on HEIGHT is unreasonable as 29.50 inches. The imputed HEIGHT is 69.48, which is reasonable, however, considering the minimum precision for HEIGHT is 0.25 inch, we corrected the 42's HEIGHT into 69.50 inches. 


In [34]:
train <- BodyFat[-outlier_bmi,];test <- BodyFat[42,]
pre <- sqrt( predict(lm(I((HEIGHT)^2)~I(WEIGHT/ADIPOSITY), data=train),test) )
round(cbind(Record=test$HEIGHT, Prediction=pre, Imputation=69.50),2)

Unnamed: 0,Record,Prediction,Imputation
42,29.5,69.48,69.5


**Delete 163 and 221.**
By listing 163 & 221 and their neighborhoods with similar BODYFAT, 163's WEIGHT should have been smaller and HEIGHT should have been larger while 221 in an inverse direction. It is unable to impute WEIGHT and HEIGHT simultaneously when they are both incorrect. Thus, given their records must be wrong and are unable to be imputed, we exclude 163 & 221 from consideration.



In [36]:
BodyFat[order(BodyFat$BODYFAT),][c(65:67),-c(1,4)]
BodyFat[order(BodyFat$BODYFAT),][c(59:61),-c(1,4)]

Unnamed: 0,BODYFAT,DENSITY,WEIGHT,HEIGHT,ADIPOSITY,NECK,CHEST,ABDOMEN,HIP,THIGH,KNEE,ANKLE,BICEPS,FOREARM,WRIST
70,13.2,1.0693,156.75,71.5,21.6,36.3,94.4,84.6,94.3,51.2,37.4,21.6,27.3,27.1,17.3
163,13.3,1.069,184.25,68.75,24.4,40.7,98.9,92.1,103.5,64.0,37.3,23.5,33.5,30.6,19.7
33,13.4,1.0719,168.0,71.25,23.3,38.1,93.0,79.1,94.5,57.3,36.2,24.5,29.0,30.0,18.8


Unnamed: 0,BODYFAT,DENSITY,WEIGHT,HEIGHT,ADIPOSITY,NECK,CHEST,ABDOMEN,HIP,THIGH,KNEE,ANKLE,BICEPS,FOREARM,WRIST
1,12.6,1.0708,154.25,67.75,23.7,36.2,93.1,85.2,94.5,59.0,37.3,21.9,32.0,27.4,17.1
221,12.7,1.0706,153.25,70.5,24.5,38.5,99.0,91.8,96.2,57.7,38.1,23.9,31.4,29.9,18.9
239,12.7,1.0705,155.25,69.5,22.6,37.9,95.8,82.8,94.5,61.2,39.1,22.3,29.8,28.9,18.3




## Summary
 
<dl>
In conclusion, for data cleaning. We exclude seven points, 39, 48, 76, 163, 182, 216, and 221, and impute the 42's HEIGHT.
<dt>Extreme Values:
<dd>   
    
1. WEIGHT: Delete 39.
    
2. BODYFAT: Delete 182 and 216.
    
<dt>Incorrect Records:
<dd>   
    
1. BODYFAT: Delete 48 and 76.

2. WEIGHT & HEIGHT: Delete 163 and 221.</dd>

<dt>Imputed Value:
<dd>   
    
 1. HEIGHT: 42.</dd>
</dl>
