## Introduction

参考：
- https://cran.r-project.org/web//packages/missForest/vignettes/missForest_1.5.pdf
- https://blog.csdn.net/a358463121/article/details/52145260
- https://rpubs.com/lmorgan95/MissForest

 missForest is running iteratively, continuously updating the imputed matrix **variable-wise**, and is **assessing its
performance between iterations**. This assessment is done by considering the difference(s) between
the previous imputation result and the new imputation result. As soon as this difference (in case
of one type of variable) or differences (in case of mixed-type of variables) increase the algorithm
stops.

先对缺失值做一个初始的猜测，比如用均值/中位数填充，然后按照变量的缺失率，从小到大排序，先对缺失率小的变量使用随机森林分类/回归从而填补该变量的缺失值，然后一直迭代，直到最新的一次填补结果与上一次的填补结果不再变化（变化很小）时停止。  

**Advantages:**

- Can be applied to **mixed data types** (missings in numeric & categorical variables)
- **No pre-processing** required (no dummy-coding, standardization, data splitting, etc.)
- **No assumptions** required (aside from the normal assumption of being MAR/MCAR)
- **Robust to noisy data**, as random forests effectively have build-in feature selection. Methods like KNN imputation will have poor predictions in datasets with weak & non-informative predictors, whereas `missForest()` will make little to no use of these features
- **Non-parametric**: makes no assumptions about the relationship between the features, unlike MICE which assumes linearity
- Excellent **predictive power**
- Can leverage **non-linear** and **interaction effects** between features to improve imputation accuracy
- Gives an **OOB error estimate** for its predictions (Numeric: NRMSE/MSE, Categorical: PFC)
- Works with **high dimensionality data** (p≫n)



**Disadvantages:**

- **Imputation time**, which increases with the number of observations, predictors and number of predictors containing missing values
- It inherits the same **lack of interpretability** of random forests
- **It is an algorithm**, not a model object you can store somewhere. This means it has to run each time missing data has to be imputed, which could be problematic in some production environments


### Description

'missForest' is used to impute missing values particularly in the case of mixed-type data. It can be used to impute continuous and/or categorical data including complex interactions and nonlinear relations. It yields an out-of-bag (OOB) imputation error estimate. Moreover, it can be run parallel to save computation time.

### Usage

```r
missForest(xmis, maxiter = 10, ntree = 100, variablewise = FALSE,
                       decreasing = FALSE, verbose = FALSE,
                       mtry = floor(sqrt(ncol(xmis))), replace = TRUE,
                       classwt = NULL, cutoff = NULL, strata = NULL,
                       sampsize = NULL, nodesize = NULL, maxnodes = NULL,
                       xtrue = NA, parallelize = c('no', 'variables', 'forests'))
```

### Arguments

|                |                                                              |
| -------------- | ------------------------------------------------------------ |
| `xmis`         | a data matrix with missing values. The columns correspond to the variables and the rows to the observations. |
| `maxiter`      | maximum number of iterations to be performed given the stopping criterion is not met beforehand. |
| `ntree`        | number of trees to grow in each forest.                      |
| `variablewise` | logical. If 'TRUE' the OOB error is returned for each variable separately. This can be useful as a reliability check for the imputed variables w.r.t. to a subsequent data analysis. |
| `decreasing`   | logical. If 'FALSE' then the variables are sorted w.r.t. increasing amount of missing entries during computation. |
| `verbose`      | logical. If 'TRUE' the user is supplied with additional output between iterations, i.e., estimated imputation error, runtime and if complete data matrix is supplied the true imputation error. See 'xtrue'. |
| `mtry`         | number of variables randomly sampled at each split. This argument is directly supplied to the 'randomForest' function. Note that the default value is sqrt(p) for both categorical and continuous variables where p is the number of variables in 'xmis'. |
| `replace`      | logical. If 'TRUE' bootstrap sampling (with replacements) is performed else subsampling (without replacements). |
| `classwt`      | list of priors of the classes in the categorical variables. This is equivalent to the randomForest argument, however, the user has to set the priors for all categorical variables in the data set (for continuous variables set it 'NULL'). |
| `cutoff`       | list of class cutoffs for each categorical variable. Same as with 'classwt' (for continuous variables set it '1'). |
| `strata`       | list of (factor) variables used for stratified sampling. Same as with 'classwt' (for continuous variables set it 'NULL'). |
| `sampsize`     | list of size(s) of sample to draw. This is equivalent to the randomForest argument, however, the user has to set the sizes for all variables. |
| `nodesize`     | minimum size of terminal nodes. Has to be a vector of length 2, with the first entry being the number for continuous variables and the second entry the number for categorical variables. Default is 1 for continuous and 5 for categorical variables. |
| `maxnodes`     | maximum number of terminal nodes for trees in the forest.    |
| `xtrue`        | optional. Complete data matrix. This can be supplied to test the performance. Upon providing the complete data matrix 'verbose' will show the true imputation error after each iteration and the output will also contain the final true imputation error. |
| `parallelize`  | should 'missForest' be run parallel. **Default is 'no'**. If 'variables' the data is split into pieces of the size equal to the number of cores registered in the parallel backend. If 'forests' the total number of trees in each random forests is split in the same way. Whether 'variables' or 'forests' is more suitable, depends on the data. See Details. |

### Value

|            |                                                              |
| ---------- | ------------------------------------------------------------ |
| `ximp`     | imputed data matrix of same type as 'xmis'.                  |
| `OOBerror` | estimated OOB imputation error. For the set of continuous variables in 'xmis' the NRMSE and for the set of categorical variables the proportion of falsely classified entries is returned. See Details for the exact definition of these error measures. If 'variablewise' is set to 'TRUE' then this will be a vector of length 'p' where 'p' is the number of variables and the entries will be the OOB error for each variable separately. |
| `error`    | true imputation error. This is only available if 'xtrue' was supplied. The error measures are the same as for 'OOBerror'. |

### See Also

```
mixError`, `prodNA`, `randomForest
```

**Description of the data used**

- Iris data This complete data set contains ﬁve variables of which one is categorical with three
levels. It is contained in the R base and can be loaded directly by typing data(iris).

- Oesophageal cancer data This complete data set comes from a case-control study of oe-
sophageal cancer in Ile-et-Vilaine, France. It is contained in the R base and can be loaded
directly by typing data(esoph).

In [1]:
options(warn=-1)    #忽略一切警告
options('width'=140)  #充分利用打印宽度
options(repr.plot.width=15, repr.plot.height=10)  #满幅

In [2]:
library(pacman)

options(warn = -1) # 忽略一切警告
options("width" = 140) # 充分利用打印宽度
options(repr.plot.width = 15, repr.plot.height = 10) # 满幅

p_load(missForest)
p_load(randomForest)
p_load(skimr)
s <- skim_tee

s(iris)
## The data contains four continuous and one categorical variable.

── Data Summary ────────────────────────
                           Values
Name                       data  
Number of rows             150   
Number of columns          5     
_______________________          
Column type frequency:           
  factor                   1     
  numeric                  4     
________________________         
Group variables            None  

── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate ordered n_unique top_counts               
[90m1[39m Species               0             1 FALSE          3 set: 50, ver: 50, vir: 50

── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate mean    sd  p0 p25  p50 p75 p100 hist 
[90m1[39m Sepal.Length          0             1 5.84 0.828 4.3 5.1 5.8  6.4  7

In [3]:
## Artificially produce missing values using the 'prodNA' function:
set.seed(81)
iris.mis <- prodNA(iris, noNA = 0.2)
s(iris.mis)

── Data Summary ────────────────────────
                           Values
Name                       data  
Number of rows             150   
Number of columns          5     
_______________________          
Column type frequency:           
  factor                   1     
  numeric                  4     
________________________         
Group variables            None  

── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate ordered n_unique top_counts               
[90m1[39m Species              29         0.807 FALSE          3 set: 42, ver: 40, vir: 39

── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate mean    sd  p0 p25  p50  p75 p100 hist 
[90m1[39m Sepal.Length         24         0.84  5.83 0.852 4.3 5.1 5.75 6.4  

## missForest in a nutshell

In [16]:
iris.imp <- missForest(iris.mis)

The results are stored in the R object iris.imp which is a list. We can call upon the imputed data matrix by typing
iris.imp$ximp. 

In [17]:
s(iris.imp$ximp)

── Data Summary ────────────────────────
                           Values
Name                       data  
Number of rows             150   
Number of columns          5     
_______________________          
Column type frequency:           
  factor                   1     
  numeric                  4     
________________________         
Group variables            None  

── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate ordered n_unique top_counts               
[90m1[39m Species               0             1 FALSE          3 ver: 51, vir: 50, set: 49

── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate mean    sd  p0  p25  p50  p75 p100 hist 
[90m1[39m Sepal.Length          0             1 5.84 0.819 4.3 5.10 5.8  6.4

In [18]:
iris.imp$OOBerror

- NRMSE：normalized root mean squared error
- PFC：proportion of falsely classified

In both cases good performance of missForest leads to a value close to 0 and bad performance to a value around 1.

**If you are interested in assessing the reliability of the imputation for single variables, e.g., to
decide which variables to use in a subsequent data analysis, missForest can return the OOB
errors for each variable separately instead of aggregating over the whole data matrix. This can
be done using the argument variablewise = TRUE when calling the missForest function.**

In [19]:
missForest(iris.mis, variablewise = TRUE)$OOBerror

We can see that the output has the same length as there are variables in the data. For each
variable the resulting error and the type of error measure, i.e., mean squared error (MSE) or
PFC, is returned. Note that we are not using the NRMSE here.

## Additional output using verbose

If verbose = ’TRUE’ ,return with additional output between iterations,
i.e., estimated imputation error, runtime and if complete data matrix is supplied
the true imputation error.

In [20]:
set.seed(81)
iris.imp <- missForest(iris.mis, xtrue = iris, verbose = TRUE)

  missForest iteration 1 in progress...done!
    error(s): 0.2013623 0.03448276 
    estimated error(s): 0.1601467 0.0661157 
    difference(s): 0.01203429 0.1466667 
    time: 0.06 seconds

  missForest iteration 2 in progress...done!
    error(s): 0.2061798 0.03448276 
    estimated error(s): 0.1447976 0.04958678 
    difference(s): 0.0001436753 0 
    time: 0.1 seconds

  missForest iteration 3 in progress...done!
    error(s): 0.2119893 0.03448276 
    estimated error(s): 0.1433477 0.04132231 
    difference(s): 5.520233e-05 0 
    time: 0.06 seconds

  missForest iteration 4 in progress...done!
    error(s): 0.2138112 0.03448276 
    estimated error(s): 0.1443437 0.04958678 
    difference(s): 4.134513e-05 0 
    time: 0.06 seconds

  missForest iteration 5 in progress...done!
    error(s): 0.2183323 0.03448276 
    estimated error(s): 0.139025 0.04958678 
    difference(s): 4.394775e-05 0 
    time: 0.08 seconds



- error(s) ：(缺失值)相对于真实值的错误(如果提供了xtrue)
- estimated error(s) ：The OOB imputation error estimate for the continuous and categorical
parts of the imputed data set. Note: If there is only one type of variable there will be only
one value with the corresponding error measure.--基于非缺失值，OOB的估计值的误差变化
- difference(s) ：The difference between the previous and the new imputed continuous and cat
egorical parts of the data set.--基于缺失值，二次之间的变化情况

**note**
1. After each iteration the difference between the previous and the new imputed data matrix is assessed for the continuous and categorical parts.
2. The stopping criterion is defined such that the imputation process is stopped as soon as both differences have become larger once.
3. In case of only one type of variable the computation stops as soon as the corresponding difference goes up for the first time.
4. However, the imputation last performed where both differences went up is generally less accurate than the previous one. Therefore, whenever the computation stops due to the stopping criterion (and not due to 'maxiter') the before last imputation matrix is returned.**



In [10]:
iris.imp$OOBerror   #填补误差的估计，取自difference增大前的一次迭代

## Changing the number of iterations with maxiter

if the diﬀerence between iterations is seriously shrinking
towards nought and the estimated error is in a stalemate the only way to keep computation time
at a reasonable level is to limit the number of iterations using the argument maxiter.

In [24]:
data(esoph)
esoph.mis <- prodNA(esoph, 0.05)
s(esoph.mis)

── Data Summary ────────────────────────
                           Values
Name                       data  
Number of rows             88    
Number of columns          5     
_______________________          
Column type frequency:           
  factor                   3     
  numeric                  2     
________________________         
Group variables            None  

── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate ordered n_unique top_counts                        
[90m1[39m agegp                 5         0.943 TRUE           6 55-: 16, 45-: 15, 25-: 14, 35-: 14
[90m2[39m alcgp                 5         0.943 TRUE           4 0-3: 22, 40-: 22, 120: 20, 80-: 19
[90m3[39m tobgp                 5         0.943 TRUE           4 10-: 23, 0-9: 21, 20-: 20, 30+: 19

── Variable type: numeric ────────────────────────────────────────────────────

In [25]:
set.seed(96)
esoph.imp <- missForest(esoph.mis, verbose = TRUE)

  missForest iteration 1 in progress...done!
    estimated error(s): 0.5114959 0.686747 
    difference(s): 0.001339828 0.04924242 
    time: 0.05 seconds

  missForest iteration 2 in progress...done!
    estimated error(s): 0.4342053 0.6506024 
    difference(s): 0.0001624316 0.003787879 
    time: 0.03 seconds

  missForest iteration 3 in progress...done!
    estimated error(s): 0.4520883 0.7028112 
    difference(s): 3.275361e-05 0.007575758 
    time: 0.03 seconds

  missForest iteration 4 in progress...done!
    estimated error(s): 0.4642958 0.6827309 
    difference(s): 3.086518e-05 0.01136364 
    time: 0.05 seconds

  missForest iteration 5 in progress...done!
    estimated error(s): 0.4203652 0.6666667 
    difference(s): 0.0002101057 0.007575758 
    time: 0.03 seconds

  missForest iteration 6 in progress...done!
    estimated error(s): 0.4119456 0.6626506 
    difference(s): 0.0001025898 0.003787879 
    time: 0.03 seconds

  missForest iteration 7 in progress...done!
    e

the diﬀerence in the continuous part of the data set is still reduced
in each iteration up until iteration 9

**In the above
case of the esoph data we can get the result of the sixth iteration by doing the following:**

In [26]:
set.seed(96)
esoph.imp <- missForest(esoph.mis, verbose = TRUE, maxiter = 6)

  missForest iteration 1 in progress...done!
    estimated error(s): 0.5114959 0.686747 
    difference(s): 0.001339828 0.04924242 
    time: 0.05 seconds

  missForest iteration 2 in progress...done!
    estimated error(s): 0.4342053 0.6506024 
    difference(s): 0.0001624316 0.003787879 
    time: 0.03 seconds

  missForest iteration 3 in progress...done!
    estimated error(s): 0.4520883 0.7028112 
    difference(s): 3.275361e-05 0.007575758 
    time: 0.05 seconds

  missForest iteration 4 in progress...done!
    estimated error(s): 0.4642958 0.6827309 
    difference(s): 3.086518e-05 0.01136364 
    time: 0.03 seconds

  missForest iteration 5 in progress...done!
    estimated error(s): 0.4203652 0.6666667 
    difference(s): 0.0002101057 0.007575758 
    time: 0.03 seconds

  missForest iteration 6 in progress...done!
    estimated error(s): 0.4119456 0.6626506 
    difference(s): 0.0001025898 0.003787879 
    time: 0.03 seconds



The returned result is now given by iteration 6. Quintessentially, there are two uses for the
maxiter argument:
1. Controlling the run time in case of stagnating performance;
2. extract a preferred iteration result not supplied by the stopping criterion.

## Speed and accuracy trade-oﬀ manipulating mtry and ntree

missForest grows in each iteration for each variable a random forest to impute the missing
values. With a large number of variables p this can lead to computation times beyond today’s
perception of feasibility. There are two ways to speed up the imputation process of missForest:
1. Reducing the number of trees grown in each forest using the argument ntree;

2. reducing the number of variables randomly sampled at each split using the argument mtry.

It is imperative to know that reducing either of these numbers will probably result in reduced
accuracy. This is why we speak of a speed and accuracy trade-oﬀ.

## Testing the appropriateness by supplying xtrue

Whenever imputing data with real missing values the question arises how good the imputation
was. In missForest the estimated OOB imputation error gives a nice indication at what you
have to expect. A wary user might want to make an additional assessment (or back the OOB
estimate up) by performing cross-validation or – in the optimal case – testing missForest
previously on complete data. For both cases missForest oﬀers the xtrue argument which
simply takes in the same data matrix as xmis but with no missing values present.

We can simplify the above strategy by using xtrue.If combined with verbose = TRUE the
user even gets additional information on the performance of missForest between iterations:

In [29]:
iris.imp <- missForest(iris.mis, xtrue = iris, verbose = TRUE)

  missForest iteration 1 in progress...done!
    error(s): 0.2080787 0.03448276 
    estimated error(s): 0.1596026 0.07438017 
    difference(s): 0.01203859 0.1466667 
    time: 0.11 seconds

  missForest iteration 2 in progress...done!
    error(s): 0.2153769 0.03448276 
    estimated error(s): 0.145999 0.04132231 
    difference(s): 0.0001822433 0 
    time: 0.06 seconds

  missForest iteration 3 in progress...done!
    error(s): 0.2158092 0.03448276 
    estimated error(s): 0.1445526 0.04958678 
    difference(s): 3.156033e-05 0 
    time: 0.06 seconds

  missForest iteration 4 in progress...done!
    error(s): 0.214517 0.03448276 
    estimated error(s): 0.1441032 0.04958678 
    difference(s): 2.916028e-05 0 
    time: 0.08 seconds

  missForest iteration 5 in progress...done!
    error(s): 0.2149281 0.03448276 
    estimated error(s): 0.1428988 0.03305785 
    difference(s): 4.646177e-05 0 
    time: 0.07 seconds

