## load dataset
In this step I load the csv from url, and the file is filled with some missing feature data. Maybe it is "?" or a number "0" to as a symbol of missing value. Then I will replace it with a NaN.

In [1]:
import pandas as pd

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = pd.read_csv(url, header=None, na_values='?')

## summaries the first few rows
To find out whether the missing value is replaced with NaN.

In [2]:
print(dataframe.head())

    0   1        2     3      4     5    6    7    8    9   ...    18    19  \
0  2.0   1   530101  38.5   66.0  28.0  3.0  3.0  NaN  2.0  ...  45.0   8.4   
1  1.0   1   534817  39.2   88.0  20.0  NaN  NaN  4.0  1.0  ...  50.0  85.0   
2  2.0   1   530334  38.3   40.0  24.0  1.0  1.0  3.0  1.0  ...  33.0   6.7   
3  1.0   9  5290409  39.1  164.0  84.0  4.0  1.0  6.0  2.0  ...  48.0   7.2   
4  2.0   1   530255  37.3  104.0  35.0  NaN  NaN  6.0  2.0  ...  74.0   7.4   

    20   21   22  23     24  25  26  27  
0  NaN  NaN  2.0   2  11300   0   0   2  
1  2.0  2.0  3.0   2   2208   0   0   2  
2  NaN  NaN  1.0   2      0   0   0   1  
3  3.0  5.3  2.0   1   2208   0   0   1  
4  NaN  NaN  2.0   2   4300   0   0   2  

[5 rows x 28 columns]


然后我们可以枚举每一列，并报告该列缺失值的行数。

In [3]:
# summarize the number of rows with missing values for each column
for i in range(dataframe.shape[1]):
	# count number of rows with missing values
	n_miss = dataframe[[i]].isnull().sum()
	perc = n_miss / dataframe.shape[0] * 100
	print('> %d, Missing: %d (%.1f%%)' % (i, n_miss, perc))

> 0, Missing: 1 (0.3%)
> 1, Missing: 0 (0.0%)
> 2, Missing: 0 (0.0%)
> 3, Missing: 60 (20.0%)
> 4, Missing: 24 (8.0%)
> 5, Missing: 58 (19.3%)
> 6, Missing: 56 (18.7%)
> 7, Missing: 69 (23.0%)
> 8, Missing: 47 (15.7%)
> 9, Missing: 32 (10.7%)
> 10, Missing: 55 (18.3%)
> 11, Missing: 44 (14.7%)
> 12, Missing: 56 (18.7%)
> 13, Missing: 104 (34.7%)
> 14, Missing: 106 (35.3%)
> 15, Missing: 247 (82.3%)
> 16, Missing: 102 (34.0%)
> 17, Missing: 118 (39.3%)
> 18, Missing: 29 (9.7%)
> 19, Missing: 33 (11.0%)
> 20, Missing: 165 (55.0%)
> 21, Missing: 198 (66.0%)
> 22, Missing: 1 (0.3%)
> 23, Missing: 0 (0.0%)
> 24, Missing: 0 (0.0%)
> 25, Missing: 0 (0.0%)
> 26, Missing: 0 (0.0%)
> 27, Missing: 0 (0.0%)


我们可以看到，有些列（如列索引1和2）没有缺失值，而其他列（如列索引15和21）有许多甚至大多数缺失值。

## 用IterativeImputer进行迭代代偿
scikit-learn机器学习库提供了IterativeImputer类，支持迭代归纳。
在本节中，我们将探讨如何有效地使用IterativeImputer类。

### IterativeImputer Data Transform
这是一个数据转换，首先根据用于估计缺失值的方法进行配置。默认情况下，采用BayesianRidge模型，使用所有其他输入特征的函数。特征按升序填充，从缺失值最少的特征到缺失值最多的特征。

In [9]:
import numpy as np
from sklearn.linear_model import BayesianRidge
from sklearn.experimental import enable_iterative_imputer

IterativeImputer = enable_iterative_imputer.impute.IterativeImputer
data = dataframe.values
ix = [i for i in range(data.shape[1]) if i != 23]
X, y = data[:, ix], data[:, 23]
print('Missing: %d' % sum(np.isnan(X).flatten()))
# define impute
impute = IterativeImputer(estimator=BayesianRidge(), n_nearest_features=None, imputation_order='ascending')
# fit on the dataset
impute.fit(X)
# transform the dataset
Xtrans = impute.transform(X)
print('Missing: %d' % sum(np.isnan(Xtrans).flatten()))

Missing: 1605
Missing: 0


## 迭代式计算机和模型评估
使用k-fold交叉验证法在数据集上评估机器学习模型是一个好的做法。
为了正确应用迭代缺失数据归集，避免数据泄露，要求每一列的模型只在训练数据集上计算，然后应用于数据集中每个折的训练集和测试集。
这可以通过创建一个建模管道来实现，第一步是迭代归集，然后第二步是模型。这可以通过管道类来实现。
例如，下面的管道使用默认策略的IterativeImputer，然后是一个随机森林模型。

In [10]:
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
imputer = IterativeImputer()
pipeline = Pipeline(steps=[('i', imputer), ('m', model)])
# define model evaluation
cv = RepeatedStrati1fiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Mean Accuracy: %.3f (%.3f)' % (np.float64(np.mean(scores)), np.float64(np.std(scores))))

Mean Accuracy: 0.872 (0.053)
