填充缺失值，缺失值通常由 均值、中位数或者strategy函数的超参数替代。

In [1]:
import numpy as np

from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.model_selection import cross_val_score

In [2]:
rng = np.random.RandomState(0)

In [3]:
dataset = load_boston()

In [4]:
X_full, y_full = dataset.data, dataset.target

In [6]:
n_samples = X_full.shape[0]
n_samples

506

In [7]:
n_features = X_full.shape[1]
n_features

13

评估使用完整数据的效果，模型用 随机森林

In [8]:
estimator = RandomForestRegressor(random_state=0, n_estimators=100)
score = cross_val_score(estimator, X_full, y_full).mean()
print("Score with the entire dataset = %.2f" % score)

Score with the entire dataset = 0.56


在75%数据中添加缺失值

In [9]:
missing_rate = 0.75
n_missing_samples = int(np.floor(n_samples * missing_rate))
missing_samples = np.hstack((np.zeros(n_samples - n_missing_samples,
                                     dtype = np.bool),
                            np.ones(n_missing_samples,
                                   dtype=np.bool)))

In [10]:
rng.shuffle(missing_samples)
missing_features = rng.randint(0, n_features, n_missing_samples)

评估没有填充缺失值的得分

In [13]:
X_filtered = X_full[~missing_samples, :]
y_filtered = y_full[~missing_samples]
estimator = RandomForestRegressor(random_state=0, n_estimators=100)
score = cross_val_score(estimator, X_filtered, y_filtered).mean()
print('Score without the samples containing missing values = %.2f' % score)

Score without the samples containing missing values = 0.48


评估填充缺失值后的得分

In [16]:
X_missing = X_full.copy()
X_missing[np.where(missing_samples)[0], missing_features]=0
y_missing = y_full.copy()
estimator = Pipeline([("imputer", Imputer(missing_values=0,
                                        strategy='mean',
                                        axis=0)),
                     ('forest', RandomForestRegressor(random_state=0,
                                                     n_estimators=100))])
score = cross_val_score(estimator, X_missing, y_missing).mean()
print("Score after imputation of the missing values = %.2f" % score)

Score after imputation of the missing values = 0.57
