## 划分训练集和测试集

### 1.随机抽样

In [None]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split( data, test_size=0.2, random_state=42)

test_size: 测试集大小  
random_state: 随机种子

### 2.分层抽样

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(data, data["income_cat"]):
    strat_train_set = data.loc[train_index]
    strat_test_set = data.loc[test_index]

n_splits是将训练数据分成train/test对的组数，可根据需要进行设置，默认为10(此处n_splits=1)

## 数据清洗

缺失特征数据的处理:  
1.剔除缺失该特征的样本  
2.剔除整个特征  
3.用特定值填充缺失的特征数据(0,平均数,中位数等)

In [None]:
from sklearn.preprocessing import Imputer

imputer = Imputer(strategy="median")

创建Imputer的实例,可以用于向缺失数据填充特定值, 其中中位数只能计算于数值型特征

In [None]:
imputer.fit(data_num)
X = imputer.transform(data_num)

利用fit()方法对数据进行适应  
利用transform()方法对数据进行转换

In [None]:
imputer.strategy

获取imputer实例中超参数strategy的值(不带下划线)

In [None]:
imputer.statistics_

获取imputer实例中参数statistics的值(带下划线)  
使用fit()方法后获得的参数

### 特征缩放

常用的特征缩放包括MinMaxScaler和StandardScaler

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))

## 处理文本分类特征

In [None]:
data_encoded, data_categories = Series.factorize()

将文本分类的Series数据转化为数字分类  
factorize()返回2个对象:  
第一个对象为数字分类的列表  
第二个对象为数字分类对应的标签

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
data_cat_1hot = encoder.fit_transform(data_encoded.reshape(-1,1))

将之前的数字分类,转换为one-hot向量  
默认的输出形式为Scipy的稀疏(sparse)矩阵

In [None]:
data_cat_1hot.toarray()

通过toarray()方法将稀疏矩阵转化为稠密(dense)矩阵

## 验证模型

### 交叉验证(Cross-Validation)

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, data_prepared, data_labels,
                         scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)

交叉验证要求的是效用函数(越大越好),而非损失函数  
因此,得分函数实际上是均方误差(MSE)的负数,在计算均方根误差(RMSE)时,要对其乘以-1

In [None]:
rmse_scores.mean()  #求cv个得分的平均数

In [None]:
rmse_scores.std()   #求cv个得分的标准差

然而,交叉验证需要多次训练模型,这在实际应用中并不总是可行

### 分层抽样的交叉验证(StratifiedKFold)

该方法保证每一层fold中的各个类别都拥有与源数据相似的比例  
以随机梯度下降法(Stochastic Gradient Descent)为例:

In [None]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train)

In [None]:
# 普通交叉验证
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")

In [None]:
# 分层抽样的交叉验证
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3, random_state=42)

for train_index, test_index in skfolds.split(X_train, y_train):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train[train_index]
    y_train_folds = (y_train[train_index])
    X_test_fold = X_train[test_index]
    y_test_fold = (y_train[test_index])

    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    print(n_correct / len(y_pred))

## 保存模型

In [None]:
from sklearn.externals import joblib
joblib.dump(my_model, "my_model.pkl") # 保存模型
#之后
my_model_loaded = joblib.load("my_model.pkl") # 导入模型

## 调整模型

### Grid Search

调整优化模型的一种方法是尝试各种超参数的组合,以随机森林为例:

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    # 尝试12组 (3×4) 超参数组合
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    # 然后在bootstrap设置为False的情况,尝试 6组 (2×3) 的超参数组合
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor(random_state=42)
# 分5层进行交叉验证,即总共(12+6)*5=90次训练
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error')
grid_search.fit(data_prepared, data_labels)

In [None]:
grid_search.best_params_  #获取最优组合超参数

In [None]:
grid_search.best_estimator_  #获取最优超参数的模型

In [None]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

返回整个评估分值

### Randomized Search

当超参数组合数量过于庞大时,推荐使用RandomizedSearchCV

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
    }

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
                                n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(data_prepared, data_labels)

In [None]:
cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

返回RandomizedSearchCV评分结果

### Ensemble Methods

集成学习方法:对不同的模型进行组合  
通常,集成学习的表现要优于最好的单个独立模型  
尤其当个体模型拥有完全不同的误差类型

### 分析表现较好的模型及其误差

从表现较好的模型中获取灵感,以随机森林为例:

In [None]:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances   # 获取特征重要性分值

In [None]:
sorted(zip(feature_importances, attributes), reverse=True)  

将特征重要性分值与特征名共同展示出来