### model_selection.train_test_split函数

#### 基本用法：
train_set, test_set = train_test_split(array, test_size, random_state, shuffle)： <br/>
（1）test_size在小于1时是比例，在大于1时是length； </br>
（2）random_state指定了后，每次重新划分的结果是一致的；</br>
（3）shuffle=False表示不随机打乱，这在时序分析时很重要； </br>
（4）array可以有很多，一个以上表示按相同的准则对多个array进行划分。

In [10]:
import numpy as np
from sklearn.model_selection import train_test_split

# 先做一个假的数据
X = np.random.randn(20).reshape((10, 2))
y = np.arange(10)
print('X is:')
print(X)
print('y is:')
print(y)

# 然后进行分组
print("result of train_test_split(X, y, test_size=0.2):")
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X, y, test_size=0.2)
print(X_test_1, y_test_1)
print("result of train_test_split(X, y, test_size=0.2) again:")
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X, y, test_size=0.2)
print(X_test_1, y_test_1)
print("result of train_test_split(X, y, test_size=3, random_state=42):")
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X, y, test_size=3, random_state=42)
print(X_test_2, y_test_2)
print("result of train_test_split(X, y, test_size=3, random_state=42) again:")
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X, y, test_size=3, random_state=42)
print(X_test_2, y_test_2)
print("result of train_test_split(X, y, test_size=3, shuffle=False):")
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(X, y, test_size=3, shuffle=False)
print(X_test_3, y_test_3)
print('从结果来看，就是一个纯粹地确定性划分了')

X is:
[[-1.04028778  0.54237023]
 [-1.08536776  0.76154737]
 [ 1.04389649  0.54474058]
 [-0.59081939 -0.18466927]
 [ 0.60505079  0.15139904]
 [ 0.37615962  0.21222934]
 [ 0.97109574 -0.967902  ]
 [ 0.34014116  0.14361376]
 [-0.80693205 -0.40253761]
 [ 1.06419556  1.07561342]]
y is:
[0 1 2 3 4 5 6 7 8 9]
result of train_test_split(X, y, test_size=0.2):
[[-0.59081939 -0.18466927]
 [ 0.60505079  0.15139904]] [3 4]
result of train_test_split(X, y, test_size=0.2) again:
[[ 0.34014116  0.14361376]
 [-0.80693205 -0.40253761]] [7 8]
result of train_test_split(X, y, test_size=3, random_state=42):
[[-0.80693205 -0.40253761]
 [-1.08536776  0.76154737]
 [ 0.37615962  0.21222934]] [8 1 5]
result of train_test_split(X, y, test_size=3, random_state=42) again:
[[-0.80693205 -0.40253761]
 [-1.08536776  0.76154737]
 [ 0.37615962  0.21222934]] [8 1 5]
result of train_test_split(X, y, test_size=3, shuffle=False):
[[ 0.34014116  0.14361376]
 [-0.80693205 -0.40253761]
 [ 1.06419556  1.07561342]] [7 8 9]


#### 分层抽样：
split = StratifiedShuffleSplit(n_splits, test_size, random_state)： <br/>
（1）生成一个分层抽样的对象，他可以继续作用于具体的数据表，split.split(array, columns)，其中columns是分层依据的列；<br/>
（2）split.split(array, columns)后是一个可以用来循环的对象，存储着分层抽样出来的索引。

In [19]:
from sklearn.model_selection import StratifiedShuffleSplit

X_cat = np.array([0, 0, 0, 1, 1, 1, 1, 0, 0, 1])

split = StratifiedShuffleSplit(n_splits=1, test_size=2, random_state=42)
split_X = split.split(X, X_cat)
print('content of split.split(X, y) is:')
print(list(split_X))
print('用for循环遍历其中的index，可以得到抽样结果:')
for train_index, test_index in split.split(X, X_cat):
    strat_X_train = X[train_index, :]
    strat_X_test = X[test_index, :]
    strat_y_train = y[train_index]
    strat_y_test = y[test_index] 

print(f'X的分组结果：{strat_X_train}{strat_X_test}')
print(f'y的分组结果：{strat_y_train}{strat_y_test}')

content of split.split(X, y) is:
[(array([6, 8, 1, 0, 4, 3, 5, 2]), array([7, 9]))]
用for循环遍历其中的index，可以得到抽样结果:
X的分组结果：[[ 0.97109574 -0.967902  ]
 [-0.80693205 -0.40253761]
 [-1.08536776  0.76154737]
 [-1.04028778  0.54237023]
 [ 0.60505079  0.15139904]
 [-0.59081939 -0.18466927]
 [ 0.37615962  0.21222934]
 [ 1.04389649  0.54474058]][[0.34014116 0.14361376]
 [1.06419556 1.07561342]]
y的分组结果：[6 8 1 0 4 3 5 2][7 9]
