### 1. Split raw data into training and test sets
- train_test_split(sk-learn): 

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split
- X, y: Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
- test_size: If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size.
- shuffle: boolean, optional(default=True)

In [9]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn import datasets

In [2]:
iris = datasets.load_iris()
print('total raw data:', iris.data.shape, iris.target.shape)

total raw data: (150, 4) (150,)


In [7]:
iris.data[:5]

array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2]])

In [8]:
iris.target[:5]

array([0, 0, 0, 0, 0])

#### 1.1 We can now quickly sample a training set while holding out 40% of the data for testing (evaluating) our classifier:

In [3]:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)

print('training set:', X_train.shape, y_train.shape)

print('test_set:', X_test.shape, y_test.shape)

training set: (90, 4) (90,)
test_set: (60, 4) (60,)


In [16]:
eles_n1, counts_n1 = np.unique(y_train, return_counts=True)
eles_n1, counts_n1

(array([0, 1, 2]), array([34, 27, 29]))

- non-stratified sampling, so the percentage of each element in training set is not equal to the percentage which in whole dataset

In [17]:
counts_n1/len(y_train)

array([ 0.37777778,  0.3       ,  0.32222222])

#### 1.2 only split X

In [4]:
X_train2, X_test2 = train_test_split(iris.data, test_size=0.2)

print('training set:', X_train2.shape)

print('test set:', X_test2.shape)

training set: (120, 4)
test set: (30, 4)


#### 1.3 stratified sampling
- StratifiedShuffleSplit
- http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html

In [13]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
for train_inx, test_inx in split.split(iris.data, iris.target):  # do stratified sampling based on iris.target(category)
    X_train3 = iris.data[train_inx]
    X_test3 = iris.data[test_inx]
    y_train3 = iris.target[train_inx]
    y_test3 = iris.target[test_inx]

In [11]:
eles, counts = np.unique(iris.target, return_counts=True)
eles, counts

(array([0, 1, 2]), array([50, 50, 50]))

In [12]:
counts/len(iris.data)

array([ 0.33333333,  0.33333333,  0.33333333])

In [14]:
eles_n, counts_n = np.unique(y_train3, return_counts=True)
eles_n, counts_n

(array([0, 1, 2]), array([40, 40, 40]))

- stratified sampling, so the percentage of each class in training set and whole dataset is same

In [15]:
counts_n/len(y_train3)

array([ 0.33333333,  0.33333333,  0.33333333])