# Training and test data splitting
---

## Import data

The `sklean.datasets` package has some small toy datasets embeded.

In [1]:
from sklearn.datasets import load_iris

In [2]:
X, y = load_iris(return_X_y=True)

In [4]:
X.shape

(150, 4)

In [5]:
y.shape

(150,)

## Split data into training and test sets

The following code splits the array X into `X_train` : `X_test` = 70% : 30%, and the array y into `y_train` : `y_test` = 70% : 30%.  

In [6]:
from sklearn.cross_validation import train_test_split

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [9]:
X_train.shape

(105, 4)

In [10]:
X_test.shape

(45, 4)

In [11]:
y_train.shape

(105,)

In [12]:
y_test.shape

(45,)

### Stratify splitting

The `train_test_split()` allows you to perform `stratify splitting`. The `stratify splitting` splits data so that the proportion of values in the resulting datasets is the same as the proportion of values in the original dataset.

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)

In [24]:
import collections
collections.Counter(y)

Counter({0: 50, 1: 50, 2: 50})

In [25]:
collections.Counter(y_train)

Counter({0: 35, 1: 35, 2: 35})

In [26]:
collections.Counter(y_test)

Counter({0: 15, 1: 15, 2: 15})

### Compared with non-stratify-splitting

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [33]:
collections.Counter(y)

Counter({0: 50, 1: 50, 2: 50})

In [34]:
collections.Counter(y_train)

Counter({0: 34, 1: 38, 2: 33})

In [35]:
collections.Counter(y_test)

Counter({0: 16, 1: 12, 2: 17})