It's some Data Science 101 stuff to split your data out in order to validate the performance of your model. Thankfully, sklearn comes with some pretty robust batteries-included approaches do doing that.

## Load a Dataset

Here we'll use the Iris Dataset

In [1]:
from sklearn.datasets import load_iris

data = load_iris()
data.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

In [2]:
X = data['data']
y = data['target']

In [3]:
X.shape, y.shape

((150, 4), (150,))

## Vanilla Split

In [4]:
from sklearn.model_selection import train_test_split

Say we wanted to split our data 70/30, we'd just use the `test_size=0.3` argument.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [6]:
[arr.shape for arr in train_test_split(X, y, test_size=0.3)]

[(105, 4), (45, 4), (105,), (45,)]

But this pre-supposes that we've already broken our data out into `X` and `y`. What if instead, we started with a table of data and wanted to preserve it as such.

In [7]:
import numpy as np

values = np.c_[X, y]
values.shape

(150, 5)

The `train_test_split` function can handle that just fine.

In [8]:
train_values, test_values = train_test_split(values, test_size=0.3)

In [9]:
[arr.shape for arr in train_test_split(values)]

[(112, 5), (38, 5)]

## Stratification

One thing to note, looking at this, is the effect of our sampling on each population. For instance, all-in, our base dataset has a perfectly equal distribution of each kind of flower.

In [10]:
import pandas as pd

valuesDf = pd.DataFrame(values)
trainDf = pd.DataFrame(train_values)
testDf = pd.DataFrame(test_values)

In [11]:
valuesDf[4].value_counts().sort_index()

0.0    50
1.0    50
2.0    50
Name: 4, dtype: int64

However, as a result of our `train_test_split`, we've skewed the distribution between our test and our train datasets

In [12]:
trainDf[4].value_counts().sort_index() / len(trainDf)

0.0    0.323810
1.0    0.361905
2.0    0.314286
Name: 4, dtype: float64

In [13]:
testDf[4].value_counts().sort_index() / len(testDf)

0.0    0.355556
1.0    0.266667
2.0    0.377778
Name: 4, dtype: float64

If we were working with a massive amount of data, we might be able to make sweeping assumptions about this distribution, but with a meager *150 rows of data*, we want to be careful about our sampling.

The `StratifiedShuffleSplit` object takes the typical "how do you want to split your data" arguments at instantiation.

In [14]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.3)

But then it has its own `split` method where you specify what X you're splitting and, more importantly, what `y` it should be working to preserve a distribution of.

In [15]:
for train_index, test_index in split.split(X=values, y=values[:,4]):
    strat_train_set = values[train_index]
    strat_test_set = values[test_index]

That's more like it.

In [16]:
pd.DataFrame(strat_test_set)[4].value_counts()

0.0    15
2.0    15
1.0    15
Name: 4, dtype: int64

In [17]:
pd.DataFrame(strat_train_set)[4].value_counts()

0.0    35
1.0    35
2.0    35
Name: 4, dtype: int64