<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Classification" data-toc-modified-id="Classification-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Classification</a></span><ul class="toc-item"><li><span><a href="#Create-some-data" data-toc-modified-id="Create-some-data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Create some data</a></span></li><li><span><a href="#Get-frequencies" data-toc-modified-id="Get-frequencies-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Get frequencies</a></span></li><li><span><a href="#Train-test-split-without-stratification" data-toc-modified-id="Train-test-split-without-stratification-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Train-test split without stratification</a></span></li><li><span><a href="#Train-Test-split-with-Stratification-of-y-(target)" data-toc-modified-id="Train-Test-split-with-Stratification-of-y-(target)-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Train-Test split with Stratification of y (target)</a></span></li><li><span><a href="#Train-Test-split--on-other-columns" data-toc-modified-id="Train-Test-split--on-other-columns-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Train Test split  on other columns</a></span></li><li><span><a href="#Stratification-of-X-multiple-columns" data-toc-modified-id="Stratification-of-X-multiple-columns-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>Stratification of X multiple columns</a></span></li></ul></li></ul></div>

# Note on Stratification

## Introduction

Train test split. as well as cross validation has the option to 'stratify'. 
'Stratification seeks to ensure that each fold is representative of all strata of the data. Generally this is done in a supervised way for classification and aims to ensure each class is (approximately) equally represented across each test fold (which are of course combined in a complementary way to form training folds).'(https://stats.stackexchange.com/questions/49540/understanding-stratified-cross-validation)
In a large, well balanced data set for a regression or binary classification problem this option will not be likely needed. However when neither or any of these conditions are met this is a nifty feature.

## Classification

For classification the target variable is typically stratified. The train and test set will retain the original class proportions.


### Create some data

In [263]:
# Create data objects fro the iris dataset
import pandas as pd
from sklearn import datasets
# create data set for multiclass classification
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import datasets


def multiclass():
    data_multiclass = datasets.load_iris()
    df = pd.DataFrame(data_multiclass['data'],
                      columns=data_multiclass['feature_names'])
    df['ycol'] = data_multiclass['target']
    X = df.drop('ycol', axis=1)
    y = df['ycol'].values
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=0)
    ss = StandardScaler()
    X_train = (ss.fit_transform(X_train))
    X_test = ss.transform(X_test)
    X = pd.DataFrame(ss.transform(X), columns=df.drop('ycol', axis=1).columns)
#     print('Iris Dataset Data has been created for multiclass ML')
#     print(f'Shape df {df.shape}')
#     print(f'Shape X {X.shape}')
#     print(f'Shape y {y.shape}')
#     print(f'Shape X_train {X_train.shape}')
#     print(f'Shape y_train {y_train.shape}')
#     print(f'Shape X_test {X_test.shape}')
#     print(f'Shape y_test {y_test.shape}')
#     print(f'Mean X {X_train.mean()}, should be close to 0')
#     print(f'Std X {X_train.std()}, should be close to 1')
#     print(f'Available columns {df.columns.values}')
    return df, X, y, X_train, y_train, X_test, y_test

In [267]:
df, X, y, X_train, y_train, X_test, y_test = multiclass()

In [268]:
#abusing make_classification to create 2 extra columns with severely inbalanced classes
from sklearn.datasets import make_classification
X_, y_ = make_classification(n_samples=len(X), 
                             n_features=20, 
                             n_informative=4, 
                             n_redundant=2, 
                             n_repeated=0, 
                             n_classes=6, 
                             n_clusters_per_class=2, 
                             weights=[.05,.05,.1,.1,.2,.5], 
                             flip_y=0.01, 
                             class_sep=1.0, 
                             hypercube=True, 
                             shift=0.0, 
                             scale=1.0, 
                             shuffle=True, 
                             random_state=None)

X['extra'], X['extra1']= y_, y_+1

### Get frequencies

In [269]:
# value proportions 'extra1'
print(f'y\tPrec.')
for i in range(1,7):
    print(f'{i}\t{round(X["extra1"].value_counts(sort = False)[i]/X.shape[0],2)}\t%')

y	Prec.
1	0.05	%
2	0.05	%
3	0.1	%
4	0.1	%
5	0.21	%
6	0.49	%


In [272]:
# value proportions 'extra'
round(X["extra"].value_counts(sort=False)/X.shape[0],2)

0    0.05
1    0.05
2    0.10
3    0.10
4    0.21
5    0.49
Name: extra, dtype: float64

In [276]:
# value proportions 'y'
round(pd.Series(y).value_counts(sort=False)/X.shape[0],2)

0    0.33
1    0.33
2    0.33
dtype: float64

### Train-test split without stratification

In [283]:
#train test split, stratify = None
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=None, test_size=0.33, 
                                                    random_state=2018)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(100, 6) (100,)
(50, 6) (50,)


In [284]:
# value proportions 'y', stratify = None
round(pd.Series(y_train).value_counts(sort=False)/X_train.shape[0],2)

0    0.31
1    0.36
2    0.33
dtype: float64

In [285]:
# value proportions 'extra'
round(X_train["extra"].value_counts(sort=False)/X_train.shape[0],2)

0    0.05
1    0.05
2    0.07
3    0.09
4    0.22
5    0.52
Name: extra, dtype: float64

In [286]:
# value proportions 'extra1'
round(X_train['extra1'].value_counts(sort=False)/X_train.shape[0],2)

1    0.05
2    0.05
3    0.07
4    0.09
5    0.22
6    0.52
Name: extra1, dtype: float64

### Train-Test split with Stratification of y (target)

In [287]:
#train test split, stratify = y
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.33, 
                                                    random_state=2018)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(100, 6) (100,)
(50, 6) (50,)


In [288]:
# value proportions 'y', stratify = y
round(pd.Series(y_train).value_counts(sort=False)/X_train.shape[0],2)

0    0.33
1    0.34
2    0.33
dtype: float64

In [289]:
# value proportions y
round(X_train["extra"].value_counts(sort=False)/X_train.shape[0],2)

0    0.05
1    0.05
2    0.09
3    0.05
4    0.21
5    0.55
Name: extra, dtype: float64

In [290]:
# value proportions y
round(X_train['extra1'].value_counts(sort=False)/X_train.shape[0],2)

1    0.05
2    0.05
3    0.09
4    0.05
5    0.21
6    0.55
Name: extra1, dtype: float64

### Train Test split  on other columns

Unfortunately it looses the y stratification, but this may not be a problem. Why the heck would you do this? Well, some features may warrant equal representation in the sets. Features like gender, age groups may spring to mind.

In [300]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=X['extra'], test_size=0.3, 
                                                    random_state=2018)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(105, 6) (105,)
(45, 6) (45,)


In [295]:
# value proportions 'y', stratify = 'extra'
round(pd.Series(y_train).value_counts(sort=False)/X_train.shape[0],2)

0    0.37
1    0.34
2    0.29
dtype: float64

In [296]:
# value proportions 'extra', stratify = 'extra'
round(X_train['extra'].value_counts(sort=False)/X_train.shape[0],2)

0    0.05
1    0.05
2    0.10
3    0.10
4    0.21
5    0.50
Name: extra, dtype: float64

In [297]:
# value proportions 'extra1', stratify = 'extra'
round(X_train["extra1"].value_counts(sort=False)/X_train.shape[0],2)

1    0.05
2    0.05
3    0.10
4    0.10
5    0.21
6    0.50
Name: extra1, dtype: float64

### Stratification of X multiple columns

In [301]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=X[['extra', 'extra1']], test_size=0.3, 
                                                    random_state=2018)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(105, 6) (105,)
(45, 6) (45,)


In [302]:
round(X_train['extra'].value_counts(sort=False)/X_train.shape[0],1)

0    0.0
1    0.0
2    0.1
3    0.1
4    0.2
5    0.5
Name: extra, dtype: float64

In [303]:
round(X_train['extra1'].value_counts(sort=False)/X_train.shape[0],2)

1    0.05
2    0.05
3    0.10
4    0.10
5    0.21
6    0.50
Name: extra1, dtype: float64

In [304]:
X_train.shape[0]

105

In [226]:
import random
X['test'] = [random.randint(0, 5) for x in range(0, X.shape[0])]

In [227]:
round(X['test'].value_counts(sort=False)/X.shape[0],2)

0    0.19
1    0.13
2    0.19
3    0.13
4    0.19
5    0.16
Name: test, dtype: float64

In [None]:
# Create the bins.  My `y` variable has
# 506 observations, and I want 50 bins.

bins = np.linspace(0, 506, 50)

# Save your Y values in a new ndarray,
# broken down by the bins created above.

y_binned = np.digitize(y, bins)

# Pass y_binned to the stratify argument,
# and sklearn will handle the rest

X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.3, stratify=y_binned)
