<h1 style="color:White;font-size:170%;">New York City Taxi Fare Prediction</h>
In this notebook we will cover how to split datasets in more effective way using StratifiedShuffleSplit Library.

## Import libraries

In [22]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit

In [10]:
data = pd.read_csv(r"datasets/avocado.csv")

## Explore the content of data

In [11]:
data.head()

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


## Split train and test data

In [14]:
train_data, test_data = train_test_split(data, test_size = 0.02)

In [15]:
data['price_category'] = pd.cut(data['AveragePrice'],
                                bins= [0., 0.7, 1.2, 1.6, 2.5, 3., np.inf],
                                labels = [1, 2, 3, 4, 5, 6])

In [16]:
train_data, test_data = train_test_split(data, test_size = 0.02)

In [19]:
train_data['price_category'].value_counts() / len(train_data)

3    0.353668
2    0.339354
4    0.280698
1    0.015153
5    0.010792
6    0.000335
Name: price_category, dtype: float64

In [20]:
test_data['price_category'].value_counts()/len(test_data)

3    0.369863
2    0.317808
4    0.295890
5    0.010959
1    0.005479
6    0.000000
Name: price_category, dtype: float64

Similar proportion but not same, in some application can be enough but for other we are searching as 2 decimal precision. For this reason we can use the Stratified Shuffle Split.

## Spit with StratifiedShuffleSplit method

In [23]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2)

for train_ids, test_ids in split.split(data, data['price_category']):
    train_data = data.loc[train_ids]
    test_data = data.loc[test_ids]

In [24]:
train_data['price_category'].value_counts() / len(train_data)

3    0.353997
2    0.338927
4    0.280978
1    0.014933
5    0.010823
6    0.000342
Name: price_category, dtype: float64

In [25]:
test_data['price_category'].value_counts()/len(test_data)

3    0.353973
2    0.338904
4    0.281096
1    0.015068
5    0.010685
6    0.000274
Name: price_category, dtype: float64

With this method we can see that the data are balanced good without be skewed.