
# Tutorial 8A Data Sampling:
In the second part of this activity, we discuss data sampling. Simple Random Sampling methods are easy to implement. We focus in this exercise on Stratified sampling. 



We load wine.csv file, please note this wine data is different from the one used later in this module. 


In [1]:
import pandas as pd
import numpy as np
# UCI's wine dataset
wine = pd.read_csv("wine.csv")
wine.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color,is_red,high_quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red,1.0,0.0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red,1.0,0.0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red,1.0,0.0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red,1.0,0.0
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red,1.0,0.0


In [2]:
wine.shape

(6497, 15)

In [3]:
wine.quality.value_counts()

6    2836
5    2138
7    1079
4     216
8     193
3      30
9       5
Name: quality, dtype: int64

Let's take the first 50 records for easy demonstration. 

In [4]:
wine=wine.iloc[0:100,:]

In [5]:
wine.quality.value_counts()

5    66
6    22
4     7
7     5
Name: quality, dtype: int64


## Random sampling:  

Let's take a random sample of 25 records.

In [6]:
wine.sample(frac=0.5).quality.value_counts()

5    32
6    13
4     3
7     2
Name: quality, dtype: int64

What did you find ?

## Stratified sampling:  

Now we can take another random sample stratified by quality.

In [7]:
# Stratified Split of train and test data
from sklearn.cross_validation import StratifiedShuffleSplit
sample = StratifiedShuffleSplit(wine.quality, n_iter=1, test_size=0.5)
## check 
sample

StratifiedShuffleSplit(labels=[5 5 5 6 5 5 5 7 7 5 5 5 5 5 5 5 7 5 4 6 6 5 5 5 6 5 5 5 5 6 5 6 5 6 5 6 6
 7 4 5 5 4 6 5 5 4 5 5 5 5 5 6 6 5 6 5 5 5 5 6 5 5 7 5 5 5 5 5 5 6 6 5 5 4
 5 5 5 6 5 4 5 5 5 5 6 5 6 5 5 5 5 6 5 5 4 6 5 5 5 6], n_iter=1, test_size=0.5, random_state=None)

We can first print the indexes:

In [8]:
import numpy as np
np.set_printoptions(threshold=np.inf)
for train_index, test_index in sample:
    print("train index :")
    print(train_index)
    print("test index:")
    print(test_index)

train index :
[30 71 98 50 91 46 92 18  1 96 19 72 20 68 39  4 63 45 58 27 99 21 74 51  9
 75  0 73 40 69 57 43 60  8 62 12 14 15 82 59 65 34 88 86 13 52 29 61 77 38]
test index:
[ 2 35 89 93 85 32 44 16 81 56 76 10 67 55 84 31 25  7 94 70 37 66 95 17 54
 83 33 80 22 97 79 36 24 64  3  6 23 78 41 47 87  5 42 26 28 90 48 53 11 49]


We will next store the data.

In [9]:
#loop elements in sample
for train_index, test_index in sample:
    xtrain, xtest = wine.iloc[train_index], wine.iloc[test_index]
# Check target series for distribution of classes
xtrain.value_counts()

AttributeError: 'DataFrame' object has no attribute 'value_counts'

Now let's check the two samples xtrain and xtest.

In [10]:
xtrain.shape

(50, 15)

In [11]:
xtest.shape

(50, 15)

Same shape

In [12]:
xtrain.quality.value_counts()

5    33
6    11
4     4
7     2
Name: quality, dtype: int64

In [13]:
xtest.quality.value_counts()

5    33
6    11
7     3
4     3
Name: quality, dtype: int64

Same quality distribution. 

In [14]:
xtrain[0:5]

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color,is_red,high_quality
96,6.8,0.775,0.0,3.0,0.102,8.0,23.0,0.9965,3.45,0.56,10.7,5,red,1.0,0.0
52,6.6,0.5,0.04,2.1,0.068,6.0,14.0,0.9955,3.39,0.64,9.4,6,red,1.0,0.0
19,7.9,0.32,0.51,1.8,0.341,17.0,56.0,0.9969,3.04,1.08,9.2,6,red,1.0,0.0
77,6.8,0.785,0.0,2.4,0.104,14.0,30.0,0.9966,3.52,0.55,10.7,6,red,1.0,0.0
92,8.6,0.49,0.29,2.0,0.11,19.0,133.0,0.9972,2.93,1.98,9.8,5,red,1.0,0.0


In [15]:
xtest[0:5]

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color,is_red,high_quality
68,9.3,0.32,0.57,2.0,0.074,27.0,65.0,0.9969,3.28,0.79,10.7,5,red,1.0,0.0
75,8.8,0.41,0.64,2.2,0.093,9.0,42.0,0.9986,3.54,0.66,10.5,5,red,1.0,0.0
36,7.8,0.6,0.14,2.4,0.086,3.0,15.0,0.9975,3.42,0.6,10.8,6,red,1.0,0.0
7,7.3,0.65,0.0,1.2,0.065,15.0,21.0,0.9946,3.39,0.47,10.0,7,red,1.0,1.0
22,7.9,0.43,0.21,1.6,0.106,10.0,37.0,0.9966,3.17,0.91,9.5,5,red,1.0,0.0


#### Different data! 

#### Exercise: What is your observation of the output? Do you think these samples represent the data correctly? 

Check this link for more details about Stratified sampling: http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedShuffleSplit.html