# Processing

This is a quick review of outlier detection and feature selection and various technical details in data carpentry.

In this session, examples covered are:

* Delete columns
* Convert column into binary
* Convert column into np.datetime
* Find unique values in a column
* Iterating through columns and excluding columns from iteration
* Drop rows from data frame
* One-hot encode string column
* Subset dataframe by columns
* Subset dataframe by rows
* Resample dataset


We will walk them through with minimum examples.

In [1]:
import os, sys
import itertools
import random
import numpy as np 
import pandas as pd

class Example(object):
    def __init__(self, inspecting = True):
        """ Reset dataset for each example. """
        global dataset
        dataset = pd.read_csv('processing_examples.csv')
        self.inspecting = inspecting
        
    def __enter__(self):
        if self.inspecting:
            print('====== before ======')
            print(dataset)
            
    def __exit__(self, type, value, traceback):
        if self.inspecting:
            print('====== after ======')
            print(dataset)


This shows an example of zeroing out all elements in a dataset,  
just so to make sure you could understand the syntax we are using in this lab  
and quickly show what jobs these above functions do.

In [2]:
with Example():
    dataset.iloc[:, :] = 0

      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.550071    8     No  2017-10-10           B
4  0.382113    9     No  2017-10-11           C
5  0.921326    1     No  2017-10-03           B
6  0.691557    9    Yes  2017-10-11           C
7  0.526204    1     No  2017-10-03           B
   float  int  yes/no  date  categorical
0      0    0       0     0            0
1      0    0       0     0            0
2      0    0       0     0            0
3      0    0       0     0            0
4      0    0       0     0            0
5      0    0       0     0            0
6      0    0       0     0            0
7      0    0       0     0            0


## Delete a column

In [3]:
with Example():
    ret = dataset.drop('float', axis = 1)
    print('====== returns ======')
    print(ret)

      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.550071    8     No  2017-10-10           B
4  0.382113    9     No  2017-10-11           C
5  0.921326    1     No  2017-10-03           B
6  0.691557    9    Yes  2017-10-11           C
7  0.526204    1     No  2017-10-03           B
   int yes/no        date categorical
0    4     No  2017-10-06           E
1    5    Yes  2017-10-07           F
2    6    Yes  2017-10-08           G
3    8     No  2017-10-10           B
4    9     No  2017-10-11           C
5    1     No  2017-10-03           B
6    9    Yes  2017-10-11           C
7    1     No  2017-10-03           B
      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.550071    8     No  2017-10-1

In [4]:
with Example():
    ret = dataset.drop('float', inplace = True, axis = 1)
    print('====== returns ======')
    print(ret)

      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.550071    8     No  2017-10-10           B
4  0.382113    9     No  2017-10-11           C
5  0.921326    1     No  2017-10-03           B
6  0.691557    9    Yes  2017-10-11           C
7  0.526204    1     No  2017-10-03           B
None
   int yes/no        date categorical
0    4     No  2017-10-06           E
1    5    Yes  2017-10-07           F
2    6    Yes  2017-10-08           G
3    8     No  2017-10-10           B
4    9     No  2017-10-11           C
5    1     No  2017-10-03           B
6    9    Yes  2017-10-11           C
7    1     No  2017-10-03           B


**Recommended way**

In [5]:
with Example():
    del dataset['float']

      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.550071    8     No  2017-10-10           B
4  0.382113    9     No  2017-10-11           C
5  0.921326    1     No  2017-10-03           B
6  0.691557    9    Yes  2017-10-11           C
7  0.526204    1     No  2017-10-03           B
   int yes/no        date categorical
0    4     No  2017-10-06           E
1    5    Yes  2017-10-07           F
2    6    Yes  2017-10-08           G
3    8     No  2017-10-10           B
4    9     No  2017-10-11           C
5    1     No  2017-10-03           B
6    9    Yes  2017-10-11           C
7    1     No  2017-10-03           B


## Convert column into binary

In [6]:
with Example():
    dataset['yes/no'] = dataset['yes/no'].apply(['Yes', 'No'].index)

      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.550071    8     No  2017-10-10           B
4  0.382113    9     No  2017-10-11           C
5  0.921326    1     No  2017-10-03           B
6  0.691557    9    Yes  2017-10-11           C
7  0.526204    1     No  2017-10-03           B
      float  int  yes/no        date categorical
0  0.283405    4       1  2017-10-06           E
1  0.034334    5       0  2017-10-07           F
2  0.773453    6       0  2017-10-08           G
3  0.550071    8       1  2017-10-10           B
4  0.382113    9       1  2017-10-11           C
5  0.921326    1       1  2017-10-03           B
6  0.691557    9       0  2017-10-11           C
7  0.526204    1       1  2017-10-03           B


In [7]:
with Example():
    dataset['yes/no'] = list(map(['Yes', 'No'].index, dataset['yes/no']))

      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.550071    8     No  2017-10-10           B
4  0.382113    9     No  2017-10-11           C
5  0.921326    1     No  2017-10-03           B
6  0.691557    9    Yes  2017-10-11           C
7  0.526204    1     No  2017-10-03           B
      float  int  yes/no        date categorical
0  0.283405    4       1  2017-10-06           E
1  0.034334    5       0  2017-10-07           F
2  0.773453    6       0  2017-10-08           G
3  0.550071    8       1  2017-10-10           B
4  0.382113    9       1  2017-10-11           C
5  0.921326    1       1  2017-10-03           B
6  0.691557    9       0  2017-10-11           C
7  0.526204    1       1  2017-10-03           B


map() takes a function and an iterable object and applies the function
to all elements of that iterable object.

In [8]:
a = [0,2,4]
list(map(lambda x:x+1, a))

[1, 3, 5]

## Convert column into np.datatime

In [9]:
with Example():
    dataset['date'] = dataset['date'].apply(np.datetime64)
    
print(type(dataset['date'][0]))

      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.550071    8     No  2017-10-10           B
4  0.382113    9     No  2017-10-11           C
5  0.921326    1     No  2017-10-03           B
6  0.691557    9    Yes  2017-10-11           C
7  0.526204    1     No  2017-10-03           B
      float  int yes/no       date categorical
0  0.283405    4     No 2017-10-06           E
1  0.034334    5    Yes 2017-10-07           F
2  0.773453    6    Yes 2017-10-08           G
3  0.550071    8     No 2017-10-10           B
4  0.382113    9     No 2017-10-11           C
5  0.921326    1     No 2017-10-03           B
6  0.691557    9    Yes 2017-10-11           C
7  0.526204    1     No 2017-10-03           B
<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [10]:
with Example():
    dataset['date'] = dataset['date'].apply(np.datetime64)
    print('====== day ======')
    print(dataset['date'].apply(lambda d: d.day))

      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.550071    8     No  2017-10-10           B
4  0.382113    9     No  2017-10-11           C
5  0.921326    1     No  2017-10-03           B
6  0.691557    9    Yes  2017-10-11           C
7  0.526204    1     No  2017-10-03           B
0     6
1     7
2     8
3    10
4    11
5     3
6    11
7     3
Name: date, dtype: int64
      float  int yes/no       date categorical
0  0.283405    4     No 2017-10-06           E
1  0.034334    5    Yes 2017-10-07           F
2  0.773453    6    Yes 2017-10-08           G
3  0.550071    8     No 2017-10-10           B
4  0.382113    9     No 2017-10-11           C
5  0.921326    1     No 2017-10-03           B
6  0.691557    9    Yes 2017-10-11           C
7  0.526204    1     No 2017-10-03           B


In [11]:
with Example():
    dataset['date'] = dataset['date'].apply(np.datetime64)
    print('====== day ======')
    print(dataset['date'].apply(lambda d: d.month))

      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.550071    8     No  2017-10-10           B
4  0.382113    9     No  2017-10-11           C
5  0.921326    1     No  2017-10-03           B
6  0.691557    9    Yes  2017-10-11           C
7  0.526204    1     No  2017-10-03           B
0    10
1    10
2    10
3    10
4    10
5    10
6    10
7    10
Name: date, dtype: int64
      float  int yes/no       date categorical
0  0.283405    4     No 2017-10-06           E
1  0.034334    5    Yes 2017-10-07           F
2  0.773453    6    Yes 2017-10-08           G
3  0.550071    8     No 2017-10-10           B
4  0.382113    9     No 2017-10-11           C
5  0.921326    1     No 2017-10-03           B
6  0.691557    9    Yes 2017-10-11           C
7  0.526204    1     No 2017-10-03           B


## Find unique values in a column

In [12]:
Example(inspecting = False)

np.unique(dataset['categorical'])

array(['B', 'C', 'E', 'F', 'G'], dtype=object)

## Iterating through columns

In [13]:
Example(inspecting = False)

for column_name in ['int', 'categorical']:
    print(dataset[column_name].head())

0    4
1    5
2    6
3    8
4    9
Name: int, dtype: int64
0    E
1    F
2    G
3    B
4    C
Name: categorical, dtype: object


In [14]:
Example(inspecting = False)

for column_name in dataset.columns:
    print(dataset[column_name].head())

0    0.283405
1    0.034334
2    0.773453
3    0.550071
4    0.382113
Name: float, dtype: float64
0    4
1    5
2    6
3    8
4    9
Name: int, dtype: int64
0     No
1    Yes
2    Yes
3     No
4     No
Name: yes/no, dtype: object
0    2017-10-06
1    2017-10-07
2    2017-10-08
3    2017-10-10
4    2017-10-11
Name: date, dtype: object
0    E
1    F
2    G
3    B
4    C
Name: categorical, dtype: object


In [15]:
Example(inspecting = False)

for column_name in np.array(dataset.columns)[[2,4]]:
    print(dataset[column_name].head())

0     No
1    Yes
2    Yes
3     No
4     No
Name: yes/no, dtype: object
0    E
1    F
2    G
3    B
4    C
Name: categorical, dtype: object


In [16]:
Example(inspecting = False)

for column_name in np.array(dataset.columns)[[False, True, False, False, True]]:
    print(dataset[column_name].head())

0    4
1    5
2    6
3    8
4    9
Name: int, dtype: int64
0    E
1    F
2    G
3    B
4    C
Name: categorical, dtype: object


## Iterating through columns with exclusions

In [17]:
Example(inspecting = False)

exclusion = ['float', 'yes/no', 'date']

for column_name in set(dataset.columns)-set(exclusion):
    print(dataset[column_name].head())

0    E
1    F
2    G
3    B
4    C
Name: categorical, dtype: object
0    4
1    5
2    6
3    8
4    9
Name: int, dtype: int64


In [18]:
Example(inspecting = False)

exclusion = [0,2,3]

for column_name in set(dataset.columns)-set(np.array(dataset.columns)[exclusion]):
    print(dataset[column_name].head())

0    E
1    F
2    G
3    B
4    C
Name: categorical, dtype: object
0    4
1    5
2    6
3    8
4    9
Name: int, dtype: int64


In [19]:
Example(inspecting = False)

exclusion = [0,2,3]

for column_name in [v for i,v in enumerate(dataset.columns) if i not in exclusion]:
    print(dataset[column_name].head())

0    4
1    5
2    6
3    8
4    9
Name: int, dtype: int64
0    E
1    F
2    G
3    B
4    C
Name: categorical, dtype: object


In [20]:
Example(inspecting = False)

exclusion = [True, False, True, True, False]

for column_name in np.array(dataset.columns)[~np.array(exclusion)]:
    print(dataset[column_name].head())

0    4
1    5
2    6
3    8
4    9
Name: int, dtype: int64
0    E
1    F
2    G
3    B
4    C
Name: categorical, dtype: object


## Drop rows from data frame

In [21]:
with Example():
    dataset.drop([3,4,5], inplace=True)

      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.550071    8     No  2017-10-10           B
4  0.382113    9     No  2017-10-11           C
5  0.921326    1     No  2017-10-03           B
6  0.691557    9    Yes  2017-10-11           C
7  0.526204    1     No  2017-10-03           B
      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
6  0.691557    9    Yes  2017-10-11           C
7  0.526204    1     No  2017-10-03           B


In [22]:
with Example():
    dataset.drop([3,4,5], inplace=True)
    dataset.reset_index(drop=True, inplace=True)

      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.550071    8     No  2017-10-10           B
4  0.382113    9     No  2017-10-11           C
5  0.921326    1     No  2017-10-03           B
6  0.691557    9    Yes  2017-10-11           C
7  0.526204    1     No  2017-10-03           B
      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.691557    9    Yes  2017-10-11           C
4  0.526204    1     No  2017-10-03           B


## One-hot encoding string column

In [23]:
from sklearn.preprocessing import LabelBinarizer

In [24]:
Example(inspecting = False)

encoder = LabelBinarizer()
onehot = encoder.fit_transform(dataset['categorical'])
print(onehot)

[[0 0 1 0 0]
 [0 0 0 1 0]
 [0 0 0 0 1]
 [1 0 0 0 0]
 [0 1 0 0 0]
 [1 0 0 0 0]
 [0 1 0 0 0]
 [1 0 0 0 0]]


In [25]:
with Example():
    encoder = LabelBinarizer()
    onehot = encoder.fit_transform(dataset['categorical'])

    for j, class_ in enumerate(encoder.classes_):
        dataset['c({})'.format(class_)] = onehot[:, j]

      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.550071    8     No  2017-10-10           B
4  0.382113    9     No  2017-10-11           C
5  0.921326    1     No  2017-10-03           B
6  0.691557    9    Yes  2017-10-11           C
7  0.526204    1     No  2017-10-03           B
      float  int yes/no        date categorical  c(B)  c(C)  c(E)  c(F)  c(G)
0  0.283405    4     No  2017-10-06           E     0     0     1     0     0
1  0.034334    5    Yes  2017-10-07           F     0     0     0     1     0
2  0.773453    6    Yes  2017-10-08           G     0     0     0     0     1
3  0.550071    8     No  2017-10-10           B     1     0     0     0     0
4  0.382113    9     No  2017-10-11           C     0     1     0     0     0
5  0.921326    1     No  2017-10-03           B     1     0     0     0     0
6  0.691557    9    Ye

In [26]:
with Example():
    encoder = LabelBinarizer()
    onehot = encoder.fit_transform(dataset['categorical'])

    for j, class_ in enumerate(encoder.classes_):
        dataset['c({})'.format(class_)] = onehot[:, j]
        
    del dataset['categorical']

      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.550071    8     No  2017-10-10           B
4  0.382113    9     No  2017-10-11           C
5  0.921326    1     No  2017-10-03           B
6  0.691557    9    Yes  2017-10-11           C
7  0.526204    1     No  2017-10-03           B
      float  int yes/no        date  c(B)  c(C)  c(E)  c(F)  c(G)
0  0.283405    4     No  2017-10-06     0     0     1     0     0
1  0.034334    5    Yes  2017-10-07     0     0     0     1     0
2  0.773453    6    Yes  2017-10-08     0     0     0     0     1
3  0.550071    8     No  2017-10-10     1     0     0     0     0
4  0.382113    9     No  2017-10-11     0     1     0     0     0
5  0.921326    1     No  2017-10-03     1     0     0     0     0
6  0.691557    9    Yes  2017-10-11     0     1     0     0     0
7  0.526204    1     No  2017-10-03     

## Subset data frame by columns

In [27]:
Example(inspecting = False)

dataset.iloc[:, [1, 2]]

Unnamed: 0,int,yes/no
0,4,No
1,5,Yes
2,6,Yes
3,8,No
4,9,No
5,1,No
6,9,Yes
7,1,No


In [28]:
Example(inspecting = False)

dataset.loc[:, ['int', 'yes/no']]

Unnamed: 0,int,yes/no
0,4,No
1,5,Yes
2,6,Yes
3,8,No
4,9,No
5,1,No
6,9,Yes
7,1,No


In [29]:
Example(inspecting = False)

dataset.loc[:, filter(lambda i: 'a' in i, dataset.columns)]

Unnamed: 0,float,date,categorical
0,0.283405,2017-10-06,E
1,0.034334,2017-10-07,F
2,0.773453,2017-10-08,G
3,0.550071,2017-10-10,B
4,0.382113,2017-10-11,C
5,0.921326,2017-10-03,B
6,0.691557,2017-10-11,C
7,0.526204,2017-10-03,B


In [30]:
Example(inspecting = False)

dataset.loc[:, filter(lambda i: i.startswith('c') or i.startswith('d'), dataset.columns)]

Unnamed: 0,date,categorical
0,2017-10-06,E
1,2017-10-07,F
2,2017-10-08,G
3,2017-10-10,B
4,2017-10-11,C
5,2017-10-03,B
6,2017-10-11,C
7,2017-10-03,B


In [31]:
Example(inspecting = False)

dataset.loc[:, [i for i in dataset.columns if i.startswith('c') or i.startswith('d')]]

Unnamed: 0,date,categorical
0,2017-10-06,E
1,2017-10-07,F
2,2017-10-08,G
3,2017-10-10,B
4,2017-10-11,C
5,2017-10-03,B
6,2017-10-11,C
7,2017-10-03,B


## Subset data frame by rows

In [32]:
Example(inspecting = False)

dataset.iloc[[3,4,5], :]

Unnamed: 0,float,int,yes/no,date,categorical
3,0.550071,8,No,2017-10-10,B
4,0.382113,9,No,2017-10-11,C
5,0.921326,1,No,2017-10-03,B


In [33]:
Example(inspecting = False)

print(dataset['int']>5)

dataset[dataset['int']>5]

0    False
1    False
2     True
3     True
4     True
5    False
6     True
7    False
Name: int, dtype: bool


Unnamed: 0,float,int,yes/no,date,categorical
2,0.773453,6,Yes,2017-10-08,G
3,0.550071,8,No,2017-10-10,B
4,0.382113,9,No,2017-10-11,C
6,0.691557,9,Yes,2017-10-11,C


In [34]:
Example(inspecting = False)

dataset[(dataset['float']>0.5) & (dataset['yes/no']=='Yes')]

Unnamed: 0,float,int,yes/no,date,categorical
2,0.773453,6,Yes,2017-10-08,G
6,0.691557,9,Yes,2017-10-11,C


Sidebar: replacing dataset[] with np.sum(), you can get the count of records  
satisfying the condition.

In [35]:
Example(inspecting = False)

np.sum((dataset['float']>0.5) & (dataset['yes/no']=='Yes'))

2

### Notice the usage of dataset[]

It acts on either rows or columns depending on the context.

In [36]:
Example(inspecting = False)

obj = 'categorical'
dataset[obj]

0    E
1    F
2    G
3    B
4    C
5    B
6    C
7    B
Name: categorical, dtype: object

In [37]:
Example(inspecting = False)

obj = [False,False,False,False,True,False,True,False]
dataset[obj]

Unnamed: 0,float,int,yes/no,date,categorical
4,0.382113,9,No,2017-10-11,C
6,0.691557,9,Yes,2017-10-11,C


That was how the following statement could work.

In [38]:
Example(inspecting = False)

dataset[dataset['categorical']=='C']

Unnamed: 0,float,int,yes/no,date,categorical
4,0.382113,9,No,2017-10-11,C
6,0.691557,9,Yes,2017-10-11,C


### And this indexer is also writable on both axes

In [39]:
with Example():
    dataset['categorical'] = ['A']*8

      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.550071    8     No  2017-10-10           B
4  0.382113    9     No  2017-10-11           C
5  0.921326    1     No  2017-10-03           B
6  0.691557    9    Yes  2017-10-11           C
7  0.526204    1     No  2017-10-03           B
      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           A
1  0.034334    5    Yes  2017-10-07           A
2  0.773453    6    Yes  2017-10-08           A
3  0.550071    8     No  2017-10-10           A
4  0.382113    9     No  2017-10-11           A
5  0.921326    1     No  2017-10-03           A
6  0.691557    9    Yes  2017-10-11           A
7  0.526204    1     No  2017-10-03           A


In [40]:
with Example():
    dataset[dataset['categorical']=='C'] = [-1, -128, 'No', '0000-00-00', '<<<']

      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.550071    8     No  2017-10-10           B
4  0.382113    9     No  2017-10-11           C
5  0.921326    1     No  2017-10-03           B
6  0.691557    9    Yes  2017-10-11           C
7  0.526204    1     No  2017-10-03           B
      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.550071    8     No  2017-10-10           B
4 -1.000000 -128     No  0000-00-00         <<<
5  0.921326    1     No  2017-10-03           B
6 -1.000000 -128     No  0000-00-00         <<<
7  0.526204    1     No  2017-10-03           B


In [41]:
with Example():
    dataset[dataset['categorical']=='C'] = [
        [-1, -128, 'No', '0000-00-00', '<<<'],
        [0, 127, 'Yes', '1900-12-01', '<<<']
    ]

      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.550071    8     No  2017-10-10           B
4  0.382113    9     No  2017-10-11           C
5  0.921326    1     No  2017-10-03           B
6  0.691557    9    Yes  2017-10-11           C
7  0.526204    1     No  2017-10-03           B
      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.550071    8     No  2017-10-10           B
4 -1.000000 -128     No  0000-00-00         <<<
5  0.921326    1     No  2017-10-03           B
6  0.000000  127    Yes  1900-12-01         <<<
7  0.526204    1     No  2017-10-03           B


## Resample dataset

In [42]:
with Example():
    dataset = dataset.sample(frac = 0.2)

      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.550071    8     No  2017-10-10           B
4  0.382113    9     No  2017-10-11           C
5  0.921326    1     No  2017-10-03           B
6  0.691557    9    Yes  2017-10-11           C
7  0.526204    1     No  2017-10-03           B
      float  int yes/no        date categorical
4  0.382113    9     No  2017-10-11           C
7  0.526204    1     No  2017-10-03           B


In [43]:
with Example():
    dataset = dataset.sample(frac = 0.9, replace = True)

      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.550071    8     No  2017-10-10           B
4  0.382113    9     No  2017-10-11           C
5  0.921326    1     No  2017-10-03           B
6  0.691557    9    Yes  2017-10-11           C
7  0.526204    1     No  2017-10-03           B
      float  int yes/no        date categorical
2  0.773453    6    Yes  2017-10-08           G
7  0.526204    1     No  2017-10-03           B
7  0.526204    1     No  2017-10-03           B
2  0.773453    6    Yes  2017-10-08           G
6  0.691557    9    Yes  2017-10-11           C
2  0.773453    6    Yes  2017-10-08           G
5  0.921326    1     No  2017-10-03           B


In [44]:
with Example():
    dataset = dataset.sample(frac = 1)

      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.550071    8     No  2017-10-10           B
4  0.382113    9     No  2017-10-11           C
5  0.921326    1     No  2017-10-03           B
6  0.691557    9    Yes  2017-10-11           C
7  0.526204    1     No  2017-10-03           B
      float  int yes/no        date categorical
3  0.550071    8     No  2017-10-10           B
1  0.034334    5    Yes  2017-10-07           F
4  0.382113    9     No  2017-10-11           C
7  0.526204    1     No  2017-10-03           B
6  0.691557    9    Yes  2017-10-11           C
5  0.921326    1     No  2017-10-03           B
2  0.773453    6    Yes  2017-10-08           G
0  0.283405    4     No  2017-10-06           E


In [45]:
with Example():
    dataset = dataset.sample(frac = 1.5, replace = True)

      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.550071    8     No  2017-10-10           B
4  0.382113    9     No  2017-10-11           C
5  0.921326    1     No  2017-10-03           B
6  0.691557    9    Yes  2017-10-11           C
7  0.526204    1     No  2017-10-03           B
      float  int yes/no        date categorical
7  0.526204    1     No  2017-10-03           B
5  0.921326    1     No  2017-10-03           B
3  0.550071    8     No  2017-10-10           B
4  0.382113    9     No  2017-10-11           C
5  0.921326    1     No  2017-10-03           B
0  0.283405    4     No  2017-10-06           E
7  0.526204    1     No  2017-10-03           B
4  0.382113    9     No  2017-10-11           C
3  0.550071    8     No  2017-10-10           B
5  0.921326    1     No  2017-10-03           B
4  0.382113    9     No  2017-10-11     

In [46]:
with Example():
    dataset = dataset.sample(frac = 1.5, replace = True).reset_index(drop=True)

      float  int yes/no        date categorical
0  0.283405    4     No  2017-10-06           E
1  0.034334    5    Yes  2017-10-07           F
2  0.773453    6    Yes  2017-10-08           G
3  0.550071    8     No  2017-10-10           B
4  0.382113    9     No  2017-10-11           C
5  0.921326    1     No  2017-10-03           B
6  0.691557    9    Yes  2017-10-11           C
7  0.526204    1     No  2017-10-03           B
       float  int yes/no        date categorical
0   0.526204    1     No  2017-10-03           B
1   0.773453    6    Yes  2017-10-08           G
2   0.773453    6    Yes  2017-10-08           G
3   0.283405    4     No  2017-10-06           E
4   0.773453    6    Yes  2017-10-08           G
5   0.283405    4     No  2017-10-06           E
6   0.283405    4     No  2017-10-06           E
7   0.691557    9    Yes  2017-10-11           C
8   0.773453    6    Yes  2017-10-08           G
9   0.382113    9     No  2017-10-11           C
10  0.283405    4     No  201

### All done!  Clear Cells, Save Notebook, `File > Close and Halt`