# 02.A: Working with Datasets

It's important to have a clear and sensible way of representing the datasets that learning algorithms train on. A dataset consists of $n$ examples. Each example consists of $m$ features. This makes $m$ the number dimensions the dataset has. In supervised learning, the dataset is a matrix like this:

$\boldsymbol{D} =\left[\begin{array}{cccccc} 
  x_1^{(1)} & x_2^{(1)} & x_3^{(1)} & \cdots & x_m^{(1)} & y^{(1)}\\ 
  x_1^{(2)} & x_2^{(2)} & x_3^{(2)} & \cdots & x_m^{(2)} & y^{(2)}\\
  x_1^{(3)} & x_2^{(3)} & x_3^{(3)} & \cdots & x_m^{(3)} & y^{(3)}\\
  \vdots    & \vdots    & \vdots    & \cdots & \vdots & \vdots \\
  x_1^{(n)} & x_2^{(n)} & x_3^{(n)} & \cdots & x_m^{(n)} & y^{(n)}
\end{array}\right]$

Each row of this matrix is an example consisting of the $m$ features plus the target label as the last element in the row. In other words, $\boldsymbol{D}$ consists of both the input matrix $\boldsymbol{X}$ and target vector $y$, where: 

$\boldsymbol{X} =\left[\begin{array}{ccccc} 
  x_1^{(1)} & x_2^{(1)} & x_3^{(1)} & \cdots & x_m^{(1)}\\ 
  x_1^{(2)} & x_2^{(2)} & x_3^{(2)} & \cdots & x_m^{(2)}\\
  x_1^{(3)} & x_2^{(3)} & x_3^{(3)} & \cdots & x_m^{(3)}\\
  \vdots    & \vdots    & \vdots    & \cdots & \vdots \\
  x_1^{(n)} & x_2^{(n)} & x_3^{(n)} & \cdots & x_m^{(n)}
\end{array}\right]$

and

$\boldsymbol{y} =\left[\begin{array}{c} 
  y^{(1)}\\ 
  y^{(2)}\\
  y^{(3)}\\
  \vdots \\
  y^{(n)}
\end{array}\right]$

For unsupervised learning, $\boldsymbol{D}$ is the same as $\boldsymbol{X}$. Here is a class named `DataSet` to represent datasets. It uses pandas' DataFrame.

Another name for $X$ is `inputs`, and another name for $y$ is `target`. In addition, features have names. Let's put all of this together in a class that we will be using in subsequent weeks.

In [86]:
import numpy as np
import pandas as pd

class DataSet:
    """
    A dataset for a machine learning problem. A dataset d has the following properties:
    d.examples   A list of examples. Each one contains both the features and the target.
    d.features   An array of the of feature names.
    d.target     An m by 1 array containing the values of y
    d.y          Same as d.target
    d.inputs     An n by m array containing the values of X
    d.X          Same as d.inputs
    d.N          Number of examples
    d.M          Number of dimensions
    d.name       The name of the data set (for output display only)
    
    """
    def __init__(self, data, features=None, y=None, name=None):
        """
        If y is True, the data contains the target as the last column
        If y is None or False, No target is available
        Else y is an array to be added as the last column of the examples  dataframe
        """
        self.__name = name
        if isinstance(data, pd.DataFrame):
            self.__examples = data
        else:
            self.__examples = pd.DataFrame(data, columns=features)
            
        if y is True:
            self.__examples.columns = [*self.__examples.columns[:-1], 'y']
        elif y is not False and y is not None:
            self.__examples['y'] = y
    
    @property
    def examples(self):
        return self.__examples
    
    @property
    def features(self):
        return self.__examples.columns[:-1].values
    
    @property
    def target(self):
        if 'y' in self.__examples.columns:
            return self.__examples['y'].values.reshape(self.N, 1)
        return None
    
    @property
    def y(self):
        return self.target
    
    @property
    def inputs(self):
        return self.__examples.iloc[:, :-1].values
    
    @property
    def X(self):
        return self.inputs
    
    @property
    def name(self):
        return self.__name
    
    @property
    def N(self):
        return self.__examples.shape[0]
    
    @property
    def M(self):
        return self.inputs.shape[1]
    
    def shuffled(self, random_state=None):
        rgen = np.random.RandomState(random_state)
        indexes = np.arange(self.N)
        rgen.shuffle(indexes)
        return DataSet(self.__examples.iloc[indexes])
    
    def train_test_split(self, start=0, end=None, test_portion=None, shuffle=False, random_state=None):
        """
        Splits the dataset into a training set and atest set. 
        If test_portion is specified, return that portion of the dataset as test 
        and the rest as training. 
        Otherwise, return the examples between start and end as test and the 
        rest as training.
        """
        indexes = np.arange(self.N)
        if shuffle is True:
            rgen = np.random.RandomState(random_state)
            rgen.shuffle(indexes)

        if test_portion is None:
            end = end or self.N
        else:
            if not isinstance(test_portion, float) or test_portion < 0 or test_portion > 1:
                raise TypeError("Only fractions between ]0,1[ are allowed")

            start = self.N - int(self.N * test_portion)
            end = self.N

        test = DataSet(self.examples.iloc[indexes[range(start, end)]])
        train = DataSet(pd.concat([self.examples.iloc[indexes[range(start)]], 
                                      self.examples.iloc[indexes[range(end, self.N)]]], axis=0))    
        return train, test
    
    def train_validation_test_split(self, portions, shuffle=False):
        #We use the train_test_split function twice in this method.
        #First we split the data into training data and everything else.
        #Second we split the everything else data into test and validation data.
        
        #Calculate the portions that we need
        #Since the train_test_split method returns based on the test portion size, we calculate how much is 
        #not going to be training data by doing 100% - training%
        trainPortion = 1 - portions["training"]
        
        #Next we calculate the ratio of test data to validation data. This is used in the second split
        validationPortion = portions["test"]/(portions["test"] + portions["validation"])
        
        #Split the data into training data, and everything else data
        train, ee = self.train_test_split(test_portion=trainPortion, shuffle=shuffle)
        
        #Next we run the split again, this time on the everything else data, to generate test and validation data
        val, test = ee.train_test_split(test_portion=validationPortion)
       
        #Finally return all 3 datasets
        return train, val, test
        
    
    def __repr__(self):
        return repr(self.examples)

This class has a couple of properties including `name` (informational), `features` (the names of the features), `inputs`, `target`, `X`, `y`, `N` (number of examples), `M` (number of dimensions).

A DataSet object is created using a NumPy array or a Pandas dataframe. If it is a NumPy array, the class uses it to create a Pandas dataframe. The dataframe storing the data can be retrieved back using the `examples` property.


Let's test this class by creating a $27 \times 3$ input data and a separate $y$ column.

In [2]:
ds = DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")

ds

     x1   x2         x1  y
0   2.0  2.0  10.791134  0
1   5.0  8.0  10.809887  0
2   8.0  7.0  13.070595  1
3   6.0  6.0   8.716489  0
4   2.0  7.0   8.767886  0
5   8.0  6.0  13.178941  1
6   3.0  4.0   7.185933  0
7   8.0  7.0  10.337348  0
8   6.0  1.0   5.458405  1
9   7.0  8.0   8.072134  1
10  5.0  5.0  11.969767  0
11  5.0  7.0   9.288796  0
12  2.0  3.0   9.987115  0
13  6.0  5.0  12.562984  1
14  4.0  7.0   9.990945  0
15  4.0  1.0  12.158995  1
16  7.0  4.0   8.738938  0
17  4.0  2.0   7.521280  0
18  7.0  6.0  11.440596  0
19  2.0  6.0   8.143957  0
20  2.0  6.0  12.382754  0
21  6.0  5.0   9.736898  1
22  2.0  4.0   7.198887  0
23  2.0  4.0   8.285740  0
24  4.0  8.0  10.210583  1
25  3.0  1.0   7.538604  1
26  3.0  7.0  11.446310  1

In [3]:
ds.examples

Unnamed: 0,x1,x2,x1.1,y
0,2.0,2.0,10.791134,0
1,5.0,8.0,10.809887,0
2,8.0,7.0,13.070595,1
3,6.0,6.0,8.716489,0
4,2.0,7.0,8.767886,0
5,8.0,6.0,13.178941,1
6,3.0,4.0,7.185933,0
7,8.0,7.0,10.337348,0
8,6.0,1.0,5.458405,1
9,7.0,8.0,8.072134,1


In [4]:
ds.features

array(['x1', 'x2', 'x1'], dtype=object)

In [5]:
ds.target 

array([[0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [1]])

In [6]:
ds.y 

array([[0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [1]])

In [7]:
ds.inputs

array([[ 2.        ,  2.        , 10.79113426],
       [ 5.        ,  8.        , 10.80988706],
       [ 8.        ,  7.        , 13.07059495],
       [ 6.        ,  6.        ,  8.71648906],
       [ 2.        ,  7.        ,  8.76788599],
       [ 8.        ,  6.        , 13.17894097],
       [ 3.        ,  4.        ,  7.18593259],
       [ 8.        ,  7.        , 10.33734798],
       [ 6.        ,  1.        ,  5.4584051 ],
       [ 7.        ,  8.        ,  8.07213408],
       [ 5.        ,  5.        , 11.96976748],
       [ 5.        ,  7.        ,  9.28879632],
       [ 2.        ,  3.        ,  9.98711483],
       [ 6.        ,  5.        , 12.56298435],
       [ 4.        ,  7.        ,  9.99094466],
       [ 4.        ,  1.        , 12.15899519],
       [ 7.        ,  4.        ,  8.73893823],
       [ 4.        ,  2.        ,  7.52128025],
       [ 7.        ,  6.        , 11.44059598],
       [ 2.        ,  6.        ,  8.14395673],
       [ 2.        ,  6.        , 12.382

In [8]:
ds.X

array([[ 2.        ,  2.        , 10.79113426],
       [ 5.        ,  8.        , 10.80988706],
       [ 8.        ,  7.        , 13.07059495],
       [ 6.        ,  6.        ,  8.71648906],
       [ 2.        ,  7.        ,  8.76788599],
       [ 8.        ,  6.        , 13.17894097],
       [ 3.        ,  4.        ,  7.18593259],
       [ 8.        ,  7.        , 10.33734798],
       [ 6.        ,  1.        ,  5.4584051 ],
       [ 7.        ,  8.        ,  8.07213408],
       [ 5.        ,  5.        , 11.96976748],
       [ 5.        ,  7.        ,  9.28879632],
       [ 2.        ,  3.        ,  9.98711483],
       [ 6.        ,  5.        , 12.56298435],
       [ 4.        ,  7.        ,  9.99094466],
       [ 4.        ,  1.        , 12.15899519],
       [ 7.        ,  4.        ,  8.73893823],
       [ 4.        ,  2.        ,  7.52128025],
       [ 7.        ,  6.        , 11.44059598],
       [ 2.        ,  6.        ,  8.14395673],
       [ 2.        ,  6.        , 12.382

In [9]:
ds.name

'Sample Data'

In [10]:
ds.N

27

In [11]:
ds.M

3

## Shuffling

The above class also supports a few useful methods. One such method is for shuffling the data, which we do often before training. This method returns a new DataSet instance with the shuffled data. Here is how this method is implemented:

```python
    ...
    def shuffled(self, random_state=None):
        rgen = np.random.RandomState(random_state)
        indexes = np.arange(self.N)
        rgen.shuffle(indexes)
        return DataSet(self.__examples.iloc[indexes])
   ...
```

Here is an example using this function.

In [12]:
ds = DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")


ds.shuffled()

     x1   x2         x1  y
14  3.0  8.0   9.122986  0
5   3.0  3.0   7.645586  0
20  6.0  6.0  10.593223  0
16  2.0  1.0   9.583503  1
3   8.0  1.0   8.742644  0
9   8.0  5.0   8.969866  0
18  7.0  2.0  12.076855  0
13  3.0  2.0   8.866998  1
17  4.0  2.0   8.323809  1
12  7.0  3.0   9.200096  0
15  5.0  1.0   9.904611  0
24  5.0  8.0   7.456931  1
7   8.0  5.0  10.221422  0
21  5.0  6.0   9.569629  0
2   5.0  8.0  10.379536  0
10  8.0  1.0  13.589025  1
19  6.0  4.0   9.050564  0
1   8.0  7.0  10.311068  1
25  2.0  1.0  12.703250  1
23  2.0  6.0  12.914921  1
0   7.0  3.0  11.245627  0
8   5.0  1.0  13.664378  1
22  8.0  6.0  10.877865  1
6   8.0  7.0   8.733300  0
11  7.0  8.0   8.800582  0
4   3.0  1.0  12.231882  0
26  7.0  8.0  11.198532  0

## Splitting a dataset into training and test datasets

Another useful method provided by the above dataset class is the `train_test_split` method. This method splits the dataset into a training and test sets. Here is how this method is implemented:

```python
    ...
    def train_test_split(self,start=0, end=None, test_portion=None, shuffle=False, random_state=None):
        """
        Splits the dataset into a training set and atest set. 
        If test_portion is specified, return that portion of the dataset as test 
        and the rest as training. 
        Otherwise, return the examples between start and end as test and the 
        rest as training.
        """
        indexes = np.arange(self.N)
        if shuffle is True:
            rgen = np.random.RandomState(random_state)
            rgen.shuffle(indexes)

        if test_portion is None:
            end = end or self.N
        else:
            if not isinstance(test_portion, float) or test_portion < 0 or test_portion > 1:
                raise TypeError("Only fractions between ]0,1[ are allowed")

            start = self.N - int(self.N * test_portion)
            end = self.N

        test = DataSet(self.examples.iloc[indexes[range(start, end)]])
        train = DataSet(pd.concat([self.examples.iloc[indexes[range(start)]], 
                                      self.examples.iloc[indexes[range(end, self.N)]]], axis=0))    
        return train, test
   ...
```

If the `start` and end `end` parameters exist, the method returns the examples before them as test and the rest of the data as training. If `test_portion` is provided, then that portion of the data is returned as test and the rest as training. The `shuffle` parameter can be used to instruct the method to shuffle the data before splitting it. The method finally returns two dataset instances: training and test sets.

Here is an example using this method.

In [87]:
ds = DataSet(np.array([
    np.random.randint(2,9, 100),
    np.random.randint(1,9, 100),
    np.random.normal(loc=10, scale=2, size=100)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 100), name="Sample Data")


ta, te = ds.train_test_split(test_portion=.25, shuffle=False, random_state=17)
print('Training set = \n', ta)
print('Test set = \n', te)

Training set = 
      x1   x2         x1  y
0   7.0  4.0   9.614875  0
1   5.0  5.0  10.447379  0
2   5.0  4.0  11.336589  0
3   7.0  7.0  10.120180  1
4   3.0  7.0   9.185731  0
..  ...  ...        ... ..
70  3.0  6.0  11.082104  0
71  2.0  8.0   8.377550  1
72  2.0  4.0  12.078999  0
73  6.0  3.0  10.504142  1
74  3.0  3.0   9.592217  0

[75 rows x 4 columns]
Test set = 
      x1   x2         x1  y
75  8.0  2.0   8.808691  1
76  4.0  5.0   9.247665  0
77  6.0  3.0   9.811697  0
78  7.0  2.0   7.825488  1
79  7.0  2.0   7.539259  1
80  2.0  8.0  10.886686  1
81  2.0  5.0  10.668977  1
82  2.0  2.0  11.083604  1
83  7.0  6.0   8.024430  0
84  8.0  1.0  11.858081  1
85  7.0  1.0  12.388543  1
86  4.0  6.0  10.636686  1
87  2.0  2.0  10.359255  1
88  6.0  6.0  10.911248  0
89  7.0  3.0  13.851956  0
90  4.0  5.0   9.274406  1
91  5.0  4.0  10.978550  1
92  2.0  6.0  10.840270  1
93  8.0  5.0  12.114026  1
94  2.0  3.0  10.289328  0
95  8.0  1.0  14.294511  0
96  4.0  4.0  10.556315  0
97

## Using this dataset class inside other notebooks

This class is part of the `mylib` library of this class with is provided to you. Here is how to import this library:

In [88]:
import mylib as my

Once imported, one can use it like this:

In [89]:
mds = my.DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")


ta, te = mds.train_test_split(test_portion=.25, shuffle=False, random_state=17)
print('Training set = \n', ta)
print('Test set = \n', te)

Training set = 
      x1   x2         x1  y
0   7.0  7.0  10.560561  0
1   5.0  1.0  10.428156  0
2   7.0  8.0  11.087958  0
3   8.0  3.0   9.057385  1
4   4.0  2.0   9.552450  0
5   8.0  8.0  11.427155  1
6   3.0  8.0   7.979947  0
7   6.0  8.0  11.211843  0
8   6.0  4.0   6.256771  0
9   4.0  7.0  10.547796  0
10  2.0  3.0   8.304792  1
11  4.0  3.0  10.451665  0
12  2.0  7.0   9.976279  0
13  8.0  3.0  10.513063  1
14  2.0  2.0   8.231380  0
15  2.0  5.0  10.334825  1
16  4.0  3.0   9.449676  1
17  7.0  4.0  10.431901  0
18  5.0  5.0  11.089372  0
19  4.0  1.0  10.384973  0
20  3.0  4.0   9.710119  1
Test set = 
      x1   x2         x1  y
21  4.0  7.0   9.105623  1
22  4.0  3.0   7.522822  0
23  5.0  4.0   8.908300  1
24  2.0  1.0   6.840327  0
25  5.0  6.0   6.534439  0
26  2.0  2.0  11.140901  1


## EXERCISE

Refactor the above DataSet class by adding a method named `train_validation_test_split` to it. This method should split the data into three sets: training, validation, and test. This method should receive a dictionary parameter named `portions` specifying how much of the data is in each set. For a 75%/15%/10% split, one can use the following portions parameter:

```python
portions={"training": .75, 'validation': .15, 'test': .10 }
```

The method should support the `shuffle` parameter as well. You may call the `train_test_split` method internally. Make sure to include a comment describing how your implementation of the method works. Test your method on the `ds` dataset above and show that it works.

In [90]:
#I changed the dataset in the cell above this one to mds because I was editing the dataset class at the top of the page,
#not the one present in that library

#The dataset 2 cells up is the one being used and I altered it to generate 100 rows so that the portion sizes can be seen easier

#Portions of the dataset, 75% training, 15% validation, 10% test
portions={"training": .75, 'validation': .15, 'test': .10}

#I call the new method that I added to the class at the top of the page and shuffle the data
tr, val, te = ds.train_validation_test_split(portions, shuffle=True)

#I output the 3 returned datasets to show it works correctly
print('Training set = \n', tr)
print('\nValidation set = \n', val)
print('\nTest set = \n', te)

Training set = 
      x1   x2         x1  y
1   5.0  5.0  10.447379  0
84  8.0  1.0  11.858081  1
25  8.0  1.0   9.746106  0
27  2.0  7.0  11.794226  1
76  4.0  5.0   9.247665  0
..  ...  ...        ... ..
18  4.0  2.0  11.860081  0
85  7.0  1.0  12.388543  1
61  2.0  3.0  12.751706  0
36  5.0  6.0   7.614472  0
52  7.0  1.0   9.157423  1

[75 rows x 4 columns]

Validation set = 
      x1   x2         x1  y
66  7.0  6.0   7.561955  1
64  4.0  1.0  12.230733  1
68  8.0  5.0   9.331282  1
67  6.0  3.0   9.199876  1
29  7.0  2.0  10.718158  1
11  3.0  1.0   8.655584  1
71  2.0  8.0   8.377550  1
9   3.0  8.0  11.675119  0
50  8.0  8.0  11.374227  0
74  3.0  3.0   9.592217  0
86  4.0  6.0  10.636686  1
92  2.0  6.0  10.840270  1
43  4.0  3.0   7.798173  1
28  5.0  2.0  10.783370  1
63  8.0  3.0  12.010859  1

Test set = 
      x1   x2         x1  y
8   5.0  6.0   9.126255  1
62  5.0  3.0   9.560953  0
39  3.0  3.0  11.458475  0
51  3.0  3.0  14.119410  0
82  2.0  2.0  11.083604  1
87  2.0 