# 02.A: Working with Datasets

It's important to have a clear and sensible way of representing the datasets that learning algorithms train on. A dataset consists of $n$ examples. Each example consists of $m$ features. This makes $m$ the number dimensions the dataset has. In supervised learning, the dataset is a matrix like this:

$\boldsymbol{D} =\left[\begin{array}{cccccc} 
  x_1^{(1)} & x_2^{(1)} & x_3^{(1)} & \cdots & x_m^{(1)} & y^{(1)}\\ 
  x_1^{(2)} & x_2^{(2)} & x_3^{(2)} & \cdots & x_m^{(2)} & y^{(2)}\\
  x_1^{(3)} & x_2^{(3)} & x_3^{(3)} & \cdots & x_m^{(3)} & y^{(3)}\\
  \vdots    & \vdots    & \vdots    & \cdots & \vdots & \vdots \\
  x_1^{(n)} & x_2^{(n)} & x_3^{(n)} & \cdots & x_m^{(n)} & y^{(n)}
\end{array}\right]$

Each row of this matrix is an example consisting of the $m$ features plus the target label as the last element in the row. In other words, $\boldsymbol{D}$ consists of both the input matrix $\boldsymbol{X}$ and target vector $y$, where: 

$\boldsymbol{X} =\left[\begin{array}{ccccc} 
  x_1^{(1)} & x_2^{(1)} & x_3^{(1)} & \cdots & x_m^{(1)}\\ 
  x_1^{(2)} & x_2^{(2)} & x_3^{(2)} & \cdots & x_m^{(2)}\\
  x_1^{(3)} & x_2^{(3)} & x_3^{(3)} & \cdots & x_m^{(3)}\\
  \vdots    & \vdots    & \vdots    & \cdots & \vdots \\
  x_1^{(n)} & x_2^{(n)} & x_3^{(n)} & \cdots & x_m^{(n)}
\end{array}\right]$

and

$\boldsymbol{y} =\left[\begin{array}{c} 
  y^{(1)}\\ 
  y^{(2)}\\
  y^{(3)}\\
  \vdots \\
  y^{(n)}
\end{array}\right]$

For unsupervised learning, $\boldsymbol{D}$ is the same as $\boldsymbol{X}$. Here is a class named `DataSet` to represent datasets. It uses pandas' DataFrame.

Another name for $X$ is `inputs`, and another name for $y$ is `target`. In addition, features have names. Let's put all of this together in a class that we will be using in subsequent weeks.

In [62]:
import numpy as np
import pandas as pd

class DataSet:
    """
    A dataset for a machine learning problem. A dataset d has the following properties:
    d.examples   A list of examples. Each one contains both the features and the target.
    d.features   An array of the of feature names.
    d.target     An m by 1 array containing the values of y
    d.y          Same as d.target
    d.inputs     An n by m array containing the values of X
    d.X          Same as d.inputs
    d.N          Number of examples
    d.M          Number of dimensions
    d.name       The name of the data set (for output display only)
    
    """
    def __init__(self, data, features=None, y=None, name=None):
        """
        If y is True, the data contains the target as the last column
        If y is None or False, No target is available
        Else y is an array to be added as the last column of the examples  dataframe
        """
        self.__name = name
        if isinstance(data, pd.DataFrame):
            self.__examples = data
        else:
            self.__examples = pd.DataFrame(data, columns=features)
            
        if y is True:
            self.__examples.columns = [*self.__examples.columns[:-1], 'y']
        elif y is not False and y is not None:
            self.__examples['y'] = y
    
    @property
    def examples(self):
        return self.__examples
    
    @property
    def features(self):
        return self.__examples.columns[:-1].values
    
    @property
    def target(self):
        if 'y' in self.__examples.columns:
            return self.__examples['y'].values.reshape(self.N, 1)
        return None
    
    @property
    def y(self):
        return self.target
    
    @property
    def inputs(self):
        return self.__examples.iloc[:, :-1].values
    
    @property
    def X(self):
        return self.inputs
    
    @property
    def name(self):
        return self.__name
    
    @property
    def N(self):
        return self.__examples.shape[0]
    
    @property
    def M(self):
        return self.inputs.shape[1]
    
    def shuffled(self, random_state=None):
        rgen = np.random.RandomState(random_state)
        indexes = np.arange(self.N)
        rgen.shuffle(indexes)
        return DataSet(self.__examples.iloc[indexes])
    
    def train_test_split(self, start=0, end=None, test_portion=None, shuffle=False, random_state=None):
        """
        Splits the dataset into a training set and a test set. 
        If test_portion is specified, return that portion of the dataset as test 
        and the rest as training. 
        Otherwise, return the examples between start and end as test and the 
        rest as training.
        """
        indexes = np.arange(self.N)
        if shuffle is True:
            rgen = np.random.RandomState(random_state)
            rgen.shuffle(indexes)

        if test_portion is None:
            end = end or self.N
        else:
            if not isinstance(test_portion, float) or test_portion < 0 or test_portion > 1:
                raise TypeError("Only fractions between ]0,1[ are allowed")

            start = self.N - int(self.N * test_portion)
            end = self.N

        test = DataSet(self.examples.iloc[indexes[range(start, end)]])
        train = DataSet(pd.concat([self.examples.iloc[indexes[range(start)]], 
                                      self.examples.iloc[indexes[range(end, self.N)]]], axis=0))    
        return train, test
    
    def __repr__(self):
        return repr(self.examples)
    
    '''
    My Section of the code for the excerise
    '''
    def train_validation_test_split(self, portions, shuffle=False, random_state=None):
        """
        Splits the data into train, validation, and test sets.
        Will receive a dictionary parameter named portions specifying 
        how much of the data is in each set.
        """
        
        indexes = np.arange(self.N)
        if shuffle is True:
            rgen = np.random.RandomState(random_state)
            rgen.shuffle(indexes)
            
        'Use Dictionary.get() to get the values out of the portions'
        trainVal = portions.get("training")
        validVal = portions.get("validation")
        testVal = portions.get("test")
        
        
        
        """
        Here we are using train_test_split to break the set into its first two parts.
        We will have the set of data remaining (the validation and training) and the
        test set
        """
        dataLeft, test = self.train_test_split(test_portion=testVal, shuffle=False)
        
        '''
        We will then find the portion of the remaining data that will be used for validation
        by using the original vaildVal to find the total number in the original dataset.
        Then we will get a new percentage by taking that number and dividing it by the new
        total in the new dataset.
        '''
        validStart = self.N - int(self.N * validVal)
        
        validPortion = validStart/dataLeft.N
        
        """
        We will then do the train_test_split on our new dataLeft set to get our validation and training sets
        """
        
        train, validation = dataLeft.train_test_split(test_portion=1-validPortion, shuffle=False)
        
        return train, validation, test

This class has a couple of properties including `name` (informational), `features` (the names of the features), `inputs`, `target`, `X`, `y`, `N` (number of examples), `M` (number of dimensions).

A DataSet object is created using a NumPy array or a Pandas dataframe. If it is a NumPy array, the class uses it to create a Pandas dataframe. The dataframe storing the data can be retrieved back using the `examples` property.


Let's test this class by creating a $27 \times 3$ input data and a separate $y$ column.

In [34]:
ds = DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")

ds

     x1   x2         x1  y
0   7.0  5.0  14.071137  0
1   7.0  4.0  10.999822  0
2   5.0  6.0  11.821084  1
3   6.0  3.0   8.643734  0
4   8.0  7.0   9.530007  1
5   2.0  5.0   9.314699  0
6   7.0  8.0   9.542695  1
7   5.0  8.0  10.032778  0
8   7.0  4.0  13.806657  1
9   5.0  6.0  11.252529  1
10  6.0  4.0   9.952814  1
11  5.0  7.0  10.852085  0
12  5.0  5.0   8.845012  0
13  2.0  4.0  12.951186  1
14  7.0  6.0  11.854058  0
15  3.0  5.0  10.272491  0
16  2.0  4.0  10.128849  0
17  3.0  3.0  12.440408  0
18  3.0  2.0  11.028291  0
19  3.0  2.0  12.732264  1
20  2.0  6.0  10.550466  1
21  5.0  2.0   7.379811  0
22  4.0  1.0   7.815355  0
23  6.0  2.0   8.059168  1
24  3.0  2.0   8.235369  0
25  5.0  7.0   9.143197  1
26  6.0  6.0   8.195049  0

In [35]:
ds.examples

Unnamed: 0,x1,x2,x1.1,y
0,7.0,5.0,14.071137,0
1,7.0,4.0,10.999822,0
2,5.0,6.0,11.821084,1
3,6.0,3.0,8.643734,0
4,8.0,7.0,9.530007,1
5,2.0,5.0,9.314699,0
6,7.0,8.0,9.542695,1
7,5.0,8.0,10.032778,0
8,7.0,4.0,13.806657,1
9,5.0,6.0,11.252529,1


In [36]:
ds.features

array(['x1', 'x2', 'x1'], dtype=object)

In [37]:
ds.target 

array([[0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [1],
       [0],
       [1],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0]])

In [38]:
ds.y 

array([[0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [1],
       [0],
       [1],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0]])

In [39]:
ds.inputs

array([[ 7.        ,  5.        , 14.07113691],
       [ 7.        ,  4.        , 10.99982182],
       [ 5.        ,  6.        , 11.82108367],
       [ 6.        ,  3.        ,  8.64373365],
       [ 8.        ,  7.        ,  9.53000722],
       [ 2.        ,  5.        ,  9.31469862],
       [ 7.        ,  8.        ,  9.54269509],
       [ 5.        ,  8.        , 10.03277825],
       [ 7.        ,  4.        , 13.80665662],
       [ 5.        ,  6.        , 11.25252891],
       [ 6.        ,  4.        ,  9.95281416],
       [ 5.        ,  7.        , 10.85208498],
       [ 5.        ,  5.        ,  8.84501223],
       [ 2.        ,  4.        , 12.9511859 ],
       [ 7.        ,  6.        , 11.85405818],
       [ 3.        ,  5.        , 10.27249061],
       [ 2.        ,  4.        , 10.12884872],
       [ 3.        ,  3.        , 12.44040794],
       [ 3.        ,  2.        , 11.02829054],
       [ 3.        ,  2.        , 12.73226393],
       [ 2.        ,  6.        , 10.550

In [40]:
ds.X

array([[ 7.        ,  5.        , 14.07113691],
       [ 7.        ,  4.        , 10.99982182],
       [ 5.        ,  6.        , 11.82108367],
       [ 6.        ,  3.        ,  8.64373365],
       [ 8.        ,  7.        ,  9.53000722],
       [ 2.        ,  5.        ,  9.31469862],
       [ 7.        ,  8.        ,  9.54269509],
       [ 5.        ,  8.        , 10.03277825],
       [ 7.        ,  4.        , 13.80665662],
       [ 5.        ,  6.        , 11.25252891],
       [ 6.        ,  4.        ,  9.95281416],
       [ 5.        ,  7.        , 10.85208498],
       [ 5.        ,  5.        ,  8.84501223],
       [ 2.        ,  4.        , 12.9511859 ],
       [ 7.        ,  6.        , 11.85405818],
       [ 3.        ,  5.        , 10.27249061],
       [ 2.        ,  4.        , 10.12884872],
       [ 3.        ,  3.        , 12.44040794],
       [ 3.        ,  2.        , 11.02829054],
       [ 3.        ,  2.        , 12.73226393],
       [ 2.        ,  6.        , 10.550

In [41]:
ds.name

'Sample Data'

In [42]:
ds.N

27

In [43]:
ds.M

3

## Shuffling

The above class also supports a few useful methods. One such method is for shuffling the data, which we do often before training. This method returns a new DataSet instance with the shuffled data. Here is how this method is implemented:

```python
    ...
    def shuffled(self, random_state=None):
        rgen = np.random.RandomState(random_state)
        indexes = np.arange(self.N)
        rgen.shuffle(indexes)
        return DataSet(self.__examples.iloc[indexes])
   ...
```

Here is an example using this function.

In [44]:
ds = DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")


ds.shuffled()

     x1   x2         x1  y
22  7.0  7.0   8.610422  1
11  7.0  7.0   6.919561  1
0   8.0  4.0  10.041026  1
1   4.0  6.0   7.976706  1
12  7.0  3.0   8.532990  0
25  8.0  2.0   8.868058  0
7   8.0  8.0  13.279906  1
5   8.0  3.0   9.438017  1
8   8.0  3.0  11.907139  1
4   2.0  1.0  10.987137  0
14  8.0  8.0  10.109049  1
2   8.0  4.0   8.939255  1
15  8.0  7.0   9.092498  1
6   6.0  6.0   7.827284  0
20  4.0  3.0   7.958531  0
18  4.0  8.0   8.609564  0
9   4.0  3.0   7.529688  1
19  5.0  5.0  13.056071  1
16  6.0  8.0  11.481829  1
10  8.0  5.0   8.559268  0
3   5.0  5.0   8.240834  1
21  5.0  2.0  11.155936  0
24  3.0  4.0   6.478901  1
26  6.0  5.0  10.382259  1
13  4.0  8.0   8.980648  1
17  2.0  8.0   9.513256  0
23  7.0  8.0  10.539011  1

## Splitting a dataset into training and test datasets

Another useful method provided by the above dataset class is the `train_test_split` method. This method splits the dataset into a training and test sets. Here is how this method is implemented:

```python
    ...
    def train_test_split(self,start=0, end=None, test_portion=None, shuffle=False, random_state=None):
        """
        Splits the dataset into a training set and atest set. 
        If test_portion is specified, return that portion of the dataset as test 
        and the rest as training. 
        Otherwise, return the examples between start and end as test and the 
        rest as training.
        """
        indexes = np.arange(self.N)
        if shuffle is True:
            rgen = np.random.RandomState(random_state)
            rgen.shuffle(indexes)

        if test_portion is None:
            end = end or self.N
        else:
            if not isinstance(test_portion, float) or test_portion < 0 or test_portion > 1:
                raise TypeError("Only fractions between ]0,1[ are allowed")

            start = self.N - int(self.N * test_portion)
            end = self.N

        test = DataSet(self.examples.iloc[indexes[range(start, end)]])
        train = DataSet(pd.concat([self.examples.iloc[indexes[range(start)]], 
                                      self.examples.iloc[indexes[range(end, self.N)]]], axis=0))    
        return train, test
   ...
```

If the `start` and end `end` parameters exist, the method returns the examples before them as test and the rest of the data as training. If `test_portion` is provided, then that portion of the data is returned as test and the rest as training. The `shuffle` parameter can be used to instruct the method to shuffle the data before splitting it. The method finally returns two dataset instances: training and test sets.

Here is an example using this method.

In [45]:
ds = DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")


ta, te = ds.train_test_split(test_portion=.25, shuffle=False, random_state=17)
print('Training set = \n', ta)
print('Test set = \n', te)

Training set = 
      x1   x2         x1  y
0   7.0  3.0  11.506196  1
1   8.0  5.0  10.905182  1
2   4.0  5.0   9.054415  0
3   8.0  8.0  10.423384  0
4   6.0  5.0   9.488277  0
5   5.0  3.0   6.281749  0
6   3.0  6.0  10.770672  0
7   4.0  1.0   9.551137  1
8   8.0  5.0  11.872486  1
9   2.0  2.0  10.019604  1
10  4.0  5.0  10.099496  1
11  2.0  4.0   9.350159  0
12  8.0  7.0   9.219867  0
13  4.0  4.0   9.946844  1
14  4.0  5.0  16.170566  0
15  5.0  6.0   8.283784  0
16  7.0  4.0   9.858579  1
17  8.0  6.0   5.834166  1
18  4.0  2.0  10.303100  0
19  4.0  3.0  11.199927  0
20  7.0  5.0   8.724143  0
Test set = 
      x1   x2         x1  y
21  7.0  4.0  11.350370  0
22  4.0  4.0  13.075067  1
23  2.0  4.0  10.234860  1
24  5.0  7.0  10.154091  0
25  6.0  2.0   8.094857  0
26  2.0  2.0   6.011618  0


## Using this dataset class inside other notebooks

This class is part of the `mylib` library of this class with is provided to you. Here is how to import this library:

In [27]:
import mylib as my

Once imported, one can use it like this:

In [46]:
ds = my.DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")


ta, te = ds.train_test_split(test_portion=.25, shuffle=False, random_state=17)
print('Training set = \n', ta)
print('Test set = \n', te)

Training set = 
      x1   x2         x1  y
0   2.0  7.0  12.887140  1
1   4.0  5.0   9.683472  0
2   3.0  3.0  10.322752  1
3   5.0  7.0   5.791450  1
4   7.0  5.0   7.556257  1
5   8.0  5.0   7.276206  0
6   2.0  7.0  12.139865  1
7   5.0  3.0  12.603362  1
8   4.0  4.0   9.702698  0
9   6.0  8.0   8.819233  0
10  5.0  7.0   8.986496  0
11  2.0  1.0   8.620929  1
12  5.0  6.0  10.357897  1
13  8.0  2.0   7.220334  1
14  3.0  3.0   8.465271  0
15  5.0  7.0  10.082646  0
16  5.0  4.0   9.471886  0
17  5.0  6.0   8.229844  1
18  4.0  3.0  12.095454  1
19  4.0  7.0  10.265396  1
20  6.0  7.0   6.748197  1
Test set = 
      x1   x2         x1  y
21  5.0  8.0  11.178396  1
22  5.0  7.0  12.315620  0
23  4.0  5.0  11.591339  1
24  4.0  6.0   9.993349  1
25  4.0  7.0  10.976606  0
26  4.0  2.0   7.586242  0


## EXERCISE

Refactor the above DataSet class by adding a method named `train_validation_test_split` to it. This method should split the data into three sets: training, validation, and test. This method should receive a dictionary parameter named `portions` specifying how much of the data is in each set. For a 75%/15%/10% split, one can use the following portions parameter:

```python
portions={"training": .75, 'validation': .15, 'test': .10 }
```

The method should support the `shuffle` parameter as well. You may call the `train_test_split` method internally. Make sure to include a comment describing how your implementation of the method works. Test your method on the `ds` dataset above and show that it works.

In [63]:
# TODO

def train_validation_test_split(self, portions, shuffle=False, random_state=None):
        """
        Splits the data into train, validation, and test sets.
        Will receive a dictionary parameter named portions specifying 
        how much of the data is in each set.
        """
        
        indexes = np.arange(self.N)
        if shuffle is True:
            rgen = np.random.RandomState(random_state)
            rgen.shuffle(indexes)
            
        'Use Dictionary.get() to get the values out of the portions'
        trainVal = portions.get("training")
        validVal = portions.get("validation")
        testVal = portions.get("test")  
        
        """
        Here we are using train_test_split to break the set into its first two parts.
        We will have the set of data remaining (the validation and training) and the
        test set
        """
        dataLeft, test = self.train_test_split(test_portion=testVal, shuffle=False)
        
        '''
        We will then find the portion of the remaining data that will be used for validation
        by using the original vaildVal to find the total number in the original dataset.
        Then we will get a new percentage by taking that number and dividing it by the new
        total in the new dataset.
        '''
        validStart = self.N - int(self.N * validVal)
        
        validPortion = validStart/dataLeft.N
        
        """
        We will then do the train_test_split on our new dataLeft set to get our validation and training sets
        """
        
        train, validation = dataLeft.train_test_split(test_portion=validPortion, shuffle=False)
        
        return train, validation, test

ds = DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")


portions={"training": .75, 'validation': .15, 'test': .10 }

train, validation, test = ds.train_validation_test_split(portions={"training": .75, 'validation': .15, 'test': .10 }
                                                      , shuffle=True)

print('Training set: \n\n', train)
print('\nValidation set: \n', validation)
print('\nTest set: \n\n', test)


Training set: 

      x1   x2         x1  y
0   3.0  5.0  10.301051  0
1   4.0  5.0   8.230561  0
2   4.0  3.0   8.256209  1
3   2.0  4.0  11.183143  0
4   7.0  2.0   6.950015  0
5   5.0  6.0  10.610652  0
6   6.0  4.0   8.670299  1
7   3.0  2.0   9.745855  0
8   5.0  8.0   9.002924  1
9   4.0  7.0   9.398452  1
10  8.0  6.0  12.342579  0
11  3.0  1.0   8.958457  0
12  3.0  8.0  11.881169  1
13  5.0  1.0   6.823429  0
14  6.0  4.0  10.117719  0
15  7.0  1.0   9.964405  1
16  7.0  6.0  11.597774  0
17  5.0  5.0   9.318671  0
18  3.0  5.0   8.810315  0
19  4.0  2.0   8.233956  1
20  7.0  5.0   6.993485  1
21  4.0  7.0   9.835555  0
22  6.0  1.0  11.634258  1
23  8.0  3.0  10.699003  1

Validation set: 
      x1   x2         x1  y
24  5.0  6.0  10.614052  0

Test set: 

      x1   x2         x1  y
25  5.0  1.0  10.916891  1
26  8.0  1.0   9.575260  1
