# 04: Working with datasets

It's important to have a clear and sensible way of representing datasets. In machine learning a dataset is a table that consists of $n$ rows. Each row is called an example or a sample. Its columns are divided into two input and output portion. The input portion consists of $m$ columns called features. In other words, $m$ represent the dimensionality of the dataset. The IRIS dataset, for example, has *4* input features, meaning that it is a dataset with a dimensionality of *4*. The output portion exists only in supervised learning. It consists of one or more columns called targets.

Mathematically, a supervised learning dataset can be thought of as a matrix of the form:

$\boldsymbol{D} =\left[\begin{array}{cccccc} 
  x_1^{(1)} & x_2^{(1)} & x_3^{(1)} & \cdots & x_m^{(1)} & y^{(1)}\\ 
  x_1^{(2)} & x_2^{(2)} & x_3^{(2)} & \cdots & x_m^{(2)} & y^{(2)}\\
  x_1^{(3)} & x_2^{(3)} & x_3^{(3)} & \cdots & x_m^{(3)} & y^{(3)}\\
  \vdots    & \vdots    & \vdots    & \cdots & \vdots & \vdots \\
  x_1^{(n)} & x_2^{(n)} & x_3^{(n)} & \cdots & x_m^{(n)} & y^{(n)}
\end{array}\right]$

As you can see, each row of this matrix is a data example consisting of the $m$ input features plus the target column (typically the last column). The $\boldsymbol{D}$ matrix can be broken into two components: the input matrix $\boldsymbol{X}$ and the target vector $y$, where: 

$\boldsymbol{X} =\left[\begin{array}{ccccc} 
  x_1^{(1)} & x_2^{(1)} & x_3^{(1)} & \cdots & x_m^{(1)}\\ 
  x_1^{(2)} & x_2^{(2)} & x_3^{(2)} & \cdots & x_m^{(2)}\\
  x_1^{(3)} & x_2^{(3)} & x_3^{(3)} & \cdots & x_m^{(3)}\\
  \vdots    & \vdots    & \vdots    & \cdots & \vdots \\
  x_1^{(n)} & x_2^{(n)} & x_3^{(n)} & \cdots & x_m^{(n)}
\end{array}\right]$

and

$\boldsymbol{y} =\left[\begin{array}{c} 
  y^{(1)}\\ 
  y^{(2)}\\
  y^{(3)}\\
  \vdots \\
  y^{(n)}
\end{array}\right]$

For unsupervised learning, $\boldsymbol{D}$ is the same as $\boldsymbol{X}$; no target vector.

We can use a Python class to represent datasets, but that would be unnecessary. We will instead use Numpy arrays and sometimes Pandas's Dataframes for representing datasets. This is the approach that popular machine learning library `scikit-learn` follows, and we will do the same.

Here is an example dataset with three input features and one target. We start with the input feature matrix $\mathbf{X}$; the use of the uppercase $\mathbf{X}$ indicates being a matrix with more than one column. The target column is typically called $\mathbf{y}$; the use of the lowercase $\mathbf{y}$ indicates a vector (one column).

Here is the input matrix $\mathbf{X}$: 

In [1]:
import numpy as np
import pandas as pd

X = np.array([
        np.random.randint(2,6, 27),                # x1
        np.random.randint(1,9, 27),                # x2
        np.random.normal(loc=10, scale=2, size=27) # x3
    ]).T

X

array([[ 2.        ,  3.        ,  9.08571715],
       [ 3.        ,  4.        , 11.02867118],
       [ 4.        ,  4.        , 12.49777946],
       [ 2.        ,  8.        , 10.99971967],
       [ 5.        ,  6.        ,  9.27535658],
       [ 3.        ,  5.        ,  9.29852643],
       [ 5.        ,  7.        , 11.67872293],
       [ 3.        ,  8.        ,  7.99787447],
       [ 2.        ,  2.        ,  8.52725862],
       [ 4.        ,  1.        , 12.75729435],
       [ 4.        ,  3.        , 12.77159384],
       [ 3.        ,  6.        ,  9.28819879],
       [ 3.        ,  5.        ,  9.36185114],
       [ 3.        ,  8.        ,  8.37275863],
       [ 4.        ,  6.        ,  8.07569704],
       [ 2.        ,  3.        ,  9.61295022],
       [ 2.        ,  2.        , 11.79023666],
       [ 4.        ,  3.        ,  9.82731865],
       [ 3.        ,  4.        ,  9.71159121],
       [ 2.        ,  1.        , 11.89886391],
       [ 3.        ,  1.        ,  9.204

And here is the target $\mathbf{y}$

In [2]:
y = np.random.randint(0,2, 27)
y

array([0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 0, 0])

We can then use both components $\mathbf{X}$ and $\mathbf{y}$ to create the full dataset  $\mathbf{D}$

In [3]:
D = np.concatenate([X, y.reshape(len(X), -1)], axis=1)
D

array([[ 2.        ,  3.        ,  9.08571715,  0.        ],
       [ 3.        ,  4.        , 11.02867118,  1.        ],
       [ 4.        ,  4.        , 12.49777946,  0.        ],
       [ 2.        ,  8.        , 10.99971967,  1.        ],
       [ 5.        ,  6.        ,  9.27535658,  1.        ],
       [ 3.        ,  5.        ,  9.29852643,  0.        ],
       [ 5.        ,  7.        , 11.67872293,  1.        ],
       [ 3.        ,  8.        ,  7.99787447,  1.        ],
       [ 2.        ,  2.        ,  8.52725862,  0.        ],
       [ 4.        ,  1.        , 12.75729435,  0.        ],
       [ 4.        ,  3.        , 12.77159384,  1.        ],
       [ 3.        ,  6.        ,  9.28819879,  0.        ],
       [ 3.        ,  5.        ,  9.36185114,  0.        ],
       [ 3.        ,  8.        ,  8.37275863,  1.        ],
       [ 4.        ,  6.        ,  8.07569704,  1.        ],
       [ 2.        ,  3.        ,  9.61295022,  0.        ],
       [ 2.        ,  2.

## Using dataframes

To display this dataset in a nice tabular format with column headings, we can use a Pandas' dataframe.

In [4]:
ds = pd.DataFrame(D, columns=['x1', 'x2', 'x3', 'y'])
ds

Unnamed: 0,x1,x2,x3,y
0,2.0,3.0,9.085717,0.0
1,3.0,4.0,11.028671,1.0
2,4.0,4.0,12.497779,0.0
3,2.0,8.0,10.99972,1.0
4,5.0,6.0,9.275357,1.0
5,3.0,5.0,9.298526,0.0
6,5.0,7.0,11.678723,1.0
7,3.0,8.0,7.997874,1.0
8,2.0,2.0,8.527259,0.0
9,4.0,1.0,12.757294,0.0


To generalize moving data back and forth between NumPy arrays and Pandas' dataframes, here are a few functions.
* `to_dataframe`: coverts NumPy $\mathbf{X}$ and $\mathbf{y}$ arrays into a dataframe $\mathbf{D}$.
* `from_dataframe`: converts a dataframe $\mathbf{D}$ to NumPy $\mathbf{X}$ and $\mathbf{y}$ arrays.
* `print_dataset`: prints NumPy $\mathbf{X}$ and $\mathbf{y}$ arrays in a nice tabular format.

In [5]:
def to_dataframe(X, y=None, features=None, target=None):
    """
    Puts X and y into a data frame and prints it.
    - X, y: The input and target data.
    - features: The names of the input data features.
    - target: The name of the target column.
    """
    M = X.shape[1]
    columns = [ f"x{i + 1}" for i in range(M) ] if features is None else features
        
    if y is not None:
        if y.ndim == 1:
            y = y.reshape(len(X), -1)
            
        T = y.shape[1] # number of target columns
        if T == 1:
            columns.append("y" if target is None else target)
        else:
            columns += [ f"y{i + 1}" for i in range(T) ] if target is None else target
     
    return pd.DataFrame(np.concatenate([X, y], axis=1), columns=columns)
    
def from_dataframe(df, ntargets=1):
    """
    Separates a given data frame into Numpy input and target arrays.
    - df: The data frame 
    - ntargets: The number of target columns at the end of the data frame.
    """
    ntargets = 0 if ntargets is None else ntargets
    if ntargets > 0:
        X = df.iloc[:, :-ntargets].values.squeeze()
        y = df.iloc[:, -ntargets:].values.squeeze()
        features = list(df.columns[:-ntargets])
        targets = list(df.columns[-ntargets:])
        
        return X, y, features, targets
    else:
        X = df.values
        features = list(df.columns)
        
        return X, features

def print_dataset(X, y=None, name=None, features=None, target=None):
    """
    Puts X and y into a data frame and prints it.
    - X, y: The input and target data.
    - features: The names of the input data features.
    - target: The name of the target column.
    """
    if name is not None:
        print(name)
        
    print(to_dataframe(X, y=y, features=features, target=target))

Here is how the above datasets prints using the `print_dataset` function.

In [6]:
print_dataset(X, y)

     x1   x2         x3    y
0   2.0  3.0   9.085717  0.0
1   3.0  4.0  11.028671  1.0
2   4.0  4.0  12.497779  0.0
3   2.0  8.0  10.999720  1.0
4   5.0  6.0   9.275357  1.0
5   3.0  5.0   9.298526  0.0
6   5.0  7.0  11.678723  1.0
7   3.0  8.0   7.997874  1.0
8   2.0  2.0   8.527259  0.0
9   4.0  1.0  12.757294  0.0
10  4.0  3.0  12.771594  1.0
11  3.0  6.0   9.288199  0.0
12  3.0  5.0   9.361851  0.0
13  3.0  8.0   8.372759  1.0
14  4.0  6.0   8.075697  1.0
15  2.0  3.0   9.612950  0.0
16  2.0  2.0  11.790237  1.0
17  4.0  3.0   9.827319  1.0
18  3.0  4.0   9.711591  0.0
19  2.0  1.0  11.898864  1.0
20  3.0  1.0   9.204777  1.0
21  4.0  5.0   6.944344  1.0
22  2.0  6.0  13.933578  0.0
23  3.0  8.0  10.062039  0.0
24  5.0  4.0   8.553481  0.0
25  5.0  1.0   7.778517  0.0
26  2.0  1.0  10.824182  0.0


## Shuffling data

Datasets are almost always shuffled before being used for training. Here is a function for doing so. The `random_state` parameter is 

In [7]:
def shuffled(X, y, random_state=None):
    """
    Shuffles the X, y.
    """
    rgen = np.random.RandomState(random_state)
    
    indexes = rgen.permutation(len(X))
    
    return X[indexes], y[indexes]

In [8]:
shuffled(X, y)

(array([[ 4.        ,  3.        ,  9.82731865],
        [ 2.        ,  2.        , 11.79023666],
        [ 4.        ,  1.        , 12.75729435],
        [ 3.        ,  8.        ,  7.99787447],
        [ 2.        ,  2.        ,  8.52725862],
        [ 3.        ,  4.        , 11.02867118],
        [ 3.        ,  4.        ,  9.71159121],
        [ 2.        ,  1.        , 10.82418181],
        [ 2.        ,  3.        ,  9.08571715],
        [ 3.        ,  5.        ,  9.29852643],
        [ 5.        ,  7.        , 11.67872293],
        [ 3.        ,  5.        ,  9.36185114],
        [ 5.        ,  6.        ,  9.27535658],
        [ 3.        ,  8.        ,  8.37275863],
        [ 5.        ,  4.        ,  8.55348133],
        [ 5.        ,  1.        ,  7.77851658],
        [ 3.        ,  6.        ,  9.28819879],
        [ 3.        ,  1.        ,  9.20477673],
        [ 4.        ,  5.        ,  6.9443438 ],
        [ 3.        ,  8.        , 10.06203902],
        [ 4.        

## Splitting data

Another useful operation commonly performed on datasets is splitting them into training and testing sets. Here is a function that does that.

If the `start` and  `end` parameters exist, the method returns the examples before them as test and the rest of the data as training. If `test_size` is provided, then that portion of the data is returned as test and the rest as training. The `shuffle` parameter can be used to instruct the method to shuffle the data before splitting it. The method finally returns two dataset instances: training and test sets.

Here is an example using this method.

In [9]:
def train_test_split(X, y, test_size=.25, shuffle=True, random_state=None):
    """
    Splits the dataset into a training set and a test set. If test_portion 
    is specified, return that portion of the dataset as test and the rest 
    as training.
    """
    if shuffle is True:
        rgen = np.random.RandomState(random_state)
        indexes = rgen.permutation(len(X))
        X, y = X[indexes], y[indexes]

    if not isinstance(test_size, float) or test_size < 0.0 or test_size > 1.0:
        raise TypeError("Only fractions between ]0,1[ are allowed for test_size.")

    split_ndx = int(test_size * len(X))
    
    if y.ndim == 1:
        return X[split_ndx:, :], X[0:split_ndx, :], y[split_ndx:], y[0:split_ndx]
    else:
        return X[split_ndx:, :], X[0:split_ndx, :], y[split_ndx:, :], y[0:split_ndx, :]

Here is an example:

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=17)
print_dataset(X_train, y_train, name="Training Dataset")
print_dataset(X_test, y_test, name="Testing Dataset")

X_train.shape, X_test.shape, y_train.shape, y_test.shape

Training Dataset
     x1   x2         x3    y
0   3.0  8.0  10.062039  0.0
1   4.0  5.0   6.944344  1.0
2   2.0  8.0  10.999720  1.0
3   4.0  4.0  12.497779  0.0
4   4.0  3.0  12.771594  1.0
5   2.0  2.0   8.527259  0.0
6   4.0  1.0  12.757294  0.0
7   5.0  4.0   8.553481  0.0
8   4.0  6.0   8.075697  1.0
9   3.0  4.0  11.028671  1.0
10  3.0  4.0   9.711591  0.0
11  3.0  5.0   9.361851  0.0
12  3.0  8.0   7.997874  1.0
13  5.0  6.0   9.275357  1.0
14  3.0  8.0   8.372759  1.0
15  2.0  2.0  11.790237  1.0
16  2.0  6.0  13.933578  0.0
17  5.0  7.0  11.678723  1.0
18  2.0  1.0  10.824182  0.0
19  4.0  3.0   9.827319  1.0
20  2.0  3.0   9.612950  0.0
Testing Dataset
    x1   x2         x3    y
0  5.0  1.0   7.778517  0.0
1  2.0  3.0   9.085717  0.0
2  2.0  1.0  11.898864  1.0
3  3.0  1.0   9.204777  1.0
4  3.0  5.0   9.298526  0.0
5  3.0  6.0   9.288199  0.0


((21, 3), (6, 3), (21,), (6,))

## Using the `mylib` package

The above functions will be used in the upcoming weeks of this class. To facilitate such usage, they have been placed inside a simple for-this-class-only package named `mylib`, which you can download from GitHub at https://github.com/aalgahmi/mylib. Here is how you can use the `git` command-line to do so:

* Open a terminal window and change its current directory to where your handout notebooks are. For example:
  ```
  cd handouts
  
  
  ```
  
  Notice that can launch a terminal from inside Jupyter Lab using the menu **File/New/Terminal**.
* Run the following command to download this package (clone it) from GitHub:

  ```
  git clone https://github.com/aalgahmi/mylib.git
  ```

This will only need to be done once. After that, you can import this package using a statement like this:

In [11]:
import mylib as my

Once imported, let's create a dataset

In [12]:
X = np.array([
    np.random.randint(-2,3, 30),
    np.random.randint(1,2, 30)]).T

y = np.array([
    np.random.randint(0,3, 30),
    np.random.randint(2,4, 30)]).T

Let's use this library to print this dataset:

In [13]:
my.print_dataset(X, y, target=['t1', 't2'], features=['a1', 'a2'])

    a1  a2  t1  t2
0    2   1   2   2
1   -1   1   1   2
2   -2   1   2   3
3    2   1   0   3
4    2   1   0   2
5   -2   1   1   2
6   -1   1   1   3
7   -1   1   1   3
8    0   1   0   3
9    0   1   0   3
10   1   1   0   2
11  -1   1   2   3
12   2   1   0   2
13   1   1   2   3
14   2   1   1   2
15  -1   1   0   2
16  -1   1   2   3
17   2   1   1   2
18  -1   1   1   3
19   0   1   0   3
20   1   1   0   2
21   2   1   0   2
22  -2   1   2   2
23   0   1   1   2
24  -1   1   2   2
25   2   1   2   3
26   1   1   0   2
27   1   1   1   2
28   2   1   0   3
29   1   1   1   3


And one can use the `train_test_split` to shuffle and split this data set into two sets for training and testing.

In [14]:
X_train, X_test, y_train, y_test = my.train_test_split(X, y, test_size=.33, random_state=17)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((21, 2), (9, 2), (21, 2), (9, 2))

Let's print these splits:

In [15]:
my.print_dataset(X_train, y_train, name="Training Dataset")

Training Dataset
    x1  x2  y1  y2
0    1   1   0   2
1    2   1   0   3
2   -2   1   2   3
3    1   1   0   2
4    0   1   0   3
5    0   1   0   3
6    2   1   0   3
7    2   1   1   2
8   -1   1   1   2
9    1   1   0   2
10   2   1   0   2
11  -1   1   1   3
12   2   1   0   2
13   2   1   2   3
14   1   1   2   3
15  -1   1   2   3
16  -2   1   2   2
17  -1   1   1   3
18   1   1   1   3
19   2   1   1   2
20  -1   1   0   2


In [16]:
my.print_dataset(X_test, y_test, name="Testing Dataset")

Testing Dataset
   x1  x2  y1  y2
0   2   1   2   2
1  -1   1   1   3
2   2   1   0   2
3  -1   1   2   3
4  -2   1   1   2
5   0   1   1   2
6   1   1   1   2
7   0   1   0   3
8  -1   1   2   2


## Introducting `scikit-learn`

Scikit-learn is a popular general-purpose machine learning library for Python programmers. It is "an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities." 

The `scikit-learn` library is very well-designed [This paper](https://arxiv.org/abs/1309.0238) outlines the main design principles that went into its creation. In a nutshell, they are: 

**Consistency**
All objects share a consistent and simple interface that distinguishes between three different kinds of objects:

* **Estimators**: Any object that can estimate some parameters based on a dataset is called an estimator. All estimators implement a `fit()` method. This is where the estimation (learning) itself is performed.
* **Transformers**: These are estimators that can transform datasets. The transformation is performed by the `transform()` method. All transformers also have a convenient method called `fit_transform()`, which is equivalent to calling `fit()` and then `transform()`.

* **Predictors**: Some estimators, like those of supervised learning, are capable of making predictions and therefore are called predictors. All predictors implement two methods: a `predict()` and a `score()`. While the `predict()` method returns the predicted values, the `score()` method returns a measure of the quality of these predictions (for example accuracy for classification problems, and coefficient of determination $R^2$ for regression problems).

**Inspection**
All hyperparameters are accessible as public instance variables, and all learned (estimated from data) parameters are accessible as public instance variables with an underscore suffix.

**Nonproliferation of classes**
Datasets are represented as NumPy arrays or SciPy sparse matrices. Classes are used for estimators, transformers, and predictors.

**Composition**
New estimators can be created from existing building blocks. This done using pipelines and feature unions.

**Sensible defaults**
Finally, `scikit-learn` provides reasonable default values for most parameters, making it easier to use.

As you use the classes and functions of `scikit-learn`, make sure to always reference [their documentation pages](https://scikit-learn.org/stable/) for explanations and code examples.

Let's import `scikit-learn`, inspect its version, and use it split a dataset.

In [17]:
import sklearn

Here is the current version we are using:

In [18]:
sklearn.__version__

'1.2.0'

`scikit-learn` consists of multiple packages/modules. One such package is called `datasets`. It contains a few popular sample datasets as well as various functions for creating sample datasets. Let's use this package to load the Iris dataset.

In [19]:
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target

where `X` is a 4-dimensional input array with 150 examples (also known as samples or rows), and `y` is the target (output or ground truth) column with three classes (encoded as 0, 1, and 2) representing following Iris flower types:

In [20]:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

Here is how this dataset looks:

In [21]:
my.to_dataframe(X, y)

Unnamed: 0,x1,x2,x3,x4,y
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.0,1.4,0.2,0.0
2,4.7,3.2,1.3,0.2,0.0
3,4.6,3.1,1.5,0.2,0.0
4,5.0,3.6,1.4,0.2,0.0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2.0
146,6.3,2.5,5.0,1.9,2.0
147,6.5,3.0,5.2,2.0,2.0
148,6.2,3.4,5.4,2.3,2.0


## Splitting data in `scikit-learn`

To train supervised machine learning models to recognize these flower types, we need to split the dataset into two portions: training and testing. We can use the `train_test_split` from the `model_selection` package to do that. This function will automatically shuffle the data and can be called in the same way as the above `train_test_split` function.

In [22]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((120, 4), (30, 4), (120,), (30,))

where X_? represents the input portion of the ? dataset and y_? the output portion. We'll set the test dataset aside for now. Let's print the first few examples of the training dataset and make sure that it's shuffled.

In [23]:
X_train[:5, :], y_train[:5]

(array([[6.8, 2.8, 4.8, 1.4],
        [6.3, 2.5, 5. , 1.9],
        [5.7, 3.8, 1.7, 0.3],
        [6.2, 3.4, 5.4, 2.3],
        [6.9, 3.1, 5.4, 2.1]]),
 array([1, 2, 0, 2, 2]))

## EXERCISE

Do Exercise 2 - PART A