# Data for ML


## Introduction

>When a ML model is trained, data that are relevant to the specific problem must be provided. For example, if you are attempting to predict the price of a house, you need to provide the model with a dataset of houses and their prices. 

Furthering the example, here is a non-exhaustive list of the data required to define a house:
- The square footage.
- The number of bedrooms.
- The number of bathrooms.
- The year the house was built.

The important point to note is that you need to have a clear idea of what data you will provide. Create a good data structure, i.e. the data you have, which will be eventually passed to the model, should be consistent. To ensure consistency, you must have a clear idea of the data that you will provide.

These steps are part of the process of what is known as 'ML system design'.

## ML system design

ML system design consists of interface, data (the data to be fed to the model) and model definitions. This is not always a step-by-step process; you may have to add more data/features or tune your model in an iterative process.

The interface refers to the mode in which the model will be used. It includes the formats of the input features and the output labels (i.e. are the images in vector, scalar, color or greyscale format? Are the inputs required to have already been processed?), as well as the methods of the model, which the user will be required to call. For example, if you are attempting to predict the price of a house, you need to provide the model with a dataset of houses and their prices; additionally, your system should be intuitive.

### Data

Preparing the data is one of the most important steps in ML system design. As mentioned above, you need to have a clear idea of what data you will provide, create a good data structure, and subsequently use the same structure to both train the model and make predictions.

You can gain insights into the data by performing EDA; you can create hypotheses about the data and identify the features that are relevant to the problem.

These data are divided into features and labels. A feature is a characteristic of the data that you are attempting to predict, while a label is the value that you are attempting to predict. For example, in the case of predicting the price of a house, the features include the square footage, number of bedrooms, number of bathrooms and year the house was built. Further, there is a label: the price of the house.

However, you can obtain more data from a house if you possess domain knowledge. For example, if you know that the square footage per number of rooms is related to the price of the house, you can create a new column to compute the square footage per number of rooms and use the new feature to train the model and make predictions. This is called `feature engineering`, where you can extract features from the data that you have. Eventually, this data can improve the performance of the model. Feature engineering is an art that is quite difficult to master; however, as with all arts, the more you practice, the better you become.


### Model


Once the data are available, the next step involves training the model. Over the next lessons, we will explore this in more detail; however, for now, be aware that training a model implies that the algorithm will learn to predict the labels from the features. There are many different algorithms for training models, and the choice of one depends on the problem, data, number of features, requirements of the problem, etcetera.

However, there are some rules of thumb that you can follow when training a model:
1. Train a simple model first to get a good baseline.
2. Train more complex models that surpass the baseline. Explore different algorithms, and compare them.
3. Once you have obtained the best model, you can improve it by adding more features/samples, increasing the complexity, etcetera.
4. Be careful not to overfit the model (we will discuss this concept later). One way to prevent overfitting is to remove features that are not relevant to the model; thus, you can go back in the process and train a model with less features. This process can be repeated until you obtain a good model.
5. 'Refine the model by finding good hyperparameters. A hyperparameter is a parameter that cannot be learnt by a model on its own; it must be set manually by the programmer. For example, if you are attempting to predict the price of a house, you can set the learning rate of the model, i.e. the speed at which the model learns. This is a hyperparameter because it is not learned by the model, but set manually.

Over the next lessons, we will explore these concepts in more detail.


Here, we will briefly explain how to obtain a dataset using sklearn and train a model using that dataset.

## scikit-learn

> scikit-learn (abbreviated as `sklearn` and pronounced as "S-K-Learn") is a __high-level ML library__ containing:
> - ML algorithms
> - example datasets
> - data pre-processing & pipelines

Pair that with a simple API, and you obtain a powerful and easy tool for carrying out tasks.

In [1]:
import sklearn

print(sklearn.__version__)

1.0.1


__Note:__ Although a stable version is yet to be achieved, it has been around for __more than 10 years__ and is used throughout the industry.

### Use cases
It is used

- in fast prototyping and testing ideas.
- as a __part__ of highly complicated pipelines.
- often as part of ML research.
- widely in production for particular models, such as decision trees and others.

### Simple demonstration

We will introduce `sklearn` as a simple tool that enables the easy demonstration of concepts without delving into details.

__Rest assured that we will explore algorithms in more detail later.__


### Data loading

As mentioned, `sklearn` provides a few ready-to-use datasets. Data are returned either in `np.array`s or in `pd.DataFrame` (we will focus on `np.array` since it is more common).

### Example

Load California Housing using `sklearn.datasets.fetch_california_housing()`. For more information, consult the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html).

Print the shapes of both the features and targets.

In [15]:
from sklearn import datasets

X, y = datasets.fetch_california_housing(return_X_y=True, as_frame=True)
X

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


This is the `California` house-pricing dataset with `20640` examples, `8` features and the respective `y` targets.



#### Features


There are `8` features, including
- Median Income
- House Age
- Average number of Rooms

The features are represented by a matrix with as many rows as there are examples and as many columns as there are features for each example. We call this matrix the __design matrix__.

In [18]:
X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


#### Targets

The `y` (targets) parameter is simply the median price of the houses connected to the features that we wish to predict at the end of this notebook.

The targets are represented in the same way as the features, i.e. as a matrix with as many rows as there are examples. However, instead of the number of columns representing the number of features for each example, it now represents the number of targets for each example.

Most commonly, there is a single label for each example (such as a house price), which indicates that the labels will be a vector. In other cases, each example may have multiple labels, such as in a detection problem, where you are predicting the corner coordinates of a box containing the item you intend to detect.

In [19]:
y[:5]

0    4.526
1    3.585
2    3.521
3    3.413
4    3.422
Name: MedHouseVal, dtype: float64

Features are also floating-point arrays, which we will use to train our algorithm.

## Conclusion

At this point, you should have a fair understanding of

- ML system design, as the process of defining the interface, data, and model.
- the iterative process of designing a ML system.
- `sklearn`, __features__, __targets__ and __labels__.