# Data for ML


When you train a machine learning model, you need to provide it with data that is sensible to the the specific problem. For example, if you are trying to predict the price of a house, you need to provide the model with a dataset of houses and their prices. 

So, you can ask yourself, "What data do I need to define a house?". Some examples could be:
- The square footage of the house
- The number of bedrooms in the house
- The number of bathrooms in the house
- The year the house was built
Amongst other characteristics.

The point here is that you need to have a clear idea of what data you are going to provide, create a good data structure, meaning that the data you have and that will be eventually passed to the model is consistent. To ensure that is consistent you need to have a clear idea of what data you are going to provide.

These steps are part of the process of what is known as "Machine Learning system design".

## Machine Learning system design

Machine Learning system design consists of defining the interface, the data that is going to be fed to the model, and the model itself. This process is not always done step by step, and you can find yourself in a situation where you have to add more data, more features, or tuning your model in an iterative process.

When we talk about interface, it refers to the way that the model is intended to be used. It includes the format of the input features and the format of the output labels (i.e. are they vectors, scalars, color or greyscale images, are the inputs required to have already been processed in some way?), and the methods of the model which will be required for the user to call. For example, if you are trying to predict the price of a house, the user needs to provide the model with a dataset of houses and their prices, and as such, you need to make it easy for the user to know how to use your system.

### Data

Preparing the data is one of the most important steps in the process of machine learning system design. As mentioned above, you need to have a clear idea of what data you are going to provide, create a good data structure, and then use the same structure for both training the model, and making predictions with it.

You can get insights into the data by performing EDA, so you can then create hypotheses about the data and see which features are relevant to the problem.

This data is divided into features and labels. A feature is a characteristic of the data that you are trying to predict, and a label is the value that you are trying to predict. For example, in the case of predicting the price of a house, you have features like the square footage of the house, the number of bedrooms in the house, the number of bathrooms in the house, and the year the house was built. You also have a label, which is the price of the house.

However, you can get more data from a house if you have domain knowledge, for example, if you know that the square footage per number of rooms is related to the price of the house, you can create a new column to compute the square footage per number of rooms and use the new feature to train the model and to make predictions. This is called feature engineering, where you can extract features from the data that you already have. Eventually, this data can improve the performance of the model. Feature engineering is an art, and it is quite difficult to master, but the more you practice, the easier it will come to you.


### Model


Once we have the data, we need to train the model. We will get more in depth on this in next lessons, but for now, you should know that training a model means that the algorithm is going to learn to predict the labels from the features. There are many different algorithms that you can use to train a model, and the choice can depend on the problem, the data, the number of features, the requirements of the problem... As with many things in AI, the answer is always: "It depends"!

However, there are some rules of thumb that you can follow when you are training a model:
1. Train a simple model first to get a good baseline
2. Train more complex models that surpass the baseline. Try different algorithms, and compare them all.
3. Once you obtained the best model, you can improve it by adding more features, adding more samples, increasing the complexity of the model, etc.
4. Be careful not to overfit the model, we will come to this concept later. One way to prevent overfitting is to remove features that are not relevant to the model, so you can go back in the process and train a model with less features. This process can be repeated until you obtain a good model.
5. "Fine tune the model by finding good hyperparameters. A hyperparameter is a parameter that our model can't learn, but we have to set it manually. For example, if you are trying to predict the price of a house, you can set the learning rate of the model, which is the speed at which the model learns. This is a hyperparameter because it is not learned by the model, but we have to set it manually.

Over the next lessons, you will explore these concepts in more detail. 



In this notebook we are going to briefly explain how to obtain a dataset using sklearn and to train a model using that dataset.


## scikit-learn

> scikit-learn (abbreviated `sklearn` and pronounced "S-K-Learn") is a __high-level machine learning library__ containing:
> - machine learning algorithms
> - example datasets
> - data pre-processing & pipelines

Pair that with simple API and we get a powerful & easy tool to get the job done

In [1]:
import sklearn

print(sklearn.__version__)

1.0.1


Although it didn't reach stable version (yet), it was around for __more than 10 years__ and is used throughout the industry.

## Where it is used

- Fast prototyping and testing ideas
- __Part__ of more complicated pipelines
- Often as part of Machine Learning research (if possible)
- Widely in production for particular models such as decision trees and others that we will look at shortly

## Simple example

We will introduce `sklearn` as a simple tool which allows us to easily show you concepts without delving into details.

__Do not sweat over what are those algorithms right now, we will go over them in detail!__

## Data Loading

As mentioned, `sklearn` provides a few ready datasets for us to use. Data is returned either in `np.array`s or in `pd.DataFrame` (we will stick to `np.array` though as it's more common).

## Exercise

Load California Housing using `sklearn.datasets.fetch_california_housing()`. Check the documentation for more information [here](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html).

Print shape of both features and targets

In [15]:
from sklearn import datasets

X, y = datasets.fetch_california_housing(return_X_y=True, as_frame=True)
X

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


This is `California` house pricing dataset with `20640` examples, `8` features and respective `y` targets.



## Features



Features consist of `8` features, among which we can find:
- Median Income
- House Age
- Average number of Rooms

The features are represented by a matrix with as many rows as there are examples, and as many columns as there are features for each example. We call this matrix the __design matrix__.

In [18]:
X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25



## Targets

`y` (targets) is simply the median price of the houses connected to those features that we would like to predict at the end of this notebook

The targets are represented in the same way as the features, as a matrix with as many rows as there are examples. However, instead of the number of columns representing the number of features for each example, it now represents the number of targets for each example.

Most commonly, there is a single label for each example (such as a house price), which means that the labels will be a vector. Other times, each example may have multiple labels (such as in a detection problem, where you are predicting the corner coordinates of a box containing the item you want to detect).

In [19]:
y[:5]

0    4.526
1    3.585
2    3.521
3    3.413
4    3.422
Name: MedHouseVal, dtype: float64

Features are also floating point arrays. We will use them to train our algorithm.

## Summary

- Machine learning system design is the process of defining the interface, the data, and the model
- The process of designing a machine learning system is iterative
- `sklearn` is a high level library used to quickly prototype solutions
- The attributes we want to make predictions from are called __features__
- The attributes which we want to predict are called __targets__ or __labels__