# 2. End-to-end machine learning project 
In this chapter you will work through an example project end to end. 

The main steps you will go through: 
1. Look at the big picture. 
2. Get the data. 
3. Discover and visualize the data to gain insights. 
4. Prepare the data for Machine Learning algorithms. 
5. Select a model and train it. 
6. Fine-tune your model. 
7. Present your solution. 
8. Launch, monitor, and maintain your system.

## Working with real data 
It is best with real world data, not artificial datasets. 
There are thousands of open datasets to choose form. 

In this chapter we'll use the California Housing Prices dataset from the StatLib repository. 

## Look at the big picture 
Your fist task is to use California census data to build a model of housing prices in the state. 
This data includes metrics such as the population, median income, and median hosing price for each block group ("districts") in California. 
Your model should learn from this data and be able to predict the median housing price in any district, given all the other metrics. 

> Since you are a well-organized data scientist, the first thing you should do is pull out your Machine Learning project checklist. 

### Frame the problem 
The business object is. 
How does the company expect to use and benefit from this model? 
Knowing the objective is important because it will determine how you frame the problem, which algorithms you will select, which performance measure you will use to evaluate your model, and how much effort you will spend tweaking it. 

Your boss answers that your model's output (a prediction of a district's median housing price) will be fed to another Machine Learning system, along with many other signals. 
This downstream system will determine whether it is worth investing in a given area or not. 
Getting this right is critical, as it directly affects revenue. 

> A piece of information fed to a machine Learning system is often called a **signal**, in reference Claude Shannon's information theory, which he developed at Bell Labs to improve telecommunications. His theory: you want a high signal-to-noise ratio. 

[See figure 2.2 A machine learning pipeline for real estate investments]

> **Pipelines** A sequence of data processing components is called a data *pipeline*. Pipelines are very common in Machine Learning systems, since there is a lot of data to manipulate and many data transformations to apply. 
> Each component is fairly self-contained: the interface between components is simply the data store. This makes the system simple to grasp (with the help of a data flow graph).
> If a component breaks down, the downstream components can often continue to run normally (at least for while) by using the last output from the broken component. This makes the architecture quite robust. 
> On the other hand, a broken component can go unnoticed for some time if proper monitoring is not implemented. The data gets stale and the overall system's performance drops. 

The next question to ask your boss is what the current solution looks like (if any). The current situation will often give you a reference for performance, as well as insights on how to solve the problem. 

With all this information, you are ready to start **designing your system**. 
First, you need to **frame the problem**: 
Is it supervised, unsupervised, or reinforcement learning? 
Is it a classification task, a regression task, or something else? 
Should you use batch learning or online learning techniques? 
<!-- IMO: Supervised, regression task, batch learning -->

In this example, it is clearly a typical supervised learning task, since you are given **labeled** training examples (each instance comes with the expected output, i.e., the district's median housing price). 
It is also a typical regression task, since you are asked to predict a value. 
More specifically, this is a *multiple regression* problem, since the system will use multiple features to make a prediction. 
It is a *univariate regression* problem, wince we are only trying to predict a single value for each district. 
Finally, there is no continuous flow of data coming into the system, there is no particular need to adjust to changing data rapidly, and the data is small enough to fit in memory, so plain batch learning should do just fine. 

> If the data were hugh, you could either split your batch learning work across multiple servers (using the **MapReduce technique**) or use an online learning technique. 

### Select a performance 
A typical performance measure for regression problems is the **Root Mean Square Error (RMSE)**.
It gives an idea of how much error the system typically makes in its predictions, with a higher weight for large errors. 

[Equation 2.1 Root Mean Square Error (RMSE) on pp. 39]

> **Notations** 
> * $m$ is the number of instances in the dataset you are measuring the RMSE on. 
> * $\mathbf{x}^{(i)}$ is a vector of all feature values (excluding the label) of the $i^{ht}$ instance in the dataset, and $y^{(i)}$ is its label (the desired output value for that instance). 
> * $\mathbf{X}$ is a matrix containing all the feature values (excluding labels) of all instances in the dataset. There is one row per instance, and the $i^{th}$ row is equal to the transpose of $\mathbf{x}^{(i)}$, noted $( \mathbf{x}^{(i)} )^T$. 
> * $h$ is yor system's prediction function, also called a *hypothesis*. When your system is given an instance's feature vector $\mathbf{x}^{(i)}$, it outputs a predicted value $\hat{y}^{(i)} = h(\mathbf{x}^{(i)})$ for that instance ($\hat{y}$ is pronounced "y-hat"). 
> * $RMSE (\mathbf{X}, h)$ is the cost function measured on the set of examples using your hypothesis $h$. 

Even though the RMSE is generally the preferred performance measure for regression tasks, in some contexts you may prefer to use another function. 

For example, suppose that there are many outliers districts. 
In that case, you may consider using the **mean absolute error** (MAE, also called the average absolute deviation). 

[Equation 2-2. Mean absolute error (MAE)]

Both the RMSE and the MAE are ways to measure the distance between two vectors: the vector of predictions and the vector of target values. 
Various distance measures, or **norms**, are possible: 
* Computing the root of a sum of squares (RMSE) corresponds to the *Euclidean norm*. It is also called the $\mathcal{l}_2$ norm, noted $||\cdot||_2$ or just  $||\cdot||$.
* Computing the sum of absolutes (MAE) corresponds to the $\mathcal{l}_1$, noted $||\cdot||_1$. This is sometimes called the *Manhattan norm* because it measures the distance between two points in a city if you can only travel along orthogonal city blocks. 
* The $\mathcal{l}_k$ norm of a vector $\mathbf{v}$ containing $n$ elements is defined as $||\mathbf{v}||_k$ 
* The higher the norm index, the more it focuses on large values and neglects small ones. This is why the RMSE is more sensitive to outliers than the MAE. 


### Check the assumptions 
Lastly, it is good practice to list and verify the assumptions that have been made so far (by you or others); this can help you catch serious issues early on. 


## Get the data 
### Create the workspace 

First, install python. 
Next, you need to create a workspace directory for your Machine Learning code and datasets. 

`$ export ML_PATH="$HOME/ml"` # You can change the path if you prefer 
`$ mkdir -p $ML_PATH`

You can use your system's packaging system (e.g., Homebrew), install a Scientific Python distribution such as Anaconda or just use Python's own packaging system, `pip`

#### Creating an Isolated Environment 
If you would like to work in an isolated environment, install `virtualenv` by running the following pip command: 

`$ python3 -m pip install --user -U virtualenv `

now you can create an isolated Python environment: 

`cd $ML_PATH`
`python3 -m virtualenv my_env`

Now every time you want to activate this environment, just open a terminal and type: 

`cd $ML_PATH`
`source my_env/bin/activate`

to deactivate this environment, type `deactivate`. 
While the environment is active, any package you install using pip will be installed in this isolated environment, and Python will only have access to these packages. 

Now you can install all the required modules and their dependencies: 
`$ python3 -m pip install -U jupyter matplotlib numby pandas scipy scikit-learn`

If you created a virtualenv, you need to register it to Jupyter and give it a name: 
`$ python3 -m ipykernel install --user --name=python3`

Now you can fire up Jupyter: 
`$ jupyter notebook`

----

### Download the data 
In typical environments your data would be available in a relational database and spread across multiple tables/documents/files. 

Having a function that downloads the data is useful in particular if the data changes regularly: you can write a small script that uses the function to fetch the latest data. 

Now when you call `fetch_housing_data()`, it creates a `datasets/housing` directory in your workspace, downloads the `housing.tgz` file, and extracts the `housing.csv` file from it in this directory. 

You should write a small function to load the data: `load_housing_data`. 
This function returns a pandas DataFrame object containing all the data. 

### take a quick look at the data structure 

Take a look at the top five rows using the DataFrame's `head()` method. 
The `info()` method is useful to get a quick description of the data, in particular the total number of rows, each attribute's type, and the number of nonnull values. 

You can find out what categories exist and how many districts belong to each category by using the `value_counts()` method. 
The `describe()` method shows a summary of the numerical attributes. 

Another quick way to get a feel of the type of data you are dealing with is to plot a histogram for each numerical attribute. 
You can either plot this one attribute at a time, or you can call the `hist()` method on the whole dataset, and it will plot a histogram for each numerical attribute. 

The `hist()` method relies on Matplotlib, which in turn relies on a user-specific graphical backend to draw on your screen. 
the simplest option is to use Jupyter's magic command `%matplotlib inline`. 
Note that calling `show()` is optional in a Jupyter notebook, as Jupyter will automatically display plots when a cell is executed. 


### Create a test set 

Creating a test set is theoretically simple: pick some instances randomly, typically 20% of the dataset (or less if your dataset is very large), and set them aside. 

*Stratified sampling*: the population is divided into homogeneous subgroups called *strata*, and the right number of instances are sampled from each stratum to guarantee that the test set is representative of the overall population. 



## Discover and visualize the data to gain insights 
### Visualizing geographical data 
### looking for correlations 
### experimenting with attribute combinations 

## Prepare the data for machine learning algorithms
### Data cleaning 
### Handling text and categorical attributes 
### Custom transformers 
### Feature scaling 
### Transformation pipelines 

## Select and train a model 
### Training and evaluation on the training set 
### Better evaluation using cross-validation

## Fine-Tune your model 
### Grid search 
### Randomized search 
### Ensemble methods 
### Analyze the best models and their erros 
### Evaluate your system on the test set 

## Launch, Monitor, and Maintain your system 
## Try it, out! 

## Exercise 