# End-to-End Machine Learning Project.

The goal is just to ilulustrate the main steps of a Machine Learning project, pretending to be a recently hired data scientist in a real estate company. The main steps you will go through:

1. Look at the big picture.
2. Get the data.
3. Discover and visualize the data to gain insights.
4. Prepare the data for Machine Learning algorithms.
5. Select a model and train it.
6. Fine-tune your model.
7. Present your solution.
8. Launch, monitor and maintain your system.


## Working with Real Data.
Here are a few places you can look to get data:

* Popular open data repositories:
  * [UC Irvine Machine Learning Repository](http://archive.ics.uci.edu/ml/)
  * [Kaggle datasets](https://www.kaggle.com/datasets)
  * [Amazon’s AWS datasets](http://aws.amazon.com/fr/datasets/)
* Meta portals (they list open data repositories):
  * http://dataportals.org/
  * http://opendatamonitor.eu/
  * http://quandl.com/
* Other pages listing many popular open data repositories:
  * [Wikipedia’s list of Machine Learning datasets](https://goo.gl/SJHN2k)
  * [Quora.com question](http://goo.gl/zDR78y)
  * [Datasets subreddit](https://www.reddit.com/r/datasets)
  
In this example we chose the California Housing Price dataset from the StatLib repository. This data set was based on data from the 1990 California census. We also added a categorical attribute and removed a rew features for teaching purposes.

#### Look at the big picture

The problem is to build a model of housing prices in California using the dataset mentioned above. This data has metrics such as the population, median income, median housing price, and so on for each block group in California. Block groups are the smallest geographical unit for which the US Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). We will just call them “districts” for short. **Your model should learn from this data and be able to predict the median housing price in any district, given all the other metrics.**


#### Frame the problem

<img src="./img/ml_pipeline.png" alt="ml pipeline" title="A Machne Learning pipeline for real state investments." width="470" height="330" />
<center><strong>Figure 1.-</strong> A Machne Learning pipeline for real state investments. </center>

After get a look to the problem it is clearly a typical supervised learning task since you are given labeled training examples (each instance comes with the expected output, i.e., the district’s median housing price). It is also a typical regression task, since you are asked to predict a value. More specifically, this is a multivariate regression problem since the system will use multiple features to make a prediction (it will use the district’s population, the median income, etc.). Finally, there is no continuous flow of data coming in the system, there is no particular need to adjust to changing data rapidly, and the data is small enough to fit in memory, so plain batch learning should do just fine.

#### Select a Performance Measure
A typical performance measure for regression problems is the Root Mean Square Error (RMSE). In measures the *standard deviation* of the errors the system makes in its predictions.

\begin{equation*}
RMSE(X, h) = \sqrt{\frac{1}{m} \sum_{i=1}^m (h(x^{(i)}) - y^{(i)})^2}
\end{equation*}

Even though the RMSE is generally the preferred performance measure for regression tasks, in some contexts you may prefer to use another function. For example, suppose that there are many outlier districts. In that case, you may consider using the Mean Absolute Error

\begin{equation*}
MAE(X, h) = \frac{1}{m} \sum_{i=1}^m | h(x^{(i)}) - yx^{(i)}|
\end{equation*}

Both the RMSE and the MAE are ways to measure the distance between two vectors: the vector of
predictions and the vector of target values. Various distance measures, or *norms*, are possible:

* Computing the root of a sum of squares (RMSE) corresponds to the Euclidian norm: it is the notion of distance you are familiar with. It is also called the $\ell_2$ norm, noted $\| \cdot \|_2$ (or just $\| \cdot \|$).
* Computing the sum of absolutes (MAE) corresponds to the $\ell_1$ norm, noted $\| \cdot \|_1$. It is sometimes called the *Manhattan norm* because it measures the distance between two points in a city if you can only travel along orthogonal city blocks.
* Molre generally, the $\ell_k$ *norm* of a vector **v** containing $n$ elements is defined as $\|v\|_k = (|v_0|^k+|v_1|^k + \cdots + |v_n|^k)^{\frac{1}{k}}$. $\ell_0$ just jives the number of non-zero elements in the vector, and $\ell_\infty$ gives the maximum absolute value in the vector.
* The higher the norm index, the more it focuses on large values and neglects small ones. This is why the RMSE is more sensitive to outliers than the MAE. But when outliers are exponentially rare (like in a bell-shaped curve), the RMSE performs very well and is generally preferred.

### Get the Data

In [None]:
# Se utiliza la función fetch_housing_data para obtener los datos.

In [None]:
# Observamos las estructura de los datos obtenidos. Se utiliza la función load_housing_data

In [None]:
# Obtenemos la descripción de los datos utilizando housing.info()

In [None]:
# mostramos las categorias (ocean proximity)

In [None]:
housing.describe()  # Mostramos la descripción de los datos numericos (housing.describe)

In [None]:
# generamos una matriz con los histogramas de cada columna

#### Create a Test Set
If you look at the test set, you may stumble upon some seemingly interesting pattern in the test data that leads you to select a particular kind of ML model. When you estimate the generalization error using the test set, your estimate will be too spotimistic and you will launch a system that will not perform well as expected. This is called *data snooping bias*.

In [None]:
import numpy as np

# Generamos una función que nos permita dividir los datos en test y train.
def split_train_test(data, test_ratio):
    pass

In [None]:
# utilizamos la función split_train_test con un 0.2, es decir un 20% para test

It works, but it is no perfect: if you run the program again, it will generate different test set! Over time, your ML algorithm will get to see the whole dataset, which is what you want to avoid.

A common solution is to use each instance's identifier to decide wheter or not it should go in the test set. For example, you could compute a hash of each instance's identifier, keep only the last byte of the hash, and put the instance in the test set if this value is lower or equal to 51 (~20% of 256).

In [None]:
from zlib import crc32
# se generan dos funciones: test_set_check y split_train_test_by_id.

In [None]:
# Los datos de housing no tienen un identificador de columna, agregamos un índice
# utilizamos la función de split_train_test_by_index

In [None]:
# o utilice las columnas más "estables" para crear un identificador único.

In [None]:
# mostramos los datos con head()

In [None]:
# Se utiliza la función sklearn.model_selection.train_test_split. Y mostramos el resultado

So far we have considered pyrely random samplig methods. This is generally fine if your dataset is large enough (especially relative to the number of attributes), but if it is not, you run the risk of introducing a significant sampling bias. A *stratified sampling* is when the data to be sampled is divided into homogeneous subgrupos called *strata*, and the right number of instances is sampled from each stratum to guarantee that the test set is representative of the overall population.

Suppose you chatted with experts who told you that the median income is a very important attribute to predict median housing prices. Let’s look at the median income histogram more closely.

In [None]:
# mostarmos el historial de median_income

It is important to have a sufficient number of instances in your dataset for each
stratum, or else the estimate of the stratum’s importance may be biased. This means that you should not
have too many strata, and each stratum should be large enough. The following code creates an income category attribute by dividing the median income by 1.5 (to limit the number of income categories), and rounding up using ceil (to have discrete categories), and then merging all the categories greater than 5 into category 5.

In [None]:
# Dividimos por 1.5 para limitar el número de categorias de ingreso

# Etiquete los que están arriba de 5 como 5


In [None]:
# utilizamos value_counts para contar los elementos.

In [None]:
# mostramos el historial

In [None]:
# Generamos un código para generar conjuntos etratificados

In [None]:
# contamos los elementos divididos entre el conjunto estratificado

In [None]:
# entre el total

In [None]:
import pandas as pd

# compararamos las proporciones de las categorías de ingresos en el conjunto de datos general.



In [None]:
compare_props

In [None]:
# elimine el atributo income_cat. Los datos vuelven a su estado original.

## Discover and Visualize the Data to Gain insights

In [None]:
# generamos una nueva copia de strat_train_set

In [None]:
# Generamos una gráfica de disperción utilizando la longitud y la latitud

In [None]:
# Mejoramos la visualización del mapa utilizando un alpha de 0.1
# Se mostraŕa que las zonas mas obscuras son la aglomeración de las viviendas

In [None]:
# echemos un vistazo a los precios de la vivienda. El radio de los círculos representa la población
# del distrito y coloreamos el precio.

In [None]:
# mostramos los datos sobre el mapa de california, con todo lo anterior

### Looking for correlations

Since the dataset is not too large, you can easily compute the *standard correlation coefficient* (also called *Pearson's r*) between every pair of attibutes using the corr() method.

In [None]:
# generamos la matriz de correlación

In [None]:
# ordenamos la correlación de mayor a menor de acuerdo a "median_house_value"

In [None]:
# observamos la correlación de atributos utilizando *scatter_matrix* de pandas

In [None]:
# El atributo más prometedor para predecir el valor medio de la vivienda es el ingreso medio.
# Observamos su grafica de correlación

### Experimenting with attribute combinations
One last this you may want to do before actually preparing the data for Machine learning algorithms is to try out various attribute combinations. For example, the total number of rooms in a district is not very useful if you don’t know how many households there are. What you really want is the number of rooms per household. Similarly, the total number of bedrooms by itself is not very useful: you probably want to compare it to the number of rooms. And the population per household also seems like an interesting attribute combination to look at. Let’s create these new attributes:

In [None]:
# Generamos nuevos datos utilizando los actuales.
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrroms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

In [None]:
# volvemos a generar la matriz de correlación

In [None]:
# observamos la grafica de correlación de rooms_per_household

In [None]:
# utilizamos el comando describe()

### Prepare the data for Machine Learning algorithms
It's time to prepare the data for Machine Learning Algorithms. The reasons for doing that are:
* This will allow you to reproduce thise transformations easily on any dataset (e.g, the next time you get a fresh dataset).
* You will  gradually build a library of transformation functions that you can reuse in future projects.
* You can use these functions in your live systems to transform the new data before feeding it to your algorithms.
* This will make it possible for you to easily try various transformations and see which combination of transformations works best.

In [None]:
# Preparamos los datos.
# separamos las características de sus etiquetas

#### Data Cleaning

Most ML algorithms cannot work with missing features, so lest's case of them. *total_bedrooms* feature has some missing values, for fixing we have three options:

* Get rid of the corresponding districts.
* Get rid of the whole attribute
* Set the values to some value (zero, the main, the median, etc.)

In [None]:
# obtenemos las muestras que tengan datos incompletos (null)

In [None]:
# opción 1: Eliminamos las muestras que tengan datos incompletos

In [None]:
# opción 2: eliminamos las columnas

In [None]:
# opción 3: Reemplazamos los valos null por los valores medios.

*Scikit-Learn* provides a handy class to take care of missing values: *Imputer*.

In [None]:
# podemos utilizar SampleImputer para rellenar los datos incompletos

Since the median can only be computed on numerical attributes, we nedd to create a copy of the dta without the text attribute *ocean_proximity*

In [None]:
# quitamos ocean_proximity dado que no es un atributo numérico

In [None]:
# utilizamos el imputer

In [None]:
# mostramos el atributo de statistics_

In [None]:
# checamos manualmente los datos

In [None]:
# utilizamos el imputer para transformar el conjunto de entrenamiento

In [None]:
# Generamos un dataframe con a partir de X

In [None]:
# mostramos las 5 primeras muestras

### Handling text and Categorical Attributes
Early we left out the categorical attribute *ocean_proximity* because it is a text attribute so we cannot compute its median. Most ML algorithms prefer to work with numbers anyway, so let's convert these categories from text to numbers.

In [None]:
# generamos una variable housing_cat a aprtir de la columna "ocean_proximity"
# mostramos las primeras 10 muestras.

In [None]:
# utilizamos sklearn.preprocesing.OrdinalEncoder para obtener etiquetas numéricas a partir de los datos no ordinales.

In [None]:
# observamos las categorias

This is better: *housing_categories* is now purely numerical. *ordinal encoder* retuns the list of categories.

One issue with this representation is that ML algorithms will assume that two nearby values are more similar that two distant values. To fix this issue, a comkmon solution is to create one binary attribute per category, This is called *one-hot encoding*, because only one attribute will be equal to 1 (hot), while the others will be 0 (cold).

In [None]:
# Utilizamos sklearn.preprocessing.OneHotEncoder para obtener ohe de los atributos categoricos

In [None]:
# mostramos la matriz densa 

You can apply both transformation (from text categories to integer, then integer to one-hot vectors) using the *OneHotEncoder* class.

In [None]:
# realizamos el mismo proceso

In [None]:
# mostramos la matriz

In [None]:
# Generamos la matriz densa,

In [None]:
# obtenemos las categorias

> If a categorical attribute has a large number of possible categories, then one-hot encoding will result in a large number of input features. This may slow down training and degraded performance. If this happens, you will want to produce denser representations. If this happends, you will want to produce denser represenations called *embeddings*, but this requires a good understanding of neural networks.

#### Custom transformers

Although Scikit-Learn provides may useful transformers, you will need to write your own for task such as custom cleanup operations or combining specific atribbutes. Remenber, Scikit-Learns relies on duck typing, you need to create a class and implement three methods: *fit()* (returning *self*, *transform()*, and *fit_transform()*.

In [None]:
# generamos una clase propia para ralizar la combinación de los atributos
# from combined_attributes_adder import CombinedAttributesAdder

In [None]:
# Generamos el data frame con los datos extra

#### Transformation Pipelines

There are many data transformation steps that need to be executed in the right order. Fortunately, Scikit-Learn provides the *pipeline* class to help with such sequences of transformations.

In [None]:
# generamos un pipeline con imputer, combinación de atributos y standarscaler

In [None]:
# mostramos los datos numericos obtenidos del pipeline


Now it would be nice if we could feed a Pandas DataFrame containing non-numerical columns directly into our pipeline, insted of having to first manually extract the numerical columns into a Numpy array. There is nothing in Scikit-Learn to handle Pandas DataFrames, but we can write a custom transformer for this task:

In [None]:
# agregamos al pipeline la transformación de los atributos categoricos

In [None]:
# mostramos los datos preparados

In [None]:
# mostramos las dimensiones del data set

### Select and Train a Model

#### Training and Evaluating on theTraining Set
Les's train a **Linear Regression** model, like previous.

In [None]:
# generamos un modelo lineal (sklearn.linear_model.LinearRegression) para generar un modelo.

Done!. You now have a working Linear Regression model. Let's try it out on a few instances from the training set:

In [None]:
# realizamos predicciones 

It works, but it's not perfect. The first prediction is off by close to 40%. Let's meausre this model on the whole training set using Scikit-Learn's mean_squared_error function:

In [None]:
# Comparamos los datos esperados vs los datos obtenidos

In [None]:
# utilizando sklearn.metrics.mean_squared_error obtenemos el rmse de las predicciones.

In [None]:
# utilizamos mean_absolute_error para obtener el mae

 This is an example of a model underfitting the training data. When this happens it can mean that the features do not provide enough information to make good predictions, or that the model is not powerful enough.
 
 Let's train a *DecisionTreeRegressior*. This is a powerful model, capable of finding complex nonlinear relationships in the data.

In [None]:
# generamos un nuevo modelo utilizando sklearn.tree.DecisionTreeRegressor

In [None]:
# generamos las predicciones con el nuevo modelo.

### Better Evaluation Using Cross-Validation

 We use Scikit-Learn's *cross-validation* feature. The following code performs *K-fold cross-validation*: it randomly splits the training set into 10 distinct subset called *folds*, then it trains and evaluates the Decision Tree model 10 times, picking a different fold for evaluation time and training on the other 9 folds. The result is an array containing the 10 evaluation scores:

In [None]:
# utilizamos sklearn.model_selection.cross_val_score para evaluar el modelo mediente cross validation

> Scikit-Learn cross-validation features expect a utility function rather than a cost function.

In [None]:
# Definimos una función que nos permita mostar las metricas de evaluación
def display_scores(scores):
    pass


display_scores(tree_rmse_scores)

In [None]:
# utilizamos validación cruzada para evaluar el modelo lineal, generado anteriormente

Note that the Decision Tree model is overfitting so badly that it performs worse than the Linear Regression Model.

Let's try one last model: *RandomForestRegressor*.

In [None]:
# Generamos un nuevo modelo. Utilizamos sklearn.ensemble.RandomForestRegressor

In [None]:
# generamos las predicciones y obtenemos el rmse

In [None]:
# utilizamos validación cruzada para evaluar el modelo

Random Forest look promising. However, note that the score on the training set is still much lower than on the validation sets, meaning that the model is still overfitting the training set. Possible solution for overfitting are to simplify the model, constraint it (i.e., regularize it), or get a lot more training data.

> for saving models that you experiment with can use the *pickle* module, or use sklean-externals.joblib.
> ```python
>  from sklearn.externals import joblib
>  
>  joblib.dump(my_model, "my_model.pkl") # save
>  my_model_loaded = joblib.load("my_model.pkl") # load
> ```

### Fine-Tune your model
Let’s assume that you now have a shortlist of promising models. You now need to fine-tune them. 

#### Grid Search
One way to do that would be to fiddle with the hyperparameters manually, until you find a great combination of  hyperparameter values. This would be very tedious work and time consuming.

In [None]:
# utilizamos validación cruzada para obtener el rmse de tree_reg

In [None]:
# mostramos sus scores mediante la función display_scores, previamente definida


In [None]:
# volvemos a obtener los valroes del modelo lin_reg a partir de la validación cruzada

In [None]:
# obtenemos un nuevo modelo utilizando sklearn.ensamble.RandomForestRegressor

In [None]:
# obtenemos su rmse
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

In [None]:
# obtenemos sus scores utilizando validación cruzada

In [None]:
# describimos los scores de los resultados obtenidos de la validación cruzada del modelo lin_reg

Instead you should get Scikit-Learn’s GridSearchCV to search for you. All you need to do is tell it which hyperparameters you want it to experiment with, and what values to try out, and it will evaluate all the possible combinations of hyperparameter values, using cross-validation.

In [None]:
# utilizamos GridSearchCV para obtener un modelo optimizado de sus hyperparametros

In [None]:
# mostramos los mejores parámetros

In [None]:
# Obtenemos el mejor modelo

> If *GridSearchCV is initialized with refit=True, then once it finds the best estimator using crossvalidation, it retrains it on the whole training set.

Let's look at the score of each hyperparameter combination tested during the grid search:

In [None]:
# mostramos los resultados de cada modelo

#### Randomized Search
The grid search approach is fine when you are exploring relatively few combinations, like in the previous example, but when the hyperparameter *search space* is large, it is often preferable to use *RandomizedSearchCV*. This approach has two main benefits:

* If you let the randomized search run for, say, 1,000 iterations, this approach will explore 1,000 different values for each hyperparameter (instead of just a few values per hyperparameter with the grid search approach).
* You have more control over the computing budget you want to allocate to hyperparameter search, simply by setting the number of iterations.

In [None]:
# utilizamos sklearn.model_selection.RandomizedSearchCV para una búsqueda aleatoria

In [None]:
# mostramos los resultados de cada modelo

#### Analyze the Best models and their errors
You will often gain good insights on the problem by inspecting the best models. For example, the
RandomForestRegressor can indicate the relative importance of each attribute for making accurate
predictions:

In [None]:
# mostramos la importancia de las características

In [None]:
# mostramos la importancia y sus nombres, de forma ordenada.

Whith this information, you may want to try dropping some of the less useful features.

#### Evaluate Your system on the test set
Now is the time to evaluate the final model on the test set.

In [None]:
# evaluamos el modelo final vs el conjunto de entrenamiento