## Intro to Machine Learning

*Welcome to Machine Learning Housing Corp.! Your task is to predict median house values in Californian districts, given a number of features from these districts.*

# Setup

First, let's import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures. We also check that Python 3.5 or later is installed (although Python 2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as Scikit-Learn ≥0.20.

In [656]:
# Python ≥3.5 is required


# Scikit-Learn ≥0.20 is required


# Common imports
# To plot pretty figures


# Where to save the figures


# Define a function to save the figures


# Get the Data

## Download the Data

In [657]:
# Import the necessary libraries


#Get the path to the file


# Write and call a function to fetch the data


In [659]:
# Write and call a function to load the data


## Take a Quick Look at the Data Structure

## Create a Test Set

In [665]:
# to make this notebook's output identical at every run


In [666]:
# Write and call a function to split the training set


In [None]:
# What is the length of the training set?


In [None]:
# What is the length of the test set?

In [669]:
# check the test set

# splitting the trainning set by id


In [674]:
# splitting the data using stratified shuffle split


# Discover and Visualize the Data to Gain Insights

## Visualizing Geographical Data

The argument `sharex=False` fixes a display bug (the x-axis values and legend were not displayed). 

In [None]:
# Download the California image


## Looking for Correlations

In [None]:
# from pandas.tools.plotting import scatter_matrix # For older versions of Pandas


## Experimenting with Attribute Combinations

# Prepare the Data for Machine Learning Algorithms

In [694]:
 # drop labels for training set


## Data Cleaning

Remove the text attribute because median can only be calculated on numerical attributes:

In [701]:

# alternatively: housing_num = housing.select_dtypes(include=[np.number])

Check that this is the same as manually computing the median of each attribute:

Transform the training set:

## Handling Text and Categorical Attributes

Now let's preprocess the categorical input feature, `ocean_proximity`:

By default, the `OneHotEncoder` class returns a sparse array, but we can convert it to a dense array if needed by calling the `toarray()` method:

Alternatively, you can set `sparse=False` when creating the `OneHotEncoder`:

## Custom Transformers

Let's create a custom transformer to add extra attributes:

Note that I hard coded the indices (3, 4, 5, 6) for concision and clarity in the book, but it would be much cleaner to get them dynamically, like this:

Also, `housing_extra_attribs` is a NumPy array, we've lost the column names (unfortunately, that's a problem with Scikit-Learn). To recover a `DataFrame`, you could run this:

## Transformation Pipelines

Now let's build a pipeline for preprocessing the numerical attributes:

For reference, here is the old solution based on a `DataFrameSelector` transformer (to just select a subset of the Pandas `DataFrame` columns), and a `FeatureUnion`:

In [726]:


# Create a class to select numerical or categorical columns


Now let's join all these components into a big pipeline that will preprocess both the numerical and the categorical features:

The result is the same as with the `ColumnTransformer`:

# Select and Train a Model

## Training and Evaluating on the Training Set

In [None]:
# let's try the full preprocessing pipeline on a few training instances


Compare against the actual values:

**Note**: since Scikit-Learn 0.22, you can get the RMSE directly by calling the `mean_squared_error()` function with `squared=False`.

## Better Evaluation Using Cross-Validation

**Note**: we specify `n_estimators=100` to be future-proof since the default value is going to change to 100 in Scikit-Learn 0.22 (for simplicity, this is not shown in the book).

# Fine-Tune Your Model

## Grid Search

In [None]:
    # try 12 (3×4) combinations of hyperparameters
    # then try 6 (2×3) combinations with bootstrap set as False

# train across 5 folds, that's a total of (12+6)*5=90 rounds of training


The best hyperparameter combination found:

Let's look at the score of each hyperparameter combination tested during the grid search:

## Randomized Search

## Analyze the Best Models and Their Errors

## Evaluate Your System on the Test Set

We can compute a 95% confidence interval for the test RMSE:

We could compute the interval manually like this:

Alternatively, we could use a z-scores rather than t-scores:

# Extra material

## A full pipeline with both preparation and prediction

## Model persistence using joblib

In [763]:

 # DIFF
#...
 # DIFF

## Example SciPy distributions for `RandomizedSearchCV`

# Exercises

## 1.

Question: Try adding a transformer in the preparation pipeline to select only the most important attributes.

Note: this feature selector assumes that you have already computed the feature importances somehow (for example using a `RandomForestRegressor`). You may be tempted to compute them directly in the `TopFeatureSelector`'s `fit()` method, however this would likely slow down grid/randomized search since the feature importances would have to be computed for every hyperparameter combination (unless you implement some sort of cache).

Let's define the number of top features we want to keep:

Now let's look for the indices of the top k features:

Let's double check that these are indeed the top k features:

Looking good... Now let's create a new pipeline that runs the previously defined preparation pipeline, and adds top k feature selection:

Let's look at the features of the first 3 instances:

Now let's double check that these are indeed the top k features:

Works great!  :)