**AUTHOR: RAIHAN SALMAN BAEHAQI (1103220180)**

**PART I**  

**The Fundamentals of Machine Learning**  

---

**CHAPTER 2 - End-to-End Machine Learning Project**  

---

Presents a comprehensive end-to-end machine learning project using the California Housing Prices dataset. This chapter demonstrates the complete workflow from data acquisition to model deployment using a practical real estate pricing prediction problem. The chapter guides readers through eight main steps: looking at the big picture, getting the data, discovering and visualizing data, preparing data for ML algorithms, selecting and training a model, fine-tuning the model, presenting the solution, and launching/monitoring the system.  

---


**Working with Real Data**  

The chapter uses the California Housing Prices dataset from StatLib repository based on 1990 census data. Recommended data sources include UC Irvine ML Repository, Kaggle, AWS datasets, Data Portals, OpenDataMonitor, and Quandl.

![Figure2-1.jpg](./02.Chapter-02/Figure2-1.jpg)  

---

**Look at the Big Picture**

**Frame the Problem**  

The business objective is building a model to predict median housing prices in California districts for a real estate investment ML pipeline. The model output feeds into a downstream system that determines investment worthiness.

![Figure2-2.jpg](./02.Chapter-02/Figure2-2.jpg)

Pipelines are sequences of data processing components running asynchronously, where each component pulls data, processes it, and outputs results to another data store. This architecture provides simplicity, allows different teams to focus on different components, and creates robustness, though broken components can go unnoticed without proper monitoring.  

The problem is classified as:
* Supervised learning (labeled training examples with expected outputs)​
* Regression task (predicting a continuous value)​
* Multiple regression (uses multiple features)​
* Univariate regression (predicting single value per district)​
* Batch learning (no continuous data flow, data fits in memory)

**Select a Performance Measure**  

Root Mean Square Error (RMSE) is the typical regression performance measure, providing error magnitude with higher weight for large errors:  

Equation 2-1. Root Mean Square Error (RMSE)  
![Eq2-1.jpg](./02.Chapter-02/Eq2-1.jpg)

Where:
* m = number of instances in the dataset​  
* x<sup>(i)</sup> = vector of feature values for the ith instance​
* y<sup>(i)</sup> = label (desired output) for that instance​
* X = matrix containing all feature values of all instances​
* h = prediction function (hypothesis)  

<br> Mean Absolute Error (MAE) is preferred when dealing with many outliers:  

Equation 2-2. Mean absolute error (MAE)  
![Eq2-2.jpg](./02.Chapter-02/Eq2-2.jpg)  

Distance Measures and Norms:
* Euclidean norm (ℓ2): Root of sum of squares (RMSE)​  
* Manhattan norm (ℓ1): Sum of absolutes (MAE)  ​  
* ℓk norm: ||v||<sub>k</sub> = (|v<sub>0</sub>|<sup>k</sup> + |v<sub>1</sub>|<sup>k</sup> + ... + |v<sub>n</sub>|<sup>k</sup>)<sup>1/k</sup>
* ℓ0: Number of nonzero elements​  
* ℓ∞: Maximum absolute value  

Higher norm indices focus more on large values and neglect small ones, making RMSE more sensitive to outliers than MAE.

**Check the Assumptions**

Verify all assumptions early to catch serious issues. For example, confirm that downstream systems need actual prices rather than categories to avoid framing the problem incorrectly.

---

**Get the Data**

**Create the Workspace**  

Install Python 3 and required modules:

In [None]:
$ export ML_PATH="$HOME/ml"
$ mkdir -p $ML_PATH

Check pip installation:

In [None]:
$ python3 -m pip --version
pip 19.3.1 from [...]/lib/python3.7/site-packages/pip (python 3.7)

Upgrade pip:

In [None]:
$ python3 -m pip install --user -U pip

**Creating an Isolated Environment**  

Install and create virtualenv for isolated project environments:

In [None]:
$ python3 -m pip install --user -U virtualenv
$ cd $ML_PATH
$ python3 -m virtualenv my_env
$ source my_env/bin/activate  # on Linux/macOS
$ .\my_env\Scripts\activate   # on Windows

Install required packages:

In [None]:
$ python3 -m pip install -U jupyter matplotlib numpy pandas scipy scikit-learn

Register virtualenv to Jupyter:

In [None]:
$ python3 -m ipykernel install --user --name=python3

Start Jupyter server:

In [None]:
$ jupyter notebook

![Figure2-3.jpg](./02.Chapter-02/Figure2-3.jpg)  

![Figure2-4.jpg](./02.Chapter-02/Figure2-4.jpg)

**Download the Data**  

Create an automated function to fetch data:

In [None]:
import os
import tarfile
import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

Load data with pandas:

In [None]:
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)


**Take a Quick Look at the Data Structure**

View the first five rows:

In [None]:
housing.head()

![Figure2-5.jpg](./02.Chapter-02/Figure2-5.jpg)  

The dataset contains 10 attributes: longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_income, median_house_value, and ocean_proximity.

Get dataset information:

In [None]:
housing.info()

![Figure2-6.jpg](./02.Chapter-02/Figure2-6.jpg)  

All attributes are numerical except ocean_proximity (object/text). Check categorical values:

In [None]:
>>> housing["ocean_proximity"].value_counts()
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

View statistical summary:

In [None]:
housing.describe()

![Figure2-7.jpg](./02.Chapter-02/Figure2-7.jpg)  

Create histograms for all numerical attributes:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
plt.show()

![Figure2-8.jpg](./02.Chapter-02/Figure2-8.jpg)

Key observations from histograms:
1. Median income is scaled and capped (0.5 to 15), representing roughly tens of thousands of dollars​
2. Housing median age and median house value are capped, potentially problematic for the target attribute​
3. Attributes have very different scales requiring feature scaling​
4. Many histograms are tail-heavy, extending farther right than left, requiring transformation for bell-shaped distributions

**Create a Test Set**

Simple random sampling function:

In [None]:
import numpy as np

def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]

>>> train_set, test_set = split_train_test(housing, 0.2)
>>> len(train_set)
16512
>>> len(test_set)
4128

For stable splits across updates, use identifier-based hashing:

In [None]:
from zlib import crc32

def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

def split_train_test_by_id(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]

housing_with_id = housing.reset_index()
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")

Or create stable ID from latitude/longitude:

In [None]:
housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")

Use Scikit-Learn's train_test_split:

In [None]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

Stratified Sampling ensures representative test sets. Create income categories based on median income importance:

In [None]:
housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])
housing["income_cat"].hist()

![Figure2-9.jpg](./02.Chapter-02/Figure2-9.jpg)  

Perform stratified sampling:

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

Verify stratification worked:

In [None]:
>>> strat_test_set["income_cat"].value_counts() / len(strat_test_set)
3    0.350533
2    0.318798
4    0.176357
5    0.114583
1    0.039729
Name: income_cat, dtype: float64

![Figure2-10.jpg](./02.Chapter-02/Figure2-10.jpg)  

Remove income_cat to restore original data:

In [None]:
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

---  

**Discover and Visualize the Data to Gain Insights**  

Create a copy of training set for exploration:

In [None]:
housing = strat_train_set.copy()

**Visualizing Geographical Data**  

Basic scatterplot:

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude")

![Figure2-11.jpg](./02.Chapter-02/Figure2-11.jpg)  

Enhanced visualization with transparency:

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)

![Figure2-12.jpg](./02.Chapter-02/Figure2-12.jpg)  

Advanced visualization with price and population:

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
    s=housing["population"]/100, label="population", figsize=(10,7),
    c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
    sharex=False)
plt.legend()

![Figure2-13.jpg](./02.Chapter-02/Figure2-13.jpg)  

**Looking for Correlations**  

Compute correlation matrix:

In [None]:
corr_matrix = housing.corr()
>>> corr_matrix["median_house_value"].sort_values(ascending=False)
median_house_value    1.000000
median_income         0.687160
total_rooms           0.135097
housing_median_age    0.114110
households            0.064506
total_bedrooms        0.047689
population           -0.026920
longitude            -0.047432
latitude             -0.142724
Name: median_house_value, dtype: float64

The correlation coefficient ranges from -1 to +1, measuring only linear relationships. Median income shows strongest positive correlation (0.687) with median house value.  

Use scatter matrix to visualize correlations:

In [None]:
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))

![Figure2-14.jpg](./02.Chapter-02/Figure2-14.jpg)

Focus on median_income correlation:

In [None]:
housing.plot(kind="scatter", x="median_income", y="median_house_value",
             alpha=0.1)

![Figure2-15.jpg](./02.Chapter-02/Figure2-15.jpg)  

**Experimenting with Attribute Combinations**  

Create new combined attributes:

In [None]:
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"] = housing["population"]/housing["households"]

corr_matrix = housing.corr()
>>> corr_matrix["median_house_value"].sort_values(ascending=False)
median_house_value          1.000000
median_income               0.687160
rooms_per_household         0.146285
total_rooms                 0.135097
housing_median_age          0.114110
households                  0.064506
total_bedrooms              0.047689
population_per_household   -0.021985
population                 -0.026920
longitude                  -0.047432
latitude                   -0.142724
bedrooms_per_room          -0.259984
Name: median_house_value, dtype: float64

The new bedrooms_per_room attribute shows stronger negative correlation (-0.260) than total_bedrooms.  

---

**Prepare the Data for Machine Learning Algorithms**  

Revert to clean training set and separate predictors from labels:

In [None]:
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

**Data Cleaning**  

Handle missing values (total_bedrooms has 207 missing):  

Option 1: Drop districts with missing values

In [None]:
housing.dropna(subset=["total_bedrooms"])

Option 2: Drop the whole attribute

In [None]:
housing.drop("total_bedrooms", axis=1)

Option 3: Set missing values to median (zero, mean, etc.)

In [None]:
median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median, inplace=True)

Use Scikit-Learn's SimpleImputer:

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

housing_num = housing.drop("ocean_proximity", axis=1)
imputer.fit(housing_num)

>>> imputer.statistics_
array([   -118.51  ,     34.26  ,     29.    ,   2119.5   ,    433.    ,
          1164.    ,    408.    ,      3.5409])
>>> housing_num.median().values
array([   -118.51  ,     34.26  ,     29.    ,   2119.5   ,    433.    ,
          1164.    ,    408.    ,      3.5409])

X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index)

Scikit-Learn Design Principles:
* Consistency: Estimators (fit), Transformers (transform), Predictors (predict)​
* Inspection: Hyperparameters accessible as public instance variables (imputer.strategy)​
* Nonproliferation of classes: Datasets are NumPy arrays or SciPy sparse matrices​
* Composition: Reusable building blocks (pipelines)​
* Sensible defaults: Reasonable default parameter values

**Handling Text and Categorical Attributes**  

Most ML algorithms prefer numbers. Use OrdinalEncoder:

In [None]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
housing_cat = housing[["ocean_proximity"]]
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

>>> housing_cat_encoded[:10]
array([[0.],
       [0.],
       [4.],
       [1.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.]])

>>> ordinal_encoder.categories_
[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
      dtype=object)]

For non-ordinal categories, use one-hot encoding:

In [None]:
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

>>> housing_cat_1hot
<16512x5 sparse matrix of type '<class 'numpy.float64'>'
    with 16512 stored elements in Compressed Sparse Row format>

>>> housing_cat_1hot.toarray()
array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       ...,
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.]])

>>> cat_encoder.categories_
[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
      dtype=object)]

One-hot encoding creates binary attributes (dummy variables) to avoid implying similarity between unrelated categories.

**Custom Transformers**  

Create transformers for custom cleanup operations:

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

Adding hyperparameters (add_bedrooms_per_room) helps find good combinations through automatic hyperparameter tuning.

**Feature Scaling**  

ML algorithms don't perform well when numerical attributes have very different scales.  

Min-max scaling (normalization) shifts values to 0-1 range using MinMaxScaler:

In [None]:
from sklearn.preprocessing import MinMaxScaler

Standardization subtracts mean and divides by standard deviation, resulting in zero mean and unit variance using StandardScaler:

In [None]:
from sklearn.preprocessing import StandardScaler

Standardization is less affected by outliers and doesn't bound values to a specific range. Scaling applies to training data only; fit_transform() on training, transform() on test/new data.

**Transformation Pipelines**  

Chain transformations using Pipeline:

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)

Pipelines execute transformers sequentially, using fit_transform() for all except the last step. All names except the last must be transformers.  

Handle categorical and numerical columns together with ColumnTransformer:

In [None]:
from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

housing_prepared = full_pipeline.fit_transform(housing)

ColumnTransformer applies appropriate transformations to each column subset, concatenating outputs.  

---

**Select a Model and Train It**

**Training and Evaluating on the Training Set**  

Train Linear Regression:

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

>>> some_data = housing.iloc[:5]
>>> some_labels = housing_labels.iloc[:5]
>>> some_data_prepared = full_pipeline.transform(some_data)
>>> print("Predictions:", lin_reg.predict(some_data_prepared))
Predictions: [ 210644.60459286  317768.80697211  210956.43331178   59218.98886849
  189747.55849879]
>>> print("Labels:", list(some_labels))
Labels: [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]

Measure RMSE on training set:

In [None]:
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
>>> lin_rmse
68628.19819848922

RMSE of $68,628 indicates underfitting (high bias) - features don't provide enough information or the model isn't powerful enough.  

Solutions for underfitting:
* Select more powerful model​
* Feed better features to algorithm​
* Reduce constraints (regularization)  

Train Decision Tree Regressor:

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)

housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
>>> tree_rmse
0.0

Zero error indicates severe overfitting - the model memorized training data.

**Better Evaluation Using Cross-Validation**

K-fold cross-validation randomly splits training set into K folds, trains and evaluates K times:

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

>>> display_scores(tree_rmse_scores)
Scores: [70194.33680785 66855.16363941 72432.58244769 70758.73896782
 71115.88230639 75585.14172901 70262.86139133 70273.6325285
 75366.87952553 71231.65726027]
Mean: 71407.68766037929
Standard deviation: 2439.4345041191004

Decision Tree actually performs worse than Linear Regression when properly evaluated.  

Cross-validate Linear Regression:

In [None]:
>>> lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
                                  scoring="neg_mean_squared_error", cv=10)
>>> lin_rmse_scores = np.sqrt(-lin_scores)
>>> display_scores(lin_rmse_scores)
Scores: [66782.73843989 66960.118071   70347.95244419 74739.57052552
 68031.13388938 71193.84183426 64969.63056405 68281.61137997
 71552.91566558 67665.10082067]
Mean: 69052.46136345083
Standard deviation: 2731.674001798348

Train Random Forest Regressor:

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(n_estimators=10, random_state=42)
forest_reg.fit(housing_prepared, housing_labels)

housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
>>> forest_rmse
18603.515021376355

>>> forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                                     scoring="neg_mean_squared_error", cv=10)
>>> forest_rmse_scores = np.sqrt(-forest_scores)
>>> display_scores(forest_rmse_scores)
Scores: [49519.80364233 47461.9115823  50029.02762854 52325.28068953
 49308.39426421 53446.37892622 48634.8036574  47585.73832311
 53490.10699751 50021.5852922 ]
Mean: 50182.303100336096
Standard deviation: 2097.0810550985693

Random Forest performs much better (training RMSE 18,604, CV mean 50,182) but shows overfitting. Save models for later comparison:

In [None]:
import joblib
joblib.dump(my_model, "my_model.pkl")
my_model_loaded = joblib.load("my_model.pkl")

---  

**Fine-Tune Your Model**

**Grid Search**  

Systematically explore hyperparameter combinations:

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    # try 12 (3×4) combinations of hyperparameters
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    # then try 6 (2×3) combinations with bootstrap set as False
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor(random_state=42)
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error',
                           return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)

This trains across (12+6) × 5 = 90 different models.

In [None]:
>>> grid_search.best_params_
{'max_features': 8, 'n_estimators': 30}

>>> grid_search.best_estimator_
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features=8, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=30,
                      n_jobs=None, oob_score=False, random_state=42, verbose=0,
                      warm_start=False)

Access evaluation scores:

In [None]:
>>> cvres = grid_search.cv_results_
>>> for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
...     print(np.sqrt(-mean_score), params)
...
63669.11631261028 {'max_features': 2, 'n_estimators': 3}
55627.099719926795 {'max_features': 2, 'n_estimators': 10}
53384.57275149205 {'max_features': 2, 'n_estimators': 30}
60965.950449450494 {'max_features': 4, 'n_estimators': 3}
52741.04704299915 {'max_features': 4, 'n_estimators': 10}
50377.40461678399 {'max_features': 4, 'n_estimators': 30}
58663.93866579625 {'max_features': 6, 'n_estimators': 3}
52006.19873526564 {'max_features': 6, 'n_estimators': 10}
50146.51167415009 {'max_features': 6, 'n_estimators': 30}
57869.25276169646 {'max_features': 8, 'n_estimators': 3}
51711.127883959234 {'max_features': 8, 'n_estimators': 10}
49682.273345071546 {'max_features': 8, 'n_estimators': 30}
62895.06951262424 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54658.176157539405 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59470.40652318466 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52724.9822587892 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
57490.5691951261 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
51009.495668875716 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}

Best combination: max_features=8, n_estimators=30, achieving RMSE of 49,682.

**Randomized Search**  

For large hyperparameter spaces, RandomizedSearchCV evaluates random combinations, providing better coverage with controlled iterations. Useful when the hyperparameter search space is large.

**Ensemble Methods**  

Combine best-performing models for better results (e.g., Random Forest already ensembles Decision Trees).

**Analyze the Best Models and Their Errors**  

Examine feature importances:

In [None]:
>>> feature_importances = grid_search.best_estimator_.feature_importances_
>>> feature_importances
array([7.33442355e-02, 6.29090705e-02, 4.11437985e-02, 1.46726854e-02,
       1.41064835e-02, 1.48742809e-02, 1.42575993e-02, 3.66158981e-01,
       5.64191792e-02, 1.08792957e-01, 5.33510773e-02, 1.03114883e-02,
       1.64780994e-01, 6.02803867e-05, 1.96041560e-03, 2.85647464e-03])

>>> extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
>>> cat_encoder = full_pipeline.named_transformers_["cat"]
>>> cat_one_hot_attribs = list(cat_encoder.categories_[0])
>>> attributes = num_attribs + extra_attribs + cat_one_hot_attribs
>>> sorted(zip(feature_importances, attributes), reverse=True)
[(0.3661589806181342, 'median_income'),
 (0.1647809935615905, 'INLAND'),
 (0.10879295677551573, 'pop_per_hhold'),
 (0.07334423551601243, 'longitude'),
 (0.06290907048262032, 'latitude'),
 (0.056419179181954014, 'rooms_per_hhold'),
 (0.053351077347675815, 'bedrooms_per_room'),
 (0.04114379847872964, 'housing_median_age'),
 (0.014874280890402769, 'population'),
 (0.014672685420543239, 'total_rooms'),
 (0.014257599323407808, 'households'),
 (0.014106483453584104, 'total_bedrooms'),
 (0.010311488326303788, '<1H OCEAN'),
 (0.0028564746373201584, 'NEAR OCEAN'),
 (0.0019604155994780706, 'NEAR BAY'),
 (6.0280386727366e-05, 'ISLAND')]

This information helps decide which features to drop. For example, only one ocean_proximity category (INLAND) is really useful.

**Evaluate Your System on the Test Set**  

Final evaluation on test set:

In [None]:
final_model = grid_search.best_estimator_

X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
>>> final_rmse
47730.22690385927

Compute 95% confidence interval:

In [None]:
from scipy import stats

confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
>>> np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
                              loc=squared_errors.mean(),
                              scale=stats.sem(squared_errors)))
array([45685.10470776, 49691.25001878])

The generalization error is estimated between $45,685 and 49,691 with 95% confidence.

---

**Launch, Monitor, and Maintain Your System**  

Present Your Solution
Document findings, highlight what worked and what didn't, list assumptions and system limitations, and create clear visualizations and presentations.​

Launch
Get solution ready for production:​
* Plug into production input data sources​
* Write unit tests​
* Write monitoring code for live performance and trigger alerts on drops​

Deployment options:
* Save trained model and load in production environment​
* Wrap model in web service (REST API)​
* Deploy to cloud (Google Cloud AI Platform provides simple API for loading and deploying models)​

Monitor and Maintain
Models degrade over time ("model rot") as data evolves. Monitor regularly and retrain on fresh data.​

Monitoring strategies:
* Sample predictions and evaluate through human raters or downstream system performance​
* Monitor input data quality to catch upstream issues​
* Train models on regular schedules or when performance drops​
* Automate the entire process​

Automation best practices:
* Collect fresh data regularly and label it​
* Write scripts to train and fine-tune models automatically​
* Write scripts to evaluate new models against previous ones on updated test sets​
* Deploy new models if significantly better than existing ones​
* Monitor input quality and alert on degradation​
* Keep backups of datasets and models for rollback capability​

The chapter concludes by emphasizing that successful ML projects require extensive infrastructure beyond algorithm selection, including robust data pipelines, monitoring systems, human evaluation frameworks, and automated training processes.