# Chapter 2: End-to-End Machine Learning Project

## 1. Chapter Overview
**Goal:** In this chapter, we will work through a complete Machine Learning project from start to finish. We will pretend to be a data scientist in a real estate company. Our task is to predict median house values in Californian districts, given a number of features from these districts.

**Key Concepts:**
* **Big Picture:** Understanding the problem (Supervised, Regression, Batch).
* **Data Splitting:** Creating a test set and stratified sampling.
* **Data Visualization:** Gaining insights from geographical data.
* **Data Preparation:** Handling missing values, categorical features, and feature scaling.
* **Transformation Pipelines:** Automating data processing.
* **Model Selection:** Training and evaluating multiple models (Linear Regression, Decision Tree, Random Forest).
* **Fine-Tuning:** Using Grid Search to find the best hyperparameters.

**Practical Skills:**
* Using `Scikit-Learn` pipelines (`Pipeline`, `ColumnTransformer`).
* Data cleaning with `SimpleImputer`.
* Encoding text data with `OneHotEncoder`.
* Cross-Validation techniques.

In [None]:
# Setup
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# To plot pretty figures directly within Jupyter
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

## 2. Theoretical Explanation

### 1. Problem Definition
* **Supervised Learning:** We have labeled training data (the median house value is known for each district).
* **Regression Task:** We are predicting a continuous value (price).
    * *Multiple Regression:* We use multiple features (population, median income, etc.) to make a prediction.
    * *Univariate Regression:* We are predicting a single value per district.
* **Batch Learning:** The system is trained on all available data offline, not incrementally.

### 2. Performance Measures
To evaluate how good our regression model is, we typically use:

**RMSE (Root Mean Square Error):**
$$ RMSE(X, h) = \sqrt{\frac{1}{m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)})^2 } $$
* It gives more weight to large errors.
* Standard metric for regression.

**MAE (Mean Absolute Error):**
$$ MAE(X, h) = \frac{1}{m} \sum_{i=1}^{m} |h(x^{(i)}) - y^{(i)}| $$
* Preferred if there are many outliers in the data.

### 3. Stratified Sampling
Random sampling works for large datasets, but for smaller ones, you risk **sampling bias**. Stratified sampling ensures that the test set is representative of the overall population by dividing the population into homogeneous subgroups (strata) and sampling from each stratum to match the overall distribution (e.g., ensuring income categories are represented proportionally).

### 4. Pipelines
In ML, data must go through a sequence of processing steps (imputation $\rightarrow$ scaling $\rightarrow$ modeling). A **Pipeline** ensures these steps are executed in the correct order and makes the code reproducible and deployable.

## 3. Code Reproduction

We will implement the project step-by-step as described in the book.

### 3.1 Fetching and Loading Data

In [None]:
import os
import tarfile
import urllib.request
import pandas as pd

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

# Fetch and load
fetch_housing_data()
housing = load_housing_data()
housing.head()

### 3.2 Creating a Test Set (Stratified Sampling)
We create an income category attribute to perform stratified sampling, as median income is a very important attribute for predicting housing prices.

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

# Create income categories to categorize the continuous 'median_income' variable
housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

# Drop the temporary income_cat column so data returns to original state
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

print("Training set size:", len(strat_train_set))
print("Test set size:", len(strat_test_set))

### 3.3 Discover and Visualize the Data
We use the training set only for exploration to avoid bias.

In [None]:
housing = strat_train_set.copy()

# Visualize geographical data
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
             s=housing["population"]/100, label="population", figsize=(10,7),
             c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
)
plt.title("California Housing Prices")
plt.legend()

### 3.4 Data Cleaning and Preparation Pipeline
We will separate the predictors (features) from the labels (target). Then we build a pipeline to handle numerical and categorical attributes separately.

In [None]:
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Separate numerical and categorical columns
housing_num = housing.drop("ocean_proximity", axis=1)
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

# Pipeline for numerical attributes
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")), # Fill missing values with median
    ('std_scaler', StandardScaler()),              # Scale features
])

# Full pipeline handling both numerical and categorical
full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", OneHotEncoder(), cat_attribs),         # Convert text categories to numbers
])

housing_prepared = full_pipeline.fit_transform(housing)
print("Data prepared shape:", housing_prepared.shape)

### 3.5 Select and Train a Model
We will train three models: Linear Regression, Decision Tree, and Random Forest.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# 1. Linear Regression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
print(f"Linear Regression RMSE: {lin_rmse}")

# 2. Decision Tree Regressor
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
print(f"Decision Tree RMSE: {tree_rmse}") # Likely 0.0 due to overfitting

### 3.6 Evaluation using Cross-Validation
The Decision Tree likely overfitted (0 error). We use Cross-Validation to get a better estimate.

In [None]:
from sklearn.model_selection import cross_val_score

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

# Cross-validation for Decision Tree
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

print("\nDecision Tree Cross-Validation:")
display_scores(tree_rmse_scores)

# Cross-validation for Random Forest
forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                                scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)

print("\nRandom Forest Cross-Validation:")
display_scores(forest_rmse_scores)

### 3.7 Fine-Tuning the Model (Grid Search)
We use GridSearchCV to find the best combination of hyperparameters for the Random Forest.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]

forest_reg = RandomForestRegressor(random_state=42)
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error',
                           return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)

print("Best Parameters:", grid_search.best_params_)

# Evaluate final model on test set
final_model = grid_search.best_estimator_

X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
print("Final RMSE on Test Set:", final_rmse)

## 4. Step-by-Step Explanation

### 1. Data Pipeline (`full_pipeline`)
**Input**: Raw pandas DataFrame.
**Process**:
1.  **Imputation**: The `SimpleImputer` replaces `NaN` values in numerical columns with the column median. This prevents errors during mathematical operations.
2.  **Scaling**: The `StandardScaler` standardizes features (mean=0, variance=1). This is crucial for algorithms like Linear Regression and Neural Networks, which converge faster with scaled data.
3.  **Encoding**: The `OneHotEncoder` transforms the categorical `ocean_proximity` (e.g., "INLAND", "NEAR BAY") into sparse binary vectors (dummy variables).
**Output**: A clean, numerical NumPy array ready for training.

### 2. Model Training & Evaluation
* **Linear Regression**: Fitted a straight line. The RMSE was high (underfitting), meaning the model is too simple for the complex data.
* **Decision Tree**: Achieved 0.0 RMSE on training data but performed poorly on cross-validation. This is a classic example of **overfitting**; it memorized the training data.
* **Random Forest**: An ensemble of Decision Trees. It performed significantly better because it averages out the predictions of many trees, reducing overfitting.

### 3. Grid Search
Instead of manually guessing hyperparameters (like `n_estimators` for the number of trees), `GridSearchCV` tries every combination defined in `param_grid`. It uses Cross-Validation for each combination to ensure the selected parameters generalize well.

## 5. Chapter Summary

* **End-to-End Workflow**: We covered the entire process: fetching data, cleaning it, choosing a model, tuning it, and evaluating it.
* **Data Exploration**: Visualization reveals patterns (like location clustering) that can inform feature engineering.
* **Preprocessing is Key**: Real-world data is messy. Pipelines handles missing values and categorical data systematically.
* **Evaluation**: RMSE is the standard metric for regression. Cross-validation provides a more reliable error estimate than a single validation set.
* **Automation**: `GridSearchCV` automates the tedious process of hyperparameter tuning.