# California Housing Prices
We'll to fit a model to predict house prices in California using the 1990 census.

1. **Problem Statement or Project Overview.**

Current methodologies for price forecasting often rely on deterministic heuristics and basic statistical tools, resulting in high variance and error margins exceeding 30%. These conventional approaches struggle to capture the stochastic nature of market data.

To address this, this notebook implements a Machine Learning pipeline designed to uncover latent patterns and non-linear dependencies that escape traditional linear analysis. By leveraging high-dimensional feature extraction, the model aims to minimize loss and improve generalization capabilities.

2. **Objective Function Definition**

To guide the model's optimization, we must select an appropriate Loss Function (also referred to as the Cost Function). This metric quantifies the divergence between the model's predictions ($\hat{y}$) and the ground truth ($y$).

In this case, the task is a regression problem, because the objective is to forecast the house prices (continuous value) so the Loss fuction suitable for regression problems include:
1. RMSE
2. MAE

$$RMSE(X,h) = \sqrt{\frac{1}{2}\sum_{i=1}^{m}(h(x^{(i)})-y^(i))^{2}}$$

$$MAE=\frac{1}{n}\sum|y_{(ground truth)}- y_{(prediction)}|$$

3. **Data Acquisition**

To initialize the training pipeline, we must first ingest the raw data. The dataset used for this project is the California Housing Prices dataset, which is sourced directly from Aurélien Géron's repository.
We retrieve the archive (.tgz) programmatically to ensure reproducibility.

Source URL: https://github.com/ageron/data/raw/main/housing.tgz

In [1]:
from pathlib import Path
import pandas as pd
import tarfile
import urllib.request

def load_housing_data():
    # check if dataset is already downloaded
    tarball_path = Path("../data/datasets/housing.tgz")
    # download and extract if not present
    if not tarball_path.is_file():
        # create datasets directory if it doesn't exist
        Path("../data/datasets").mkdir(parents=True, exist_ok=True)
        url = "https://github.com/ageron/data/raw/main/housing.tgz"
        # download the tarball
        urllib.request.urlretrieve(url, tarball_path)
    with tarfile.open(tarball_path) as housing_tarball:
        # extract the contents
        housing_tarball.extractall(path="../data/datasets")
    return pd.read_csv(Path("../data/datasets/housing/housing.csv"))

housing = load_housing_data()