# End-to-end Machine Learning project

Based on the book "Hands-On Machine Learning with Scikit-Learn and TensorFlow" by Aur√©lien G√©ron

Adapted from original [notebook](https://github.com/ageron/handson-ml3/blob/main/02_end_to_end_machine_learning_project.ipynb) by Aur√©lien G√©ron 
This notebook requires Python 3.7 or later and scikit-learn 1.0.1 or later.

I highly recommend using [Colab](https://colab.research.google.com/) or a [virtual environment](https://docs.python.org/3/tutorial/venv.html) to keep all your dependencies contained and frozen to a specific version. Note that tensorflow (to be used later in the course) only supports up to Python 3.11 right now, and only supports GPUs on Linux (including WSL2).

In [None]:
from packaging import version
import sklearn

assert version.parse(sklearn.__version__) >= version.parse("1.0.1")

## High level Machine Learning Project Checklist
Appendix B (2019) or Appendix A (2022)

1. Frame the problem and look at the big picture.
2. Get the data.
3. Explore the data to gain insights.
4. Prepare the data to better expose the underlying data patterns to Machine Learning algorithms.
5. Explore many different models and short-list the best ones.
6. Fine-tune your models and combine them into a great solution.
7. Present your solution.
8. Launch, monitor, and maintain your system.

## Frame the problem and look at the big picture
*Welcome to Machine Learning Housing Corp.! Your task is to predict median house values in Californian districts, given a number of features from these districts.*

**How does the company expect to use and benefit from this model?**
The model‚Äôs output, along with many other signals, will be used to determine whether it is worth investing in a given area or not.

**What the current solution looks like (if any).** The current situation will often give you a reference for performance, as well as insights on how to solve the problem. 

Answer: The district housing prices are currently estimated manually by experts: a team gathers up-to-date information about a district, and when they cannot get the median housing price, they estimate it using complex rules.
This is costly and time-consuming, and their estimates are not great; in cases where they manage to find out the actual median housing price, they often realize that their estimates were off by more than 20%.

**‚ùì Discussions for class:**
- What kind of ML task is this?
- What kind of performance measure should we use?

## Download the Data

In [None]:
from pathlib import Path
import pandas as pd
import tarfile
import urllib.request

def load_housing_data():
    dir_path = Path("../datasets")
    tarball_path = dir_path / "housing.tgz"
    if not tarball_path.is_file():
        dir_path.mkdir(parents=True, exist_ok=True)
        url = "https://github.com/ageron/data/raw/main/housing.tgz"
        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as housing_tarball:
            housing_tarball.extractall(path=dir_path)
    return pd.read_csv(dir_path / "housing" / "housing.csv")

housing = load_housing_data()

## Take a Quick Look at the Data Structure

In [None]:
# look at the first few rows of the housing dataframe
housing.head()

In [None]:
# summarize the data
housing.info()

In [None]:
# look at the categories in the ocean_proximity column
housing["ocean_proximity"].value_counts()

In [None]:
import matplotlib.pyplot as plt

# plot a histogram for each numerical attribute
housing.hist(bins=50, figsize=(12, 8))

### ‚ùì Discussions for class:
- What do you notice about the data?
- Do the values make sense for the labels?
- Is the scale of the features comparable? Does this matter?
- What possible biases might be present in the data?

## Create a Test Set

**‚ùì Discussions for class:**
- Why do we need a test set?
- What is data snooping bias?
- How should we create the test set?

In [None]:
# Naive approach: use the index as the identifier and randomly select 20%
import numpy as np
np.random.seed(42)

def shuffle_and_split_data(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]

In [None]:
train_set, test_set = shuffle_and_split_data(housing, 0.2)
print("Training samples: ", len(train_set), "Testing samples: ", len(test_set))

### ‚ùì Discussions for class:
- What is the purpose of the random seed?
- What is naive about this approach? (Hint: what if the dataset is updated?)
- What alternative identifier could we use?

### Hash-based identifier
Instead of randomly permuting the indices, we can compute a hash of each instance's identifier and select samples that are less than 20% of the maximum hash value. This ensures that the test set will remain consistent across multiple runs, even if we refresh the dataset.

In [None]:
from zlib import crc32

def is_id_in_test_set(identifier, test_ratio):
    return crc32(np.int64(identifier)) < test_ratio * 2**32

def split_data_with_id_hash(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: is_id_in_test_set(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]

In [None]:
housing_with_id = housing.reset_index()  # adds an `index` column
train_set, test_set = split_data_with_id_hash(housing_with_id, 0.2, "index")

Unfortunately this dataset doesn't have a unique identifier other than the row number, which doesn't protect against insertions in the dataset. Another solution is to pick something that uniquely identifies the sample, such as a combination of the district's latitude and longitude (although this isn't perfect either, as some districts may be close enough together that their ids are computed to be the same).

In [None]:
housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]
train_set, test_set = split_data_with_id_hash(housing_with_id, 0.2, "id")

In [None]:
# scikit-learn's train_test_split function does basically the same thing as shuffle_and_split_data, with some added magic
# This function is commonly used but it's important to understand its assumptions and limitations!
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

### Random sampling bias
Random sampling is fine if the dataset is large enough relative to the number of attributes, but if it is not, you run the risk of introducing **sampling bias**.

**üßÆ Math break!**

Say we want to make sure that survey respondents represent the population ratio of 51.1% female and 48.9% male ($\pm 3\%$). The [binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution) can be used to model the probability of choosing $k$ female participants from $n$ total participants:

$$P(X = k) = \binom{n}{k}p^k(1-p)^{n-k}$$

where

$$\binom{n}{k} = \frac{n!}{k!(n-k)!}$$

$P(X = k)$ is the probability mass function, and the corresponding cumulative distribution function is just the sum up to $k$:

$$P(X \leq k) = \sum_{i=0}^k \binom{n}{i}p^i(1-p)^{n-i}$$

You can see how this depends on $n$!

In [None]:
from scipy.stats import binom

p = 0.511 # ratio of female to male
buffer = 0.03
sample_sizes = [10, 100, 500, 1000, 5000, 10000]
prob_bias = []

for n in sample_sizes:
    too_small = n * (p - buffer)
    too_large = n * (p + buffer)
    proba_too_small = binom(n, p).cdf(too_small - 1)
    proba_too_large = 1 - binom(n, p).cdf(too_large)
    prob_bias.append((proba_too_small + proba_too_large) * 100)

print(sample_sizes)
print(prob_bias)
plt.plot(sample_sizes, prob_bias, "o-")
plt.xlabel("Sample size")
plt.ylabel("Probability of sampling bias (%)")
plt.show()

### Stratified sampling
Instead of taking a naive random sample, we can use **stratified sampling** to ensure that the test set is representative of the overall population. The population is divided into smaller subgroups called **strata**, and a representative random sample is drawn from each.

Scikit-learn provides a handy option for this in `train_test_split`, but we need to decide what the strata should be. Here's where **domain knowledge** comes in - in this case, let's say we were told that the current manual process uses median income as a proxy for median housing price, so we should make sure that the test set is representative of the income distribution.

In [None]:
# check to see that it's a reasonable proxy
plt.scatter(housing["median_income"], housing["median_house_value"], alpha=0.1)
plt.grid(True)
plt.xlabel("Median income")
plt.ylabel("Median house value")

There's some kind of odd stuff going on in this plot, like the obvious ceiling at 500k. There's some weaker lines at 450k and 350k as well. Eventually we might want to deal with these outliers, but for now we'll ignore them.

### ‚ùì Discussions for class:
- How might we deal with the outliers?

In [None]:
# split the median income into reasonable categories in order to create strata buckets
housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])
housing["income_cat"].hist()

In [None]:
strat_train_set, strat_test_set = train_test_split(
    housing, test_size=0.2, stratify=housing["income_cat"], random_state=42)

In [None]:
# verify the sampling by looking at the histogram of the test set
strat_test_set["income_cat"].hist()

In [None]:
# We don't actually want the income_cat column sticking around, so let's drop it
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

# Discover and Visualize the Data to Gain Insights

In [None]:
# Make a copy of the training set to mess around with
housing = strat_train_set.copy()

## Visualizing Geographical Data

Not every data set makes sense to plot on a map, but in this case we have latitude and longitude. Might as well plot it.

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", grid=True)
plt.show()

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", grid=True, alpha=0.2)
plt.show()

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", grid=True,
             s=housing["population"] / 100, label="population",
             c="median_house_value", cmap="jet", colorbar=True,
             legend=True, figsize=(10, 7))
plt.show()

## Looking for Correlations

**üßÆ Math break!**

The [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is a measure of the linear correlation between two variables $X$ and $Y$ (commonly denoted as $r$):

$$r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2}\sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}$$

where $\bar{x}$ and $\bar{y}$ are the sample means of $X$ and $Y$, respectively.

Coefficients close to 1 indicate strong positive correlation, and coefficients close to -1 indicate strong negative correlation.

![](https://upload.wikimedia.org/wikipedia/commons/d/d4/Correlation_examples2.svg)

We'll revisit the concept of correlation later on when we talk about convolution.

In [None]:
corr_matrix = housing.corr(numeric_only=True)
corr_matrix["median_house_value"].sort_values(ascending=False)

In [None]:
# pandas' scatter_matrix function plots every numerical attribute against every other numerical attribute
# This can be useful, but you probably want to restrict the number of attributes to plot

from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
plt.show()

## Experimenting with Attribute Combinations

Aka **feature engineering**. This is another place where domain knowledge is important!

In [None]:
housing["rooms_per_house"] = housing["total_rooms"] / housing["households"]
housing["bedrooms_ratio"] = housing["total_bedrooms"] / housing["total_rooms"]
housing["people_per_house"] = housing["population"] / housing["households"]

In [None]:
corr_sorted = corr_matrix["median_house_value"].sort_values(ascending=False)
corr_sorted.plot(kind="bar")
plt.plot([-0.5, len(corr_sorted)], [0, 0], "k--", linewidth=2)
plt.xticks(rotation=90)
plt.xlabel("Attribute")
plt.ylabel("Correlation Coefficient")
plt.title("Correlation of Attributes with Median House Value")
plt.show()