# End-to-End Machine Learning Project

- In this chapter you will work on a machine learning project end-to-end, pretending to be a recently hired data scientist at a real estate company.
- Here are the main steps you will go through:
    1. Frame the problem & look at the big picture
        - [ ] Define the objective in Business terms
        - [ ] How will your solution be used?
        - [ ] What are the current solutions/workarounds?
        - [ ] How should you frame this problem?
            - Supervised/Unsupervised
            - Batch/Online
            - Instance-based/Model-based
        - [ ] How should performance be measured?
        - [ ] Is the performance measure aligned with the business objective?
        - [ ] What would be the minimum performance needed to reach the business objective?
        - [ ] What are comparable problems? can you use previous methods or tools?
        - [ ] Is human expertise available?
        - [ ] How would you solve the problem manually?
        - [ ] List the assumptions you/others have made so far?
        - [ ] Verify assumptions if possible
    2. Get the data
        - [ ] List the data you need and how much you need
        - [ ] Find & Document where you can get that data
        - [ ] Check how much space it will take
        - [ ] Check legal obligations & get authorization if necessary
        - [ ] Get access authorization
        - [ ] Create a workspace with enough storage space
        - [ ] Get the data
        - [ ] Convert the data into a format you can easily manipulate
        - [ ] Ensure sensitive information is either deleted or protected
        - [ ] Check the size & type of data
        - [ ] Sample a test set, put it aside, and never look at it
    3. Explore the Data
        - [ ] Create a copy of the data for exploration, sample it down if necessary
        - [ ] Create a Jupyter notebook to keep records on your data exploration
        - [ ] Study each attribute and its characteristics
            - Name
            - Type (Categorical, Continuous, Int/Float, Structured/Unstructured, Text ..)
            - % of missing values
            - Noisiness and type of noise (Stochastic, rounding error, ..)
            - Usefulness for the task
            - Type of distribution (Gaussian, Logarithmic, Uniform ..)
        - [ ] For supervised Learning tasks, Identify the target attribute
        - [ ] Visualize the data
        - [ ] Study the correlations between attributes
        - [ ] Study how you would solve the problem manually
        - [ ] Identify the promising transformations you may want to apply
        - [ ] Identify Extra data that would be useful
        - [ ] Document what you have learned
    4. Prepare the data for machine learning algorithms
        - [ ] Create a copy of the data
        - [ ] Write functions for all the data transformations you want to apply
            - You can easily prepare the data the next time you get a fresh dataset
            - You can apply these transformations in future projects
            - You can clean and prepare the test set
            - You can clean and prepare new data instances in production
            - You can treat cleaning/processing steps as hyper-parameters.
        - [ ] Data Cleaning
            - Fix or remove outliers — Optional
            - Fill in missing values (with 0, mean, median, inference, ...) or drop their rows/columns
        - [ ] Feature Selection
            - Drop the attributes that provide no useful information for the task
        - [ ] Feature Engineering
            - Discretize continuous features
            - Decompose features (Categorical, datetime, ...)
            - Add promissing feature transformations ($log(x)$, $sqrt(x)$, $x^{2}$, ..)
            - Aggregate features into promising new features
        - [ ] Feature scaling
            - Standarize or normalize features
    5. Shortlist promising models
        - [ ] If the data set is big, sample smaller datasets for experimentation
        - [ ] Try many models from different categories (NB, Linear regression, RF, NN, ..) using standard parameters.
        - [ ] Measure and compare their performance
            - For each model, measure N-fold cross validation and capture the mean and standard diviation of the performance.
        - [ ] Analyze the most significant variable for each algorithm
        - [ ] Analyze the types of errors the models make
            - What data would a human use to avoid these errors?
        - [ ] Perform a quick round of feature selection and engineering
        - [ ] Perform one or two more quick iterations of the previous steps
        - [ ] Shortlist the top-3 to 5 most performant algorithms that make different types of errors
    6. Fine-tune your models & combine them into a great solution
        - [ ] Use the whole dataset
        - [ ] Fine-tune hyper-parameters using cross-validation
            - Treat your data transformation choices as hyper-parameters
            - Unless there are very few hyper-parameters to explore, prefer random search to grid search.
                - If training takes a long time, you may prefer Bayesian Optimization.
        - [ ] Try ensemble methods. Combining your best models will often produce better results than running them individually.
        - [ ] Once you are confident about your model, measure its performance on the test set to estimate its generalization error.
    7. Present your solution
        - [ ] Document what you have done
        - [ ] Create a nice presentation
        - [ ] Explain why your solution achieves the business objective
        - [ ] Showcase interesting things you noticed along the way
        - [ ] Ensure your key findings are easily communicated through beautiful visualization and one-line statements
    8. Launch, Monitor, and maintain your system
        - [ ] Get your solution ready for production
        - [ ] Write monitoring code to check your system's performance while running in production and run interval-based checks to alert when it drops.
            - Beware of slow degradation: models tend to rot as data evolves
            - Also monitor your inputs quality
        - Re-train your model on a regular basis on fresh data

## Working with Real Data

- When you are learning about ML, It's best to work with real data sets, not artificial ones.
- Popular open data reposatories
    - [UC Irvine ML repo](https://archive.ics.uci.edu/ml/index.php)
    - [Kaggle Datasets](https://www.kaggle.com/datasets)
    - [Amazon AWS Datasets](https://registry.opendata.aws/)
- Meta Portals: they list open data reposatories
    - [Data Portals](http://dataportals.org/)
    - [OpenDataMonitor](https://opendatamonitor.eu/frontend/web/index.php?r=dashboard%2Findex)
    - [Quandl](https://www.quandl.com/)
- Other pages listing many open data reposatories
    - [Wikipedia](https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research)
    - [Quora](https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public)
    - [The Datasets Subreddit](https://www.reddit.com/r/datasets)
- In this chapter we'll use the California housing prices dataset, taken from the StatLib repository.
- The dataset is based on data from the 1990 California cencus.
- For teaching purposes we've added a categorical feature and removed multiple ones.

## 1. Look at the big picture

- You first task is to use the california census data to build a model of the housing prices in the state.
- This data includes metrics such as:
    - Population
    - Median Income
    - Median housing price for each block group in California
        - A block group is the smallest geographical unit for which cencus data is published
            - A Block group has a population between 600 to 3,000.
        - We will call them "districts" for short.
- You model should be able to predict the median housing price for any district, given the other features.
- Frame the problem
    - The first question to ask your boss is: "What is the business objective?"
        - Building a model is probably not the end goal.
    - How does the comapny expects to benefit from the model?
    - Your boss answers that your model's output (A prediction for a district's median housing price) will be fed along other signals to another model to decide whether or not to invest in the district.
    - Getting this right is critical, as it's directly connects to revenue.
    - A sequence of data processing components is called a **data pipeline**
        - Pipelines are very common is machine learning systems, since data needs to be preprocessed, manipulated, transformed to output the final predictions.
    - Components typically run asynchonosly
    - Each component pulls in a large amount of data, process it and splits out a result (prediction) into a data store.
    - Each component is fairly self-contained, the interface between components are the data stores.
    - Different teams can focus on different components.
    - The next question to ask your boss is "what the current solution looks like?"
    - Answer the following questions $\to$
        - Is the problem supervised, unsupervised, or reinforcement learning?
            - A supervised learning problem.
        - Is it a classification or regression task?
            - Since we're predicting the median housing price for a district, and since housing prices and numerical, this is a **Regression Task**
        - Should you use batch learning or online learning techniques?
            - Since cencus data is historical and does come every year, a **batch learning** approach is better.
    - Note: if the data is huge, you can either split the data between multiple servers using Map Reduce or use online learning.

- Select a Performance Measure
    - A typical performance measure for regression problems is **Root Mean Squared Error**. 
        - It has a higher weight for large errors
        - Following is its general formula:
$$ RMSE(X,h) = \sqrt{ {1 \over m} \sum_{i=1}^m ( h(x^{(i)}) - y^{(i)} )^2  }$$
- $m$ is the number of instances in the dataset you are measuring the $RMSE$ on.
- $x^{(i)}$ is a vector containing all of the input feature values (excluding the label) for the $i^{th}$ instance.
- $y^{(i)}$ is the label or the desired output of input $x^{(i)}$.
- $X$ is a **matrix** containing all feature values excluding the labels/targets. The $i^{th}$ row of $X$ corresponds to $x^{(i)}$ and we can note:
$$X=\left(
\begin{array}{c}
  {x^{(1)}}^{T}\\
  {x^{(2)}}^{T}\\
  {x^{(3)}}^{T}\\
  \vdots\\
  {x^{(m)}}^{T}\\
\end{array}
\right)$$
- $h$ is your system's prediction function, also called hypothesis.
- $RMSE(X,h)$ is the cost function measured on the set of examples $X$ and the hypothesis $h$.
- $RMSE$ is the preferred performance measure for regression tasks
    - But sometimes, we prefer to use other cost functions.
- In a case where you have many outliers, you may consider using mean absolute error as a cost function ($MAE$).
    - Also called the average absolute diviation.
$$ MAE(X,h) = {1 \over m} \sum_{i=1}^m |h(x^{(i)}) - y^{(i)}| $$
- Both $RMSE$ and $MAE$ are ways to measure the distance between two vectors, in our case, the distance between the vector of predictions and the vector of targets/labels.
- Various distance measures, or norms, are possible:
    - Computing the root of the sum of squares ($RMSE$) corresponds to the euclidian norm.
        - It is also called $l_{2}$ norm.
    - Computing the sum of absolutes ($MAE$) corresponds to the $l_{1}$ norm.
    - More generally, the norm $l_{k}$ of a vector $\bf{x}$ is:
$$||v||_{k} = (\sum_{i=1}^{m} |v_{i}|^{k})^{1 \over k}$$
        - $l_{0}$ gives the number of non-zero elements in the vector $v$
        - $l_{\infty}$ gives the maximum absolute value in the vector $v$
- The higher the norm index, the more it focuses on large values and neglects small ones.
    - This is why $RMSE$ is more sensitive to large errors than $MAE$.
        - But when outliers are exponentially rare (like in a bell curve) $RMSE$ performs extremely well and is preferred over other cost functions.

- Check the assumptions
    - Lastly, it is good practice to list and verify the list of assumptions made by you or others, this can help you catch serious mistakes early one.
    - Example: You've spent 6months working on an algorithm to predict the median housing price per district only to find out later that your predictions are being converted into categories ("Cheap", "Medium", "Expensive").
        - In this case, It would've been better to work on the classification problem instead of a regression one, that would produce better prediction of the downstream system.
    - Fortunately, you find out that the team actually need the actual median housing prices. Good to go!

## 2. Get the Data

- It is preferable to create some util functions to automate the process of downloading/extracting web-based data sets.

In [12]:
import os
import tarfile
import urllib

In [13]:
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("data", "01")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

In [14]:
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    """Creates `HOUSING_PATH`, Downloads & Extracts the contents of `HOUSING_URL` into `HOUSING_PATH`
    
    # Arguments:
        housing_url, string: the download link
        housing_path, string: where to download & extract data
    """
    os.makedirs(name=housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(url=housing_url, filename=tgz_path)
    housing_tgz = tarfile.open(name=tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

- Now we download the data:

In [15]:
fetch_housing_data()

- Let's write a small function to load the data using pandas:

In [16]:
import pandas as pd

In [17]:
def load_housing_data(housing_path=HOUSING_PATH):
    """Loads Housing data into a pandas dataframe.
    
    # Arguments:
        housing_path: the path where `housing.csv` exists
    
    # Returns:
        data, pd.DataFrame: the housing data as a pandas dataframe
    """
    data_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(data_path)

- Take a quick look at the data structure:

In [18]:
housing = load_housing_data()

In [19]:
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


- Each row represents one district.
- There are 10 attributes
    - `longitude`
    - `latitude`
    - `housing_median_age`
    - `total_rooms`
    - `total_bedrooms`
    - `population`
    - `households`
    - `median_income`
    - `median_house_value`
    - `ocean_proximity`
- the `info()` method is useful to take a quick look at the data, in particular
    - How many rows in total
    - How many NaNs per column
    - Data types for each column

In [21]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


- There are $20,640$ instances in the dataset.
    - Which means that it is fairly small by machine learning standards.
    - $207$ districts are missing the `total_bedrooms` attribute
    - We will need to take care of this later.
- All attributes are numerical, except `ocean_proximity`
- Since we noticed repeated `ocean_proximity` values for the top 5 rows, we suspect that it is a categorical column, let's check it out:

In [23]:
housing['ocean_proximity'].value_counts()

<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

- `.describe()` shows a summary of all numerical values:

In [24]:
housing.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


- `.describe()` ignores null values.
- The `std` row shows standard deviation, which measures how dispersed the values are.
- The `25%`, `50%`, `75%` rows show the persentiles of each columns
    - Example: 25% of districts have <=18 years housing median age.
- Another way to get a feel of numerical continuous data is to draw a histogram for each numerical column