# Checklist
This checklist can guide you through your Machine Learning projects. There are eight main steps:
1. Frame the problem and look at the big picture.
2. Get the data.
3. Explore the data to gain insights.
4. Prepare the data to better expose the underlying data patterns to Machine Learning algorithms.
5. Explore many different models and short-list the best ones.
6. Fine-tune your models and combine them into a great solution.
7. Present your solution.
8. Launch, monitor, and maintain your system.
Obviously, you should feel free to adapt this checklist to your needs

## Frame the Problem and Look at the Big Picture
1. Define the objective in business terms.
2. How will your solution be used?
3. What are the current solutions/workarounds (if any)?
4. How should you frame this problem (supervised/unsupervised, online/offline, etc.)?
5. How should performance be measured?
6. Is the performance measure aligned with the business objective?
7. What would be the minimum performance needed to reach the business objective?
8. What are comparable problems? Can you reuse experience or tools?
9. Is human expertise available?
10. How would you solve the problem manually?
11. List the assumptions you (or others) have made so far.
12. Verify assumptions if possible.

## Get the Data
Note: automate as much as possible so you can easily get fresh data.
1. List the data you need and how much you need.
2. Find and document where you can get that data.
3. Check how much space it will take.
4. Check legal obligations, and get authorization if necessary.
5. Get access authorizations.
6. Create a workspace (with enough storage space).
7. Get the data.
8. Convert the data to a format you can easily manipulate (without changing the data itself).
9. Ensure sensitive information is deleted or protected (e.g., anonymized).
10. Check the size and type of data (time series, sample, geographical, etc.).
11. Sample a test set, put it aside, and never look at it (no data snooping!).
## Explore the Data
Note: try to get insights from a field expert for these steps.
1. Create a copy of the data for exploration (sampling it down to a manageable size if necessary).
2. Create a Jupyter notebook to keep a record of your data exploration.
3. Study each attribute and its characteristics:
Name
Type (categorical, int/float, bounded/unbounded, text, structured, etc.)
% of missing values
Noisiness and type of noise (stochastic, outliers, rounding errors, etc.)
Possibly useful for the task?
Type of distribution (Gaussian, uniform, logarithmic, etc.)
4. For supervised learning tasks, identify the target attribute(s).
5. Visualize the data.
6. Study the correlations between attributes.
7. Study how you would solve the problem manually.
8. Identify the promising transformations you may want to apply.
9. Identify extra data that would be useful (go back to “Get the Data”).
10. Document what you have learned.

## Prepare the Data
Notes:
Work on copies of the data (keep the original dataset intact).
Write functions for all data transformations you apply, for five reasons:
So you can easily prepare the data the next time you get a fresh dataset
So you can apply these transformations in future projects
To clean and prepare the test set
To clean and prepare new data instances once your solution is live
To make it easy to treat your preparation choices as hyperparameters
1. Data cleaning:
Fix or remove outliers (optional).
Fill in missing values (e.g., with zero, mean, median…) or drop their rows (or columns).
2. Feature selection (optional):
Drop the attributes that provide no useful information for the task.
3. Feature engineering, where appropriate:
Discretize continuous features.
Decompose features (e.g., categorical, date/time, etc.).
Add promising transformations of features (e.g., log(x), sqrt(x), x^2, etc.).
Aggregate features into promising new features.
4. Feature scaling: standardize or normalize features.

## Short-List Promising Models
Notes:
If the data is huge, you may want to sample smaller training sets so you can train many different
models in a reasonable time (be aware that this penalizes complex models such as large neural nets
or Random Forests).
Once again, try to automate these steps as much as possible.
1. Train many quick and dirty models from different categories (e.g., linear, naive Bayes, SVM,
Random Forests, neural net, etc.) using standard parameters.
2. Measure and compare their performance.
For each model, use N-fold cross-validation and compute the mean and standard deviation
of the performance measure on the N folds.
3. Analyze the most significant variables for each algorithm.
4. Analyze the types of errors the models make.
What data would a human have used to avoid these errors?
5. Have a quick round of feature selection and engineering.
6. Have one or two more quick iterations of the five previous steps.
7. Short-list the top three to five most promising models, preferring models that make different
types of errors.Fine-Tune the System
Notes:
You will want to use as much data as possible for this step, especially as you move toward the end
of fine-tuning.
As always automate what you can.
1. Fine-tune the hyperparameters using cross-validation.
Treat your data transformation choices as hyperparameters, especially when you are not
sure about them (e.g., should I replace missing values with zero or with the median value?
Or just drop the rows?).
Unless there are very few hyperparameter values to explore, prefer random search over
grid search. If training is very long, you may prefer a Bayesian optimization approach (e.g.,
using Gaussian process priors, as described by Jasper Snoek, Hugo Larochelle, and Ryan
Adams).1
2. Try Ensemble methods. Combining your best models will often perform better than running them
individually.
3. Once you are confident about your final model, measure its performance on the test set to
estimate the generalization error.
WARNING
Don’t tweak your model after measuring the generalization error: you would just start overfitting the test set.

## Present Your Solution
1. Document what you have done.
2. Create a nice presentation.
Make sure you highlight the big picture first.
3. Explain why your solution achieves the business objective.
4. Don’t forget to present interesting points you noticed along the way.
Describe what worked and what did not.
List your assumptions and your system’s limitations.
5. Ensure your key findings are communicated through beautiful visualizations or easy-to-remember
statements (e.g., “the median income is the number-one predictor of housing prices”).




In [3]:
from __future__ import division, print_function, unicode_literals

#common imports 
import numpy as np
import os

# 
np.random.seed(42)

#plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

#where to save the figures
PROJECT_ROOT_DIR = '.'
CHAPTER_ID = 'end_to_end_practical'
IMAGE_PATH = os.path.join(PROJECT_ROOT_DIR,'images',CHAPTER_ID)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

## Get the data

In [4]:
import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

In [5]:
fetch_housing_data()

In [6]:
import pandas as pd 

def load_housing_data(housing_path = HOUSING_PATH):
    csv_path = os.path.join(housing_path, 'housing.csv')
    return pd.read_csv(csv_path)


housing = load_housing_data()
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [10]:
housing.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')

We have 10 columns, 1 categorical and 9 quantitative data

In [11]:
housing.shape

(20640, 10)

In [13]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [14]:
housing.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


20640 entries/instances in the dataset

total_bedrooms only have 20433 entries, 207 districts are missing this info. WE should take care of it later on


The count, mean, min, and max rows are self-explanatory. Note that the null values are ignored (so, for
example, count of total_bedrooms is 20,433, not 20,640). The std row shows the standard deviation
(which measures how dispersed the values are). The 25%, 50%, and 75% rows show the corresponding
percentiles: a percentile indicates the value below which a given percentage of observations in a group
of observations falls. For example, 25% of the districts have a housing_median_age lower than 18,
while 50% are lower than 29 and 75% are lower than 37. These are often called the 25th percentile (or
1st quartile), the median, and the 75th percentile (or 3rd quartile).

### RMSE 
it measures the standard deviation4 of the errors the
system makes in its predictions

i.e, rmse represents 1 $\sigma$

1 $\sigma$ = 68% 

2 $\sigma$ = 95%

3 $\sigma$ = 99.7%

In [21]:
# plots
from ipywidgets import interact 
@interact 

def plot_housing_hist():
    housing.hist(bins=50, figsize=(20,15))

pd.options.plotting.backend = 'plotly'

interactive(children=(Output(),), _dom_classes=('widget-interact',))

display plots when a cell is executed.Notice a few things in these histograms:

    First, the median income attribute does not look like it is expressed in US dollars (USD). Afterchecking with the team that collected the data, you are told that the data has been scaled andcapped at 15 (actually 15.0001) for higher median incomes, and at 0.5 (actually 0.4999) forlower median incomes. Working with preprocessed attributes is common in Machine Learning,and it is not necessarily a problem, but you should try to understand how the data was computed.
    The housing median age and the median house value were also capped. The latter may be aserious problem since it is your target attribute (your labels). Your Machine Learning algorithmsmay learn that prices never go beyond that limit. You need to check with your client team (theteam that will use your system’s output) to see if this is a problem or not. If they tell you that theyneed precise predictions even beyond $500,000, then you have mainly two options:

    <ul> 
        <li> Collect proper labels for the districts whose labels were capped.</li>
        
        <li>Remove those districts from the training set (and also from the test set, since your systemshould not be evaluated poorly if it predicts values beyond $500,000).</li>
    </ul>  
        

    These attributes have very different scales. We will discuss this later in this chapter when weexplore feature scaling
    Finally, many histograms are tail heavy: they extend much farther to the right of the median thanto the left. This may make it a bit harder for some Machine Learning algorithms to detectpatterns. We will try transforming these attributes later on to have more bell-shapeddistributions.


# Create a test set¶


In [23]:
import numpy as np


#similar to train_test_split(dataset, ...)
def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data)*test_ratio)
    test_indices=shuffled_indices[:test_set_size]
    train_indices=shuffled_indices[test_set_size:]
    return data.iloc[train_indices],data.iloc[test_indices]




In [24]:
train_test, test_set = split_train_test(housing, 0.2)
print(len(train_test)," train +",len(test_set), " test")

16512  train + 4128  test


# now continue
