# End-to-End Machine Learning Project

- In this chapter you will work on a machine learning project end-to-end, pretending to be a recently hired data scientist at a real estate company.
- Here are the main steps you will go through:
    1. Frame the problem & look at the big picture
        - [ ] Define the objective in Business terms
        - [ ] How will your solution be used?
        - [ ] What are the current solutions/workarounds?
        - [ ] How should you frame this problem?
            - Supervised/Unsupervised
            - Batch/Online
            - Instance-based/Model-based
        - [ ] How should performance be measured?
        - [ ] Is the performance measure aligned with the business objective?
        - [ ] What would be the minimum performance needed to reach the business objective?
        - [ ] What are comparable problems? can you use previous methods or tools?
        - [ ] Is human expertise available?
        - [ ] How would you solve the problem manually?
        - [ ] List the assumptions you/others have made so far?
        - [ ] Verify assumptions if possible
    2. Get the data
        - [ ] List the data you need and how much you need
        - [ ] Find & Document where you can get that data
        - [ ] Check how much space it will take
        - [ ] Check legal obligations & get authorization if necessary
        - [ ] Get access authorization
        - [ ] Create a workspace with enough storage space
        - [ ] Get the data
        - [ ] Convert the data into a format you can easily manipulate
        - [ ] Ensure sensitive information is either deleted or protected
        - [ ] Check the size & type of data
        - [ ] Sample a test set, put it aside, and never look at it
    3. Explore the Data
        - [ ] Create a copy of the data for exploration, sample it down if necessary
        - [ ] Create a Jupyter notebook to keep records on your data exploration
        - [ ] Study each attribute and its characteristics
            - Name
            - Type (Categorical, Continuous, Int/Float, Structured/Unstructured, Text ..)
            - % of missing values
            - Noisiness and type of noise (Stochastic, rounding error, ..)
            - Usefulness for the task
            - Type of distribution (Gaussian, Logarithmic, Uniform ..)
        - [ ] For supervised Learning tasks, Identify the target attribute
        - [ ] Visualize the data
        - [ ] Study the correlations between attributes
        - [ ] Study how you would solve the problem manually
        - [ ] Identify the promising transformations you may want to apply
        - [ ] Identify Extra data that would be useful
        - [ ] Document what you have learned
    4. Prepare the data for machine learning algorithms
        - [ ] Create a copy of the data
        - [ ] Write functions for all the data transformations you want to apply
            - You can easily prepare the data the next time you get a fresh dataset
            - You can apply these transformations in future projects
            - You can clean and prepare the test set
            - You can clean and prepare new data instances in production
            - You can treat cleaning/processing steps as hyper-parameters.
        - [ ] Data Cleaning
            - Fix or remove outliers — Optional
            - Fill in missing values (with 0, mean, median, inference, ...) or drop their rows/columns
        - [ ] Feature Selection
            - Drop the attributes that provide no useful information for the task
        - [ ] Feature Engineering
            - Discretize continuous features
            - Decompose features (Categorical, datetime, ...)
            - Add promissing feature transformations ($log(x)$, $sqrt(x)$, $x^{2}$, ..)
            - Aggregate features into promising new features
        - [ ] Feature scaling
            - Standarize or normalize features
    5. Shortlist promising models
        - [ ] If the data set is big, sample smaller datasets for experimentation
        - [ ] Try many models from different categories (NB, Linear regression, RF, NN, ..) using standard parameters.
        - [ ] Measure and compare their performance
            - For each model, measure N-fold cross validation and capture the mean and standard diviation of the performance.
        - [ ] Analyze the most significant variable for each algorithm
        - [ ] Analyze the types of errors the models make
            - What data would a human use to avoid these errors?
        - [ ] Perform a quick round of feature selection and engineering
        - [ ] Perform one or two more quick iterations of the previous steps
        - [ ] Shortlist the top-3 to 5 most performant algorithms that make different types of errors
    6. Fine-tune your models & combine them into a great solution
        - [ ] Use the whole dataset
        - [ ] Fine-tune hyper-parameters using cross-validation
            - Treat your data transformation choices as hyper-parameters
            - Unless there are very few hyper-parameters to explore, prefer random search to grid search.
                - If training takes a long time, you may prefer Bayesian Optimization.
        - [ ] Try ensemble methods. Combining your best models will often produce better results than running them individually.
        - [ ] Once you are confident about your model, measure its performance on the test set to estimate its generalization error.
    7. Present your solution
        - [ ] Document what you have done
        - [ ] Create a nice presentation
        - [ ] Explain why your solution achieves the business objective
        - [ ] Showcase interesting things you noticed along the way
        - [ ] Ensure your key findings are easily communicated through beautiful visualization and one-line statements
    8. Launch, Monitor, and maintain your system
        - [ ] Get your solution ready for production
        - [ ] Write monitoring code to check your system's performance while running in production and run interval-based checks to alert when it drops.
            - Beware of slow degradation: models tend to rot as data evolves
            - Also monitor your inputs quality
        - Re-train your model on a regular basis on fresh data

## Working with Real Data

- When you are learning about ML, It's best to work with real data sets, not artificial ones.
- Popular open data reposatories
    - [UC Irvine ML repo](https://archive.ics.uci.edu/ml/index.php)
    - [Kaggle Datasets](https://www.kaggle.com/datasets)
    - [Amazon AWS Datasets](https://registry.opendata.aws/)
- Meta Portals: they list open data reposatories
    - [Data Portals](http://dataportals.org/)
    - [OpenDataMonitor](https://opendatamonitor.eu/frontend/web/index.php?r=dashboard%2Findex)
    - [Quandl](https://www.quandl.com/)
- Other pages listing many open data reposatories
    - [Wikipedia](https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research)
    - [Quora](https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public)
    - [The Datasets Subreddit](https://www.reddit.com/r/datasets)
- In this chapter we'll use the California housing prices dataset, taken from the StatLib repository.
- The dataset is based on data from the 1990 California cencus.
- For teaching purposes we've added a categorical feature and removed multiple ones.

## 1. Look at the big picture

- You first task is to use the california census data to build a model of the housing prices in the state.
- This data includes metrics such as:
    - Population
    - Median Income
    - Median housing price for each block group in California
        - A block group is the smallest geographical unit for which cencus data is published
            - A Block group has a population between 600 to 3,000.
        - We will call them "districts" for short.
- You model should be able to predict the median housing price for any district, given the other features.
- Frame the problem
    - ...