# E2E ML Project Example:

* Main Steps:
1. Look at the big picture.
2. Get the data.
3. Explore and visualize the data to gain insights.
4. Prepare the data for machine learning algorithms.
5. Select a model and train it.
6. Fine-tune your model.
7. Present your solution.
8. Launch, monitor, and maintain your system.

# A. Working w/ Real Data:

* When you are learning about machine learning, it is best to experiment with real-world data, not artificial datasets.

* Popular Open Data Repositories:
    * [OpenML.org](https://openml.org/)
    * [Kaggle.com](https://www.kaggle.com/datasets)
    * [PaperWithCode.com](https://paperswithcode.com/datasets)
    * [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/)
    * [Amazon’s AWS datasets](https://registry.opendata.aws/)
    * [TensorFlow datasets](https://www.tensorflow.org/datasets)

* Meta Portals (List Open Data Repositories):
    * [DataPortals.org](https://dataportals.org/)
    * [OpenDataMonitor.eu](https://opendatamonitor.eu/frontend/web/index.php?r=dashboard%2Findex)
    * [Wikipedia’s list of machine learning datasets](https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research)
    * [Quora.com](https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public)
    * [The datasets subreddit](https://www.reddit.com/r/datasets/)

# B. Look at the Big Picture:

* [Machine Learning Project Checklist](./ML%20Project%20Checklist.ipynb)

### B.1. Frame the Problem

* What is the business objective?
    * How does the company expect to use and benefit from this model?
    * Knowing the objective is important because it will determine how you frame the problem, which algorithms you will select, which performance measure you will use to evaluate your model, and how much effort you will spend tweaking it.

* What is the current solution looks like (if any)?
    * The current situation will often give you a reference for performance, as well as insights on how to solve the problem.

* Determine what kind of training supervision the model will need:
    * Supervised Learning, since the dataset has labels. (District median housing)
    * Multiple regression problem, since the system will use multiple features to make a prediction (the district’s population, the median income, etc.)
    * Univariate regression problem, since we are only trying to predict a single value for each district.
    * There is no continuous flow of data coming into the system, there is no particular need to adjust to changing data rapidly, and the data is small enough to fit in memory, so plain batch learning should do just fine.

### B.2 Select a Performance Measure:

* A typical performance measure for regression problems is the root mean square error (RMSE).
* It gives an idea of how much error the system typically makes in its predictions, with a higher weight given to large errors.
* $RMSE(X,h)\ =\ \sqrt{\frac{1}{m_i}\sum_{i=1}^{m}(h(x^i)-y^i)^2}$, where:
    * $m$ = number of instances the dataset you are measuring the RMSE on.
    * $x^i$ = vector of all the feature values (excluding the label) of the $i^th$ instance in the dataset.
    * $y^i$ = label of $x^i$, the desired output value for that instance.
    * $X$ = is a matrix containing all the feature values (excluding labels) of all instances in the dataset. There is one row per instance, and the $i^th$ row is equal to the transpose of $x^i \rightarrow (x^i)^T$
    * $h$ = Prediction function, also called a **hypothesis**.
        * When your system is given an instance’s feature vector $x^i$, it outputs a predicted value $\hat{y}^i\ =\ h(x^i)$.
        * Prediction error is then equal to $\hat{y}^i-$y^i$.
    * $RMSE(X,h)$ = the cost function measured on the set of examples using your hypothesis $h$.