# Chapter 2, End-to-End Machine Learning Project

The main steps of any ML project are as follows:
1. Look at the big picture
2. Get the data
3. Discover and visualize the data to gain insights
4. Prepare the data for ML algorithms
5. Select a model and train it
6. Fine-tune the model
7. Present solution
8. Launch, monitor, and maintain system

See for an example concerning predicting house price: https://github.com/ageron/handson-ml2/blob/master/02_end_to_end_machine_learning_project.ipynb
## Look at the Big Picture
### Frame the Problem
Pipeline - different ML algorithms producing data fed into a cumulative ML algorithm
<br>Multiple Regression - using multiple features to make a prediction
<br>Univariate Regression - predicting a single value
<br>Multivariate Regression - predicting multiple values
### Select a Performance Measure
Root Mean Square Error - common performance measure
Mean Absolute Error - performance error that is well suited for data with many outliers
The above are both ways of measuring the distance between vectors.
## Get the Data
### Download the Data
Consider using a function to automatically download latest version of data.
### Take a Quick Look at Data Structure
Observe data in pandas charts.
<br>Create histograms (charts) for various features of the data.
### Create a Test Set
Set aside ~20% of data for test cases. 
<br>Ensure that the SAME ~80% of data is used to train the algorithm; do not mix test and training data.
<br>Stratified Sampling - divide sample into homogeneuous subgroups, strata, and use a representative amount of each subgroup
## Discover and Visualize the Data to Gain Insights
### Visualizing Geographic Data
Different kinds of plots can clearly display important characteristics of data.
### Looking for Correlations
Standard Correlation Coefficient - value [-1, 1] that describes how strong the posotive/negative correlation is; nothing to do with slope
<br>pandas.scatter_matrix plots every numerical value versus every numerical value.
### Experimenting with Attribute Combinations
Attributes can be combined in considerate ways to produce additional meaningful attributes.
<br>For example, combine household income and people per household to understand how income in split up per person.
## Prepare the Data for Machine Learning Algorithms
### Data Cleaning
Fixing data with missing features has three solutions: 
1. Get rid of the data elements missing the feature
2. Get rid of the entire attribute
3. Set a default feature value if missing, ex the mean

### Handling Text and Categorical Attributes
ML algorithms work best with numerical values, not strings.
<br>One-Hot Encoding - using a value of 1 (hot) or 0 (cold) to represent whether a piece of data fulfils a particular string property
### Custom Transformers
Create a custon Transform and implement the these three methods to promote duck typing with the other Scikit Transforms: fit, transform, fit_transform
### Feature Scaling
Feature Scaling - adjusting the numerical scale of features to improve algorithm efficiency
<br>Min-Max Scaling - subtract min value, then divide by (max - min); results in values [0,1]
<br>Standardization - subtract mean value, then divide by standard deviation; results in values that describe how many standard deviations away from mean
### Transformation Pipelines
Use Scikit pipeline to apply Transforms to all features sequentially
## Select and Train a Model
### Training and Evaluating on the Training Set
A LinearRegression fit may badly underfit the data and a DecisionTreeRegressor may badly overfit the data.
### Better Evaluation Using Cross-Validation
Using Cross-Validation can reveal serious flaws in models.
<br>In the end, RandomForestRegressor works best on this problem.
<br>Try out several types of models before using hyperparameters.
<br>Save models with joblib
## Fine Tune Your Model
### Grid Search
Scikit's GridSearchCV can create Models with every combination of hyperparamters and select the best one.
### Randomized Search
When thousands of hyperparamters are available, RandomizedSearch can test an intelligent amount of random combinations of hyperparameters.
### Ensemble Methods
Ensemble Learning - building a model on top of other models.
### Analyze the Best Models and Their Errors
The model, like RandomForestRegressor, can display the relative importance of the features in an ordered list.
### Evaluate Your System on the Test Set
Do not tweak the model based upon the test set.
## Launch, Monitor, and Maintain Your System
An ML application must be automatically monitored for performance.
<br>Models decay with time, regardless of the algorithm.
<br>A good product must be able to be easily updated.
<br>Several tasks can be automated: 
1. Collecting a labeling fresh data
2. Training a new model and fine tuning hyperparamteters
3. Evaluating new model and deploying if more effective than the old