# Hands-On Machine Learning by Géron

Part I. The Fundamentals of Machine Learning 

# 1. The Machine Learning landscape 

A look at the map and learn about the main regions and the most notable landmarks: 
- supervised versus unsupervised learning
- online versus batch learning, 
- instance based versus model-based learning

Then we will look at the workflow of a typical ML project, discuss the main challenges you may face, and cover how to evaluate and fine-tune a Machine Learning system. 

## What is Machine Learning? 

Machine Learning is the science (and art) of programming computes so they can *learn from data*.

## Why Use Machine Learning? 

Machine Learning can help humans learn. ML algorithms can be inspected to see what they have learned (although for some algorithms this can be tricky). 
Applying ML techniques to dig into large amounts of data can help discover patterns that were not immediately apparent. This is called *data mining*. 

To summarize, Machine Learning is great for: 
- Problems for which existing solutions require a lot of fine-tuning or long lists of rules: one Machine learning algorithm can often simplify code and perform better than the traditional approach. 

- Complex problems for which using a traditional approach yields no good solution: the best Machine Learning techniques can perhaps find a solution. 

- Fluctuating environments: a Machine Learning system can adapt to new data. 

- Getting insights about complex problems and large amounts of data. 

## Types of Machine Learning Systems


### Supervised/Unsupervised Learning 

Machine Learning systems can be classified according to the amount and type of supervision they get during training. 
There are four major categories: 
1. supervised learning,
2. unsupervised learning,
3. smisupervised learning, 
4. and Reinforcement learning.

#### Supervised Learning

In *supervised learning*, the training set you feed to the algorithm includes the desired solutions, called *labels*. 

A typical supervised learning task is *classification* given a set of data with their *class*. 

Another typical task is to predict a *target* numeric value given a set of *features* called *predictors*. This sort of task is called *regression*. To train the system, you need to give it many examples of instances including both their predictors and their labels. 

Some regression algorithms can be used for classification as well, and vice versa. For example, *Logistic Regression* is commonly used for classification, as it can output a value that corresponds to the probability of belonging to a given class. 

> Note: In machine learning an *attribute* is a data type, while a *feature* has several meanings, depending on the context, but generally means an attribute plus its value. 
Many people use the words *attribute* and *feature* interchangeably. 

Some of the most important supervised learning algorithms: 
- k-nearest neighbors
- linear regression
- logistic regression 
- Support Vector Machines (SVMs)
- Decision Trees and Random Forests 
- Neural Networks 

some neural network architectures can be unsupervised. They can also be semisupervised and unsupervised pretraining. 


#### Unsupervised learning 

In *unsupervised learning* the training data is unlabeled. The system tries to learn without a teacher. 

Some of themost important unsupervised learning algorithms (ch8-ch9): 
* Clustering 
    - K-Means
    - DBSCAN
    - Hierarchical Cluster Analysis (HCA)
* Anomaly detection and novelty detection
    - One-class SVM
    - Isolation Forest
* Visualization and dimensionality reduction 
    - Principal Component Analysis (PCA) 
    - Kernel PCA
    - Locally Linear Embedding (LLE)
    - t-Distributed Stochastic Neighbor Embedding (t-SNE)
* Association rule learning 
    - Apriori
    - Eclat

*Visualization* algorithms are also good examples of unsupervised learning algorithms: you feed them a lot of complex and unlabeled data, and they output a 2D and 3D representation of our data that can easily be plotted. 
These algorithms try to preserve as much structure as they can so that you can understand how the data is organized and perhaps identify unsuspected patterns. 

A realted task is *dimensinality reduction*, in which the goal is to simplify the data without losing too much information. 
One way to do this is to merge several correlated features into one. 
This is called *feature extraction*. 

> Note: It is often a good idea to try to reduce the dimension of your training data using a dimensionality reduction algorithm before you feed it to another Machine Learning algorithm (such as a supervised learning algorithm). It will run much faster, the data will take up less disk and memory space, and in some cases it may also perform better. 

Yet another important unsupervised task is *anomaly detection*. The system is shown mostly normal instances during training, so it learns to recognize them; then, when it sees a new instance, it can tell whether it looks like a normal one or whether it is likely an anomaly. 
A very similar task is *novelty detection*: it aims to detect new instances that look different from all instances in the training set. 

Finally, another common unsupervised task is *association rule learning*, in which the goal is to dig into large amounts of data and discover interesting relations between attributes. 


#### Semisupervised Learning 
Since labeling data is usually time-consuming and costly, you will often have plenty of unlabeled instances, and few labeled instances. 
Some algorithms can deal with data that's partialy labeled. 
This is called *semisupervised learning*. 

Most semisupervised learning algorithms are combinations of unsupervised and supervised algorithms. 
For example, *deep belief networks* (DBNs) are based on unsupervised components called *restricted Boltzmann machines* (RBMs) stacked on top of one another. 
RBMs are trained sequentially in an unsupervised manner, and then the whole system is fine-tuned using supervised learning techniques. 

#### Reinforcement Learning 

*Reinforcement Leanring* is a very different beast. 
The learning system, called an *agent* in this context, can observe the environment, select and perform actions, and get *rewards* in return (or *penalties* in the form of negative rewards). 
It must then learn by itself what is the best strategy, called a *policy*, to get the most reward over time. 
A policy defines what action the agent should choose when it in a given situation. 



### Batch and Online Learning 

Another criterion used to classify Machine Learning systems is whether or not the system can learn incrementally from a stream of incoming data. 

#### Batch Learning 
In *batch learning*, the system is **incapable of learning incrementally**: it must be trained using all the available data. 
This will generally take a lot of time and computing resources, so it is typically done offline. 
First the system is trained, and then it is launched into production and runs without learning anymore; it just applies what it has learned. This is called *offline learning*. 

If you want a batch learning system to know about new data, you need to train a new version of the system from scratch on the full dataset (not just the new data, but also the old data), then stop the old system and replace it with the new one. 

Fortunately, the whole process of training, evaluating, and launching a Machine Learning system can be automated fairly easily, so even a batch learning system can adapt to change. 
Simply update the data and train a new version of the system from scratch as often as needed. 

This solution is simple and often works finr, but training using the full set of data can take many hours. If your system needs to adapt to rapidly chaning data, then you need a more reactive solution. 

#### Online Learning
In *online learning*, you train the system incrementally by feeding it data instances sequentially, either individually or in small groups called *mini-batches*. 
Each learning step is fast and cheap, so the system can learn about new data on the fly, as it arrives. 

Online learning is great for systems that receive data as a continuous flow and need to adapt to change rapidly or autonomously. 
It is also a good option if you have limited computing resources: once an online leanring system has learned about new data instances, it does not need them anymore, so you can discard them (unless you want to be able to roll back to a previous state and "repay" the data). 
This can save a huge amount of space. 

Online learning algorithms can also be used to train systems on huge datasets that cannot fit in one machine's main memory (this is called **out-of-core learning**). 
The algorithm loads part of the data, runs a training step on the data, and repeats the process until it has run on all of the data. 

> Note: Out-of-core learning is usually done offline (i.e., not on the live system), so *online learning* can be a confusing name. Think of it as **incremental learning**. 

One important parameter of online learning systems is how fast they should adapt to changing data: this is called the **learning rate**. 
If you set a high learning rate, then you system will rapidly adapt to new data, but it will also tend to quickly forget the old data. 
Conversely, if you set a low learning rate, the system will have more inertia; that is, it will learn more slowly, but it will also be less sensitive to noise in the new data or to sequences of nonrepresentative data points (outliers). 

A bit challenge with online learning is that if bad data is fed to the system, the system's performance will gradually decline. 
If it's a live system, your clients will notice. 
To reduce this risk, you need to monitor your system closely and promptly switch learning off (and possibly revert to a previously working state) if you detect a drop in performance. 
You may also want to monitor the input data and react to abnormal data (e.g., using an anomaly detection algorithm). 

### Instance-Based versus Model-Based Learning 
One more way to categorize Machine Learning systems is by how they *generalize*. 
Most Machine Learning tasks are about making predictions. 
This means that given a number of trainign examples, the system need to be able to make good prediction for (generalize to) examples it has never seen before. 
Having a good performance measure on the training data is good, but insufficient; the true goal is to perform well on new instantces. 

There are two main approaches to generalization: instance-based learning and model-based learning. 

#### Instance-based Learning 
Possibly the most trivial form of learning is simply to learn by heart. 
This requires a *measure of similarity* between two email. A (very basic) similarity measure between two email could be to count the number of words they have in common. 

This is called *instance-based learning*: the system learns the examples by heart, then generalizes to new cases by using a similarity measure to compare them to the learned examples (or a subset of them). 

#### Model-based Learning 
Another way to generalize from a set of examples is to build a model of these examples and then use that model to make *predictions*. This is called *model-based learning*. 


In [None]:
## Example: does money make people happier? 

In summary: 
- You studied the data
- You selected a model
- You trained it on the training data (i.e., the learning algorithm searched for the model parameter values that minimize a cost function). 
- Finally, you applied the model to make predicitons o new cases (this is called *inference*), hoping that this model will generalize well. 

This is ehat a typical Machine Learning project looks like. 
We have covered a lot of ground so far: you now know what Machine Learning is really about, why it is useful, what some of the most common categories of ML systems are, and what a typical project workflow looks like. 
Now let's look at wha can go wrong in learning and prevent you form making accurate predictions. 

## Main Challenges of Machine Learning 

In short, since your main task is to select a learning algorithm and train it on some data, the two things that can go wrong are "bad algorithm" and "bad data". 

### Insufficient Quantity of Training Data 
It takes a lot of data for most Machine Learning algorithms to work properly. 


### Nonrepresentative Training Data 
In order to generalize well, it is crucial that your training data be representative of the new cases you want to generalize to. 
This is true whether you use instance-based learning or model-based learning. 

If the sample is too small, you will have a *sampling noise* (i.e., nonrepresentative data as a result of chance), but even very large samples can be nonrepresentative if the sampling method is flawed. This is called *sampling bias*. 

### Poor-Quality Data 
Obviously, if your training data is full of errors, outliers, and noise (e.g., due to poor quality measurements), it will make it harder for the system to detect the underlying patterns, so your system is less likely to perform well. 
It is often well worth the effort to spend time cleaning up your training data. 

### Irrelevant Features 
Your system will only be capable of learning if the training data contains enoufh relevant features and not too many irrelevant ones. 
A critical part of the success of a Machine Learning project is coming up with a good set of features to train on. 
This process, called *feature engineering*, involves the following steps: 


### Overfitting the Training Data 

*Overfitting*: it means that the model performs well on the training data, but it does not generalize well. 

Complex models such as deep neural networks can detect subtle patterns in the data, but if the training set is noisy, or if it is too small (which introduces sampling noise), then the model is likely to detect patterns in the noise itself. 
Obviously these patterns will not generalize to new instances. 

> Overfitting happens when the model is too complex relative to the amount and noisiness of the training data. Here are possible solutions: 
> * Simplify the model by selecting one with fewer parameters, by reducing the number of attributes in the training data, or by constraining the model. 
> * Gather more training data
> * Reduce the noise in the training data (e.g. fix data errors and remove outliers). 

Constraining a model to make it simpler and reduce the risk of overfitting is called **regularization**. 


### Underfitting the Training Data 
It occurs when your model is too simple to learn the underlying structure of the data. 
Reality is just more complex than the model, so tis predictions are bound to be inaccurate, even on the training examples. 

Main options for fixing this problem: 
* Select a more powerful model, with more parameters. 
* Feed better features to the learning algorithm (feature engineering)
* Reduce the constraints on the model (e.g., reduce the regularization hyperparameter).

### Stepping Back 

The big picture: 
* Machine Learning is about making machines get better at some task by learning from data, instead of having to explicitly code rules. 
* There are many different types of ML systems: supervised or not, batch or online, instance-based or model-based. 
* In an ML project you gather data in a training set, and you feed the training set to a learning algorithm. If the algorithm is model-based, it tunes some parameters to fit the model to the training set (i.e., to make good predictions on the training set itself), and then hopefully it will be able to make good predictions on new cases as well. If the algorithm is instance-based, it just learns the examples by heart and generalizes to new instances by using a similarity measure to compare them to the learned instances. 
* The system will not perform well if your training set is too small, or if the data is not representative, is noisy, or is polluted with irrelevant features (garbage in, garbage out). Lastly, your model needs to be neither too simple (in which case it will underfit) nor too complex (in which case it will overfit).



## Testing and Validating
The only way to know how well a model will generalize to new cases is to actually try it out on new cases. 
One way to do tat is to put your model in production and monitor how well it performs. 
This works well, but if your model is horribly bad, your users will complain --not the best idea. 

A better option is to split your data into two sets: 
the **training set** and the **test set**. 
The error rate on new cases is called the *generalization error* (or *out-of-sample error*), and by evaluating your model on the test set, you get an estimate of this error. 
This value tells you haw well your model will perform on instances it has never seen before. 

If the training error is low but the generalization error is high, it means that your model is overfitting the training data. 

> It is common to use 80% of the data for training and *hold out* 20% for testing. However, this depends on the size of the dataset: if it contains 10 million instances, then holding out 1% means your test set will contain 100,000 instances, probably more than enough to get a good estimate of the generalization error. 

### Hyperparameter Tuning and Model Selection 
Evaluating a model is simple enough: just use a test set. 
Suppose you are hesitating between two types of models. One option is to train both and compare how well they generalize using the test set. 

**Holdout validation**
You hold out part of the training set to evaluate several candidate models and select the best one. 
The new held-out set is called the **validation set** (or sometimes **development set**, or **dev set**). 
More specifically, you train multiple models with various hyperparameters you select the model that performs best on the validation set. 
After this holdout validation process, you train the best model on the full training set (including the validation set), and this gives you the final model. Lastly, you evaluate this final model on the test set to get an estimate of the generalization error. 

This solution usually works quite well. 
However, if the validation set is too small, then model evaluations will be imprecise: you may end up selecting a suboptimal model by mistake. 
Conversely, if the validation set is too large, then the remaining training set will be much smaller than the full training set. Since the final model will be trained on the full training set, it is not ideal to compare candidate models trained on a much smaller training set. 
One way to solve this problem is to perform repeated **cross-validation**, using many small validation sets. 
Each model is evaluated once per validation set after it is trained on the rest of the data. 
By averaging out all the evaluations of a model, you get a much more accurate measure of its performance. 
There is a drawback, however: the training time is multiplied by the number of validation sets. 


### Data Mismatch 

In some cases, it's easy to get a large amount of data for training, but his data probably won't be perfectly representative of the data that will be used in production. 

The validation set and the test set must be as representative as possible of the data you expect to use in production. 


### No Free Lunch Theorem 
A model is a simplified version of the observations. 
The simplifications are meant to discard the superfluous details that are unlikely to generalize to new instances. 
To decide what data to discard what data to keep, you must make **assumptions**. 
David Wolpert demonstrated that if you make absolutely no assumption about the data, then there is no reason to prefer one model over any other. 
This is called the **No Free Lunch (NFL) theorem**. 
There is no model that is *a priori* guaranteed to work better. 
The only way to know for sure which model is best is to evaluate them all. 
Since this is not possible, in practice you make some reasonable assumptions about the data and evaluate only a few reasonable models. 


## Exercise 

