# The Fundamentals of Machine Learning

## What is Machine Learning?

Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed.

—Arthur Samuel, 1959

A computer program is said to learn from experience E with respect to some task T
and some performance measure P, if its performance on T, as measured by P, improves
with experience E.

—Tom Mitchell, 1997

## Machine Learning Solutions are good for 
1. Problems for which existing solutions require a lot of hand-tuning or long lists of rules: one ML algorithm can often simplify code and perform better.
2. Complex problems for which there is no good solution at all using a traditional approach: the best Machine Learning techniques can find a solution.
3. Fluctuating environments: a Machine Learning system can adapt to new data.
4. Getting insights about complex problems and large amounts of data.

## Types of Machine Learning Systems

### Supervised and Unsupervised Learning

**Supervised Learning**

In supervised learning, the training data includes the desired solution, called labels.

A typical supervised learning task is classification. Another typical task is to predict a target numeric value, such as the price of a car given set of features called predictors. This sort of task is called regression.

Some regression algorithms can be used for classification as well, and vice versa, example Logistic Regression.

Examples of supervised learning algorithms
* K-Nearest Neighbors
* Liner Regression
* Logistic Regression
* Support Vector Machines (SVMs)
* Decision Trees and Random Forests
* Neural Networks

**Unsupervised Learning**

In unsupervied learning the training datat is unlabeled. The system tries to learning without a teacher.

Visualization algorithms are good examples of unsupervised learning algorithms: you feed them al lot of complex and unlabeled data, and they output a 2D or 3D representation of the data that can be easily plotted. These algorithms try to preserve as much structure as they can, so you can understand how the data is organized and perhaps identify unsuspected patterns.

A related task is dimensionality reduction, in which the goal is to simplify the data without loosing too much information. One way to do this is to merge several correlated features into one. This is called **feature extraction**.

*It is often a good idea to try to reduce the dimensions of your training data using a dimensionality reduciton algorithm before you feed it to another ML algorithm. It will run much fasterm the data will take up less disk and memory space, and in some cases it may also perform better.*

Another important unsupervised task is anomaly detection. Thes system is shown mostly normal instances during training, so it learns to recognize them and when it sees a new instance it can tell wheather it looks like a normal one or wheather it is likely an anomaly. A similar task is novelty detection: the difference is that novelty detection algorithms expect to see only normal data during training, while anomaly detection algorithms are usually more tolerant, they can often perform well even with a small percentage of outliers in the training set.

Another common unsupervised task is association rule learning, in which the goal is to dig into large amounts of data and discover interesting relations between attributes.

Examples of unsupervised learning algorithms
* Clustering
    - K-Means
    - DBSCAN
    - Hierarchical Cluster Analysis (HCA)
* Anomaly detection and novelty detection
    - One-class SVM
    - Isolation Forest
* Visualization and dimensionality reduction
    - Principal Component Analysis (PCA)
    - Kernal PCA
    - Locally-Linear Embedding (LLE)
    - t-distributed Stochastic Neighbor Embedding (t-SNE)
* Association rule learning
    - Apriori
    - Eclat
    
**Semisupervised Learning**

Some algorithms can deal with partially labeled training data, usually a lot of unlabeled data and a small bit of labeled data. This is called semisupervised learning.

Most semisupervised learning algorithms are combinations of unsupervised and supervised algorithms. For example, *Deep Belief Networks*(DBNs) are based on unsupervised components called *Restricted Boltzmann Machines*(RBMs) stacked on top of one another. RBMs are trained sequentially in an unsupervised manner, and then the whole system is fine-tuned using supervised learning techniques.

**Reinforcement Learning**

In Reinforcement Learning, the learning system is called an agent, can observe the environment, select and perform actions, and get rewards in return (or penalties in the form of negative rewards). It must then learn by itself what is the best strategy, called a policy, to get the most reward over time. A policy defines what action the agent should choose when it is in a given situation.
Example: Deep Mind's AlphaGo 

### Batch and Online Learning

Another criterion used to classify Machine Learning systems is whether or not the system can learn incrementally from a stream of incoming data.

**Batch Learning**

In batch learning, the system is incapable of learning incrementally: it must be trained using all the available data. This will generally take a lot of time and computing resources, so it is typically done offline. First the system is trained, and then it is launched into production. This is called offline learning.

**Online Learning**

In online learning, you train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batch. Each learning step is fast and cheap, so the system cal learn about new data no the fly, as it arrives. Online learning is great for system that receive data as a continous flow and need to adapt to change radily or autonomously.

Online learning algorithms can also be used to train system on huge datsets that cannot fit in one machine's main memory (this is called out-of-core learning). Out-of-core learning is usually done offline, so online learning can be confusing name. It can be thought of as incremental learning.

One important parameter of online learning system is how fast they should adapt to changing data: this is called the learning rate. A big challenge with online learning is that if bad data is fed to the system, the system's performance will gradually decline.

### Instance-Based Versus Model-Based Learning

One way to categorize Machine Learning systems is by how they generalize.

**Instance-based Learning**

Possibly the most trivial form of learning is simply to learn by heart. In instance-based learning: the system learns the examples by heart, then generlizes to new cases by comparing them to the learned examples (or subset of them), using a similarity measure.

**Model-based Learning**

Another way to generalize from a set of example is to build a model of these examples, then use that model to make predictions. This is called model-based learning. For model based learning you can define a performance measure by either defining a utility function(or fitness function) that measures how good the model is, or by defining a cost function that measures how bad it is.

## Main Challenges of Machine Learning
The two main things that can go wrong are bad algorithm and bad data.

### Insufficient Quantity of Training Data
Machine learning algorithms takes a lot of data to work properly even for very simple problems.

In a paper publised in 2001 it was showen that very different ML algorithms including faily simple ones, performed well on a compled problem of natural language disambiguation once they were given enough data. The results suggest that we may want to reconsider the trade-off between algorithm vs data collection. This idea that data matters more than the algorithm for complex problems was further popularized by the paper *The Unreasonable Effectiveness of Data* published in 2009.

However, the small and medium sized datasets are still very common, and it is not always easy or cheap to get extra training data, so don't abandon algorithms just yet.


### Nonrepresentative Training Data
It is crucial to use a training set that is representative of the cases you want to generalize to. If the sample is too small, you will have **sampling noise** (nonrepresentative data as a result of chance), but even very large samples can be nonrepresentative if the sampline method is flawed. THis is called **sampling bias**.

### Poor-Quality Data
If the training data is full of errors, outliers, and noise, it will make it harder for the system to detect the underlying patterns. It is often well worth the effort to spend time cleaning up training data.

### Irrelevant Features
As the saying goes: garbage in, garbage out. Your system will only be capable of learning if the training data contains enough relevant features and not too may irrelevant ones. A critical part of the success of a Machine Learning project is coming up with a good set of features to train on. This process called feature engineering, involves:
* Features selection: selecting the most useful features to train on among existing features.
* Feature extraction: combining existing features to produce a more useful one(dimensionality reduction algorithms can help).
* Creating new features by gathering new data.

### Overfitting the Training Data
Overfitting means that the model performs well on the training data, but it does not generalize well.
Overfitting happens when the model is too complex relative to the amount and noisiness of the training data. The possible solutions are:
* To simplify the model by selecting one with fewer parameters, by reducing the number of attributes in the training data or by constraining the model.
* To gather more training data.
* To reduce the noise in the training data.

Constraining a model to make it simpler and reduce the risk of overfitting is called **regularization**.

The amount of regularization to apply during learning can be controlled by a **hyperparameter**. A hyperparameter is a parameter of learning algorithm(not of the model). 

### Underfitting the Training Data

Underfitting occurs when the model is too simple to learn the underlying structure of the data.

The main options to fix this problem are:
* Selecting a more powerful model, with more parameters.
* Feeding better features to the learning algorithm (feature engineering)
* Reducing the constraints on the model(e.g. reducing the regularization hyper parameter)

## Testing and Validation

The only way to know how well a model will generalize to new cases is to actually try it out on new cases.
For this the best option is to split the data into two sets: the training set and the test set.
The error rate on new cases is called the **generalization error**(out-of-sample error), and by evaluation the model on the test set, you get an estimate of this error. This value tesll how well the model will perform on the instances it has never seen before.

If the training error is low but the generalization error is high, it means that your model is overfitting the training data.

_It is common to use 80% of the data for training and hold out 20% for testing._

### Hyperparameter Tuning and Model Selection

If you train 100 different models using 100 different hyperparameter to get the lowest generalization error.The model is unlikely to perform as well on new data. The problem is that you measured the generalization error multiple times on the test set, and you adapted the model and hyperparameters to produce the best model for that particular set.

A common solution is holdout validation: you simply hold out part of the training set to evaluate several cadndidate models and select the best one. The new heldout set is called the validation set (or sometimes the development set, or dev set).

The solution works quite well. However if the validation set is too small, then model evaluations will be imprecise. Conversely if the the validation set is too large, then the remaining training set will be much smaller than the full training set.
One way to solve this problem is to perform repeated cross-validation, using many validation sets. However, the training time is multiplied by the number of validation sets.
