# The Machine Learning Landscape

### What is Machine Learning?

- Machine Learning is the science (and art) of programming computers so they can
learn from data.

### Why use Machine Learning?

- Problems for which existing solutions require a lot of hand-tuning or long lists of
rules: one Machine Learning algorithm can often simplify code and perform better.
- Complex problems for which there is no good solution at all using a traditional
approach: the best Machine Learning techniques can find a solution.
- Fluctuating environments: a Machine Learning system can adapt to new data.
- Getting insights about complex problems and large amounts of data.

### Types of ML Systems

1. Whether or not they are trained with human supervision
2. Whether or not they can learn incrementally on the fly
3. Whether they work by simply comparing new data points to known data points,
or instead detect patterns in the training data and build a predictive model

#### 1. Supervised/Unsupervised Learning
- **Supervised:** In supervised learning, the training data you feed to the algorithm includes the desired solutions, called labels 

- **Unsupervised:** In unsupervised learning, as you might guess, the training data is unlabeled

- **Semisupervised:** Some algorithms can deal with partially labeled training data, usually a lot of unlabeled data and a little bit of labeled data. This is called semisupervised learning, semisupervised learning algorithms are combinations of unsupervised and supervised algorithms

- **Reinforcement:** Reinforcement Learning is a very different beast. The learning system, called an agent in this context, can observe the environment, select and perform actions, and get rewards in return (or penalties in the form of negative rewards). It must then learn by itself what is the best strategy, called a policy, to get the most reward over time. A policy defines what action the agent should choose when it is in a given situation.

#### 2. Batch and Online Learning

- **Batch:** In batch learning, the system is incapable of learning incrementally: it must be trained using all the available data. This will generally take a lot of time and computing resources, so it is typically done offline.

- **Online:** In online learning, you train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches. Each learning step is fast and cheap, so the system can learn about new data on the fly

#### 3. Instance-Based Versus Model-Based Learning

One more way to categorize Machine Learning systems is by how they generalize, There are two main approaches to generalization:

- **Instance-based:** the system learns the examples by heart, then generalizes to new cases by comparing them to the learned examples (or a subset of them), using a similarity measure

- **Model-based:** build a model based on the training data, wihch is then used to make predictions without needing to store all the instances

### Main Challenges

#### Insufficient Quantity of Training Data

It takes a lot of data for most Machine Learning algorithms to work properly. Even for very simple problems you typically need thousands of examples, and for complex problems such as image or speech recognition you may need millions of examples. However, that small- and mediumsized datasets are still very common, and it is not always easy or cheap to get extra training data, so don’t abandon algorithms just yet.

#### Nonrepresentative Training Data

In order to generalize well, it is crucial that your training data be representative of the new cases you want to generalize to. This is true whether you use instance-based learning or model-based learning.

#### Poor Quality Data

If your training data is full of errors, outliers, and noise, it will make it harder for the system to detect the underlying patterns.  It is often well worth the effort to spend time cleaning up your training data.

#### Irrelevant Features

Your system will only be capable of learning if the training data contains enough relevant features and not too many irrelevant ones. A critical part of the success of a Machine Learning project is coming up with a good set of features to train on. (Feature Engineering)

#### Overfitting Training Data

Overgeneralizing is something that we humans do all too often, and unfortunately machines can fall into the same trap if we are not careful. In Machine Learning this is called overfitting: it means that the model performs well on the training data, but it does not generalize well. 

- **Possible Solutions:** To simplify the model by selecting one with fewer parameters, by reducing the number of attributes in the training data or by **constraining** the model, to gather more training data, to reduce the noise in the training data
    - Constraining a model to make it simpler and reduce the risk of overfitting is called **regularization.** You want to find the right balance between fitting the training data perfectly and keeping the model simple enough to ensure that it will generalize well.

The amount of regularization to apply during learning can be controlled by a **hyperparameter**. A hyperparameter is a parameter of a learning algorithm (not of the model). As such, it is not affected by the learning algorithm itself; it must be set prior to training and remains constant during training.

#### Underfitting Training Data

underfitting is the opposite of overfitting: it occurs when your model is too simple to learn the underlying structure of the data.

- **Possible Solutions:** Selecting a more powerful model, with more parameters, Feeding better features to the learning algorithm (feature engineering), Reducing the constraints on the model