# The Machine Learning Landscape

# What is Machine Learning?

## Definitions

Machine Learning is the science (and art) of programming computers so they can learn from data.

- General definition:
  - Machine learning is the field of study that gives the computers the ability to learn without being explicitly programmed.
    - Arthur Samuel, 1959
- Engineer-oriented one:
  - A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.
    - Tom Mitchell, 1997

Consider an example, the first good application of ML the *spam filter*. Here, the task T is to flag spam for new emails, experience E is *training data* and performance measure can be *accuracy*.

## Why use Machine Learning?

Consider how would you write a spam filter with traditional programming techniques:

1. First you define patterns you manually obtained by looking at spam messages.
2. Then you define a set of rules (algorithm) that classify messages according to those patterns.
3. You would test your program and repeat step 1 and step 2 until it was good enough.

Since the problem is difficult your program will likely became a long list of complex rules, thus pretty hard to maintain.

In contrast, a spam filter using ML automatically learns the patterns which are good enough to classify a message to spam. The program is much shorter, easier to maintain and more accurate.

If spammers notice something your spam filter detects and they change it. Your spam filter also adapt that after updating data automatically.

Another area of use is for problems that have no algorithms. 

ML can help humans learn. For instance, once a spam filter has been trained, we can easily get the patterns of spams. This could reveal some new trends or correlations, and thereby leads to a better understanding of the problem.

Applying ML to dig into large amounts of data can help discover patters. This is called *data mining*.

### Summarize

- Replace long list of rules, simple code and better performance.
- Complex problems which have no good algorithm, ML can find the solution.
- ML can adapt for new data.
- Getting insights about complex problems and large amount of data.

## Examples of Applications

Some concrete examples used widely, with their techniques.

- Analyzing images of products to automatically classify them
  - *Image classification* performed by *Convolutional Neural Networks (CNNs).*
- Detecting tumors in brain scans
  - *Semantic segmentation*, where each pixel of image is classified, using *CNNs*.
- Automatically classifying news articles
  - *Text classification*, the technique called *Natural Language Processing (NLP)*, performed by *Recurrent Neural Networks (RNNs)* or *Transformers*.
- Automatically flagging offensive comments
  - *Text classification* using NLP.
- Summarizing long documents
  - *Text summarization* using NLP.
- Creating a Chatbot or personal assistant
  - Using NLP also includes *Natural Language Understanding (NLU)* and *Question-Answering* modules.
- Forecasting your company's revenue next year
  - *Regression*, using any regression algorithm like *Linear Regression* or *Polynomial Regression*, a regression *Support Vector Machine (SVM)*, a regression *Random Forest*, or an *Artificial Neural Network (ANN)*
- Making your App to react on voice commands
  - *Speech recognition*, Using RNNs or *Transformers*.
- Detecting credit card fraud
  - *Anomaly Detection*.
- Segmenting clients based on their purchases for market strategy
  - *Clustering*.
- Representing a complex, high dimensionality dataset in clear and insightful diagram
  - *Data Visualization*, often involving *Dimensionality reduction* techniques.
- Recommending a product based on client interest and past purchases
  - *Recommender system*.
- Building an intelligent bot for a game
  - *Reinforcement Learning*.

# Types of Machine Learning Systems

We classify ML systems based on the following criteria:

- Whether or not they are trained with human supervision (supervised, unsupervised, semi-supervised, and Reinforcement Learning)
- Whether or not they can learn incrementally on the fly (online versus batch learning)
- Whether they work by simply comparing new data points to known data points, or instead by detecting patterns in the training data and building a predictive model, much like scientists do (instance-based versus model-based learning)

They are not exclusive, you can combine any of them.

## Supervised/Unsupervised Learning

ML systems can be classified according to the amount and type of supervision they get during training. There are four major categories: supervised, unsupervised, semi-supervised, and Reinforcement Learning.

### Supervised Learning

In this, the training set you feed to the algorithm includes the desired solutions, called *labels*.

There are two typical tasks in supervised learning:

- *Classification*: The spam filter, it trains on many example emails along with their *class* (spam or not spam), and it must learn to classify new emails.
- *Regression*: In this algorithm the task is to predict a *target* numerical value, such as the price if a car, given a set of *features* (mileage, age, brand, etc.) called *predictors*. In this you need to give many examples of cars with their predictors and their labels (i.e. prices).

> In ML an *attribute* is a data type (e.g. "mileage"), while a *feature* has several meanings, depending on the context, but generally means an attribute plus its value (e.g. "mileage = 15,00"). Many people use those words interchangeably.

> Some regression algorithms can be used for classification as well, and vice-versa. For examples, *Logistic Regression* used for classification.

Here are some of the most important supervised learning algorithms:

- k-Nearest Neighbors
- Linear Regression
- Logistic Regression
- Support Vector Machines (SVMs)
- Decision Trees and Random Forests
- Neural Networks

> Some Neural Networks can be unsupervised, such as *Autoencoders*.

### Unsupervised Learning

Here, the training data is unlabeled.

Here are some important examples:

- Clustering
  - K-Means
  - DB SCAN
  - Hierarchical Cluster Analysis (HCA)
- Anomaly detection and novelty detection
  - One-class SVM
  - Isolation Forest
- Visualization and dimensionality reduction
  - Principal Component Analysis (PCA)
  - Kernel PCA
  - Locally Linear Embedding (LLE)
  - t-Distributed Stochastic Neighbor Embedding (t-SNE)
- Association rule learning
  - Apriori
  - Eclat

If you have lots of data of customers then you can divide customers into groups, this is called *clustering*.

In *visualization* algorithms, you feed a lot of complex and unlabeled data, and they output a 2D or 3D representation.  They can be useful in understanding how data is organized also identify unsuspected patterns.

In *dimensionality reduction* the goal is to simplify the data without loosing too much information. One way to do this is to merge correlated features into one. This is called *feature extraction*.

> It is often a good idea to reduce the dimension of your training data. This will reduce training time and even perform better.

The work of *anomaly detection* is to identify unusual activities or outliers in data. For example, credit card fraud detection, catching manufacturing defects, etc. Similar task is *novelty detection* it aims to detect new instances that look different from training data.

*Association rule learning* discovers interesting relations between attributes. For example, if people buy macaroni then they also tends to buy cheese with this.

### Semi-supervised Learning

Sometimes the data have some labeled and some unlabeled data. Some algorithms can deal with them and they falls into semi-supervised learning.

Most of these algorithms are combination of unsupervised and supervised algorithms. For example, *Deep Belief Networks (DBNs)* are based on unsupervised components called *Restricted Boltzmann Machines (RBMs)* stacked on top of one another. RBMs trained sequentially in unsupervised manner, and then the whole system is fine-tuned using supervised learning.

Example of semi-supervised can be Google photos which can identify photos of people after you tell it who is who.

### Reinforcement Learning

The learning system called an *agent* in this context, can observe the environment, select and perform actions, and get *rewards* in return. It must by itself what is the best strategy, called a *policy*, to get the most reward over time. A policy defines what actions the agent should choose on a given situation.

For example, DeepMind AlphaGo learns to play Go and defeats the world champ.

> Learning was turned off during playing with champion.

## Batch and Online Leaning

Another criteria used to classify is whether or not the system can learn incrementally from a stream of data.

### Batch Learning

In *Batch Learning*, the system is unable to learn incrementally: It must be trained on available data.

This will generally take lot of time and computing, so it is done offline. Hence, it called *Offline Learning* also.

If you want new data to be known by the algorithm, it has to be learn from start and then replace old one. This can be automated but can take time. Also, you have to save whole data that can requires a lot of computing resources.

### Online Learning

In *Online Leaning*, you train the system incrementally by feeding it data instances sequentially, either individually or in small batches called *mini-batches*. Each learning step is fast and cheap.

These are great for systems that receive continuous flow of data, and need to adapt or change rapidly or autonomously.

Good option for limited computing resources: once system learns from a data then you can discard them. This will save lot of space.

They are also used in training on huge dataset which cannot fit in the system called *out-of-core learning*.

One important parameter in online learning is *learning rate*. If you set it high systems learns rapidly but forgets about old data. Conversely, if you set it low it learns slow with containing information about old data.

A big challenge is if bad data is fed to system, its performance will gradually decline. To reduce this risk you need to monitor data, turning off training periodically or you can use anomaly detection also.

## Instance-Based Versus Model-Based Learning

One more way to categorize is by how they *generalize*. Having good performance measure is good, but insufficient; the true goal is to perform well on new instances.

### Instance-Based learning

In spam filter, instead of flagging emails that are identical to known spam emails, it could be programmed to also flag emails that are very similar to known spam email.

This requires a *measure of similarity* between two emails. This is called *Instance-based learning*, first it learns the examples by heart (exactly), then generalizes using a similarity measure.

### Model-Based Learning

Another approach is to make a *Model* and train it on set of examples. It generalizes the relation between predictions and data.

In this approach first you collect data. Then select a model, this step is called *model selection*: say you choose a *linear model*. It has two *model parameters* called theta 0 and theta 1. First you have to define the parameters values and these values are depend on choosing of performance measure. You can either choose *utility function*, which measures how *good* your model is, or you can choose *cost function* that measures how *bad* your model is. In Linear Regression you updates the values of parameters at the point the model best fits your data, this is called *training*.

This is how a typical ML systems work.



# Main Challenges of Machine Learning

Since your main task is to select algorithm and run it on data. So, two things can go wrong "bad algorithm" and "bad data".

## Bad Data

### Insufficient Quantity of Training Data

It takes a lot of data for ML algorithms to work properly. Even for very simple you need thousands of examples, and for complex problems such as image and speech you'll need millions of examples.

>  In papers it is shown that after some amount of data all algorithms performs same. So instead of just collecting more data we should focus on algorithms also.

### Non-representative Training Data

In order to generalize well, training data should be representative of the new cases you want to generalize to.

For example if you train a Linear model than adding some new missing points that are off the line could change the results. So using better model would be great.

It is crucial to use training set that represents the cases you want to generalize. If sample is too small, you will have *sampling noise*, but even very large samples can be non-representative if the sampling method is flawed. This is called *sampling bias*.

### Poor-Quality Data

If your data is full of errors, outliers, and noise, it will make harder for the system to know the patterns. 

It would be better to clean up your data. Truth is, most of the Data Scientist spend lot of their time to cleaning up their data.

There are various ways to clean up the data; by fill in the missing values, discarding outliers, removing features with most of null values.

#### Irrelevant Features

As the saying: garbage in, garbage out. A critical part of the success of ML project is coming up with a good set of features. This process is called *feature engineering*.

It involves following steps:

- *Feature selection*: selecting the most useful features.
- *Feature extraction*: combining existing features to produce more useful one.
- Creating new features by gathering new data.

## Bad Algorithms

### Overfitting the Training Data

*Overfitting* means that the model performs very well on training data, but it does not generalize well.

Complex models such as deep neural networks can detect subtle patterns in the data. So, it tricks or take a short path to get the better results.

Overfitting generally occurs when the model is too complex relative to the amount and noisiness of the training data. Here are possible solutions:

- Simplify the model.
- Gather more training data.
- Reduce the noise in the training data.

Constraining a model to make it simpler and reduce the risk of overfitting is called *regularization*. For example, Linear model has two parameters, this means it has two degree of freedom, it can tweak two variables. But if we set one parameter to zero it has only one degree of freedom. This makes a model simpler.

The amount of regularization to apply on training can be controlled by a *hyperparameter*. A hyperparameter is a parameter of a learning algorithm (not of the model). It is set prior to the training and remains constant during training. *Tuning hyperparameters* is an important part of ML.

### Underfitting the Training Data

In Contrast, *underfitting* is the opposite of overfitting: it occurs when the model is too simple to learn underlying structure of the data.

Here are some solutions:

- Select a more complex model, with more parameters.
- Feed better features to the learning algorithm.
- Reduce the constraints on the model.



# Testing and Validating

The only way to test your model is to actually try it on new cases. So, what you can do is separate your data into two part: *training set* and *test set*. As names imply, you train your model using training set and test it using the test set. The error rate on new cases is called *generalization error*.

If your training error is low, but your generalization error is high, it means your model is overfitting.

> It is common to use 80% as training and 20% as *hold out*  for testing. But it varies according to size of the data. If you have millions of examples then 1% as hold out is enough for testing.

## Hyperparameter Tuning and Model Selection

Say you want to apply regularization. But how can you decide the value of regularization hyperparameter? One way to do that train a model multiple times with various hyperparameter value. Say you get 5% generalization error, but in production you are getting about 15% error. 

The problem is that when you measured the generalization error multiple times on the test set, and you tends to adapt the model and hyperparameter to produce best model *for that particular test set*. So your model is unlikely to perform well on new examples.

A common solution is *holdout validation*: you simply hold out part of training set to evaluate between several models and hyperparameter values. This set is called *validation set* (or *development set*, or *dev set*). More specifically, you train multiple models with various hyperparameter values on the reduced training set (training set minus validation set), then select a model that performs best on validation set. Then you train that model with full training set (including validation set) and test it on the test set.

This solution works well mostly. But, if your validation set is too small or too large or not descriptive. Then you might end up with less accurate model. So the solution here is *cross-validation*, using many small validation sets throughout the training set. Each model is evaluated separately on each set and then averaging all evaluations for the best model.

> Training time on cross-validation is multiplied by the times of validation set number.

 ## Data Mismatch

It's easy to get large amount of data, but it is not necessary that it will represent all of your data you want in production. 

For example, you want to build Flower species recognition app. If you collect data from web of flowers, the pictures are different from actual captures from any mobile. So you collect some images captures from mobile app. 

Now the most important rule to remember is  that your validation set and test set must be as representative as you want in your production. So, shuffle them and put half of those actual images in validation set and half of them in test set.

After all that if your model does not performs well and you don't know whether it is because overfitting or variation of images used in train, dev, test. You can simply use *train-dev* set. Train your model on training set without train-dev and dev sets, evaluate it on train-dev set if it performs well then it is not overfitting training set, the problem is in data mismatch. Now, you can tackle this as making web images look like mobile ones by some preprocessing.