**Table of contents**<a id='toc0_'></a>    
- [Different types of Machine Learning](#toc1_)    
  - [Supervised learning](#toc1_1_)    
  - [Unsupervised learning](#toc1_2_)    
  - [Semi-supervised learning](#toc1_3_)    
  - [Self-supervised learning](#toc1_4_)    
  - [Reinforcement learning](#toc1_5_)    
- [Online learning and Batch learning](#toc2_)    
- [Instance-based learning and Model-based learning](#toc3_)    
- [Main challenges of Machine Learning](#toc4_)    
- [Model evaluation, Hyperparameter tuning and Model selection](#toc5_)    
  - [Testing and validating](#toc5_1_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=5
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_'></a>[Different types of Machine Learning](#toc0_)

Three of the most common ML models are, supervised, unsupervised and reinforcement learning (image source: ML with PyTorch and Scikit-Learn - Seb Raschka).

<img src="./imgs/the_three_different_types_of_machine_learning.png">

### <a id='toc1_1_'></a>[Supervised learning](#toc0_)

Supervised learning is used in scenarios where the data is already labeled and the goal is to train a model to learn the relationship between the input features and the target. The target is the label that we are trying to predict, such as whether an email is spam or not, or whether a tumor is malignant or benign, or house price prediction.  

<u>**Naming conventions**</u>

- **Features** are also known as predictors, inputs, attributes, independent variables etc.
- **Target** is also known as response, outcome, label, dependent variable etc.

<u>**Different types of supervised learning problems**</u>

- `Classification:` classification is a supervised learning task where the target variable is categorical, such as spam or not spam, or malignant or benign. The goal is to train a model to learn the relationship between the input features and the target which can then be used to predict the target variable for new unlabeled data.

- `Regression:` regression is a supervised learning task where the target variable is continuous, such as house price or stock price prediction. 

- `Logistic regression:` in logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). The logistic regression model predicts P(Y=1) as a function of X. It is often used for classification tasks, such as spam detection.

### <a id='toc1_2_'></a>[Unsupervised learning](#toc0_)

In unsupervised learning, the data is unlabeled and the goal is to model the underlying structure or distribution in the data in order to learn more about the data.

<u>**Different types of unsupervised learning problems**</u>

- `Clustering:` clustering is an unsupervised learning task where the goal is to group the data into distinct clusters such that observations within a cluster are very similar to each other, whereas observations in different clusters are very different from each other. Clustering is often used for exploratory analysis and/or as a component of a preprocessing pipeline in order to get the labels for a supervised learning task.

- `Dimensionality reduction:` dimensionality reduction is an unsupervised learning task where the goal is to reduce the dimensionality (i.e., the number of features) of a data set by extracting a new set of features that preserves the most important information. It often does so by combining highly correlated features to form a smaller set of features that are more easily interpreted. Dimensionality reduction is often used as a component of a preprocessing pipeline.

- `Association rule learning:` association rule learning is an unsupervised learning task where the goal is to discover rules that describe the relationship between variables in a data set. One example of association rule learning is market basket analysis. Market basket analysis is used to analyze customer behavior by finding associations between the different items that customers place in their “shopping baskets” while shopping at a supermarket or an online store. For example, a supermarket might discover that customers who buy butter and eggs also tend to buy bread, so they can put butter, eggs, and bread close to each other to increase sales.

- `Anomaly detection:` anomaly detection (also outlier detection) is an unsupervised learning task where the goal is to identify observations that are significantly different from the rest of the data. For example, anomaly detection is used to detect fraud detection or defective items in manufacturing. The training data for anomaly detection is often highly imbalanced (i.e., most observations are not anomalies). As a result when the model sees a new observation it can output whether it is an anomaly or not. Anomaly detection is often used to automatically remove outliers from a data set before another model is applied.

- `Novelty detection:` novelty detection is similar to anomaly detection except that novelty detection algorithms are expected to generalize better to new observations that are not part of the training data. For novelty detection model training, only normal data are used. The training set should not contain any instance that you would like to classify as a novelty. 

Unsupervised learning can be a goal in itself, such as in clustering, or it can be used as a preprocessing step for supervised learning, such as in PCA.

### <a id='toc1_3_'></a>[Semi-supervised learning](#toc0_)

In semi-supervised learning, the data is a mix of labeled and unlabeled examples. Most semi-supervised learning algorithms are combinations of unsupervised and supervised algorithms. For example, a clustering algorithm may be used to group similar instances together, and then every unlabeled instance can be labeled with the most common label in its cluster. Once the whole dataset is labeled, it is possible to use any supervised learning algorithm.

### <a id='toc1_4_'></a>[Self-supervised learning](#toc0_)

Self-supervised learning is a mix of supervised and unsupervised learning. First the model is trained in an unsupervised manner, typically to solve a task that can be automated (e.g., clustering, feature extraction, image colorization, predicting the missing part of an image, etc.). Then, the model is fine-tuned using supervised learning to perform the actual task of interest (e.g., classification). This makes it possible to train a good model using only a small amount of labeled training data, as long as there is a lot of unlabeled training data available.

### <a id='toc1_5_'></a>[Reinforcement learning](#toc0_)

Reinforcement learning is a very different beast. The learning system, called an agent in this context, can observe the environment, select and perform actions, and get rewards in return (or penalties in the form of negative rewards). It must then learn by itself what is the best strategy, called a policy, to get the most reward over time. A policy defines what action the agent should choose when it is in a given situation.

## <a id='toc2_'></a>[Online learning and Batch learning](#toc0_)

`Batch Learning:` Batch learning is also known as offline learning. In batch learning, the system is incapable of learning incrementally (i.e, the model can't be updated as new data comes in). Instead, it must be trained using all the available data. 

The advantage of batch learning is that it is generally simple to train (since you can automate the process) and the resulting model is usually accurate and stable (as launching the model is dependent on you). 

The downside is that the model needs to be retrained from scratch every time new data is available. This can be very time consuming and/or expensive, so it is typically done offline at regular intervals. Model retraining frequency depends on the model's performance decay rate.

`Online Learning:` In online learning, you train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches. Each learning step is fast and cheap, so the system can learn about new data on the fly, as it arrives.

One important parameter of online learning systems is how fast they should adapt to changing data: this is called the ***learning rate***. If you set a high learning rate, then your system will rapidly adapt to new data, but it will also tend to quickly forget the old data (and you don’t want a spam filter to flag only the latest kinds of spam it was shown). Conversely, if you set a low learning rate, the system will have more inertia; that is, it will learn more slowly, but it will also be less sensitive to noise in the new data or to sequences of nonrepresentative data points (outliers).

The main challenges with online learning are that if bad data is fed to the model, the model's performance will decline, often quickly (depending on the data quality and learning rate). To deal with this it is advised to monitor the input data and deal with outliers (e.g., using anomaly detection algorithms). Also monitor the model's performance and if it drops, roll back to a previously working state.

Online learning algorithms can be used to train models on huge datasets that cannot fit in one machine’s main memory, this is called ***out-of-core learning***. Out-of-core learning is usually done offline and not on the live system.

## <a id='toc3_'></a>[Instance-based learning and Model-based learning](#toc0_)

`Instance-based learning:` In instance-based learning, the system learns the examples by heart, then generalizes to new cases using a similarity measure to compare them to the learned examples or a subset of them.

Instance-based learning algorithms are also known as memory-based learning or lazy learning.

Examples of popular instance-based learning algorithms:

- K-nearest neighbors (KNN)
- Support vector machines (SVMs)
- Gaussian processes


`Model-based learning:` Model-based learning algorithms learn the underlying relationships and patterns in the data by creating a mathematical representation of the data. The model can then be used to make predictions on new data.

Examples of popular model-based learning algorithms:

- Linear regression
- Logistic regression
- Decision trees
- Random forests
- Neural networks


## <a id='toc4_'></a>[Main challenges of Machine Learning](#toc0_)

1. Insufficient quantity of training data.
2. Non-representative training data.
3. Poor quality data.
4. Irrelevant features.
5. Overfitting the training data (Usually overfitting happens when the model is too complex relative to
the amount and noisiness of the training data).
6. Underfitting the training data (This happens when the model is too simple to learn the underlying structure of the data).

## <a id='toc5_'></a>[Model evaluation, Hyperparameter tuning and Model selection](#toc0_)

`Hyperparameter:` A hyperparameter is a parameter of a learning algorithm (not of the model). The value of a hyperparameter has to be set before the training process begins. A hyperparameter controls the learning process and the model's capacity. 

`Regularization:` Regularization is a technique used to reduce overfitting by constraining a model’s complexity. For example, a simple regularization technique for linear models is to reduce the number of polynomial degrees. The amount of regularization to apply during learning can be controlled by a hyperparameter called the regularization hyperparameter.

`Generalization error:` The generalization error is the error rate on new cases (i.e., cases that were not used for training the model). It is also known as ***out-of-sample error***.

`Performance measurer:`  To know how good a model is we need to specify a performance measure. You can either define a ***utility function (or fitness function)*** that measures how good your model is, or you can define a ***cost function*** that measures how bad it is. 

`Loss function:` Often used synonymously with a ***cost function***. Sometimes the loss function is also called an ***error function***. In some literature, the term “loss” refers to the loss measured for a single data point, and the “cost” is a measurement that computes the loss (average or summed) over the entire dataset.

### <a id='toc5_1_'></a>[Testing and validating](#toc0_)

`Training set:` The training set is the dataset used to train the model.

`Test set:` The test set is the dataset used to test the model's performance on new data. This gives an estimate of the generalization error.

`Validation set:` Once a particular learning algorithm have been selected we would want to fine-tune the hyperparameters to get the best possible model. Testing different candidate models (i.e, models with different hyperparameter values) on the same test set would result in the model that performs best on the test set, but this model may not perform well on new unseen data. So to avoid this, we use a ***validation set***. The validation set is used *to compare the performance of different candidate models and select the best performing model*. Once the best performing model is selected, it is retrained on the whole training set (training set + validation set) and the generalization error is measured on the test set. The validation set is also known as the ***dev set*** or ***development set***.

`Repeated cross-validation:` If the validation set is too small, the model evaluation may be imprecise. If the validation set is too large, then the remaining training set will be too small and comparison of the candidate models will not be fair. To solve this problem, we can use ***repeated cross-validation***. Repeated cross-validation, uses many small validation sets. Each candidate model is evaluated once per validation set after it is trained on the rest of the data. By averaging out all the evaluations of a model, you get a much more accurate measure of its performance. There is a drawback, however: the training time is multiplied by the number of validation sets.

`Data mismatch and the train-dev set:` Data mismatch is when the training data is different from the data used in production. This can happen due to many reasons, such as the data being sampled from different distributions, or the data being sampled at different times. This can result in the model performing poorly in production. To avoid this, it is important to use a validation set that is as close as possible to the data used in production.

When real data is scarce, you may use similar abundant data for training and hold out some of it in a ***train-dev set*** to evaluate overfitting; the real data is then used to evaluate data mismatch (dev set) and to evaluate the final model’s performance (test set).