# A

**Anomaly detection**

---

Picking out new, anomalous data points based on training with many "normal" instances

<img src="img/anomaly_detection.png" width="25%">

Typically, the challenge with this task is gathering a very "clean" training set that is representative of most normal values

**Association rule learning**

---

Digging into large amounts of data and discovering interesting **trends between features**

e.g. discovering the association between customer's buying BBQ sauce and burgers

**Attribute**

---

A data type/column present in the data

e.g. "mileage"

# B

**Batch learning**

---

The system is incapable of learning incrementally - it must be trained with **all available data**

This requires a lot of time and resources so is typically done **"offline"** and then a static model is brought into production and runs without learning anymore

<img src="img/batch_learning.png" width="25%" >

# C

**Cost function**

---

A function that measures how **bad** a model performs so it can improve itself

**Cross-validation**

---

**Splitting** the validation dataset into many **smaller sets** and evaluating the model once against each subset. Then, the model's errors can be averaged out to get a better representation of its performance (by ensuring it doesn't overfit to a large validation dataset).

However, the training time is now multiplied by the number of subsets

# D

**Data mining**

---

Digging into large amounts of data and discovering **patterns** that were *not immediately apparent* (with ML techniques)

**Dimensionality reduction**

---

Simplifying a dataset without losing too much information. The goal is to improve performance and make the dataset easier to visualise

# E

# F

**Feature extraction**

---

Merging multiple **related** columns into one feature that represents them all

e.g. a car's mileage may be strongly related to its age, so we could extract these two columns into one representing its general "wear and tear"

**Feature**

---

An **attribute** plus its value

e.g. "mileage = 15,000"

# G

# H

**Hyperparameter tuning**

---

Once we have chosen a model, we need to choose its hyperparameters (i.e. its parameters for training).

This is done by evaluating it against a test dataset and tweaking the values (either automatically or manually) until the lowest regularisation error is found. Then, this optimal model needs to be checked against a validation set to ensure that its hyperparameters aren't simply tuned for the test set -> this is called **holdout validation**

# I

**Instance-based learning**

---

The system will learn each case off by heart and then generate a **"similarity measure"** to compare new instances to any previous ones

<img src="img/instance_based_learning.png" width="25%" >

# J

# K

**K-fold cross-validation**

---

Instead of splitting the dataset into train and test, we can randomly select $k$ folds that will split the training dataset into distinct subsets. 

K-fold cross-validation will then train the model $k$ times, holding back one of the folds for validation and using the others for training data.

<img src="img/k_fold_cross_validation.jpg" width="25%">

# L

**Learning rate**

---

How fast a model should adapt to changing data

High learning rate = system will rapidly adapt *but* will **forget** old data quickly

Low learning rate = system will adapt more slowly *but* will be **less sensitive** to outliers

# M

**Machine learning**

---

The science (and art) of programming computers so they can *learn from data*.

Traditionally, we write programs with a series of hard-coded rules based on data. Machine learning is an alternate approach that allows the computer to define these rules for itself and adjust them as the data changes

The main data challenges of ML are:
* Insufficient **quantity** of data -> typically need thousands or millions of samples, data is usually more important than algorithm
* **Nonrepresentative** training data -> in order to generalise well, you need training data that is representative of the whole population
* **Poor quality** data -> data needs to be cleansed of outliers or errors so the model doesn't try to account for them
* Irrelevant features -> rubbish in, rubbish out - need to keep only relevant features and discard unrelated ones

The main algorithm challenges of ML are:
* Overfitting
* Underfitting
* The trade-off between validation and training time

**Model**

---

Can refer to
* The type of model (e.g. Linear Regression)
* A fully specified model architecture (e.g. Linear Regression with one input and one output)
* The final trained model ready to be used for predictions (e.g. Linear Regression with one input and one output, using x = 4.85 and x2 = 4.91 x 10^-5

**Model-based learning**

---

An alternative to instance-based learning for generalising from a set of examples is to create a model that can **"predict"** new instances

<img src="img/model_based_learning.png" width="25%">

**Multiple regression**

---

When the system uses **multiple features** to make a prediction

**Multivariate regression**

---

A regression task where the goal is to predict **multiple values** per set of inputs

**Mean absolute error (MAE)**

---

An equation that measures the average error a model makes, similar to RMSE, but without adjusting as much for big outliers (no squaring and square rooting).

Used when outliers are common

$MAE\left(X,h\right)=\frac{1}{m}\sum_{i=1}^{m}\left|h\left(x^{\left(i\right)}\right)-y^{\left(i\right)}\right|$

**Min-max scaling**

---

Min-max scaling (or normalisation) is where values are shifted and rescaled so they lie between 0 and 1.

It works by subtracting the min from the current value and dividing by the max.

This always guarantees the range of the values, but can be severely affected by outliers

# N

# O

**Online learning**

---

The system can be trained incrementally by feeding it single data instances sequentially, either individually or in small "mini-batches"

Each learning step must be fast and cheap so that it can learn on the fly

Best used for quickly-changing data (such as stocks)

<img src="img/online_learning.png" width="25%" >

**Overfitting**

---

When a model replicates its training dataset too closely and **lose the ability to generalise**. It happens when the model is **too complex** relative to the amount and noisiness of the training data.

You can tell a model is overfitting when the training/validation error is low but the generalisation error is high

<img src="img/overfitting.png" width="25%" >

Solutions:
* Simplify the model by choosing one with fewer parameters (e.g. linear rather than polynomial)
* Reduce the number of attributes in the training dataset
* Restrict the model's parameters (regularisation)
* Gather more training data
* Reduce noise in training data (e.g. fix errors and remove outliers)

**One-hot encoding**

---

Creates one binary attribute (column) per text category that will be 1 if that category is true, or 0 if it is not.

Standard way of encoding text categories as numbers.

<img src="img/one_hot_encoding.jpg" width="25%">

# P

**Pipeline**

---

A sequence of data processing components to transform data from the source to the end goal

<img src="img/pipeline.png" width="25%">

# Q

# R

**Regularisation**

---

Constraining a model to make it simpler and reduce the risk of **overfitting**

e.g. force the slope of a linear model to be smaller to reduce degrees of freedom -> this may fit worse with training data but be more generalisable

<img src="img/regularisation.png" width="25%">

*in the above image, circles are training data and squares are new data*

**Reinforcement learning**

---

Where the learning system (agent) can observe the environment, carry out actions and then gets either a **reward** or **penalty**. It must then learn the best strategy over time to maximise the reward

<img src="img/reinforcement.png" width="25%" />

e.g. AIs can analyse historical results and then play millions of games against themselves to find the best policy to win

**Root Mean Square Error (RMSE)**

---

An equation that measures the **average error** a system makes in its predictions and **gives more weight to higher errors** (hence the square root).

Used when outliers are rare.

$ RMSE(X, h) = \sqrt{\frac{1}{m}\Sigma_{i=1}^{m}{\Big({h(x^{(i)})-y^{(i)}}\Big)^2}} $

Where $m$ is the number of instances in the validation dataset that you are evaluating RMSE on

$x^{(i)}$ is a vector with all the feature values for the $i^{th}$ instance

e.g. $x^{1}$ might be $\left(\begin{matrix}-118.29\\33.91\\\begin{matrix}1,416\\38,372\\\end{matrix}\\\end{matrix}\right)$ (the first item in the dataset)

$y^{i}$ is simply the label (desired output for that instance)

Finally, $X$ is a matrix containing all the feature values and $h$ is the prediction function that gives us the value to evaluate

**Random forest algorithm**

---

Used for regression or classification and works by training many decision trees on random subsets of data, and then averaging out their predictions

# S

**Semisupervised learning**

---

A machine learning system that can deal with a combination of labelled and unlabelled data. They often use a combination of supervised and unsupervised algorithms

<img src="img/semisupervised.png" width="25%" />

e.g. the unlabelled circles in the image above help classify a new instance (the cross) as a triangle and not a square - this can be done with a combination of clustering and classification algorithms

e.g. Google Photos allows you to label a person once and then will identify all the photos that person has appeared in

**Supervised learning**

---

A machine learning system where you provide the desired solutions (called **labels**) in the training data

<img src="img/supervised.png" width="25%" />

Examples of supervised techniques:
* Classification
* Regression
* Neural networks

**Standardisation**

---

A method of bringing all data values onto the same scale. It subtracts the mean value from the current, and divides by the standard deviation.

This results in data with a standard deviation of 1 and is good at **ignoring outliers**, but does not guarantee data lies **within a specific range**.

<img src="img/standardisation.png" width="25%" />

# T

# U

**Underfitting**

--- 

When the model is **too simple** to learn the underlying structure of your training data and its **predictions are always inaccurate**, even on training data

Solutions:
* Select a more powerful model, with more parameters
* Feed better features to the learning algorithm (feature engineering)
* Reduce the constraints on the model (reduce any regularisation hyperparameters)

**Unsupervised learning**

---

A machine learning system where the training data is **unlabelled**. The system tries to learn without a teacher

<img src="img/unsupervised.png" width="25%" />

Examples of unsupervised techniques:
* Clustering
* Anomaly detection
* Visualisation/dimensionality reduction
* Association rule learning

**Utility/fitness function**

--- 

A function that measures how well a model is performing so it can improve itself

**Univariate regression**

---

A regression task where the goal is to only predict **one value** per set of inputs

# V

# W

# X

# Y

# Z