# The Machine Learning Landscape

## 1.1 ML is a subset of AI

Machine Learning, and Artificial Intelligence more broadly, are common buzz words thrown around to make methods of data analysis these days appear sophisticated and innovative. But what do these terms actual mean? What do they implicate?

- **Artificial Intelligence** (AI) is a broad field of computer sicence that aims to create software that can perform tasks that would typically require human intelligence.
- **Machine Learning** (ML) is a subset that involves the development of models that can make predictions/decisions without explicit programming.

## 1.2 ML algorithms have 3 major classifications

### 1.2.1 by training

- Supervised learning (with labels)
    + Classification, e.g. identifying a species of plant, determining the malignancy of a tumour
    + Regression, e.g. forecasting reactivities
- Unsupervised learning (without labels)
    + Clustering, e.g.  grouping families of similar DNA sequences (without any predefined categories)
    + Anomaly detection, e.g. identifying unusual patterns or outliers in data
    + Dimension reduction
    + Association rule learning
+ Semi-supervised learning (with some labels)
    + e.g. Google photos person ID
+ Reinforcement learning (penalty/reward)
    + e.g. AlphaGo

### 1.2.2 by adaptability to data

- Batch learning
    + System is trained then launched. If the system needs to obtain new data, the whole system needs to be trained again.
    + Takes a lot of time, memory, and computational power
- Online learning
    + System automatically updates with continuous flow of data
    + Potential performance delcine due to updating poor, non-representative data

### 1.2.3 by generalisation method

- Instance-based learning
    + Memorise values then make predictions based on a similarity measure
    + e.g. K-nearest neighbours
- Model-based learning
    + Create a model based on values which is used independently to make predictions
    + e.g. Least-squares regression

## 1.3 There are 2 main challenges for successful ML

### 1.3.1 Bad data

- Insufficient training data
    + May need thousands or millions of data points
- Non-representative training data
    + Too small sample size, sampling noise
    + Flawed sampling method, sampling bias
- Poor-quality data
    + Outliers
    + Missing features
- Irrelevant features
    + May require feature engineering such as feature selection and extraction

### 1.3.2 Bad algorithm

- Overfitting
    + Model is too complex relative to amount of data and noisiness of training set
    + Solutions:
        + Simplify/constain/regularise
        + Gather more training data
        + Reduce data noise (e.g. fix errors, remove outliers)
- Underfitting
    + Model too simple to understand the underlying structure of data
    + Solutions:
        + Select more powerful model
        + Feed better features
        + Reduce constraints

## 1.4 ML requires testing and validation of models

To assess the generalisation error of a model, typically a dataset will be split into 2 sets: **the training set and the test set**.

The training set is used to develop the model, and then the error rate in predicting results in the test set is called the *generalisation error*.
- Low training performance error but high generalisation error indicates overfitting.
- Training set will often be ~80% of the dataset and the test set ~20%, but this is not a set rule.
    + 1% of a 10 million instance dataset could be sufficient to assess generalisation.

For additional rigour, you can also have a **validation set**, which is a subset of the training set.
- Use of multiple small validation sets is called cross-validation
- This can systematically improve a model before testing