# Chapter 1: Giving Computers the Ability to Learn from Data
## Building intelligent machines to turn data into knowledge
### The 3 Different Types of ML

1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning

##### Supervised learning
- Labelled data
- Direct feedback
- Predict outcome/future
##### Unsupervised learning
- No labels/targets
- No feedback
- Find hidden structure in data
##### Reinforcement learning
- Decision process
- Reward system
- Learn series of actions

## Supervised Learning
The main goal of supervised learning is to create a model that learns how to predict labels of future data by giving it training data with existing, known labels which are the desired outputs and slowly adjusting the model until the outputs match the known labels well for a given set of training inputs.

*Figure 1.1* below shows the work flow for creating and using a supervised learning model
<p style="text-align: center;">
<img src="Figures\01_02.png" width="600" height="500">
</p>
<p style="text-align: center;"><b>Figure 1.1: Process diagram for a supervised learning model</b></p>
An example of supervised learning might be a model which decides if an email is good, or if it is spam. In this example where we have discrete outcomes of the model this is a sub-category of supervised learning called classification. In examples where the desired outcomes could be on a continuous scale, eg predicting sales of a certain product, the task is one of regression, also a sub-set of supervised ML

##### Classification
The previous example of email tagging as spam or not is an example of a binary classification, the concept of which is shown in *figure 1.2* where there are 30 example data points, 15 of each binary type where each data point has 2 inputs associated with it, $x_{1}$ and $x_{2}$. The SL model learns where the decision boundary lies between those 2 sets so that it can evaluate which side a new data points lies on and hence which set or class to put a new data point in.
<p style="text-align: center;">
<img src="Figures\01_03.png" width="500" height="500">
</p>
<p style="text-align: center;"><b>Figure 1.2: Example concept of a binary classification</b></p>
A classification model can have an arbitrary number of classes, such a model is called <b>multiclass classification</b>. An example might be identifying hand written letters and numbers, which would have 36 different classes is standard English. A multiclass classification model can return an output class of any label that was included in the training dataset.

##### Regression for predicting continuous outcomes

Another type of supervised learning for predicting outcomes for non-discrete or continuous values, also called <b>regression analysis</b>. We start with a number of predictor (explantory) variables and continuous response variables and try to predict the relationship between them so that given a set of predictor variables we could then predict the outcome from them. A convention to call the predictor variables "features" and the response variables "target variables" is adopted.

An example might be predicting test scores as the target variable for a given test given the feature variables of time spent studying.
<p style="text-align: center;">
<img src="Figures\01_04.png" width="500" height="500">
</p>
<p style="text-align: center;"><b>Figure 1.3: Linear regression visualised</b></p>
Linear regression is given a variable x and target y, fitting a stright line to the data points that minimizes the average distances of all the points to that line, most commonly the averaged square distance is used. Once this line is fitted we can use it to predict the target value y of a new data point x.

##### Solving interactive problems with reinforcement learning

Here the goal is to develop a system (<b>agent</b>) that improves its performance based on interactions with the environment. Since the information about the current state of the environment typically also includes a so-called <b>reward signal</b>, we can think of reinforcement learning as a field related to supervised learning. The difference is the feedback to the agent is not based on the true labels or values but from a reward functions that assess how well the agent performed the desired task. Through interaction with the environment, an agent can then use RL to learn a series of actions that maximizes this reward via exploratory trial-and-error approach or deliberative planning.

A popular example is a chess program, the agent decides on a series of moves depending on the state of the board (environment), and the reward can be defined as winning or losing the game. A general scheme for how RL works is shown in figure 1.4 below. The agent generates an action or series of actions which alter the state of the environment, it looks at how the state was altered and based on the reward function we define, a judgement is made about whether the actions made changes in a good or a bad way. This reward function is then passed back to the agent so it knows if it improved, or got worse. Doing this over and over again eventually leads the agent to producing an optimal set of actions to perform the task.
<p style="text-align: center;">
<img src="Figures\01_05.png" width="700" height="500">
</p>
<p style="text-align: center;"><b>Figure 1.4: Process diagram for reinforcement learning</b></p>
In summary RL is concerned with learning to choose the series of actions that maximize the total reward.

##### Discovering hidden structures with unsupervised learning

In SL we know the right answer or the right goal outcome when training the model. In unsupervised learning we deal with unlabelled data or data of an unknown structure. With UL we can explore the structure of our data to extract meaningful information without the guidance of a known or desired outcome or reward function.
##### Finding subgroups with clustering
<b>Clustering</b> is an exploratory data analysis or pattern discovery technique that allows us to organise a pile of information into meaningful clusters without proof or prior knowledge of their existance. Each cluster that arises defines a group of objects that share a certain degree of similarity but are more dissimilar to objects in other clusters, hence this is sometimes referred to as <b>unsupervised classification</b>
<p style="text-align: center;">
<img src="Figures\01_06.png" width="500" height="500">
</p>
<p style="text-align: center;"><b>Figure 1.5: Illustration of how clustering can be applied to oragnise unlabelled data into 3 groups based on their similarities across different features x_1 & x_2</b></p>

##### Dimensionality reduction for data compression
Another UL subfield. Datasets and observations can often have many measurements with each observation, presenting problems for storage space and computational performance of ML models. Unsupervised dimensionality reduction is an approach in feature preprocessing to remove noise from data, noise which can affect the predictive performance of certain algorithms. Dimensionality reduction compresses the data onto a smaller dimensionality subspace while retaining most of the relevant information. It can also be useful when needing to visualise datasets which are in high dimensions, reducing them down to 1, 2 or 3D means we can visualise the data much easier.

##### Introduction to the basic terminology and notations
##### Notations and convensions used in this book
Figure 1.6 shows an exerpt from the Iris dataset, a dataset commonly used in ML as an example, it contains measurements of 150 Iris flowers from 3 different species.
<p style="text-align: center;">
<img src="Figures\01_08.png" width="700" height="500">
</p>
<p style="text-align: center;"><b>Figure 1.6: Exerpt of Iris dataset</b></p>
Each flower row represents one flower measured, the measurements themselves are stored in columns which we call features. For efficiency and implementation we make use of some basic linear algebra. we use matrix notation to refer to our data and will follow the common convention to represent each example as a row in a feature matrix <b>X</b> where each feature is stored as a separate column. The Iris dataset, consisting of 150 examples and 4 features can be written as a 150x4 matrix formally denoted as

\begin{equation}
    \textbf{X} ∈ \mathbb{R}^{150x4} : \begin{bmatrix}
        x_{1}^{(1)} & x_{2}^{(1)} & x_{3}^{(1)} & x_{4}^{(1)}\\
        \\
        x_{1}^{(2)} & x_{2}^{(2)} & x_{3}^{(2)} & x_{4}^{(2)}\\
        \\
        && .
        \\
        &&.
        \\
        &&.
        \\
        x_{1}^{(150)} & x_{2}^{(150)} & x_{3}^{(150)} & x_{4}^{(150)}
        \end{bmatrix}
\end{equation}

Unless stated otherwise the superscript i refers to the ith training example and the subscript j refers to the jth training feature of the ith example, or the jth dimension of the training dataset. Lower case bold faced letters will be used to represent vectors $\textbf{x} ∈ \mathbb{R}^{nx1}$ and uppercase bold font to represent matricies $\textbf{X} ∈ \mathbb{R}^{nxm}$. When refering to elements of a vector or matrix in isolation italics will be used $\textit{x}^{\textit{(n)}}$ or $\textit{x}^{\textit{(n)}}_{\textit{m}}$ eg $\textit{x}^\textit{(150)}_{\textit{1}}$ refers to the first dimension of flower example 150, the sepal length. Each row in matrix <b>X</b> represents one flower instance and can be written as a four-dimensional row vector $\textit{x}^{\textit{(i)}} ∈ \mathbb{R}^{1x4}$

\begin{equation}
    \textbf{X}^{\textit{i}} = 
    \begin{bmatrix}
        \textit{x}^{\textit{(i)}}_{\textit{1}} & \textit{x}^{\textit{(i)}}_{\textit{2}} & \textit{x}^{\textit{(i)}}_{\textit{3}} & \textit{x}^{\textit{(i)}}_{\textit{4}}
    \end{bmatrix}
\end{equation}
and each feature dimension is a 150-dimension column vector $\textit{X}^{\textit{(i)}} ∈ \mathbb{R}^{150x1}$
\begin{equation}
    \textbf{X}^{\textit{i}} = 
    \begin{bmatrix}
        \textit{x}^{\textit{(1)}}_{\textit{j}} \\ \textit{x}^{\textit{(2)}}_{\textit{j}} \\ \textit{x}^{\textit{(3)}}_{\textit{j}} \\ ... \\\textit{x}^{\textit{(150)}}_{\textit{j}}
    \end{bmatrix}
\end{equation}
We can similarly represent the target variables, here class labels, as a 150-dimension column vector:
\begin{equation}
    \textbf{y} = 
    \begin{bmatrix}
        \textit{y}^{\textit{(1)}} \\ ... \\ \textit{y}^{\textit{(150)}}
    \end{bmatrix} \text{, where }\textit{y}^{\textit{(i)}} ∈ \{Setosa, Versicolor, Virginica\}
\end{equation}

#### A roadmap for building MLK systems

<p style="text-align: center;">
<img src="Figures\01_09.png" width="700" height="500">
</p>
<p style="text-align: center;"><b>Figure 1.7: ML roadmap</b></p>

Fig 1.7 shows a typical development process for a ML model

### Preprocessing - Getting data into shape

Datasets for ML models rarely come ready to go out the box, often the data needs to be cleaned or extracted from some other source. The iris dataset for example we can think of as a set of images of flowers from which we need to extract the keyt features like colour, height, number of petals etc.. Many algorithms also require that the features are on the same scale for optimal performance, often achieved by transforming features in the range [0,1] or a standard/normal distribution centred around 0 with a unit variance.

Some features may be highly correlated and therefore redundant to a certain degree. Dimensionality reduction techniques are useful for compressing features onto a lower dimension subspace. This reduced storage space for the dataset and the algorithm has less inputs so can run faster. It also can help with datasets and predictive performace of algorithms when there is a lot of useless features or noise, or the dataset has a lower signal to background ratio.

Best practise is to randomly split a dataset into a training and test sub-dataset. This is to ensure the algorithm can generalise well to new data.

### Training and selecting a predictive model

Many different algorithms have been developed for many different tasks, its always important to choose the right algorithm for the right task. For example, each classification algorithm has inherent biases and no classification algorithm is superior when making no assumptions about the task. It is usually best to compare a few different algorithms for a task and select the best one. Comparing models first requires us to decide on how to measure each algorithms' performance by determining the a metric against which to measure. One common metric is accuracy, the proportion of correctly classified examples.

How can we decide which algorithm will work best on the test dataset or future real world data if it is not used when selecting the algorithm? To address this, different techniques summerized as "cross-validation" can be used. In cross-validation, we further divide a dataset into training and validation subsets in order to estimate the generalizing perfomance of the model

We cannot assume that the default parameter values provided by the various libraries for the different algorithms are immediately the most optimal. Hyperparameter optimization help fine tune performance, hyperparameters are not learned but can be thought of as the knobs on the front of the ML model machine that we can adjust to suit our needs.

### Evaluating models and predicting unseen data instances

After a model is selected and trained on the training dataset we can use the test dataset to estimate how well it performs on unseen data to estimate the generaliztion error. All the previously mentioned parameters/procedures are all obtained from the training dataset and are carried through both to the test dataset and to any future data.