# Introduction to Machine Learning

Let's look at the following questions:

1. What is machine learning?

2. How do we do machine learning?

3. What are some of the challenges and issues we need to consider when doing machine learning?

4. What are some of the biomedical applications of machine learning?

## 1. What is machine learning?

### AI vs Machine Learning vs Deep Learning

Artificial Intelligence is the broader umbrella under which Machine Learning and Deep Learning come. And you can also see in the diagram that even deep learning is a subset of Machine Learning. So all three of them AI, machine learning and deep learning are just the subsets of each other. So let us move on and understand how exactly they are different from each other.

"Artificial Intelligence is a technique that allows machines to act like humans by replicating their behavior and nature."

“Machine Learning is a subset of artificial intelligence. It allows the machines to learn and make predictions based on its experience(data)“

“Deep learning is a particular kind of machine learning that achieves great power and flexibility by learning to represent the world as nested hierarchy of concepts or abstraction”

 <img src='https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-2-Introduction-to-machine-learning/imgs/AI-vs-ML-vs-Deep-Learning.png' width=600px>

### Goal of machine learning

The goal of machine learning is to create a *model* that makes a *prediction*. The model is a mathematical function that takes input (which is some information that we can measure or observe about samples) and returns a prediction. Usually we train our model by providing it with samples where we already know the correct answer (*training data*); this type of machine learning is called _supervised learning_. But ideally we want our model to return good predictions even for new or unseen data. Sometimes we will use machine learning just to identify patterns in data; this type of machine learning is called _unsupervised learning_. Machine learning is often considered a type of Artificial Intelligence (AI).

### Supervised vs. unsupervised machine learning

When we do _supervised learning_ we have some training data for which we already know the answers: the true class of the data (or at least the class of the data that is assigned by the best method available), or the true value of some quantivative value. Most of this course will cover supervised learning.

When we do _unsupervised learning_ we don't have answers available (or at least we analyze the data as if the answers are unavailable). Most commonly we do this by _clustering_, where we apply a method to divide the data into subsets (called clusters). We will discuss clustering during Week 5. [Is PCA/ICA also considered unsupervised learning?]

There are also approaches where we combine supervised learning with unsupervised learning (*semi-supervised learning*).

Another kind of machine learning is *reinforcement learning*, where the system (or agent) interacts with its environment and receives rewards or punishments based on its responses, and learns through that process.

We won't be covering semi-supervised or reinforcement learning in this course.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-2-Introduction-to-machine-learning/imgs/SupervisedLearning.png" width = "250" style="float: right;">

#### Example: Supervised learning

Diagnosis of heart failure. Here we recorded two measurements of patient heart function and have each patient labeled in two classes: "healthy" or "heart failure". The goal of machine learning is to generate a model that predicts the patient class.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-2-Introduction-to-machine-learning/imgs/Biopsy.png" width = "250" style="float: right;">
<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-2-Introduction-to-machine-learning/imgs/BreastCancerClustering.png" width = "250" style="float: right;">

#### Example: Unsupervised learning (clustering)

Diagnosis of breast cancer. Here we have two features describing cells in a biopsy sample. The goal is to learn if there is a structure in the dataset, without considering the diagnosis.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-2-Introduction-to-machine-learning/imgs/BrainMRI.png" width = "250" style="float: right;">
<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-2-Introduction-to-machine-learning/imgs/MRIClustering.png" width = "250" style="float: right;">

#### Example: Unsupervised learning (clustering)

Classification of brain tissues. Here we have two features describing pixels in a brain MRI. The goal is to segment the brain into 3 types of regions: WM (white matter), GM (gray matter) and CSF (cerebro-spinal fluid).

## 2. How do we do machine learning?

### Machine learning protocol

When we do machine learning, we normally follow a protocol consisting of the following steps.

Step 1. Define the problem

Step 2. Prepare the data

Step 3. Exploratory data analysis

Step 4. Feature selection and extraction

Step 5. Create training, validation and test sets

Step 6. Select model type and optimization algorithm

Step 7. Fit the model

Step 8. Evaluate the model

#### Step 1. Define the problem

It is very important to understand and describe what exactly we are trying to predict. The classes or values we are trying to predict are called our targets.

Are we trying predict discrete *classes* (or *labels*)? For example, does a patient have a disease or not? This is a *classification* problem. If we have two classes, it is a _binary classification_ problem. Sometime we are interested in a _multiclass_ problem, where we want to assign three or more classes.

On the other hand, sometimes we are trying to predict a continuous numerical value, for example what is the patients expected survival time? Then we have _regression_ problem.

We will talk about classification in Week 3 and regression Week 4.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-2-Introduction-to-machine-learning/imgs/Classification.png" width = "400" style="float: right;">

#### Example: Classification

Diagnosis of heart failure. Here we recorded two measurements of patient heart function and have each patient labeled in two classes: "healthy" or "heart failure". The goal of machine learning is to generate a model that predicts the patient class.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-2-Introduction-to-machine-learning/imgs/Regression.png" width = "400" style="float: right;">

#### Example: Regression

Prediction of brain volume. Here we are trying to predict the brain volumes (measured from MRI scans) for preterm babies from gestational age.

#### Step 2. Prepare the data

This is often the most time consuming step of any machine learning project!

Usually, we will need to preprocess the data so that it is in a format suitable for the type of machine learning we want to do. Sometimes we need to remove missing or low quality samples from the data, or handle some of the data in a special way.

Another type of data preparation that is sometimes useful is _data augmentation_. This is particularly important when we have a small amount of data. For example, if we have a set images (assuming there is no natural orientation), we could rotate all of the images randomly and generate an augmented dataset with multiple rotations for each initial image.

Data can be *numerical* or *categorical*. For numerical data we may need to scale or shift the data. For categorical data we may need to encode it properly for our methods. For binary classes, we will usually encode the data with 0 or 1 values. For multiclass data, sometimes we can encode the classes using integer values (0, 1, 2, etc.) But sometimes we will need to use *one-hot* encoding, where multiple classes are encoded as a set of "dummy" binary variables.

#### Step 3. Exploratory data analysis

This is a very useful step whenever you are starting machine learning with a new dataset. We can use various techniques to summarize and visualize our data so as to gain an initial understanding of the data. This will help us to identify issues that will require additional data preparation (so back to step 2), and to select the best methods to use for the rest of the protocol.

For example, we might examine correlations between features and between target values.

#### Step 4. Feature selection and extraction

Each sample in our data will be represented with one or more *features*. Sometimes we can use the raw data as the features, but in most cases we need to do at least some processing of the data to generate the features needed for machine learning.

Features may be numerical or categorical. The features describe our data samples and will be used as input for machine learning. We represent each sample as a feature vector, with each element of the vector being one of the features.

Sometimes calculating the features may be a time-consuming step. Or we may need to do some preliminary work to identify the appropriate features. Or we may even need to perform feature engineering, creating new features to best describe our data as input to the model.

#### Step 5. Creating training, validation and test sets

This is a key step that will help us determine if our model has good *generalizability*, i.e. how good the model will be at predicting new or unseen data.

To do this, we divide our data into training and test sets. We use the training set to find the best parameters for our model. We use the test set to evaluate how good the model actually is. When reporting the performance (e.g. accuracy) of the model it is important to use only the test set.

One common pitfall in machine learning is to learn too much about the training set, and to have low generalizability. This is called *overfitting*. By evaluating the model performance on a test set we can determine the extent of *overfitting*.

Often we will further divide the training set into a training set and a validation set. The validation set will be used to select _hyperparameters_ (see below), depending on the particular model we are using.

We need to select the size of the sets. For example a common division is to put 80% of the data in the training set and 20% in the test set.

The simplest method to generate these sets is to just divide the data randomly. But sometimes we need to be smarter when dividing the sets. For example, we might need to ensure that there are about the same proportion of positive and negative samples in each set.

#### Step 6. Select model type

There are many different type of models available for both classification and regression problems. We will cover a few of them during the course!

Each model describes a functional form and some *parameters* (or *weights*). We will then apply an optimization algorithm that will determine the best parameters for our training set. The models can vary between very simple linear models with one or two parameters, all the way to complicated *neural networks* with thousands or even billions of parameters!

For simpler models, we can often select and configure the model with one line of code. On the other hand, for neural networks, we will create a complicated architecture consisting of many layers. We will cover neural networks in the second half of the course.

We will sometimes distinguish *hyperparameters* from other types of parameters. The distinction is just that hyperparameters are parameters that aren't optimized by our optimization algorithm. For example, when we do polynomial regression in scikit-learn, the coefficients will be the parameters, and the degree of the polynomial will be a hyperparameter. We can use methods such as grid search to find the best hyperparameters.

The *loss function* describes what the optimization algorithm is trying to minimize. For many types of models, the loss function is implicit (such as sum of squared errors for linear regression), but for neural networks we'll specify a loss function.

#### Step 7. Fit the model

We run our optimization algorithm and find the parameters.

The type of optimization algorithm to find the best parameters will depend on the type of model and our data. For example, when fitting a neural network, we will use a variant of a gradient descent algorithm. We also need to consider the time and computer resources that will be necessary for training. When we study training neural networks, we will need to consider training batches and training rates.

For most of our examples, this will only take a split second, but in some cases some serious computational power and time will be required!

### Step 8. Evaluate the model

This is actually the most important step. We have multiple metrics that we can use to evaluate the performance of the model and to compare different models. 

There are many different performance metrics that will appropriate for different kinds of problems. For classification problems, we will use accuracy but also additional metrics to help minimize false positives and false negatives. For regression problems we will $R^2$ or RMSE.

Of course, in a real project we will need to go back and repeat the steps as many times as necessary until we have a model that is suitable.

### Mathematical notation

Sometimes we will use mathematical to describe machine learning

Each sample is characterised by a $D$-dimensional feature vector $\overline{x}$

$\overline{x} = (x_1, ... X_D)^T$

Model $f$ returns predicted targets $\hat{y}$

$\hat{y} = f(\overline{x})$

Consider $N$ training samples $\overline{x}_1, ..., \overline{x}_N$

The features are represented by data matrix $X$

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-2-Introduction-to-machine-learning/imgs/FeatureMatrix.png" width = "200">

And in supervised learning, the output is a target vector $\overline{y}$ of dimension $N$

$\overline{y} = (y_1, ... y_D)^T$

## 3. What are some of the challenges and issues we need to consider when doing machine learning?

### Overfitting

This is one of the biggest challenges with machine learning. It is actually rather easy to train a model that has extremely high accuracy on the training data. But this isn't our goal: we want a model that is good at predictions on new or unseen data.

What is happening is that the model is just learning about noise in the training set, instead of learning useful information about the data.

Our main weapon against overfitting is to split our data into a training set and test set. When evaluating the performance of the trained model, if it is higher on the training set and lower on the test set, we are overfitting.

Even better than using a test set, sometimes we will have the opportunity to test the model on actual new data, which is very useful for evaluating its true performance.

Overfitting is particularly a problem when your model has a large number of parameters, for example with a large neural network. Often it is a better to use a simpler model with fewer parameters.s

The converse, underfitting, is also possible, but it generally less of an issue; it is pretty easy to use a more complicated model with more parameters!

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-2-Introduction-to-machine-learning/imgs/sphx_glr_plot_underfitting_overfitting_001.png" width = "800">

https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html

### White box vs black box

Most of the time when we do machine learning, we treat the models as a *black box*, where we are only interested in the performance of the model; we don't care about the actual parameters.

Ocasionally, we may treat a model as a *white box*, and examine the values of parameters. This can provide insights on which features were useful for the model and sometimes a better understanding of the data and a model.

### Handling outliers

Often a dataset will contain some extreme values for certain features or targets. 

Sometimes outliers are erroneous (or missing) values. If we believe this to be the case, it is usually best to exclude the outliers from the data.

Be very cautious when excluding outliers, since they can often have information about the true variability of the data.

Remember, you can almost always improve the performance of the model by excluding outliers, but you may end up with a model that is less useful on real data.

### Handling missing or invalid data

It is common for datasets to contain some missing or invalid datapoints. For example, for patient data, not all the same tests may have been performed on every patient. There may also have been a mistake with the measurement or errors in data entry.

We have multiple options to handle this situation:

1. Remove samples with missing or invalid values. If there are only a few samples with these issues, removing those samples is usually the best solution.
2. Replace the missing or invalid values with a default value. This is called *imputing* the values. For example, if we are missing a feature for some of the samples, we could calculate the median value of the rest of the dataset, and assign that value. We could even do machine learning to try to predict the missing values.

Most of the example datasets will use won't have these issues. But almost every real world dataset will!

### Handling poor quality data

In addition to missing or invalid data, the dataset may have errors that we can't easily detect.

[Add more here]

### Dimensionality reduction: PCA & ICA

The aim of dimensionality reduction is to reduce the number of features (dimension of the feature vector) while preserving the important distinguishing characteristics of samples. This can be used to help visualize the structure of the dataset. There are two methods commonly used here:

- Principal Component Analysis (PCA)
- Independent Component ANalysis (ICA)

## 4. What are some of the biomedical applications of machine learning?

There are many biomedical applications of machine learning:

- *Diagnostic* In this case we are interested in predicting if a patient is likely to have a disease condition, based on information about the patient.
- *Prognostic* In this case we want to predict the likely progress or survival of a patient.
- *Treatment* Here we want to predict the best treatment option.
- *Drug discovery* Machine learning has many applications to drug discovery. It can be used to predict the activity of a chemical compound, toxicity, and pharmacokinetic properties.

Here are some specific example datasets we'll use during this course:

- Neonatal brain volumes. With this dataset we are interested in the relation between brain volumes and gestational ages for premature babies
- Heart disease. With this dataset we are interested in predicting if the patient has heart disease and the seriousness of the disease based on measurements of heart function
- MRI images.
- And many others