# Introduction to Machine Learning

Let's look at the following questions:

1. What is machine learning?

2. How do we do machine learning?

3. What are some of the challenges and issues we need to consider when doing machine learning?

4. What are some of the biomedical applications of machine learning?

## 1. What is machine learning?

### Goal of machine learning

The goal of machine learning is to create a model that makes a prediction about something that we can observe. The model is a mathematical function that takes input (some information we can measure or collect about samples) and returns a prediction. Usually we will train our model by providing it with example where we already known the correct answer (_supervised learning_). But we want our model to return good predictions even for new or unseen data. Sometimes we will use machine learning just to identify patterns in the data (_unsupervised learning_).

### Machine learning and AI

Machine learning is considered a branch of Artificial Intelligence. It is a type of predictive AI. AI includes other branches, such as generative AI and agent-based AI.

### Deep learning

Deep learning is just machine learning using deep neural networks as the type of mathematical model. It has been very successful at working with complicated input data, such as images. We will cover neural networks in the second half of the course.

### Supervised vs. unsupervised machine learning

When we do _supervised_ machine learning we have some training data for which we already know the answers: the true class of the data (or the class of the data that is assigned by the best method available), or the true value of some quantivative value.

When we do _unsupervised_ machine learning we don't have the answers available. Most commonly this is _clustering_, where we apply a method to divide the data into some subsets (called clusters).

Examples ...

Most of our machine learning will be supervised learning. We will look at unsupervised learning during Week 5.

### Fitting a model

When doing machine learning, we need to select the mathematical form of our model. Then we will apply an algorithm to find the best parameters for our model.

Hyperparameters

## 2. How do we do machine learning?

### Machine learning protocol

When we do machine learning, we normally follow a protocol (or pipeline) consisting of the following steps.

Step 1. Define the problem

Step 2. Prepare the data

Step 3. Exploratory data analysis

Step 4. Feature selection and extraction

Step 5. Creating training, validation and test sets

Step 6. Select model type and optimization algorithm

Step 7. Fit the model!

Step 8. Evaluate the model

#### Step 1. Define the problem

It is very important to understand and describe what exactly we are trying to predict. Are we trying predict classes, e.g. does a patient have a disease or not? This is a classification problem. If we have two classes, it is a _binary classification_ problem. Sometime we are interested in a _multiclass_ problem Are we trying to predict a number, e.g. risk? Then we have _regression_ problem.

##### Classification versus regression

##### Supervised versus unsupervised

#### Step 2. Prepare the data

This is often the most time consuming step of a machine learning protocol. We may need to pre-process the data so that it is in a format suitable for the type of machine learning we want to do. Sometimes we need to remove low quality samples from the data, or handle some of the data in a special way.

Another type of data preparation that is sometimes useful is _data augmentation_. This is particularly important when we have a small amount of data. For example, if we have a set images (assuming there is no natural orientation), we could rotate all of the images randomly and generate an augmented dataset with multiple rotations for each initial image.

Data can be numerical or categorical. Categorical data could be binary or multiclass. [Discuss here methods for handling categorical data] [Discuss one-hot encoding]

#### Step 3. Exploratory data analysis

This is a very useful step whenever you are starting a machine learning protocol. We will use various techniques to summarize and visualize our data to gain an initial understanding of the data. This will help us to select the best methods to use for the rest of the protocol.

#### Step 4. Feature selection and extraction

Sometimes we can put the raw data directly into a machine learning method. But more commonly we need to convert the data into features.

Features are numbers that describe our data samples and will be used as input for machine learning. We will represent each sample as a vector, with each element of the vector being one of the features.

One common task required will be to convert text data into numeric data. For example we would transform "healthy" or "diseased" into 0 or 1.

One-hot encoding. Explain it here...

Examples ...

#### Step 5. Creating training, validation and test sets

This is a key step to help us create a model that we can evaluate for generalizability: how good the model would be at predicting new or unseen data.

The first division is between training and test sets. We will use the training set to find the best parameters with our optimization algorithm. We will use the test set to evaluate how good the model would perform on unseen or new data. When reporting the performance (such as accuracy) of the model it is important to use only the test set. This will be way to determine and avoid overfitting.

Often we will further divide the training set into a training set and a validation set. The validation set will be used to select _hyperparameters_, depending on the particular model we are using.

We need to select the size of the sets. For example, we could put 80% of the data in the training set, 10% in the validation set and 10% in the test set.

The simplest method to generate these sets is to just divide the data randomly. But we more often need to take some care to divide the sets. For example, we might need to ensure that there are about the same proportion of positive and negative samples in each set. Sometimes we will need to more complicated methods to create sets.

Describe cross-validation

Testing on completely new and unseen data!

#### Step 6. Select model type and optimization algorithm

There are many different type of models available for both classification and regression problems. We will cover many of them during the course!

For some models we need to write down a loss function...

Each model describes a functional form and some parameters (or weights) that we need to discover using an optimization algorithm. The models can very between a very simple linear model, all the way to complicated neural networks with thousands or even billions of parameters!

The type of optimization algorithm to find the best parameters will depend on the type of model and our data. We also need to consider the time and computer resources that will be necessary for training.

For example, when fitting a neural network, we will use a variant of a gradient descent algorithm.

Parameters vs hyperparameters... Explain the difference!

If we have many hyperparameters that may be relevant use a grid search

Describe cross-validation

Loss function

##### Neural Networks

#### Step 7. Fit the model!

We run our optimization algorithm and find the parameters...

### Step 8. Evaluate the model

This is the most important step. We have multiple metrics that we can use to evaluate the performance of the model and to compare different models. 

There are many different performance metrics that will appropriate for different kinds of problems. For classification problems, we will use accuracy, sensitivity and specificity [also recall and precision and F1-score; maybe introduce these first since they are the standard for machine learning?]. For regression problems we will R2 or RMSE (etc...)

Of course, we will need to go back and repeat the steps as many times as necessary until we have a model that is suitable.

## 3. What are some of the challenges and issues we need to consider when doing machine learning?

### Overfitting

### White box vs black box

### Extrapolation vs interpolation

### Handling outliers

### Handling missing or invalid data

We have multiple options to handle this situation:

1. Remove samples with missing or invalid values
2. Replace the bad values with a default value (e.g. the median of the whole dataset) (imputed data)

### Handling poor quality data

### Dimensionality reduction: PCA & ICA

## 4. What are some of the biomedical applications of machine learning?

There are many biomedical applications of machine learning:

Diagnostic applications
Prognostic applications
Drug discovery

Here are some specific example datasets we'll use doing this course:



[Add more examples]