### Introduction to Machine Learning

We will

* Outline the three basic types of machine learning
* Set out the notation we will use in the course

The material comes from Chapter 1 of *Python Machine Learning* by Rascha and Mirjalili.

### Three ways to learn

Traditionally ML is broken up into three types.

![types](types.png)


### Online vs Offline

Learning can furthermore be "offline" or "online".

* **Online** The dataset is continually updated as more data comes in.

* **Offline** The dataset is complete.

All of the 3 categories mentioned above have online and offline techniques.


### Supervised Learning

In supervised learning we are trying to predict a target variable from a list of features.

Examples:

* Given height, weight, age, BMI, income, zip code, etc. of a person, predict their five year cancer risk.

* Given a 128x128 greyscale image, predict whether the image is of a cat.

* Given a 128x128 greyscale image of a cow, predict its weight.

Supervised learning happens when the *right answer* is included in the data we have.

Sometimes this is called **labeled** data.  

The labels were put there often by humans (but not always).




### Supervised Learning: Discrete Target

Supervised learning when the target variable is discrete is called **classification**.  

There are finitely many classes to which a data point might belong.

Examples:

* Given a 128x128 greyscale image, predict whether the image is of a cat.
* Classify an email as spam or non spam.
* Classify a political blog as conservative or liberal.
* Classify a tweet as happy, sad, angry, or neutral.
* Classify behavior on a server as normal or malicious.

In each case the true label is one of finitely many possibilities.

![classification](classification.png)

### Supervised Learning: Continuous Target

Supervised learning when the target variable is a continuous quantity is called **regression**.

Examples:

* Given a 128x128 greyscale image of a cow, predict its weight.
* Given a chemical description of concrete, predict how long it will take to dry.
* Given a description of the microbiome of a dead body, predict how long it has been dead.
* Given the online behavior of an individual, predict his or her net worth.

In each case the variable to be predicted is one of a large number of continuously varying possibilities.

![regression](regression.png)


### Supervised ML, Overview

![supervised](supervised.png)

### Unsupervised Machine Learning

In unsupervised machine learning, we do not know the value of the target variable for the data instances.

Examples:

* A bunch of emails
* A bunch of images of stuff
* Data gathered from the James Webb Space Telescope
* Logs from a server in a datacenter

Notice that unlabeled data is **much** more readily available than labeled data.

But what can you do with it?



### Unsupervised ML: Clustering

Sometimes it is possible that the data can neatly be organized into different types.  These are called **clusters** and discovering them is called **clustering**.

![rocks](https://images.unsplash.com/photo-1507832321772-e86cc0452e9c?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxzZWFyY2h8MXx8cm9ja3N8ZW58MHx8MHx8&w=1000&q=80)

![clustering](clustering.png)

### Unsupervised ML: Dimensionality Reduction

Often data appears to be high dimensional but actually mostly lies on a low dimensional manifold.

![dimred](dimred.png)

There are automatic methods of finding low dimensional representations of data.

Even if the data is inherently high dimensional, we can still find the most informative low dimensional representation.

This can speed up learning and provide interesting visualizations.


### Unsupervised ML: Hierarchical Clustering

Sometimes things don't just cluster... they cluster "hierarchically".

The tree of life is an example.

![hcluster](hcluster.png)

### Reinforcement Learning

In reinforcement learning the learner is an **agent** that interacts with an **environment**.  The agent encounters **states** of the environment and responds with **actions**.  In response to the action, the environment issues a positive or negative **reward** and updates the state.

![reinforcement](reinforcement.png)

Examples:

* An agent learns to play Atari games by playing them over and over. Environment: images from the game.

* An agent learns to play chess or go by playing itself over and over ([alphazero](https://en.wikipedia.org/wiki/AlphaZero)).  Beats "stockfish" the previous world champion engine in shock upset.

* An agent learns to rotate a cube in a robotic hand.



### Terminology and Notation

There are many synonymous notations and names because ML emerged out of many disciplines.

We think of datasets as matrices.  The rows are the **instances** or **data points**. The columns are **features**.

In supervised learning there is often a **target** column.  The aim of supervised learning is to predict the target column using the other columns.

![terminology](termnote.png)


### Mathematical Notation


Suppose we have some sample of data ${\bf {x}}^{(1)},{\bf{x}}^{(2)},\ldots,{\bf {x}}^{(n)}$ for some integer $n$ drawn from some natural source.

Typically for all $i =1,2,\ldots,n$, ${\bf {x}}^{(i)} \in \mathbb{R}^d$ for some $d \in \mathbb{N}$.  

The integer $d$ is the *dimension* of the data.

The ${\bf {x}}^{(i)}$ might be virtually anything.  Some examples...

* Pictures of cats
* Measurements of shellfish parts
* Measurements of flower parts
* *etc*

We think of the data as being arranged in one large matrix $X$ with $n$ rows and $d$ columns.  The $i$th row of $X$ is the data point ${\bf {x}}^{(i)}$.  (Because all vectors are column vectors by default, the $i$th row of $X$ is technically ${\bf {x}}^{(i)T}$, the transpose of ${\bf {x}}^{(i)}$.)

Usually we have $n >> d$ and the matrix $X$ is "tall and skinny".


![X matrix](Xmat.png)

![more notation](moreNote.png)

### Feature Engineering

Suppose you are in the position of selecting the features of an iris flower that will be useful for predicting the species.

How would you go about selecting them?

What about a column that is the **ratio** of sepal length to petal length? 

There can be additional computed columns in addition to the base observations.

In a neural network often the lower layers transform the raw observations into a feature set that the higher layers process.

For example the first few layers might standardize an image of a handwritten digit (centering it, straightening it, removing noise, removing color).

Often we do not understand exactly how this "automatic" feature engineering works. 

![fe](https://www.researchgate.net/publication/337619475/figure/fig2/AS:830399039156224@1574993967836/The-general-structure-of-convolutional-neural-network-Convolutional-neural-network.png)

### Cost/Benefit of more features

As you add features you have more information about the problem and it usually becomes easier, at least on the training set.  

But when $X$ is "wide" and "short", spurious corellations with the target variable can occur. 

This is a simple form of "overfitting" which becomes more problematic as the number of features increases.

More features also usually means more training time. 

Also the points become "further apart" in high dimensional space and it is harder to see what the form of the data is.

![curse](http://www.visiondummy.com/wp-content/uploads/2014/04/curseofdimensionality.png)


### Models

A model in ML refers to a learning algorithm plus a particular set of weights or parameters.

For example a neural network with certain values for weights is a model.

Neural networks are a class of models.

A specific neural net with specific weights is a model.


![per](https://www.allaboutcircuits.com/uploads/articles/how-to-train-a-basic-perceptron-neural-network_rk_aac_image1.jpg)


### Model Selection

Model **selection** refers to the process of finding the best model for the data you have.

Usually you would pick a class of models (eg. linear regression).

Then, using the data, you pick a specific model from this class.

Your hope is that this will be the best performing model from your class on **new** data.


### Model Evaluation

Model **evaluation** refers to estimating how well a model will perform **out of sample**.  That means on **new data** that it has never seen before.

While we can use "best performance on the training set" for model selection, we cannot use training set performance for model evaluation.

That is because good performance on the training set is partly because a model works well and partly because of luck.

We cannot distinguish the two.

When the model sees new data, the "luck" goes away, and performance seems to diminish.

**Model evaluation** must be done with **test data** which is set aside and not used in training.
