# Lecture 1
## Machine Learning Examples

Run the following cell with `Shift + Enter` to watch the video.

In [None]:
IRdisplay::display_html('<iframe width="640" height="360" src="https://tube.switch.ch/embed/2b9da809" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>')

## Organization of this Course

Run the following cell with `Shift + Enter` to watch the video.

In [None]:
IRdisplay::display_html('<iframe width="640" height="360" src="https://tube.switch.ch/embed/0ff9ee1d" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>')

## The Life Expectancy Dataset
Run the following cell with `Shift + Enter` to watch the video.

In [None]:
IRdisplay::display_html('<iframe width="640" height="360" src="https://tube.switch.ch/embed/cda29cc5" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>')

In this section you will load the life expectancy dataset,
look at it and produce some plots.
You will learn about the `R` functions `file.path`, `read.csv`, `str`, `?`,
`<-`, `$`, `:`, `c`, `pdf`, `par`, `plot`.

Let us load the life expectancy dataset from the csv file.
You can run the following cell by clicking it and pressing `Shift + Enter`.

In [None]:
data <- read.csv(file.path("..", "data", "life_expectancy.csv"))
str(data)

- We used the function `file.path`, to generate a valid path
  in an operating system independent way.
- We used the function `read.csv` and assign the output to the variable `data`.
- In `R` the it is common to use `<-` for (left-)assignment,
  but `=` can also be used.
- On the second line we used the function `str`
  to look at the names and data types of the columns of `data`;
  if you just want to extract the names, you can use the function `names`.
- In `R` the dot `.` does not have any special meaning
  and can be used in any variable or function name.
- `R` has usually excellent documentation. You can access it with `?`, e.g.

In [None]:
?read.csv

In [None]:
?"<-"

Let us actually look at the data.

In [None]:
data$LifeExpectancy

- The output consists of a list with the life expectancies
  measured in different countries and in different years.
- We can access the data in the columns of `data` by using the extraction
  operator `$`. Type `?"$"` in the empty field below and have a look at the
  examples at the bottom of the documentation.
- We could have also accessed this data with `data[,6]`. Try it out.

Let us continue to explore the data.
First we look at rows 30 to 40.

In [None]:
data[30:40,]

Do you wonder what `NA` means? Look it up with `?NA`.

With the following command we look at rows 13, 33, 41 and 72.

In [None]:
data[c(13, 33, 41, 72),]

The combine function `c` is important to know.
You may want to have a look at its documentation
or play with some examples, like

In [None]:
x <- c(1, 2, 3)
y <- c(4, 5, 6)
x + y
c(x, y)

Before you move on to the next video, you find in the cell below
the code to generate the figures used in the slides.
The first and the last line are commented out,
such that the plots are shown in this notebook
instead of being printed to the pdf.
Use the documentation, if you want to know more
about the usage of the functions `pdf`, `par` and `plot`.

In [None]:
# pdf("life_expectancy_example_plots.pdf", width = 5.8, height = 2.8)
par(mfcol = c(1, 3), cex = .7)
plot(data$GDP, data$LifeExpectancy, xlab = "GDP per capita [USD]",
                                    ylab = "Life Expectancy [Years]",
                                    xlim = c(0, 100000))
plot(data$BMI, data$LifeExpectancy, xlab = "BMI [kg/m^2]",
                                    ylab = "Life Expectancy [Years]")
plot(data$Year, data$LifeExpectancy, ylab = "Life Expectancy [Years]",
                                     xlab = "Year", xlim = c(1999, 2016))
# dev.off()

## Error Decomposition and Parametric versus Non-parametric Methods

In [None]:
IRdisplay::display_html('<iframe width="640" height="360" src="https://tube.switch.ch/embed/55a9ae3c" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>')

### Artificial Data Generation Process

With the following code we define the custom function `f`.

In [None]:
f <- function(x) {x^2}

You may want to look at the documentation `?"function"`,
evaluate `f` at different points or create your own function.

In [None]:
f(3)

In the slides you saw data generated with the following function `myfunc`.

In [None]:
myfunc <- function(x) {sin(2*x) + 2*(x - 0.5)^3 - 0.5*x}

We will generate `N = 60` data points.

In [None]:
set.seed(12)
N <- 60
x <- sort(runif(N))
error <- .06*rnorm(N)
y <- myfunc(x) + error
par(mfcol = c(1, 1))
plot(x, y)

In the first line we set the pseudo-random number generator seed to 12.
This means that we will obtain the same pseudo-random numbers,
every time we run the code above.
The functions `runif` and `rnorm` generate uniformly and normally distributed
pseudo-random numbers, respectively. And the function `sort`, well, does the
obvious :)

You may want to convince yourself that
running the cell above always gives the same data.
What happens, if you remove the first line or replace it by `set.seed(123)`?

If you feel you don't understand something in the code above,
it would be a good idea to insert a cell below
(you can use e.g. the + button above) and experiment a bit.

### Parametric Method

As an example of a parametric method to estimate the function `myfunc`,
we will fit a linear function to the data.

In [None]:
linear.fit <- lm(y ~ x)
plot(x, y)
curve(myfunc, 0, 1, col = 'blue', add = TRUE, lw = 2)
abline(linear.fit, col = "dark green", lw = 2)

The cryptically looking `linear.fit <- lm(y ~ x)` means simply:
"fit a linear model with response `y` and predictor `x`
and assign the result to variable `linear.fit`".
The functions `curve` and `abline` plot the true function
and the fitted line.

### Non-Parametric Method

As an example of a non-parametric method to estimate the function `myfunc`,
we define the `kNN` method with three mandatory arguments and one optional
argument with default value `k = 2`.

In [None]:
kNN <- function(x0, x, y, k = 2) { # test point x0, predictor values x, response values y, default k = 2 nearest neighbours
    d = abs(x - x0)                # compute all distances between the test point and the data
    o = order(d)                   # compute the order of the distances (smallest to largest)
    mean(y[o[1:k]])                # take the average response of the k nearest neighbours
}

If you feel comfortable with the `kNN` function you can skip this paragraph and
move to the next cell. Otherwise, create a new cell below and experiment a bit,
e.g. `tmp.x <- c(5, 2, 3, 1)`, `tmp.x0 <- 1.4`, `abs(tmp.x - tmp.x0)`,
`order(tmp.x)`, `kNN(tmp.x0, tmp.x, c(1, 2, 3, 4), k = 1)`.

In [None]:
grid <- seq(0, 1, length.out = 1000)
y.hat <- sapply(grid, kNN, x, y, k = 1)
plot(x, y)
curve(myfunc, 0, 1, col = 'blue', add = TRUE, lw = 2)
lines(grid, y.hat, col = 'red', lw = 2)

With the `seq` function we generated 100 evenly spaced points in the interval
[0, 1]. The `sapply` function lets us apply the function kNN to all values of
grid; from the third argument onward, the `sapply` function passes the arguments
to the function `kNN`, i.e. you can change the `k` to 50, for example, if you
want to see the figure of the slide on the curse of dimensionality.
It is highly recommendable to execute the cell above for different values of `k`
between 1 and 50 and observe how well the curve fits the data.
We call kNN with `k = 1` a flexible method, because it can fit very rough
data with many jumps and kNN with  `k = 50` an inflexible method, because it can
fit only rather smooth data.
We will later assess more formally the kNN method with different values of `k`.
If you want to better understand the `sapply` function, it may be worthwile to
experiment a bit, e.g. `tmp.f <- function(x, y) { x + y }; tmp.x <- c(1, 2, 3);
sapply(tmp.x, tmp.f, 2)` or look at its documentation.

Take a little moment to think about the definitions of the reducible and the
irreducible error as well as parametric and non-parametric methods.
When you are ready, move over to [the quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1088128).

## Assessing Model Accuracy

In [None]:
IRdisplay::display_html(' <iframe width="640" height="360" src="https://tube.switch.ch/embed/e7180ba6" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>')

### Assessing Model Accuracy with Artificial Data
Let us generate a test set with the same generative process as above.

In [None]:
set.seed(42)
N.test <- 10^3
x.test <- sort(runif(N.test))
error.test <- 0.06 * rnorm(N.test)
y.test <- myfunc(x.test) + error.test

Now we compute the training error for the linear model. To do so we first
compute the predicted responses `y.pred`.

In [None]:
y.pred <- predict(linear.fit, data.frame(x = x))
1/length(y) * sum((y - y.pred)^2)

and the test error for the linear model

In [None]:
y.test.pred <- predict(linear.fit, data.frame(x = x.test))
1/length(y.test) * sum((y.test - y.test.pred)^2)

The predict function takes as first argument some fitted model, like the
`linear.fit` we obtained above. As a second argument it expects a `data.frame`
with some values in a column called `x`.

Let us do the same with the kNN method.

In [None]:
y.pred.kNN <- sapply(x, kNN, x, y, k = 5)
1/length(y) * sum((y - y.pred.kNN)^2)

In [None]:
y.test.pred.kNN <- sapply(x.test, kNN, x, y, k = 5)
1/length(y.test) * sum((y.test - y.test.pred.kNN)^2)

As expected, the kNN method has a much lower training and test error than the
linear method.

Let us now investigate how the training and test errors depend on the choice of
`k`. To do so, we will define the following `assess.kNN` function, and evaluate
it with different `k`.

In [None]:
assess.kNN <- function(k, data.train, data.test) {
    x <- data.train$x
    y <- data.train$y
    x.test <- data.test$x
    y.test <- data.test$y
    y.pred <- sapply(x, kNN, x, y, k = k)
    error.train <- 1/length(y) * sum((y - y.pred)^2)
    y.test.pred <- sapply(x.test, kNN, x, y, k = k)
    error.test <- 1/length(y.test) * sum((y.test - y.test.pred)^2)
    c(error.train, error.test)
}

This function takes `k`, a training set and a test set as input and returns both
the training and the test error.

We will evaluate this function now for different values of `k`.

In [None]:
ks <- seq(1, 20)
errors.kNN <- sapply(ks, assess.kNN, data.frame(x, y), data.frame(x = x.test, y = y.test))
plot(ks, errors.kNN[1,], col = "red", ylab = "MSE", xlab = "k", type = "b")
points(ks, errors.kNN[2,], col = "blue", type = "b")
abline(h = 0.0036, lty = 2)
legend("bottomright", legend = c("training error", "test error", "irreducible error"),
       col = c("red", "blue", "black"), lty = c(1, 1, 2))

We repeat here three important observations we made saw already in the video:
1. The test error is always above the irreducible error.
2. The very flexible method with `k=1` can perfectly fit the training data
(zero training error) but its test error is higher than the one of a less
flexible method, with `k = 5` for example.
3. Training and test error increase with decreasing flexibility of the method.

### Assessing Model Accuracy with Real Data

For real dataset we cannot easily generate an additional test set, typically.
Common practice is therefore to split the dataset into two parts.
We will do this for the life expectancy dataset.
For now, we will only look at the GDP as input and the life expectancy as
output.

In [None]:
data1 <- na.omit(data[, c("GDP", "LifeExpectancy")])

The function `na.omit` removes all rows where either the BMI or the life
expectancy is not available (na).

Now we split into training and test set.

In [None]:
set.seed(199)
idx.train <- sample(nrow(data1), nrow(data1)/2)
data1.train <- data1[idx.train,]
data1.test <- data1[-idx.train,]

There are `n = nrow(data1)` data points in total. The first line above samples
`n/2` indices from the indices 1 to n (without replacement). In the second and
third line we extract every row with index occurring in `idx.train` or its
complement `-idx.train` to form the training set `data.train` and the test set
`data1.test`.

Next we fit a linear model to the training set, define a function to compute
the MSE and compute the training and test error.

In [None]:
fit <- lm(LifeExpectancy ~ GDP, data1.train)
mse <- function(fit, data) {
    1/nrow(data) * sum((data$LifeExpectancy - predict(fit, data))^2)
}
c(mse(fit, data1.train), mse(fit, data1.test))

Interestingly, the training error is higher than the test error. This is an
indication that the model is not flexible enough. This we can also see in the
plot.

In [None]:
plot(data1)
abline(fit, col = 'dark green', lw = 2)

Instead of linear fits, we could use a quadratic fit.

In [None]:
q.fit <- lm(LifeExpectancy ~ poly(GDP, 2), data1.train)
c(mse(q.fit, data1.train), mse(q.fit, data1.test))

We used the function `poly(GDP, 2)` to form a polynomial of degree 2, i.e.
$$\beta_0 + \beta_1 \mathrm{GDP} + \beta_2 \mathrm{GDP}^2$$.
Still the training error is higher than the test error, but they both decreased.
How does the plot look like?

In [None]:
plot(data1)
grid <- seq(min(data1$GDP), max(data1$GDP), length.out = 1000)
lines(grid, predict(q.fit, data.frame(GDP = grid)), col = 'orange', lw = 2)

Hm, also a quadratic model does not seem ideal.
Let us move on to polynomials of arbitrary degree.
To do so we create the function `poly.fit` that takes the degree `d` and
training and test sets as input, and returns the training error, the test error
and the fit object.

In [None]:
poly.fit <- function(d, data.train, data.test) {
    fit <- lm(LifeExpectancy ~ poly(GDP, d), data.train)
    c(mse(fit, data.train), mse(fit, data.test), fit)
}
ds <- 1:14
results.poly <- sapply(ds, poly.fit, data1.train, data1.test)
plot(ds, results.poly[1,], col = "red", ylab = "MSE", xlab = "d", type = "b")
points(ds, results.poly[2,], col = "blue", type = "b")
legend("topright", legend = c("training error", "test error"),
       col = c("red", "blue"), lty = c(1, 1))

Here, a polynomial of degree `d = 1` is the least flexible method and it
underfits the data, while the polynomial with degree `d = 14` is the most
flexible method we considered and it overfits the data.  We see again the
typical U-shaped curve of the test error: first it decreases with increasing
flexibility, but at some point it starts to increase again.  In contrast, the
training error decreases continually.

Let us look at how well the best performing polynomial (with degree `d = 11`)
fits the data:

In [None]:
plot(data1)
grid <- seq(min(data1$GDP), max(data1$GDP), length.out = 1000)
lines(grid, predict(lm(LifeExpectancy ~ poly(GDP, 11), data1.train),
                    data.frame(GDP = grid)), col = 'orange', lw = 2)

In my opinion, this does not look like coming very close to the true generating
function. The downswing for very high GDPs looks wrong and also the wiggles for
the high GDPs where there is only little data does not look convincing. Maybe we
have to try our luck with other methods.

Take a little moment to think about the definitions of the test and training
error and the flexibility of methods.
When you are ready, move over to [the quiz](https://moodle.epfl.ch/mod/quiz/view.php?id=1088139).

## Exercises

**Q1.** For each of parts (a) through (d), indicate whether we would generally
expect the performance of a flexible statistical learning method to be better or
worse than an inflexible method. Justify your answer.

a) The sample size $n$ is extremely large, and the number of predictors $p$ is small?

b) The number of predictors $p$ is extremely large, and the number of
observations $n$ is small ?

c) The relationship between the predictors and response is highly non-linear?

d) The variance of the error terms, i.e. $\sigma^2 = \mathrm{var}(\epsilon)$ is extremely high?

**Q2.** Describe the differences between a parametric and a non-parametric
machine learning approach. What are the advantages of a parametric approach (as
opposed to a nonparametric approach)? What are its disadvantages?

**Q3.** In this exercise you will look at a [Parkinsons Telemonitoring Data
Set](https://archive.ics.uci.edu/ml/datasets/Parkinsons+Telemonitoring).
Navigate to that page to find some information about the dataset (if you click
on the link in the previous sentence it will typically open a new browser tab).
You can load the data with the following command.

In [None]:
data.parkinsons <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/telemonitoring/parkinsons_updrs.data")

In order to make progress in progamming in R I recommend you to refrain from
copy-pasting as much as possible.

We will only focus on the PPE feature and the total UPDRS as a response here.

a) Plot the PPE feature on the x-axis and the total UPDRS response on the y-axis.

b) Create a training and a test set.

c) Fit kNN to the training data for different values of k and compute the
training and the test error. For which value of k do you neither see
underfitting nor overfitting?

d) Estimate an upper bound for the irreducible error of this dataset.