# Introduction to supervised learning

## 1. Data

First, we need some data!  For this exercise, we'll use a dataset with information about the physical and performance characteristics of automobiles.  Our goal is to predict the fuel efficiency of a car, given its other attributes.  As usual, let's explore the data a bit before we move on to step 2.

In [3]:
import pandas as pd

df = pd.read_csv('data/auto_mpg.csv')
#df = pd.read_csv('/blue/zoo4926/share/Jupyter_Content/data/auto_mpg.csv')

## 2. Model

Next, we need a _model_ that specifies, in a very general sense, how the input data (features) are related to the output data.  For these data, it looks like a linear relationship might be a reasonable model.  Something like this, perhaps:

$$ \hat{mpg} = a + b \cdot hp. $$

Here, $ a $ and $ b $ are unknown constants, $ hp $ is our input, and $ \hat{mpg} $ is our output.

To simplify the notation, let's replace $ hp $ with $ x $ and $ \hat{mpg} $ with $ \hat{y} $.

$$ \hat{y} = a + bx $$

By now, you might be thinking, "Hey, this looks like linear regression!"  If so, you are correct!  Regression is one of the major problem areas in machine learning, and simple linear regression provides an excellent introduction to thinking about problems from a machine learning perspective.  So, we're going to implement linear regression, from first principles, using a machine learning approach.

## 3. Calculating the loss

Having specified a way to model the relationship between horsepower and fuel efficiency, our main problem now is choosing values for $a$ and $b$, the parameters of our model.  As a first step, consider this question: Given 
_some_ value of $a$ and $b$, and some observation of $x$ and $y$, which we'll call $x_0$ and $y_0$, how can we evaluate how well our model works?  In other words, how far from "the truth" is our model?

One option is to calculate the _squared error_:

$$ SE = (y_0 - \hat{y_0})^2 . $$

For a variety of reason (some of which we'll see in a moment), the squared error ends up being a good loss function for linear regression.  For a given set of $(x, y)$ observations, $(x_i, y_i); i = 1 ... N$ , we can calculate the average squared error over the entire dataset, which gives us the _mean squared error_:

$$ MSE = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y_i})^2 = \frac{1}{N} \sum_{i=1}^N (y_i - (a + bx_i))^2 .$$

One idea for choosing values for $a$ and $b$ and is to find the values that make $MSE$ as small as possible.

## 4. Minimizing the loss

### a. Analytical solution

So, how do we find values for $a$ and $b$ that make $MSE$ as small as possible?  Let's explore that question by visualizing the loss function and using pencil and paper.

Okay, so we _could_ find an exact, analytical solution to our minimization problem without too much trouble, which is kind of neat.  The solution was:

$$ a = \bar{y} - b \bar{x} $$

$$ b = \frac{\sum_{i=1}^N {x_iy_i} - N\bar{x}\bar{y}} {\sum_{i=1}^N {{x_i}^2} - N\bar{x}^2} $$

(I figured I'd better put it here ahead of time in case I messed up the math in class.)

**Challenge:** Implement a Python function to calculate values for $a$ and $b$, given vectors $\boldsymbol{x}$ and $\boldsymbol{y}$.  What estimates does it give for $a$ and $b$?

Do those numbers seem reasonable?  Let's graph our model and see.  We can use the plotnine geometry `geom_abline()` for this.

Okay, so there obviously is nothing ground-breaking about solving the linear regression problem.  But, I hope this exercise helps build some intuition about what "line of best fit" really means.  From a machine learning perspective, it is the line that minimizes our loss function!

### b. Numerical solution

We were quite fortunate to be able to solve our machine learning problem analytically.  Often, it will be impossible to find an exact solution to the problem of minimizing the loss over a given dataset.  In those cases, we have to use _numerical optimization_ to find an approximate solution.

Because we could obtain an exact solution to the linear regression problem, it makes an excellent test case for exploring numerical methods: we know the correct solution, so it is easy to check the results of our numerical solutions.

Numerical optimization is a computational cornerstone of machine learning.  My goal here is to introduce numerical optimization methods and illustrate how they work without going into much detail.  We will take a much closer look at the principles behind numerical optimization when we get to artificial neural networks.

For now, though, let's implement a numerical solution to the linear regression problem.  We'll begin by implementing our loss function as a Python function.

**Challenge:** Implement the $MSE$ loss function for simple linear regression in Python.  Your function should take as arguments values for $a$ and $b$ and the data vectors $\boldsymbol{x}$ and $\boldsymbol{y}$.

Instead of implementing a numerical optimization algorithm ourselves (don't worry, we'll do that later!), we'll use the sophisticated optimization routines available via the [`minimize`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html#scipy.optimize.minimize) function in the `scipy.optimize` package.  The basic interface is

`minimize(function, (starting_params), args=(other_args), method='METHOD')`.

In [None]:
from scipy.optimize import minimize

What is `minimize()` actually doing?  We can get some idea by visualizing each step of the optimization algorithm.

Now that we've figured out how to implement simple linear regression using numerical optimization, it doesn't take a big leap to extend our implementation to _multiple linear regression_, where we have more than one predictor variable.

Let's look again at the graph of our data with our regression line.

In [None]:
pn.ggplot(df, pn.aes(x='hp', y='mpg')) + pn.geom_point() + pn.geom_abline(intercept=a, slope=b)

The graph suggests that the relationship between horsepower and fuel efficiency is not strictly linear.  Perhaps a quadratic relationship would fit the data better?  Let's try this:

$$ \hat{y} = a + bx + cx^2 . $$

Now, we have:

$$ MSE = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y_i})^2 = \frac{1}{N} \sum_{i=1}^N (y_i - (a + bx_i + cx_i^2))^2 .$$

**Challenge:** Building upon the simple linear regression solution we developed above, implement the loss function for this more sophisticated regression problem and use `minimize()` to estimate the parameters of the regression eequation.

Does our new model fit the data better?  Comparing the final loss function values gives us one way to check.

Let's also plot the regression line to see how it fits the data.

In [None]:
params = []

import numpy as np
xvals = np.linspace(df.hp.min(), df.hp.max(), len(df))
yvals = params[0] + params[1]*xvals + params[2]*(xvals**2)

(pn.ggplot(df, pn.aes(x='hp', y='mpg')) + pn.geom_point() +
     pn.geom_line(pn.aes(x=xvals, y=yvals), color='blue', size=1.2)
)

Assuming you found the correct model parameters, the fit should look pretty good!  We should be suspicious about what our model is saying for cars with more than 200 horsepower, though!