In [None]:
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi = False

In [None]:
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np

In [None]:
mpl.rcParams['mathtext.fontset'] = 'stix'
mpl.rcParams['font.family'] = 'STIXGeneral'
mpl.rcParams['text.usetex'] = False
plt.rc('xtick', labelsize=12)
plt.rc('ytick', labelsize=12)
plt.rc('axes', labelsize=12)
mpl.rcParams['figure.dpi'] = 300

# Regression and Classification

Written by: [Matthew R. Carbone](https://www.bnl.gov/staff/mcarbone) | _Assistant Computational Scientist, Computational Science Initiative, Brookhaven National Laboratory_

In this tutorial, we're going to go over the fundamentals of regression and classification, which are the two most common types of supervised learning. We will also discuss some of the best practices in machine learning, such as a train-validation-testing split. In regression problems, the objective is to learn a _continuous_ output. In classification problems, the objective is to learn a _discrete_ output. Here are some examples:
- Predicting the cost of a house from its properties, such as square footage, number of bathrooms, etc. is a _regression_ problem.
- Whether or not an image is of a cat or dog is a _classification_ problem.
- Predicting the type of animal in an image is a _classification_ problem.

**Learning objectives:**
- Understand a variety of regression and classification algorithms.
- Start to explore some of the fundamental concepts in machine learning, such as splitting data, overfitting, etc.

Regression and classification problems can be solved via a variety of different methods (or _models_). In this tutorial, we're going to go over the following types of models, which will form the backbone of your understanding for e.g. neural networks, and other types of machine learning, later on.
- Linear regression
- Polynomial regression
- Logistic regression

There are numerous [other types of models](https://www.listendata.com/2018/03/regression-analysis.html#Linear-Regression) which we simply won't have the time to dive into, but paradigmatically, the objective of all of these models is the same. Given some input, predict some output.

## Ingredients for regression

There are a few "ingredients" to always consider when approaching a regression problem.
- Your available data ("dataset")
- Your choice of model ("model")
- How you choose to fit the model to the data ("optimizer")
- An indicator for how well your model fits the data ("metric"/"criterion")

We will discuss all of these components today.

## Other resources

- [Andrew Ng's flagship Coursera course on machine learning](https://www.coursera.org/specializations/machine-learning-introduction)
- [Intro to regression analysis](https://towardsdatascience.com/introduction-to-regression-analysis-9151d8ac14b3)
- [15 types of regression](https://www.listendata.com/2018/03/regression-analysis.html#Linear-Regression)

# Linear regression

Let's begin with the simplest form of regression: that of fitting a line to data. We can recast this problem as learning a function $f(x) = y,$ where the form of $f$ is simply the familiar $f(x) = mx + b.$ Given a dataset $\{x_i, y_i\}$, we can "learn" the coefficients $m$ and $b$ that best model the data.

In [None]:
def linear_model(x, m, b):
    return m * x + b

In [None]:
def linear_data_with_noise(seed=123, scale=0.5, N=100, slope=2.4, y_intercept=0.8):
    np.random.seed(seed)
    x = np.linspace(-1, 1, N)
    y = linear_model(x, slope, y_intercept) + np.random.normal(scale=scale, size=(N,))
    return x, y

In [None]:
x, y = linear_data_with_noise()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(3, 2))

ax.scatter(x, y)
ax.set_xlabel("$x$")
ax.set_ylabel("$y$")

plt.show()

Suppose we fix $m$ and $b$ to some values $m_0$ and $b_0$. We now have a model which is completely defined, and given some value for $x$, we can predict $y.$ But how do we know if these are good choices? We need to define a metric, or a measure of how well the model fits the data.

In [None]:
def criterion(y_true, y_pred):
    return np.mean((y_true - y_pred)**2)

Above, our `criterion` is called the average mean squared error. The square root of this is the "root-mean-squared" error, which you are likely familiar with. For the purposes of this example, it doesn't really matter.

## Evaluating an arbitrary model using a "random" optimizer

We know what the ground truth slope `m` and y-intercept `b` actually are, but let's pretend we don't, and evaluate a set of linear models against the data that we have. We can take a bunch of values for `m0` (guesses for the slope) and `b0` (guesses for the y-intercept), create a linear model with those parameters, and evaluate those against the actual data.

In [None]:
np.random.seed(123)
m0 = np.random.random(size=100) * 6 - 3  # 100 random numbers between -3 and 3
b0 = np.random.random(size=100)          # 100 random numbers between 0 and 1

Let's try every one of these combinations for `m0` and `b0`, and valuate the model against our data.

In [None]:
import pandas as pd
from itertools import product

In [None]:
m0b0 = list(product(m0, b0))

In [None]:
df = pd.DataFrame({
    "m0": [params[0] for params in m0b0],
    "b0": [params[1] for params in m0b0],
    "criterion": [criterion(y, linear_model(x, *params)) for params in m0b0]
})

In [None]:
df

In [None]:
argmin = df["criterion"].argmin()

In [None]:
df.iloc[argmin, :]

## ⚠️ Check your understanding/Discussion

What happened here?

# Logistic Regression