## Supervised Learning Series

# Part 2: Logistic regression

In the previous post we discussed the fitting of a linear model to a set of input/output points - otherwise known as *linear regression*.  In general all sorts of nonlinear phenomenon present themselves and the data they generate - whose input and output share a nonlinear relationship - are poorly modeled using a linear model, thus causing linear regression to perform rather poorly.  This naturally leads to the exploration of fitting *nonlinear* functions to data, referred to in general as *nonlinear regression*.

In this post we describe a very particular form of nonlinear regression called *logistic regression* that is designed to deal with a very particular kind of dataset that is commonly dealt with in machine learning / deep learning: *two-class classification data*.  This sort of data is distinguished by the fact that its output values are constrained to be either one of two fixed values.  As we will see, such a constraint naturally leads to the choice of a *logistic sigmoid function* as the ideal nonnlinear function to fit to such data, hence the name *logistic regression*.

In [41]:
# imports from custom library
import sys
sys.path.append('../../')
import matplotlib.pyplot as plt
from mlrefined_libraries import superlearn_library as superlearn
import autograd.numpy as np
import math
import pandas as pd
%matplotlib notebook

# this is needed to compensate for %matplotlib notebook's tendancy to blow up images when plotted inline
from matplotlib import rcParams
rcParams['figure.autolayout'] = True

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# 1.  Setting the scene

in this Section we set the scene for logistic regression by describing the problem setup and how linear regression - as well as reasonable extensions of it - naturally fail with such data.

## 1.1 The data

Two class classification is a particular instance of *regression* or *surface-fitting*, wherein the output of a dataset of $P$ points $\left\{ \left(\mathbf{x}_{p},y_{p}\right)\right\} _{p=1}^{P}$ is no longer continuous but takes on two fixed numbers.  The actual value of these numbers is in principle arbitrary, but particular value pairs are more helpful than others for derivation purposes (i.e., it is easier to determine a proper nonlinear function to regress on the data for particular output value pairs).  We will typically use $y_{p}\in\left\{ -1,\,+1\right\}$ - that is every output takes on either the value $+1$ or $-1$.  Often in the context of classification the output values $y_p$ are called *labels*, and all points sharing the same label value are referred to as a *class* of data.  Hence a dataset containing points with label values $y_{p}\in\left\{ -1,\,+1\right\}$ is said to be a dataset consisting of two classes.

The simplest shape such a dataset can take is that of a set of adjacent 'steps', as illustrated in the Figure below.  Here the 'bottomm' step is the region of space containing most of the points that have label value $y_p = -1$.  The 'top step' likewise contains most of the points having label value $y_p = +1$.  These steps are largely separated by a point when $N = 1$, a line when $N = 2$, and a hyperplane when $N$ is larger.  As described in the figure, because its output takes on a discrete set of values one can view a classification dataset 'from above'.  In this visualization we remove the vertical $y$ dimension of the data and designate the dataset by its input only, displaying the output values of each class by coloring the input of its members in two unique colors (we chosen blue for points with label $y_p = -1$, and red for those having label $y_p = +1$.

<figure>
  <img src= '../../mlrefined_images/superlearn_images/Fig_4_10.png' width="60%" height="60%" alt=""/>
  <figcaption>   
<strong>Figure 1:</strong> <em> RELABEL AXES WITH X / Y  NOTATION IN ALL PANELS Classification
from a regression/surface-fitting perspective for $1$-dimensional (left panels) and $2$-dimensional (right panels) toy datasets. This surface-fitting view is equivalent to the 'separator' perspective looking at each respective dataset 'from above' where the separating hyperplane is precisely where the step function (shown here in yellow) transitions from its bottom to top step.  In the separator view the actual $y$ value (or label) is represented by coloring the points red or blue to denote their respective classes. </em>  </figcaption> 
</figure>

## 1.2  Two views on two class data

## 1.3  A first attempt at a solution